Skip to content

mOS for HPC v0.4 User's Guide

Andrew Tauferner edited this page Nov 7, 2017 · 1 revision

Launching applications on mOS for HPC

Applications are run under mOS for HPC through the use of a launcher command called yod.  Any program not launched with yod will simply run under the Linux kernel.  This document discusses how to use yod in conjunction with mpirun but no attempt is made to discuss job schedulers.

Launching processes with yod

The yod utility of mOS is the fundamental mechanism for spawning Light-weight Kernel (LWK) processes.  The syntax is:

yod yod-arguments program program-arguments

One of yod's principal jobs is to reserve LWK resources (CPUs, memory) for the process being spawned.  And yod supports a simple syntax whereby a fraction of LWK resources are reserved.  This is useful when launching multiple MPI ranks per node.  In such cases, the general pattern looks like this:

mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args

This gives each MPI rank an equal portion of the LWK resources.

Please consult the yod man page for a more thorough description of the yod arguments.  Please consult the mpirun man page for further information on mpirun-args.

In addition to those arguments documented in the yod man page, there are some additional backdoor options.  These options are considered experimental and could easily change or even disappear in future releases.  Some of the backdoor options are described in the table below.  All of these are passed to yod via the --opt option.

 

Option Arguments Description Additional Notes
move-syscalls-disable None Disables system call migration from LWK CPUs to designated system call CPUsDisables  
lwkmem-blocks-allocated None Enables the tracking and reporting of blocks allocated. Blocks are counted by size and NUMA domain. The block counts are reported in the kernel console at process exit.

This is a total count of block allocations. Block frees are not counted.  Thus the total amount of memory allocations reported may exceed the amount of memory designated for the given NUMA domain.

This option is useful for debugging.  It has no other noticeable effect on the hosted applications.

lwkmem-brk-clear-len length Specifies the number of bytes to clear (zero) at the beginning of an expanded heap (sbrk) region.  The default is 4096 bytes.  A negative number will clear all bytes.

There are no documented requirements in the brk system call documentation that require that heap extensions contain zeroed memory.  However, in reality, the Linux implementation initially backs heap extensions with the zero page.

mOS will tend to use large pages to back the heap and zeroing out the entire large page is expensive and useful only if user-space software assumes that memory is zeroed.  Performance is gained by zeroing less than the full amount.  By default, mOS will clear only the first 4K.

AMG is one application that is known to fail if all of heap is not cleared by the kernel.  It is possible that the application is using malloc when it should be using calloc.  It is recommended that AMG be run with --opt  lwkmem-brk-clear-len=-1 .

lwkmem-load-elf-disable   Disables loading the initialized/uninitialized sections of ELF binary image in mOS memory for an mOS process. This feature is enabled by default when the application is launched. The .data and/or .bss sections of program are loaded into LWK memory by default.  This option forces .data and .bss sections to be loaded into Linux memory instead. 
lwkmem-mmap-fixed size Private, anonymous mmaps larger than size will be placed on fixed, 1 GiB aligned  boundaries, providing consistent behavior from run to run and adequate space for VMAs to expand (via mremap) without having to be moved. 

This option may be useful for applications that dynamically grow mmap'd regions -- the alignment of VMAs allows expansion between regions.

LAMMPS, in particular, can take advantage of this. 

lwksched-stats  level  

Output counters to the kernel log at the time of process exit. Data detail controlled by the <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided:

PID: This is the TGID the process. This can be used to visually group the CPUs that belong to a specific process CPUID: CPU corresponding to the data being displayed

THREADS: number of threads within the process (main thread plus pthreads)

CPUS: number of cpus reserved for use by this process

MAX_COMMIT: high water mark of the number of mOS threads assigned to run on this

CPU MAX_RUNNING: high water mark of the number of tasks enqueued to the mOS run queue, including kernel tasks.

GUEST_DISPATCH: number of times a non-mOS thread (kernel thread) was dispatched on this CPU.

TIMER_POP: the number of timer interrupts. Typically this would be as a result of a POSIX timer expiring or RR dispatching if enabled through the option lwksched-enable-rr

SYS_MIGR: The number of system calls that were migrated to a Linux CPU for execution.

SETAFFINITY: The number of setaffinity system calls executed by this CPU.

UTIL-CPU: indicator that this CPU has been designated as a utility CPU meant to run utility threads such as the OMP monitor and the PSM Progress threads.

This option is useful for debugging.

The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases.

Recommended environment variables

The mOS kernel will assign unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned by the process. For these reasons, it is advisable to apply these runtime specific environment variables that can inhibit this mOS behavior.   

Name Value Description
I_MPI_PIN off

Disables process pinning in Intel MPI.  Without this set, Intel MPI gets confused by isolated CPUs (including mOS LWK CPUS) and may attempt to assign ranks to cores not controlled by mOS. Symptoms include core dumps from the pmi_proxy (HYDRA).  When disabled via "I_MPI_PIN=off", processes forked by the pmi_proxy will inherit the affinity mask of the proxy, which is what we want for mOS' yod.

See https://software.intel.com/en-us/node/528818

I_MPI_FABRICS shm:tmi For use on clusters with Intel(R) Omni-Path Fabric
I_MPI_TMI_PROVIDER psm2 For use on clusters with Intel(R) Omni-Path Fabric
I_MPI_FALLBACK 0  
PSM2_RCVTHREAD 0 Disables the PSM2 progress thread.  If not disabled, the MPICH run-time will create an additional thread within the process. This additional thread could interfere with mOS process / thread placement and reduce performance. Some application environments may require the use of this progress thread in order to allow forward progress. In those environments the existence of the PSM2 progress thread must be made known to the mOS kernel through the yod --util_threads option. Please consult the yod man page for a more detailed description of this option.
PSM2_MQ_RNDV_HFI_WINDOW 4194304 For use on clusters with Intel(R) Omni-Path Fabric
PSM2_MQ_EAGER_SDMA_SZ 65536 For use on clusters with Intel(R) Omni-Path Fabric
PSM2_MQ_RNDV_HFI_THRESH 200000 For use on clusters with Intel(R) Omni-Path Fabric
KMP_AFFINITY none

Does not bind OpenMP threads to CPU resources, allowing the mOS kernel to place the threads on CPU resources. If the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify KMP_AFFINITY=verbose,none to list a machine topology map. The other KMP_AFFINITY options are supported with mOS can be specified if it is desired to allow the OpenMP run-time to place the OpenMP threads.

See https://software.intel.com/en-us/node/522691

HFI_NO_CPUAFFINITY 1

For use on clusters with Intel(R) Omni-Path Fabric
Forces PSM2 to not affinitize the PSM2 progress thread.