-
Notifications
You must be signed in to change notification settings - Fork 26
mOS for HPC v0.4 User's Guide
Applications are run under mOS for HPC through the use of a launcher command called yod. Any program not launched with yod will simply run under the Linux kernel. This document discusses how to use yod in conjunction with mpirun but no attempt is made to discuss job schedulers.
The yod utility of mOS is the fundamental mechanism for spawning Light-weight Kernel (LWK) processes. The syntax is:
yod yod-arguments program program-arguments
One of yod's principal jobs is to reserve LWK resources (CPUs, memory) for the process being spawned. And yod supports a simple syntax whereby a fraction of LWK resources are reserved. This is useful when launching multiple MPI ranks per node. In such cases, the general pattern looks like this:
mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args
This gives each MPI rank an equal portion of the LWK resources.
Please consult the yod man page for a more thorough description of the yod arguments. Please consult the mpirun man page for further information on mpirun-args.
In addition to those arguments documented in the yod man page, there are some additional backdoor options. These options are considered experimental and could easily change or even disappear in future releases. Some of the backdoor options are described in the table below. All of these are passed to yod via the --opt
option.
Option | Arguments | Description | Additional Notes |
---|---|---|---|
move-syscalls-disable |
None | Disables system call migration from LWK CPUs to designated system call CPUsDisables | |
lwkmem-blocks-allocated |
None | Enables the tracking and reporting of blocks allocated. Blocks are counted by size and NUMA domain. The block counts are reported in the kernel console at process exit. |
This is a total count of block allocations. Block frees are not counted. Thus the total amount of memory allocations reported may exceed the amount of memory designated for the given NUMA domain. This option is useful for debugging. It has no other noticeable effect on the hosted applications. |
lwkmem-brk-clear-len |
length |
Specifies the number of bytes to clear (zero) at the beginning of an expanded heap (sbrk) region. The default is 4096 bytes. A negative number will clear all bytes. |
There are no documented requirements in the brk system call documentation that require that heap extensions contain zeroed memory. However, in reality, the Linux implementation initially backs heap extensions with the zero page. mOS will tend to use large pages to back the heap and zeroing out the entire large page is expensive and useful only if user-space software assumes that memory is zeroed. Performance is gained by zeroing less than the full amount. By default, mOS will clear only the first 4K. AMG is one application that is known to fail if all of heap is not cleared by the kernel. It is possible that the application is using malloc when it should be using calloc. It is recommended that AMG be run with |
lwkmem-load-elf-disable |
Disables loading the initialized/uninitialized sections of ELF binary image in mOS memory for an mOS process. This feature is enabled by default when the application is launched. | The .data and/or .bss sections of program are loaded into LWK memory by default. This option forces .data and .bss sections to be loaded into Linux memory instead. | |
lwkmem-mmap-fixed |
size |
Private, anonymous mmaps larger than size will be placed on fixed, 1 GiB aligned boundaries, providing consistent behavior from run to run and adequate space for VMAs to expand (via mremap) without having to be moved. |
This option may be useful for applications that dynamically grow mmap'd regions -- the alignment of VMAs allows expansion between regions. LAMMPS, in particular, can take advantage of this. |
lwksched-stats |
level
|
Output counters to the kernel log at the time of process exit. Data detail controlled by the <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided: PID: This is the TGID the process. This can be used to visually group the CPUs that belong to a specific process CPUID: CPU corresponding to the data being displayed THREADS: number of threads within the process (main thread plus pthreads) CPUS: number of cpus reserved for use by this process MAX_COMMIT: high water mark of the number of mOS threads assigned to run on this CPU MAX_RUNNING: high water mark of the number of tasks enqueued to the mOS run queue, including kernel tasks. GUEST_DISPATCH: number of times a non-mOS thread (kernel thread) was dispatched on this CPU. TIMER_POP: the number of timer interrupts. Typically this would be as a result of a POSIX timer expiring or RR dispatching if enabled through the option lwksched-enable-rr SYS_MIGR: The number of system calls that were migrated to a Linux CPU for execution. SETAFFINITY: The number of setaffinity system calls executed by this CPU. UTIL-CPU: indicator that this CPU has been designated as a utility CPU meant to run utility threads such as the OMP monitor and the PSM Progress threads. |
This option is useful for debugging. The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases. |
The mOS kernel will assign unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned by the process. For these reasons, it is advisable to apply these runtime specific environment variables that can inhibit this mOS behavior.
Name | Value | Description |
---|---|---|
I_MPI_PIN | off |
Disables process pinning in Intel MPI. Without this set, Intel MPI gets confused by isolated CPUs (including mOS LWK CPUS) and may attempt to assign ranks to cores not controlled by mOS. Symptoms include core dumps from the pmi_proxy (HYDRA). When disabled via "I_MPI_PIN=off", processes forked by the pmi_proxy will inherit the affinity mask of the proxy, which is what we want for mOS' yod. |
I_MPI_FABRICS | shm:tmi | For use on clusters with Intel(R) Omni-Path Fabric |
I_MPI_TMI_PROVIDER | psm2 | For use on clusters with Intel(R) Omni-Path Fabric |
I_MPI_FALLBACK | 0 | |
PSM2_RCVTHREAD | 0 | Disables the PSM2 progress thread. If not disabled, the MPICH run-time will create an additional thread within the process. This additional thread could interfere with mOS process / thread placement and reduce performance. Some application environments may require the use of this progress thread in order to allow forward progress. In those environments the existence of the PSM2 progress thread must be made known to the mOS kernel through the yod --util_threads option. Please consult the yod man page for a more detailed description of this option. |
PSM2_MQ_RNDV_HFI_WINDOW | 4194304 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_EAGER_SDMA_SZ | 65536 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_RNDV_HFI_THRESH | 200000 | For use on clusters with Intel(R) Omni-Path Fabric |
KMP_AFFINITY | none |
Does not bind OpenMP threads to CPU resources, allowing the mOS kernel to place the threads on CPU resources. If the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify |
HFI_NO_CPUAFFINITY | 1 |
For use on clusters with Intel(R) Omni-Path Fabric
|