mOS for HPC v1.0 Administrators Guide

This document provides the instructions to check out, build, install, boot and validate mOS for HPC. All the instructions provided below have been validated on the following system configuration:

Component	Build Configuration	Boot & Test Configuration
Processor	Intel(R) Xeon(R) E5-2699 v3 processor	2 Intel(R) Xeon(R) Platinum 8470 processors 6 Ponte Vecchio Xe-HPC(R) GPUs 4 HPE Slingshot(R) NICs
Cluster mode	n/a	non-SNC/SNC4
Memory	64 GB DDR4	1 TB DDR5 128G HBM
Distribution	SLES 15 SP3	SLES 15 SP4 / SLES 15 SP3
Boot loader	n/a	GRUB

You may need to modify the steps documented here if you have different hardware or software.

Check out

The mOS for HPC source can be checked out from GitHub at https://github.com/intel/mOS.

git clone https://github.com/intel/mOS.git
cd mOS

Configuration

The mOS for HPC source contains the config.mos example file that can be used as a starting point to configure it. Your specific compute node hardware may require a different configuration. The table below shows the settings needed to configure the source code.

Mandatory setting	Description
CONFIG_MOS_FOR_HPC=y	Activate the mOS for HPC code in the Linux kernel
CONFIG_MOS_LWKMEM=y	Enable the mOS for HPC memory management. It is possible to build mOS without LWKMEM enabled, but this is not recommended.
CONFIG_MOS_ONEAPI_LEVELZERO=y	Enables GPU resource management in the mOS kernel. Requires that both the level-zero-* and level-zero-devel-* packages are installed on build system. Required that the level-zero-* package is installed on the compute node. If your system does not contain accelerators managed by OneAPI, then the this configuration does not need to be set and the level-zero packages are not required.
CONFIG_NO_HZ_FULL=y	Activate the tickless feature of Linux. In conjunction with the mOS for HPC scheduler, this limits noise on LWK CPUs.
CONFIG_NODES_SHIFT=6	This controls the size of the NUMA node masks with in the kernel. A value of 6 represents a mask size of 8 bytes and is sufficient for any system with 64 or less numa domains. If a value larger than 6 is specified, kernel memory is wasted and performance is impacted.
CONFIG_MAXSMP is not set	Ensure this configuration option is not set to 'y'. It will create very large CPU and NUMA masks that will waste large amounts of kernel memory
CONFIG_DAX=y CONFIG_DEV_DAX=y CONFIG_DEV_DAX_HMEM=y CONFIG_DEV_DAX_HMEM_DEVICES=y CONFIG_DEV_DAX_KMEM=y CONFIG_FS_DAX=Y CONFIG_FS_DAX_PMD=y	Required to have HBM memory presented as NUMA domains in the OS system memory.

Build

It is recommended that you build kernel RPMs for installation of mOS for HPC. The minimum build system requirements can be found at https://www.kernel.org/doc/html/latest/process/changes.html. Please run the following commands from the directory where you checked out mOS for HPC:

cp config.mos .config
make -j 32 rpm-pkg

Installation

1. Install RPMS

The RPMs built in the previous step needs to be installed on the compute nodes or into the compute node image.

At a minimum install the kernel-5.14.21_1.0.mos-1.x86_64 and kernel-mOS-5.14.21_1.0.mos-1.x86_64 RPMs into your compute node image. The exact RPM names may vary depending on the state of the code, whether a local version name is specified (in something like make menuconfig), and how many times the RPMs are built. However, the 5.14.21_1.0.mos part of the name should remain constant.

sudo rpm -ivh --force ${HOME}/rpmbuild/RPMS/x86_64/kernel-5.14.21_1.0.mos-1.x86_64.rpm

sudo rpm -ivh --force ${HOME}/rpmbuild/RPMS/x86_64/kernel-mOS-5.14.21_1.0.mos-1.x86_64.rpm

If you wish to use the GPU resource management support in mOS, the level-zero-* package is required to be installed on the compute node.

2. Update GRUB with the new kernel command line

After RPM installation the kernel needs to be added to the grub menu on the compute nodes. The kernel parameters needed are taken from /etc/defaults/grub via the GRUB_CMDLINE_LINUX variable in that file. Please update or replace the GRUB_CMDLINE_LINUX variable in that file as follows:

Intel(R) Xeon(R) Example (Aurora blade):

GRUB_CMDLINE_LINUX="selinux=0 intel_pstate=disable nmi_watchdog=0 kernelcore=64G pci=noaer,noats,pcie_bus_perf tsc=reliable nohz_full=1-51,53-103,105-155,157-207 console=tty0 console=ttyS0,115200"

Intel(R) Xeon Phi(TM) Example:

GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 selinux=0 intel_pstate=disable nmi_watchdog=0 kernelcore=8% movable_node nohz_full=<see following table>"

Recommended Kernel Boot Parameters

The following parameters and values are recommended for mOS for HPC. Not all combinations and variations of boot parameters have been validated and tested. Boot failure is possible if, for example, lwkcpus and lwkmem are not properly set for your system. It is recommended that the lwkcpus and lwkmem parameters be omitted and the lightweight kernel partition created after booting using the lwkctl command. The lwkctl command supports creating an LWK partition using automatic configuration of memory and CPUs. The command also supports deleting of the LWK partition. The resulting auto configuration from the lwkctl command can then later be used in the kernel boot parameters as a known good initial configuration if desired. Please refer to Documentation/kernel-parameters.txt in the mOS for HPC kernel source for further details.

Name	Recommended Value	Description

nmi_watchdog	0	Disable the NMI watchdog interrupt from occurring in order to eliminate this additional source of noise on the CPUs. An alternative method of turning off the watchdog is writing a zero to the system file /proc/sys/kernel/nmi_watchdog. This approach would eliminate the need to set it here.
intel_pstate	disable	Do not allow the system to dynamically adjust the frequency of the CPUs. When running HPC applications, we want a stable, consistent CPU frequency across the entire job.
lwkcpus (not needed if using lwkctl)	topology dependent	Boot-time designation of LWK CPUs is no longer necessary. It is now possible (and recommended) to boot the system without this option and then use the lwkctl command to designate LWK CPUs and memory once the system is up. CPUs to be controlled by the mOS LWK. This includes the CPUs that will be exclusively owned by the LWK and also Linux CPUs that will be used by mOS to host utility threads.
lwkmem (not needed if using lwkctl)	topology dependent	Boot-time designation of LWK memory is no longer necessary. It is now possible (and recommended) to boot the system without this option and then use the lwkctl command to designate LWK CPUs and memory once the system is up. Designate memory for use by mOS. The amount of memory requested is specified in parse_mem format (K,M,G). Or designate memory via NUMA domain. The LWK memory requested using this kernel command line can only come from the movable memory in the system. Use the 'kernelcore' command line argument explained below to specify the total amount of non-movable and movable memory in the system. Example #1: lwkmem=126G This requests the kernel to designate a total of 126G of physical memory to the LWK. The memory requested will be allocated from all online NUMA nodes which have movable memory. Example #2: lwkmem=0:58G,1:16G This requests that the kernel designates a total of 58G of physical memory from NUMA node 0 and 16G of physical memory from NUMA node 1 to the LWK. If the full amount of requested memory can not be allocated on a specified NUMA node in the list, then the remainder of the request will be distributed uniformly among the requests on subsequent NUMA nodes in the request list. In this example, if the kernel could designate only 50G on NUMA node 0 then the remaining 8G of the request would be added to the 16G requested from NUMA node 1.
kernelcore	8% Or specify an explicit value that is at least 8% of total memory	This Linux boot argument sets the total non-movable memory in the system. Non-movable memory is the memory used only by the Linux kernel and cannot be dedicated to the LWK. The kernel treats the rest of the physical memory as movable memory which can be dynamically provisioned between Linux and the LWK. The memory requested using the 'lwkmem' kernel parameter described above can only come from movable memory in the system. Adjust the 'kernelcore' kernel parameter value based on the requirement. In mOS for HPC, It is desirable to keep the total non-movable memory low since it cannot be dynamically moved between Linux and LWK. It is desirable to contain the non-movable memory to only DDR so that the entire HBM memory is movable and can be given to the LWK. This can be accomplished by specifying the 'movable_node' kernel parameter (described below) along with the 'kernelcore' parameter. Please see BIOS settings below for MCDRAM (HBM) configuration. Ex: kernel command line parameters: kernelcore=16G movable_node on a system with 96G DDR and 16G HBM In this case, There will be 16G of total non-movable memory in the system and it will be uniformly spread across only the DDR NUMA nodes. The entire 16G HBM memory will be movable memory which can be dedicated to the LWK, and the remaining 80G of DDR memory will be movable as well.
movable_node		If HBM is identified as EFI special purpose memory (Sapphire Rapids BIOS) or as hot-pluggable memory (KNL BIOS), then the movable_node marks HBM NUMA nodes as movable nodes, there won't be any kernel memory allocation in HBM and all of it is available for use by applications (Linux or LWK). Please see BIOS settings below for MCDRAM configuration.
nohz_full	CPU dependent	Recommend specifying the CPUs on the first CORE of each socket. For example on the Aurora blades which have 52 cores per socket: nohz_full=1-51,53-103,105-155,157-207.

The last step is to update the grub configuration using the grub2-mkconfig command. Please ensure that appropriate rd.lvm.lv settings are specified for your system. The grub configuration file is grub.cfg. The location of this file varies. The example below shows a system where it is located in /boot/efi/EFI/centos/grub.cfg. Other systems might have it in /boot/grub2/grub.cfg. You may want to save a backup copy of your grub.cfg file before the following step.

$ sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Note: This command will add the kernel parameters in GRUB_CMD_LINUX to every entry in the grub menu. You should preserve and restore the existing kernel entries in grub.cfg after running grub2-mkconfig.

BIOS settings for Intel(R) Xeon Phi(TM)

When running on a Xeon Phi(R) platform (i.e. KNL), It is recommended to treat MCDRAM as hot-pluggable memory. This setting in conjunction with the 'movable_node' kernel parameter is necessary for maximum MCDRAM availability for applications (either Linux or LWK). The following BIOS menu is used to configure MCDRAM:

EDKII Menu -> Advanced -> Uncore Configuration -> Treat MCDRAM as Hot-Pluggable Memory ==> <Yes>

Booting

If mOS for HPC has been properly installed and configured then the grub boot menu should have an entry for mOS. Select the 5.14.21_1.0.mos entry during boot.

Using HBM in Sapphire Rapids

HBM is treated as "EFI Specific Purpose Memory" by bios. The kernel must be built with DAX configured. HBM memory on a two socket SPR system is presented to the OS as either two DAX devices when in non-SNC mode or eight DAX devices when in SNC-4 mode. These devices can then be targeted using the daxctl command to reconfigure the contained HBM memory into system memory available to the OS as additional NUMA nodes. The command puts all of this memory into zone movable and prevents it from being used by any kernel allocations. The HBM memory in each of the DAX devices will be reconfigured into Linux zone movable system memory. The following is the command that must be executed:

daxctl reconfigure-device --mode=system-ram all

Depending on your Linux distribution, you may need to modify the above command. It is possible that your system has udev rules that may attempt to online system memory. If so, then the above daxctl command will be racing with the udev rule to online memory, and may cause fragmentation in the Linux buddy allocator. This can lead to less than optimal usage of HBM memory both by Linux and by the LWK partition. Proper behavior can be observed looking at /proc/buddyinfo after boot. If all of the memory for the HBM NUMA nodes does not show up in the last column of the table, then this race condition is likely occurring. To fix this, modify the daxctl command so that only the udev rule is on-lining the memory. Also be sure to specify the movable_node parameter in the kernel command line (described earlier), otherwise the memory on-lined by the udev rule will all be on zone_normal and not useable by the LWK.

daxctl reconfigure-device --mode=system-ram --no-online all

Run-Time LWK Partitioning using the lwkctl Utility

In mOS for HPC the resources, CPUs and memory, can be dynamically partitioned between Linux and the LWK. An LWK partition can be created using the user space command utility lwkctl after the kernel boots up. This command line utility will offline the resources on Linux and hand them over (designates) to the LWK and vice versa. Once this partitioning is complete, further resource partitioning between LWK processes (reservation) is done using the mOS job launch utility yod. The lwkctl command requires root privileges in order to create or delete an LWK partition. Both LWK CPU and LWK memory specifications need to be provided while creating an LWK partition. The specification value of auto is supported for both the LWK CPU and the LWK memory specifications. When auto is used, mOS will generate a topology-aware specification suitable for most HPC application environments. Deleting an LWK partition deletes both LWK CPU and LWK memory designations. The command can also be used to view the current LWK partition.

Quick Reference:

Creating LWK partition:
```
sudo lwkctl -c 'lwkcpus=<lwkcpu_spec> lwkmem=<lwkmem_spec>'
```
Example 1 - mOS determines the configuration:
sudo lwkctl -c 'lwkcpus=auto lwkmem=auto'

Example 2 - (for Intel(R) Xeon Phi(TM)):

sudo lwkctl -c 'lwkcpus=1.52-67,256-271:69.120-135,188-203:137.2-17,206-221:205.70-85,138-153:19.20-35,224-239:87.88-103,156-171:155.36-51,240-255:223.104-119,172-187 lwkmem=0:16G,1:16G,2:16G,3:16G,4:3968M,5:3968M,6:3968M,7:3968M'

Notice that the entire specification needs to be enclosed within ' '

Deleting LWK partition:
```
sudo lwkctl -d
```

See the existing LWK partition:

To output in human readable format,
```
lwkctl -s
```
To output in raw format,
```
lwkctl -s -r
```

The following table is a complete description of the lwkctl command options.

LWKCTL Option	Description
--create, -c '<spec>'	Creates a new LWK partition. If an LWK partition already exists then it will be deleted before creating the new partition. LWK CPU and memory specifications for the new partition are passed as arguments in the following format: '<spec>': 'lwkcpus=<lwkcpu_spec> lwkmem=<lwkmem_spec>' Specifying lwkcpus=auto and lwkmem=auto will cause the system to automatically generate an LWK partition based on the available CPU and memory resources, using the system topology. The resulting LWK partition will work well for many HPC applications. The lwkmem=auto:max specification will aggressively allocate all available memory to the LWK partition without regards to balancing memory designations across the numa domains. The syntax for the <lwkcpu_spec>: <utility cpu1>.<lwkcpu set1>:<utility cpu2>.<lwkcpu set2>... The syntax for the <lwkmem_spec>: <n1>:<size1>,<n2>:<size2>,... n1,n2,.. are NUMA node numbers. size1,size2,.. are sizes of the LWKMEM requests on corresponding NUMA node. Based on available kernel movable memory and alignment the designated LWK memory will be less than or equal to the requested size. LWK memory requests are aligned on sparse memory section boundary which is in general set to 128MB. LWK memory is allocated from movable memory of Linux kernel(i.e. from ZONE_MOVABLE pages). In order to have movable memory on every node the kernel needs to be booted with the kernelcore argument to specify the total non-movable memory in the system.
--delete, -d	Deletes the existing LWK partition and releases corresponding resources to Linux.
--show, -s	Displays information of the existing LWK partition.
--raw, -r	Modifies the format of --show/-s option to display the unprocessed partition specification of LWK.
--precise, -p <yn>	Modifies the behavior of --create/-c option. If --precise yes is specified and the requested memory designation cannot be completed or if the requested memory size for any NUMA domain is not on a 128M boundary, the attempt to create the LWK partition will fail. If --precise no is specified and the requested memory designation cannot be completed, the partition will be created using the available memory. If no option is specified, the default behavior is '--precise no'.
--timeout, -t	Specifies the timeout (in seconds) that lwkctl will wait for a serialization lock. This lock prohibits multiple invocations of lwkctl that alter the partition from running concurrently. The default is 300 (five minutes). If specified as zero, then lwkctl will block indefinitely waiting for the lock.
--force, -f	Force the deletion or creation of the partition even if there is a job running on the node. Please note that using using this option could impact active jobs. And if resources are currently being consumed by active jobs, this option will not necessarily guarantee that those resources are freed or that the partitioning will complete successfully. Use with caution.
--verbose, -v <number>	Controls the verbosity of lwkctl. Number is an integer between 0 and 4.
--help, -h	Prints a terse version of this documentation.

Boot-Time LWK Partitioning

A default LWK partition can also be created during the kernel boot up by specifying the LWK resources needed on the kernel command line through kernel parameters 'lwkcpus=' and 'lwkmem='.

lwkcpus=<linux cpu1>.<lwkcpu set1>:<linux cpu2>.<lwkcpu set2>...

lwkmem=<n1>:<size1>,<n2>:<size2>,...

Where, n1,n2,.. are NUMA node numbers. size1,size2,.. are sizes of the LWKMEM requests on corresponding NUMA node.

Based on the system need, this default LWK partition can be deleted later after the boot-up and a new LWK partition can be created using the lwkctl command.

The specification of auto is not a supported kernel command line value for the lwkcpu or othe lwkmem parameters.

Tip: If you plan to use boot-time LWK partitioning, first use lwkctl to create the partition and display the resulting partition configuration. These values can then be used as a starting point for the boot time kernel parameters.

Note: Linux interfaces to hot-plug CPU and memory can not be used to further hot-plug CPU and memory when an LWK partition is in place. The LWK partition must be deleted to use those Linux interfaces.

Validate operational state

In order to validate a successful installation, perform the following steps on the compute nodes where mOS for HPC is installed.

To test that yod is functional, launch a simple application using yod:

$ yod /bin/echo hello

hello

If LWK memory is active then you should be able to see some LWK entries in the process mapping of an LWK process:

jeattine@skl-10:~>yod cat /proc/self/maps | grep LWK
155555327000-15555532b000 rw-p 00000000 00:00 0 [anon_private] [LWK]
1555554c4000-1555554e6000 rw-p 00000000 00:00 0 [anon_private] [LWK]
15555553e000-155555540000 rw-p 00000000 00:00 0 [anon_private] [LWK]
555555607000-555555609000 rw-p 00000000 00:00 0 [dbss] [LWK]
555555800000-555555821000 rw-p 00000000 00:00 0 [heap] [LWK]

The above example runs the cat program as an mOS process, reserving LWK memory resources for it.

Alternatively, you can use the lwkctl utility to view the mOS version and LWK configuration.

Intel(R) Xeon(R) Example:

jeattine@skl-10:~>lwkctl -s
mOS version : 1.0
Linux CPU(s): 0,24,48,72 [ 4 CPU(s) ]
LWK CPU(s): 1-23,25-47,49-71,73-95 [ 92 CPU(s) ]
Utility CPU(s): 0,24,48,72 [ 4 CPU(s) ]
LWK/Linux GPU(s): [ 0 GPU(s) ]
LWK Memory(KB): 88080384 88080384 [ 2 NUMA nodes ]

Intel(R) Xeon Phi(TM) Example:

$ lwkctl -s
mOS version : 1.0
Linux CPU(s): 0-1,18-19,68-69,86-87,136-137,154-155,204-205,222-223 [ 16 CPU(s) ]
LWK CPU(s): 2-17,20-67,70-85,88-135,138-153,156-203,206-221,224-271 [ 256 CPU(s) ]
Utility CPU(s): 0-1,18-19,68-69,86-87,136-137,154-155,204-205,222-223 [ 16 CPU(s) ]
LWK Memory(KB): 19922944 19922944 19922944 19922944 3145728 3145728 3145728 3145728 [ 8 NUMA nodes ]
CPU specification was automatically generated.
Memory specification was automatically generated.

Check the dmesg log for mOS entries (Intel(R) Xeon(R) example):

[ 2304.588958] mOS-mem: Initializing memory management. precise=no
[ 2304.706098] mOS-mem: Node 0: Free range [0x0000000298000000-0x00000015b7ffffff] pfn [2719744-22773759] [20054016] pages
[ 2304.716964] mOS-mem: Node 0: Offlining [0x0000000298000000-0x00000015b7ffffff] pfn [2719744-22773759] [20054016] pages
[ 2305.978386] mOS-mem: Node 0: Requested 78336 MB Allocated 78336 MB
[ 2306.178184] mOS-mem: Node 1: Free range [0x0000001a40000000-0x0000002dafffffff] pfn [27525120-47906815] [20381696] pages
[ 2306.189132] mOS-mem: Node 1: Offlining [0x0000001a40000000-0x0000002dafffffff] pfn [27525120-47906815] [20381696] pages
[ 2307.638936] mOS-mem: Node 1: Requested 79616 MB Allocated 79616 MB
[ 2307.645205] mOS-mem: Requested 157952 MB Allocated 157952 MB
[ 2307.650865] mOS: Configured LWK CPUs: 3-23,27-47,51-71,75-95
[ 2307.656521] mOS: Configured Utility CPUs: 0,24,48,72
[ 2307.661502] mOS: LWK CPU profile set to: normal
[ 2307.666376] mOS-sched: set unbound workqueue cpumask to 0-2,24-26,48-50,72-74
[ 2307.673507] mOS-sched: IDLE MWAIT enabled. Hints min/max=00000000/00000021. CPUID_MWAIT substates=00002020
[ 2324.295122] mOS-mem: Clearing memory in parallel on NUMA domains[0-1]
[ 2327.353908] mOS-mem: Clearing LWK memory[SUCCESS]
[ 2327.358624] mOS: Kernel processing request: [lwkcpus=0.3-12,51-60:48.13-23,61-71:24.27-36,75-84:72.37-47,85-95 lwkmem=0:78336M,1:79616M precise=no] [SUCCESS]

Additional things to know

When mOS is booted and managing resources, an obvious question is what common system tools tell you about the machine state. Here is some information.

Command / Tool	Notes
top, htop	Behaves as expected showing CPU utilization, process placement across CPUs. The LWK CPUs may show updates less frequently than Linux CPUs.
/proc/meminfo free	By default shows memory usage statistics on both Linux and LWK. Furthermore, the tool mosview can be used to see only LWK side usage or only Linux side usage.
dmesg	mOS kernel will write information to the syslog, a good place to check for operational health
debugging and profiling tools	mOS maintains compatibility with Linux so that tools such as ptrace, strace, ftrace, and gdb continue to work as expected. In addition, Intel(R) Parallel Studio XE tools such as Intel(R) VTune(TM) Profilier and Intel(R) Advisor also work as designed.

Interrupt balancing

LWKCPU awareness

Typically distribution of interrupts across CPUs is managed by a user space daemon called irqbalance. This daemon is typically launched at the time of bootup by systemd. mOS supports the usage of irqbalance daemon for interrupt balancing, any other similar tools may need further adaptations in the mOS tool lwkctl.

In mOS the kernel tries to keep the interrupts on LWK CPUs minimal. If the balancer attempts to affinitize interrupts to LWK CPUs for interrupts not on the approved list, the request is denied by the mOS code and an informational message is printed to the console:

__irq_set_affinity(irq 333, mask 15) can not affinitize only to LWKCPUs ret -22

The approved devices allowed to have interrupts on the LWK CPUs are typically the high speed NICs and GPUs since these devices are directly contributing to application performance.

Documentation

Readme
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
User's Guide
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
Administrator's Guide
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
Memory Management and Scheduler Design
    LWKMem, Sched
Utility Thread API
    v1.0
Other Info
    An mOS flyer

mOS for HPC v1.0 Administrators Guide

Check out

Configuration

Build

Installation

1. Install RPMS

2. Update GRUB with the new kernel command line

Intel(R) Xeon(R) Example (Aurora blade):

Intel(R) Xeon Phi(TM) Example:

Recommended Kernel Boot Parameters

BIOS settings for Intel(R) Xeon Phi(TM)

Booting

Using HBM in Sapphire Rapids

Run-Time LWK Partitioning using the lwkctl Utility

Boot-Time LWK Partitioning

Validate operational state

Intel(R) Xeon(R) Example:

Intel(R) Xeon Phi(TM) Example:

Additional things to know

Interrupt balancing

LWKCPU awareness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Documentation

Clone this wiki locally