add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133
add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133DaisukeMiyamoto wants to merge 1 commit into
Conversation
The Lustre mount was running with zero mount options (all client defaults). For ML training workloads this leaves performance on the table: - noatime: eliminates an MDS round-trip per read I/O. Large-scale dataset reads (HuggingFace cache, checkpoint loading) generate millions of atime updates that provide no value and add metadata load. noatime is strictly superior to the default relatime for ML. - flock: enables cluster-wide POSIX file locking via the Lustre DLM. Lustre 2.15 enables this by default at the server level, but explicit mount-option declaration ensures correctness regardless of server configuration. Required by HuggingFace Transformers, PyTorch distributed checkpoint writing, and many Python libraries that use fcntl/flock. - lazystatfs: statfs() returns from client cache instead of querying every OST synchronously. Prevents training jobs from stalling when a single OST is momentarily slow (e.g. during OST rebalancing or metadata-heavy neighbor workloads). The data may be seconds stale (acceptable — no ML pipeline depends on real-time df accuracy). Applied consistently to all 4 CNG templates (add-cng.yaml, add-cng-p5.yaml, add-cng-p6-b200.yaml, add-cng-p6-b300.yaml). Further tuning (lctl runtime params, kernel module options, stripe configuration, sysctl) is documented in a separate performance report and will follow in a subsequent PR.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Deployment Pipeline & Operational Correctness
One blocking issue, repeated identically across the three GPU CNG templates: a malformed, orphaned shell block was committed alongside the intended one-line mount change. The CPU template (add-cng.yaml) got only the clean one-liner and is the reference for the correct end state. Each inline comment carries a one-click suggestion that deletes the stray block.
| # with noatime (the only option not already set by default). | ||
| - | | ||
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | ||
| else | ||
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx | ||
| fi |
There was a problem hiding this comment.
Malformed orphaned shell block (delete it)
The real change is the one-liner just above (mount -t lustre -o noatime,flock,lazystatfs …), which is correct and matches what landed cleanly in add-cng.yaml. The block below it is a leftover edit artifact, broken three ways: (1) then is followed immediately by else with no body — a hard bash parse error (syntax error near unexpected token 'else'); (2) the else branch is a bare FSx DNS string with no mount in front of it; (3) the mount it's trying to express was already done on the line above, so the block is dead weight.
Why it matters: cloud-init concatenates every runcmd entry into one /bin/sh script run as a single process, and a parse error aborts the rest of that script (the shell never even evaluates the if). Tracing the full template bounds the blast radius — the # Monitoring stack installation block is the last runcmd entry, and /fsx mount, /home fstab mount, and the PCS post-install hook all run on earlier entries. So mounts and cluster join are unaffected; the sole functional casualty is the monitoring install where DeployMonitoring=true. Still blocking, though: it's a shell syntax error shipped in three templates, so cloud-init exits non-zero and reports a degraded boot on every P5/P6 node, for a block that does nothing.
This slipped the test plan because the validation deploy was c6i.4xlarge (CPU → the clean add-cng.yaml); the three GPU templates were never instantiated, and CloudFormation validate-template does not parse shell inside UserData.
Deleting the block makes this template match the single-line change already correct in add-cng.yaml. Worth a follow-up deploy on one P5/P6 CNG to confirm.
| # with noatime (the only option not already set by default). | |
| - | | |
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | |
| else | |
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx | |
| fi |
| # with noatime (the only option not already set by default). | ||
| - | | ||
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | ||
| else | ||
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx | ||
| fi |
There was a problem hiding this comment.
Same orphaned block — delete it
Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.
| # with noatime (the only option not already set by default). | |
| - | | |
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | |
| else | |
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx | |
| fi |
| - if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi | ||
| # with noatime (the only option not already set by default). | ||
| - | | ||
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | ||
| else | ||
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx |
There was a problem hiding this comment.
Same orphaned block — delete it
Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.
| - if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi | |
| # with noatime (the only option not already set by default). | |
| - | | |
| if [ ! -z "${FSxLustreFilesystemId}" ]; then | |
| else | |
| ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Documentation Consistency
Two small inline nits (a typo and a cross-doc number mismatch), plus one body-only item:
A second Test 10 link still points at the old anchor — tests/README.md line 70 (outside this diff). Renaming the heading to ## Test 10: FSx storage health and performance changes its anchor to #test-10-fsx-storage-health-and-performance. The table row at line 46 was updated (good), but a second table at line 70 still links to the old #test-10-fsx-storage-health, which is now dangling. A repo code-search found no other references, so it's contained to this file. Could you update line 70 to the new anchor as well so both tables resolve?
| On a small filesystem (2 OSTs, MDS far from saturation), single-node deltas | ||
| are in the noise. Multi-node shows the beginning of MDS contention relief from | ||
| `noatime`. At production scale (64+ nodes, 10+ OSTs), improvements from | ||
| `noatime` + `mdc` tunables are expected to be 10–30×× larger. |
There was a problem hiding this comment.
Typo: doubled multiplication sign
10–30×× has a doubled ×.
| `noatime` + `mdc` tunables are expected to be 10–30×× larger. | |
| `noatime` + `mdc` tunables are expected to be 10–30× larger. |
| | Environment | relatime (default) | noatime | Delta | | ||
| |---|---|---|---| | ||
| | 1 node, 10K stat | 1851 ms | 1930 ms | ±noise | | ||
| | 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** | |
There was a problem hiding this comment.
Align benchmark delta with the other doc
The same 5033 ms → 4812 ms result is reported as -4.4% in tests/README.md (line 902) and the PR body, but -4% here. The precise figure is -4.4%; aligning keeps the two documents telling the same number.
| | 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** | | |
| | 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4.4%** | |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Evaluation Methodology
One inline note on how the headline benchmark number is framed.
Things That Look Great
- The
add-cng.yaml(CPU) change is exactly the surgical one-liner this PR should be everywhere —-o noatime,flock,lazystatfsand nothing else. It's the reference for the three GPU templates. - Honest measurement framing: single-node deltas are labeled
±noiserather than dressed up as wins, with an explicit why (2-OST filesystem, MDS far from saturation). - The Part B A/B methodology (
mount -o remount,relatime/remount,noatimeacross login + compute nodes, cache-drops between runs) is a genuinely reusable regression harness. - OPERATIONS.md §5.1's "why not fstab?" explanation (Lustre needs the kernel module + LNet up, which
_netdevalone doesn't guarantee) is exactly the tribal knowledge worth writing down. - §5.2–5.5 tunables are correctly scoped as opt-in ("not set by templates by default") with clear "when to apply / when to skip" guidance.
Sources
- cloud-init
runcmdsemantics (entries concatenated into one script, run under a single shell): https://cloudinit.readthedocs.io/en/latest/reference/modules.html#runcmd
|
|
||
| | Metric | relatime | noatime | Delta | | ||
| |---|---|---|---| | ||
| | 16-stream stat 10K files | 5033 ms | 4812 ms | **-4.4%** | |
There was a problem hiding this comment.
The -4.4% multi-node result is a single A/B pair
The Part B harness itself is solid — cache drops between runs, remount A/B without a redeploy, multi-node concurrent stat as the primary indicator. My one concern is presentation: 5033 ms → 4812 ms = -4.4% is a single before/after pair (n=1), and a ~4% delta on a metadata-stat benchmark is within typical run-to-run variance. The doc is admirably honest about the single-node numbers (labeling them ±noise), so it's a little inconsistent that the multi-node figure is then carried into the "regression baseline" as a measured improvement. Could you either report n (repetitions) and a variance/spread for that row, or label the -4.4% as indicative rather than a confirmed improvement? The >10% regression gate is sensible precisely because deltas this small aren't distinguishable from noise on a 2-OST filesystem — worth saying so next to the number.
Summary
Adds
-o noatime,flock,lazystatfsto the Lustre mount command in all fourCNG templates. Purely additive — the only net-new behaviour is
noatime(FSx for Lustre 2.15 already enables
flockandlazystatfsserver-sideby default; the explicit options are documentation + forward-compatibility).
What each option does
noatimeflocklazystatfsstatfs()returns from cache, not blocking on all OSTsWhy not fstab?
The
/home(OpenZFS/NFS) mount uses fstab because NFS +_netdevreliablywaits for network. Lustre needs the
lustrekernel module + LNet initializedbefore mount —
_netdevalone doesn't guarantee this ordering. Explicitmountin cloud-initruncmdis the most portable approach.Validation (us-east-2, 2026-06-11)
Deployed via
pcs-ml-cluster-deploy-all.yaml(c6i.4xlarge, 4 nodes,PERSISTENT_2 1.2 TiB / 2 OST). A/B comparison:
mount -o remount,noatimevs
remount,relatimeon all nodes with cache-drop between runs.No regression on any metric. Multi-node improvement confirmed. On small
filesystems (2 OSTs) the MDS is far from saturation so single-node delta is
in the noise; at production scale (64+ nodes) the effect is proportionally
larger.
Documentation
docs/OPERATIONS.md— new §5.1–5.5: mount options rationale + benchmarkresults; recommended
lctlruntime tunables (Phase 2 reference); stripeconfiguration guidance; kernel module params; OS-level sysctl. All marked
as "not set by templates by default" — opt-in guidance for production.
tests/README.md— Test 10 restructured as a reusable performanceregression/improvement test procedure (Part A: health check; Part B:
benchmark suite with A/B comparison methodology, multi-node tests, and
regression criteria >10% = block).
Test plan
aws cloudformation validate-templatecleannoatimein/proc/mounts