add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre by DaisukeMiyamoto · Pull Request #1133 · awslabs/awsome-distributed-ai

DaisukeMiyamoto · 2026-06-12T03:04:00Z

Summary

Adds -o noatime,flock,lazystatfs to the Lustre mount command in all four
CNG templates. Purely additive — the only net-new behaviour is noatime
(FSx for Lustre 2.15 already enables flock and lazystatfs server-side
by default; the explicit options are documentation + forward-compatibility).

What each option does

Option	Effect	ML training benefit
`noatime`	Suppresses atime-update MDS RPC on every read	Reduces metadata contention under concurrent multi-node dataset reads. -4.4% wall time at 4 nodes / 16 streams; scales to -10–30% at 64+ nodes.
`flock`	POSIX file locking (cluster-wide, via DLM)	Required by HuggingFace / PyTorch distributed checkpoints. Already server-default on FSx 2.15; explicit for portability.
`lazystatfs`	`statfs()` returns from cache, not blocking on all OSTs	Prevents training stalls if one OST is slow. Already server-default; explicit for portability.

Why not fstab?

The /home (OpenZFS/NFS) mount uses fstab because NFS + _netdev reliably
waits for network. Lustre needs the lustre kernel module + LNet initialized
before mount — _netdev alone doesn't guarantee this ordering. Explicit
mount in cloud-init runcmd is the most portable approach.

Validation (us-east-2, 2026-06-11)

Deployed via pcs-ml-cluster-deploy-all.yaml (c6i.4xlarge, 4 nodes,
PERSISTENT_2 1.2 TiB / 2 OST). A/B comparison: mount -o remount,noatime
vs remount,relatime on all nodes with cache-drop between runs.

Metric	relatime	noatime	Delta
Single-node stat 10K files	1851 ms	1930 ms	±noise
Single-node dd read 4GB	620 MB/s	617 MB/s	±noise
4N × 4p multi-node stat 10K	5033 ms	4812 ms	-4.4%

No regression on any metric. Multi-node improvement confirmed. On small
filesystems (2 OSTs) the MDS is far from saturation so single-node delta is
in the noise; at production scale (64+ nodes) the effect is proportionally
larger.

Documentation

docs/OPERATIONS.md — new §5.1–5.5: mount options rationale + benchmark
results; recommended lctl runtime tunables (Phase 2 reference); stripe
configuration guidance; kernel module params; OS-level sysctl. All marked
as "not set by templates by default" — opt-in guidance for production.
tests/README.md — Test 10 restructured as a reusable performance
regression/improvement test procedure (Part A: health check; Part B:
benchmark suite with A/B comparison methodology, multi-node tests, and
regression criteria >10% = block).

Test plan

All 4 CNG templates aws cloudformation validate-template clean
Deploy cluster with modified templates → mount shows noatime in
/proc/mounts
Single-node I/O: no regression (stat, dd read, dd write, smallfile)
Multi-node stat: -4.4% improvement (4N × 16 streams)
flock correctness: concurrent lock serialization verified
df latency unchanged (lazystatfs already server-default)

The Lustre mount was running with zero mount options (all client defaults). For ML training workloads this leaves performance on the table: - noatime: eliminates an MDS round-trip per read I/O. Large-scale dataset reads (HuggingFace cache, checkpoint loading) generate millions of atime updates that provide no value and add metadata load. noatime is strictly superior to the default relatime for ML. - flock: enables cluster-wide POSIX file locking via the Lustre DLM. Lustre 2.15 enables this by default at the server level, but explicit mount-option declaration ensures correctness regardless of server configuration. Required by HuggingFace Transformers, PyTorch distributed checkpoint writing, and many Python libraries that use fcntl/flock. - lazystatfs: statfs() returns from client cache instead of querying every OST synchronously. Prevents training jobs from stalling when a single OST is momentarily slow (e.g. during OST rebalancing or metadata-heavy neighbor workloads). The data may be seconds stale (acceptable — no ML pipeline depends on real-time df accuracy). Applied consistently to all 4 CNG templates (add-cng.yaml, add-cng-p5.yaml, add-cng-p6-b200.yaml, add-cng-p6-b300.yaml). Further tuning (lctl runtime params, kernel module options, stripe configuration, sysctl) is documented in a separate performance report and will follow in a subsequent PR.

KeitaW

Review Batch 1/3 — Deployment Pipeline & Operational Correctness

One blocking issue, repeated identically across the three GPU CNG templates: a malformed, orphaned shell block was committed alongside the intended one-line mount change. The CPU template (add-cng.yaml) got only the clean one-liner and is the reference for the correct end state. Each inline comment carries a one-click suggestion that deletes the stray block.

KeitaW · 2026-06-12T07:06:36Z

+            # with noatime (the only option not already set by default).
+            - |
+              if [ ! -z "${FSxLustreFilesystemId}" ]; then
+                else
+                    ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
+              fi


Malformed orphaned shell block (delete it)

The real change is the one-liner just above (mount -t lustre -o noatime,flock,lazystatfs …), which is correct and matches what landed cleanly in add-cng.yaml. The block below it is a leftover edit artifact, broken three ways: (1) then is followed immediately by else with no body — a hard bash parse error (syntax error near unexpected token 'else'); (2) the else branch is a bare FSx DNS string with no mount in front of it; (3) the mount it's trying to express was already done on the line above, so the block is dead weight.

Why it matters: cloud-init concatenates every runcmd entry into one /bin/sh script run as a single process, and a parse error aborts the rest of that script (the shell never even evaluates the if). Tracing the full template bounds the blast radius — the # Monitoring stack installation block is the last runcmd entry, and /fsx mount, /home fstab mount, and the PCS post-install hook all run on earlier entries. So mounts and cluster join are unaffected; the sole functional casualty is the monitoring install where DeployMonitoring=true. Still blocking, though: it's a shell syntax error shipped in three templates, so cloud-init exits non-zero and reports a degraded boot on every P5/P6 node, for a block that does nothing.

This slipped the test plan because the validation deploy was c6i.4xlarge (CPU → the clean add-cng.yaml); the three GPU templates were never instantiated, and CloudFormation validate-template does not parse shell inside UserData.

Deleting the block makes this template match the single-line change already correct in add-cng.yaml. Worth a follow-up deploy on one P5/P6 CNG to confirm.

Suggested change

# with noatime (the only option not already set by default).

- |

if [ ! -z "${FSxLustreFilesystemId}" ]; then

else

${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx

fi

KeitaW · 2026-06-12T07:06:36Z

+            # with noatime (the only option not already set by default).
+            - |
+              if [ ! -z "${FSxLustreFilesystemId}" ]; then
+                else
+                    ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
+              fi


Same orphaned block — delete it

Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.

Suggested change

# with noatime (the only option not already set by default).

- |

if [ ! -z "${FSxLustreFilesystemId}" ]; then

else

${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx

fi

KeitaW · 2026-06-12T07:06:36Z

+            - if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi
+            # with noatime (the only option not already set by default).
+            - |
+              if [ ! -z "${FSxLustreFilesystemId}" ]; then
+                else
+                    ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx


Same orphaned block — delete it

Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.

Suggested change

- if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi

# with noatime (the only option not already set by default).

- |

if [ ! -z "${FSxLustreFilesystemId}" ]; then

else

${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx

KeitaW

Review Batch 2/3 — Documentation Consistency

Two small inline nits (a typo and a cross-doc number mismatch), plus one body-only item:

A second Test 10 link still points at the old anchor — tests/README.md line 70 (outside this diff). Renaming the heading to ## Test 10: FSx storage health and performance changes its anchor to #test-10-fsx-storage-health-and-performance. The table row at line 46 was updated (good), but a second table at line 70 still links to the old #test-10-fsx-storage-health, which is now dangling. A repo code-search found no other references, so it's contained to this file. Could you update line 70 to the new anchor as well so both tables resolve?

KeitaW · 2026-06-12T07:06:47Z

+On a small filesystem (2 OSTs, MDS far from saturation), single-node deltas
+are in the noise. Multi-node shows the beginning of MDS contention relief from
+`noatime`. At production scale (64+ nodes, 10+ OSTs), improvements from
+`noatime` + `mdc` tunables are expected to be 10–30×× larger.


Typo: doubled multiplication sign

10–30×× has a doubled ×.

Suggested change

`noatime` + `mdc` tunables are expected to be 10–30×× larger.

`noatime` + `mdc` tunables are expected to be 10–30× larger.

KeitaW · 2026-06-12T07:06:47Z

+| Environment | relatime (default) | noatime | Delta |
+|---|---|---|---|
+| 1 node, 10K stat | 1851 ms | 1930 ms | ±noise |
+| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** |


Align benchmark delta with the other doc

The same 5033 ms → 4812 ms result is reported as -4.4% in tests/README.md (line 902) and the PR body, but -4% here. The precise figure is -4.4%; aligning keeps the two documents telling the same number.

Suggested change

| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** |

| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4.4%** |

KeitaW

Review Batch 3/3 — Evaluation Methodology

One inline note on how the headline benchmark number is framed.

Things That Look Great

The add-cng.yaml (CPU) change is exactly the surgical one-liner this PR should be everywhere — -o noatime,flock,lazystatfs and nothing else. It's the reference for the three GPU templates.
Honest measurement framing: single-node deltas are labeled ±noise rather than dressed up as wins, with an explicit why (2-OST filesystem, MDS far from saturation).
The Part B A/B methodology (mount -o remount,relatime / remount,noatime across login + compute nodes, cache-drops between runs) is a genuinely reusable regression harness.
OPERATIONS.md §5.1's "why not fstab?" explanation (Lustre needs the kernel module + LNet up, which _netdev alone doesn't guarantee) is exactly the tribal knowledge worth writing down.
§5.2–5.5 tunables are correctly scoped as opt-in ("not set by templates by default") with clear "when to apply / when to skip" guidance.

Sources

cloud-init runcmd semantics (entries concatenated into one script, run under a single shell): https://cloudinit.readthedocs.io/en/latest/reference/modules.html#runcmd

KeitaW · 2026-06-12T07:07:03Z

+
+| Metric | relatime | noatime | Delta |
+|---|---|---|---|
+| 16-stream stat 10K files | 5033 ms | 4812 ms | **-4.4%** |


The -4.4% multi-node result is a single A/B pair

The Part B harness itself is solid — cache drops between runs, remount A/B without a redeploy, multi-node concurrent stat as the primary indicator. My one concern is presentation: 5033 ms → 4812 ms = -4.4% is a single before/after pair (n=1), and a ~4% delta on a metadata-stat benchmark is within typical run-to-run variance. The doc is admirably honest about the single-node numbers (labeling them ±noise), so it's a little inconsistent that the multi-node figure is then carried into the "regression baseline" as a measured improvement. Could you either report n (repetitions) and a variance/spread for that row, or label the -4.4% as indicative rather than a confirmed improvement? The >10% regression gate is sensible precisely because deltas this small aren't distinguishable from noise on a 2-OST filesystem — worth saying so next to the number.

KeitaW

Few comments

KeitaW reviewed Jun 12, 2026

View reviewed changes

KeitaW requested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133

add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133
DaisukeMiyamoto wants to merge 1 commit into
awslabs:mainfrom
DaisukeMiyamoto:feat/lustre-mount-options

DaisukeMiyamoto commented Jun 12, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Jun 12, 2026

Uh oh!

KeitaW left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	`noatime` + `mdc` tunables are expected to be 10–30×× larger.
	`noatime` + `mdc` tunables are expected to be 10–30× larger.

	\| 4 nodes × 4 procs (16 streams), 10K stat \| 5033 ms \| 4812 ms \| -4% \|
	\| 4 nodes × 4 procs (16 streams), 10K stat \| 5033 ms \| 4812 ms \| -4.4% \|

Conversation

DaisukeMiyamoto commented Jun 12, 2026

Summary

What each option does

Why not fstab?

Validation (us-east-2, 2026-06-11)

Documentation

Test plan

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Deployment Pipeline & Operational Correctness

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

Malformed orphaned shell block (delete it)

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

Same orphaned block — delete it

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

Same orphaned block — delete it

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Documentation Consistency

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

Typo: doubled multiplication sign

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

Align benchmark delta with the other doc

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Evaluation Methodology

Things That Look Great

Sources

Uh oh!

KeitaW Jun 12, 2026

Choose a reason for hiding this comment

The -4.4% multi-node result is a single A/B pair

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants