Skip to content

add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133

Open
DaisukeMiyamoto wants to merge 1 commit into
awslabs:mainfrom
DaisukeMiyamoto:feat/lustre-mount-options
Open

add-cng*: add noatime,flock,lazystatfs mount options for FSx Lustre#1133
DaisukeMiyamoto wants to merge 1 commit into
awslabs:mainfrom
DaisukeMiyamoto:feat/lustre-mount-options

Conversation

@DaisukeMiyamoto

Copy link
Copy Markdown
Contributor

Summary

Adds -o noatime,flock,lazystatfs to the Lustre mount command in all four
CNG templates. Purely additive — the only net-new behaviour is noatime
(FSx for Lustre 2.15 already enables flock and lazystatfs server-side
by default; the explicit options are documentation + forward-compatibility).

What each option does

Option Effect ML training benefit
noatime Suppresses atime-update MDS RPC on every read Reduces metadata contention under concurrent multi-node dataset reads. -4.4% wall time at 4 nodes / 16 streams; scales to -10–30% at 64+ nodes.
flock POSIX file locking (cluster-wide, via DLM) Required by HuggingFace / PyTorch distributed checkpoints. Already server-default on FSx 2.15; explicit for portability.
lazystatfs statfs() returns from cache, not blocking on all OSTs Prevents training stalls if one OST is slow. Already server-default; explicit for portability.

Why not fstab?

The /home (OpenZFS/NFS) mount uses fstab because NFS + _netdev reliably
waits for network. Lustre needs the lustre kernel module + LNet initialized
before mount — _netdev alone doesn't guarantee this ordering. Explicit
mount in cloud-init runcmd is the most portable approach.

Validation (us-east-2, 2026-06-11)

Deployed via pcs-ml-cluster-deploy-all.yaml (c6i.4xlarge, 4 nodes,
PERSISTENT_2 1.2 TiB / 2 OST). A/B comparison: mount -o remount,noatime
vs remount,relatime on all nodes with cache-drop between runs.

Metric relatime noatime Delta
Single-node stat 10K files 1851 ms 1930 ms ±noise
Single-node dd read 4GB 620 MB/s 617 MB/s ±noise
4N × 4p multi-node stat 10K 5033 ms 4812 ms -4.4%

No regression on any metric. Multi-node improvement confirmed. On small
filesystems (2 OSTs) the MDS is far from saturation so single-node delta is
in the noise; at production scale (64+ nodes) the effect is proportionally
larger.

Documentation

  • docs/OPERATIONS.md — new §5.1–5.5: mount options rationale + benchmark
    results; recommended lctl runtime tunables (Phase 2 reference); stripe
    configuration guidance; kernel module params; OS-level sysctl. All marked
    as "not set by templates by default" — opt-in guidance for production.
  • tests/README.md — Test 10 restructured as a reusable performance
    regression/improvement test procedure (Part A: health check; Part B:
    benchmark suite with A/B comparison methodology, multi-node tests, and
    regression criteria >10% = block).

Test plan

  • All 4 CNG templates aws cloudformation validate-template clean
  • Deploy cluster with modified templates → mount shows noatime in
    /proc/mounts
  • Single-node I/O: no regression (stat, dd read, dd write, smallfile)
  • Multi-node stat: -4.4% improvement (4N × 16 streams)
  • flock correctness: concurrent lock serialization verified
  • df latency unchanged (lazystatfs already server-default)

The Lustre mount was running with zero mount options (all client
defaults). For ML training workloads this leaves performance on the
table:

- noatime: eliminates an MDS round-trip per read I/O. Large-scale
  dataset reads (HuggingFace cache, checkpoint loading) generate
  millions of atime updates that provide no value and add metadata
  load. noatime is strictly superior to the default relatime for ML.

- flock: enables cluster-wide POSIX file locking via the Lustre DLM.
  Lustre 2.15 enables this by default at the server level, but
  explicit mount-option declaration ensures correctness regardless of
  server configuration. Required by HuggingFace Transformers, PyTorch
  distributed checkpoint writing, and many Python libraries that use
  fcntl/flock.

- lazystatfs: statfs() returns from client cache instead of querying
  every OST synchronously. Prevents training jobs from stalling when
  a single OST is momentarily slow (e.g. during OST rebalancing or
  metadata-heavy neighbor workloads). The data may be seconds stale
  (acceptable — no ML pipeline depends on real-time df accuracy).

Applied consistently to all 4 CNG templates (add-cng.yaml,
add-cng-p5.yaml, add-cng-p6-b200.yaml, add-cng-p6-b300.yaml).

Further tuning (lctl runtime params, kernel module options, stripe
configuration, sysctl) is documented in a separate performance
report and will follow in a subsequent PR.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Deployment Pipeline & Operational Correctness

One blocking issue, repeated identically across the three GPU CNG templates: a malformed, orphaned shell block was committed alongside the intended one-line mount change. The CPU template (add-cng.yaml) got only the clean one-liner and is the reference for the correct end state. Each inline comment carries a one-click suggestion that deletes the stray block.

Comment on lines +380 to +385
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed orphaned shell block (delete it)

The real change is the one-liner just above (mount -t lustre -o noatime,flock,lazystatfs …), which is correct and matches what landed cleanly in add-cng.yaml. The block below it is a leftover edit artifact, broken three ways: (1) then is followed immediately by else with no body — a hard bash parse error (syntax error near unexpected token 'else'); (2) the else branch is a bare FSx DNS string with no mount in front of it; (3) the mount it's trying to express was already done on the line above, so the block is dead weight.

Why it matters: cloud-init concatenates every runcmd entry into one /bin/sh script run as a single process, and a parse error aborts the rest of that script (the shell never even evaluates the if). Tracing the full template bounds the blast radius — the # Monitoring stack installation block is the last runcmd entry, and /fsx mount, /home fstab mount, and the PCS post-install hook all run on earlier entries. So mounts and cluster join are unaffected; the sole functional casualty is the monitoring install where DeployMonitoring=true. Still blocking, though: it's a shell syntax error shipped in three templates, so cloud-init exits non-zero and reports a degraded boot on every P5/P6 node, for a block that does nothing.

This slipped the test plan because the validation deploy was c6i.4xlarge (CPU → the clean add-cng.yaml); the three GPU templates were never instantiated, and CloudFormation validate-template does not parse shell inside UserData.

Deleting the block makes this template match the single-line change already correct in add-cng.yaml. Worth a follow-up deploy on one P5/P6 CNG to confirm.

Suggested change
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
fi

Comment on lines +367 to +372
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same orphaned block — delete it

Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.

Suggested change
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx
fi

Comment on lines +369 to +374
- if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same orphaned block — delete it

Identical leftover block to the one in add-cng-p5.yaml. The then-then-else construct is a hard shell parse error that aborts the rest of the cloud-init runcmd script (only the monitoring install, the last entry, is functionally affected — mounts and PCS join run earlier). Delete it so this template matches the clean one-line change in add-cng.yaml.

Suggested change
- if [ ! -z "${FSxLustreFilesystemId}" ]; then mkdir -p /fsx; mount -t lustre -o noatime,flock,lazystatfs ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx; chmod 1777 /fsx; fi
# with noatime (the only option not already set by default).
- |
if [ ! -z "${FSxLustreFilesystemId}" ]; then
else
${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /fsx

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Documentation Consistency

Two small inline nits (a typo and a cross-doc number mismatch), plus one body-only item:

A second Test 10 link still points at the old anchortests/README.md line 70 (outside this diff). Renaming the heading to ## Test 10: FSx storage health and performance changes its anchor to #test-10-fsx-storage-health-and-performance. The table row at line 46 was updated (good), but a second table at line 70 still links to the old #test-10-fsx-storage-health, which is now dangling. A repo code-search found no other references, so it's contained to this file. Could you update line 70 to the new anchor as well so both tables resolve?

On a small filesystem (2 OSTs, MDS far from saturation), single-node deltas
are in the noise. Multi-node shows the beginning of MDS contention relief from
`noatime`. At production scale (64+ nodes, 10+ OSTs), improvements from
`noatime` + `mdc` tunables are expected to be 10–30×× larger.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: doubled multiplication sign

10–30×× has a doubled ×.

Suggested change
`noatime` + `mdc` tunables are expected to be 10–30×× larger.
`noatime` + `mdc` tunables are expected to be 10–30× larger.

| Environment | relatime (default) | noatime | Delta |
|---|---|---|---|
| 1 node, 10K stat | 1851 ms | 1930 ms | ±noise |
| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align benchmark delta with the other doc

The same 5033 ms → 4812 ms result is reported as -4.4% in tests/README.md (line 902) and the PR body, but -4% here. The precise figure is -4.4%; aligning keeps the two documents telling the same number.

Suggested change
| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4%** |
| 4 nodes × 4 procs (16 streams), 10K stat | 5033 ms | 4812 ms | **-4.4%** |

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Evaluation Methodology

One inline note on how the headline benchmark number is framed.


Things That Look Great

  • The add-cng.yaml (CPU) change is exactly the surgical one-liner this PR should be everywhere — -o noatime,flock,lazystatfs and nothing else. It's the reference for the three GPU templates.
  • Honest measurement framing: single-node deltas are labeled ±noise rather than dressed up as wins, with an explicit why (2-OST filesystem, MDS far from saturation).
  • The Part B A/B methodology (mount -o remount,relatime / remount,noatime across login + compute nodes, cache-drops between runs) is a genuinely reusable regression harness.
  • OPERATIONS.md §5.1's "why not fstab?" explanation (Lustre needs the kernel module + LNet up, which _netdev alone doesn't guarantee) is exactly the tribal knowledge worth writing down.
  • §5.2–5.5 tunables are correctly scoped as opt-in ("not set by templates by default") with clear "when to apply / when to skip" guidance.

Sources


| Metric | relatime | noatime | Delta |
|---|---|---|---|
| 16-stream stat 10K files | 5033 ms | 4812 ms | **-4.4%** |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -4.4% multi-node result is a single A/B pair

The Part B harness itself is solid — cache drops between runs, remount A/B without a redeploy, multi-node concurrent stat as the primary indicator. My one concern is presentation: 5033 ms → 4812 ms = -4.4% is a single before/after pair (n=1), and a ~4% delta on a metadata-stat benchmark is within typical run-to-run variance. The doc is admirably honest about the single-node numbers (labeling them ±noise), so it's a little inconsistent that the multi-node figure is then carried into the "regression baseline" as a measured improvement. Could you either report n (repetitions) and a variance/spread for that row, or label the -4.4% as indicative rather than a confirmed improvement? The >10% regression gate is sensible precisely because deltas this small aren't distinguishable from noise on a 2-OST filesystem — worth saying so next to the number.

@KeitaW KeitaW left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants