[Common] Reduce shared-memory bank conflicts in the colwise scaling path of the tuned NVFP4 kernel by Oleg-Goncharov · Pull Request #3106 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-06-08T16:08:01Z

Description

This PR optimizes shared-memory accesses in the colwise scaling path by reducing bank conflicts, resulting in an average performance improvement of about 1%.

The colwise Y-lane mapping changed from:

(thread_lane % 4 + warp) % 4

to:

(thread_lane / 2 + warp) % 4

The intent is to change the shared-memory swizzle pattern used by the sOut_tr b64 stores. With the previous %4 mapping, the bank pattern repeats every 4 lanes, which leads to an 8-way conflict pattern for this store layout. Using /2 makes the mapping repeat every 8 lanes instead, reducing the conflict degree to 4-way for the same access pattern.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Modified the pattern of accessing shared memory

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov · 2026-06-08T16:09:31Z

/te-ci

greptile-apps · 2026-06-08T16:12:51Z

Greptile Summary

Replaces thread_lane % 4 with thread_lane / 2 when computing tid_Y_colwise in the colwise scaling path of the tuned NVFP4 kernel, reducing shared-memory bank conflicts on the sOut_tr 64-bit stores from 8-way to 4-way by doubling the swizzle period from 4 threads to 8. The bijection between threads and output regions is preserved; input-read bank access (determined solely by thread_lane) and scale-factor store bank access are unaffected.

The formula change is mathematically valid: (thread_lane / 2 + warp) % 4 still produces a bijection over all 128 (thread_lane, tid_Y_colwise) pairs, covering every element of the 64×64 tile exactly once.
The sOut_tr b64 store bank calculation changes from a period-4 pattern (yielding 8 threads per bank) to a period-8 pattern (yielding 4 threads per bank), consistent with the claimed ~1% average improvement.

Confidence Score: 4/5

Safe to merge; the one-line change is a correct swizzle substitution with no effect on output correctness.

The formula thread_lane / 2 produces the same bijective coverage as the old thread_lane % 4 formula, verified analytically. Input reads are bank-access-neutral to this change, scale-factor stores are unaffected, and only the sOut_tr 64-bit stores benefit from fewer conflicts. The only open item is a missing in-code comment explaining the swizzle rationale, which makes future maintenance harder but does not threaten correctness or performance.

No files require special attention. The single changed file is self-contained and the optimization is isolated to the colwise_scaling device function.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	Single-line change replacing `thread_lane % 4` with `thread_lane / 2` in the colwise tid_Y swizzle formula; verified correct bijection and bank-conflict reduction from 8-way to 4-way on sOut_tr b64 stores.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["threadIdx.x"] --> B["warp = threadIdx.x / 32\nthread_lane = threadIdx.x % 32"]
    B --> C["tid_Y_colwise = (thread_lane / 2 + warp) % 4\n[NEW: period-8 swizzle]"]
    B --> D["tid_X_colwise = thread_lane"]
    C --> E["in_thread_offset_Y = tid_Y_colwise x 16\nin_thread_offset_X = thread_lane"]
    D --> E
    E --> F["ld_shared_b32 from sIn2x\n[bank = thread_lane, no conflicts]"]
    F --> G["Compute block AMAX and scaling coefficient"]
    G --> H["out_tr_offset_Y = thread_lane x 2\nout_tr_offset_X = tid_Y_colwise x 8"]
    H --> I["st_shared_b64 to sOut_tr\n[bank conflicts: 4-way, down from 8-way]"]
    C --> J["scale_tr_offset_X = stage_Y x 4 + tid_Y_colwise"]
    J --> K["Store scale to sSFcolwise\n[bank conflicts unchanged]"]

_{Reviews (1): Last reviewed commit: "Optimized shared memory stores in colwis..." | Re-trigger Greptile}

greptile-apps · 2026-06-08T16:12:55Z

  const int thread_lane = threadIdx.x % THREADS_PER_WARP;

-  const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
+  const int tid_Y_colwise = (thread_lane / 2 + warp) % 4;


Missing comment for non-obvious swizzle formula

The change from thread_lane % 4 to thread_lane / 2 is a subtle shared-memory swizzle that is hard to verify without derivation. The rationale—halving the bank-conflict degree on the sOut_tr b64 stores from 8-way to 4-way because the period of the bank pattern doubles from 4 threads to 8 threads—is not captured anywhere in the code, yet the checklist item "I have commented my code, particularly in hard-to-understand areas" is left unchecked. A one-line comment like // /2 gives period-8 swizzle → 4-way bank conflicts on sOut_tr b64 stores (vs 8-way with %4) would help future readers validate the formula without re-deriving it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

ptrendx · 2026-06-09T12:41:42Z

I agree that it would be good to provide some comment on the logic here, not necessarily in the code itself (that one would require a little bit more context on what you are trying to achieve here), but definitely in the PR description.

Oleg-Goncharov · 2026-06-09T14:43:38Z

Good point, I expanded the PR description with the rationale and the updated mapping scheme.

Optimized shared memory stores in colwise path

94b74fe

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov added performance Performance issues fp4 labels Jun 8, 2026

Oleg-Goncharov requested review from ksivaman and ptrendx June 8, 2026 16:08

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

ptrendx approved these changes Jun 9, 2026

View reviewed changes

ptrendx merged commit 2323e54 into NVIDIA:main Jun 9, 2026
35 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Reduce shared-memory bank conflicts in the colwise scaling path of the tuned NVFP4 kernel#3106

[Common] Reduce shared-memory bank conflicts in the colwise scaling path of the tuned NVFP4 kernel#3106
ptrendx merged 1 commit into
NVIDIA:mainfrom
Oleg-Goncharov:pr_nvfp4_micro_optimization

Oleg-Goncharov commented Jun 8, 2026 •

edited

Loading

Uh oh!

Oleg-Goncharov commented Jun 8, 2026

Uh oh!

greptile-apps Bot commented Jun 8, 2026

Uh oh!

greptile-apps Bot Jun 8, 2026

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

Oleg-Goncharov commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oleg-Goncharov commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Oleg-Goncharov commented Jun 8, 2026

Uh oh!

greptile-apps Bot commented Jun 8, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Jun 9, 2026

Uh oh!

Oleg-Goncharov commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oleg-Goncharov commented Jun 8, 2026 •

edited

Loading