Skip to content

[Common] Reduce shared-memory bank conflicts in the colwise scaling path of the tuned NVFP4 kernel#3106

Merged
ptrendx merged 1 commit into
NVIDIA:mainfrom
Oleg-Goncharov:pr_nvfp4_micro_optimization
Jun 9, 2026
Merged

[Common] Reduce shared-memory bank conflicts in the colwise scaling path of the tuned NVFP4 kernel#3106
ptrendx merged 1 commit into
NVIDIA:mainfrom
Oleg-Goncharov:pr_nvfp4_micro_optimization

Conversation

@Oleg-Goncharov

@Oleg-Goncharov Oleg-Goncharov commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR optimizes shared-memory accesses in the colwise scaling path by reducing bank conflicts, resulting in an average performance improvement of about 1%.

The colwise Y-lane mapping changed from:

(thread_lane % 4 + warp) % 4

image

to:

(thread_lane / 2 + warp) % 4

image

The intent is to change the shared-memory swizzle pattern used by the sOut_tr b64 stores. With the previous %4 mapping, the bank pattern repeats every 4 lanes, which leads to an 8-way conflict pattern for this store layout. Using /2 makes the mapping repeat every 8 lanes instead, reducing the conflict degree to 4-way for the same access pattern.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Modified the pattern of accessing shared memory

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov added performance Performance issues fp4 labels Jun 8, 2026
@Oleg-Goncharov

Copy link
Copy Markdown
Collaborator Author

/te-ci

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Replaces thread_lane % 4 with thread_lane / 2 when computing tid_Y_colwise in the colwise scaling path of the tuned NVFP4 kernel, reducing shared-memory bank conflicts on the sOut_tr 64-bit stores from 8-way to 4-way by doubling the swizzle period from 4 threads to 8. The bijection between threads and output regions is preserved; input-read bank access (determined solely by thread_lane) and scale-factor store bank access are unaffected.

  • The formula change is mathematically valid: (thread_lane / 2 + warp) % 4 still produces a bijection over all 128 (thread_lane, tid_Y_colwise) pairs, covering every element of the 64×64 tile exactly once.
  • The sOut_tr b64 store bank calculation changes from a period-4 pattern (yielding 8 threads per bank) to a period-8 pattern (yielding 4 threads per bank), consistent with the claimed ~1% average improvement.

Confidence Score: 4/5

Safe to merge; the one-line change is a correct swizzle substitution with no effect on output correctness.

The formula thread_lane / 2 produces the same bijective coverage as the old thread_lane % 4 formula, verified analytically. Input reads are bank-access-neutral to this change, scale-factor stores are unaffected, and only the sOut_tr 64-bit stores benefit from fewer conflicts. The only open item is a missing in-code comment explaining the swizzle rationale, which makes future maintenance harder but does not threaten correctness or performance.

No files require special attention. The single changed file is self-contained and the optimization is isolated to the colwise_scaling device function.

Important Files Changed

Filename Overview
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh Single-line change replacing thread_lane % 4 with thread_lane / 2 in the colwise tid_Y swizzle formula; verified correct bijection and bank-conflict reduction from 8-way to 4-way on sOut_tr b64 stores.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["threadIdx.x"] --> B["warp = threadIdx.x / 32\nthread_lane = threadIdx.x % 32"]
    B --> C["tid_Y_colwise = (thread_lane / 2 + warp) % 4\n[NEW: period-8 swizzle]"]
    B --> D["tid_X_colwise = thread_lane"]
    C --> E["in_thread_offset_Y = tid_Y_colwise x 16\nin_thread_offset_X = thread_lane"]
    D --> E
    E --> F["ld_shared_b32 from sIn2x\n[bank = thread_lane, no conflicts]"]
    F --> G["Compute block AMAX and scaling coefficient"]
    G --> H["out_tr_offset_Y = thread_lane x 2\nout_tr_offset_X = tid_Y_colwise x 8"]
    H --> I["st_shared_b64 to sOut_tr\n[bank conflicts: 4-way, down from 8-way]"]
    C --> J["scale_tr_offset_X = stage_Y x 4 + tid_Y_colwise"]
    J --> K["Store scale to sSFcolwise\n[bank conflicts unchanged]"]
Loading

Reviews (1): Last reviewed commit: "Optimized shared memory stores in colwis..." | Re-trigger Greptile

const int thread_lane = threadIdx.x % THREADS_PER_WARP;

const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
const int tid_Y_colwise = (thread_lane / 2 + warp) % 4;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing comment for non-obvious swizzle formula

The change from thread_lane % 4 to thread_lane / 2 is a subtle shared-memory swizzle that is hard to verify without derivation. The rationale—halving the bank-conflict degree on the sOut_tr b64 stores from 8-way to 4-way because the period of the bank pattern doubles from 4 threads to 8 threads—is not captured anywhere in the code, yet the checklist item "I have commented my code, particularly in hard-to-understand areas" is left unchecked. A one-line comment like // /2 gives period-8 swizzle → 4-way bank conflicts on sOut_tr b64 stores (vs 8-way with %4) would help future readers validate the formula without re-deriving it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@ptrendx

ptrendx commented Jun 9, 2026

Copy link
Copy Markdown
Member

I agree that it would be good to provide some comment on the logic here, not necessarily in the code itself (that one would require a little bit more context on what you are trying to achieve here), but definitely in the PR description.

@Oleg-Goncharov

Copy link
Copy Markdown
Collaborator Author

Good point, I expanded the PR description with the rationale and the updated mapping scheme.

@ptrendx ptrendx merged commit 2323e54 into NVIDIA:main Jun 9, 2026
35 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fp4 performance Performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants