Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ __device__ __forceinline__ void colwise_scaling(const IType *__restrict__ sIn_pt
const int warp = threadIdx.x / THREADS_PER_WARP;
const int thread_lane = threadIdx.x % THREADS_PER_WARP;

const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
const int tid_Y_colwise = (thread_lane / 2 + warp) % 4;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing comment for non-obvious swizzle formula

The change from thread_lane % 4 to thread_lane / 2 is a subtle shared-memory swizzle that is hard to verify without derivation. The rationale—halving the bank-conflict degree on the sOut_tr b64 stores from 8-way to 4-way because the period of the bank pattern doubles from 4 threads to 8 threads—is not captured anywhere in the code, yet the checklist item "I have commented my code, particularly in hard-to-understand areas" is left unchecked. A one-line comment like // /2 gives period-8 swizzle → 4-way bank conflicts on sOut_tr b64 stores (vs 8-way with %4) would help future readers validate the formula without re-deriving it.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

const int tid_X_colwise = thread_lane;

const int thread_offset_Y_colwise = tid_Y_colwise * SCALE_DIM;
Expand Down
Loading