Skip to content

Conversation

SurbhiJainUSC
Copy link
Collaborator

Description

An issue was identified while running MaxText/Tunix SFT with Deepseek-V3 on v6e-256. The issue is that two axes within the MLA layer are being assigned the exact same sharding rule by the logical_axis_rules defined here.

Conflict 1: The embed axis and the q_lora axis are both using an identical sharding specification: ['fsdp', 'sequence', 'context', 'expert']).

Conflict 2: The embed axis and the kv_lora axis are also using the same sharding specification : ['fsdp', 'sequence', 'context', 'expert']).

During the optimizer sharding in Tunix here, jax.lax.with_sharding_constraint() looks at the sharding rules defined for each axis and then uses those rules to determine how to shard the optimizer.

The error points to this line in the JAX source:
https://github.com/jax-ml/jax/blob/1a91543e92778bb659939cc3bdc3d4b7978191b6/jax/_src/named_sharding.py#L473

When JAX encounters the same sharding specification being passed for two different axes (embed and q_lora), it sees this as an internal inconsistency, and throws DuplicateSpecError.

This PR resolves that issue by removing the sharding rules for q_lora and kv_lora.

FIXES: b/444495481

Tests

Tested Deepseek-V3 on v6e-256: https://cloudlogging.app.goo.gl/FYJACjyjTdAF3uQE7

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@richjames0 richjames0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@copybara-service copybara-service bot merged commit 0baff00 into main Sep 22, 2025
27 checks passed
@copybara-service copybara-service bot deleted the deepseek_sharding branch September 22, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants