[replica-parallel] Add replica slices concept #1319

gspschmid · 2024-11-11T18:37:49Z

(Followed by #1320)

Adds the concept of "replica slices", an explicit representation of which replica ids save which slices out of an array. The intent is for replica slices to allow us to generalize beyond Orbax's current restriction that requires each shard to be saved by exactly one replica.

Motivation: Depending on their sharding and replication, JAX arrays may consist of multiple shards. In case of replication each shard carries a distinct replica_id, distinguishing the copies of the same logical shard from one another. Orbax's current behavior is to save the same replica_id-copy for all shards of all arrays ("single-replica" saving). In the presence of replication this is suboptimal, since the work could be parallelized across all replicas.

This PR is an initial step in the direction of "replica-parallel" saving: we make "replica slices" and related metadata explicit, but do not change any of Orbax's behavior.

Care is taken to compute the resulting local_shape (the shape of slices written by each replica) even when the local process does not end up saving any data. This seems necessary, since, to the best of my understanding, only one particular process may set tensorstore metadata, and tensorstore's chunk shape, in particular, is derived from local_shape.

gspschmid · 2024-11-11T19:39:52Z

@cpgaffney1

cpgaffney1

Thanks!! Looks good overall, just some minor comments.

checkpoint/orbax/checkpoint/_src/serialization/replica_slices.py

cpgaffney1 · 2024-11-12T20:57:16Z

checkpoint/orbax/checkpoint/_src/serialization/replica_slices.py

+  sharding: jax.sharding.Sharding
+  dtype: np.dtype
+  # Whether the replica slices have been transferred and are ready as ndarrays
+  transferred: bool


Is this really necessary if we can just check replica_slices to see whether it contains numpy arrays or not?

There might be no replica_slices (e.g. because we are in single-replica mode and the current process has no shards with shard.replica_id == replica_id)?

cpgaffney1 · 2024-11-12T20:58:24Z

checkpoint/orbax/checkpoint/_src/serialization/replica_slices.py

+  dtype: np.dtype
+  # Whether the replica slices have been transferred and are ready as ndarrays
+  transferred: bool
+  replica_slices: list[ReplicaSliceOnDevice] | list[tuple[Index, np.ndarray]]


It seems like we could just have a ReplicaSlice object that can store data: jax.Array | np.ndarray, with optional replica_id. That would simplify this typing a bit.

Moved from ReplicaSliceOnDevice to ReplicaSlice (that may be either on-device or on-host) and added some invariants.

checkpoint/orbax/checkpoint/_src/serialization/replica_slices_test.py

cpgaffney1 · 2024-11-13T03:37:07Z

checkpoint/orbax/checkpoint/_src/serialization/replica_slices_test.py

+    assert num_devices >= 2
+    assert is_pow_of_two(num_devices)
+
+  def test_get_replica_slices_single_replica(self):


Would be good to add new cases, or parameterize the existing ones, to include tests for arrays that are not fully replicated.

Added a second variant of the test that operates on a partially-replicated array (first dim partitioned).

gspschmid · 2024-11-13T17:02:46Z

@cpgaffney1 Thanks for the review, PTAL!

…chmid. PiperOrigin-RevId: 696250850

cpgaffney1 · 2024-11-13T23:56:50Z

This is conflicting with a few internal behaviors - I have created a change that fixes most issues, just waiting for some advice on one particular issue. Likely by tomorrow the internal change should be ready to go, at which point I will merge this change and submit mine just after.

gspschmid · 2024-11-14T09:39:48Z

Sounds good, thanks for helping shepherd this through! :-)

…chmid. PiperOrigin-RevId: 696250850

…thanks to https://github.com/gspschmid. PiperOrigin-RevId: 696601983

…1319, as performance regressions have been observed. Note also that simply setting `use_replica_parallel=False` does not fix the issue. PiperOrigin-RevId: 720202277

…1319, as performance regressions have been observed. Note also that simply setting `use_replica_parallel=False` does not fix the issue. Also disable `enable_pinned_host_transfer` feature, as allowing this also results in poor performance. PiperOrigin-RevId: 720202277

…1319, as performance regressions have been observed. Note also that simply setting `use_replica_parallel=False` does not fix the issue. Also disable `enable_pinned_host_transfer` feature, as allowing this also results in poor performance. PiperOrigin-RevId: 720622279

gspschmid mentioned this pull request Nov 11, 2024

[replica-parallel] Add replica-parallel saving #1320

Merged

cpgaffney1 reviewed Nov 12, 2024

View reviewed changes

cpgaffney1 reviewed Nov 13, 2024

View reviewed changes

[replica-parallel] Add replica slices concept

5f533d5

gspschmid force-pushed the gschmid/replica_parallel_0 branch from fcdeb8e to 5f533d5 Compare November 13, 2024 16:58

copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024

Submission of #1319. All credit and thanks to https://github.com/gsps…

c977d4c

…chmid. PiperOrigin-RevId: 696250850

copybara-service bot mentioned this pull request Nov 13, 2024

Submission of https://github.com/google/orbax/pull/1319. All credit and thanks to https://github.com/gspschmid. #1330

Merged

copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024

Submission of #1319. All credit and thanks to https://github.com/gsps…

6622191

…chmid. PiperOrigin-RevId: 696250850

copybara-service bot pushed a commit that referenced this pull request Nov 14, 2024

Submission of #1319. All credit and thanks to https://github.com/gsps…

326bf89

…chmid. PiperOrigin-RevId: 696250850

cpgaffney1 merged commit acd7869 into google:main Nov 14, 2024
1 check passed

copybara-service bot pushed a commit that referenced this pull request Nov 14, 2024

Internal reflection of #1319, with a few adjustments. All credit and …

e257b2d

…thanks to https://github.com/gspschmid. PiperOrigin-RevId: 696601983

gspschmid deleted the gschmid/replica_parallel_0 branch November 15, 2024 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[replica-parallel] Add replica slices concept #1319

[replica-parallel] Add replica slices concept #1319

gspschmid commented Nov 11, 2024 •

edited

Loading

gspschmid commented Nov 11, 2024

cpgaffney1 left a comment

cpgaffney1 Nov 12, 2024

gspschmid Nov 13, 2024 •

edited

Loading

cpgaffney1 Nov 12, 2024

gspschmid Nov 13, 2024

cpgaffney1 Nov 13, 2024

gspschmid Nov 13, 2024

gspschmid commented Nov 13, 2024

cpgaffney1 commented Nov 13, 2024

gspschmid commented Nov 14, 2024

[replica-parallel] Add replica slices concept #1319

[replica-parallel] Add replica slices concept #1319

Conversation

gspschmid commented Nov 11, 2024 • edited Loading

gspschmid commented Nov 11, 2024

cpgaffney1 left a comment

Choose a reason for hiding this comment

cpgaffney1 Nov 12, 2024

Choose a reason for hiding this comment

gspschmid Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

cpgaffney1 Nov 12, 2024

Choose a reason for hiding this comment

gspschmid Nov 13, 2024

Choose a reason for hiding this comment

cpgaffney1 Nov 13, 2024

Choose a reason for hiding this comment

gspschmid Nov 13, 2024

Choose a reason for hiding this comment

gspschmid commented Nov 13, 2024

cpgaffney1 commented Nov 13, 2024

gspschmid commented Nov 14, 2024

gspschmid commented Nov 11, 2024 •

edited

Loading

gspschmid Nov 13, 2024 •

edited

Loading