Skip to content

Commit e52ff39

Browse files
authored
fix diloco integration test (#218)
Summary: - for diloco the model parameters, in the way they are saved by the test can be different across replicas - only the global parameters can be the same - fix the test to validate the global parameters are the same instead of the local model parameters Test Plan: ``` $ pytest -v ./torchft/local_sgd_integ_test.py::LocalSGDIntegTest::test_diloco_recovery_0 ```
1 parent 87fbc95 commit e52ff39

File tree

1 file changed

+18
-5
lines changed

1 file changed

+18
-5
lines changed

torchft/local_sgd_integ_test.py

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -414,13 +414,26 @@ def test_diloco_recovery(self, use_cuda: bool) -> None:
414414
rep0, rep1 = state_dicts
415415

416416
for step in rep0.keys():
417-
# Inner optimizer will be different, outer optimizer and model should be the same
417+
# Inner optimizer and local model parameters will be different e.g.
418+
# with 2 replicas r1 and r2, we sync every 2 steps
419+
#
420+
# - Manager Step 1
421+
# - Step 1: r1 and r2 step
422+
# - Step 2: r1 and r2 step, sync the model, quorum succeeds
423+
# - Manager Step 2
424+
# - Step 1: r1 steps but r2 fails
425+
# - Step 2:
426+
# - r1 steps, sync fails because r2 is down
427+
# - r1 recovers r2 from the model state at this step
428+
# that is different from the model for r1 at the beginning
429+
# of step Manager Step 2
430+
#
431+
# Outer optimizer and global model should be the same
432+
418433
torch.testing.assert_close(
419-
rep1[step]["model"],
420-
rep0[step]["model"],
434+
rep1[step]["original_params"],
435+
rep0[step]["original_params"],
421436
check_device=False,
422-
rtol=1e-4,
423-
atol=1e-4,
424437
)
425438
torch.testing.assert_close(
426439
rep1[step]["outer_optim"],

0 commit comments

Comments
 (0)