This repository was archived by the owner on Dec 20, 2024. It is now read-only.

MLFlow Offline Failover #114

Draft

HCookie wants to merge 1 commit into develop from feat/mlflow-offline-failover

Member

HCookie commented Oct 29, 2024

Wraps experiment to capture connection errors
Logs to disk if server not found


          Initial commit to provide failover

223c66f

- Wraps experiment to capture connection errors
- Logs to disk if server not found

anaprietonem assigned HCookie

HCookie requested a review from anaprietonem

December 13, 2024 10:58

anaprietonem reviewed

View reviewed changes

Contributor

anaprietonem left a comment

Thanks for this Harrison. Probably we need to iterative over this but could be nice to port this PR over to keep working on it, once the new repo is ready.

@gmertes we would also need a way to test this functionality. Do you know if we could use the mlflow-test server for that?

src/anemoi/training/diagnostics/mlflow/logger.py

+                      from anemoi.training.utils.mlflow_sync import MlFlowSync
+                      try:
+                          MlFlowSync(

Contributor

anaprietonem Dec 16, 2024

This would try to sync the offline part of the run before logging again online? We'd probably need to use the self._run_id and recall _get_mlflow_run_params so it restarts from the synced run it

src/anemoi/training/diagnostics/mlflow/logger.py

@@ @@ -323,6 +328,7 @@ def __init__( @@
                           tracking_uri=tracking_uri,
                           on_resume_create_child=on_resume_create_child,
                       )
+                      self._save_dir = f"{LOCAL_FILE_URI_PREFIX}{save_dir}"

Contributor

anaprietonem Dec 16, 2024

What is the LOCAL_FILE_URI_PREFIX doing here?

src/anemoi/training/diagnostics/mlflow/logger.py

+                      parent_obj = super()
+                      logger_obj = self
+                      experiment = super().experiment

Contributor

anaprietonem Dec 16, 2024

I think the 'experiment' method does not return an experiment object rather the mlflow client https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger so what's the idea behind that experiment=super().experiment ? This new experiment method won't be returning the same as the original so that could introduce some problems?

src/anemoi/training/diagnostics/mlflow/logger.py

@@ @@ -426,7 +432,71 @@ def _get_mlflow_run_params( @@
                   def experiment(self) -> MLFlowLogger.experiment:
                       if rank_zero_only.rank == 0:
                           self.auth.authenticate()
-                      return super().experiment
+                      parent_obj = super()

Contributor

anaprietonem Dec 16, 2024

Just probably me not being familiar with this, could you explain what does parent_obj = super() do here?

src/anemoi/training/diagnostics/mlflow/logger.py

+                      logger_obj = self
+                      experiment = super().experiment
+                      if self._failed_to_offline:

Contributor

anaprietonem Dec 16, 2024

What happens if the use is already running in offline mode? That's needed in HPCs like Leonardo and MN5

Member

gmertes commented Dec 16, 2024

@gmertes we would also need a way to test this functionality. Do you know if we could use the mlflow-test server for that?

@anaprietonem Yes, I can bring mlflow-test down during an experiment to test.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet