Skip to content
This repository was archived by the owner on Dec 20, 2024. It is now read-only.

MLFlow Offline Failover #114

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from
Draft

Conversation

HCookie
Copy link
Member

@HCookie HCookie commented Oct 29, 2024

  • Wraps experiment to capture connection errors
  • Logs to disk if server not found

- Wraps experiment to capture connection errors
- Logs to disk if server not found
Copy link
Contributor

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this Harrison. Probably we need to iterative over this but could be nice to port this PR over to keep working on it, once the new repo is ready.

@gmertes we would also need a way to test this functionality. Do you know if we could use the mlflow-test server for that?

from anemoi.training.utils.mlflow_sync import MlFlowSync

try:
MlFlowSync(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would try to sync the offline part of the run before logging again online? We'd probably need to use the self._run_id and recall _get_mlflow_run_params so it restarts from the synced run it

@@ -323,6 +328,7 @@ def __init__(
tracking_uri=tracking_uri,
on_resume_create_child=on_resume_create_child,
)
self._save_dir = f"{LOCAL_FILE_URI_PREFIX}{save_dir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the LOCAL_FILE_URI_PREFIX doing here?


parent_obj = super()
logger_obj = self
experiment = super().experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 'experiment' method does not return an experiment object rather the mlflow client https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger so what's the idea behind that experiment=super().experiment ? This new experiment method won't be returning the same as the original so that could introduce some problems?

@@ -426,7 +432,71 @@ def _get_mlflow_run_params(
def experiment(self) -> MLFlowLogger.experiment:
if rank_zero_only.rank == 0:
self.auth.authenticate()
return super().experiment

parent_obj = super()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just probably me not being familiar with this, could you explain what does parent_obj = super() do here?

logger_obj = self
experiment = super().experiment

if self._failed_to_offline:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the use is already running in offline mode? That's needed in HPCs like Leonardo and MN5

@gmertes
Copy link
Member

gmertes commented Dec 16, 2024

@gmertes we would also need a way to test this functionality. Do you know if we could use the mlflow-test server for that?

@anaprietonem Yes, I can bring mlflow-test down during an experiment to test.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants