-
Notifications
You must be signed in to change notification settings - Fork 15
MLFlow Offline Failover #114
base: develop
Are you sure you want to change the base?
Conversation
HCookie
commented
Oct 29, 2024
- Wraps experiment to capture connection errors
- Logs to disk if server not found
- Wraps experiment to capture connection errors - Logs to disk if server not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this Harrison. Probably we need to iterative over this but could be nice to port this PR over to keep working on it, once the new repo is ready.
@gmertes we would also need a way to test this functionality. Do you know if we could use the mlflow-test server for that?
from anemoi.training.utils.mlflow_sync import MlFlowSync | ||
|
||
try: | ||
MlFlowSync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would try to sync the offline part of the run before logging again online? We'd probably need to use the self._run_id
and recall _get_mlflow_run_params
so it restarts from the synced run it
@@ -323,6 +328,7 @@ def __init__( | |||
tracking_uri=tracking_uri, | |||
on_resume_create_child=on_resume_create_child, | |||
) | |||
self._save_dir = f"{LOCAL_FILE_URI_PREFIX}{save_dir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the LOCAL_FILE_URI_PREFIX
doing here?
|
||
parent_obj = super() | ||
logger_obj = self | ||
experiment = super().experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 'experiment' method does not return an experiment object rather the mlflow client https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger so what's the idea behind that experiment=super().experiment
? This new experiment method won't be returning the same as the original so that could introduce some problems?
@@ -426,7 +432,71 @@ def _get_mlflow_run_params( | |||
def experiment(self) -> MLFlowLogger.experiment: | |||
if rank_zero_only.rank == 0: | |||
self.auth.authenticate() | |||
return super().experiment | |||
|
|||
parent_obj = super() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just probably me not being familiar with this, could you explain what does parent_obj = super()
do here?
logger_obj = self | ||
experiment = super().experiment | ||
|
||
if self._failed_to_offline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the use is already running in offline mode? That's needed in HPCs like Leonardo and MN5
@anaprietonem Yes, I can bring mlflow-test down during an experiment to test. |