Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.5] Remove hardcode heartbeat_timeout 0 #3176

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

YuanTingHsieh
Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh commented Jan 23, 2025

Description

The original hardcode 0 has a problem, if the external user code has an exception and the program will never return.
Our FL client job process (running LauncherExecutor) will never ends.
By using the default heartbeat_timeout value, if the FL client job process does not receive the heartbeat from the user process for heartbeat_timeout seconds, then we will consider it dead.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@YuanTingHsieh
Copy link
Collaborator Author

/build

Copy link
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@IsaacYangSLA IsaacYangSLA merged commit 215fd4d into NVIDIA:2.5 Jan 23, 2025
20 checks passed
@YuanTingHsieh YuanTingHsieh deleted the remove_hardcode_heartbeat_25 branch January 23, 2025 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants