-
Notifications
You must be signed in to change notification settings - Fork 98
Set device according to local rank #788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes device allocation issues for Gaudi2E multi-device setups by ensuring proper mapping between local rank and module IDs. The changes prevent HCCL failures when world_size > 4 by setting devices according to local rank and automatically managing available Habana modules.
Key Changes:
- Added automatic detection and configuration of available Habana modules using
pyhlml - Set device based on
local_rankto ensure correct HPU assignment - Added validation to ensure sufficient available modules for the requested world size
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@yangulei The current implementation places all the device selection and environment configuration logic inline, and it is a bit dense. Consider encapsulating this logic into a (private) helper function (e.g., _configure_habana_visible_modules(world_size)) |
|
I’m not sure this approach is entirely safe. There’s a potential race condition here: if devices become busy after the availability check but before they’re actually used, the assumption of idle state could fail. This scenario is especially likely when running vllm-gaudi in Kubernetes with multiple pods scheduled on the same node (with user not using device plugins and resource limits) |
Yes you are right, this will results in |
Done, thanks! |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
d28490c to
1761c24
Compare
vllm_gaudi/v1/worker/hpu_worker.py
Outdated
| if utility.aip == 0 and utility.memory == 0: | ||
| module_id = pyhlml.hlmlDeviceGetModuleID(device) | ||
| available_module_ids.append(module_id) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what circumstances there might be an exception we want to ignore? Busy device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a system with 8 HPUs installed but one of them is failed to be discovered in hl-smi. I cannot remember if the indexes of the HPUs are contiguous, if it's not, the scan of the device index might try to access the invalid one.
vllm_gaudi/v1/worker/hpu_worker.py
Outdated
| except Exception: | ||
| continue | ||
| if len(available_module_ids) < 1: | ||
| raise RuntimeError("No available Habana modules found. All modules are currently in use.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown on pyhlml is not called here, consider using context manager or use try-finally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
vllm_gaudi/v1/worker/hpu_worker.py
Outdated
| if any(not c.isdigit() for c in env_visible_modules.split(",")) and env_visible_modules.lower() != "all": | ||
| raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. " | ||
| "It should be a comma-separated list of integers or 'all'.") | ||
| env_module_ids = list(map(int, env_visible_modules.split(","))) | ||
| if any(module_id < 0 or module_id >= device_count for module_id in env_module_ids): | ||
| pyhlml.hlmlShutdown() | ||
| raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. " | ||
| f"Module IDs should be between 0 and {device_count - 1}.") | ||
| if any(env_module_id not in available_module_ids for env_module_id in env_module_ids): | ||
| logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules) | ||
| selected_modules = [x for x in env_module_ids if x in available_module_ids] | ||
| if len(selected_modules) < self.parallel_config.world_size: | ||
| pyhlml.hlmlShutdown() | ||
| raise RuntimeError( | ||
| f"Not enough available modules for world_size={self.parallel_config.world_size}. " | ||
| "Set HABANA_VISIBLE_MODULES to include more available modules and try again.") | ||
| else: | ||
| selected_modules_str = ",".join(map(str, sorted(selected_modules))) | ||
| os.environ["HABANA_VISIBLE_MODULES"] = selected_modules_str | ||
| logger.warning("Using selected available modules: %s", selected_modules_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go into README on how env should look like instead of complex logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of if this ENV is already in the Setting HABANA_VISIBLE_MODULES section of the documentation.
Most of the logic are sanity tests, plus a useful path to filter out the busy modules.
|
run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
979be75 to
475fc24
Compare
Signed-off-by: Youlei Yang <[email protected]>
Signed-off-by: Youlei Yang <[email protected]>
Signed-off-by: Youlei Yang <[email protected]>
Motivation
For a typical node with 8xGaudi2E HPUs, the devices are break into two groups with 4 HPUs connected with top board each. Current random mapping between
local_rankandmodule_idwill cause HCCL failure forworld_size>4cases.Changes
pyhlmlto setHABANA_VISIBLE_MODULESto available modules. This is necessary if multiple cases withworld_size=1/2/4wants to run on the same node simultaneously or the availablemodule_idsare not start with 0.