Skip to content

Conversation

@yangulei
Copy link
Contributor

@yangulei yangulei commented Jan 8, 2026

Motivation

For a typical node with 8xGaudi2E HPUs, the devices are break into two groups with 4 HPUs connected with top board each. Current random mapping between local_rank and module_id will cause HCCL failure for world_size>4 cases.

Changes

  • Set device according to local rank.
  • Use pyhlml to set HABANA_VISIBLE_MODULES to available modules. This is necessary if multiple cases with world_size=1/2/4 wants to run on the same node simultaneously or the available module_ids are not start with 0.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes device allocation issues for Gaudi2E multi-device setups by ensuring proper mapping between local rank and module IDs. The changes prevent HCCL failures when world_size > 4 by setting devices according to local rank and automatically managing available Habana modules.

Key Changes:

  • Added automatic detection and configuration of available Habana modules using pyhlml
  • Set device based on local_rank to ensure correct HPU assignment
  • Added validation to ensure sufficient available modules for the requested world size

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yangulei yangulei requested a review from Copilot January 8, 2026 07:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yafshar
Copy link
Contributor

yafshar commented Jan 8, 2026

@yangulei The current implementation places all the device selection and environment configuration logic inline, and it is a bit dense. Consider encapsulating this logic into a (private) helper function (e.g., _configure_habana_visible_modules(world_size))

@yafshar
Copy link
Contributor

yafshar commented Jan 8, 2026

I’m not sure this approach is entirely safe. There’s a potential race condition here: if devices become busy after the availability check but before they’re actually used, the assumption of idle state could fail. This scenario is especially likely when running vllm-gaudi in Kubernetes with multiple pods scheduled on the same node (with user not using device plugins and resource limits)

@yangulei
Copy link
Contributor Author

yangulei commented Jan 9, 2026

I’m not sure this approach is entirely safe. There’s a potential race condition here: if devices become busy after the availability check but before they’re actually used, the assumption of idle state could fail. This scenario is especially likely when running vllm-gaudi in Kubernetes with multiple pods scheduled on the same node (with user not using device plugins and resource limits)

Yes you are right, this will results in Device acquire failed error. But I can't find any better solutions here, do you have any idea?
BTW, I'm not sure if utility.aip == 0 and utility.memory == 0 is the best condition to spot out an available device.

@yangulei
Copy link
Contributor Author

yangulei commented Jan 9, 2026

@yangulei The current implementation places all the device selection and environment configuration logic inline, and it is a bit dense. Consider encapsulating this logic into a (private) helper function (e.g., _configure_habana_visible_modules(world_size))

Done, thanks!

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@yangulei yangulei force-pushed the set_device branch 2 times, most recently from d28490c to 1761c24 Compare January 13, 2026 08:35
if utility.aip == 0 and utility.memory == 0:
module_id = pyhlml.hlmlDeviceGetModuleID(device)
available_module_ids.append(module_id)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what circumstances there might be an exception we want to ignore? Busy device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a system with 8 HPUs installed but one of them is failed to be discovered in hl-smi. I cannot remember if the indexes of the HPUs are contiguous, if it's not, the scan of the device index might try to access the invalid one.

except Exception:
continue
if len(available_module_ids) < 1:
raise RuntimeError("No available Habana modules found. All modules are currently in use.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown on pyhlml is not called here, consider using context manager or use try-finally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 126 to 145
if any(not c.isdigit() for c in env_visible_modules.split(",")) and env_visible_modules.lower() != "all":
raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
"It should be a comma-separated list of integers or 'all'.")
env_module_ids = list(map(int, env_visible_modules.split(",")))
if any(module_id < 0 or module_id >= device_count for module_id in env_module_ids):
pyhlml.hlmlShutdown()
raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
f"Module IDs should be between 0 and {device_count - 1}.")
if any(env_module_id not in available_module_ids for env_module_id in env_module_ids):
logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules)
selected_modules = [x for x in env_module_ids if x in available_module_ids]
if len(selected_modules) < self.parallel_config.world_size:
pyhlml.hlmlShutdown()
raise RuntimeError(
f"Not enough available modules for world_size={self.parallel_config.world_size}. "
"Set HABANA_VISIBLE_MODULES to include more available modules and try again.")
else:
selected_modules_str = ",".join(map(str, sorted(selected_modules)))
os.environ["HABANA_VISIBLE_MODULES"] = selected_modules_str
logger.warning("Using selected available modules: %s", selected_modules_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go into README on how env should look like instead of complex logic here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of if this ENV is already in the Setting HABANA_VISIBLE_MODULES section of the documentation.
Most of the logic are sanity tests, plus a useful path to filter out the busy modules.

@iboiko-habana
Copy link
Collaborator

run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@yangulei yangulei force-pushed the set_device branch 2 times, most recently from 979be75 to 475fc24 Compare January 15, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants