Set device according to local rank #788

yangulei · 2026-01-08T07:07:55Z

Motivation

For a typical node with 8xGaudi2E HPUs, the devices are break into two groups with 4 HPUs connected with top board each. Current random mapping between local_rank and module_id will cause HCCL failure for world_size>4 cases.

Changes

Set device according to local rank.
Use pyhlml to set HABANA_VISIBLE_MODULES to available modules. This is necessary if multiple cases with world_size=1/2/4 wants to run on the same node simultaneously or the available module_ids are not start with 0.

Copilot

Pull request overview

This PR fixes device allocation issues for Gaudi2E multi-device setups by ensuring proper mapping between local rank and module IDs. The changes prevent HCCL failures when world_size > 4 by setting devices according to local rank and automatically managing available Habana modules.

Key Changes:

Added automatic detection and configuration of available Habana modules using pyhlml
Set device based on local_rank to ensure correct HPU assignment
Added validation to ensure sufficient available modules for the requested world size

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_worker.py

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_worker.py

github-actions · 2026-01-08T07:45:20Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_worker.py

yafshar · 2026-01-08T19:00:05Z

@yangulei The current implementation places all the device selection and environment configuration logic inline, and it is a bit dense. Consider encapsulating this logic into a (private) helper function (e.g., _configure_habana_visible_modules(world_size))

yafshar · 2026-01-08T19:04:50Z

I’m not sure this approach is entirely safe. There’s a potential race condition here: if devices become busy after the availability check but before they’re actually used, the assumption of idle state could fail. This scenario is especially likely when running vllm-gaudi in Kubernetes with multiple pods scheduled on the same node (with user not using device plugins and resource limits)

yangulei · 2026-01-09T04:59:30Z

I’m not sure this approach is entirely safe. There’s a potential race condition here: if devices become busy after the availability check but before they’re actually used, the assumption of idle state could fail. This scenario is especially likely when running vllm-gaudi in Kubernetes with multiple pods scheduled on the same node (with user not using device plugins and resource limits)

Yes you are right, this will results in Device acquire failed error. But I can't find any better solutions here, do you have any idea?
BTW, I'm not sure if utility.aip == 0 and utility.memory == 0 is the best condition to spot out an available device.

yangulei · 2026-01-09T05:00:08Z

@yangulei The current implementation places all the device selection and environment configuration logic inline, and it is a bit dense. Consider encapsulating this logic into a (private) helper function (e.g., _configure_habana_visible_modules(world_size))

Done, thanks!

github-actions · 2026-01-09T05:10:37Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-09T08:49:51Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-09T08:51:52Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-12T03:57:15Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

michalkuligowski · 2026-01-13T14:11:30Z

vllm_gaudi/v1/worker/hpu_worker.py

+                if utility.aip == 0 and utility.memory == 0:
+                    module_id = pyhlml.hlmlDeviceGetModuleID(device)
+                    available_module_ids.append(module_id)
+            except Exception:


In what circumstances there might be an exception we want to ignore? Busy device?

I saw a system with 8 HPUs installed but one of them is failed to be discovered in hl-smi. I cannot remember if the indexes of the HPUs are contiguous, if it's not, the scan of the device index might try to access the invalid one.

michalkuligowski · 2026-01-13T14:12:14Z

vllm_gaudi/v1/worker/hpu_worker.py

+            except Exception:
+                continue
+        if len(available_module_ids) < 1:
+            raise RuntimeError("No available Habana modules found. All modules are currently in use.")


shutdown on pyhlml is not called here, consider using context manager or use try-finally

Done, thanks!

michalkuligowski · 2026-01-13T14:13:43Z

vllm_gaudi/v1/worker/hpu_worker.py

+            if any(not c.isdigit() for c in env_visible_modules.split(",")) and env_visible_modules.lower() != "all":
+                raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
+                                   "It should be a comma-separated list of integers or 'all'.")
+            env_module_ids = list(map(int, env_visible_modules.split(",")))
+            if any(module_id < 0 or module_id >= device_count for module_id in env_module_ids):
+                pyhlml.hlmlShutdown()
+                raise RuntimeError(f"Invalid HABANA_VISIBLE_MODULES={env_visible_modules}. "
+                                   f"Module IDs should be between 0 and {device_count - 1}.")
+            if any(env_module_id not in available_module_ids for env_module_id in env_module_ids):
+                logger.warning("Some device for HABANA_VISIBLE_MODULES=%s are not available.", env_visible_modules)
+                selected_modules = [x for x in env_module_ids if x in available_module_ids]
+                if len(selected_modules) < self.parallel_config.world_size:
+                    pyhlml.hlmlShutdown()
+                    raise RuntimeError(
+                        f"Not enough available modules for world_size={self.parallel_config.world_size}. "
+                        "Set HABANA_VISIBLE_MODULES to include more available modules and try again.")
+                else:
+                    selected_modules_str = ",".join(map(str, sorted(selected_modules)))
+                    os.environ["HABANA_VISIBLE_MODULES"] = selected_modules_str
+                    logger.warning("Using selected available modules: %s", selected_modules_str)


I think this should go into README on how env should look like instead of complex logic here

The usage of if this ENV is already in the Setting HABANA_VISIBLE_MODULES section of the documentation.
Most of the logic are sanity tests, plus a useful path to filter out the busy modules.

iboiko-habana · 2026-01-14T09:52:02Z

run_deepseek_v2_inc_dynamic_tp2_test is failed because of CI issues. Test case will be disabled ASAP and fix after that

github-actions · 2026-01-15T01:27:37Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Youlei Yang <[email protected]>

Copilot AI review requested due to automatic review settings January 8, 2026 07:07

yangulei requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 8, 2026 07:07

Copilot AI reviewed Jan 8, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

yangulei requested a review from Copilot January 8, 2026 07:30

Copilot AI reviewed Jan 8, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

yangulei force-pushed the set_device branch from 2b61ec5 to a0cc7f8 Compare January 8, 2026 07:54

yangulei requested a review from Copilot January 8, 2026 08:13

Copilot AI reviewed Jan 8, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_worker.py Show resolved Hide resolved

github-actions bot mentioned this pull request Jan 8, 2026

🚦 Team Review Dashboard #701

Open

yangulei force-pushed the set_device branch from f214254 to 0aacb48 Compare January 9, 2026 08:51

yangulei force-pushed the set_device branch from 0aacb48 to f07b832 Compare January 9, 2026 08:53

yangulei force-pushed the set_device branch 2 times, most recently from d28490c to 1761c24 Compare January 13, 2026 08:35

michalkuligowski reviewed Jan 13, 2026

View reviewed changes

yangulei force-pushed the set_device branch from c71e423 to 69d6a56 Compare January 14, 2026 03:07

yangulei force-pushed the set_device branch from 79d1c3e to 1100df1 Compare January 15, 2026 01:27

yangulei force-pushed the set_device branch 2 times, most recently from 979be75 to 475fc24 Compare January 15, 2026 17:26

yangulei added 3 commits January 16, 2026 01:29

set device to local_rank

7a01685

Signed-off-by: Youlei Yang <[email protected]>

auto set HABANA_VISIBLE_MODULES

9eb6ee5

Signed-off-by: Youlei Yang <[email protected]>

set HABANA_VISIBLE_MODULES for the tests

63d010b

Signed-off-by: Youlei Yang <[email protected]>

yangulei force-pushed the set_device branch from 475fc24 to 63d010b Compare January 15, 2026 17:35

Set device according to local rank #788

Are you sure you want to change the base?

Set device according to local rank #788

Conversation

yangulei commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions bot commented Jan 8, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

yafshar commented Jan 8, 2026

Uh oh!

yafshar commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yangulei commented Jan 9, 2026

Uh oh!

yangulei commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 9, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 12, 2026

🚧 CI Blocked

Uh oh!

michalkuligowski Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

michalkuligowski Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

yangulei Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

iboiko-habana commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

🚧 CI Blocked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

yangulei commented Jan 8, 2026 •

edited

Loading

yafshar commented Jan 8, 2026 •

edited

Loading