When trying to set debug_train_only: true, I get the following issue,
[2026-04-27 13:22:40,186] factory.py:184 INFO Initializing 1 Sgl engines (2 GPU(s) each, nnodes=1, replicas=1)
rgpuid: []
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/TorchSpec/torchspec/train_entry.py", line 367, in <module>
train_async_no_generation(args)
File "/root/TorchSpec/torchspec/train_entry.py", line 323, in train_async_no_generation
inference_engines, engine_init_refs = prepare_inference_engines(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/TorchSpec/torchspec/inference/factory.py", line 85, in prepare_inference_engines
engines, init_refs = _prepare_sgl_engines(args, inference_pg, mooncake_config, engine_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/TorchSpec/torchspec/inference/factory.py", line 203, in _prepare_sgl_engines
base_gpu_id = int(reordered_gpu_ids[bundle_offset])
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: list index out of range
pg is actually never initialized with rodered gpu ids, causing issues when trying to enable debug train.
Replication
Set debug_train_only: true for Qwen3 8B and run ./examples/qwen3-8b-single-node/run.sh
When trying to set
debug_train_only: true, I get the following issue,pgis actually never initialized with rodered gpu ids, causing issues when trying to enable debug train.Replication
Set
debug_train_only: truefor Qwen3 8B and run./examples/qwen3-8b-single-node/run.sh