Skip to content

List out of index when debug_train_only #84

@Dogacel

Description

@Dogacel

When trying to set debug_train_only: true, I get the following issue,

[2026-04-27 13:22:40,186] factory.py:184 INFO Initializing 1 Sgl engines (2 GPU(s) each, nnodes=1, replicas=1)
rgpuid: []
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/TorchSpec/torchspec/train_entry.py", line 367, in <module>
    train_async_no_generation(args)
  File "/root/TorchSpec/torchspec/train_entry.py", line 323, in train_async_no_generation
    inference_engines, engine_init_refs = prepare_inference_engines(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/TorchSpec/torchspec/inference/factory.py", line 85, in prepare_inference_engines
    engines, init_refs = _prepare_sgl_engines(args, inference_pg, mooncake_config, engine_group)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/TorchSpec/torchspec/inference/factory.py", line 203, in _prepare_sgl_engines
    base_gpu_id = int(reordered_gpu_ids[bundle_offset])
                      ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: list index out of range

pg is actually never initialized with rodered gpu ids, causing issues when trying to enable debug train.


Replication

Set debug_train_only: true for Qwen3 8B and run ./examples/qwen3-8b-single-node/run.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions