Skip to content

Running dflash with SGLang's Docker encounters issues. #99

@lixinqi7

Description

@lixinqi7

There is an issue with the Dockerfile using /TorchSpec/docker/sglang/v0.5.10.post1/Dockerfile. Upon checking, the actual version of mooncake-transfer-engine in the container is 0.3.10.post2, but the error still occurs:
(SglEngine pid=12651) [2026-05-14 07:19:44] Disable piecewise CUDA graph because the capture size is not set
(SglEngine pid=12651) [2026-05-14 07:19:44] Scheduler hit an exception: Traceback (most recent call last):
(SglEngine pid=12651) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3653, in run_scheduler_process
(SglEngine pid=12651) scheduler = Scheduler(
(SglEngine pid=12651) ^^^^^^^^^^
(SglEngine pid=12651) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 459, in init
(SglEngine pid=12651) raise self._mooncake_init_error
(SglEngine pid=12651) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 395, in _init_mooncake
(SglEngine pid=12651) self.init_eagle_mooncake_store(device=mooncake_device)
(SglEngine pid=12651) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 898, in init_eagle_mooncake_store
(SglEngine pid=12651) store.setup(device=device or self.device)
(SglEngine pid=12651) File "/root/torchspec/torchspec/transfer/mooncake/store.py", line 98, in setup
(SglEngine pid=12651) self._verify_force_delete()
(SglEngine pid=12651) File "/root/torchspec/torchspec/transfer/mooncake/store.py", line 252, in _verify_force_delete
(SglEngine pid=12651) raise RuntimeError(
(SglEngine pid=12651) RuntimeError: Mooncake version too old: batch_remove() not found. Requires mooncake-transfer-engine >= 0.3.10.post1.

The Dockerfile used in /TorchSpec/docker/sglang/v0.5.8.post1/Dockerfile is incorrect. The problem is as follows:
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/torchspec/torchspec/train_entry.py", line 367, in
train_async_no_generation(args)
File "/root/torchspec/torchspec/train_entry.py", line 333, in train_async_no_generation
all_results = timer.wait("Actor initialization", train_init_refs + engine_init_refs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/torchspec/torchspec/train_entry.py", line 87, in wait
result = ray.get(refs)
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2980, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1023, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::SglEngine.init() (pid=12497, ip=10.90.91.146, actor_id=a40d25a9423bd0735314071b01000000, repr=<torchspec.inference.engine.sgl_engine.SglEngine object at 0x7f22ae0b13a0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/torchspec/torchspec/inference/engine/sgl_engine.py", line 292, in init
self._engine = sgl.Engine(**engine_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/utils.py", line 325, in call
return module(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 152, in init
server_args = self.server_args_class(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: ServerArgs.init() got an unexpected keyword argument 'spec_training_store_last_hidden_states'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions