Skip to content

refactor(orchestration): harden command line passing and shell execution#114

Merged
dbotwinick merged 2 commits intospark-arena:developfrom
jlapenna:fix-docker-bash-escaping-base64
Apr 9, 2026
Merged

refactor(orchestration): harden command line passing and shell execution#114
dbotwinick merged 2 commits intospark-arena:developfrom
jlapenna:fix-docker-bash-escaping-base64

Conversation

@jlapenna
Copy link
Copy Markdown
Contributor

@jlapenna jlapenna commented Apr 5, 2026

This one stemmed from a specific recipe.yaml --> https://spark-arena.com/benchmark/ad49f140-0581-41e6-9ec5-8a7c524451d6

Its hf-override flags were getting misinterprted as python substitutions; i started a little cleanup here, and the agents noticed some inconsistencies, and it snowballed from there.

Happy to try to scale this down to be a bit more of a point-fix; lmk!

(Also, I'll fix the merge conflcits shortly)

--the human, Joe.

Hardened Command Line Passing & Shell Execution Security

This PR implements rigorous defense-in-depth measures against shell injection and quoting vulnerabilities across the entire sparkrun codebase. Sparkrun relies heavily on dynamically generated bash scripts piped over SSH or injected into docker exec boundaries.

This refactor makes string interpolation and command building robust by combining Python's shlex.quote, safe base64-encoded pipelines, and strictly formatted printf outputs in bash.

1. Base64 Command Pipeline Hardening

The previous echo <b64> | base64 -d implementation was vulnerable to edge cases if the base64 string somehow started with hyphens or was misinterpreted by varying system echo implementations.

  • Replaced echo with printf: Uses printf '%s' '{b64_cmd}' to safely pipe the literal base64 string.
  • Flag protection: Added the -- delimiter to base64 -d -- to definitively stop option parsing.
  • Clean execution environment: The resulting decoded command is now executed using bash --noprofile --norc to prevent interference from the system or the user's login shell configuration.
  • Centralization: Centralized the pipeline wrapping logic to sparkrun.utils.shell.b64_wrap_bash.

2. Widespread shlex.quote Application

Previously, environment variables and some CLI flags were manually wrapped in single quotes (e.g. export KEY='%s'), which breaks if the value itself contains single quotes.

  • Applied shlex.quote to all dynamically generated docker run and docker exec CLI flags in DockerExecutor, including container names, network settings, IPC modes, memory limits, and environment variables.
  • Updated sparkrun.orchestration.networking.py to use shlex.quote for all exported variables in the cx7 bring-up and arping scripts.
  • Ensured SSH pipeline target strings in ssh.py (ssh ... <target> <remote_cmd>) safely quote the <remote_cmd>.

3. Bash Template Hardening (.sh files and Python templates)

Bash scripts generating logs or informational output were using vulnerable double-quoted interpolation (e.g., echo "Launching {container_name}"), which could still evaluate command substitutions if the Python variable outputted a single-quoted string containing a subshell (e.g., '$(reboot)').

  • Replaced all vulnerable echo "..." {variable} usages in .sh files (like container_launch.sh, exec_serve_detached.sh, exec_serve_foreground.sh) with safe format strings: printf "Launching %%s\n" "{container_name}".
  • Similarly hardened inline Python bash string templates (like generate_node_script in executor.py) to use printf '... %s ...\n' %(name)s.
  • Hardened the ssh-keyscan processing in networking.py by replacing echo "$keys" >> known_hosts with printf "%s\n" "$keys" >> known_hosts.

4. Code Quality & Standards Enforcement

  • Idiomatic Python & Imports: Standardized the codebase by moving all inline imports (e.g., import shlex or import base64 embedded inside functions) to the top of their respective files, strictly adhering to PEP 8 standards. Fixed a variable scoping bug (NameError for target) that was uncovered during this cleanup.
  • Linting: Ran ruff check --fix and ruff format across the src/ and tests/ directories, removing unused variables and fixing type() comparisons (using isinstance() or is).

5. Documentation (DEVELOPERS.md)

Added a new Shell Execution & Security section to DEVELOPERS.md instructing developers on how to properly handle string interpolation for new features using shlex.quote, b64_encode_cmd, and printf.

Testing

  • Updated all string generation tests in test_executor.py, test_scripts.py, and test_shell.py to match the new, hardened formats.
  • Added new tests proving the correct handling of variables with spaces and special quotes.
  • All 2,138 tests pass.

@jlapenna
Copy link
Copy Markdown
Contributor Author

jlapenna commented Apr 5, 2026

This is a work in progress, and not ready for review.

@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch 3 times, most recently from ec167ca to 8ffbc58 Compare April 5, 2026 22:00
@jlapenna
Copy link
Copy Markdown
Contributor Author

jlapenna commented Apr 5, 2026

This one stemmed from a specific recipe.yaml --> https://spark-arena.com/benchmark/ad49f140-0581-41e6-9ec5-8a7c524451d6

Its hf-override flags were getting misinterprted as python substitutions; i started a little cleanup here, and the agents noticed some inconsistencies, and it snowballed from there.

Happy to try to scale this down to be a bit more of a point-fix; lmk!

(Also, I'll fix the merge conflcits shortly)

@dbotwinick
Copy link
Copy Markdown
Contributor

This one stemmed from a specific recipe.yaml --> https://spark-arena.com/benchmark/ad49f140-0581-41e6-9ec5-8a7c524451d6

Its hf-override flags were getting misinterprted as python substitutions; i started a little cleanup here, and the agents noticed some inconsistencies, and it snowballed from there.

Happy to try to scale this down to be a bit more of a point-fix; lmk!

(Also, I'll fix the merge conflcits shortly)

Yeah. I've sort of just let that be so far -- the basis of the arg substitution is basically from code I wrote 12+ years ago... but also those should pass through unchanged. I know they run the risk of substitution (and they do technically get picked up) -- but it would be pretty difficult to accomplish -- and then I'm not sure if it would actually be a security threat if the processed CLI args are properly quoted in later stages. So it's something I just ignored because it basically meant it was picked up, no match found, returned as itself -- and since we should use shell quoting for what makes it through to the container, there should be no harm in that -- or at least that's what I was thinking.

@dbotwinick
Copy link
Copy Markdown
Contributor

Also I'm happy if someone is review the b64 command pattern. That was added to enable "indirect sudo" scenario -- which is primarily needed for "cross-user" scenario where the cluster user doesn't match the OS user but we need sudo ops and the OS user is the one with sudo rights. That got messy.... and the quick b64 pattern was a way to avoid issues with nested^nested shell quoting...

@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch from 8ffbc58 to 133dc70 Compare April 6, 2026 02:21
@jlapenna
Copy link
Copy Markdown
Contributor Author

jlapenna commented Apr 6, 2026

I reproduced the issue on the main branch to confirm exactly what this PR fixes. When running a recipe with complex JSON arguments (like ad49f140-0581-41e6-9ec5-8a7c524451d6), the previous shell escaping logic caused quote stripping.

The Issue

The recipe explicitly supplies this JSON string for the --hf-overrides argument:
'{"rope_scaling": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144}}'

Because docker exec executes this via bash -c '...', the old escaping logic (replacing ' with '\'' and wrapping in ') caused the shell to strip the inner double quotes before the string actually reached the vllm process. As a result, vllm received invalid JSON:
{rope_scaling: {rope_type: yarn, factor: 4.0, original_max_position_embeddings: 262144}}

This caused vllm serve to crash when attempting to call json.loads().

The Fix

This PR bypasses the initial shell parser entirely by converting the entire vllm serve command string into Base64 (echo <base64_string> | base64 -d | bash). The Docker daemon and SSH simply transport a solid block of Base64 characters, avoiding all quote stripping, escaping issues, or syntax collisions. Once inside the container, it's safely decoded and executed verbatim as a script file.

Full Command Output (from main before this PR)

Click to expand
sparkrun v0.2.20

Runtime:   vllm-distributed
Image:     vllm-node-tf5
Model:     Intel/Qwen3-Coder-Next-int4-AutoRound
Mode:      solo

VRAM Estimation:
  Model dtype:      int4
  Model params:     11,823,991,872
  KV cache dtype:   bfloat16
  Architecture:     48 layers, 2 KV heads, 256 head_dim
  Model weights:    5.51 GB
  KV cache:         96.00 GB (max_model_len=1,048,576)
  Tensor parallel:  1
  Per-GPU total:    101.51 GB
  DGX Spark fit:    YES

  GPU Memory Budget:
    gpu_memory_utilization: 70%
    Usable GPU memory:     84.7 GB (121 GB x 70%)
    Available for KV:      79.2 GB
    Max context tokens:    865,009
    Context multiplier:    0.8x (vs max_model_len=1,048,576)
    WARNING: max_model_len exceeds available KV budget (82.5% fits)

Hosts:     --hosts
  Target:  localhost

[1/6] Preparing
  done (0.0s)
[2/6] Building image
  done (0.4s)
[3/6] Distributing resources
Fetching 24 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 188508.04it/s]
Download complete: : 0.00B [00:00, ?B/s]                done (0.1s)                                                                           | 0/24 [00:00<?, ?it/s]
[4/6] Syncing tuning configs
  done (0.0s)
[5/6] Launching vllm runtime
  Step 1/3: Detecting InfiniBand
  Step 2/3: Launching container
  Step 3/3: Executing serve command
  done (4.3s)
Cluster:   sparkrun_3f3f7cd8d285

Serve command:
  vllm serve Intel/Qwen3-Coder-Next-int4-AutoRound \
    --max-model-len 1048576 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --gpu-memory-utilization 0.7 \
    --host 0.0.0.0 \
    --port 8000 \
    --load-format fastsafetensors \
    --language-model-only \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 49152 \
    --max-num-seqs 384 \
    --kv-cache-dtype fp8 \
    --optimization-level 3 \
    --performance-mode throughput \
    --mamba-cache-mode align \
    --hf-overrides '{"rope_scaling": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144}}'

Runtime versions:
  build_base_image: nvidia/cuda:13.2.0-devel-ubuntu24.04
  build_build_args_build_jobs: 16
  build_build_args_exp_mxfp4: False
  build_build_args_transformers_5: True
  build_build_args_vllm_prs: 
  build_build_args_vllm_ref: main
  build_build_date: 2026-03-26 05:23:08+00:00
  build_build_script_commit: 3dcd2a90c1dc668d6f9445eaf6c04ecd96489791
  build_flashinfer_commit: ede7a275
  build_gpu_arch: 12.1a
  build_vllm_commit: cd7643015
  build_vllm_version: 0.18.1rc1.dev121+gcd7643015.d20260325.cu132
  container_maintainer: NVIDIA CORPORATION <[email protected]>
  container_org_opencontainers_image_ref_name: ubuntu
  container_org_opencontainers_image_version: 24.04
  cuda:      13.2
  nccl:      (2, 29, 7)
  python:    3.12.3
  torch:     2.12.0.dev20260325+cu130
  vllm:      0.18.1rc1.dev121+gcd7643015.d20260325

[6/6] Post-launch hooks — skipped
/usr/local/lib/python3.12/dist-packages/torch/compiler/__init__.py:148: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
  return torch._dynamo.allow_in_graph(fn)
usage: vllm serve [model_tag] [options]
vllm serve: error: argument --hf-overrides: Value {rope_scaling: {rope_type: yarn, factor: 4.0, original_max_position_embeddings: 262144}} cannot be converted to <function loads at 0xe7d9ee978fe0>.

@dbotwinick
Copy link
Copy Markdown
Contributor

Oh I see. Not what I was thinking when I read your initial comments (obviously), but that makes sense. Perhaps I haven't tested one of those (JSON embedded in arg) since more recent security hardening efforts on shell quoting. (And I guess I had ended up adding b64 shell for indirect sudo for similar reason, but totally unrelated other than the b64 shell part.) Gotcha.

@jlapenna jlapenna marked this pull request as ready for review April 7, 2026 01:28
@jlapenna
Copy link
Copy Markdown
Contributor Author

jlapenna commented Apr 7, 2026

Kind of wondering what you'd want to do here. Shoudl I start with something pointed just for the quoting issue that started this yak shave?

@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch from 133dc70 to 67be868 Compare April 7, 2026 04:22
@dbotwinick
Copy link
Copy Markdown
Contributor

Yeah I guess that makes sense. I'll take a look and I guess we'll go from there...

@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch 3 times, most recently from 64be3a0 to 7ae8daa Compare April 7, 2026 05:26
@jlapenna
Copy link
Copy Markdown
Contributor Author

jlapenna commented Apr 7, 2026 via email

@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch from 7ae8daa to 7ef0a80 Compare April 7, 2026 16:46
@dbotwinick
Copy link
Copy Markdown
Contributor

Working on review. Not finished. I'm debating how I feel, but realistically, having a utility lib to help with shell quoting issues and then relying on that to avoid problems seems good. Part of me doesn't love the base64 due to its perceived nature of obfuscating activities, but on the other hand, it is an effective and relatively well-understood style of solution. (I mentioned how I also used it somewhere else... and felt the same conflict... but went with it anyway). I might want to even increase the degree of "standardization" to rely on util lib for shell quoting so that we don't "manually" shlex.quote anywhere -- we used the sparkrun canonical quoting functions -- which might mostly just be shlex.quote -- but then we could expand on rulesets/details later and reduce the blast radius (counter point: we have one place where we could break everything at once + not helpful to just add indirection if it ends up being just shlex.quote).

Anyway... I'm going to slowly work my way through it. At least as slowly as we all do things these days in post-LLM world. I'm of the mind that we try to go for it. So it's mostly about finding the right path to do it quickly and without breaking everything.

Copy link
Copy Markdown
Contributor

@dbotwinick dbotwinick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments -- e.g. didn't hit every location of shlex usage -- but interested in your take / opinion from what I commented on. Then let's get stuff integrated.

@click.argument("recipe_name", type=RECIPE_NAME)
@host_options
@recipe_override_options
@click.option("--name", "cluster_id_override", default=None, help="Override deterministic cluster ID (static container name)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a circumstance when we want user to be able to override the cluster ID? It's intended to be an automatic deterministic ID based on what is being run and where. The only real risk of problems there is collisions or changes in calculations inputs across version changes, but both of those are low impact and low likelihood.

dashboard: bool = False,
init_port: int | None = None,
topology: str | None = None,
cluster_id_override: str | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need it (see comment on _run.py)... but... I think it's probably OK to retain as an option as part of the internal API because there may be circumstances in the future where we need to consider it (and it has small effect on usage+code+maintenance if retained here).

executor_keys = {
"auto_remove",
"restart_policy",
"privileged",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this if we transition to having executor args available as their own overrides (ref #119 )

parts.extend(["-e", f"{key}={value}"])
escaped_cmd = command.replace("'", "'\\''")
parts.extend([shlex.quote(container_name), "bash", "-c", "'%s'" % escaped_cmd])
parts.extend(["-e", shlex.quote(f"{key}={value}")])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where it gets messy -- would it be better to have a sparkrun quoting function that we keep in utils/shell.py? and ref here instead of shlex.quote? On one hand, I think it could give us future flexibility, on the other, it's adding risks + unnecessary indirection.

@jlapenna jlapenna closed this Apr 9, 2026
@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch from 7ef0a80 to 4dfaae0 Compare April 9, 2026 00:46
@jlapenna jlapenna reopened this Apr 9, 2026
… docker exec

This commit consolidates several improvements and fixes to shell command
execution within orchestration:

- Use base64 encoding for `docker exec` commands to prevent argument
  parsing and quote stripping issues.
- Centralize base64 shell wrapping into utility functions.
- Apply `shlex.quote` to docker run/exec flags and shell wrapping.
- Use `printf` instead of `echo` for known_hosts population.
- Add comprehensive tests for shell utilities and orchestration.
- Update documentation on shell execution and security guidelines.
@jlapenna jlapenna force-pushed the fix-docker-bash-escaping-base64 branch from f117ce1 to a7d886d Compare April 9, 2026 02:49
@dbotwinick dbotwinick merged commit ed6cb19 into spark-arena:develop Apr 9, 2026
@dbotwinick
Copy link
Copy Markdown
Contributor

FYI. I merged this in (obviously). I'm chewing through checks and consistency on it and will be updating the develop branch a bit later with the latest. Trying to get to a good coherent whole ready for marking next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants