refactor(orchestration): harden command line passing and shell execution#114
Conversation
|
This is a work in progress, and not ready for review. |
ec167ca to
8ffbc58
Compare
|
This one stemmed from a specific recipe.yaml --> https://spark-arena.com/benchmark/ad49f140-0581-41e6-9ec5-8a7c524451d6 Its hf-override flags were getting misinterprted as python substitutions; i started a little cleanup here, and the agents noticed some inconsistencies, and it snowballed from there. Happy to try to scale this down to be a bit more of a point-fix; lmk! (Also, I'll fix the merge conflcits shortly) |
Yeah. I've sort of just let that be so far -- the basis of the arg substitution is basically from code I wrote 12+ years ago... but also those should pass through unchanged. I know they run the risk of substitution (and they do technically get picked up) -- but it would be pretty difficult to accomplish -- and then I'm not sure if it would actually be a security threat if the processed CLI args are properly quoted in later stages. So it's something I just ignored because it basically meant it was picked up, no match found, returned as itself -- and since we should use shell quoting for what makes it through to the container, there should be no harm in that -- or at least that's what I was thinking. |
|
Also I'm happy if someone is review the b64 command pattern. That was added to enable "indirect sudo" scenario -- which is primarily needed for "cross-user" scenario where the cluster user doesn't match the OS user but we need sudo ops and the OS user is the one with sudo rights. That got messy.... and the quick b64 pattern was a way to avoid issues with nested^nested shell quoting... |
8ffbc58 to
133dc70
Compare
|
I reproduced the issue on the The IssueThe recipe explicitly supplies this JSON string for the Because This caused The FixThis PR bypasses the initial shell parser entirely by converting the entire Full Command Output (from
|
|
Oh I see. Not what I was thinking when I read your initial comments (obviously), but that makes sense. Perhaps I haven't tested one of those (JSON embedded in arg) since more recent security hardening efforts on shell quoting. (And I guess I had ended up adding b64 shell for indirect sudo for similar reason, but totally unrelated other than the b64 shell part.) Gotcha. |
|
Kind of wondering what you'd want to do here. Shoudl I start with something pointed just for the quoting issue that started this yak shave? |
133dc70 to
67be868
Compare
|
Yeah I guess that makes sense. I'll take a look and I guess we'll go from there... |
64be3a0 to
7ae8daa
Compare
|
Dont feel obligated if this doesn't align with what you're looking for. We
can patch this specific issue only, if you want.
…On Mon, Apr 6, 2026, 9:58 PM Drew Botwinick ***@***.***> wrote:
*dbotwinick* left a comment (spark-arena/sparkrun#114)
<#114 (comment)>
Yeah I guess that makes sense. I'll take a look and I guess we'll go from
there...
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALGVSMXIFJ6FQDKLNIB734USDGDAVCNFSM6AAAAACXNPNK32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOJWGU3DKMJXHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
7ae8daa to
7ef0a80
Compare
|
Working on review. Not finished. I'm debating how I feel, but realistically, having a utility lib to help with shell quoting issues and then relying on that to avoid problems seems good. Part of me doesn't love the base64 due to its perceived nature of obfuscating activities, but on the other hand, it is an effective and relatively well-understood style of solution. (I mentioned how I also used it somewhere else... and felt the same conflict... but went with it anyway). I might want to even increase the degree of "standardization" to rely on util lib for shell quoting so that we don't "manually" shlex.quote anywhere -- we used the sparkrun canonical quoting functions -- which might mostly just be shlex.quote -- but then we could expand on rulesets/details later and reduce the blast radius (counter point: we have one place where we could break everything at once + not helpful to just add indirection if it ends up being just shlex.quote). Anyway... I'm going to slowly work my way through it. At least as slowly as we all do things these days in post-LLM world. I'm of the mind that we try to go for it. So it's mostly about finding the right path to do it quickly and without breaking everything. |
dbotwinick
left a comment
There was a problem hiding this comment.
Left a few comments -- e.g. didn't hit every location of shlex usage -- but interested in your take / opinion from what I commented on. Then let's get stuff integrated.
| @click.argument("recipe_name", type=RECIPE_NAME) | ||
| @host_options | ||
| @recipe_override_options | ||
| @click.option("--name", "cluster_id_override", default=None, help="Override deterministic cluster ID (static container name)") |
There was a problem hiding this comment.
Is there ever a circumstance when we want user to be able to override the cluster ID? It's intended to be an automatic deterministic ID based on what is being run and where. The only real risk of problems there is collisions or changes in calculations inputs across version changes, but both of those are low impact and low likelihood.
| dashboard: bool = False, | ||
| init_port: int | None = None, | ||
| topology: str | None = None, | ||
| cluster_id_override: str | None = None, |
There was a problem hiding this comment.
Not sure if we need it (see comment on _run.py)... but... I think it's probably OK to retain as an option as part of the internal API because there may be circumstances in the future where we need to consider it (and it has small effect on usage+code+maintenance if retained here).
| executor_keys = { | ||
| "auto_remove", | ||
| "restart_policy", | ||
| "privileged", |
There was a problem hiding this comment.
do we still need this if we transition to having executor args available as their own overrides (ref #119 )
| parts.extend(["-e", f"{key}={value}"]) | ||
| escaped_cmd = command.replace("'", "'\\''") | ||
| parts.extend([shlex.quote(container_name), "bash", "-c", "'%s'" % escaped_cmd]) | ||
| parts.extend(["-e", shlex.quote(f"{key}={value}")]) |
There was a problem hiding this comment.
this is where it gets messy -- would it be better to have a sparkrun quoting function that we keep in utils/shell.py? and ref here instead of shlex.quote? On one hand, I think it could give us future flexibility, on the other, it's adding risks + unnecessary indirection.
7ef0a80 to
4dfaae0
Compare
… docker exec This commit consolidates several improvements and fixes to shell command execution within orchestration: - Use base64 encoding for `docker exec` commands to prevent argument parsing and quote stripping issues. - Centralize base64 shell wrapping into utility functions. - Apply `shlex.quote` to docker run/exec flags and shell wrapping. - Use `printf` instead of `echo` for known_hosts population. - Add comprehensive tests for shell utilities and orchestration. - Update documentation on shell execution and security guidelines.
f117ce1 to
a7d886d
Compare
|
FYI. I merged this in (obviously). I'm chewing through checks and consistency on it and will be updating the develop branch a bit later with the latest. Trying to get to a good coherent whole ready for marking next release. |
This one stemmed from a specific recipe.yaml --> https://spark-arena.com/benchmark/ad49f140-0581-41e6-9ec5-8a7c524451d6
Its hf-override flags were getting misinterprted as python substitutions; i started a little cleanup here, and the agents noticed some inconsistencies, and it snowballed from there.
Happy to try to scale this down to be a bit more of a point-fix; lmk!
(Also, I'll fix the merge conflcits shortly)
--the human, Joe.
Hardened Command Line Passing & Shell Execution Security
This PR implements rigorous defense-in-depth measures against shell injection and quoting vulnerabilities across the entire
sparkruncodebase. Sparkrun relies heavily on dynamically generated bash scripts piped over SSH or injected intodocker execboundaries.This refactor makes string interpolation and command building robust by combining Python's
shlex.quote, safe base64-encoded pipelines, and strictly formattedprintfoutputs in bash.1. Base64 Command Pipeline Hardening
The previous
echo <b64> | base64 -dimplementation was vulnerable to edge cases if the base64 string somehow started with hyphens or was misinterpreted by varying systemechoimplementations.echowithprintf: Usesprintf '%s' '{b64_cmd}'to safely pipe the literal base64 string.--delimiter tobase64 -d --to definitively stop option parsing.bash --noprofile --norcto prevent interference from the system or the user's login shell configuration.sparkrun.utils.shell.b64_wrap_bash.2. Widespread
shlex.quoteApplicationPreviously, environment variables and some CLI flags were manually wrapped in single quotes (e.g.
export KEY='%s'), which breaks if the value itself contains single quotes.shlex.quoteto all dynamically generateddocker runanddocker execCLI flags inDockerExecutor, including container names, network settings, IPC modes, memory limits, and environment variables.sparkrun.orchestration.networking.pyto useshlex.quotefor all exported variables in thecx7bring-up and arping scripts.ssh.py(ssh ... <target> <remote_cmd>) safely quote the<remote_cmd>.3. Bash Template Hardening (
.shfiles and Python templates)Bash scripts generating logs or informational output were using vulnerable double-quoted interpolation (e.g.,
echo "Launching {container_name}"), which could still evaluate command substitutions if the Python variable outputted a single-quoted string containing a subshell (e.g.,'$(reboot)').echo "..." {variable}usages in.shfiles (likecontainer_launch.sh,exec_serve_detached.sh,exec_serve_foreground.sh) with safe format strings:printf "Launching %%s\n" "{container_name}".generate_node_scriptinexecutor.py) to useprintf '... %s ...\n' %(name)s.ssh-keyscanprocessing innetworking.pyby replacingecho "$keys" >> known_hostswithprintf "%s\n" "$keys" >> known_hosts.4. Code Quality & Standards Enforcement
import shlexorimport base64embedded inside functions) to the top of their respective files, strictly adhering to PEP 8 standards. Fixed a variable scoping bug (NameErrorfortarget) that was uncovered during this cleanup.ruff check --fixandruff formatacross thesrc/andtests/directories, removing unused variables and fixingtype()comparisons (usingisinstance()oris).5. Documentation (
DEVELOPERS.md)Added a new
Shell Execution & Securitysection toDEVELOPERS.mdinstructing developers on how to properly handle string interpolation for new features usingshlex.quote,b64_encode_cmd, andprintf.Testing
test_executor.py,test_scripts.py, andtest_shell.pyto match the new, hardened formats.