Skip to content

Restore EMR Spark plugin startup scripts#625

Merged
nvliyuan merged 7 commits into
NVIDIA:mainfrom
nvliyuan:restore-emr-spark-plugin-startup
May 15, 2026
Merged

Restore EMR Spark plugin startup scripts#625
nvliyuan merged 7 commits into
NVIDIA:mainfrom
nvliyuan:restore-emr-spark-plugin-startup

Conversation

@nvliyuan

Copy link
Copy Markdown
Collaborator

Summary

Test plan

  • Parsed emr-spark-plugin-startup.py with Python AST.
  • Loaded config-emr6.json and config-emr7.json with Python JSON parser.
  • Ran bash -n on both EMR cgroup bootstrap scripts.

nvliyuan added 2 commits May 15, 2026 10:41
Bring back the EMR startup helper and companion configuration files that remained on legacy-main after the branch split.

Signed-off-by: liyuan <yuali@nvidia.com>
Add the standard Apache license headers required by CI for the restored EMR Python and shell scripts.

Signed-off-by: liyuan <yuali@nvidia.com>
@greptile-apps

greptile-apps Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR restores the EMR Spark RAPIDS startup tooling — a Python cluster-creation script, two cgroup bootstrap shell scripts, and two EMR 6/7 JSON configuration files — that existed on the legacy-main branch but was absent from main after the branch split.

  • emr-spark-plugin-startup.py: orchestrates EMR cluster creation — selects the right config/bootstrap pair based on the EMR release label, substitutes ${executor_cores} and ${task_gpu_amount} placeholders, uploads the bootstrap script to S3, writes the finalised config to a NamedTemporaryFile, and calls aws emr create-cluster with all required EC2 attributes (KeyName, SubnetId, AvailabilityZone, InstanceProfile).
  • config-emr6.json / config-emr7.json: provide YARN GPU and Spark RAPIDS defaults; both use ${...} placeholder syntax that matches the Python replace() calls.
  • cgroup-bootstrap-action-emr6.sh / cgroup-bootstrap-action-emr7.sh: set up the cgroup permissions required by YARN's Linux Container Executor for GPU isolation on each EMR generation.

Confidence Score: 5/5

Safe to merge — all files are new additions with no changes to existing code paths.

Every issue raised in prior review rounds has been addressed: placeholder syntax is consistent between the JSON configs and Python replacements, the unsupported-instance guard is present, paths are resolved relative to file, the config is written to a temp file rather than overwriting the source, the S3 upload return value is checked before proceeding, and all required EC2 attributes are included in the cluster command. No new defects were found.

No files require special attention.

Important Files Changed

Filename Overview
scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py Main orchestration script: resolves config/bootstrap paths via _SCRIPT_DIR, guards unsupported instance types, writes updated config to a temp file (preserving originals), checks the S3 upload return value, and correctly populates --ec2-attributes with KeyName, SubnetId, AvailabilityZone, and InstanceProfile.
scripts/csp-startup-scripts/emr/config-emr6.json EMR 6 cluster configuration with ${executor_cores} and ${task_gpu_amount} placeholders consistent with the Python script's replace() calls; spark.task.cpus key has no trailing space.
scripts/csp-startup-scripts/emr/config-emr7.json EMR 7 cluster configuration using cgroupv1 mount path /spark-rapids-cgroup; placeholders use ${executor_cores} and ${task_gpu_amount} matching the Python replacements.
scripts/csp-startup-scripts/emr/cgroup-bootstrap-action-emr6.sh Grants broad cgroup permissions on /sys/fs/cgroup/cpu,cpuacct and /sys/fs/cgroup/devices required for YARN GPU scheduling on EMR 6; uses set -ex for fail-fast behaviour.
scripts/csp-startup-scripts/emr/cgroup-bootstrap-action-emr7.sh Creates and mounts a dedicated cgroupv1 devices hierarchy at /spark-rapids-cgroup for EMR 7; uses set -ex.
scripts/csp-startup-scripts/README.md New top-level README pointing readers to the NVIDIA docs for EMR-specific Spark RAPIDS setup; no functional content.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User invokes emr-spark-plugin-startup.py] --> B{Parse CLI args}
    B --> C{Release label contains emr-7?}
    C -- Yes --> D[config-emr7.json + cgroup-bootstrap-emr7.sh]
    C -- No --> E[config-emr6.json + cgroup-bootstrap-emr6.sh]
    D --> F{worker_instance in g4dn_instance_map?}
    E --> F
    F -- No --> G[Print error and return]
    F -- Yes --> H[Compute exec_cores and task_gpu_amount]
    H --> I[Read config JSON, replace placeholders]
    I --> J[Upload bootstrap script to S3]
    J -- Failed --> K[Return early]
    J -- OK --> L[Write config to NamedTemporaryFile]
    L --> M[aws emr create-cluster with ec2-attributes and bootstrap-actions]
    M -- Success --> N[Print cluster ID]
    M -- Error --> O[Print stderr]
Loading

Reviews (4): Last reviewed commit: "Resolve EMR config paths relative to scr..." | Re-trigger Greptile

Comment thread scripts/csp-startup-scripts/emr/config-emr7.json Outdated
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py
Comment thread scripts/csp-startup-scripts/emr/config-emr6.json Outdated
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py Outdated
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py Outdated
nvliyuan and others added 2 commits May 15, 2026 10:48
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py Outdated
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py
Validate supported worker instance types, pass the key pair and subnet into the EMR cluster attributes, and write substituted EMR configuration to a temporary file instead of modifying the source JSON.

Signed-off-by: liyuan <yuali@nvidia.com>
Comment thread scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py
nvliyuan added 2 commits May 15, 2026 12:19
Return upload status from the S3 helper and skip cluster creation when the bootstrap action cannot be uploaded.

Signed-off-by: liyuan <yuali@nvidia.com>
Load the EMR configuration files and upload bootstrap scripts using paths anchored to the startup script directory so the helper works from any current working directory.

Signed-off-by: liyuan <yuali@nvidia.com>
@nvliyuan

Copy link
Copy Markdown
Collaborator Author

merged since just cherry pick the missing #476

@nvliyuan nvliyuan merged commit 1629594 into NVIDIA:main May 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants