Add patcher and oracle solver for Nemotron-Terminal-Synthetic-Tasks#22
Add patcher and oracle solver for Nemotron-Terminal-Synthetic-Tasks#22harvenstar wants to merge 4 commits intoopen-thoughts:mainfrom
Conversation
Patches nvidia/Nemotron-Terminal-Synthetic-Tasks to use shared base images instead of per-task unique images, enabling RL training on Daytona. Two sources of unique images are fixed: - Removes `docker_image` from task.toml (pointed to private Nvidia Gitlab registry, inaccessible outside Nvidia) - Removes `COPY files/ /app/` from Dockerfiles (baked per-task data files into images, making every image unique) Task-specific data files are moved from environment/files/ to setup_files/, which Harbor uploads to /setup_files/ in the container before the agent runs. A setup preamble is prepended to instruction.md for tasks that have data files. Result: ~10 unique Dockerfiles (one per category: data_science, security, debugging, etc.) — within Daytona's snapshot limit.
Generates solution/solve.sh for each task using GPT-5-mini, reading only instruction.md (never test files) to prevent LLM from cheating by hardcoding expected outputs or modifying verifiers.
- Remove temperature param (gpt-5-mini only supports default) - Use max_completion_tokens instead of max_tokens
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces two essential Python scripts designed to streamline the preparation of the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two new Python scripts: patch_nemotron_synthetic_tasks.py and synthesize_oracle_solvers.py. The first script patches Nemotron-Terminal-Synthetic-Tasks to optimize Docker image usage for Daytona's snapshot system by modifying Dockerfiles, task configurations, and instruction files. The second script generates oracle solver scripts for these tasks using an LLM. Review comments suggest improving the robustness of a regex in the patching script, enhancing the guidance provided in a Dockerfile consolidation warning, and including exception messages in the error reporting for the solver synthesis script for better debugging.
gpt-5-mini uses reasoning tokens that count against max_completion_tokens. With 2048 the model often exhausted the budget on thinking, leaving zero tokens for output.
Summary
Adds two scripts to prepare nvidia/Nemotron-Terminal-Synthetic-Tasks for RL training on Daytona:
data/patchers/patch_nemotron_synthetic_tasks.py— Patches tasks to use shared base images instead of per-task unique images. Removes private Nvidia registry references from task.toml, removesCOPY files/ /app/from Dockerfiles, and moves task data files tosetup_files/for Harbor injection. Result: ~10 unique Dockerfiles (one per category), within Daytona's snapshot limit.data/patchers/synthesize_oracle_solvers.py— Generatessolution/solve.shfor each task using an LLM (gpt-5-mini by default). Only readsinstruction.md(never test files) to prevent the LLM from cheating. Supports parallel generation with configurable concurrency and automatic retry with exponential backoff.Test Results (200 tasks: 100 easy + 100 medium)
Medium failures breakdown: 77 solve.sh quality issues, 18 environment dependency issues (e.g. model_training tests require torch but Dockerfile only has data science packages — this appears to be an issue in the original dataset), 5 other.
Notes