Skip to content

Rename evaluator/ to eval/ and remove legacy evaluator.py files#37

Open
akrentsel wants to merge 5 commits into
skydiscover-ai:mainfrom
akrentsel:rename-evaluator-to-eval
Open

Rename evaluator/ to eval/ and remove legacy evaluator.py files#37
akrentsel wants to merge 5 commits into
skydiscover-ai:mainfrom
akrentsel:rename-evaluator-to-eval

Conversation

@akrentsel

Copy link
Copy Markdown
Collaborator

Summary

  • Rename all containerized evaluator directories from evaluator/ to eval/ across all 38 benchmarks
  • Remove the 17 old root-level evaluator.py files that were kept for backwards compatibility in fully containerized benchmarks
  • Update container_evaluator.py Docker tag detection to handle both eval and evaluator directory names
  • Update all documentation: benchmarks/README.md, main README.md, 18 sub-benchmark READMEs, and config.yaml usage comments

Test plan

  • Verify a containerized benchmark runs with the new eval/ directory path (e.g., benchmarks/math/circle_packing_rect/eval)
  • Verify Docker image tag generation still works correctly in container_evaluator.py
  • Spot-check documentation links and run commands in READMEs

🤖 Generated with Claude Code

exe.dev user and others added 2 commits March 16, 2026 00:50
Convert every benchmark to the containerized evaluator format:

- ALE Bench (10 problems): standard wrapper, ale-bench deps
- GPU Mode (4 problems): NVIDIA CUDA base image, self-contained
  shared_eval.py + reference.py with local EvaluationResult definition
- Frontier-CS: Docker-in-Docker for judge server, clones Frontier-CS
- Image Gen: standard wrapper, OpenAI GPT-5 judge
- Prompt Optimization: custom JSON bridge for EvaluationResult, heavy
  NLP deps (dspy, bm25s, datasets)
- Cleanup: remove old circle_packing/evaluator.py (containerized
  version already exists)

Old top-level evaluator.py files are kept for backwards compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standardize containerized evaluator directory naming from `evaluator/`
to `eval/` across all benchmarks. Remove the old root-level
`evaluator.py` files that were kept for backwards compatibility in
benchmarks that have been fully containerized. Update all documentation,
config comments, and the container_evaluator.py tag detection logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's benchmark evaluation structure. The primary goal was to streamline the naming convention for containerized evaluator directories, moving from a longer evaluator/ to a more concise eval/. This change enhances consistency across the benchmark suite and removes redundant legacy files, simplifying the overall project architecture and improving maintainability.

Highlights

  • Standardized Evaluator Directory Naming: Renamed all containerized evaluator directories from evaluator/ to eval/ across 38 benchmarks for consistency.
  • Removed Legacy Evaluator Files: Eliminated 17 old root-level evaluator.py files that were previously maintained for backwards compatibility in fully containerized benchmarks.
  • Updated Docker Tag Detection: Modified container_evaluator.py to correctly detect Docker tags using both the new eval and the old evaluator directory names.
  • Comprehensive Documentation Update: Revised all relevant documentation, including main READMEs, 18 sub-benchmark READMEs, and config.yaml usage comments, to reflect the new eval/ naming convention.
  • Introduced Wrapper for Backwards Compatibility: Added a wrapper.py module to new eval/ directories to ensure existing Python-based evaluators continue to function correctly with the updated structure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the benchmark evaluation system by containerizing evaluators. It introduces new eval/ subdirectories for each benchmark, containing a Dockerfile, evaluate.sh entrypoint, and a Python evaluator.py script. Corresponding README.md and config.yaml files are updated to reflect the new eval/ directory structure in command examples and descriptions. The evaluator.py scripts are modified or wrapped with wrapper.py to standardize JSON output, and the core skydiscover/evaluation/container_evaluator.py is updated to correctly tag Docker images for these new containerized evaluators.

Tests the full Docker pipeline: image build, container start, program
evaluation, and teardown using the new eval/ naming convention. Also
tests tag generation and create_evaluator routing logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@akrentsel

Copy link
Copy Markdown
Collaborator Author

Fixes #36

exe.dev user and others added 2 commits March 18, 2026 22:21
Merge upstream changes (Harbor task support) and update the new
evaluator format table in README.md to use eval/ instead of evaluator/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@akrentsel akrentsel closed this Mar 18, 2026
@akrentsel akrentsel reopened this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant