Rename evaluator/ to eval/ and remove legacy evaluator.py files by akrentsel · Pull Request #37 · skydiscover-ai/skydiscover

akrentsel · 2026-03-18T21:28:14Z

Summary

Rename all containerized evaluator directories from evaluator/ to eval/ across all 38 benchmarks
Remove the 17 old root-level evaluator.py files that were kept for backwards compatibility in fully containerized benchmarks
Update container_evaluator.py Docker tag detection to handle both eval and evaluator directory names
Update all documentation: benchmarks/README.md, main README.md, 18 sub-benchmark READMEs, and config.yaml usage comments

Test plan

Verify a containerized benchmark runs with the new eval/ directory path (e.g., benchmarks/math/circle_packing_rect/eval)
Verify Docker image tag generation still works correctly in container_evaluator.py
Spot-check documentation links and run commands in READMEs

🤖 Generated with Claude Code

Convert every benchmark to the containerized evaluator format: - ALE Bench (10 problems): standard wrapper, ale-bench deps - GPU Mode (4 problems): NVIDIA CUDA base image, self-contained shared_eval.py + reference.py with local EvaluationResult definition - Frontier-CS: Docker-in-Docker for judge server, clones Frontier-CS - Image Gen: standard wrapper, OpenAI GPT-5 judge - Prompt Optimization: custom JSON bridge for EvaluationResult, heavy NLP deps (dspy, bm25s, datasets) - Cleanup: remove old circle_packing/evaluator.py (containerized version already exists) Old top-level evaluator.py files are kept for backwards compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Standardize containerized evaluator directory naming from `evaluator/` to `eval/` across all benchmarks. Remove the old root-level `evaluator.py` files that were kept for backwards compatibility in benchmarks that have been fully containerized. Update all documentation, config comments, and the container_evaluator.py tag detection logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-03-18T21:35:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's benchmark evaluation structure. The primary goal was to streamline the naming convention for containerized evaluator directories, moving from a longer evaluator/ to a more concise eval/. This change enhances consistency across the benchmark suite and removes redundant legacy files, simplifying the overall project architecture and improving maintainability.

Highlights

Standardized Evaluator Directory Naming: Renamed all containerized evaluator directories from evaluator/ to eval/ across 38 benchmarks for consistency.
Removed Legacy Evaluator Files: Eliminated 17 old root-level evaluator.py files that were previously maintained for backwards compatibility in fully containerized benchmarks.
Updated Docker Tag Detection: Modified container_evaluator.py to correctly detect Docker tags using both the new eval and the old evaluator directory names.
Comprehensive Documentation Update: Revised all relevant documentation, including main READMEs, 18 sub-benchmark READMEs, and config.yaml usage comments, to reflect the new eval/ naming convention.
Introduced Wrapper for Backwards Compatibility: Added a wrapper.py module to new eval/ directories to ensure existing Python-based evaluators continue to function correctly with the updated structure.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the benchmark evaluation system by containerizing evaluators. It introduces new eval/ subdirectories for each benchmark, containing a Dockerfile, evaluate.sh entrypoint, and a Python evaluator.py script. Corresponding README.md and config.yaml files are updated to reflect the new eval/ directory structure in command examples and descriptions. The evaluator.py scripts are modified or wrapped with wrapper.py to standardize JSON output, and the core skydiscover/evaluation/container_evaluator.py is updated to correctly tag Docker images for these new containerized evaluators.

Tests the full Docker pipeline: image build, container start, program evaluation, and teardown using the new eval/ naming convention. Also tests tag generation and create_evaluator routing logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

akrentsel · 2026-03-18T22:19:26Z

Fixes #36

Merge upstream changes (Harbor task support) and update the new evaluator format table in README.md to use eval/ instead of evaluator/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

exe.dev user and others added 2 commits March 16, 2026 00:50

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

exe.dev user and others added 2 commits March 18, 2026 22:21

Merge origin/main and fix remaining evaluator/ reference

af3b2f2

Merge upstream changes (Harbor task support) and update the new evaluator format table in README.md to use eval/ instead of evaluator/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove accidentally staged worktree directories

80911ef

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

akrentsel closed this Mar 18, 2026

akrentsel reopened this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename evaluator/ to eval/ and remove legacy evaluator.py files#37

Rename evaluator/ to eval/ and remove legacy evaluator.py files#37
akrentsel wants to merge 5 commits into
skydiscover-ai:mainfrom
akrentsel:rename-evaluator-to-eval

akrentsel commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

akrentsel commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akrentsel commented Mar 18, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

akrentsel commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant