Rename evaluator/ to eval/ and remove legacy evaluator.py files#37
Rename evaluator/ to eval/ and remove legacy evaluator.py files#37akrentsel wants to merge 5 commits into
Conversation
Convert every benchmark to the containerized evaluator format: - ALE Bench (10 problems): standard wrapper, ale-bench deps - GPU Mode (4 problems): NVIDIA CUDA base image, self-contained shared_eval.py + reference.py with local EvaluationResult definition - Frontier-CS: Docker-in-Docker for judge server, clones Frontier-CS - Image Gen: standard wrapper, OpenAI GPT-5 judge - Prompt Optimization: custom JSON bridge for EvaluationResult, heavy NLP deps (dspy, bm25s, datasets) - Cleanup: remove old circle_packing/evaluator.py (containerized version already exists) Old top-level evaluator.py files are kept for backwards compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standardize containerized evaluator directory naming from `evaluator/` to `eval/` across all benchmarks. Remove the old root-level `evaluator.py` files that were kept for backwards compatibility in benchmarks that have been fully containerized. Update all documentation, config comments, and the container_evaluator.py tag detection logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the project's benchmark evaluation structure. The primary goal was to streamline the naming convention for containerized evaluator directories, moving from a longer Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the benchmark evaluation system by containerizing evaluators. It introduces new eval/ subdirectories for each benchmark, containing a Dockerfile, evaluate.sh entrypoint, and a Python evaluator.py script. Corresponding README.md and config.yaml files are updated to reflect the new eval/ directory structure in command examples and descriptions. The evaluator.py scripts are modified or wrapped with wrapper.py to standardize JSON output, and the core skydiscover/evaluation/container_evaluator.py is updated to correctly tag Docker images for these new containerized evaluators.
Tests the full Docker pipeline: image build, container start, program evaluation, and teardown using the new eval/ naming convention. Also tests tag generation and create_evaluator routing logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Fixes #36 |
Merge upstream changes (Harbor task support) and update the new evaluator format table in README.md to use eval/ instead of evaluator/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
evaluator/toeval/across all 38 benchmarksevaluator.pyfiles that were kept for backwards compatibility in fully containerized benchmarkscontainer_evaluator.pyDocker tag detection to handle bothevalandevaluatordirectory namesTest plan
eval/directory path (e.g.,benchmarks/math/circle_packing_rect/eval)container_evaluator.py🤖 Generated with Claude Code