Fe chh v2 by ahydchh · Pull Request #62 · EinsiaLab/Frontier-Engineering

ahydchh · 2026-04-26T09:35:07Z

No description provided.

… holographic seed - scripts/setup_uv_envs.sh + scripts/requirements/: uv-based env setup for fe-base, fe-jobshop, fe-pyportfolioopt, fe-optics replacing per-task conda deps - scripts/run_full_baseline_validation.py: switch JobShop/Optics/PyPortfolioOpt/ CoFlyers/Dawn/DuckDB/EV2Gym/PyMOTO tasks to uv venvs via task.runtime.python_path; add ProtonTherapyPlanning and perturbation_prediction (76 tasks total, was 74); inject HOLO_EVAL_SEED=3 for holographic_multispectral_focusing - benchmarks/ParticlePhysics/ProtonTherapyPlanning/frontier_eval/: unified metadata + verification/evaluate_unified.py wrapper (run candidate → plan.json → score) - benchmarks/SingleCellAnalysis/perturbation_prediction/frontier_eval/: unified metadata + verification/evaluate_unified.py wrapper (candidate → prediction.h5ad → Pearson/Spearman/cosine; dataset auto-downloaded from OpenProblems S3) - benchmarks/Optics/frontier_eval/run_eval.sh: add --seed ${HOLO_EVAL_SEED:-0} for holographic tasks; fixes baseline validity failure at default seed=0 for holographic_multispectral_focusing (mean_target_efficiency 0.00377 < 0.004) - docs/baseline_validation_report_2026-04-24.md: baseline run results for 15 tasks Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

… compat - Add bootstrap script (scripts/bootstrap/setup_denoising_task.sh) and env.sh for repo-local viash/nextflow/JDK tooling and task_denoising checkout - Add python310_compat.patch: switch methods/magic, metrics/mse, metrics/poisson to python:3.10 base image; scprep requires pandas<2.1 which has no Python 3.12 wheels and cannot be built from source on Python 3.12 (pkg_resources missing) - Update setup_denoising_task.sh to apply python310_compat.patch automatically - Update evaluator (frontier_eval/tasks/denoising/evaluator/python.py) with full viash-build + nextflow + rank_scores pipeline; verified valid=1 - Update README.md / README_zh-CN.md: document Docker group setup, proxy config for Docker Hub access, and the Python 3.10 compatibility fix rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-26T09:35:43Z

🤖 AI Code Review (gemini-3-flash-preview)

🇬🇧 English Analysis

1. Executive Summary

Core Purpose: This PR performs a significant refactoring of the verification logic across multiple communication engineering benchmarks (LDPCErrorFloor, PMDSimulation, RayleighFadingBER). The primary shift is moving the simulation loop and importance sampling weight calculations from the candidate-provided "sampler" to the "evaluator" (Evaluator-Owned Simulation). This ensures that the evaluation framework independently verifies the statistical validity of the results.
Modified File Structure & Modifications:
- .gitignore: Added **/temp/ to prevent local temporary files from being tracked.
- benchmarks/CommunicationEngineering/LDPCErrorFloor/baseline/solution.py & scripts/init.py: Adjusted bias_factor from 1.5 to 1.0 to align with the new evaluator-owned sampling logic.
- benchmarks/CommunicationEngineering/LDPCErrorFloor/verification/evaluator.py: Implemented _run_evaluator_owned_simulation. The evaluator now calls sampler.sample and performs its own decoding and weight calculation. Updated R0_DEV anchor.
- benchmarks/CommunicationEngineering/PMDSimulation/scripts/init.py: Updated bias_strength to 0.25 for the baseline sampler.
- benchmarks/CommunicationEngineering/PMDSimulation/verification/evaluator.py: Similar refactor to LDPC; implemented internal simulation logic and updated reference outage probability (R0_DEV).
- benchmarks/CommunicationEngineering/RayleighFadingBER/verification/evaluator.py: Refactored to include _run_evaluator_owned_simulation and standardized validation helpers.

2. AI Content Analysis

Estimated AI Component: 35%
Reasoning & Evidence: The refactoring follows a highly consistent, almost "templated" pattern across three different benchmarks. Functions like _as_1d_float_array, _summarize_weighted_event_run, and the structure of _run_evaluator_owned_simulation are nearly identical in logic and naming conventions across files. This suggests the use of AI to propagate a specific architectural pattern (Evaluator-Owned Simulation) across the codebase. However, the domain-specific math (e.g., PMD DGD calculations, LDPC LLRs) remains specialized.

3. Engineering & Economic Assessment

Engineering Reality Check: This is a high-quality, production-grade improvement. In LLM-based code evaluation, allowing the model-generated code to control the simulation loop (simulate_variance_controlled) is a "black box" risk. By forcing the model to only provide the sample method, the evaluator can independently verify the Importance Sampling (IS) weights and convergence. This effectively prevents "cheating" or numerical instability in user-provided loops.
Economic Value: High. It significantly reduces technical debt by standardizing the evaluation protocol. It improves the reliability of the benchmarks, ensuring that performance scores are based on mathematically sound simulations rather than potentially biased or buggy candidate implementations.

4. Quality Assurance

Verification & Testing:
- frontier_eval Integration: Yes.
- task_name: LDPCErrorFloor, PMDSimulation, RayleighFadingBER.
- Execution & Dependencies: The PR modifies the internal logic but does not update the .md files. It assumes the existing environment (NumPy, SciPy, Philox generator) is present.
Documentation Quality: The code is well-commented, particularly regarding the rationale for the refactor (e.g., "evaluator-owned sampling"). However, there is no update to the user-facing READMEs to explain that the required interface for TrappingSetSampler or PMDSampler has shifted from a full simulation method to a raw sample method.
Organizational Structure: Excellent. The move toward a shared logic pattern for evaluators makes the project more maintainable and scalable.

5. Security & Privacy Check

Sensitive Files: Clean. The addition of **/temp/ to .gitignore is a proactive security/cleanliness measure.
Absolute Paths: None detected. All imports and paths use relative logic or repo-root anchors.

🇨🇳 中文分析

1. 摘要

核心目的: 本 PR 对多个通信工程基准测试（LDPCErrorFloor、PMDSimulation、RayleighFadingBER）的验证逻辑进行了重大重构。核心变化是将模拟循环和重要性采样（Importance Sampling）权重计算的控制权从候选代码提供的“采样器”移交给“评估器”（即“评估器拥有模拟权”）。这确保了评估框架能够独立验证结果的统计有效性。
修改的文件结构与变更摘要:
- .gitignore: 增加了 **/temp/ 以防止本地临时文件被追踪。
- LDPCErrorFloor 的 solution.py 与 init.py: 将 bias_factor 从 1.5 调整为 1.0，以匹配新的评估器采样逻辑。
- LDPCErrorFloor/verification/evaluator.py: 实现了 _run_evaluator_owned_simulation。评估器现在调用 sampler.sample 并自行执行译码和权重计算。更新了 R0_DEV 锚点值。
- PMDSimulation/scripts/init.py: 将基准采样器的 bias_strength 更新为 0.25。
- PMDSimulation/verification/evaluator.py: 进行了与 LDPC 类似的重构；实现了内部模拟逻辑并更新了参考中断概率 (R0_DEV)。
- RayleighFadingBER/verification/evaluator.py: 重构以包含 _run_evaluator_owned_simulation 并标准化了验证辅助函数。

2. AI 成分分析

预估 AI 含量: 35%
判断依据与证据: 重构在三个不同的基准测试中遵循了高度一致、几乎“模板化”的模式。诸如 _as_1d_float_array、_summarize_weighted_event_run 等函数以及 _run_evaluator_owned_simulation 的结构在不同文件中的逻辑和命名约定几乎完全相同。这表明开发者可能使用了 AI 将特定的架构模式（评估器拥有模拟权）推广到整个代码库。然而，特定领域的数学逻辑（如 PMD 的 DGD 计算、LDPC 的 LLR）仍保持了专业性。

3. 工程与经济评估

工程现实检验: 这是一个高质量的生产级改进。在基于 LLM 的代码评估中，允许模型生成的代码控制模拟循环（simulate_variance_controlled）存在“黑盒”风险。通过强制模型仅提供 sample 方法，评估器可以独立验证重要性采样权重和收敛性。这有效地防止了用户提供的循环中可能出现的“作弊”或数值不稳定问题。
经济价值: 高。通过标准化评估协议显著减少了技术债务。它提高了基准测试的可靠性，确保性能评分基于数学上严谨的模拟，而非潜在有偏或有 Bug 的候选实现。

4. 质量保证

验证与测试:
- frontier_eval 集成: 是
- task_name: LDPCErrorFloor, PMDSimulation, RayleighFadingBER
- 运行与依赖: PR 修改了内部逻辑，但未更新 .md 文件。它假设现有环境（NumPy, SciPy, Philox 生成器）已就绪。
文档质量: 代码注释良好，特别是关于重构理由的说明（如“evaluator-owned sampling”）。然而，未更新面向用户的 README，以解释 TrappingSetSampler 或 PMDSampler 的要求接口已从完整的模拟方法转变为原始的 sample 方法。
组织结构: 优秀。评估器逻辑向共享模式的转变使项目更具可维护性和可扩展性。

5. 安全与隐私检查

敏感文件: 未发现异常。在 .gitignore 中添加 **/temp/ 是前瞻性的安全/整洁措施。
绝对路径: 未检测到。所有导入和路径均使用相对逻辑或仓库根锚点。

github-actions · 2026-04-26T16:14:12Z

🤖 AI Code Review (gemini-3-flash-preview)

🇬🇧 English Analysis

1. Executive Summary

Core Purpose: This PR primarily refactors the evaluation mechanism for Communication Engineering benchmarks (LDPCErrorFloor and PMDSimulation). It shifts the simulation control loop from the candidate's solution to the evaluator ("evaluator-owned simulation") to ensure integrity and consistency. Additionally, it registers four new tasks in the material and particle physics domains.
Modified File Structure & Modifications:
- .gitignore: Added **/temp/ to prevent local temporary files from being committed.
- TASK_DETAILS.md & TASK_DETAILS_zh-CN.md: Added metadata for PETScannerOptimization, MicrowaveAbsorberDesign, LightweightBroadbandAbsorber, and NanoCarbonAbsorberOptimization.
- benchmarks/CommunicationEngineering/LDPCErrorFloor/baseline/solution.py & scripts/init.py: Adjusted bias_factor to 1.0 and updated comments to reflect the new evaluator-led sampling approach.
- benchmarks/CommunicationEngineering/LDPCErrorFloor/verification/evaluator.py: Major refactor. Implemented _run_evaluator_owned_simulation, added rigorous result validation (_validate_result), and updated reference anchors (R0_DEV) for smoke tests.
- benchmarks/CommunicationEngineering/PMDSimulation/scripts/init.py: Updated bias_strength to 0.25 for baseline calibration.
- benchmarks/CommunicationEngineering/PMDSimulation/verification/evaluator.py: Major refactor. Implemented evaluator-side importance sampling logic and updated outage probability reference values.

2. AI Content Analysis

Estimated AI Component: 35%
Reasoning & Evidence: The validation helper functions (e.g., _as_1d_float_array, _validate_result, and _as_noise_batch in evaluator.py) exhibit high structural regularity and standard error-handling patterns typical of AI-assisted boilerplate generation. The use of np.isfinite, math.isclose, and descriptive exception strings is very "clean" and follows common LLM coding styles. However, the core logic involving importance sampling weights (log_pdf_true - log_pdf_biased) and domain-specific physics (DGD thresholding, LDPC decoding) shows specific engineering intent unlikely to be purely AI-generated without heavy human guidance.

3. Engineering & Economic Assessment

Engineering Reality Check: This is a high-quality, production-grade improvement. In benchmarking systems, allowing a submitted solution to report its own convergence and error rates is a "trust-but-verify" anti-pattern. By moving the simulation loop into the evaluator.py, the system now independently verifies the candidate's sample method. The adjustment of R0_DEV to higher values for "smoke tests" is a practical engineering trade-off to allow fast verification in CI environments.
Economic Value: Medium. It significantly reduces the risk of "cheating" or accidental misreporting in the benchmark, thereby increasing the scientific value of the leaderboard. It reduces technical debt by standardizing how importance sampling tasks are evaluated.

4. Quality Assurance

Verification & Testing:
- frontier_eval Integration: Yes.
- task_name: LDPCErrorFloor, PMDSimulation, PETScannerOptimization, MicrowaveAbsorberDesign, LightweightBroadbandAbsorber, NanoCarbonAbsorberOptimization.
- Execution & Dependencies: The PR updates the internal logic but relies on existing environment setups. The TASK_DETAILS.md provides high-level descriptions, though specific installation steps for the new material engineering tasks are not fully detailed in this diff.
Documentation Quality: The documentation in TASK_DETAILS.md is clear and follows the existing tabular format. Comments in the code (e.g., explaining the shift to "evaluator-owned simulation") are helpful for future maintainers.
Organizational Structure: The structure remains logical. The move toward a more robust verification/evaluator.py improves the modularity between "what is being solved" and "how it is being graded."

5. Security & Privacy Check

Sensitive Files: Clean. No .env, API keys, or IDE-specific files were detected. The addition of **/temp/ to .gitignore is a proactive security measure.
Absolute Paths: None detected. The code uses relative paths or Path objects correctly.

🇨🇳 中文分析

1. 摘要

核心目的: 本 PR 主要重构了通信工程基准测试（LDPCErrorFloor 和 PMDSimulation）的评估机制。它将模拟控制循环从候选方案移至评估器（“评估器主导的模拟”），以确保评估的完整性和一致性。此外，还在材料工程和粒子物理领域注册了四个新任务。
修改的文件结构与变更摘要:
- .gitignore: 增加了 **/temp/ 以防止本地临时文件被提交。
- TASK_DETAILS.md & TASK_DETAILS_zh-CN.md: 增加了 PETScannerOptimization、MicrowaveAbsorberDesign、LightweightBroadbandAbsorber 和 NanoCarbonAbsorberOptimization 的元数据。
- benchmarks/CommunicationEngineering/LDPCErrorFloor/baseline/solution.py & scripts/init.py: 将 bias_factor 调整为 1.0，并更新注释以反映新的评估器采样方法。
- benchmarks/CommunicationEngineering/LDPCErrorFloor/verification/evaluator.py: 重大重构。实现了 _run_evaluator_owned_simulation，增加了严格的结果校验（_validate_result），并更新了冒烟测试的参考锚点（R0_DEV）。
- benchmarks/CommunicationEngineering/PMDSimulation/scripts/init.py: 将 bias_strength 更新为 0.25 以进行基准校准。
- benchmarks/CommunicationEngineering/PMDSimulation/verification/evaluator.py: 重大重构。实现了评估器端的重采样逻辑，并更新了中断概率参考值。

2. AI 成分分析

预估 AI 含量: 35%
判断依据与证据: 验证辅助函数（如 evaluator.py 中的 _as_1d_float_array、_validate_result 和 _as_noise_batch）表现出高度的结构规律性和标准的错误处理模式，这是典型的 AI 辅助生成的模板代码。使用 np.isfinite、math.isclose 和描述性异常字符串非常“干净”，符合常见的 LLM 编码风格。然而，涉及重要性采样权重（log_pdf_true - log_pdf_biased）和特定领域物理逻辑（DGD 阈值、LDPC 解码）的核心逻辑显示了明确的工程意图，不太可能在没有深度人工指导的情况下纯由 AI 生成。

3. 工程与经济评估

工程现实检验: 这是一个高质量的生产级改进。在基准测试系统中，允许提交的方案自行报告其收敛性和错误率是一种缺乏校验的反模式。通过将模拟循环移入 evaluator.py，系统现在可以独立验证候选方案的 sample 方法。将 R0_DEV 调整为较大的值以进行“冒烟测试”是一个务实的工程权衡，允许在 CI 环境中进行快速验证。
经济价值: 中等。它显著降低了基准测试中“作弊”或意外误报的风险，从而提高了排行榜的科学价值。通过标准化重要性采样任务的评估方式，减少了技术债务。

4. 质量保证

验证与测试:
- frontier_eval 集成: 是。
- task_name: LDPCErrorFloor, PMDSimulation, PETScannerOptimization, MicrowaveAbsorberDesign, LightweightBroadbandAbsorber, NanoCarbonAbsorberOptimization。
- 运行与依赖: PR 更新了内部逻辑，但依赖于现有的环境配置。TASK_DETAILS.md 提供了高层描述，但此 diff 中未完全详细列出新材料工程任务的具体安装步骤。
文档质量: TASK_DETAILS.md 中的文档清晰，遵循现有的表格格式。代码中的注释（例如解释转向“评估器主导模拟”的原因）对未来的维护者很有帮助。
组织结构: 结构保持逻辑性。向更健壮的 verification/evaluator.py 转变，提高了“被解决的问题”与“如何评分”之间的模块化程度。

5. 安全与隐私检查

敏感文件: 未发现异常。未检测到 .env、API 密钥或 IDE 特定文件。在 .gitignore 中增加 **/temp/ 是一项主动的安全措施。
绝对路径: 未检测到。代码正确使用了相对路径或 Path 对象。

ahydchh and others added 8 commits April 24, 2026 13:40

feat(v2): add unified task-set envs and harden evaluators

611d293

feat(v2): add microwave absorber and PET scanner tasks

1805a86

chore: drop generated PET solution artifact

fffa514

docs(v2): record new task integration details

071f194

docs(v2): align task docs and clean repo artifacts

4fa4fc2

feat(v2): unify proton therapy and perturbation tasks

da2c8b1

ahydchh added 4 commits April 26, 2026 13:20

refactor(material): align absorber tasks to unified flow

e269308

chore(material): drop generated microwave submission artifact

971e00f

feat(material): add nanocarbon absorber task

68dc7b8

docs(v2): align task index and env guidance

5b77500

ahydchh added 4 commits April 27, 2026 09:47

feat(v2): add first PR44 batch tasks

7d744e4

feat(v2): add second PR44 robotics batch

d1fc86f

chore(v2): align PR44 task docs and runtime wiring

45e86ca

fix(v2): enable unified DuckDB task execution

abc171f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fe chh v2#62

Fe chh v2#62
ahydchh wants to merge 16 commits into
mainfrom
fe_chh_v2

ahydchh commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahydchh commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

🤖 AI Code Review (gemini-3-flash-preview)

🇬🇧 English Analysis

1. Executive Summary

2. AI Content Analysis

3. Engineering & Economic Assessment

4. Quality Assurance

5. Security & Privacy Check

🇨🇳 中文分析

1. 摘要

2. AI 成分分析

3. 工程与经济评估

4. 质量保证

5. 安全与隐私检查

Uh oh!

github-actions Bot commented Apr 26, 2026

🤖 AI Code Review (gemini-3-flash-preview)

🇬🇧 English Analysis

1. Executive Summary

2. AI Content Analysis

3. Engineering & Economic Assessment

4. Quality Assurance

5. Security & Privacy Check

🇨🇳 中文分析

1. 摘要

2. AI 成分分析

3. 工程与经济评估

4. 质量保证

5. 安全与隐私检查

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants