time shuffle traj ablation expms by wlg1 · Pull Request #13 · withmartian/mi-cot

wlg1 · 2026-05-25T03:41:48Z

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Implement a correct-trace ablation platform that grades Hugging Face SDS artifacts, caches correctness manifests/paired indices, and reruns CEBRA+SDS on all, within_correct, and paired_correct subsets to compare base vs RFT. Document and script the end-to-end workflows (single model, dataset sweep, full analysis, sweep runner) plus a complementary time-shuffle ablation that permutes sentence-ordered trajectories before SDS and reports pass/fail outcomes and visual gaps.

Topic Details

Time-shuffle ablation

Explain the time-shuffle ablation suite that permutes sentence-aligned trajectories via shuffle_sequence_lists, reruns CEBRA/SDS for real vs shuffled order, and orchestrates comparison/batch runners plus README documentation to validate whether SDS metrics collapse without temporal structure.

Modified files (7)

ablate_cot_trace_expms/README.md
ablate_cot_trace_expms/time_shuffle/README.md
ablate_cot_trace_expms/time_shuffle/__init__.py
ablate_cot_trace_expms/time_shuffle/run_compare_base_rft.py
ablate_cot_trace_expms/time_shuffle/run_remaining_batch.sh
ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py
ablate_cot_trace_expms/time_shuffle/shuffle.py

Latest Contributors(0)

User	Commit	Date

Correct-trace platform

Describe the correct-trace stack: dataset/model path constants, grading rules, HF artifact downloads, manifest caching, subset selection, CEBRA/SDS subset execution, evaluation metrics, paired-index caching, README guidance, sweep summaries, and cleanup scripts to support the paired-correct pass/fail checks in manifest_cache/results.

Modified files (16)

ablate_cot_trace_expms/correct_only/README.md
ablate_cot_trace_expms/correct_only/__init__.py
ablate_cot_trace_expms/correct_only/config.py
ablate_cot_trace_expms/correct_only/evaluate.py
ablate_cot_trace_expms/correct_only/grading.py
ablate_cot_trace_expms/correct_only/hf_utils.py
ablate_cot_trace_expms/correct_only/manifest.py
ablate_cot_trace_expms/correct_only/manifest_cache/.gitignore
ablate_cot_trace_expms/correct_only/manifest_cache/.gitkeep
ablate_cot_trace_expms/correct_only/pipeline.py
ablate_cot_trace_expms/correct_only/run_compare_base_rft.py
ablate_cot_trace_expms/correct_only/run_full_analysis.py
ablate_cot_trace_expms/correct_only/run_remaining_sweep.sh
ablate_cot_trace_expms/correct_only/run_single_model.py
ablate_cot_trace_expms/correct_only/subsets.py
ablate_cot_trace_expms/correct_only/summarize_sweep.py

Latest Contributors(0)

User	Commit	Date

Other

Other files

Modified files (2)

ablate_cot_trace_expms/__init__.py
exploration/cebra_EM.py

Latest Contributors(0)

User	Commit	Date

_{Review this PR on Baz | Customize your next review}

baz-reviewer · 2026-05-25T03:55:15Z

+```bash
+python ablate_cot_length_expms/time_shuffle/run_time_shuffle.py \
+  --features-path /path/to/all_sentences_features.pkl \
+  --limit-problems 500 \
+  --shuffle-seeds 0 1 2 3 4 \
+  --out-dir ablate_cot_length_expms/time_shuffle/results


Both command blocks reference ablate_cot_length_expms/... instead of ablate_cot_trace_expms/..., so copy-pasting them fails — should we update run_time_shuffle.py and run_compare_base_rft.py paths (and their --out-dir examples) to match this README's location?

_{Finding types: Naming and Typos Breaking Changes | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

In ablate_cot_trace_expms/time_shuffle/README.md, replace all occurrences of the `ablate_cot_length_expms/...` prefix with `ablate_cot_trace_expms/...` in both command blocks: 1. Around lines 105-110: fix the `run_time_shuffle.py` entrypoint path. 2. Around lines 121-140: fix the `run_compare_base_rft.py` entrypoint path. Also verify that any `--out-dir` examples are updated consistently to point to `ablate_cot_trace_expms/time_shuffle/results/...` so all documented paths match the actual script locations.

baz-reviewer · 2026-05-25T03:55:16Z

+        if mode == "full":
+            perm = rng.permutation(n)
+        elif mode == "block":
+            blocks = [np.arange(i, min(i + block_size, n)) for i in range(0, n, block_size)]
+            block_order = rng.permutation(len(blocks))
+            perm = np.concatenate([blocks[i] for i in block_order])
+        else:


The mode == "block" / block_size validation logic is duplicated between shuffle_sequence_lists and permute_trajectory, and shuffle_sequence_lists is missing the block_size >= 1 guard — so --block-size 0 hits range(0, n, 0) and raises an indirect ValueError; should we add the guard here and factor the shared permutation logic into a helper?

_{Finding types: Code Dedup and Conventions Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

In ablate_cot_trace_expms/time_shuffle/shuffle.py, inside `shuffle_sequence_lists()`'s `mode == "block"` branch (lines 95-97), add the same `block_size >= 1` validation used in `permute_trajectory()`, raising a clear `ValueError` before any `range(..., ..., block_size)` construction (keep the error message consistent with `permute_trajectory()`). Then consider extracting the shared block/full permutation logic into a reusable helper that both functions delegate to.

baz-reviewer · 2026-05-25T03:55:16Z

+    keys = ["persistence", "mean_self_transition", "specialization", "sss", "K_eff", "regime_r2", "bic"]
+    out = {"K": k, "order": "shuffled_mean", "n_seeds": len(sub)}
+    for key in keys:
+        vals = [r[key] for r in sub]
+        out[f"{key}_mean"] = float(np.mean(vals))
+        out[f"{key}_std"] = float(np.std(vals))
+    return out


_aggregate_shuffled() never writes delta_r2_mean, but run_compare_base_rft._metric_at_k() always reads it for shuffled delta_r2, so evaluate_pair() hits a missing-key failure; should we add delta_r2 to the aggregated keys list or write delta_r2_mean into comparison[*]['shuffled']?

_{Finding type: Type Inconsistency | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 123-133, update the `_aggregate_shuffled(rows, k)` function so the aggregated result includes `delta_r2_mean` (and `delta_r2_std`) for the shuffled comparison. Right now `_aggregate_shuffled` only computes mean/std for `persistence`, `mean_self_transition`, `specialization`, `sss`, `K_eff`, `regime_r2`, and `bic`, but `execute_ablation` already stores `delta_r2` on each per-run row and downstream comparison code expects `comp['shuffled']['delta_r2_mean']`. Fix by adding `delta_r2` to the `keys` list in `_aggregate_shuffled`, ensuring it is included in the `out` dict as `delta_r2_mean` (and optionally `delta_r2_std`) for each K.

baz-reviewer · 2026-05-25T03:55:16Z

+    if cebra_cache and cebra_cache.exists():
+        log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}")
+        with open(cebra_cache, "rb") as f:
+            cache = pickle.load(f)
+        cebra_seqs = cache["cebra_seqs"]
+        pca_seqs = cache["pca_seqs"]
+        labels = cache["labels"]
+        ar_r2 = cache["ar_r2"]


--cebra-cache is caller-controlled and reused on cebra_cache.exists() alone — pickle.load() runs without validating provenance, features_path, limit_problems, or payload shape, so stale/malicious data can flow into shuffle_sequence_lists()/fit_and_evaluate(). Should we validate cache metadata against current args before deserializing, recompute on mismatch, and consider replacing pickle with npz+JSON to avoid arbitrary code execution on load?

_{Finding types: Basic Security Patterns Logical Bugs Type Inconsistency | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 159-197 inside `execute_ablation()`: 1. **Security**: The code loads a user-supplied `--cebra-cache` path via `pickle.load()` as soon as it exists, which is an unsafe arbitrary-pickle deserialization path. Consider replacing pickle-based caching with a safe format (e.g., `np.save`/`npz` for arrays and JSON for metadata) so the cache loader never executes code during deserialization. If pickle must be kept, restrict loading to caches within an allowed/expected directory. 2. **Cache invalidation**: Refactor the cache-loading branch to validate that the stored metadata (at minimum `features_path`, `limit_problems`, and ideally `cebra_mode`) matches the current function arguments before reusing the cache. If validation fails (missing keys, wrong types, or mismatched values), log a clear message, recompute CEBRA, and overwrite the cache rather than silently using stale trajectories. 3. **Payload validation**: After loading (before the "Real temporal order" loop), verify that `cebra_seqs`, `pca_seqs`, and `labels` are present, are lists of the same length, and that each trajectory's embedding/label lengths are compatible with downstream consumers. Raise a clear error or force recomputation if any check fails. When saving the cache (around lines 182-196), include a script-specific identifier/version/hash alongside the metadata so it can be verified on reload.

baz-reviewer · 2026-05-25T03:55:16Z

+        log(f"[{model_tag}] Loading features: {features_path}")
+        all_f, triplets = load_and_prepare_cebra(
+            str(features_path),
+            mode=cebra_mode,
+            limit_problems=limit_problems,
+        )
+        log(f"[{model_tag}] Training CEBRA (real temporal order)...")
+        cebra_seqs, pca_seqs, labels = train_cebra_projection(all_f, triplets)


--features-path passes a caller-controlled file to load_and_prepare_cebra (which calls pickle.load) without path validation or schema checks — should we restrict to trusted filenames/directories and validate required keys like problem_id and hidden_state_last before unpickling?

_{Finding types: Basic Security Patterns Type Inconsistency | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

In ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 168-175 inside `execute_ablation`: 1. Validate `features_path` before passing to `load_and_prepare_cebra`: reject paths that don't match the expected filename (e.g., `all_sentences_features.pkl`) and/or aren't under a known trusted directory; consider requiring an explicit `--allow-unsafe-pickle` flag for untrusted inputs. 2. After loading (or before calling `load_and_prepare_cebra`), add a schema validation step that checks required top-level keys (`problem_id`, `hidden_state_last`, etc.) and basic shapes/types, raising a clear `ValueError` on mismatch. 3. Apply the same key validation to the cached path branch (`cebra_cache`) to ensure `cebra_seqs/pca_seqs/labels/ar_r2` are present before training. 4. In `exploration/cebra_EM.py`, consider switching the `load_and_prepare_cebra` pickle load to a safer format (npz/json) or adding integrity verification (e.g., SHA256 allowlist) before any unpickling.

baz-reviewer · 2026-05-25T03:55:16Z

+            if metric == "delta_r2":
+                passed = gap_real > gap_shuf + 0.01
+            else:
+                passed = gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon)


gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon) doesn't match the documented ordering check when gap_shuf < 0, so a still-negative gap_real can pass; should we switch this branch to gap_real > gap_shuf or normalize the sign first?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/time_shuffle/run_compare_base_rft.py around lines 103-123 inside `evaluate_pair`, fix the gap check that currently uses `gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon)` for `metric != "delta_r2"`. This ratio-based threshold is incorrect for negative `gap_shuf` because it can make the acceptance threshold more negative, allowing cases where `(RFT-Base)_real` is not actually greater than `(RFT-Base)_shuffled` to pass. Refactor the `else:` branch to implement the documented ordering check directly: require `gap_real > gap_shuf` (or `gap_real >= gap_shuf + epsilon` for numerical stability), and only use `min_gap_shrink_ratio` if you normalize it in a sign-safe way; otherwise remove it from this comparison. Make sure the `add_check` detail message remains accurate and that the strictness is consistent with the docstring (gap_real must be strictly larger unless you intentionally define inclusive behavior).

baz-reviewer · 2026-05-26T00:49:28Z

+        pids = resolve_subset_pids(
+            subset,
+            manifest=manifest,
+            limit_problems=limit,
+            base_manifest=base_manifest,
+            rft_manifest=rft_manifest,
+            reference_manifest=manifest,
+        )


_run_model_subsets() passes reference_manifest=manifest to resolve_subset_pids(), but the callee doesn't accept it, so this hits TypeError on the first subset — should we drop the extra kwarg or add it to the signature?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/run_compare_base_rft.py around lines 78-85 inside `_run_model_subsets()`, remove the `reference_manifest=manifest` keyword argument from the call to `resolve_subset_pids()` (or update `_run_model_subsets` to only pass the parameters that `resolve_subset_pids` actually supports). Then verify the script runs through the subset loop without raising TypeError and that subset PID resolution still behaves as intended. If `reference_manifest` is truly required for paired/within_correct logic, instead add `reference_manifest` to `ablate_cot_trace_expms/correct_only/subsets.py::resolve_subset_pids()` signature and thread it through the internal logic, keeping defaults so existing calls remain compatible.

baz-reviewer · 2026-05-26T00:49:28Z

+from .grading import DATASET_CFG, _extract_answer_from_cot, is_correct_answer, load_benchmark_rows
+


manifest.py uses package-relative imports, but the documented entrypoints load it as a top-level module, so ImportError happens before CLI logic — should we switch to package imports / python -m ... execution, or make manifest.py import-safe as a script module?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/manifest.py around lines 13-14 and 256 (imports from `.grading` and `.subsets`), the code uses package-relative imports that break when `manifest.py` is loaded as a top-level module (e.g., after adding `correct_only/` to `sys.path`). Refactor the imports to be import-safe by attempting the relative import first and falling back to absolute imports when `__package__` is empty/None (so top-level execution works). Apply the same strategy to the local import inside `get_or_build_paired_index` rather than leaving it as `from .subsets ...` so CLI entrypoints don’t crash before running.

baz-reviewer · 2026-05-26T00:49:28Z

+    cot_data_path = cot_data_path.resolve()
+    with open(cot_data_path, "rb") as f:
+        cot_data: dict = pickle.load(f)


Unsafe Deserialization (CWE-502): build_manifest() passes externally sourced cot_data.pkl (fetched via ensure_model_artifacts()) into pickle.load(f) — should we replace this with a non-executable format (JSON/MsgPack) or at minimum gate loading on a SHA-256 integrity check against a pinned digest?

_{Finding type: Basic Security Patterns | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

In ablate_cot_trace_expms/correct_only/manifest.py around lines 108-110 inside `build_manifest()`, remove the unsafe `pickle.load(f)` of `cot_data.pkl` sourced from external HF downloads via `ensure_model_artifacts()`. Preferred fix: refactor `cot_data` to be stored/loaded in a non-code-executing format (e.g., JSON/MsgPack) and update all callers/writers accordingly. If pickle must be kept temporarily: (1) compute the file's SHA-256 and compare against a pinned digest from HF artifact metadata/revision before unpickling, aborting with a clear error on mismatch; (2) require the local file to already exist (no downloads during manifest construction); (3) optionally use a restricted unpickler/whitelist that only allows primitive containers/types to prevent arbitrary object construction.

baz-reviewer · 2026-05-26T00:49:28Z

+    if not cot.is_file():
+        cot = ensure_hf_file(repo, f"{prefix}/cot_data.pkl", local_dir=root)
+    if not features.is_file():
+        features = ensure_hf_file(repo, f"{prefix}/all_sentences_features.pkl", local_dir=root)


ensure_model_artifacts() now pulls all_sentences_features.pkl from Hugging Face during normal runs, and load_and_prepare_cebra() pickle.load(f)s it with no pin/checksum, so should we store this in a non-code format or verify the artifact before unpickling?

_{Finding type: Basic Security Patterns | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/hf_utils.py around lines 24-44, in ensure_model_artifacts(), the function downloads all_sentences_features.pkl from a Hugging Face dataset repo when missing locally, but the downstream pipeline uses pickle.load() on it; this is unsafe because the artifact is not revision-pinned or integrity-verified. Refactor so artifacts are either (a) stored in a non-executable, non-pickle format (e.g., .npz/.hdf5/parquet) and loaded with a safe reader, or (b) if you must keep pickle, download a specific pinned revision and verify an expected sha256 checksum before returning/allowing it to be loaded. Concretely, change ensure_model_artifacts() to require a repo revision and expected hash (e.g., passed via paths/config), download that exact revision, compute/validate the hash of the local file, and fail closed if it doesn’t match; then update the artifact filename/path used by ensure_model_artifacts and the corresponding loader to match the safe format.

baz-reviewer · 2026-05-26T00:49:28Z

+        gap_all = metric_at_k(rft_summaries["all"], k_focus, metric) - metric_at_k(
+            base_summaries["all"], k_focus, metric
+        )
+        gap_paired = metric_at_k(rft_summaries["paired_correct"], k_focus, metric) - metric_at_k(
+            base_summaries["paired_correct"], k_focus, metric


gap_all / gap_paired subtract metric_at_k() results unconditionally; should we mirror the earlier None check before the gap arithmetic, otherwise missing K can raise TypeError and abort evaluation?

_{Finding type: Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/evaluate.py around lines 48-67 inside evaluate_compare_summaries’ gap-retention logic for metrics (“persistence”, “delta_r2”), mirror the defensive handling used above: after computing gap_all and gap_paired, first check whether metric_at_k(...) for both base and rft returned None (e.g., missing K=k_focus) and in that case add a failed check entry (include subset/all vs paired_correct and the missing K) and skip the subtraction/arithmetic. Then only compute gaps and the min_paired_gap_fraction comparison when all required values are non-None, so evaluation never raises TypeError when K is absent.

baz-reviewer · 2026-05-26T00:49:28Z

+MODEL_PAIRS: dict[str, tuple[str, str]] = {
+    "llama8": ("llama_8B_base", "llama_8b_reasoning"),
+    "qwen14": ("qwen_14B_base", "qwen_14b_reasoning"),
+    "qwen1.5": ("qwen_1.5B_base", "qwen1.5b_reasoning"),


MODEL_PAIRS hardcodes one (base_folder, rft_folder) per family, but the HF artifact folders vary by dataset, so resolve_paths()/ensure_model_artifacts() can request nonexistent filenames for some dataset×family combos — should we make the mapping dataset-aware or add per-dataset overrides?

_{Finding type: Breaking Changes | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/config.py around lines 17-21 where `MODEL_PAIRS` is defined and in `resolve_paths()` around lines 60-82, the code assumes a single (base_folder, rft_folder) per `model_family`, but HF artifact folder names vary by `dataset_key`, causing nonexistent `<folder>/layer_<n>/cot_data.pkl` and `all_sentences_features.pkl` requests. Refactor by making the folder mapping dataset-aware (e.g., change `MODEL_PAIRS` to a nested mapping keyed by dataset_key then model_family, or introduce a per-dataset override dict) and update `resolve_paths()` to derive `folder` from both `dataset_key` and `model_family` for the requested `variant` ('base' vs 'rft'). Add explicit error messages when a dataset×family override is missing so failures are actionable.

baz-reviewer · 2026-05-26T00:49:28Z

+    cache = None
+    if cebra_cache and cebra_cache.exists():
+        log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}")
+        with open(cebra_cache, "rb") as f:
+            loaded = pickle.load(f)


execute_correct_subset duplicates time_shuffle/run_time_shuffle.py::execute_ablation's cache+CEBRA training+SDS-summary loop; should we reuse execute_ablation or extract a shared helper so cache keys, logging, and summary format stay in sync?

_{Finding type: Code Dedup and Conventions | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

baz-reviewer · 2026-05-26T02:54:48Z

+                    "gap_persist": metric_check("paired_correct", "persistence")
+                    and checks.get("paired_retains_gap_persistence", {}).get("passed"),


gap_persist doesn’t match the other PASS/FAIL/None summary columns — should we map paired_retains_gap_persistence to a status label instead of writing the raw boolean?

_{Finding type: Logical Bugs | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/summarize_sweep.py around lines 68-69 inside `main()`, fix the `gap_persist` field construction in the `rows.append({...})` dict. Right now `gap_persist` is computed with `metric_check(...) and checks.get(...).get('passed')`, but since `metric_check()` returns the strings 'PASS'/'FAIL', Python’s `and` returns the second operand (a boolean) instead of a status label, so the CSV mixes `str|None` and `bool|None`. Refactor `gap_persist` to follow the exact same PASS/FAIL/None contract as `paired_*` by either calling `metric_check` with the correct check key for the gap persistence metric, or explicitly mapping the boolean `passed` value to 'PASS'/'FAIL' (and returning None when the check is missing).

baz-reviewer · 2026-05-26T02:54:48Z

+RUN_NAME="${RUN_NAME:-middle_layer_sweep}"
+HF_CACHE="${HF_CACHE:-$MI_COT_ROOT/hf_cache}"
+RESULTS="$CORRECT_ONLY/results/compare/$RUN_NAME"


RESULTS and HF_CACHE come from env vars and later feed find ... -delete / rm -rf; should we resolve both paths and assert they stay under the intended workspace before cleanup?

_{Finding type: Basic Security Patterns | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/run_remaining_sweep.sh around lines 9-11 (definitions of RUN_NAME, HF_CACHE, and RESULTS) and the cleanup logic at lines 35 and 64-69, harden the script against environment-variable path traversal. Resolve RUN_NAME and HF_CACHE to safe, absolute paths (use basename/sanity checks to reject any RUN_NAME containing “/” or “..”, and realpath/readlink to compute canonical absolute paths), then assert that RESULTS is under “$CORRECT_ONLY/results/compare” and that HF_CACHE is under “$MI_COT_ROOT/hf_cache” (or the repo’s intended cache root). If either check fails, print an error and exit before any find -delete or rm -rf is executed, so cleanup never targets outside the workspace.

baz-reviewer · 2026-05-26T02:54:48Z

+  # HF activation pickles for this combo's folders
+  local -a folders=()
+  case "$dataset" in
+    math500|svamp)
+      case "$family" in
+        qwen14) folders=(Qwen_14B_base Qwen_14B_reasoning) ;;
+        qwen1.5) folders=(Qwen_1_5B_base Qwen_1_5B_reasoning) ;;
+        llama8) folders=(Llama_8B_base Llama_8B_reasoning) ;;
+      esac
+      ;;
+    gsm8k)
+      case "$family" in


Lines 37-65 duplicate the dataset/model↔folder mapping from MODEL_PAIRS_BY_REPO in config.py; can we derive the folder list from there instead of hardcoding the same cases twice?

_{Finding type: Code Dedup and Conventions | Severity: 🟢 Low}

Want Baz to fix this for you? Activate Fixer

baz-reviewer · 2026-05-26T02:54:48Z

+            log(
+                f"[{model_tag}] SKIP subset {subset_name}: "
+                f"{len(all_f)} sentence features < PCA_DIM={PCA_DIM}"
+            )
+            return {


The len(all_f) < PCA_DIM skip path returns a different summary and bypasses out_dir/summary.json, so skipped subsets miss per_run and metric_at_k() can crash on `summary[

_{Finding type: Breaking Changes | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents

Before applying, verify this suggestion against the current code. In ablate_cot_trace_expms/correct_only/pipeline.py around lines 90-94 (the `len(all_f) < PCA_DIM` early-return inside the subset processing logic), the skip path returns a non-standard summary that omits the `per_run` field and then exits before the normal `out_dir/summary.json` write. Refactor this branch so it produces a summary that matches the same schema as the non-skipped path (at minimum include `per_run: []`, and any other required keys like `subset`/`skipped`/`reason`), and ensure the code still writes `out_dir/summary.json` for skipped subsets rather than returning early before artifact creation. After the change, `metric_at_k()` and `evaluate_compare_summaries()` should be able to read `summary['per_run']` safely without crashing.

wlg1 added 2 commits May 24, 2026 20:41

time shuffle traj ablation expms

5b22180

time shuffle traj ablation expms

308e4c5

baz-reviewer Bot reviewed May 25, 2026

View reviewed changes

correct-only expms

56cbd16

baz-reviewer Bot reviewed May 26, 2026

View reviewed changes

correct-only expms update

6753aed

baz-reviewer Bot reviewed May 26, 2026

View reviewed changes

		from .grading import DATASET_CFG, _extract_answer_from_cot, is_correct_answer, load_benchmark_rows

		"gap_persist": metric_check("paired_correct", "persistence")
		and checks.get("paired_retains_gap_persistence", {}).get("passed"),

Conversation

wlg1 commented May 25, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generated description

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wlg1 commented May 25, 2026 •

edited by baz-reviewer Bot

Loading