Skip to content

time shuffle traj ablation expms#13

Open
wlg1 wants to merge 4 commits into
withmartian:mainfrom
wlg1:mike/ablate_cot_length
Open

time shuffle traj ablation expms#13
wlg1 wants to merge 4 commits into
withmartian:mainfrom
wlg1:mike/ablate_cot_length

Conversation

@wlg1
Copy link
Copy Markdown
Contributor

@wlg1 wlg1 commented May 25, 2026

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Implement a correct-trace ablation platform that grades Hugging Face SDS artifacts, caches correctness manifests/paired indices, and reruns CEBRA+SDS on all, within_correct, and paired_correct subsets to compare base vs RFT. Document and script the end-to-end workflows (single model, dataset sweep, full analysis, sweep runner) plus a complementary time-shuffle ablation that permutes sentence-ordered trajectories before SDS and reports pass/fail outcomes and visual gaps.

TopicDetails
Time-shuffle ablation Explain the time-shuffle ablation suite that permutes sentence-aligned trajectories via shuffle_sequence_lists, reruns CEBRA/SDS for real vs shuffled order, and orchestrates comparison/batch runners plus README documentation to validate whether SDS metrics collapse without temporal structure.
Modified files (7)
  • ablate_cot_trace_expms/README.md
  • ablate_cot_trace_expms/time_shuffle/README.md
  • ablate_cot_trace_expms/time_shuffle/__init__.py
  • ablate_cot_trace_expms/time_shuffle/run_compare_base_rft.py
  • ablate_cot_trace_expms/time_shuffle/run_remaining_batch.sh
  • ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py
  • ablate_cot_trace_expms/time_shuffle/shuffle.py
Latest Contributors(0)
UserCommitDate
Correct-trace platform Describe the correct-trace stack: dataset/model path constants, grading rules, HF artifact downloads, manifest caching, subset selection, CEBRA/SDS subset execution, evaluation metrics, paired-index caching, README guidance, sweep summaries, and cleanup scripts to support the paired-correct pass/fail checks in manifest_cache/results.
Modified files (16)
  • ablate_cot_trace_expms/correct_only/README.md
  • ablate_cot_trace_expms/correct_only/__init__.py
  • ablate_cot_trace_expms/correct_only/config.py
  • ablate_cot_trace_expms/correct_only/evaluate.py
  • ablate_cot_trace_expms/correct_only/grading.py
  • ablate_cot_trace_expms/correct_only/hf_utils.py
  • ablate_cot_trace_expms/correct_only/manifest.py
  • ablate_cot_trace_expms/correct_only/manifest_cache/.gitignore
  • ablate_cot_trace_expms/correct_only/manifest_cache/.gitkeep
  • ablate_cot_trace_expms/correct_only/pipeline.py
  • ablate_cot_trace_expms/correct_only/run_compare_base_rft.py
  • ablate_cot_trace_expms/correct_only/run_full_analysis.py
  • ablate_cot_trace_expms/correct_only/run_remaining_sweep.sh
  • ablate_cot_trace_expms/correct_only/run_single_model.py
  • ablate_cot_trace_expms/correct_only/subsets.py
  • ablate_cot_trace_expms/correct_only/summarize_sweep.py
Latest Contributors(0)
UserCommitDate
Other Other files
Modified files (2)
  • ablate_cot_trace_expms/__init__.py
  • exploration/cebra_EM.py
Latest Contributors(0)
UserCommitDate
Review this PR on Baz | Customize your next review

Comment on lines +105 to +110
```bash
python ablate_cot_length_expms/time_shuffle/run_time_shuffle.py \
--features-path /path/to/all_sentences_features.pkl \
--limit-problems 500 \
--shuffle-seeds 0 1 2 3 4 \
--out-dir ablate_cot_length_expms/time_shuffle/results
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both command blocks reference ablate_cot_length_expms/... instead of ablate_cot_trace_expms/..., so copy-pasting them fails — should we update run_time_shuffle.py and run_compare_base_rft.py paths (and their --out-dir examples) to match this README's location?

Finding types: Naming and Typos Breaking Changes | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/README.md, replace all occurrences of the
`ablate_cot_length_expms/...` prefix with `ablate_cot_trace_expms/...` in both command
blocks: 1. Around lines 105-110: fix the `run_time_shuffle.py` entrypoint path. 2.
Around lines 121-140: fix the `run_compare_base_rft.py` entrypoint path. Also verify
that any `--out-dir` examples are updated consistently to point to
`ablate_cot_trace_expms/time_shuffle/results/...` so all documented paths match the
actual script locations.

Comment on lines +93 to +99
if mode == "full":
perm = rng.permutation(n)
elif mode == "block":
blocks = [np.arange(i, min(i + block_size, n)) for i in range(0, n, block_size)]
block_order = rng.permutation(len(blocks))
perm = np.concatenate([blocks[i] for i in block_order])
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode == "block" / block_size validation logic is duplicated between shuffle_sequence_lists and permute_trajectory, and shuffle_sequence_lists is missing the block_size >= 1 guard — so --block-size 0 hits range(0, n, 0) and raises an indirect ValueError; should we add the guard here and factor the shared permutation logic into a helper?

Finding types: Code Dedup and Conventions Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/shuffle.py, inside `shuffle_sequence_lists()`'s
`mode == "block"` branch (lines 95-97), add the same `block_size >= 1` validation used
in `permute_trajectory()`, raising a clear `ValueError` before any `range(..., ...,
block_size)` construction (keep the error message consistent with
`permute_trajectory()`). Then consider extracting the shared block/full permutation
logic into a reusable helper that both functions delegate to.

Comment on lines +127 to +133
keys = ["persistence", "mean_self_transition", "specialization", "sss", "K_eff", "regime_r2", "bic"]
out = {"K": k, "order": "shuffled_mean", "n_seeds": len(sub)}
for key in keys:
vals = [r[key] for r in sub]
out[f"{key}_mean"] = float(np.mean(vals))
out[f"{key}_std"] = float(np.std(vals))
return out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_aggregate_shuffled() never writes delta_r2_mean, but run_compare_base_rft._metric_at_k() always reads it for shuffled delta_r2, so evaluate_pair() hits a missing-key failure; should we add delta_r2 to the aggregated keys list or write delta_r2_mean into comparison[*]['shuffled']?

Finding type: Type Inconsistency | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 123-133, update the
`_aggregate_shuffled(rows, k)` function so the aggregated result includes
`delta_r2_mean` (and `delta_r2_std`) for the shuffled comparison. Right now
`_aggregate_shuffled` only computes mean/std for `persistence`, `mean_self_transition`,
`specialization`, `sss`, `K_eff`, `regime_r2`, and `bic`, but `execute_ablation` already
stores `delta_r2` on each per-run row and downstream comparison code expects
`comp['shuffled']['delta_r2_mean']`. Fix by adding `delta_r2` to the `keys` list in
`_aggregate_shuffled`, ensuring it is included in the `out` dict as `delta_r2_mean` (and
optionally `delta_r2_std`) for each K.

Comment on lines +159 to +166
if cebra_cache and cebra_cache.exists():
log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}")
with open(cebra_cache, "rb") as f:
cache = pickle.load(f)
cebra_seqs = cache["cebra_seqs"]
pca_seqs = cache["pca_seqs"]
labels = cache["labels"]
ar_r2 = cache["ar_r2"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--cebra-cache is caller-controlled and reused on cebra_cache.exists() alone — pickle.load() runs without validating provenance, features_path, limit_problems, or payload shape, so stale/malicious data can flow into shuffle_sequence_lists()/fit_and_evaluate(). Should we validate cache metadata against current args before deserializing, recompute on mismatch, and consider replacing pickle with npz+JSON to avoid arbitrary code execution on load?

Finding types: Basic Security Patterns Logical Bugs Type Inconsistency | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 159-197 inside
`execute_ablation()`: 1. **Security**: The code loads a user-supplied `--cebra-cache`
path via `pickle.load()` as soon as it exists, which is an unsafe arbitrary-pickle
deserialization path. Consider replacing pickle-based caching with a safe format (e.g.,
`np.save`/`npz` for arrays and JSON for metadata) so the cache loader never executes
code during deserialization. If pickle must be kept, restrict loading to caches within
an allowed/expected directory. 2. **Cache invalidation**: Refactor the cache-loading
branch to validate that the stored metadata (at minimum `features_path`,
`limit_problems`, and ideally `cebra_mode`) matches the current function arguments
before reusing the cache. If validation fails (missing keys, wrong types, or mismatched
values), log a clear message, recompute CEBRA, and overwrite the cache rather than
silently using stale trajectories. 3. **Payload validation**: After loading (before the
"Real temporal order" loop), verify that `cebra_seqs`, `pca_seqs`, and `labels` are
present, are lists of the same length, and that each trajectory's embedding/label
lengths are compatible with downstream consumers. Raise a clear error or force
recomputation if any check fails. When saving the cache (around lines 182-196), include
a script-specific identifier/version/hash alongside the metadata so it can be verified
on reload.

Comment on lines +168 to +175
log(f"[{model_tag}] Loading features: {features_path}")
all_f, triplets = load_and_prepare_cebra(
str(features_path),
mode=cebra_mode,
limit_problems=limit_problems,
)
log(f"[{model_tag}] Training CEBRA (real temporal order)...")
cebra_seqs, pca_seqs, labels = train_cebra_projection(all_f, triplets)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--features-path passes a caller-controlled file to load_and_prepare_cebra (which calls pickle.load) without path validation or schema checks — should we restrict to trusted filenames/directories and validate required keys like problem_id and hidden_state_last before unpickling?

Finding types: Basic Security Patterns Type Inconsistency | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 168-175 inside
`execute_ablation`: 1. Validate `features_path` before passing to
`load_and_prepare_cebra`: reject paths that don't match the expected filename (e.g.,
`all_sentences_features.pkl`) and/or aren't under a known trusted directory; consider
requiring an explicit `--allow-unsafe-pickle` flag for untrusted inputs. 2. After
loading (or before calling `load_and_prepare_cebra`), add a schema validation step that
checks required top-level keys (`problem_id`, `hidden_state_last`, etc.) and basic
shapes/types, raising a clear `ValueError` on mismatch. 3. Apply the same key validation
to the cached path branch (`cebra_cache`) to ensure `cebra_seqs/pca_seqs/labels/ar_r2`
are present before training. 4. In `exploration/cebra_EM.py`, consider switching the
`load_and_prepare_cebra` pickle load to a safer format (npz/json) or adding integrity
verification (e.g., SHA256 allowlist) before any unpickling.

Comment on lines +113 to +116
if metric == "delta_r2":
passed = gap_real > gap_shuf + 0.01
else:
passed = gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon) doesn't match the documented ordering check when gap_shuf < 0, so a still-negative gap_real can pass; should we switch this branch to gap_real > gap_shuf or normalize the sign first?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_compare_base_rft.py around lines 103-123 inside
`evaluate_pair`, fix the gap check that currently uses `gap_real >= min_gap_shrink_ratio
* (gap_shuf + epsilon)` for `metric != "delta_r2"`. This ratio-based threshold is
incorrect for negative `gap_shuf` because it can make the acceptance threshold more
negative, allowing cases where `(RFT-Base)_real` is not actually greater than
`(RFT-Base)_shuffled` to pass. Refactor the `else:` branch to implement the documented
ordering check directly: require `gap_real > gap_shuf` (or `gap_real >= gap_shuf +
epsilon` for numerical stability), and only use `min_gap_shrink_ratio` if you normalize
it in a sign-safe way; otherwise remove it from this comparison. Make sure the
`add_check` detail message remains accurate and that the strictness is consistent with
the docstring (gap_real must be strictly larger unless you intentionally define
inclusive behavior).

Comment on lines +78 to +85
pids = resolve_subset_pids(
subset,
manifest=manifest,
limit_problems=limit,
base_manifest=base_manifest,
rft_manifest=rft_manifest,
reference_manifest=manifest,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_model_subsets() passes reference_manifest=manifest to resolve_subset_pids(), but the callee doesn't accept it, so this hits TypeError on the first subset — should we drop the extra kwarg or add it to the signature?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/run_compare_base_rft.py around lines 78-85 inside
`_run_model_subsets()`, remove the `reference_manifest=manifest` keyword argument from
the call to `resolve_subset_pids()` (or update `_run_model_subsets` to only pass the
parameters that `resolve_subset_pids` actually supports). Then verify the script runs
through the subset loop without raising TypeError and that subset PID resolution still
behaves as intended. If `reference_manifest` is truly required for paired/within_correct
logic, instead add `reference_manifest` to
`ablate_cot_trace_expms/correct_only/subsets.py::resolve_subset_pids()` signature and
thread it through the internal logic, keeping defaults so existing calls remain
compatible.

Comment on lines +13 to +14
from .grading import DATASET_CFG, _extract_answer_from_cot, is_correct_answer, load_benchmark_rows

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifest.py uses package-relative imports, but the documented entrypoints load it as a top-level module, so ImportError happens before CLI logic — should we switch to package imports / python -m ... execution, or make manifest.py import-safe as a script module?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/manifest.py around lines 13-14 and 256 (imports from
`.grading` and `.subsets`), the code uses package-relative imports that break when
`manifest.py` is loaded as a top-level module (e.g., after adding `correct_only/` to
`sys.path`). Refactor the imports to be import-safe by attempting the relative import
first and falling back to absolute imports when `__package__` is empty/None (so
top-level execution works). Apply the same strategy to the local import inside
`get_or_build_paired_index` rather than leaving it as `from .subsets ...` so CLI
entrypoints don’t crash before running.

Comment on lines +108 to +110
cot_data_path = cot_data_path.resolve()
with open(cot_data_path, "rb") as f:
cot_data: dict = pickle.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe Deserialization (CWE-502): build_manifest() passes externally sourced cot_data.pkl (fetched via ensure_model_artifacts()) into pickle.load(f) — should we replace this with a non-executable format (JSON/MsgPack) or at minimum gate loading on a SHA-256 integrity check against a pinned digest?

Finding type: Basic Security Patterns | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
In ablate_cot_trace_expms/correct_only/manifest.py around lines 108-110 inside
`build_manifest()`, remove the unsafe `pickle.load(f)` of `cot_data.pkl` sourced from
external HF downloads via `ensure_model_artifacts()`. Preferred fix: refactor `cot_data`
to be stored/loaded in a non-code-executing format (e.g., JSON/MsgPack) and update all
callers/writers accordingly. If pickle must be kept temporarily: (1) compute the file's
SHA-256 and compare against a pinned digest from HF artifact metadata/revision before
unpickling, aborting with a clear error on mismatch; (2) require the local file to
already exist (no downloads during manifest construction); (3) optionally use a
restricted unpickler/whitelist that only allows primitive containers/types to prevent
arbitrary object construction.

Comment on lines +40 to +43
if not cot.is_file():
cot = ensure_hf_file(repo, f"{prefix}/cot_data.pkl", local_dir=root)
if not features.is_file():
features = ensure_hf_file(repo, f"{prefix}/all_sentences_features.pkl", local_dir=root)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_model_artifacts() now pulls all_sentences_features.pkl from Hugging Face during normal runs, and load_and_prepare_cebra() pickle.load(f)s it with no pin/checksum, so should we store this in a non-code format or verify the artifact before unpickling?

Finding type: Basic Security Patterns | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/hf_utils.py around lines 24-44, in
ensure_model_artifacts(), the function downloads all_sentences_features.pkl from a
Hugging Face dataset repo when missing locally, but the downstream pipeline uses
pickle.load() on it; this is unsafe because the artifact is not revision-pinned or
integrity-verified. Refactor so artifacts are either (a) stored in a non-executable,
non-pickle format (e.g., .npz/.hdf5/parquet) and loaded with a safe reader, or (b) if
you must keep pickle, download a specific pinned revision and verify an expected sha256
checksum before returning/allowing it to be loaded. Concretely, change
ensure_model_artifacts() to require a repo revision and expected hash (e.g., passed via
paths/config), download that exact revision, compute/validate the hash of the local
file, and fail closed if it doesn’t match; then update the artifact filename/path used
by ensure_model_artifacts and the corresponding loader to match the safe format.

Comment on lines +48 to +52
gap_all = metric_at_k(rft_summaries["all"], k_focus, metric) - metric_at_k(
base_summaries["all"], k_focus, metric
)
gap_paired = metric_at_k(rft_summaries["paired_correct"], k_focus, metric) - metric_at_k(
base_summaries["paired_correct"], k_focus, metric
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gap_all / gap_paired subtract metric_at_k() results unconditionally; should we mirror the earlier None check before the gap arithmetic, otherwise missing K can raise TypeError and abort evaluation?

Finding type: Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/evaluate.py around lines 48-67 inside
evaluate_compare_summaries’ gap-retention logic for metrics (“persistence”,
“delta_r2”), mirror the defensive handling used above: after computing gap_all and
gap_paired, first check whether metric_at_k(...) for both base and rft returned None
(e.g., missing K=k_focus) and in that case add a failed check entry (include subset/all
vs paired_correct and the missing K) and skip the subtraction/arithmetic. Then only
compute gaps and the min_paired_gap_fraction comparison when all required values are
non-None, so evaluation never raises TypeError when K is absent.

Comment on lines +17 to +20
MODEL_PAIRS: dict[str, tuple[str, str]] = {
"llama8": ("llama_8B_base", "llama_8b_reasoning"),
"qwen14": ("qwen_14B_base", "qwen_14b_reasoning"),
"qwen1.5": ("qwen_1.5B_base", "qwen1.5b_reasoning"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_PAIRS hardcodes one (base_folder, rft_folder) per family, but the HF artifact folders vary by dataset, so resolve_paths()/ensure_model_artifacts() can request nonexistent filenames for some dataset×family combos — should we make the mapping dataset-aware or add per-dataset overrides?

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/config.py around lines 17-21 where `MODEL_PAIRS` is
defined and in `resolve_paths()` around lines 60-82, the code assumes a single
(base_folder, rft_folder) per `model_family`, but HF artifact folder names vary by
`dataset_key`, causing nonexistent `<folder>/layer_<n>/cot_data.pkl` and
`all_sentences_features.pkl` requests. Refactor by making the folder mapping
dataset-aware (e.g., change `MODEL_PAIRS` to a nested mapping keyed by dataset_key then
model_family, or introduce a per-dataset override dict) and update `resolve_paths()` to
derive `folder` from both `dataset_key` and `model_family` for the requested `variant`
('base' vs 'rft'). Add explicit error messages when a dataset×family override is
missing so failures are actionable.

Comment on lines +64 to +68
cache = None
if cebra_cache and cebra_cache.exists():
log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}")
with open(cebra_cache, "rb") as f:
loaded = pickle.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execute_correct_subset duplicates time_shuffle/run_time_shuffle.py::execute_ablation's cache+CEBRA training+SDS-summary loop; should we reuse execute_ablation or extract a shared helper so cache keys, logging, and summary format stay in sync?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Comment on lines +68 to +69
"gap_persist": metric_check("paired_correct", "persistence")
and checks.get("paired_retains_gap_persistence", {}).get("passed"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gap_persist doesn’t match the other PASS/FAIL/None summary columns — should we map paired_retains_gap_persistence to a status label instead of writing the raw boolean?

Finding type: Logical Bugs | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/summarize_sweep.py around lines 68-69 inside
`main()`, fix the `gap_persist` field construction in the `rows.append({...})` dict.
Right now `gap_persist` is computed with `metric_check(...) and
checks.get(...).get('passed')`, but since `metric_check()` returns the strings
'PASS'/'FAIL', Python’s `and` returns the second operand (a boolean) instead of a
status label, so the CSV mixes `str|None` and `bool|None`. Refactor `gap_persist` to
follow the exact same PASS/FAIL/None contract as `paired_*` by either calling
`metric_check` with the correct check key for the gap persistence metric, or explicitly
mapping the boolean `passed` value to 'PASS'/'FAIL' (and returning None when the check
is missing).

Comment on lines +9 to +11
RUN_NAME="${RUN_NAME:-middle_layer_sweep}"
HF_CACHE="${HF_CACHE:-$MI_COT_ROOT/hf_cache}"
RESULTS="$CORRECT_ONLY/results/compare/$RUN_NAME"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESULTS and HF_CACHE come from env vars and later feed find ... -delete / rm -rf; should we resolve both paths and assert they stay under the intended workspace before cleanup?

Finding type: Basic Security Patterns | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/run_remaining_sweep.sh around lines 9-11
(definitions of RUN_NAME, HF_CACHE, and RESULTS) and the cleanup logic at lines 35 and
64-69, harden the script against environment-variable path traversal. Resolve RUN_NAME
and HF_CACHE to safe, absolute paths (use basename/sanity checks to reject any RUN_NAME
containing “/” or “..”, and realpath/readlink to compute canonical absolute
paths), then assert that RESULTS is under “$CORRECT_ONLY/results/compare” and that
HF_CACHE is under “$MI_COT_ROOT/hf_cache” (or the repo’s intended cache root). If
either check fails, print an error and exit before any find -delete or rm -rf is
executed, so cleanup never targets outside the workspace.

Comment on lines +37 to +48
# HF activation pickles for this combo's folders
local -a folders=()
case "$dataset" in
math500|svamp)
case "$family" in
qwen14) folders=(Qwen_14B_base Qwen_14B_reasoning) ;;
qwen1.5) folders=(Qwen_1_5B_base Qwen_1_5B_reasoning) ;;
llama8) folders=(Llama_8B_base Llama_8B_reasoning) ;;
esac
;;
gsm8k)
case "$family" in
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 37-65 duplicate the dataset/model↔folder mapping from MODEL_PAIRS_BY_REPO in config.py; can we derive the folder list from there instead of hardcoding the same cases twice?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Comment on lines +90 to +94
log(
f"[{model_tag}] SKIP subset {subset_name}: "
f"{len(all_f)} sentence features < PCA_DIM={PCA_DIM}"
)
return {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len(all_f) < PCA_DIM skip path returns a different summary and bypasses out_dir/summary.json, so skipped subsets miss per_run and metric_at_k() can crash on `summary[

Finding type: Breaking Changes | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/pipeline.py around lines 90-94 (the `len(all_f) <
PCA_DIM` early-return inside the subset processing logic), the skip path returns a
non-standard summary that omits the `per_run` field and then exits before the normal
`out_dir/summary.json` write. Refactor this branch so it produces a summary that matches
the same schema as the non-skipped path (at minimum include `per_run: []`, and any other
required keys like `subset`/`skipped`/`reason`), and ensure the code still writes
`out_dir/summary.json` for skipped subsets rather than returning early before artifact
creation. After the change, `metric_at_k()` and `evaluate_compare_summaries()` should be
able to read `summary['per_run']` safely without crashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant