time shuffle traj ablation expms#13
Conversation
| ```bash | ||
| python ablate_cot_length_expms/time_shuffle/run_time_shuffle.py \ | ||
| --features-path /path/to/all_sentences_features.pkl \ | ||
| --limit-problems 500 \ | ||
| --shuffle-seeds 0 1 2 3 4 \ | ||
| --out-dir ablate_cot_length_expms/time_shuffle/results |
There was a problem hiding this comment.
Both command blocks reference ablate_cot_length_expms/... instead of ablate_cot_trace_expms/..., so copy-pasting them fails — should we update run_time_shuffle.py and run_compare_base_rft.py paths (and their --out-dir examples) to match this README's location?
Finding types: Naming and Typos Breaking Changes | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/README.md, replace all occurrences of the
`ablate_cot_length_expms/...` prefix with `ablate_cot_trace_expms/...` in both command
blocks: 1. Around lines 105-110: fix the `run_time_shuffle.py` entrypoint path. 2.
Around lines 121-140: fix the `run_compare_base_rft.py` entrypoint path. Also verify
that any `--out-dir` examples are updated consistently to point to
`ablate_cot_trace_expms/time_shuffle/results/...` so all documented paths match the
actual script locations.
| if mode == "full": | ||
| perm = rng.permutation(n) | ||
| elif mode == "block": | ||
| blocks = [np.arange(i, min(i + block_size, n)) for i in range(0, n, block_size)] | ||
| block_order = rng.permutation(len(blocks)) | ||
| perm = np.concatenate([blocks[i] for i in block_order]) | ||
| else: |
There was a problem hiding this comment.
The mode == "block" / block_size validation logic is duplicated between shuffle_sequence_lists and permute_trajectory, and shuffle_sequence_lists is missing the block_size >= 1 guard — so --block-size 0 hits range(0, n, 0) and raises an indirect ValueError; should we add the guard here and factor the shared permutation logic into a helper?
Finding types: Code Dedup and Conventions Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/shuffle.py, inside `shuffle_sequence_lists()`'s
`mode == "block"` branch (lines 95-97), add the same `block_size >= 1` validation used
in `permute_trajectory()`, raising a clear `ValueError` before any `range(..., ...,
block_size)` construction (keep the error message consistent with
`permute_trajectory()`). Then consider extracting the shared block/full permutation
logic into a reusable helper that both functions delegate to.
| keys = ["persistence", "mean_self_transition", "specialization", "sss", "K_eff", "regime_r2", "bic"] | ||
| out = {"K": k, "order": "shuffled_mean", "n_seeds": len(sub)} | ||
| for key in keys: | ||
| vals = [r[key] for r in sub] | ||
| out[f"{key}_mean"] = float(np.mean(vals)) | ||
| out[f"{key}_std"] = float(np.std(vals)) | ||
| return out |
There was a problem hiding this comment.
_aggregate_shuffled() never writes delta_r2_mean, but run_compare_base_rft._metric_at_k() always reads it for shuffled delta_r2, so evaluate_pair() hits a missing-key failure; should we add delta_r2 to the aggregated keys list or write delta_r2_mean into comparison[*]['shuffled']?
Finding type: Type Inconsistency | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 123-133, update the
`_aggregate_shuffled(rows, k)` function so the aggregated result includes
`delta_r2_mean` (and `delta_r2_std`) for the shuffled comparison. Right now
`_aggregate_shuffled` only computes mean/std for `persistence`, `mean_self_transition`,
`specialization`, `sss`, `K_eff`, `regime_r2`, and `bic`, but `execute_ablation` already
stores `delta_r2` on each per-run row and downstream comparison code expects
`comp['shuffled']['delta_r2_mean']`. Fix by adding `delta_r2` to the `keys` list in
`_aggregate_shuffled`, ensuring it is included in the `out` dict as `delta_r2_mean` (and
optionally `delta_r2_std`) for each K.
| if cebra_cache and cebra_cache.exists(): | ||
| log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}") | ||
| with open(cebra_cache, "rb") as f: | ||
| cache = pickle.load(f) | ||
| cebra_seqs = cache["cebra_seqs"] | ||
| pca_seqs = cache["pca_seqs"] | ||
| labels = cache["labels"] | ||
| ar_r2 = cache["ar_r2"] |
There was a problem hiding this comment.
--cebra-cache is caller-controlled and reused on cebra_cache.exists() alone — pickle.load() runs without validating provenance, features_path, limit_problems, or payload shape, so stale/malicious data can flow into shuffle_sequence_lists()/fit_and_evaluate(). Should we validate cache metadata against current args before deserializing, recompute on mismatch, and consider replacing pickle with npz+JSON to avoid arbitrary code execution on load?
Finding types: Basic Security Patterns Logical Bugs Type Inconsistency | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 159-197 inside
`execute_ablation()`: 1. **Security**: The code loads a user-supplied `--cebra-cache`
path via `pickle.load()` as soon as it exists, which is an unsafe arbitrary-pickle
deserialization path. Consider replacing pickle-based caching with a safe format (e.g.,
`np.save`/`npz` for arrays and JSON for metadata) so the cache loader never executes
code during deserialization. If pickle must be kept, restrict loading to caches within
an allowed/expected directory. 2. **Cache invalidation**: Refactor the cache-loading
branch to validate that the stored metadata (at minimum `features_path`,
`limit_problems`, and ideally `cebra_mode`) matches the current function arguments
before reusing the cache. If validation fails (missing keys, wrong types, or mismatched
values), log a clear message, recompute CEBRA, and overwrite the cache rather than
silently using stale trajectories. 3. **Payload validation**: After loading (before the
"Real temporal order" loop), verify that `cebra_seqs`, `pca_seqs`, and `labels` are
present, are lists of the same length, and that each trajectory's embedding/label
lengths are compatible with downstream consumers. Raise a clear error or force
recomputation if any check fails. When saving the cache (around lines 182-196), include
a script-specific identifier/version/hash alongside the metadata so it can be verified
on reload.
| log(f"[{model_tag}] Loading features: {features_path}") | ||
| all_f, triplets = load_and_prepare_cebra( | ||
| str(features_path), | ||
| mode=cebra_mode, | ||
| limit_problems=limit_problems, | ||
| ) | ||
| log(f"[{model_tag}] Training CEBRA (real temporal order)...") | ||
| cebra_seqs, pca_seqs, labels = train_cebra_projection(all_f, triplets) |
There was a problem hiding this comment.
--features-path passes a caller-controlled file to load_and_prepare_cebra (which calls pickle.load) without path validation or schema checks — should we restrict to trusted filenames/directories and validate required keys like problem_id and hidden_state_last before unpickling?
Finding types: Basic Security Patterns Type Inconsistency | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
In ablate_cot_trace_expms/time_shuffle/run_time_shuffle.py around lines 168-175 inside
`execute_ablation`: 1. Validate `features_path` before passing to
`load_and_prepare_cebra`: reject paths that don't match the expected filename (e.g.,
`all_sentences_features.pkl`) and/or aren't under a known trusted directory; consider
requiring an explicit `--allow-unsafe-pickle` flag for untrusted inputs. 2. After
loading (or before calling `load_and_prepare_cebra`), add a schema validation step that
checks required top-level keys (`problem_id`, `hidden_state_last`, etc.) and basic
shapes/types, raising a clear `ValueError` on mismatch. 3. Apply the same key validation
to the cached path branch (`cebra_cache`) to ensure `cebra_seqs/pca_seqs/labels/ar_r2`
are present before training. 4. In `exploration/cebra_EM.py`, consider switching the
`load_and_prepare_cebra` pickle load to a safer format (npz/json) or adding integrity
verification (e.g., SHA256 allowlist) before any unpickling.
| if metric == "delta_r2": | ||
| passed = gap_real > gap_shuf + 0.01 | ||
| else: | ||
| passed = gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon) |
There was a problem hiding this comment.
gap_real >= min_gap_shrink_ratio * (gap_shuf + epsilon) doesn't match the documented ordering check when gap_shuf < 0, so a still-negative gap_real can pass; should we switch this branch to gap_real > gap_shuf or normalize the sign first?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/time_shuffle/run_compare_base_rft.py around lines 103-123 inside
`evaluate_pair`, fix the gap check that currently uses `gap_real >= min_gap_shrink_ratio
* (gap_shuf + epsilon)` for `metric != "delta_r2"`. This ratio-based threshold is
incorrect for negative `gap_shuf` because it can make the acceptance threshold more
negative, allowing cases where `(RFT-Base)_real` is not actually greater than
`(RFT-Base)_shuffled` to pass. Refactor the `else:` branch to implement the documented
ordering check directly: require `gap_real > gap_shuf` (or `gap_real >= gap_shuf +
epsilon` for numerical stability), and only use `min_gap_shrink_ratio` if you normalize
it in a sign-safe way; otherwise remove it from this comparison. Make sure the
`add_check` detail message remains accurate and that the strictness is consistent with
the docstring (gap_real must be strictly larger unless you intentionally define
inclusive behavior).
| pids = resolve_subset_pids( | ||
| subset, | ||
| manifest=manifest, | ||
| limit_problems=limit, | ||
| base_manifest=base_manifest, | ||
| rft_manifest=rft_manifest, | ||
| reference_manifest=manifest, | ||
| ) |
There was a problem hiding this comment.
_run_model_subsets() passes reference_manifest=manifest to resolve_subset_pids(), but the callee doesn't accept it, so this hits TypeError on the first subset — should we drop the extra kwarg or add it to the signature?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/run_compare_base_rft.py around lines 78-85 inside
`_run_model_subsets()`, remove the `reference_manifest=manifest` keyword argument from
the call to `resolve_subset_pids()` (or update `_run_model_subsets` to only pass the
parameters that `resolve_subset_pids` actually supports). Then verify the script runs
through the subset loop without raising TypeError and that subset PID resolution still
behaves as intended. If `reference_manifest` is truly required for paired/within_correct
logic, instead add `reference_manifest` to
`ablate_cot_trace_expms/correct_only/subsets.py::resolve_subset_pids()` signature and
thread it through the internal logic, keeping defaults so existing calls remain
compatible.
| from .grading import DATASET_CFG, _extract_answer_from_cot, is_correct_answer, load_benchmark_rows | ||
|
|
There was a problem hiding this comment.
manifest.py uses package-relative imports, but the documented entrypoints load it as a top-level module, so ImportError happens before CLI logic — should we switch to package imports / python -m ... execution, or make manifest.py import-safe as a script module?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/manifest.py around lines 13-14 and 256 (imports from
`.grading` and `.subsets`), the code uses package-relative imports that break when
`manifest.py` is loaded as a top-level module (e.g., after adding `correct_only/` to
`sys.path`). Refactor the imports to be import-safe by attempting the relative import
first and falling back to absolute imports when `__package__` is empty/None (so
top-level execution works). Apply the same strategy to the local import inside
`get_or_build_paired_index` rather than leaving it as `from .subsets ...` so CLI
entrypoints don’t crash before running.
| cot_data_path = cot_data_path.resolve() | ||
| with open(cot_data_path, "rb") as f: | ||
| cot_data: dict = pickle.load(f) |
There was a problem hiding this comment.
Unsafe Deserialization (CWE-502): build_manifest() passes externally sourced cot_data.pkl (fetched via ensure_model_artifacts()) into pickle.load(f) — should we replace this with a non-executable format (JSON/MsgPack) or at minimum gate loading on a SHA-256 integrity check against a pinned digest?
Finding type: Basic Security Patterns | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
In ablate_cot_trace_expms/correct_only/manifest.py around lines 108-110 inside
`build_manifest()`, remove the unsafe `pickle.load(f)` of `cot_data.pkl` sourced from
external HF downloads via `ensure_model_artifacts()`. Preferred fix: refactor `cot_data`
to be stored/loaded in a non-code-executing format (e.g., JSON/MsgPack) and update all
callers/writers accordingly. If pickle must be kept temporarily: (1) compute the file's
SHA-256 and compare against a pinned digest from HF artifact metadata/revision before
unpickling, aborting with a clear error on mismatch; (2) require the local file to
already exist (no downloads during manifest construction); (3) optionally use a
restricted unpickler/whitelist that only allows primitive containers/types to prevent
arbitrary object construction.
| if not cot.is_file(): | ||
| cot = ensure_hf_file(repo, f"{prefix}/cot_data.pkl", local_dir=root) | ||
| if not features.is_file(): | ||
| features = ensure_hf_file(repo, f"{prefix}/all_sentences_features.pkl", local_dir=root) |
There was a problem hiding this comment.
ensure_model_artifacts() now pulls all_sentences_features.pkl from Hugging Face during normal runs, and load_and_prepare_cebra() pickle.load(f)s it with no pin/checksum, so should we store this in a non-code format or verify the artifact before unpickling?
Finding type: Basic Security Patterns | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/hf_utils.py around lines 24-44, in
ensure_model_artifacts(), the function downloads all_sentences_features.pkl from a
Hugging Face dataset repo when missing locally, but the downstream pipeline uses
pickle.load() on it; this is unsafe because the artifact is not revision-pinned or
integrity-verified. Refactor so artifacts are either (a) stored in a non-executable,
non-pickle format (e.g., .npz/.hdf5/parquet) and loaded with a safe reader, or (b) if
you must keep pickle, download a specific pinned revision and verify an expected sha256
checksum before returning/allowing it to be loaded. Concretely, change
ensure_model_artifacts() to require a repo revision and expected hash (e.g., passed via
paths/config), download that exact revision, compute/validate the hash of the local
file, and fail closed if it doesn’t match; then update the artifact filename/path used
by ensure_model_artifacts and the corresponding loader to match the safe format.
| gap_all = metric_at_k(rft_summaries["all"], k_focus, metric) - metric_at_k( | ||
| base_summaries["all"], k_focus, metric | ||
| ) | ||
| gap_paired = metric_at_k(rft_summaries["paired_correct"], k_focus, metric) - metric_at_k( | ||
| base_summaries["paired_correct"], k_focus, metric |
There was a problem hiding this comment.
gap_all / gap_paired subtract metric_at_k() results unconditionally; should we mirror the earlier None check before the gap arithmetic, otherwise missing K can raise TypeError and abort evaluation?
Finding type: Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/evaluate.py around lines 48-67 inside
evaluate_compare_summaries’ gap-retention logic for metrics (“persistence”,
“delta_r2”), mirror the defensive handling used above: after computing gap_all and
gap_paired, first check whether metric_at_k(...) for both base and rft returned None
(e.g., missing K=k_focus) and in that case add a failed check entry (include subset/all
vs paired_correct and the missing K) and skip the subtraction/arithmetic. Then only
compute gaps and the min_paired_gap_fraction comparison when all required values are
non-None, so evaluation never raises TypeError when K is absent.
| MODEL_PAIRS: dict[str, tuple[str, str]] = { | ||
| "llama8": ("llama_8B_base", "llama_8b_reasoning"), | ||
| "qwen14": ("qwen_14B_base", "qwen_14b_reasoning"), | ||
| "qwen1.5": ("qwen_1.5B_base", "qwen1.5b_reasoning"), |
There was a problem hiding this comment.
MODEL_PAIRS hardcodes one (base_folder, rft_folder) per family, but the HF artifact folders vary by dataset, so resolve_paths()/ensure_model_artifacts() can request nonexistent filenames for some dataset×family combos — should we make the mapping dataset-aware or add per-dataset overrides?
Finding type: Breaking Changes | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/config.py around lines 17-21 where `MODEL_PAIRS` is
defined and in `resolve_paths()` around lines 60-82, the code assumes a single
(base_folder, rft_folder) per `model_family`, but HF artifact folder names vary by
`dataset_key`, causing nonexistent `<folder>/layer_<n>/cot_data.pkl` and
`all_sentences_features.pkl` requests. Refactor by making the folder mapping
dataset-aware (e.g., change `MODEL_PAIRS` to a nested mapping keyed by dataset_key then
model_family, or introduce a per-dataset override dict) and update `resolve_paths()` to
derive `folder` from both `dataset_key` and `model_family` for the requested `variant`
('base' vs 'rft'). Add explicit error messages when a dataset×family override is
missing so failures are actionable.
| cache = None | ||
| if cebra_cache and cebra_cache.exists(): | ||
| log(f"[{model_tag}] Loading CEBRA cache from {cebra_cache}") | ||
| with open(cebra_cache, "rb") as f: | ||
| loaded = pickle.load(f) |
There was a problem hiding this comment.
execute_correct_subset duplicates time_shuffle/run_time_shuffle.py::execute_ablation's cache+CEBRA training+SDS-summary loop; should we reuse execute_ablation or extract a shared helper so cache keys, logging, and summary format stay in sync?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
| "gap_persist": metric_check("paired_correct", "persistence") | ||
| and checks.get("paired_retains_gap_persistence", {}).get("passed"), |
There was a problem hiding this comment.
gap_persist doesn’t match the other PASS/FAIL/None summary columns — should we map paired_retains_gap_persistence to a status label instead of writing the raw boolean?
Finding type: Logical Bugs | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/summarize_sweep.py around lines 68-69 inside
`main()`, fix the `gap_persist` field construction in the `rows.append({...})` dict.
Right now `gap_persist` is computed with `metric_check(...) and
checks.get(...).get('passed')`, but since `metric_check()` returns the strings
'PASS'/'FAIL', Python’s `and` returns the second operand (a boolean) instead of a
status label, so the CSV mixes `str|None` and `bool|None`. Refactor `gap_persist` to
follow the exact same PASS/FAIL/None contract as `paired_*` by either calling
`metric_check` with the correct check key for the gap persistence metric, or explicitly
mapping the boolean `passed` value to 'PASS'/'FAIL' (and returning None when the check
is missing).
| RUN_NAME="${RUN_NAME:-middle_layer_sweep}" | ||
| HF_CACHE="${HF_CACHE:-$MI_COT_ROOT/hf_cache}" | ||
| RESULTS="$CORRECT_ONLY/results/compare/$RUN_NAME" |
There was a problem hiding this comment.
RESULTS and HF_CACHE come from env vars and later feed find ... -delete / rm -rf; should we resolve both paths and assert they stay under the intended workspace before cleanup?
Finding type: Basic Security Patterns | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/run_remaining_sweep.sh around lines 9-11
(definitions of RUN_NAME, HF_CACHE, and RESULTS) and the cleanup logic at lines 35 and
64-69, harden the script against environment-variable path traversal. Resolve RUN_NAME
and HF_CACHE to safe, absolute paths (use basename/sanity checks to reject any RUN_NAME
containing “/” or “..”, and realpath/readlink to compute canonical absolute
paths), then assert that RESULTS is under “$CORRECT_ONLY/results/compare” and that
HF_CACHE is under “$MI_COT_ROOT/hf_cache” (or the repo’s intended cache root). If
either check fails, print an error and exit before any find -delete or rm -rf is
executed, so cleanup never targets outside the workspace.
| # HF activation pickles for this combo's folders | ||
| local -a folders=() | ||
| case "$dataset" in | ||
| math500|svamp) | ||
| case "$family" in | ||
| qwen14) folders=(Qwen_14B_base Qwen_14B_reasoning) ;; | ||
| qwen1.5) folders=(Qwen_1_5B_base Qwen_1_5B_reasoning) ;; | ||
| llama8) folders=(Llama_8B_base Llama_8B_reasoning) ;; | ||
| esac | ||
| ;; | ||
| gsm8k) | ||
| case "$family" in |
There was a problem hiding this comment.
Lines 37-65 duplicate the dataset/model↔folder mapping from MODEL_PAIRS_BY_REPO in config.py; can we derive the folder list from there instead of hardcoding the same cases twice?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
| log( | ||
| f"[{model_tag}] SKIP subset {subset_name}: " | ||
| f"{len(all_f)} sentence features < PCA_DIM={PCA_DIM}" | ||
| ) | ||
| return { |
There was a problem hiding this comment.
The len(all_f) < PCA_DIM skip path returns a different summary and bypasses out_dir/summary.json, so skipped subsets miss per_run and metric_at_k() can crash on `summary[
Finding type: Breaking Changes | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
ablate_cot_trace_expms/correct_only/pipeline.py around lines 90-94 (the `len(all_f) <
PCA_DIM` early-return inside the subset processing logic), the skip path returns a
non-standard summary that omits the `per_run` field and then exits before the normal
`out_dir/summary.json` write. Refactor this branch so it produces a summary that matches
the same schema as the non-skipped path (at minimum include `per_run: []`, and any other
required keys like `subset`/`skipped`/`reason`), and ensure the code still writes
`out_dir/summary.json` for skipped subsets rather than returning early before artifact
creation. After the change, `metric_at_k()` and `evaluate_compare_summaries()` should be
able to read `summary['per_run']` safely without crashing.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Implement a correct-trace ablation platform that grades Hugging Face SDS artifacts, caches correctness manifests/paired indices, and reruns CEBRA+SDS on
all,within_correct, andpaired_correctsubsets to compare base vs RFT. Document and script the end-to-end workflows (single model, dataset sweep, full analysis, sweep runner) plus a complementary time-shuffle ablation that permutes sentence-ordered trajectories before SDS and reports pass/fail outcomes and visual gaps.shuffle_sequence_lists, reruns CEBRA/SDS for real vs shuffled order, and orchestrates comparison/batch runners plus README documentation to validate whether SDS metrics collapse without temporal structure.Modified files (7)
Latest Contributors(0)
manifest_cache/results.Modified files (16)
Latest Contributors(0)
Modified files (2)
Latest Contributors(0)