Inverse design — paper-grade comparison + plot pass + per-element constraints#18
Conversation
ClassificationTaskConfig gains an optional class_weights (length == num_classes); ClassificationHead registers it as a buffer and passes it to F.cross_entropy. Lets callers counter class imbalance so a dominant class doesn't collapse all predictions onto itself. Unweighted behaviour is unchanged when class_weights is None. Adds focused head tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
normalize_composition now returns the (non-reduced) pymatgen Composition.formula
("Fe2 O3") instead of a hand-padded fixed-decimal string ("Fe2.000000 O3.000000").
pymatgen already canonicalizes element order and integer-vs-decimal amounts, so
equal compositions collapse to the same readable key while absolute stoichiometry
is preserved (Fe2O3 != Fe4O6). Tests updated to the .formula form.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
material_type: - merge the 5 fine labels into AC / QC / others (3 classes); - balanced (inverse-frequency) class weights so it no longer collapses to the dominant "others" class; - stratified per-dataset sampling keeps every minority (AC/QC) row under the smoke cap so the rare classes survive. Plots: - titles show just the property name + plotted scale; R²/accuracy moved into the axes (boxed), avoiding overlap; - kernel-regression panels report per-composition R² (one composition per panel), with a single horizontal legend at the figure's top-left; - confusion matrix coloured by row-normalized recall with real class names, drawn bottom-left origin and ordered others → AC → QC so the diagonal reads bottom-left → top-right; - forgetting plot widened with the legend outside so it scales to many tasks. Smoke config max_epochs_per_step 1 -> 5 (1 epoch was too under-trained to show the classifier diagonal or meaningful fits). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
optimize_latent gains class_target_weight (default 1.0) to scale the classification objective relative to the regression targets, so a class-probability objective can be made primary while regression targets stay secondary. Validated > 0 when class_targets is given. Adds tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Task order: the last three tasks are fixed as formation_energy → klat → material_type so the inverse-design heads (especially the QC classifier) are freshest when inverse design runs; the first nine order is free. - Inverse design: primary objective is raising quasicrystal probability (class_target_weight=5); secondary objectives are low formation energy and high lattice thermal conductivity. Seeds are the training compositions the model already scores highest on QC probability. - Inverse-design plot reworked: QC probability is the primary panel (seed → decoded, toward 1.0); regression targets are secondary panels with concise property + ↑/↓ titles. Report slide leads with the QC result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional latent-space regulariser λ·‖tanh(encoder(AE.decode(h))) − h‖² to optimize_latent's loss. The penalty pulls the optimized latent toward what the AE faithfully reconstructs, mitigating the decode round-trip drop (after-decode head predictions drifting back from the in-latent optimum). Default 0 (off, no behaviour change). Tests added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly mode - inverse_cycle_weight config field, wired into _inverse_design (defaults 0, off). - run() saves a final_model.pt checkpoint at end of training so inverse-design experiments can be iterated without retraining. - run_inverse_only(ckpt) + --inverse-only <ckpt> CLI flag: rebuild the model with all task configs (same construction as the final training state), load the state_dict, and run only the inverse-design stage (~seconds per iteration). Validated: smoke train saves final_model.pt; --inverse-only reloads and runs in <3s; cycle_weight=1.0 raises after-decode QC vs cycle_weight=0 on the same model (round-trip drift shrinks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… comparison
Two new top-level scripts that DON'T live under continual_rehearsal_demo (per the
"evaluation is independent of rehearsal" requirement). They reuse the demo's
runner only for data loading + model reconstruction; no rehearsal loop is run.
* finetune_inverse_heads.py
Loads a final_model.pt, freezes encoder + every other head (including AE),
and runs a short fine-tune of just the inverse-design heads (defaults:
formation_energy, klat, material_type). Re-uses the model's
configure_optimizers filtering — frozen params automatically get no optimizer.
* eval_inverse_methods.py
Loads a checkpoint and compares two inverse-design methods head-to-head on
the SAME model / seed compositions / targets:
A. optimize_latent(optimize_space="latent", cycle_consistency_weight=λ)
swept over a configurable list of λ values.
B. optimize_composition(kmd_kernel) — the differentiable KMD path added
in PR #17.
Outputs eval_inverse_methods.json (per-seed, per-method) and a comparison
PNG (QC + each regression target across methods, mean ± seed std).
Both are independent of the rehearsal demo (own CLI / output dir / no
training-loop side effects) and stop after writing their artefacts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new optional knobs on optimize_composition for experimental feasibility: * allowed_elements : 1-D bool mask or LongTensor index list of element columns the optimisation may use. Disallowed elements are masked to -inf inside the softmax, so w stays exactly 0 on them throughout and no gradient ever lifts them. Hard whitelist for "elements we can actually synthesise". * element_step_scale : 1-D float tensor (n_components,) ≥ 0. Per-element gradient multiplier applied to logits.grad before optimizer.step. 0 freezes that element at its current value (lock the seed framework), 0.1 lets it drift slowly, 1.0 is the default. Combine with allowed_elements for both hard + soft constraints. Implementation: a small _w_from_logits helper masks logits inside softmax; the optimisation loop scales logits.grad by element_step_scale before each Adam step. Frozen elements never accumulate momentum (g=0 → Adam doesn't move them). eval_inverse_methods.py exposes both as symbol-list CLI flags (--allowed-elements, --locked-elements, --locked-step-scale) and resolves symbols to indices via DEFAULT_ELEMENTS. Tests: - allowed_elements as index list or bool mask: forbidden columns stay 0. - allowed_elements validation: 1-D requirement, length, range, non-empty. - element_step_scale=0 on chosen elements: their relative weight (the ratio w[i]/w[j] of equally-seeded frozen elements) is preserved exactly to 1.0. - element_step_scale validation: length and non-negative values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mposition
Replace the raw-tensor API for allowed_elements / element_step_scale with a
symbol-based one that reads naturally from a user's chemical intent.
* allowed_elements : str | list[str], default "all".
- "all" (default): no constraint.
- list[str]: non-empty list of element symbols (validated against
DEFAULT_ELEMENTS); kernel must have n_components == len(DEFAULT_ELEMENTS).
Any other value (empty list, unknown symbol, wrong type, etc.) raises with
a clear message — no silent acceptance.
* element_step_scale : float | Mapping[str, float], default 1.0.
- scalar: uniform per-element multiplier (default 1.0 = no scaling).
- mapping {symbol -> float}: override specific elements at 1.0 otherwise;
{"Mg": 0.0, "Al": 0.0} freezes the seed framework while the rest is free.
Tests rewritten to use symbols; introduces a tiny aligned-model helper so
symbol-based tests run on the bundled DEFAULT_ELEMENTS registry. Eval script
passes symbols straight through (no manual tensor conversion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the orchestrator that produces the paper materials for the latent-vs-KMD
inverse-design comparison, plus the cheaper baseline TOML used by the
finetune/eval/paper scripts as the training config.
- samples/continual_rehearsal_demo_config_inverse_baseline.toml
Drops the two heavy kernel-regression tasks (dos_density, power_factor) from
the demo sequence; keeps 7 other regression tasks for encoder diversity. Last
three are formation_energy -> klat -> material_type so the inverse-design
heads stay freshest at the end of the continual sequence. Saves final_model.pt
so inverse-design experiments iterate without retraining.
- src/foundation_model/scripts/paper_inverse_comparison.py
Runs latent (cycle-weight sweep {0, 0.1, 0.5, 1, 2, 5}) and composition
(4 configs: unconstrained, alloy palette, alloy+sparsity, alloy+soft step=0.5)
on the same trained checkpoint with shared seeds + targets. Writes
final_model.pt copy, seeds.json, results.json, comparison.png and a README
summary table into one output folder (paper-ready).
Companion artefacts in artifacts/paper_inverse_design/ are .gitignored.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the inverse-design workflow to support paper-grade method comparisons and adds new inverse-design controls (cycle-consistency in latent space; per-element constraints in composition space), while also tightening the continual rehearsal demo’s plotting and inverse-design objective/seed selection to match the intended paper setup.
Changes:
- Added a paper-materials orchestrator (
paper_inverse_comparison.py) plus standalone fine-tune/eval scripts for inverse heads (finetune_inverse_heads.py,eval_inverse_methods.py). - Updated continual rehearsal demo to (a) use merged 3-class
material_type, (b) savefinal_model.ptand support--inverse-only, and (c) improve plots + inverse-design configuration knobs. - Added per-class cross-entropy weights for classification heads and expanded inverse-design APIs/tests (
optimize_latentweights + cycle penalty;optimize_compositionper-element constraints).
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/foundation_model/scripts/paper_inverse_comparison.py | New end-to-end “paper folder” orchestrator for latent-vs-composition inverse-design sweeps. |
| src/foundation_model/scripts/finetune_inverse_heads.py | New script to freeze most of the model and fine-tune only inverse-design heads. |
| src/foundation_model/scripts/eval_inverse_methods.py | New script to compare latent cycle-sweep vs differentiable KMD composition optimization. |
| src/foundation_model/scripts/continual_rehearsal_demo.py | Updates demo task ordering/targets/seeding, plotting pass, checkpoint saving, and inverse-only mode. |
| src/foundation_model/models/task_head/classification.py | Adds optional class_weights buffer and applies it in cross-entropy loss. |
| src/foundation_model/models/task_head/classification_test.py | New tests covering weighted/unweighted CE and validation. |
| src/foundation_model/models/model_config.py | Adds class_weights to ClassificationTaskConfig. |
| src/foundation_model/models/flexible_multi_task_model.py | Extends optimize_latent (class weight + cycle penalty) and optimize_composition (allowed elements + per-element step scaling). |
| src/foundation_model/models/flexible_multi_task_model_test.py | Adds tests for new optimize_latent/optimize_composition behaviors and validations. |
| src/foundation_model/data/composition_sources.py | Switches canonical composition key to pymatgen Composition.formula. |
| src/foundation_model/data/composition_sources_test.py | Updates tests to match the new canonical formula key. |
| samples/continual_rehearsal_demo_config.toml | Updates demo defaults for task order + inverse-design settings/seeding. |
| samples/continual_rehearsal_demo_config_smoke.toml | Updates smoke config task order and inverse-design defaults. |
| samples/continual_rehearsal_demo_config_inverse_baseline.toml | Adds a cheaper baseline config for inverse-design experiments. |
| # Candidate pool: the chosen split of the material_type frame, with a valid descriptor. | ||
| frame = self.task_frames["material_type"] | ||
| index = ( | ||
| frame.index if cfg.inverse_seed_split == "all" else frame.index[frame["split"] == cfg.inverse_seed_split] | ||
| ) | ||
| pool = [c for c in index if c in self._desc_cache or not self.descriptor_fn([c]).empty] |
| def __post_init__(self) -> None: | ||
| unknown = [t for t in self.task_sequence if t not in TASK_SPECS] | ||
| if unknown: | ||
| raise ValueError(f"Unknown task(s) {unknown}. Available: {sorted(TASK_SPECS)}") | ||
| if not 0.0 <= self.replay_ratio <= 1.0: | ||
| raise ValueError("replay_ratio must be in [0, 1] (0 = no rehearsal).") | ||
| if len(self.inverse_reg_tasks) != len(self.inverse_reg_targets): | ||
| raise ValueError("inverse_reg_tasks and inverse_reg_targets must have equal length.") | ||
| if self.inverse_seed_strategy not in {"top_qc", "random", "explicit"}: | ||
| raise ValueError("inverse_seed_strategy must be 'top_qc', 'random', or 'explicit'.") | ||
| if self.inverse_seed_split not in {"train", "val", "test", "all"}: | ||
| raise ValueError("inverse_seed_split must be 'train', 'val', 'test', or 'all'.") | ||
| if self.inverse_seed_strategy == "explicit" and not self.inverse_seed_compositions: | ||
| raise ValueError("inverse_seed_strategy='explicit' requires inverse_seed_compositions.") |
|
|
||
| # Load the trained model exactly as we built it during training (same task_sequence). | ||
| model = runner._build_full_model() | ||
| state = torch.load(ckpt_path, map_location="cpu", weights_only=True) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f20951716
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for p in model.encoder.parameters(): | ||
| p.requires_grad_(False) | ||
| for head_name, head in model.task_heads.items(): | ||
| train = head_name in keep | ||
| for p in head.parameters(): | ||
| p.requires_grad_(train) |
There was a problem hiding this comment.
Freeze task_log_sigmas during inverse-head fine-tune
freeze_except only freezes the encoder and task heads, so task_log_sigmas remain trainable when enable_learnable_loss_balancer is on (the default), and configure_optimizers will still optimize them. That means this script is not actually “head-only” fine-tuning: loss-balancing coefficients change during training and can materially alter the outcome of the inverse-head comparison. Freeze those parameters (or disable the learnable balancer) in this flow.
Useful? React with 👍 / 👎.
| if class_weights is not None: | ||
| weights = torch.as_tensor(class_weights, dtype=torch.float) | ||
| if weights.numel() != num_classes: | ||
| raise ValueError(f"class_weights length ({weights.numel()}) must equal num_classes ({num_classes}).") | ||
| self.register_buffer("class_weights", weights) | ||
| else: | ||
| self.class_weights = None |
There was a problem hiding this comment.
Register class_weights buffer consistently for checkpoint loads
The module shape depends on whether class_weights is set: weighted heads register a buffer, but unweighted heads assign a plain attribute. This makes state_dict keys differ between runs, so strict checkpoint loads in the inverse scripts can fail when loading checkpoints produced under a different class-weight setting (including older checkpoints). Keep the buffer contract consistent across both modes (or relax strict loading) to avoid brittle checkpoint compatibility.
Useful? React with 👍 / 👎.
…ent-system seed dedup
Three coupled improvements based on review of the previous paper-materials run, where
the composition method produced 0/16 novel element sets — i.e. it could only rebalance
the seed's existing elements, never recruit new ones.
**Param renames** (two clearly orthogonal regularisers that I had previously conflated):
- 'cycle_consistency_weight' -> 'ae_cycle_weight' in optimize_latent. Lives in latent
space; penalises the AE decode-encode round-trip drift (|h - encode(decode(h))|^2).
- 'sparsity_weight' -> 'entropy_weight' in optimize_composition. Lives in composition
space; the implementation is Shannon entropy of w, which is not literal L1 sparsity
(entropy biases toward peaky w, L1 would push small weights to zero). New name is
truthful about the mechanism.
Docstrings now state space / penalty form / problem solved explicitly so the two
penalties cannot be confused.
**Scheme B — 'seed_blend' for optimize_composition** (default 0.95):
Old behaviour clamped non-seed-element weights to log(1e-12) ~= -27.6, where the
softmax Jacobian dL/dlogit_i is proportional to w_i and therefore ~= 1e-12. Adam
cannot lift those logits within a few hundred steps, so the support set is frozen
to the seed's nonzero elements — the root cause of zero novelty.
w0 <- seed_blend * seed + (1 - seed_blend) * uniform_over_allowed lifts non-seed
logits to log(0.05 / |allowed|) ~= -7.6, which IS reachable. seed_blend=1.0
reproduces the old strict behaviour for callers who want it.
**Scheme D — random-init control in paper comparison**:
Drops the seed entirely (initial_weights=None, n_starts=B). With Scheme B, this
should converge to a similar attractor as the blended-seed path, confirming the
seed was binding the support set.
**Element-system seed dedup in _select_seeds**:
The top-QC seed list collapsed to many near-duplicates of the same alloy family
(e.g. {Mg, Al, Ag} appeared 3 times in the previous 16). Now keep one best-scoring
representative per element set, so 16 seeds == 16 distinct alloy families.
New tests:
- ae_cycle / entropy_weight validation + smoke
- seed_blend range validation
- seed_blend=1.0 freezes support set (strict reproduces old behaviour)
- seed_blend<1 recruits non-seed elements
- random-init respects allowed_elements
Re-run paper_inverse_comparison results land where the analysis predicted:
- strict seed: 0/16 novel, per-seed refinement (Mg/Al/Cu/Zn)
- blended seed: 16/16 novel, converges to a Ti/Pu/F/Mn attractor (model bias)
- alloy palette + blended: 16/16 outputs are Mg-Pd-Al-Ni-Ga mixtures — real-world
QC alloys; Pd discovered (not in any seed) confirms the support-set freeze was
the bottleneck.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
| # Candidate pool: the chosen split of the material_type frame, with a valid descriptor. | ||
| frame = self.task_frames["material_type"] | ||
| index = ( | ||
| frame.index if cfg.inverse_seed_split == "all" else frame.index[frame["split"] == cfg.inverse_seed_split] | ||
| ) | ||
| pool = [c for c in index if c in self._desc_cache or not self.descriptor_fn([c]).empty] | ||
| if not pool: |
| class_targets: Mapping[str, int | Sequence[int]] | None = None, | ||
| class_target_weight: float = 1.0, | ||
| sparsity_weight: float = 0.0, | ||
| entropy_weight: float = 0.0, |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6004432299
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Soft per-element constraint: scale each element's logit gradient (0 = frozen). | ||
| logits.grad.mul_(step_scale) |
There was a problem hiding this comment.
Enforce true element freezing for zero step scale
element_step_scale is documented and exposed as a way to freeze specific element weights, but this implementation only zeros the logit gradient. Because w = softmax(logits), updating other logits still renormalizes the row and changes the supposedly locked element weights, so constrained inverse-design runs (e.g. element_step_scale={"Mg": 0.0} with other elements free) can drift on locked elements and violate the intended experiment constraint.
Useful? React with 👍 / 👎.
…e freeze + class_weights buffer)
Three reviewer-flagged correctness issues:
**P1 (codex) - element_step_scale=0 didn't actually freeze the locked weights**
The docstring promised 'freezes those elements at their seed values', but the
implementation only zeroed the locked logit's gradient. Because w = softmax(logits)
renormalises across all logits, the locked element's *weight* would still drift
whenever other (unlocked) logits moved — the softmax denominator changes and so
does w_locked. The previous test only checked the ratio w[Mg]/w[Al] (which stays
at 1.0 even if both drift in lockstep), so the bug went unnoticed.
Fix: in _w_from_logits, after the softmax, paste the un-blended seed values back
at locked positions and renormalise the unlocked positions to fill the remaining
1 - Σ_locked seed mass per row. Fully differentiable; the lock branch is a constant
so its gradient is naturally zero (we no longer rely on the .grad.mul_(step_scale)
zeroing for the hard-lock case). Validates that the lock requires initial_weights
(no seed -> nothing to lock to) and that locked elements are in allowed_elements
if a whitelist is set.
Test changes:
- Strengthened test_optimize_composition_element_step_scale_locks_symbols to check
absolute w values (asymmetric seed 0.30/0.20 so ratio-only checks don't suffice).
- Added test_optimize_composition_element_step_scale_locks_with_unlocked_drift
(Mg locked at 0.40, Cu/Ni free to redistribute the remaining 0.60).
- Added two validation tests for the new error paths.
The renormalisation is differentiable and the rerun of paper_inverse_comparison
produces identical numbers (no config uses step_scale=0; the soft path is
unchanged).
**P2 (codex) - finetune_inverse_heads.py was not actually head-only**
freeze_except() froze the encoder and the non-inverse heads but left
model.task_log_sigmas (the learnable loss-balancer scalars, one per task) trainable.
configure_optimizers() picks them up, so they move during the 'head-only' fine-tune
and silently change the relative weighting of the inverse-design objectives —
making the comparison apples-to-oranges. We now freeze every task_log_sigma scalar
in freeze_except (no-op when the balancer is disabled).
**P2 (copilot) - class_weights buffer key was inconsistent between configs**
A weighted head registered a class_weights buffer; an unweighted head left it as a
plain attribute. The state_dict therefore had the key only sometimes, so strict-loading
a checkpoint saved under one config into a head built under the other would fail.
Fix: always register_buffer('class_weights', tensor). When no per-class weights are
configured we register torch.ones(num_classes), which is the identity for
F.cross_entropy(..., weight=w) and for the per-sample 'sum / N' reduction the head
uses — unweighted behaviour is unchanged. Added test_class_weights_state_dict_key_present_when_unset
asserting strict-load works in both directions.
Also: paper_inverse_comparison skips the checkpoint copy when source and destination
resolve to the same file (idempotent reruns no longer raise SameFileError).
All 236 tests pass (3 new lock-tests + 1 new class_weights cross-config-load test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rsity_scale) + raw-data dumps + plan doc Two parameter renames + a sign/range realignment so users don't have to read code to know which direction is which. Plus raw arrays in the JSON output so future replots don't need to re-run the optimisation. Plus the planning doc for the continual_rehearsal_full workstream that consumes both knobs. **1) ae_cycle_weight -> ae_align_scale (range [0, 1], default 0.5)** Same penalty mechanism (latent decode-encode roundtrip MSE), but now the user sees a [0, 1] dial where 0 = no penalty (the AE-roundtrip failure mode, QC drops to 0.39 in the PR #18 paper run) and 1 = strong alignment. Default 0.5 is the empirical sweet spot from #18. No more guessing whether the value should be in [0, 1] or unbounded. **2) entropy_weight -> diversity_scale (range [0, 1], default 1.0, sign flipped)** Old: entropy_weight=0.5 minimised entropy -> peaky outputs (counter-intuitive - bigger value made the recipe simpler). New: diversity_scale=1.0 means no entropy penalty (default, most flexible); diversity_scale=0.0 means max penalty -> peaky few-element recipes. Internally the term added to the loss is (1 - diversity_scale) * H(w). Larger value -> more diversity / multi-element per output, matching the name. Default flipped from 'penalise entropy mildly' to 'no penalty' because the user's default expectation is 'let the optimiser pick its natural element count'. **3) Raw-data dumps** Both _run_latent_method and _run_composition_config now include the per-seed optimized_descriptor (B, x_dim) and optimized_weights (B, n_components) in their results dict. Future replots (per-element bar charts, similarity matrices, t-SNE, etc.) no longer need to re-run optimize_*. results.json grows from ~50 KB to ~3 MB for the 11-row paper run - still well under any sensible limit. **4) docs/continual_rehearsal_full_PLAN.md (new)** Planning doc for the next workstream (the 'full' / formal continual-rehearsal runner). Spells out: - the four-path evaluation matrix (1 latent + 3 composition); - the 17+3 seed scheme (top-QC dedup + three explicit Au-Ga-{Gd,Tb,Dy}); - the 41-element ALLOY_PALETTE (covers classical i-QC ternaries + Au-Ga-RE + group 13/14 + most TMs + accessible lanthanides); - expected-baseline table from the #18 paper run as a sanity check; - the systematic blended-unconstrained vs random-init ablation (kept as a positive finding about the architecture's expressive power + the necessity of constraints, not dismissed as redundant); - per-scenario success criteria; - the explicit pairwise_l1 definition; - the project-level narrative arc from problem statement to the future agent-based AI4S workbench. Tests: - 54 of 54 model tests pass. - New tests cover [0, 1] range validation for both knobs and the qualitative direction of diversity_scale (peaky vs spread). - Old strict-seed and unlocked-drift tests still pass. Reproducibility check: rerunning paper_inverse_comparison after the renames gives the same numbers as PR #18 for every row that maps cleanly between the old and new semantics (latent alpha sweep + strict-seed + alloy palette + random-init). The 'comp (alloy + peaky)' config now uses diversity_scale=0 (was entropy_weight=0.5) and lands on essentially the same Al-Pd-Mg peak collapse as #18 - just with the user-facing knob at a cleaner end of [0, 1]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes to support the preview run that produced artifacts/paper_inverse_design/
(handoff materials for slide / paper-figure creation outside Claude Code).
1. samples/continual_rehearsal_demo_config_inverse_baseline.toml
- Add 'dos_density' KR task between 'kp' and 'formation_energy' (11 total tasks now).
- inverse_n_seeds = 20 (was 16); plan §5 spec of '17 top-QC dedup + 3 explicit'.
- inverse_seed_explicit_append = ['Au65 Ga20 Gd15', 'Au65 Ga20 Tb15', 'Au65 Ga20 Dy15'].
- inverse_ae_align_scale = 0.5 (kept).
2. continual_rehearsal_demo.py / _select_seeds
- Add inverse_seed_explicit_append config field (default empty list).
- _select_seeds now combines (n - len(explicit)) top-QC dedup seeds with the explicit
appends; explicit entries are validated against descriptor_fn (fail-fast on bad input)
and deduplicated by element system. Total length stays at inverse_n_seeds.
- Element-system dedup also runs across the strategy + explicit boundary so the same
family is never double-listed.
3. finetune_inverse_heads.py
- Disable every non-inverse head with model.disable_task(...) before Trainer.fit so the
validation step doesn't try to forward KR heads (dos_density needs t_sequences that
the inverse-only DataModule does not provide). Re-enable them with model.enable_task(...)
before saving so the state_dict layout matches what paper_inverse_comparison rebuilds.
- freeze_except now also freezes model.task_log_sigmas (the loss-balancer scalars). Without
this the 'head-only' fine-tune wasn't actually head-only (caught by codex P2 review on #18).
4. paper_inverse_comparison.py
- DEFAULT_ALLOY_PALETTE switched to the 41-element palette spec (was 12). Asserted at import.
- The auto-generated compact summary table now lands at SUMMARY.md, leaving README.md free
for a human-written index of the whole folder (figures, raw arrays, narrative, story
handles per slide). The previous design clobbered README.md on every rerun.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ified run folder
Two structural improvements driven by the preview-run hand-off discussion.
1. Per-step persistence in continual_rehearsal_demo.py
Until now each training step saved a plot for the *new* task only (parity / confusion / KR
sequences) and a single forgetting trajectory at the end. That left no raw-data record for
old-task predictions and no way to recover the encoder/heads at an intermediate step. Now
every step dumps, for **every active head**:
stepNN_<task>/
<name>_pred.parquet # raw (composition, true, pred) — regression/clf; or long-form
# (composition, t, true, pred) for kernel regression
<name>_metrics.json # R²/accuracy/MAE/samples
checkpoint.pt # full model state_dict at the end of step N
Any plot can be redrawn from these parquets without retraining. Any intermediate stage of
the encoder is recoverable for downstream analyses (per-task probes, t-SNE of latent
evolution, etc.).
2. Unified run folder
The pipeline (continual_rehearsal_demo → finetune_inverse_heads → paper_inverse_comparison)
now writes into one parent directory by convention:
artifacts/inverse_design_run/
training/ (continual_rehearsal_demo --output-dir)
finetune/ (finetune_inverse_heads --output-dir)
inverse_design/ (paper_inverse_comparison --output-dir)
README.md / ANALYSIS.md / SLIDE_PREP.md (top-level docs)
The baseline TOML's output_dir now points at artifacts/inverse_design_run/training;
finetune and paper_inverse_comparison are invoked with --output-dir sibling subfolders.
One pipeline run = one folder. The hand-off doc (SLIDE_PREP.md) lives at the top so the
slide author finds it without digging.
3. PLAN.md §6 updated to reflect the no-PPT / SLIDE_PREP route (2026-05-23 decision):
stop generating summary.pptx and report.html as deliverables; produce SLIDE_PREP.md +
standard plots + raw arrays instead, let an external slide author make the actual deck.
Also flagged python-pptx as a dep with no consumer (clean up in rebase).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…design Full-scale sibling of the continual_rehearsal_demo: 24 supervised tasks (16 reg + 7 kr + 1 clf) over 4 inorganic datasets, tiered rehearsal (5 % / 10 % for the inverse-design tail), EarlyStopping on val_final_loss, and 3 inverse-design scenarios. Each scenario walks the same 8 configs as the demo's paper_inverse_comparison (3 ae_align_scale points + 5 composition configs) on a shared 20-seed set (17 top-QC element-system dedup + 3 explicit Au-Ga-Ln formers). Run is organised under training/ (per-step pred parquet + per-task metrics.json + per-step checkpoint.pt + forgetting trajectory + final_model.pt) and inverse_design/<scenario>/ (8-config boxplot comparison + element-frequency heatmap + per-config result.json + targets.json + summary.json + seeds.json). Slide-prep deliverables (no auto PPT / HTML): SLIDE_PREP.md (9-section handoff with auto-computed element-discovery list, smoke vs full-run disclaimer, plan §5 expected baselines, slide-author freedom/locked section, raw-data cheat-sheet), ANALYSIS.md, README.md, inverse_design/SUMMARY.md. Includes --inverse-only CKPT flow so the inverse-design stage can iterate on a saved final_model.pt without retraining. CPU smoke (800 rows / 2 epochs, --accelerator cpu) verified end-to-end; 22 co-located tests. Plan: docs/continual_rehearsal_full_PLAN.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The plan §5 specifies three independent inverse-design scenarios; the preview pipeline so far
only ran scenario 3 (FE+klat). This commit adds an orchestrator that loops over a TOML
[[inverse_scenarios]] array and writes each scenario's full paper-comparison output into a
sibling subfolder under the unified run directory.
1. samples/continual_rehearsal_demo_config_inverse_baseline.toml
Append three [[inverse_scenarios]] tables — FE down + magnetization up; FE down + tc up +
magnetization up; FE down + klat up. The plan uses 'magnetic_moment' as the magnetic target;
the current 11-task base model has 'magnetization' instead (sibling NEMAD-magnetic column).
The substitution preserves the 'maximise magnetic strength' intent without forcing a
base-model retrain; the discrepancy is loudly documented in the run folder's ANALYSIS.md.
2. src/foundation_model/scripts/paper_inverse_3scenarios.py (new)
Thin orchestrator around paper_inverse_comparison. Reads the [[inverse_scenarios]] array
from the same TOML, replaces inverse_reg_tasks / inverse_reg_targets / output_dir per
scenario via dataclasses.replace, and calls paper_inverse_comparison.run() once per
scenario. Each scenario writes into <output-dir>/<scenario.name>/ as a self-contained
mini paper-comparison run (final_model.pt, seeds.json, results.json, comparison.png,
SUMMARY.md, scenario.json).
Output layout per the plan:
<output-dir>/
scenario1_fe_down_magnetic_up/
scenario2_fe_down_tc_up_magnetic_up/
scenario3_fe_down_klat_up/
Usage:
python -m foundation_model.scripts.paper_inverse_3scenarios \
--config-file samples/continual_rehearsal_demo_config_inverse_baseline.toml \
--checkpoint artifacts/inverse_design_run/finetune/final_model.pt \
--output-dir artifacts/inverse_design_run/inverse_design
The accompanying artefacts (figures, raw arrays, analysis, slide-prep document) for one
end-to-end preview run live under artifacts/inverse_design_run/ (gitignored). 237 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nge, drop underline Underlining new (non-seed) elements on the x-axis of the element-frequency heatmap was visually noisy with tight tick labels. Drop the underline; keep the bold; switch the colour from green / dark-blue to **#E67E22** (deep orange). Rationale for orange: - High contrast against the Blues colormap of the heatmap itself. - Visually distinct from the project's existing palette: #2563EB (composition bars), #55A868 (latent bars), #C44E52 (target lines). Adds a 4th unambiguous accent that doesn't collide with anything readers have already 'learned'. - Bold + a single non-palette colour reads at a glance without the underline glyph clutter. Touches: - src/foundation_model/scripts/continual_rehearsal_full.py _element_frequency_heatmap: change colour, drop the (commented-out) underline plan, update the docstring + title caption. Every doc-emitter further down the file (the cross-scenario README writer, SLIDE_PREP, ANALYSIS) that called the markers 'bold green' is updated to 'bold orange' for consistency. The standalone post-hoc heatmap regeneration scripts in artifacts/ are also updated. Preview artefacts (artifacts/inverse_design_run/inverse_design/scenario*/element_frequency_heatmap.png) regenerated with the new style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…side run() The element-frequency heatmap (per-method x top-25 elements; bold orange x-tick labels mark elements not in any seed) was previously generated by a throwaway post-hoc script — only continual_rehearsal_full.py and that ad-hoc script knew how to make it. Anyone re-running the preview pipeline via paper_inverse_3scenarios got no heatmap. Fold the rendering into paper_inverse_comparison.run() so every paper-comparison output now also writes element_frequency_heatmap.png to its output_dir. paper_inverse_3scenarios calls run() per scenario, so per-scenario heatmaps appear automatically as part of the standard pipeline output — no separate script needed. - Add _plot_element_frequency_heatmap() helper in paper_inverse_comparison; mirrors the corresponding helper in continual_rehearsal_full._element_frequency_heatmap. - Bold + #E67E22 orange for discovered elements (synced colour); no underline. - Wire the call into run() right after the comparison.png write. - Verified: deleting existing heatmaps and rerunning paper_inverse_3scenarios reproduces all three scenarios' heatmaps automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tmap The shared demo style (continual_rehearsal_demo._apply_plot_style) sets axes.grid = True globally so every figure picks up rcParams gridlines. On an imshow heatmap, gridlines render at major-tick positions which coincide with each cell's center, drawing lines *through* the cells instead of between them. continual_rehearsal_full._element_frequency_heatmap already calls ax.grid(False) to suppress this; paper_inverse_comparison._plot_element_frequency_heatmap was missing the same line. Adds the call right before tight_layout/savefig, with a comment pointing at the shared style so future drift is obvious. Verified: rerunning paper_inverse_3scenarios regenerates all three scenarios' heatmaps with clean cell boundaries (no centre lines). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-#18 code review surfaced one latent correctness bug, two typos, a brittle truthy-list guard, two doc-drift items, a missing test file, and a ~400-line copy-paste between the demo and full runners that had already caused one regression. All seven items fixed here. **Correctness fixes** - continual_rehearsal_demo.py _plot_kr_sequences: would have raised NameError if len(comps) == 0 (e.g. a KR task whose test split happened to be empty — magnetic_susceptibility is small enough for this to bite). Early-return with a warning; legend now wrapped in is not None guard. The same bug had already been silently fixed in full a couple of PRs ago — the refactor below ensures it can't drift again. - continual_rehearsal_full.py seeds.json metadata: key "strategy_strategy" → "strategy" (typo; downstream readers expect the unrepeated name). - continual_rehearsal_full.py scenario QC summary: list and np.mean(list):.3f was a clever but fragile non-empty guard — empty list returns [], then :.3f formats a list and raises TypeError. Replaced with an explicit _qc_mean helper that returns nan when the list is empty, keeping the summary string uniform. - continual_rehearsal_full.py SLIDE_PREP table: fixed_tail[0..4] hard-coded an index range that crashes if a smaller-scale config has < 5 tail tasks. Now " → ".join(fixed_tail). **Documentation drift** - continual_rehearsal_full.py module docstring claimed four PR #18 paths per scenario — the script actually runs eight (3 latent α sweep + 5 composition configs). Updated docstring + 3 other inline comments / SLIDE_PREP strings that propagated the same number. **Test coverage** - New continual_rehearsal_demo_test.py (14 tests): config __post_init__ validation, the material-type 5→3 merge invariant, element-system seed dedup logic (top-QC + explicit Au-Ga-RE append), and the plot_kr_sequences empty-comps regression. - New continual_rehearsal_common_test.py (10 tests): the pure dumpers + plotters that now live in the new module. **Refactor — shared helpers in continual_rehearsal_common.py** - New module collects the 6 truly-pure helpers that demo and full previously each copied: dump_predictions / dump_kr_predictions / dump_metrics / plot_parity / plot_confusion / plot_kr_sequences. Also moves SCATTER_COLOR, MATERIAL_TYPE_CLASSES, and MATERIAL_TYPE_DISPLAY_ORDER to the same place. - Demo and full now import these as functions and call them inline in _evaluate_task, passing the per-task title via the new title argument so the runner-specific _title() / _display() vocabulary stays in each home file. Bound-method versions deleted from both runners (~250 lines removed; full goes from 2726 → ~2510 lines). - Backward compat: demo re-exports the constants and _SCATTER_COLOR so existing from continual_rehearsal_demo import _SCATTER_COLOR paths keep working. - Runner-specific plotters (_plot_forgetting uses self._task_colors; _plot_inverse_design / _plot_inverse_scenario differ in layout) stay as bound methods. All 282 tests pass (45 in the three new / extended script-test files + the existing 237). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plot Adds a per-seed 1:1 visualisation that complements the aggregated element-frequency heatmap. The heatmap shows column-level element concentration per method; this new plot shows, for one chosen composition method, *which seed ended up where* — left column is the seed formula, arrow, right column is the optimiser's decoded output. Both sides are normalised to fractions and rendered as percent (so seed '"Au65 Ga20 Gd15"' and decoded '"Au0.55 Ga0.30 Gd0.15"' both appear as percent-scale numbers). Element *symbols* on both sides are coloured by their appearance count in the **optimised** pool (plasma cmap, dark purple = absent / rare → bright yellow = ubiquitous); a colorbar on the right makes the scale explicit. The amount digits stay in plain black so the formulas remain readable at a glance. Elements that appear in a seed but in 0 optimised outputs render in gray to mark them as 'dropped by the optimiser'. One figure is emitted per seed-based composition config (4 figures per scenario): seed_to_optimized__comp_seed.png seed_to_optimized__comp_seed_5_all.png seed_to_optimized__comp_seed_5_all_element_list.png seed_to_optimized__comp_seed_5_all_element_list_low_diversity.png comp (random) is excluded because its seeds field holds random_start_N placeholders rather than real compositions — there is no per-row correspondence to draw. Helper sits next to the existing comparison + heatmap plotters in paper_inverse_comparison.py and is called from run() so every paper-comparison output now includes the mapping figures automatically. paper_inverse_3scenarios therefore picks them up for free. Tests: - 7 new in paper_inverse_comparison_test.py: - 4 for _parse_formula_to_fractions (raw amounts / pre-fractional / bare-element / empty) - 2 for _plot_seed_to_optimized_mapping (writes PNG; skips on length mismatch / empty) - 1 for round-trip parsing of decoded formulas All 289 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chNorm1d size-1 batch crash
First attempt at the full-data MPS run crashed at step 1 (`density`) with:
File "src/foundation_model/models/components/fc_layers.py", line 28, in forward
_out = self.normal(_out)
File ".../torch/nn/modules/batchnorm.py", line 193, in forward
return F.batch_norm(...)
File ".../torch/nn/functional.py", line 2815, in batch_norm
_verify_batch_size(input.size())
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256])
(captured in logs/continual_rehearsal_full_260524_000651.log)
`BatchNorm1d` in training mode requires every channel to see more than 1 value so it can
compute a batch variance. With `shuffle=True` and the upstream default `drop_last=False`,
any train subset whose row count `mod batch_size == 1` will eventually feed a size-1 tail
batch into the encoder and crash mid-epoch — which is exactly what happened on the qc train
split (34322 rows, batch_size=256 → tail of size 34322 % 256 = 130, fine; but on the masked
subsets across continual steps the tail count varies and a size-1 batch is just a matter of
which subset happens to land at size %256 == 1).
Fix: a one-method subclass `_DropLastTrainCompoundDataModule` that overrides
`train_dataloader()` to rebuild the base loader with `drop_last=True`. Only the train loader
is touched; val/test/predict keep `drop_last=False` so every held-out row is still evaluated.
The discarded rows are at most `batch_size − 1` per epoch (~256 / ~35k qc rows ≈ 0.7 %),
well within the rehearsal mask's noise.
Verified by logs/continual_rehearsal_full_260524_001114.log — the second attempt completed
training cleanly through `max_epochs=100 reached` with no errors.
The fix is local to the runner: `CompoundDataModule` upstream is unchanged so this doesn't
affect any other consumer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt paths
Per user feedback on the per-seed mapping plot:
1. Font 10 → 13 (formula + parenthetical), row height 0.45 → 0.34 (smaller spacing). Seed and
optimised columns are now closer together so the eye can scan a row in one sweep.
2. Seed-side text is all-black (no element colouring). The colour story now reads as 'how the
optimiser transformed each seed' on the right side only — left is the unchanged input.
3. Colormap plasma → inferno (low end near black, high end bright yellow). Per the user's
'higher contrast at both ends, low values close to black' note; rare-but-present elements
stay near the text colour and ubiquitous elements pop.
4. Seed side gains (QC=XX.X%) parenthetical — the model's baseline P(quasicrystal) for
that seed.
5. Optimised side gains (QC=XX.X%, Δ<task>=±N.NN <arrow>, …) — QC after optimisation, plus
per-target signed deltas (after − seed) with a target-direction arrow (↑ for positive z-
target, ↓ for negative). Sign of Δ vs arrow direction makes 'did this target move correctly?'
readable at a glance. Long task names get short labels via _REG_DISPLAY_SHORT
(formation_energy → FE, magnetic_moment → mm, …) to keep the parenthetical
from pushing into the colourbar.
6. Latent paths (α ∈ {0, 0.25, 1}) now also get a mapping figure each. Previously the helper
was wired only for composition methods with init=='seed'; latent paths decode their
optimised descriptor back to a composition (via KMD.inverse) so per-seed correspondence is
well-defined. Filenames are slugged from align_scale:
seed_to_optimized__latent_align0.png
seed_to_optimized__latent_align0p25.png
seed_to_optimized__latent_align1.png
Plus one persistence change: run() now computes the per-seed *baseline* QC + reg
predictions against x_seed once and stores them in results.json under
seed_predictions so the parenthetical's QC / Δreg values are reproducible from
results.json alone — re-plotting the mapping figure does not need the model loaded again.
Per scenario the output folder now carries 7 mapping figures (3 latent α + 4 seed-based
composition configs) plus comparison.png + element_frequency_heatmap.png.
Tests:
- Updated 4 existing tests in paper_inverse_comparison_test.py for the new
seed_qc / seed_reg / optimized_qc / optimized_reg / reg_targets kwargs.
- Added 2 tests for _target_arrow (positive ⇒ ↑, ≤0 ⇒ ↓).
All 291 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-scenario QC-vs-reg-target scatter figure that complements the existing bar / heatmap / seed-to-optimised views by showing the per-seed output cloud directly. - Latent paths render as circles in a Greens ramp (3 alpha values). - Composition paths render as triangles in a Blues ramp (5 configs). - Green vs Blue keeps the two groups easy to tell apart at a glance (per the user's "two groups' base colors must be easily distinguishable" requirement); the stepped color within each group encodes the parameter sweep ordering so the reader can read it off the legend. - One panel per secondary regression target; each panel pins the joint target with red dashed lines at QC=1.0 and the reg-target. - A single figure-level legend at the bottom lists every method label across all panels, plus a target-line entry. Wired into run() so every scenario gets qc_vs_secondary_scatter.png written alongside the existing comparison / heatmap / seed-to-optimised figures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thms Adds a 21:9 white-background diagram comparing optimize_latent and optimize_composition on three rows per column: - Flow diagram (top): where the optimisation variable lives and the forward path through the model. Latent shows the AE round-trip detour with the alignment-penalty return arrow in red; composition shows the straight logits-to-recipe path with 'w is the reported recipe' callout. - Loss decomposition (middle): both methods share the regression-MSE + (-log P(QC)) backbone; the third term (alpha-AE-alignment vs diversity-entropy) is highlighted in red on the side it applies to. - Tunable parameters (bottom): two-line entries (bold accent name + dim meaning) so the description column never runs off the column edge. The figure is meant to live alongside the plan in docs/ — the script is checked in so future param/loss changes regenerate the same diagram. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the earlier overview figure (which the diagram was too cluttered
to be useful) with a markdown reference that lays out, for each method:
* the optimisation variable;
* the loss with each term named separately (regression / classification
backbone shared between both methods; the differentiating third term
is highlighted);
* what each term is for, in plain-English design intent;
* any enforced constraints that don't live in the loss
(simplex, allowed_elements, element_step_scale, seed_blend);
* the user-facing parameter table with range / default / meaning.
A final side-by-side summary table pins the two methods' differences
(opt variable, where the reported recipe comes from, method-specific
loss term, failure mode, method-specific knobs) in one place — written
so the formulas can be transcribed directly into slides.
Removes docs/figures/inverse_design_algorithms_overview.{py,png}; the
markdown is now the canonical reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…41 -> 48 elements)
Adds the full 5d transition-metal row (Hf, Ta, W, Re, Os, Ir, Pt) to the
constrained-composition path's element whitelist, inserted between Cd (end
of 5th-period TMs) and Au so the 6th-period TM block is contiguous. The
addition broadens the heavy-TM coverage of the composition search and lets
the optimiser reach refractory / noble-metal i-QC families (Hf-Pd, Ta-Ni,
Ir-based phases, ...).
Updates both DEFAULT_ALLOY_PALETTE (paper_inverse_comparison.py) and
ALLOY_PALETTE (continual_rehearsal_full.py), the length assertion, and
the palette-membership test. Hardcoded '41-element' strings in the
auto-generated SLIDE_PREP.md sections of continual_rehearsal_full.py are
made dynamic via f'{len(ALLOY_PALETTE)}-element' so future palette
extensions don't need a string sweep. The 'Plan §5 + 41-elem-smoke
baselines' line is left intact -- it's a historical attribution to the
actual smoke runs done with 41 elements.
Re-ran paper_inverse_3scenarios on the existing fine-tuned checkpoint
(no retraining). Pt is picked up systematically by the constrained
composition path in scenario1 (FE-down + mag-up, 6 of 20 outputs) and
scenario3 (FE-down + klat-up, 7 of 20 outputs), and appears as a bold
orange 'discovered' element in the heatmap; Hf and Ta are picked up
occasionally by the latent path on scenario3. The other 5d TMs (W, Re,
Os, Ir) weren't selected in this run -- having them in the palette
costs nothing and keeps the search space honest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One-page summary written so each bullet maps to a slide or paper paragraph. Covers the two user-stated headlines plus six supporting points pulled from the three-scenario sweep results: 1. Multi-task foundation model + gradient inverse design is an effective recipe for multi-objective optimisation (one checkpoint, no retraining across the 3 scenarios). 2. For QC, the differentiable-KMD composition path gives more controllable and chemically meaningful results than latent-space optimisation (recipe-is-output vs AE-roundtrip, simplex by construction, etc.). 3. The two methods are complementary, not competing -- latent surfaces the model's internal attractors, composition generates recipes. 4. Element discovery is real: Pt picked up systematically in scenario 1 (6/20) and scenario 3 (7/20) despite not being in any seed, as part of an Al-Pd-Pt ternary that converges repeatedly. Pd, Hf, Ta similar. 5. The user-facing knobs (ae_align_scale, diversity_scale, seed_blend) are all in [0, 1] with intuitive meanings. 6. The 3 scenarios stress-test conflicting objectives (FE-down vs QC-up, FE-down vs klat-up); the model negotiates the trade-offs rather than collapsing to a trivial point. 7. The pipeline is end-to-end automated -- one orchestrator run produces 30 figures + 3 results.json across the scenarios; configs / seeds / checkpoints saved per run for reproducibility. 8. Honest limitations: chemistry-aware whitelist != synthesisability; targets are z-scored; latent alpha=0 is the control, not the recommendation. Cross-links to docs/inverse_design_algorithms.md and the plan doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y scatter Adds an optional seed-layer to _plot_qc_vs_reg_scatter: when called with seed_qc and seed_reg (the per-seed baseline predictions already saved in results.json under seed_predictions), the 20 seeds are drawn as orange ★ stars beneath the optimised clouds. The reader can now read seed→optimized as a 'where did the optimiser push each seed' picture rather than just an absolute scatter. Design: - marker ★ (star) is distinct from ○ (latent) and △ (composition); - color reuses the project's discovered-element orange (#E67E22) so it sits in a third color family, not Greens/Blues/red-target-lines; - s=110 (slightly bigger than the 64 used for optimised markers) so the seed cloud reads as the anchor; - drawn first (zorder=2) so optimised clouds overplot, then the seed cloud peeks out only where no optimised point covers it. The legend gains a single 'seed (baseline)' entry at the start (before the latent paths and composition paths) so the legend order matches the visual story: seeds → latent → composition → target. run() now passes seed_qc / seed_reg through; the existing tests for the helper without seeds still pass (kwargs are optional). One new test covers the seed-layer path. Existing scenario artefacts under artifacts/inverse_design_run/inverse_design/ were regenerated from the existing results.json files (the seed_predictions field is already persisted there, so this didn't require re-running the optimisation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_optimized figures The runner used to emit only comparison.png (boxplot) and element_frequency_heatmap.png per scenario; the per-seed scatter and 1:1 seed-to-optimised mapping figures lived only in the demo's paper_inverse_comparison.py. Imports those two helpers and wires them into the per-scenario loop right after the existing plot calls. No training-loop changes: both helpers consume the same per-path 'paths' dict and the per-scenario 'before_qc' / 'before_reg' arrays that are already computed for the existing plots, so no extra forward passes and no checkpoint touch-up. Plots produced per scenario (new in this commit): - qc_vs_secondary_scatter.png — per-seed cloud, latent ○ Greens vs composition △ Blues, seeds rendered as orange ★ stars (the demo's newest seed-baseline layer carries through). - seed_to_optimized__<path_key>.png × 7 — one per non-random path (3 latent + 4 comp; comp_random skipped because its seeds field is a random_start_N placeholder, no per-row correspondence with the seeds list). Reusing the demo's helpers (rather than re-implementing in the runner) keeps the two surfaces from drifting on plot style or legend ordering — this was the same fix pattern as the PR #18 K=0 NameError that shipped in demo for several PRs because the plot helpers had drifted between the two scripts. The new test_demo_inverse_plot_helpers_imported test pins the import wiring so a future rename or relocation breaks the test loudly rather than silently losing two figure groups (which the runner's training-loop smoke test would never catch on its own). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ft, hardcoded strings)
Code-review pass; all fixes are correctness/style touch-ups, no behaviour
changes to the training loop or the inverse-design objectives.
Critical:
- optimize_latent: snapshot every parameter's requires_grad alongside
was_training and restore on exit. Previously only training mode was
restored, so a subsequent model.fit() in the same session silently
found every encoder/head parameter frozen ('training stopped moving
the weights'). Mirrors the pattern already used by optimize_composition.
Pinned by test_optimize_latent_restores_requires_grad_after_call.
Should-fix:
- continual_rehearsal_full.py: the 'Restricts the support set to the 41
feasible alloy elements' string in the SLIDE_PREP.md generator now reads
from len(ALLOY_PALETTE), matching the 2026-05 Hf-Pt 5d TM bump (48).
- samples/continual_rehearsal_full_config.toml: alloy palette comment and
list both updated to the 48-element set (was the 41-element list, so
loading this TOML would have silently downgraded the search space).
- _dedupe_by_element_system in continual_rehearsal_full.py now matches
demo's empty-key guard ('if not key or key in seen: continue') -- a
malformed composition was a crash in full but a silent skip in demo;
drift removed.
- ContinualRehearsalConfig.__post_init__: reject inverse_n_seeds <= 0
(was silently returning only the explicit_append entries) and reject
inverse_ae_align_scale not in [0, 1] (was caught much later inside the
model -- the message now points at the TOML field instead).
- continual_rehearsal_demo --inverse-only: drop the duplicate
'Done. Outputs in ...' log line.
Test coverage:
- New test_optimize_latent_restores_requires_grad_after_call.
- New tests pinning the two new config validators.
- Extracted _finalise's strategy+explicit merge logic into a classmethod
_merge_strategy_and_explicit, plus three tests covering: explicit-append
drops overlapping strategy seeds, n_strategy is post-dedup cap, empty
appended is a no-op.
- New finetune_inverse_heads_test.py: 4 tests for freeze_except's contract
(encoder frozen, kept heads trainable, task_log_sigmas frozen even when
the learnable balancer is enabled, unknown keep_head silently freezes
everything).
All 307 existing+new tests pass (model + scripts + data); no other
behaviour changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… inverse design added)
Audit of README + ARCHITECTURE against HEAD found multiple drifted
claims; rewriting both so they describe what the code actually does.
Removed (these features were deleted by the deposit-layer cleanup
refactor or never existed in the current branch):
- 'Multi-modal fusion / structure encoder / x_structure / D_struct'
— no structure encoder in the codebase; only x_formula.
- 'Deposit Block (Linear + Tanh) / D_deposit' — removed in the
Deposit Layer Cleanup refactor (see CHANGES.md). The tanh is now
applied directly at the FlexibleMultiTaskModel level on the raw
latent_dim output of the encoder.
- 'Pre-training: contrastive, cross-reconstruction, masked-feature
(MFM), --pretrain flag' — none exist; tests assert their absence.
- '--freeze_encoder' — not a flag; the analogue is
shared_block_optimizer.freeze_parameters on the model config.
- 'shared_block_dims' as a primary architecture knob — gone; the
actual entry is encoder_config.hidden_dims (MLPEncoderConfig) or
d_model/num_layers/nhead (TransformerEncoderConfig).
- 'components/ — encoders, fusion, SSL' — only fc_layers and
foundation_encoder remain.
Added (the PR #18 inverse-design surface, undocumented until now):
- AutoEncoderHead — described in the heads table and the diagram;
explicitly called out as the prerequisite for
optimize_latent(optimize_space='latent').
- KernelRegressionHead — described alongside the other heads with
its (B, L, 1) output shape and t-sequence input.
- Per-class classification weights — described in the heads table
and the loss section.
- optimize_latent and optimize_composition — full table of which
optimisation variable, which method-specific loss term, and
which user-facing knobs (ae_align_scale / diversity_scale /
seed_blend / allowed_elements / element_step_scale /
class_target_weight) are on each side.
- End-to-end pipeline section (continual_rehearsal_demo →
finetune_inverse_heads → paper_inverse_3scenarios) with sample
commands and a pointer to the per-scenario output layout.
- Cross-links to docs/inverse_design_algorithms.md (method
reference) and docs/qc_inverse_design_summary.md (8 headline
messages from the 3-scenario sweep).
Also updated:
- README architecture diagram redrawn: no deposit, no structure,
explicit model-level tanh, AE head present.
- ARCHITECTURE diagram + dataflow table: same redraw, plus a
separate inverse-design diagram contrasting the two methods.
- Project Structure tree in ARCHITECTURE updated to match the
current src layout (components/, task_head/, scripts/ all listed
accurately).
- model_config.py:78 TransformerEncoderConfig docstring: 'before
passing them into the deposit layer' → 'before the model-level
tanh and the task heads'.
No behavior changes; 20/20 model_config tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tmap functionality - Removed `AutoEncoderTaskConfig` and replaced it with a private `_AEConfig` in the model configuration. - Updated `FlexibleMultiTaskModel` to support autoencoder head creation with new parameters. - Modified the `AutoEncoderHead` to utilize the new configuration and activation functions. - Changed the seed selection strategy in continual rehearsal to use the test split for more accurate predictions. - Enhanced the element frequency heatmap plotting function to highlight discovered elements and improved its integration across various scripts. - Updated documentation and configuration files to reflect changes in seed selection and autoencoder integration. - Added tests for the new autoencoder functionality and element frequency heatmap.
…+ GIF/HTML/SVG animations
Records per-step trajectory data and surfaces it visually so the user can answer
'do the targets converge together, or does the recipe stabilise early and the
targets keep moving?' Motivated by the observation that the same seed produces
markedly different optimised compositions depending on the scenario's targets —
trajectories let us see *when* and *how* the optimisation paths diverge.
Model layer (flexible_multi_task_model.py):
- optimize_composition now accepts record_weights_trajectory: bool; when on,
it returns a per-step (steps, B, n_components) weights tensor alongside the
existing target trajectory.
- optimize_latent now accepts record_input_trajectory: bool; when on, it
snapshots the per-step input each iteration (for input-space) or the AE-decoded
input (for latent-space). Returned as (B, R, steps, input_dim).
- Both result namedtuples gained an optional trajectory field with default=None
for backwards compatibility.
Path runners (eval_inverse_methods._run_latent_method,
paper_inverse_comparison._run_composition_config):
- Both gained a record_trajectory flag. The latent runner additionally calls
KMD.inverse on each per-step decoded input so the trajectory reports per-step
compositions, not just per-step descriptors (composition runner gets weights for
free since they are the optim variable already).
- Output dict carries trajectory_targets (steps, B, T) and trajectory_weights
(steps, B, n_components) when recording is on.
paper_inverse_comparison.run() now:
- Accepts record_trajectory, per_seed_trajectories and
animation_formats kwargs (forwarded by both CLI entry points).
- Persists the per-path trajectories as compressed .npz under
<scenario>/trajectories/<path_slug>.npz instead of inlining them in
results.json (which would balloon to ~36MB/scenario otherwise). results.json
carries a trajectory_file reference per path.
- Calls into the new paper_inverse_trajectory module to emit per-path:
* trajectory__<slug>.png — static line plot, normalised progress vs step, all
targets on the same y-axis. 0 = at seed, 1 = at target. Reveals the headline
finding for the user's question: e.g. in scenario 3 (FE↓+klat↑), klat
overshoots its target by step ~100 while formation_energy crawls and only
reaches ~30 % at step 300.
* trajectory__<slug>.gif — same line plot + a per-step composition bar chart
(top-K elements by weight) of the best-per-target seed.
* trajectory__<slug>.html — self-contained single-file HTML (via to_jshtml,
embeds frames as base64 PNGs — no _frames/ side-folder).
* trajectory__<slug>.svg — handwritten SMIL-animated single-file SVG (plays
in any modern browser; PowerPoint cannot embed it directly — use the GIF).
New CLI flags on both paper_inverse_comparison and paper_inverse_3scenarios:
- --record-trajectory / --no-record-trajectory (default on)
- --per-seed-trajectories (default off; mean across seeds is the default view)
- --animation-formats {gif,html,svg,none} [...] (multi-select; default gif)
Default 'mean' view: targets averaged across the 20 seeds with per-seed-baseline
normalisation; the comp panel in the animation uses the seed minimising joint
distance to all targets (best_seed_by_target_distance).
Tests: paper_inverse_trajectory_test.py covers the pure helpers (best-seed
picker, progress normalisation) plus smoke tests for the gif/html/svg writers
and the mismatched-shape skip path. All 9 tests pass; existing 153-test suite
unaffected.
Re-ran paper_inverse_3scenarios on the existing finetune/final_model.pt;
each scenario now has trajectories/ with 8 .png + 8 .gif + 8 .html + 8 .npz
(no retraining; artefacts gitignored as usual).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ics observations Extends the QC inverse-design study summary with the findings from the new per-step trajectory tooling (commit 8041b71). Adds: - New headline #8: 'Per-step optimisation trajectories explain why the same seed → different scenarios → different recipes.' Carries a 3-row table of cross-scenario observations on the headline comp (seed, 5% all, element list) path: * scenario 3 (FE↓ + klat↑): klat overshoots progress ≈ 1.5 by step ~100 and plateaus; FE crawls to ~0.32 across all 300 steps. * scenario 1 (FE↓ + mag↑): magnetisation is a *stuck* target (progress ~0.01); FE crawls to ~0.26. * scenario 2 (FE↓ + tc↑ + mag↑): FE and tc rise together to ~0.22 (cleanly coupled); mag plateaus at ~0.08. - Three interpretive takeaways: (a) 'same seed → different recipe across scenarios' is the dominant target taking over the gradient in the first 50-100 steps; (b) inverse_steps=300 is enough headroom (most paths flatline by ~150); (c) klat overshoot (progress > 1.0) is honest signal — the joint loss keeps falling on the other axes. - Renumbers the limitations section to #9. - Extends section #7's artefact list to include trajectories/<slug>.{png,gif,html,npz} per path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Short hook-up guide for any runner that wants the per-step trajectory artefacts. Targeted at the continual_rehearsal_full agent (and any future runner) — explains the 3 minimal steps to wire in: 1. Add record_weights_trajectory=True to optimize_composition (or record_input_trajectory=True to optimize_latent — note that the latent path needs an extra KMD.inverse per step to recover the per-step element weights). 2. Persist as compressed npz (the inline-JSON alternative balloons results.json to ~36 MB / scenario). 3. Call plot_trajectory_static + plot_trajectory_animation — the helpers handle mean-across-seeds for the line plot and pick the best representative seed for the comp panel. Includes the per-step QC caveat (model trajectory only records reg targets; QC is synthesised flat and dropped from the plot — full QC curve requires an extra forward pass on per-step weights). Cross-links to the reference implementation in paper_inverse_comparison.run()._emit_trajectory_outputs and lists the CLI flags to mirror. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tinual_rehearsal_full example
Two additions in response to the follow-up question 'is this a common
module or written directly into the runner?':
1. New 'Where this module lives (and why)' section at the top — explicit
table of which files were touched (paper_inverse_trajectory.py NEW;
paper_inverse_comparison.py wires it in) vs untouched (the two
continual_rehearsal_* runners + the common module). Rationale: the
paper_inverse_* family is the post-training analysis layer, and
continual_rehearsal_common holds training-loop helpers — a single
consumer doesn't justify promoting to common.
2. New 'Worked example — continual_rehearsal_full.py' section with the
three concrete edits an agent has to make:
* Edit A: _run_latent_path gains a record_trajectory kwarg; passes it
through to optimize_latent's record_input_trajectory; decodes the
per-step input via self._kmd.inverse to get the per-step weights.
* Edit B: _run_composition_path mirror — trivial since
optimize_composition's weights_trajectory is already on the right
surface.
* Edit C: in the scenario loop (the existing paths dict block), save
the per-path npz under sc_dir/trajectories/, then call
plot_trajectory_static + plot_trajectory_animation, then free the
in-memory trajectory arrays. Reuses _path_slug from
paper_inverse_comparison so filenames match.
Each edit is given with the call-site line number and the minimal diff
needed (new arg, new line, new block). The agent should be able to
apply them in 10-20 minutes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ seed-major layout
User feedback after first round:
1. Title only showed 'seed 18', need the actual chemistry formula visible.
2. The single best-representative seed isn't enough — need all 20 per-seed
trajectories to compare how the same seed behaved across the 8 paths.
Changes:
paper_inverse_trajectory.py:
- plot_trajectory_static + plot_trajectory_animation gain an optional
seed_composition: str kwarg. Rendered as monospace under the bold main
title (e.g. 'seed: Au65 Ga20 Gd15'). Earlier draft put both in the
same y position via ax.text(y=1.02); fixed via title pad=22 + a
second text at y=1.005 so the layout no longer collides.
paper_inverse_comparison.py:
- --per-seed-trajectories now defaults ON (BooleanOptionalAction); pass
--no-per-seed-trajectories to skip the bulk.
- _emit_trajectory_outputs now uses seed-major layout:
trajectories_per_seed/seed{NN}/<path>.{png,gif,html}. Workflow win:
'compare seed X across all 8 paths' is opening one folder, not 8.
(Path-major would have been 480 PNGs in 8 folders — same total count,
but mental friction for the cross-path comparison the user actually does.)
- Both mean and per-seed plots get seed_composition wired through:
mean uses the best-seed's composition; per-seed uses each row's
seeds[i]. For comp_random the seeds[i] is the 'random_start_N'
placeholder, which is fine.
- New 'seeds' kwarg on _emit_trajectory_outputs so the caller can pass
the master seed-string list once.
paper_inverse_3scenarios.py:
- --per-seed-trajectories mirrored to default ON (BooleanOptionalAction).
continual_rehearsal_full.py:
- (in same diff, applied by a separate agent following
docs/trajectory_integration.md) wires the same trajectory feature into
the full runner's inverse-design loop. Per-seed default kept OFF here
(the full runner is a multi-hour training pipeline; per-seed plotting
adds ~2h on top, opt-in is the right default at that scale).
paper_inverse_trajectory_test.py:
- New test_plot_trajectory_static_with_seed_composition pins the new
kwarg's contract.
docs:
- trajectory_integration.md updated with: new defaults, seed-major
layout description, the worked example for continual_rehearsal_full
now matches the actual wiring, 'per-seed title convention' note.
- qc_inverse_design_summary.md section 7 artefact list updated.
Reran paper_inverse_3scenarios on the existing checkpoint:
3 scenarios × 8 paths × 20 seeds × (png + gif + html) = 1440 per-seed
trajectories, plus 24 mean trajectories. Total ~6.9 GB across
artifacts/inverse_design_run/inverse_design/scenario*/ (gitignored).
~130 min wall clock for the full per-seed render.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…essions)
Single-page reference for extending the inverse-design surface — written
so a future session can add capabilities (specify element count, fix
specific amounts, min-weight floor) without spelunking through 600+
lines of optimize_composition first.
Covers:
- Where the two entry points live (optimize_latent vs optimize_composition).
- What's already there on the composition path: all 7 user-facing kwargs
in one table with what they do + where they're implemented.
- The single point of leverage: _w_from_logits inside optimize_composition.
Every simplex-projection / hardening rule belongs there; gradient flows
correctly through any differentiable rewriting. Doc explains the
pattern (validate kwarg → compute one-time state → apply in
_w_from_logits).
- Three extension sketches the user has flagged:
A. 'max_elements: int' — top-K hardening (with the K-th boundary
gradient note); ~10 lines in _w_from_logits.
B. 'fixed_amounts: {symbol: fraction}' — reuses the existing
locked_mask infrastructure; no _w_from_logits change needed.
C. 'min_nonzero_weight: float' — floor + renormalise in _w_from_logits.
- Code-location map (docstring / arg-block / one-time setup / per-step /
loss-term / trajectory / tests).
- Pre-merge checklist: keep surgical-edits pattern, one kwarg per PR,
pin the contract with at least one end-to-end test (mention the two
existing reference tests to mimic).
The latent path is more rigid (variable is h, not simplex); new
constraint features generally only make sense for optimize_composition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inverse design — paper-grade comparison + plot pass + per-element constraints
This branch is the consolidated set of improvements that grew out of
artifacts/continual_rehearsal_fulltriage. It (a) fixes everything you flagged on the demo plots, (b) reshapesmaterial_typeinto the three classes used by the paper (AC / QC / others), (c) adds the two inverse-design knobs we needed to compare cleanly (cycle-consistency loss in latent space, differentiable KMD in composition space), and (d) packages the comparison into a self-contained paper-materials folder.The branch is rebased on master after #17 merged, so the differentiable KMD lives upstream and is consumed here without any local fork of
kmd_plus.What's in it
Roughly in dependency order:
Composition.formulais the canonical key (83db854). Drops theformulacolumn on the qc parquet; the only join key for cross-source alignment is nowcomposition.62b8b0f); used to keep the QC minority alive when trainingmaterial_type.38b84b3):material_type: merged IAC+DAC → AC, IQC+DQC → QC (3 classes). Confusion is row-normalized percentages, axes orderedothers → AC → QCfrom bottom-left.#2563EBα=0.55. R² baked into the panel. Title is property + unit only.dos_density,power_factor): per-composition R², figure-level legend top-left aligned to the first panel's left edge, no overlap with the data.savefig(dpi=150)everywhere; no more "(new)" suffix.b630ae5):formation_energy → klat → material_typeregardless of what precedes them.class_target_weightinoptimize_latent(7117b2b).9fd3969):λ · ‖tanh(encoder(AE.decode(h))) − h‖²to keephon the decode-encode fixed set.2fe7198):final_model.pt— non-negotiable.--inverse-only <ckpt>skips training and reruns just the inverse-design stage.inverse_cycle_weightexposed via TOML.6778be9):finetune_inverse_heads.py— freeze encoder + non-inverse heads, retrain only the three inverse heads on top of a trained checkpoint.eval_inverse_methods.py— head-to-head latent (cycle weight sweep) vs composition method on the same model + seeds + targets, with its own JSON + PNG output.optimize_composition(5b6a4a6,92318e2):allowed_elements: default"all"or a non-emptylist[str]of symbols → hard whitelist.element_step_scale: default1.0or adict[str, float]→ soft per-element step scale (gradients are multiplied per-element; setting0.0freezes an element exactly).class_targets,class_target_weight,sparsity_weight,n_starts,steps,lrall surfaced.KMDnumpy API; the torch path is purely additive and never mutates the numpy kernel.5f20951):samples/continual_rehearsal_demo_config_inverse_baseline.toml— cheaper baseline that drops the two KR tasks and saves the checkpoint for downstream experiments.scripts/paper_inverse_comparison.py— runs latent (λ ∈ {0, 0.1, 0.5, 1, 2, 5}) and composition (4 configs: unconstrained / alloy palette / alloy+sparsity / alloy+soft step=0.5) on the same checkpoint and writes a self-contained paper folder.Reproducing the paper materials
Headline result (artifacts/paper_inverse_design/)
Mean ± std across 16 top-QC seed compositions. Targets: QC → 1.0, formation_energy → −2.0, klat → +2.0.
Takeaways for the paper:
Out of scope
The
continual_rehearsal_full.pyworkstream and itspython-pptxdeliverables (planning doc, config, runner, runner test) are intentionally not part of this PR — they're a separate workstream and are still WIP locally. Similarly, the modifieddata/scripts/phonix-db.ipynb,pyproject.tomlanduv.lockbelong to that workstream and are untouched here.🤖 Generated with Claude Code