Combine adaptive quantization with span-wide loop embeddings

resouer · resouer · commit 7d435d2a3855 · 2026-04-13T13:13:39.000-07:00
The adaptive-clip training-recovery lane is currently the strongest fully compliant direction we have, but its novelty story still leans heavily on the open openai#1586 quantization recipe. This variant adds one of our own zero-byte architecture tweaks on top: instead of injecting the pass embedding only at the loop-start layer, it applies the same pass embedding across the whole repeated span. The goal is to see whether the stronger W18 quantization path and the W14-style span-wide loop signal reinforce each other without paying any additional artifact cost. Constraint: We need a stronger candidate that is not just a thinner repackaging of the open adaptive-clip line, and the next change should not consume more bytes Rejected: Submit the plain W18 lane immediately | Strong and compliant, but its novelty story is still too close to the open openai#1586 recipe Rejected: Return to broader TTT or chunk/context sweeps | Those knobs already underperformed on this family Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If this zero-byte architecture add-on does not improve W18, stop treating loop-embedding placement as a likely differentiator for the adaptive-clip family Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB Not-tested: Full Lepton run for adaptive clip + span-wide loop embeddings
diff --git a/evaluate.py b/evaluate.py
@@ -60,7 +60,13 @@ def _load_env():
 # ---------------------------------------------------------------------------
 
 def _run(cmd, check=False, timeout=30):
-    r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
+    try:
+        r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
+    except subprocess.TimeoutExpired as e:
+        stdout = e.stdout if isinstance(e.stdout, str) else (e.stdout or b"").decode("utf-8", "replace")
+        stderr = e.stderr if isinstance(e.stderr, str) else (e.stderr or b"").decode("utf-8", "replace")
+        stderr = (stderr + f"\nTIMEOUT after {timeout}s").strip()
+        r = subprocess.CompletedProcess(cmd, 124, stdout=stdout, stderr=stderr)
     if check and r.returncode != 0:
         raise RuntimeError(f"Command failed: {cmd}\n{r.stderr}")
     return r
diff --git a/train_gpt.py b/train_gpt.py
@@ -813,11 +813,12 @@ def _loop_pass_embedding(self, layer_idx, loop_counts, x):
         if (
             not self.looping_active
             or self.loop_embed is None
-            or layer_idx != self.loop_start
+            or layer_idx < self.loop_start
+            or layer_idx > self.loop_end
         ):
             return x
-        pass_idx = loop_counts.get(layer_idx, 0)
-        loop_counts[layer_idx] = pass_idx + 1
+        loop_span = self.loop_end - self.loop_start + 1
+        pass_idx = loop_counts.get("_lv", 0) // loop_span
         if pass_idx >= self.num_loop_passes:
             return x
         emb = self.loop_embed.weight[pass_idx].to(dtype=x.dtype)