Add tweet sidenote for CC Web prompt in TPU blog post

vorushin · claude · vorushin · commit 33fdd0294fda · 2026-02-24T09:15:15.000+01:00
Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/_posts/2026-02-22-llm-pretraining-tpu-v6e-50usd.md b/_posts/2026-02-22-llm-pretraining-tpu-v6e-50usd.md
@@ -121,9 +121,10 @@ I started creating this setup while on vacation - I had little snippets of compu
 
 Nevertheless first versions of the training only reached 25% MXU usage. I pushed Opus to dig hard and investigate, but without the ability to run experiments on Colab TPUs and get the measurements it relied on online reports where other people struggled to reach MXU usage over 25%[^nanogpt_jax]. After a while it declared that our model is too small to get to a decent MXU usage on such a modern hardware. I knew for sure that it wasn't true, but didn't have enough time to rewrite everything profiling piece by piece.
 
+[^cc_tweet]: [The prompt I used](https://x.com/vorushin/status/2024040663214588124).
 [^nanogpt_jax]: E.g., [The modded nanogpt speedrun, but in JAX and on TPUs](https://nor-blog.pages.dev/posts/2025-08-21-modded-nanogpt-jax/) reports 23% MFU on TPU v6e-8, constrained by HBM bandwidth.
 
-When I was waiting for a plane, I came up with the following idea: let Claude Code (via Claude Code Web) build [a Colab notebook with a thorough set of TPU performance tests](https://github.com/vorushin/tpuchat/blob/master/05_tpu_perf.ipynb), building the transformer block by block, and measure the MFU of different parts, in different sizes and in various combinations. Start from the pure matmuls, then, implement and profile individual components, then a single layer, multiple layers, forward and backward pass, the optimizer implementation, each phase independently runnable. Even though the first implementation had a lot of issues, it helped me to start seeing MFU north of 50% and I was eventually able to dissect the slow parts and replace them with the faster implementations.
+When I was waiting for a plane, I came up with the following idea: let Claude Code (via Claude Code Web) build [a Colab notebook with a thorough set of TPU performance tests](https://github.com/vorushin/tpuchat/blob/master/05_tpu_perf.ipynb)[^cc_tweet], building the transformer block by block, and measure the MFU of different parts, in different sizes and in various combinations. Start from the pure matmuls, then, implement and profile individual components, then a single layer, multiple layers, forward and backward pass, the optimizer implementation, each phase independently runnable. Even though the first implementation had a lot of issues, it helped me to start seeing MFU north of 50% and I was eventually able to dissect the slow parts and replace them with the faster implementations.
 
 <figure>
 <img src="/img/tpu_ablations/tpu_perf.png" alt="05_tpu_perf notebook screenshot">