Skip to content

Commit 33fdd02

Browse files
vorushinclaude
andcommitted
Add tweet sidenote for CC Web prompt in TPU blog post
Co-Authored-By: Claude Opus 4.6 <[email protected]>
1 parent 0d84fd5 commit 33fdd02

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

_posts/2026-02-22-llm-pretraining-tpu-v6e-50usd.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,10 @@ I started creating this setup while on vacation - I had little snippets of compu
121121

122122
Nevertheless first versions of the training only reached 25% MXU usage. I pushed Opus to dig hard and investigate, but without the ability to run experiments on Colab TPUs and get the measurements it relied on online reports where other people struggled to reach MXU usage over 25%[^nanogpt_jax]. After a while it declared that our model is too small to get to a decent MXU usage on such a modern hardware. I knew for sure that it wasn't true, but didn't have enough time to rewrite everything profiling piece by piece.
123123

124+
[^cc_tweet]: [The prompt I used](https://x.com/vorushin/status/2024040663214588124).
124125
[^nanogpt_jax]: E.g., [The modded nanogpt speedrun, but in JAX and on TPUs](https://nor-blog.pages.dev/posts/2025-08-21-modded-nanogpt-jax/) reports 23% MFU on TPU v6e-8, constrained by HBM bandwidth.
125126

126-
When I was waiting for a plane, I came up with the following idea: let Claude Code (via Claude Code Web) build [a Colab notebook with a thorough set of TPU performance tests](https://github.com/vorushin/tpuchat/blob/master/05_tpu_perf.ipynb), building the transformer block by block, and measure the MFU of different parts, in different sizes and in various combinations. Start from the pure matmuls, then, implement and profile individual components, then a single layer, multiple layers, forward and backward pass, the optimizer implementation, each phase independently runnable. Even though the first implementation had a lot of issues, it helped me to start seeing MFU north of 50% and I was eventually able to dissect the slow parts and replace them with the faster implementations.
127+
When I was waiting for a plane, I came up with the following idea: let Claude Code (via Claude Code Web) build [a Colab notebook with a thorough set of TPU performance tests](https://github.com/vorushin/tpuchat/blob/master/05_tpu_perf.ipynb)[^cc_tweet], building the transformer block by block, and measure the MFU of different parts, in different sizes and in various combinations. Start from the pure matmuls, then, implement and profile individual components, then a single layer, multiple layers, forward and backward pass, the optimizer implementation, each phase independently runnable. Even though the first implementation had a lot of issues, it helped me to start seeing MFU north of 50% and I was eventually able to dissect the slow parts and replace them with the faster implementations.
127128

128129
<figure>
129130
<img src="/img/tpu_ablations/tpu_perf.png" alt="05_tpu_perf notebook screenshot">

0 commit comments

Comments
 (0)