-When I was waiting for a plane, I came up with the following idea: let Claude Code (via Claude Code Web) build [a Colab notebook with a thorough set of TPU performance tests](https://github.com/vorushin/tpuchat/blob/master/05_tpu_perf.ipynb), building the transformer block by block, and measure the MFU of different parts, in different sizes and in various combinations. Start from the pure matmuls, then, implement and profile individual components, then a single layer, multiple layers, forward and backward pass, the optimizer implementation, each phase independently runnable. Even though the first implementation had a lot of issues, it helped me to start seeing MFU north of 50% and I was eventually able to dissect the slow parts and replace them with the faster implementations.
0 commit comments