implement qat and weight sharing by kiratoyoshihara · Pull Request #1593 · openai/parameter-golf

kiratoyoshihara · 2026-04-13T10:58:52Z

Note: This is a draft PR.

Approach Overview

To achieve extreme efficiency under the 16MB artifact constraint and the 10-minute training limit on 8xH100, I am introducing a custom architecture modifying the baseline train_gpt.py.

1. Depth Recurrence (Parameter Tying)

Standard transformer architectures can only hold a few million parameters within 16MB. To overcome this, I have replaced the standard nn.ModuleList of blocks with a single shared Transformer Block. By applying this block recursively across all layers, we simulate a much deeper network and maximize representational capacity without increasing the actual file size on disk.

2. Quantization-Aware Training (QAT)

To cram the maximum effective parameter count into the footprint, I implemented a Fake Quantization step (symmetric INT8 via Straight-Through Estimator) inside CastedLinear. The forward/backward passes are kept in high precision (bfloat16) for H100 throughput, but the weights are regularized to be robust to extreme post-training quantization.

Current Status

Basic architecture design and local implementation
QAT and Parameter Tying integrated into train_gpt.py
Hyperparameter search (MuP) on a single GPU
Full-scale scaling and evaluation on 8xH100 SXM (Pending compute)

I will update this PR with training logs and the final artifact once the large-scale runs are complete.

implement qat and weight sharing

eabd580

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement qat and weight sharing#1593

implement qat and weight sharing#1593
kiratoyoshihara wants to merge 1 commit intoopenai:mainfrom
kiratoyoshihara:experiment-qat-mup

kiratoyoshihara commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kiratoyoshihara commented Apr 13, 2026

Approach Overview

1. Depth Recurrence (Parameter Tying)

2. Quantization-Aware Training (QAT)

Current Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant