BESE v5.3: Novel 288-token tokenizer (non-record 16MB) by mrbese · Pull Request #1621 · openai/parameter-golf

mrbese · 2026-04-14T20:28:24Z

Summary

Novel two-layer tokenizer submission (BESE: Base-Efficient Subword Encoding) for the non-record 16MB track.

288-token vocabulary (40 structured base tokens + 248 BPE merges) replaces SentencePiece, shrinking the INT6 embedding table by ~276KB vs SP1024
Freed budget funds 11 transformer layers with depth recurrence (layers 3-5, 3 loops), parallel residuals, and n-gram tilt at eval
Mean val_bpb: 1.1531 across 3 independent runs (seeds 1337, 42, 314)

Results

Run	Seed	Sliding Window BPB	INT6 Roundtrip BPB
1	1337	1.1554	1.1817
2	42	1.1539	1.1803
3	314	1.1499	1.1765
Mean		1.1531	1.1795

Tokenizer Design

BESE encodes text in two layers:

Layer 1 — 40-token base alphabet: The 11 most frequent English letters (e,t,a,o,i,n,s,r,h,d,l) get single-token encodings. The remaining 15 letters use 2-token group+position codes. Space, punctuation, digits, and non-ASCII bytes each have dedicated tokens. Every token's byte count exactly matches the UTF-8 byte count of the character(s) it represents.
Layer 2 — 248 BPE merges: Standard byte-pair encoding trained on 50K FineWeb documents using the base token sequences. Each merge token's byte count is the recursive sum of its constituents.

BPB correctness: The README includes a complete proof that sum(bytes_per_token[t] for t in encoding) == len(text.encode('utf-8')) for all inputs, plus self-test verification on diverse Unicode strings.

Architecture Highlights

11 layers, dim=512, GQA (8/4 heads), LeakyReLU(0.5)²
Depth recurrence: layers 3-5 repeated 3× (activated at 35% training progress)
Parallel residuals (GPT-J style, from layer 7)
Compiled Newton-Schulz orthogonalization + batched EMA
N-gram tilt: pre-computed bigram/trigram table applied as eval-time logit bias
INT6 quantization + LZMA compression → 12.7 MB artifact (3.3 MB margin)

Files

train_gpt.py — self-contained training script
bese_constants.py + bese_fast_bpe.py — tokenizer implementation
tokenizer.json — pre-trained BPE merges
submission.json — metadata
train_log_run{1,2,3}.txt — full training logs for all 3 runs
README.md — detailed tokenizer design, BPB correctness proof, architecture, reproduction instructions

Two-layer tokenizer (40 base + 248 BPE merges = 288 vocab) shrinks INT6 embedding table by ~276KB vs SP1024, funding extra transformer depth. Mean val_bpb 1.1531 across 3 runs (seeds 1337, 42, 314).

Add BESE v5.3 submission (non-record track, novel tokenizer)

0af399e

Two-layer tokenizer (40 base + 248 BPE merges = 288 vocab) shrinks INT6 embedding table by ~276KB vs SP1024, funding extra transformer depth. Mean val_bpb 1.1531 across 3 runs (seeds 1337, 42, 314).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621

BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621
mrbese wants to merge 1 commit intoopenai:mainfrom
mrbese:bese-v53-novel-tokenizer

mrbese commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrbese commented Apr 14, 2026

Summary

Results

Tokenizer Design

Architecture Highlights

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant