Skip to content

BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621

Open
mrbese wants to merge 1 commit intoopenai:mainfrom
mrbese:bese-v53-novel-tokenizer
Open

BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621
mrbese wants to merge 1 commit intoopenai:mainfrom
mrbese:bese-v53-novel-tokenizer

Conversation

@mrbese
Copy link
Copy Markdown

@mrbese mrbese commented Apr 14, 2026

Summary

Novel two-layer tokenizer submission (BESE: Base-Efficient Subword Encoding) for the non-record 16MB track.

  • 288-token vocabulary (40 structured base tokens + 248 BPE merges) replaces SentencePiece, shrinking the INT6 embedding table by ~276KB vs SP1024
  • Freed budget funds 11 transformer layers with depth recurrence (layers 3-5, 3 loops), parallel residuals, and n-gram tilt at eval
  • Mean val_bpb: 1.1531 across 3 independent runs (seeds 1337, 42, 314)

Results

Run Seed Sliding Window BPB INT6 Roundtrip BPB
1 1337 1.1554 1.1817
2 42 1.1539 1.1803
3 314 1.1499 1.1765
Mean 1.1531 1.1795

Tokenizer Design

BESE encodes text in two layers:

  1. Layer 1 — 40-token base alphabet: The 11 most frequent English letters (e,t,a,o,i,n,s,r,h,d,l) get single-token encodings. The remaining 15 letters use 2-token group+position codes. Space, punctuation, digits, and non-ASCII bytes each have dedicated tokens. Every token's byte count exactly matches the UTF-8 byte count of the character(s) it represents.

  2. Layer 2 — 248 BPE merges: Standard byte-pair encoding trained on 50K FineWeb documents using the base token sequences. Each merge token's byte count is the recursive sum of its constituents.

BPB correctness: The README includes a complete proof that sum(bytes_per_token[t] for t in encoding) == len(text.encode('utf-8')) for all inputs, plus self-test verification on diverse Unicode strings.

Architecture Highlights

  • 11 layers, dim=512, GQA (8/4 heads), LeakyReLU(0.5)²
  • Depth recurrence: layers 3-5 repeated 3× (activated at 35% training progress)
  • Parallel residuals (GPT-J style, from layer 7)
  • Compiled Newton-Schulz orthogonalization + batched EMA
  • N-gram tilt: pre-computed bigram/trigram table applied as eval-time logit bias
  • INT6 quantization + LZMA compression → 12.7 MB artifact (3.3 MB margin)

Files

  • train_gpt.py — self-contained training script
  • bese_constants.py + bese_fast_bpe.py — tokenizer implementation
  • tokenizer.json — pre-trained BPE merges
  • submission.json — metadata
  • train_log_run{1,2,3}.txt — full training logs for all 3 runs
  • README.md — detailed tokenizer design, BPB correctness proof, architecture, reproduction instructions

Two-layer tokenizer (40 base + 248 BPE merges = 288 vocab) shrinks
INT6 embedding table by ~276KB vs SP1024, funding extra transformer
depth. Mean val_bpb 1.1531 across 3 runs (seeds 1337, 42, 314).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant