BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621
Open
mrbese wants to merge 1 commit intoopenai:mainfrom
Open
BESE v5.3: Novel 288-token tokenizer (non-record 16MB)#1621mrbese wants to merge 1 commit intoopenai:mainfrom
mrbese wants to merge 1 commit intoopenai:mainfrom
Conversation
Two-layer tokenizer (40 base + 248 BPE merges = 288 vocab) shrinks INT6 embedding table by ~276KB vs SP1024, funding extra transformer depth. Mean val_bpb 1.1531 across 3 runs (seeds 1337, 42, 314).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Novel two-layer tokenizer submission (BESE: Base-Efficient Subword Encoding) for the non-record 16MB track.
Results
Tokenizer Design
BESE encodes text in two layers:
Layer 1 — 40-token base alphabet: The 11 most frequent English letters (e,t,a,o,i,n,s,r,h,d,l) get single-token encodings. The remaining 15 letters use 2-token group+position codes. Space, punctuation, digits, and non-ASCII bytes each have dedicated tokens. Every token's byte count exactly matches the UTF-8 byte count of the character(s) it represents.
Layer 2 — 248 BPE merges: Standard byte-pair encoding trained on 50K FineWeb documents using the base token sequences. Each merge token's byte count is the recursive sum of its constituents.
BPB correctness: The README includes a complete proof that
sum(bytes_per_token[t] for t in encoding) == len(text.encode('utf-8'))for all inputs, plus self-test verification on diverse Unicode strings.Architecture Highlights
Files
train_gpt.py— self-contained training scriptbese_constants.py+bese_fast_bpe.py— tokenizer implementationtokenizer.json— pre-trained BPE mergessubmission.json— metadatatrain_log_run{1,2,3}.txt— full training logs for all 3 runsREADME.md— detailed tokenizer design, BPB correctness proof, architecture, reproduction instructions