Valerie

Valerie is a Large Language Model written completely from scratch in pure C.

Features

Setup

git clone https://github.com/teleprint-me/valerie.c valerie
cd valerie
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build -j $(nproc)

Tokenizer

Valerie includes an ASCII-only Byte-Pair Encoding (BPE) tokenizer designed for transparency and ease of extension. Unicode (UTF-8 grapheme) support is planned.

Workflow

Train the model: Build and serialize a BPE tokenizer from a plaintext corpus.
Predict: Encode and decode text using a trained model.

Commands

Train

Build and save a tokenizer model:

./build/examples/tokenizer/train --input S --output S [--merges N] [--verbose]

--input, -i Path to input plaintext corpus (required)
--output, -o Directory to save the tokenizer model (required)
--merges, -m Number of BPE merge steps (default: 10)
--verbose, -v Enable debug output

Predict

Encode and decode text with a trained model:

./build/examples/tokenizer/predict --model S --prompt S [options]

--model, -m Path to tokenizer model file (required)
--prompt, -p Input text to encode and decode (required)
--add-bos, -b Add BOS marker
--add-eos, -e Add EOS marker
--verbose, -v Enable debug output

Example

Train:

./build/examples/tokenizer/train -i samples/simple.txt -o models -m 10

Predict:

./build/examples/tokenizer/predict -m models/tokenizer.model -p 'Hello, world!'

Typical output:

Prints tokens, frequencies, and merge steps when training.
Lists vocabulary and encodings when predicting.

Planned:

Unicode grapheme support
Model extensibility and validation

Model

What Is Valerie?

Valerie is a decoder-only transformer inspired by architectures like GPT, Llama, Mistral, and Qwen. Its design closely follows Adrian Cable’s Qwen3 C implementation, which provided an excellent reference for inference behavior. Valerie extends this concept beyond inference, toward a complete training and fine-tuning framework.

At its core, Valerie is an experiment in understanding and re-implementing large language model mechanics from first principles: every layer, tensor operation, and gradient is written manually, with full transparency and zero abstraction bloat.

Why Build From Scratch?

I wanted to understand how a transformer truly works, not just use one. That meant rebuilding every component from the ground up: tokenizer, model, optimizer, and serialization. Valerie depends only on minimal, transparent libraries like PCRE2, OpenMP, and (eventually) Vulkan, keeping the codebase small, portable, and easy to inspect.

Transformers are intricate systems grounded in algebra, geometry, calculus, and statistics. Each layer (attention, feed-forward, normalization) is a self-contained “computable block.” Valerie exposes these blocks directly, allowing the entire forward and backward pipeline to be followed line-by-line.

Why Not PyTorch?

PyTorch is powerful but highly abstracted and optimized for NVIDIA hardware. Its heavy CUDA focus, dependency footprint, and dynamic graph system hide too much of what I want to see, especially for low-level experimentation. While I appreciate Python’s flexibility, it isn’t well-suited for understanding the mechanics of transformers at the memory or numeric level.

By contrast, Valerie’s C implementation is explicit and predictable, running close to the metal and relying on a small, disciplined build.

Why Not GGML?

GGML is an excellent inference framework supporting many architectures. However, its computation-graph-based design (a Directed Acyclic Graph or DAG) makes it difficult to trace the fundamental operations without stepping through layers of abstraction. Valerie takes the opposite approach: a linear, transparent, and manually written execution path that prioritizes understanding over optimization.

Why in C?

C offers the right balance of simplicity, speed, and control. There’s no hidden allocation, no garbage collector, and no surprise abstractions, just raw access to the system. That control comes at a cost: safety and patience are required. Valerie relies on AddressSanitizer (ASAN) during development to catch common memory issues early, but careful engineering discipline remains essential.

C has been my language of choice for years, and Valerie reflects my belief that with care, C can still serve as a foundation for modern, high-performance machine learning research.

Current Status

Valerie is a work in progress. The architecture, forward pass, and training loop are nearly complete, but issues remain in three main areas:

Initialization – potential scaling or variance imbalance
Gradient accumulation – instability during backpropagation
Buffer management – cleanup and consistency for precision variants

Currently, gradients tend to explode, preventing the model from converging or generalizing. Single-precision (FP32) debugging is the primary focus before expanding to mixed and quantized formats.

You can view the model implementation here: examples/model/v.c

The code runs, but training stability is still under investigation.

Contributions

Contributions are welcome, clarity and simplicity are the guiding principles. Before optimizing for performance, Valerie aims to work correctly, read clearly, and explain itself.

License

AGPL to ensure end-user freedom.

Name		Name	Last commit message	Last commit date
Latest commit History 775 Commits
.github		.github
examples		examples
include		include
samples		samples
src		src
.clang-format		.clang-format
.clangd		.clangd
.flake8		.flake8
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
References.md		References.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Valerie

Features

Setup

Tokenizer

Workflow

Commands

Train

Predict

Example

Model

What Is Valerie?

Why Build From Scratch?

Why Not PyTorch?

Why Not GGML?

Why in C?

Current Status

Contributions

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

teleprint-me/valerie.c

Folders and files

Latest commit

History

Repository files navigation

Valerie

Features

Setup

Tokenizer

Workflow

Commands

Train

Predict

Example

Model

What Is Valerie?

Why Build From Scratch?

Why Not PyTorch?

Why Not GGML?

Why in C?

Current Status

Contributions

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages