A Flutter application demonstrating on-device LLM inference using llama.cpp and Dart FFI.
- Zero-Cloud Inference: Run GGUF models locally on your Android or iOS device.
- Real-time Streaming: Tokens are rendered as they are generated for a premium chat experience.
- Persistent Threading: High-performance backend worker isolate ensures the UI stays at 60fps.
- Performance Benchmarks: Real-time tokens-per-second tracking displayed in-app.
The app handles local GGUF models in the application's documents directory.
- Directory:
app_documents_dir/models/ - Atomic Downloads: Models are downloaded with a
.partextension and renamed only upon success. - Storage: Persistence via
SharedPreferences.
Powered by a custom C++ bridge to llama.cpp.
- Prerequisites: Android NDK (r25+ recommended) and CMake.
- Arch:
arm64-v8a. - Performance: Uses
mmapfor efficient model loading.
- CMake: The project is configured to generate an iOS Framework.
- Integration: To build, open
ios/Runner.xcworkspaceand ensure thellama_wrappertarget is added as a dependency/framework. - FFI: Uses
DynamicLibrary.process()for seamless symbol lookup.
The high-level LlamaWorker handles all native interactions:
final worker = LlamaWorker();
await worker.start();
worker.responses.listen((response) {
print("Token: ${response.token}");
if (response.isDone) print("TPS: ${response.tokensPerSecond}");
});
worker.generate(LlamaRequest(
modelPath: path,
prompt: prompt,
requestId: 'example-request',
));
```dart
final result = ModerationParser.parse(rawJson);
final report = ModerationPolicy.apply(result);
print(ModerationPolicy.getDisplayString(report));ChatScreen,ModelSetupScreen, andSettingsScreenprovide the UX for downloading GGUFs, configuring the active model, and chatting. They never call native code directly; instead they communicate throughLlamaWorker.ChatScreenstreams tokens into the UI, preloads models, and tracks request IDs so a late token from the worker cannot corrupt the current conversation.
LlamaWorkerspins up a dedicated Dart isolate for inference. The UI isolate sendsLlamaRequestobjects, and the worker isolate returnsLlamaStreamResponseevents (tokens, errors, TPS stats, warmup notifications).- The isolate model keeps the UI responsive because all long-running, synchronous native calls happen off the main thread.
LlamaServiceis the high-level Dart façade that mirrors the llama lifecycle:init→preparePrompt/getNextToken→dispose.LlamaFfiowns theDynamicLibraryhandle (libllama_wrapper.soon Android, process image on iOS). It declares native signatures (typedef ...Native) and looks up the exported C symbols (wrapper_init,wrapper_prepare_prompt, etc.), exposing them as type-safe Dart functions.- FFI (Foreign Function Interface) is what lets Dart call into the native
llama_wrapperlibrary without a platform channel. Each call marshals data (UTF-8 pointers, ints, bools) across the boundary.
llama_wrapper.cpp/.himplement the exported C API that Dart binds to. They:- Load GGUF weights via llama.cpp and configure context/thread counts for mobile.
- Prepare prompts by resetting KV cache, batching tokens, and guarding against context overflow.
- Stream tokens with sampler chains, explicit stop-sequence checks, and safety limits.
llama_wrapperlinks directly against the vendorednative/llama_cpp/src(the official llama.cpp sources) using CMake for Android and CocoaPods for iOS.- The resulting shared library/framework is bundled with the Flutter app and located via
DynamicLibrary.open('libllama_wrapper.so')(Android) orDynamicLibrary.process()(iOS).
- User taps “Send” in
ChatScreen. The UI pushes aLlamaRequest(with prompt, message history, and a unique requestId) into the worker isolate. - The worker ensures the correct model is loaded, prepares the prompt via
LlamaService, and continuously callsgetNextTokenthrough FFI. - Each token is reported back to the UI as a
LlamaStreamResponse, whereChatScreenappends it to the chat and monitors tokens-per-second. - When generation finishes (EOS, stop sequence, cancellation, or timeout), the worker sends a final
isDoneresponse. The UI then resets its state and awaits the next input.
This layered design keeps concerns separated: Flutter widgets handle UX/state, the isolate manages concurrency and prompt budgeting, the Dart FFI layer encapsulates native bindings, and the C++ wrapper deals with llama.cpp internals optimized for on-device inference.
I implemented this fully on-device LLM setup in a Flutter app, running locally on an Android phone (Pixel 7). The goal was straightforward: integrate llama.cpp, load a quantized model (~600 MB), and see whether a general chat experience could be delivered reliably without any cloud inference.
From a purely technical perspective, everything worked. The model loaded, prompts were processed, tokens streamed back, and the system behaved correctly under ideal conditions. From a product and engineering perspective, however, the conclusion was unavoidable: this is not production-viable today for general chat on mobile hardware.
Despite aggressive optimization—reduced context windows, limited history, careful batching, memory mapping, and conservative threading—the experience remained highly inconsistent. Most requests timed out or stalled; occasional responses succeeded, but not with the latency or reliability users would tolerate. This wasn’t a configuration mistake or tooling gap. It was a clear illustration of current mobile CPU and memory constraints when running even small LLMs interactively.
The important nuance is that this does not mean on-device inference is pointless. It can work well for tightly bounded tasks such as moderation, classification, or short structured outputs, especially where privacy is critical. What does not work today is trying to replicate a cloud-grade conversational assistant entirely offline on a phone.
The experiment was still a success—because it replaced assumptions with evidence. Sometimes the most valuable engineering outcome is a well-reasoned “not yet,” backed by real implementation experience rather than theory.





