LoRAServe

LoRAServe is a lightweight, educational, side project exploring how modern LLM inference systems (like vLLM and SGLang) work under the hood -- from dynamic batching and adapter hot-loading to streaming responses and KV-cache reuse. It's not a production service -- the focus is on clarity, modularity, and learning-by-doing.

🚀 LoRAServe Development Roadmap

🟢 Core Serving

Dynamic batching (tenant queues + policies)
HFEngine integration
LoRA adapter manager (hot-load / swap)
/v1/generate (non-stream)
Streaming with TextIteratorStreamer

🟢 Observability

🟡 KV Cache & Scheduling Enhancements

KVCacheManager (tracking only)
Per-request KV usage
KV-aware batching (cost-based)
Prompt prefix reuse (mini-PagedAttention)

🟡 Chat & API Compatibility

/v1/chat endpoint
Chat streaming (SSE)
Usage metadata (prompt_tokens, completion_tokens)
OpenAI-compatible schemas

🟠 Advanced Engine Features

Speculative decoding (draft = target)
Speculative decoding (small draft)
Adapter prefetching endpoint
Background adapter warming

🟠 Performance Optimizations

torch.compile / CUDA graphs
Pinned memory for H2D copies
Triton RMSNorm kernel
LoRA-fused matmul (Triton)

🔴 Deployment & Benchmarking

Benchmark harness (TTFT / throughput)
Dockerfile
Kubernetes manifests
Horizontal Pod Autoscaler (HPA)

Architecture at a Glance

/lora_serve
├── api/                     # REST API endpoints (FastAPI)
│   ├── routes.py            # /v1/generate, /v1/chat, /v1/generate/stream
│   └── schemas.py           # Request/response models (pydantic)
├── core/
│   ├── engines/             # HFEngine wrapper over transformers/PEFT
│   ├── adapters.py          # LoRAAdapterManager (load, cache, evict)
│   ├── config.py            # BaseSettings (env + .env overrides)
│   └── logging.py           # Thread/colorized logging helpers
├── scheduler/
│   ├── queue.py             # Tenant queues (per-tenant request queues)
│   ├── policies.py          # Batching/fairness strategies
│   └── batcher.py           # DynamicBatcher main loop
├── kv_cache/
│   └── manager.py           # Placeholder for KV-cache reuse & stats
└── tests/                   # pytest-based functional/unit tests

Each layer is intentionally minimal and commented so you can trace the full path:

/v1/generate → router → queue → DynamicBatcher → HFEngine → model.generate()

⚙️ Usage (Local)

# Prepare virtual env
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Install dependencies
pip install -r requirements.txt

# Configure environment (see .env.example)
cp .env.example .env

# Launch server
# - Add LORASERVE_LOGLEVEL=DEBUG in front to have DEBUG level logging
uvicorn lora_serve.app:app --host 0.0.0.0 --port 8000 --reload

simple generation

curl -s -H "Content-Type: application/json" \
  -d '{"prompt":"Hello world","max_tokens":16}' \
  http://localhost:8000/v1/generate

continous request

while True; do
  curl -s -H "Content-Type: application/json" \
    -d "{\"prompt\":\"Hello $i\",\"max_tokens\":16}" \
    http://localhost:8000/v1/generate >/dev/null && sleep 0.5
done

batch test

for i in {1..5}; do
  curl -s -H "Content-Type: application/json" \
    -d "{\"prompt\":\"Hello $i\",\"max_tokens\":16}" \
    http://localhost:8000/v1/generate >/dev/null &
done; wait

stream test

curl -N -H "Content-Type: application/json" \
  -d '{"prompt":"Explain LoRA in one line","max_tokens":16,"stream":true}' \
  http://localhost:8000/v1/generate/stream

Example Goals for Learners

See how to implement a vLLM-like batching loop from scratch.
Experiment with LoRA hot-loading on a single GPU (Titan/A100).
Explore asyncio patterns (async/await, queues, Futures).
Add metrics (Prometheus or OpenTelemetry) and visualize throughput.
Extend to multi-GPU or Kubernetes deployments later.

Dependencies

FastAPI / Uvicorn: REST API layer
Transformers: model/tokenizer backend
PEFT: LoRA adapter handling
torch: inference runtime
sse-starlette: streaming (Server-Sent Events)
pydantic: configuration and schema validation

License & Disclaimer

⚠️ Disclaimer This project is an independent, educational exploration. It is not affiliated with or endorsed by any organization and is not optimized for production workloads. Use at your own discretion for research or learning purposes.

Acknowledgements

vLLM: inspiration for batching and memory management
SGLang: reference for adapter/runtime design
PEFT: LoRA support
Transformers: base modeling toolkit

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env
uvicorn lora_serve.app:app --host 0.0.0.0 --port 8000 --reload

Then:

python examples/client_generate.py --prompt "Hello" --adapter demo-adapter

or

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docker		docker
examples		examples
k8s		k8s
lora_serve		lora_serve
scripts		scripts
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LoRAServe

🚀 LoRAServe Development Roadmap

🟢 Core Serving

🟢 Observability

🟡 KV Cache & Scheduling Enhancements

🟡 Chat & API Compatibility

🟠 Advanced Engine Features

🟠 Performance Optimizations

🔴 Deployment & Benchmarking

Architecture at a Glance

⚙️ Usage (Local)

simple generation

continous request

batch test

stream test

Example Goals for Learners

Dependencies

License & Disclaimer

Acknowledgements

Quickstart

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ywpkwon/lora-serve

Folders and files

Latest commit

History

Repository files navigation

LoRAServe

🚀 LoRAServe Development Roadmap

🟢 Core Serving

🟢 Observability

🟡 KV Cache & Scheduling Enhancements

🟡 Chat & API Compatibility

🟠 Advanced Engine Features

🟠 Performance Optimizations

🔴 Deployment & Benchmarking

Architecture at a Glance

⚙️ Usage (Local)

simple generation

continous request

batch test

stream test

Example Goals for Learners

Dependencies

License & Disclaimer

Acknowledgements

Quickstart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages