This repo is a starter DS sandbox for a personalized news stack. It’s intentionally modular so you can test each module independently via a separate Streamlit app.
It includes:
- A minimal data model (Pydantic) for stories, facts, clusters, user profiles
- A C-LLM extraction schema (stubbed by default) + canonicalization + fact IDs
- A simple clustering baseline + delta computation
- A basic user knowledge store + preference updates
- A planner (P-LLM) and realizer (R-LLM) scaffolding (stubbed)
- An evaluation playground (coverage, novelty, redundancy, faithfulness placeholder)
- Streamlit apps for each module under
apps/
⚠️ By default, LLM calls are stubbed so you can run everything with no API keys. You can later wire your preferred LLM provider by editingsrc/llm/providers.py.
python -m venv .venv
# mac/linux
source .venv/bin/activate
# windows
# .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .envFill any keys if you want live providers (optional). You can run fully offline.
Each module is a standalone app:
streamlit run apps/01_ingest_and_cluster.py
streamlit run apps/02_c_llm_extract.py
streamlit run apps/03_canonicalize_and_dedup.py
streamlit run apps/04_user_model.py
streamlit run apps/05_planner.py
streamlit run apps/06_realizer.py
streamlit run apps/07_evaluation.pyIf you want a simple launcher:
python tools/run_app.py 2apps/ # Streamlit entrypoints (one per module)
src/
config.py # env + paths
data_models.py # Pydantic models (Story, Fact, Cluster, User)
storage/ # lightweight local storage (jsonl)
ingest/ # ingest stubs + parsers
clustering/ # clustering + delta logic
extraction/ # C-LLM schema + canonicalization + fact IDs
user/ # user profile, knowledge & preference updates
planning/ # P-LLM plan schema + stub planner
realization/ # R-LLM renderer schema + stub realizer
eval/ # metrics + eval helpers
llm/ # provider interface (stub, optional live)
tools/
run_app.py # quick launcher
data/ # local runtime data (created automatically)
artifacts/ # cached outputs (created automatically)
- Ingest: load example articles → normalize → store
Story - Cluster: assign
cluster_idand maintainClusterState - Extract: run C-LLM extractor (stub) → produce grounded
Factobjects - Canonicalize: normalize facts → generate
fact_id→ dedup across stories - User model: update preferences + per-cluster fact memory (seen facts)
- Plan: create a per-user
ContentPlan(what to include/omit/emphasize) - Realize: generate swipe cards + extended modules (stub)
- Evaluate: compute novelty/redundancy coverage + compare variants
Edit:
src/llm/providers.py(implementLLMProvider.generate_json(...))- Set env keys in
.env
Then, in the apps, switch from the stub provider to your live provider.
Cursor will work best if you:
- open the repo folder
- use
.envfor secrets - run Streamlit in an integrated terminal
- keep generated data under
data/and cached artifacts underartifacts/
- Add a real EventRegistry connector in
src/ingest/eventregistry.py - Replace the stub C-LLM with your actual extraction prompt + grounding
- Add a vector DB (FAISS / pgvector) for fast retrieval
- Implement a verifier: ensure R-LLM outputs only use selected fact IDs