Skip to content

AmritSe/distil-perso

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Personalization — Data Science Base Project (Streamlit Modules)

This repo is a starter DS sandbox for a personalized news stack. It’s intentionally modular so you can test each module independently via a separate Streamlit app.

It includes:

  • A minimal data model (Pydantic) for stories, facts, clusters, user profiles
  • A C-LLM extraction schema (stubbed by default) + canonicalization + fact IDs
  • A simple clustering baseline + delta computation
  • A basic user knowledge store + preference updates
  • A planner (P-LLM) and realizer (R-LLM) scaffolding (stubbed)
  • An evaluation playground (coverage, novelty, redundancy, faithfulness placeholder)
  • Streamlit apps for each module under apps/

⚠️ By default, LLM calls are stubbed so you can run everything with no API keys. You can later wire your preferred LLM provider by editing src/llm/providers.py.


1) Setup (local)

1. Create venv + install

python -m venv .venv
# mac/linux
source .venv/bin/activate
# windows
# .venv\Scripts\activate

pip install -r requirements.txt

2. Configure environment

cp .env.example .env

Fill any keys if you want live providers (optional). You can run fully offline.


2) Running Streamlit apps

Each module is a standalone app:

streamlit run apps/01_ingest_and_cluster.py
streamlit run apps/02_c_llm_extract.py
streamlit run apps/03_canonicalize_and_dedup.py
streamlit run apps/04_user_model.py
streamlit run apps/05_planner.py
streamlit run apps/06_realizer.py
streamlit run apps/07_evaluation.py

If you want a simple launcher:

python tools/run_app.py 2

3) Project layout

apps/                       # Streamlit entrypoints (one per module)
src/
  config.py                 # env + paths
  data_models.py            # Pydantic models (Story, Fact, Cluster, User)
  storage/                  # lightweight local storage (jsonl)
  ingest/                   # ingest stubs + parsers
  clustering/               # clustering + delta logic
  extraction/               # C-LLM schema + canonicalization + fact IDs
  user/                     # user profile, knowledge & preference updates
  planning/                 # P-LLM plan schema + stub planner
  realization/              # R-LLM renderer schema + stub realizer
  eval/                     # metrics + eval helpers
  llm/                      # provider interface (stub, optional live)
tools/
  run_app.py                # quick launcher
data/                       # local runtime data (created automatically)
artifacts/                  # cached outputs (created automatically)

4) Data flow in this sandbox

  1. Ingest: load example articles → normalize → store Story
  2. Cluster: assign cluster_id and maintain ClusterState
  3. Extract: run C-LLM extractor (stub) → produce grounded Fact objects
  4. Canonicalize: normalize facts → generate fact_id → dedup across stories
  5. User model: update preferences + per-cluster fact memory (seen facts)
  6. Plan: create a per-user ContentPlan (what to include/omit/emphasize)
  7. Realize: generate swipe cards + extended modules (stub)
  8. Evaluate: compute novelty/redundancy coverage + compare variants

5) Wiring a real LLM (optional)

Edit:

  • src/llm/providers.py (implement LLMProvider.generate_json(...))
  • Set env keys in .env

Then, in the apps, switch from the stub provider to your live provider.


6) Notes for Cursor

Cursor will work best if you:

  • open the repo folder
  • use .env for secrets
  • run Streamlit in an integrated terminal
  • keep generated data under data/ and cached artifacts under artifacts/

7) Next steps you can build on

  • Add a real EventRegistry connector in src/ingest/eventregistry.py
  • Replace the stub C-LLM with your actual extraction prompt + grounding
  • Add a vector DB (FAISS / pgvector) for fast retrieval
  • Implement a verifier: ensure R-LLM outputs only use selected fact IDs

About

News extraction and personalisation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages