Skip to content

BenjaminFont/project-knowledge-base

Repository files navigation

Project Knowledge Base

Semantic search system for company knowledge using Neo4j as a hybrid graph + vector database, E5 embeddings, and GPT-4.1 LLM re-ranking. Ask natural language questions about projects, skills, and expertise.

Features

  • Hybrid Search - Combines keyword matching, vector similarity, and LLM re-ranking for 98% accuracy
  • Natural Language Q&A - Conversational answers powered by GPT-4.1 (~5s response time)
  • Expert Finding - "Who knows Spring Boot?" returns ranked developers with skill levels
  • Skill Tracking - Visualize technology distribution and identify skill gaps
  • 30+ Sample CVs - Realistic test data for immediate experimentation

Architecture

Neo4j as Hybrid Database:

  • Graph Layer - Stores relationships (Person→Project→Technology) for structured queries
  • Vector Layer - Stores 1024-dim embeddings in node properties for semantic search
  • Both layers are queried in parallel, then merged for optimal results

3-Stage Hybrid Search:

  1. Graph Queries - Exact technology matches via relationship traversal
  2. Vector Search - Semantic similarity via Neo4j vector indexes
  3. LLM Re-Ranking - GPT-4.1 scores merged results 0-100% with reasoning

Tech Stack: Neo4j 5.23+ (Graph + Vector) • E5-large Embeddings • GPT-4.1 • Python 3.10+ • Streamlit • Docker

Data Model:

# Graph structure (relationships)
(Person)-[:WORKED_ON]->(Project)-[:USES_TECHNOLOGY]->(Technology)
(Person)-[:HAS_SKILL {level, years}]->(Technology)

# Vector properties (embeddings stored in nodes)
Person.bio_embedding: [1024 dimensions]
Project.description_embedding: [1024 dimensions]

Quick Start

# 1. Setup
git clone <repo>
cd project-knowledge-base
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

# 2. Configure .env
PORTKEY_API_KEY=your_key_here
PORTKEY_RANKING_MODEL=@azure-openai-foundry/gpt-4.1
PORTKEY_CHAT_MODEL=@azure-openai-foundry/gpt-4.1

# 3. Start Neo4j
docker-compose up -d

# 4. Seed database (imports 30 CVs + generates 20 projects)
python seed_data.py

# 5. Run app
streamlit run app.py

Open http://localhost:8501

Usage Examples

Hybrid Search Tab

"Python machine learning projects" → 20 ML projects ranked by relevance
"banking applications" → Financial services projects  
"React developers" → Frontend engineers with React skills

Chat Q&A Tab

"Who are the best Python ML experts for a new project?"
→ "Based on the knowledge base, Marcus Weber and Olivia Chen are the strongest fits..."

"What React projects did Sarah work on?"
→ "Sarah Chen led 3 React projects: Digital Banking Platform, E-Commerce Marketplace..."

Filter Pipelines

Hybrid Project Search:

Query → Keyword (20) + Vector (40) → Merge (56) → Pre-filter (30) → LLM Re-rank → Filter ≥70% → Results

Chat Q&A:

Question → Projects Search (parallel) + People Search (parallel) → Context (15 each) → GPT-4.1 Answer → Response

Performance

Operation Time Notes
Hybrid Search ~2s Keyword + Vector + LLM (30 candidates)
Chat Q&A ~6s 2 parallel searches + answer generation
LLM Re-ranking ~1.7s GPT-4.1 for 30 projects
Answer Generation ~3.9s GPT-4.1 conversational response

Tested on: MacBook Pro M1, 16GB RAM

Development

Add Data

from src.neo4j_service import get_neo4j_service
service = get_neo4j_service()

service.add_person({
    'id': 'p100', 'name': 'Dev Name', 'email': 'dev@co.com',
    'position': 'Engineer', 'bio': 'Experienced in...',
    'skills': [{'technology': 'Python', 'level': 4, 'years': 3}]
})

service.add_project({
    'id': 'proj100', 'name': 'Project', 'description': '...',
    'start_date': '2023-01-01', 'end_date': '2023-12-31',
    'technologies': ['Python', 'React'], 'client': 'Client'
})

Custom Queries

Add to src/neo4j_service.py:

def get_experts_by_level(self, tech: str, min_level: int = 4):
    with self.driver.session() as session:
        return session.run("""
            MATCH (p:Person)-[s:HAS_SKILL]->(t:Technology {name: $tech})
            WHERE s.level >= $min_level
            RETURN p.name, s.level, s.years
            ORDER BY s.level DESC, s.years DESC
        """, tech=tech, min_level=min_level).data()

Configuration

Environment Variables:

# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=test1234

# Embeddings
EMBEDDING_MODEL=intfloat/multilingual-e5-large
CACHE_EMBEDDINGS=true

# Portkey AI
PORTKEY_API_KEY=your_key
PORTKEY_RANKING_MODEL=@azure-openai-foundry/gpt-4.1
PORTKEY_CHAT_MODEL=@azure-openai-foundry/gpt-4.1

Testing

# All tests
pytest tests/ -v

# LLM speed test
python test_llm_speed.py

# Performance test
python test_parallel_performance.py

Future Enhancements

  1. Document Upload - PDF/Word CV parsing, auto-embeddings, batch import
  2. Export & Reporting - CSV/JSON exports, skills dashboards, analytics
  3. Advanced Analytics - Skill gaps, team recommendations, technology trends
  4. Authentication - SSO, RBAC, audit logs, encryption
  5. Multi-Tenancy - Multiple orgs, isolated data, shared taxonomy
  6. Integrations - Slack/Teams Q&A, webhooks, email notifications

Project Structure

src/
  ├── embedding_service.py      # E5 embeddings with caching
  ├── neo4j_service.py           # Graph DB operations + hybrid search
  ├── portkey_service.py         # LLM re-ranking via Portkey
  ├── conversational_service.py  # Natural language Q&A
  └── cv_parser.py               # CV markdown parser
data/cvs/                        # 30+ sample CVs
app.py                           # Streamlit UI
seed_data.py                     # Database seeding

License

MIT

About

an agent to scan projects and cvs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages