Skip to content

glennliew/startup-scanner

Repository files navigation

Startup Research Assistant

An automated research assistant that intelligently gathers, analyzes, and serves startup information through REST APIs and natural language chat interfaces.

Python 3.11+ FastAPI PostgreSQL Docker

πŸš€ Features

Core Capabilities

  • Concurrent Multi-Source Research: Automatically scrapes and queries multiple data sources in parallel

    • Company websites (homepage, about, team pages)
    • Web Search + LLM (financial data extraction from press releases and articles) πŸ†•
    • NewsAPI (recent news articles)
    • GitHub API (open source repositories, tech stack)
  • Semantic Search: Powered by OpenAI embeddings and pgvector for similarity-based queries

  • Natural Language Interface: Chat-style endpoint that understands questions like:

    • "Which AI startups have raised over $10M?"
    • "Show me fintech companies in San Francisco"
    • "Find startups using Python and PostgreSQL"
  • Analytics Dashboard: Aggregated insights including:

    • Funding distribution
    • Industry breakdown
    • Tech stack analysis
    • Success metrics

Technical Highlights

  • Async/Await Architecture: Built with asyncio for high-performance concurrent operations
  • Rate Limiting & Throttling: Per-domain semaphores and token bucket rate limiting
  • Structured Storage: PostgreSQL with pgvector for embeddings and full-text search
  • Docker-Native: Full docker-compose setup for reproducible local deployment
  • Type-Safe: Comprehensive Pydantic schemas and SQLAlchemy models
  • Test Coverage: Unit and integration tests with pytest

πŸ“‹ Table of Contents


🏁 Quick Start

Prerequisites

  • Docker and Docker Compose
  • OpenAI API key (for embeddings and chat)

Installation

  1. Clone the repository
cd startup-scanner
  1. Create environment file
cp .env.example .env

Edit .env and add your API keys:

# Required for embeddings and chat
OPENAI_API_KEY=sk-your-key-here

# Optional external APIs
NEWSAPI_KEY=your-key
GITHUB_TOKEN=your-token
  1. Start the services
docker compose up --build

Wait for the services to start. You should see:

βœ… Database ready on postgres://localhost:5432
βœ… API ready on http://localhost:8000
  1. Run database migrations
docker compose exec app alembic upgrade head
  1. Access the API

First Research Task

Research your first startup:

curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Stripe", "domain": "stripe.com"}
    ],
    "run_mode": "sync"
  }'

Query with natural language:

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Tell me about Stripe",
    "k": 1,
    "use_llm": true
  }'

πŸ“– Complete Usage Guide

How to Use This System

This research assistant has three main ways to interact with startup data:

1️⃣ Research Startups (Gather Data)

Collect comprehensive data from multiple sources:

# Research a single startup
curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [{"name": "Stripe", "domain": "stripe.com"}],
    "run_mode": "sync"
  }'

# Research multiple startups (async mode recommended for 4+)
curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Stripe", "domain": "stripe.com"},
      {"name": "Anthropic", "domain": "anthropic.com"},
      {"name": "Scale AI", "domain": "scale.com"}
    ],
    "run_mode": "async"
  }'

What gets collected:

  • Company descriptions, logos, websites
  • Funding data (total raised, rounds, investors) - now with web search + LLM! πŸ†•
  • Team information (founders, employee count)
  • Valuation, traction metrics - extracted from press releases πŸ†•
  • Tech stack (programming languages, frameworks)
  • Recent news articles
  • GitHub repositories
  • Social media profiles
  • Embeddings for semantic search (requires OpenAI API key)

2️⃣ Chat with Natural Language (Ask Questions)

Ask questions in plain English about your startups:

# Using curl
curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Which startups work with AI?",
    "k": 5,
    "use_llm": true
  }'

# Using the helper script (easier!)
./scripts/chat.sh "Tell me about Stripe"
./scripts/chat.sh "Which startups have raised over $100M?"
./scripts/chat.sh "Find fintech companies in San Francisco"
./scripts/chat.sh "Compare Anthropic and Scale AI"

Example questions:

  • "What does [Company] do?"
  • "Which startups are in [industry]?"
  • "Compare [Company A] and [Company B]"
  • "Find startups using [technology]"
  • "Which companies are based in [location]?"

3️⃣ Query & Analyze (Explore Data)

Via REST API:

# List all startups
curl "http://localhost:8000/startups?limit=10"

# Search by name
curl "http://localhost:8000/startups?search=stripe"

# Filter by industry/tags
curl "http://localhost:8000/startups?industry=AI&tags=B2B"

# Get analytics overview
curl "http://localhost:8000/analytics"

Via Database:

# Quick stats helper script
./scripts/view_db.sh

# Direct psql access
docker compose exec db psql -U startup_user -d startup_research

# SQL query example
docker compose exec db psql -U startup_user -d startup_research \
  -c "SELECT name, funding_total_usd FROM startups ORDER BY funding_total_usd DESC LIMIT 10;"

Via GUI Tools (Recommended):

Connection Details:

Host: localhost
Port: 5432
Database: startup_research
User: startup_user
Password: startup_password

Important Setup Notes

βœ… Your .env File Location:

The .env file must be in the project root (same directory as docker-compose.yml):

/your-path/startup-scanner/.env

Required for Chat/Embeddings:

OPENAI_API_KEY=sk-your-key-here

Optional (for richer data):

GOOGLE_API_KEY=your-key              # Web search for financial data πŸ†•
GOOGLE_SEARCH_ENGINE_ID=your-cx-id  # Google Custom Search Engine ID πŸ†•
NEWSAPI_KEY=your-key                 # News articles
GITHUB_TOKEN=your-pat-token          # Repositories

After adding/changing API keys:

docker compose restart app

Troubleshooting

Chat returns "no results"?

  • Startups need embeddings β†’ Re-research them with OpenAI key enabled
  • Check embeddings: docker compose exec db psql -U startup_user -d startup_research -c "SELECT name, embedding IS NOT NULL FROM startups;"

Research fails?

  • Check logs: docker compose logs app | tail -50
  • Verify domain is correct
  • Some sites block scrapers (use API keys when possible)

Embeddings not enabled?

  • Verify OpenAI key: grep OPENAI_API_KEY .env
  • Restart app: docker compose restart app
  • Check health: curl http://localhost:8000/health (should show embeddings_enabled: true)

πŸ“š Documentation Index

This project has comprehensive documentation to help you get started and master all features:

🌐 Interactive Documentation


πŸ— Architecture

High-Level Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         FastAPI Application         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  REST Endpoints                β”‚ β”‚
β”‚  β”‚  /research /startups           β”‚ β”‚
β”‚  β”‚  /analytics /chat              β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Research Engine               β”‚ β”‚
β”‚  β”‚  β€’ Concurrent scraping         β”‚ β”‚
β”‚  β”‚  β€’ Rate limiting               β”‚ β”‚
β”‚  β”‚  β€’ Source aggregation          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  NLP & Embeddings              β”‚ β”‚
β”‚  β”‚  β€’ OpenAI embeddings           β”‚ β”‚
β”‚  β”‚  β€’ Semantic search             β”‚ β”‚
β”‚  β”‚  β€’ Chat synthesis              β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   PostgreSQL        β”‚
    β”‚   + pgvector        β”‚
    β”‚   + pg_trgm         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

1. Research Engine (app/research/)

  • Runner (runner.py): Orchestrates concurrent research tasks
  • Scrapers (sources/): Modular scrapers for each data source
    • website.py: Company homepage, about, team pages
    • newsapi.py: Recent news articles
    • github.py: Repository and tech stack info

2. Database Layer (app/models.py, app/db.py)

  • Startups Table: Comprehensive company profiles with 30+ fields
  • Scrape Jobs Table: Granular tracking of each source scrape
  • Indexes: GIN indexes for arrays, trigram for fuzzy search, HNSW for vectors

3. NLP Module (app/nlp/)

  • Embeddings Service (embeddings.py): Vector generation and semantic search
  • Chat Service (chat.py): Natural language query processing and LLM synthesis

4. API Layer (app/api/)

  • Research (research.py): Start research tasks (async/sync modes)
  • Startups (startups.py): CRUD operations with filtering
  • Analytics (analytics.py): Aggregated insights
  • Chat (chat.py): Natural language interface

βš™οΈ Configuration

Environment Variables

Variable Required Default Description
POSTGRES_USER Yes - PostgreSQL username
POSTGRES_PASSWORD Yes - PostgreSQL password
POSTGRES_DB Yes - Database name
OPENAI_API_KEY Recommended - OpenAI API key for embeddings/chat
NEWSAPI_KEY No - NewsAPI for articles
GITHUB_TOKEN No - GitHub API token
MAX_CONCURRENT_REQUESTS No 50 Global request concurrency
PER_DOMAIN_LIMIT No 3 Concurrent requests per domain
ENABLE_EMBEDDINGS No true Enable vector embeddings
ENABLE_RATE_LIMITING No true Enable rate limiting

Scraping Strategy

The system follows these principles:

  1. API-First: Prefer official APIs when available ( NewsAPI, GitHub)
  2. Polite Crawling: Respect robots.txt, implement per-domain rate limits
  3. Fallback Strategy: If API unavailable, gracefully fall back to scraping
  4. Partial Success: Store partial data if some sources fail
  5. Caching: Store raw responses to avoid re-scraping

Rate Limiting

  • Global: 50 concurrent requests (configurable)
  • Per-Domain: 3 concurrent requests (prevents overwhelming targets)
  • Token Bucket: 100 requests/minute (respects provider limits)
  • Retry Logic: Exponential backoff for 429/5xx errors

πŸ“š API Documentation

Endpoints Overview

Endpoint Method Description
/research POST Start research for startups
/startups GET List startups with filters
/startups/{id} GET Get single startup
/analytics GET Aggregated statistics
/chat POST Natural language queries
/health GET Health check

Examples

See EXAMPLES.md for comprehensive API usage examples.

Start Research

curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Notion", "domain": "notion.so"},
      {"name": "Vercel", "domain": "vercel.com"}
    ],
    "run_mode": "async"
  }'

List Startups with Filters

curl "http://localhost:8000/startups?industry=Technology&min_funding=10000000&limit=20"

Get Analytics

curl "http://localhost:8000/analytics?type=by_industry"

Chat Query

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Which startups are AI-focused and have raised over $50M?",
    "k": 5,
    "use_llm": true
  }'

πŸ›  Development

Project Structure

startup-scanner/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/              # API endpoints
β”‚   β”‚   β”œβ”€β”€ research.py
β”‚   β”‚   β”œβ”€β”€ startups.py
β”‚   β”‚   β”œβ”€β”€ analytics.py
β”‚   β”‚   └── chat.py
β”‚   β”œβ”€β”€ nlp/              # NLP and embeddings
β”‚   β”‚   β”œβ”€β”€ embeddings.py
β”‚   β”‚   └── chat.py
β”‚   β”œβ”€β”€ research/         # Research engine
β”‚   β”‚   β”œβ”€β”€ runner.py
β”‚   β”‚   └── sources/      # Data source scrapers
β”‚   β”œβ”€β”€ utils/            # Utilities
β”‚   β”‚   β”œβ”€β”€ http_client.py
β”‚   β”‚   └── parse_html.py
β”‚   β”œβ”€β”€ config.py         # Configuration
β”‚   β”œβ”€β”€ db.py             # Database setup
β”‚   β”œβ”€β”€ models.py         # SQLAlchemy models
β”‚   β”œβ”€β”€ schemas.py        # Pydantic schemas
β”‚   └── main.py           # FastAPI app
β”œβ”€β”€ alembic/              # Database migrations
β”œβ”€β”€ docker/               # Docker init scripts
β”œβ”€β”€ tests/                # Test suite
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── README.md

Local Development

  1. Install dependencies locally (optional, for IDE support):
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt
  1. Run database only:
docker compose up db -d
  1. Run app locally:
export $(cat .env | xargs)
uvicorn app.main:app --reload

Using Make Commands

make help        # Show available commands
make build       # Build Docker images
make up          # Start all services
make down        # Stop all services
make logs        # View logs
make shell       # Open app shell
make db-shell    # Open PostgreSQL shell
make migrate     # Run migrations
make test        # Run tests
make clean       # Clean up containers and volumes

Adding a New Data Source

  1. Create a new scraper in app/research/sources/:
from app.research.sources.base import BaseScraper

class MySourceScraper(BaseScraper):
    @property
    def source_name(self) -> str:
        return "my_source"

    async def scrape(self, startup_name: str, domain: Optional[str] = None) -> Dict[str, Any]:
        # Implement scraping logic
        return {"field": "value"}
  1. Add to runner in app/research/runner.py:
from app.research.sources.my_source import MySourceScraper

self.scrapers = [
    # ... existing scrapers
    MySourceScraper(self.http_client),
]
  1. Add API key configuration if needed in app/config.py.

πŸ§ͺ Testing

Run All Tests

docker compose exec app pytest

Run with Coverage

docker compose exec app pytest --cov=app --cov-report=html

Run Specific Tests

# Test models only
docker compose exec app pytest tests/test_models.py

# Test API endpoints
docker compose exec app pytest tests/test_api.py

# Test utilities
docker compose exec app pytest tests/test_utils.py

Test Structure

  • tests/conftest.py: Fixtures and test configuration
  • tests/test_models.py: Database model tests
  • tests/test_api.py: API endpoint tests
  • tests/test_utils.py: Utility function tests

🚒 Deployment

Production Considerations

  1. Environment:

    • Set ENVIRONMENT=production
    • Use strong passwords
    • Enable HTTPS (reverse proxy like Nginx)
  2. Database:

    • Use managed PostgreSQL (AWS RDS, Google Cloud SQL)
    • Enable automated backups
    • Create vector index after initial data load
  3. API Keys:

    • Store securely (AWS Secrets Manager, HashiCorp Vault)
    • Rotate regularly
    • Monitor usage and costs
  4. Monitoring:

    • Add application performance monitoring (APM)
    • Set up error tracking (Sentry)
    • Monitor rate limits and quotas
  5. Scaling:

    • Use job queue (Celery + Redis) for async research
    • Scale horizontally with multiple app instances
    • Add caching layer (Redis) for frequently accessed data

Docker Deployment

# Build production image
docker build -t startup-research-assistant:latest .

# Run with production settings
docker run -d \
  --name startup-app \
  -e ENVIRONMENT=production \
  --env-file .env.production \
  -p 8000:8000 \
  startup-research-assistant:latest

Creating Vector Index (After Data Load)

-- Connect to database
docker compose exec db psql -U startup_user -d startup_research

-- Create HNSW index for fast similarity search
CREATE INDEX idx_startups_embedding_hnsw
ON startups
USING hnsw (embedding vector_cosine_ops);

πŸ“„ License

This project is provided as-is for demonstration and educational purposes.

πŸ“§ Contact

For questions or feedback, please reach out via the project repository.


Built with ❀️ using FastAPI, PostgreSQL, pgvector, and OpenAI

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages