Startup Research Assistant

An automated research assistant that intelligently gathers, analyzes, and serves startup information through REST APIs and natural language chat interfaces.

🚀 Features

Core Capabilities

Concurrent Multi-Source Research: Automatically scrapes and queries multiple data sources in parallel
- Company websites (homepage, about, team pages)
- Web Search + LLM (financial data extraction from press releases and articles) 🆕
- NewsAPI (recent news articles)
- GitHub API (open source repositories, tech stack)
Semantic Search: Powered by OpenAI embeddings and pgvector for similarity-based queries
Natural Language Interface: Chat-style endpoint that understands questions like:
- "Which AI startups have raised over $10M?"
- "Show me fintech companies in San Francisco"
- "Find startups using Python and PostgreSQL"
Analytics Dashboard: Aggregated insights including:
- Funding distribution
- Industry breakdown
- Tech stack analysis
- Success metrics

Technical Highlights

Async/Await Architecture: Built with asyncio for high-performance concurrent operations
Rate Limiting & Throttling: Per-domain semaphores and token bucket rate limiting
Structured Storage: PostgreSQL with pgvector for embeddings and full-text search
Docker-Native: Full docker-compose setup for reproducible local deployment
Type-Safe: Comprehensive Pydantic schemas and SQLAlchemy models
Test Coverage: Unit and integration tests with pytest

🏁 Quick Start

Prerequisites

Docker and Docker Compose
OpenAI API key (for embeddings and chat)

Installation

Clone the repository

cd startup-scanner

Create environment file

cp .env.example .env

Edit .env and add your API keys:

# Required for embeddings and chat
OPENAI_API_KEY=sk-your-key-here

# Optional external APIs
NEWSAPI_KEY=your-key
GITHUB_TOKEN=your-token

Start the services

docker compose up --build

Wait for the services to start. You should see:

✅ Database ready on postgres://localhost:5432
✅ API ready on http://localhost:8000

Run database migrations

docker compose exec app alembic upgrade head

Access the API

Interactive Docs: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
Health Check: http://localhost:8000/health

First Research Task

Research your first startup:

curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Stripe", "domain": "stripe.com"}
    ],
    "run_mode": "sync"
  }'

Query with natural language:

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Tell me about Stripe",
    "k": 1,
    "use_llm": true
  }'

📖 Complete Usage Guide

How to Use This System

This research assistant has three main ways to interact with startup data:

1️⃣ Research Startups (Gather Data)

Collect comprehensive data from multiple sources:

# Research a single startup
curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [{"name": "Stripe", "domain": "stripe.com"}],
    "run_mode": "sync"
  }'

# Research multiple startups (async mode recommended for 4+)
curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Stripe", "domain": "stripe.com"},
      {"name": "Anthropic", "domain": "anthropic.com"},
      {"name": "Scale AI", "domain": "scale.com"}
    ],
    "run_mode": "async"
  }'

What gets collected:

Company descriptions, logos, websites
Funding data (total raised, rounds, investors) - now with web search + LLM! 🆕
Team information (founders, employee count)
Valuation, traction metrics - extracted from press releases 🆕
Tech stack (programming languages, frameworks)
Recent news articles
GitHub repositories
Social media profiles
Embeddings for semantic search (requires OpenAI API key)

2️⃣ Chat with Natural Language (Ask Questions)

Ask questions in plain English about your startups:

# Using curl
curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Which startups work with AI?",
    "k": 5,
    "use_llm": true
  }'

# Using the helper script (easier!)
./scripts/chat.sh "Tell me about Stripe"
./scripts/chat.sh "Which startups have raised over $100M?"
./scripts/chat.sh "Find fintech companies in San Francisco"
./scripts/chat.sh "Compare Anthropic and Scale AI"

Example questions:

"What does [Company] do?"
"Which startups are in [industry]?"
"Compare [Company A] and [Company B]"
"Find startups using [technology]"
"Which companies are based in [location]?"

3️⃣ Query & Analyze (Explore Data)

Via REST API:

# List all startups
curl "http://localhost:8000/startups?limit=10"

# Search by name
curl "http://localhost:8000/startups?search=stripe"

# Filter by industry/tags
curl "http://localhost:8000/startups?industry=AI&tags=B2B"

# Get analytics overview
curl "http://localhost:8000/analytics"

Via Database:

# Quick stats helper script
./scripts/view_db.sh

# Direct psql access
docker compose exec db psql -U startup_user -d startup_research

# SQL query example
docker compose exec db psql -U startup_user -d startup_research \
  -c "SELECT name, funding_total_usd FROM startups ORDER BY funding_total_usd DESC LIMIT 10;"

Via GUI Tools (Recommended):

TablePlus: https://tableplus.com (Beautiful UI, fast)
DBeaver: https://dbeaver.io (Free, powerful)
pgAdmin: https://www.pgadmin.org (PostgreSQL official)

Connection Details:

Host: localhost
Port: 5432
Database: startup_research
User: startup_user
Password: startup_password

Important Setup Notes

✅ Your .env File Location:

The .env file must be in the project root (same directory as docker-compose.yml):

/your-path/startup-scanner/.env

Required for Chat/Embeddings:

OPENAI_API_KEY=sk-your-key-here

Optional (for richer data):

GOOGLE_API_KEY=your-key              # Web search for financial data 🆕
GOOGLE_SEARCH_ENGINE_ID=your-cx-id  # Google Custom Search Engine ID 🆕
NEWSAPI_KEY=your-key                 # News articles
GITHUB_TOKEN=your-pat-token          # Repositories

After adding/changing API keys:

docker compose restart app

Troubleshooting

Chat returns "no results"?

Startups need embeddings → Re-research them with OpenAI key enabled
Check embeddings: docker compose exec db psql -U startup_user -d startup_research -c "SELECT name, embedding IS NOT NULL FROM startups;"

Research fails?

Check logs: docker compose logs app | tail -50
Verify domain is correct
Some sites block scrapers (use API keys when possible)

Embeddings not enabled?

Verify OpenAI key: grep OPENAI_API_KEY .env
Restart app: docker compose restart app
Check health: curl http://localhost:8000/health (should show embeddings_enabled: true)

📚 Documentation Index

This project has comprehensive documentation to help you get started and master all features:

🌐 Interactive Documentation

API Docs: http://localhost:8000/docs (when running)
ReDoc: http://localhost:8000/redoc (when running)

🏗 Architecture

High-Level Overview

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│         FastAPI Application         │
│  ┌────────────────────────────────┐ │
│  │  REST Endpoints                │ │
│  │  /research /startups           │ │
│  │  /analytics /chat              │ │
│  └────────────────────────────────┘ │
│  ┌────────────────────────────────┐ │
│  │  Research Engine               │ │
│  │  • Concurrent scraping         │ │
│  │  • Rate limiting               │ │
│  │  • Source aggregation          │ │
│  └────────────────────────────────┘ │
│  ┌────────────────────────────────┐ │
│  │  NLP & Embeddings              │ │
│  │  • OpenAI embeddings           │ │
│  │  • Semantic search             │ │
│  │  • Chat synthesis              │ │
│  └────────────────────────────────┘ │
└─────────────┬───────────────────────┘
              │
              ▼
    ┌─────────────────────┐
    │   PostgreSQL        │
    │   + pgvector        │
    │   + pg_trgm         │
    └─────────────────────┘

Key Components

1. Research Engine (`app/research/`)

Runner (runner.py): Orchestrates concurrent research tasks
Scrapers (sources/): Modular scrapers for each data source
- website.py: Company homepage, about, team pages
- newsapi.py: Recent news articles
- github.py: Repository and tech stack info

2. Database Layer (`app/models.py`, `app/db.py`)

Startups Table: Comprehensive company profiles with 30+ fields
Scrape Jobs Table: Granular tracking of each source scrape
Indexes: GIN indexes for arrays, trigram for fuzzy search, HNSW for vectors

3. NLP Module (`app/nlp/`)

Embeddings Service (embeddings.py): Vector generation and semantic search
Chat Service (chat.py): Natural language query processing and LLM synthesis

4. API Layer (`app/api/`)

Research (research.py): Start research tasks (async/sync modes)
Startups (startups.py): CRUD operations with filtering
Analytics (analytics.py): Aggregated insights
Chat (chat.py): Natural language interface

⚙️ Configuration

Environment Variables

Variable	Required	Default	Description
`POSTGRES_USER`	Yes	-	PostgreSQL username
`POSTGRES_PASSWORD`	Yes	-	PostgreSQL password
`POSTGRES_DB`	Yes	-	Database name
`OPENAI_API_KEY`	Recommended	-	OpenAI API key for embeddings/chat
`NEWSAPI_KEY`	No	-	NewsAPI for articles
`GITHUB_TOKEN`	No	-	GitHub API token
`MAX_CONCURRENT_REQUESTS`	No	50	Global request concurrency
`PER_DOMAIN_LIMIT`	No	3	Concurrent requests per domain
`ENABLE_EMBEDDINGS`	No	true	Enable vector embeddings
`ENABLE_RATE_LIMITING`	No	true	Enable rate limiting

Scraping Strategy

The system follows these principles:

API-First: Prefer official APIs when available ( NewsAPI, GitHub)
Polite Crawling: Respect robots.txt, implement per-domain rate limits
Fallback Strategy: If API unavailable, gracefully fall back to scraping
Partial Success: Store partial data if some sources fail
Caching: Store raw responses to avoid re-scraping

Rate Limiting

Global: 50 concurrent requests (configurable)
Per-Domain: 3 concurrent requests (prevents overwhelming targets)
Token Bucket: 100 requests/minute (respects provider limits)
Retry Logic: Exponential backoff for 429/5xx errors

📚 API Documentation

Endpoints Overview

Endpoint	Method	Description
`/research`	POST	Start research for startups
`/startups`	GET	List startups with filters
`/startups/{id}`	GET	Get single startup
`/analytics`	GET	Aggregated statistics
`/chat`	POST	Natural language queries
`/health`	GET	Health check

Examples

See EXAMPLES.md for comprehensive API usage examples.

Start Research

curl -X POST "http://localhost:8000/research" \
  -H "Content-Type: application/json" \
  -d '{
    "startups": [
      {"name": "Notion", "domain": "notion.so"},
      {"name": "Vercel", "domain": "vercel.com"}
    ],
    "run_mode": "async"
  }'

List Startups with Filters

curl "http://localhost:8000/startups?industry=Technology&min_funding=10000000&limit=20"

Get Analytics

curl "http://localhost:8000/analytics?type=by_industry"

Chat Query

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Which startups are AI-focused and have raised over $50M?",
    "k": 5,
    "use_llm": true
  }'

🛠 Development

Project Structure

startup-scanner/
├── app/
│   ├── api/              # API endpoints
│   │   ├── research.py
│   │   ├── startups.py
│   │   ├── analytics.py
│   │   └── chat.py
│   ├── nlp/              # NLP and embeddings
│   │   ├── embeddings.py
│   │   └── chat.py
│   ├── research/         # Research engine
│   │   ├── runner.py
│   │   └── sources/      # Data source scrapers
│   ├── utils/            # Utilities
│   │   ├── http_client.py
│   │   └── parse_html.py
│   ├── config.py         # Configuration
│   ├── db.py             # Database setup
│   ├── models.py         # SQLAlchemy models
│   ├── schemas.py        # Pydantic schemas
│   └── main.py           # FastAPI app
├── alembic/              # Database migrations
├── docker/               # Docker init scripts
├── tests/                # Test suite
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── README.md

Local Development

Install dependencies locally (optional, for IDE support):

python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

Run database only:

docker compose up db -d

Run app locally:

export $(cat .env | xargs)
uvicorn app.main:app --reload

Using Make Commands

make help        # Show available commands
make build       # Build Docker images
make up          # Start all services
make down        # Stop all services
make logs        # View logs
make shell       # Open app shell
make db-shell    # Open PostgreSQL shell
make migrate     # Run migrations
make test        # Run tests
make clean       # Clean up containers and volumes

Adding a New Data Source

Create a new scraper in app/research/sources/:

from app.research.sources.base import BaseScraper

class MySourceScraper(BaseScraper):
    @property
    def source_name(self) -> str:
        return "my_source"

    async def scrape(self, startup_name: str, domain: Optional[str] = None) -> Dict[str, Any]:
        # Implement scraping logic
        return {"field": "value"}

Add to runner in app/research/runner.py:

from app.research.sources.my_source import MySourceScraper

self.scrapers = [
    # ... existing scrapers
    MySourceScraper(self.http_client),
]

Add API key configuration if needed in app/config.py.

🧪 Testing

Run All Tests

docker compose exec app pytest

Run with Coverage

docker compose exec app pytest --cov=app --cov-report=html

Run Specific Tests

# Test models only
docker compose exec app pytest tests/test_models.py

# Test API endpoints
docker compose exec app pytest tests/test_api.py

# Test utilities
docker compose exec app pytest tests/test_utils.py

Test Structure

tests/conftest.py: Fixtures and test configuration
tests/test_models.py: Database model tests
tests/test_api.py: API endpoint tests
tests/test_utils.py: Utility function tests

🚢 Deployment

Production Considerations

Environment:
- Set ENVIRONMENT=production
- Use strong passwords
- Enable HTTPS (reverse proxy like Nginx)
Database:
- Use managed PostgreSQL (AWS RDS, Google Cloud SQL)
- Enable automated backups
- Create vector index after initial data load
API Keys:
- Store securely (AWS Secrets Manager, HashiCorp Vault)
- Rotate regularly
- Monitor usage and costs
Monitoring:
- Add application performance monitoring (APM)
- Set up error tracking (Sentry)
- Monitor rate limits and quotas
Scaling:
- Use job queue (Celery + Redis) for async research
- Scale horizontally with multiple app instances
- Add caching layer (Redis) for frequently accessed data

Docker Deployment

# Build production image
docker build -t startup-research-assistant:latest .

# Run with production settings
docker run -d \
  --name startup-app \
  -e ENVIRONMENT=production \
  --env-file .env.production \
  -p 8000:8000 \
  startup-research-assistant:latest

Creating Vector Index (After Data Load)

-- Connect to database
docker compose exec db psql -U startup_user -d startup_research

-- Create HNSW index for fast similarity search
CREATE INDEX idx_startups_embedding_hnsw
ON startups
USING hnsw (embedding vector_cosine_ops);

📄 License

This project is provided as-is for demonstration and educational purposes.

📧 Contact

For questions or feedback, please reach out via the project repository.

Built with ❤️ using FastAPI, PostgreSQL, pgvector, and OpenAI

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
alembic		alembic
app		app
docker/initdb		docker/initdb
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_data.json		sample_data.json

Folders and files

Latest commit

History

Repository files navigation

Startup Research Assistant

🚀 Features

Core Capabilities

Technical Highlights

📋 Table of Contents

🏁 Quick Start

Prerequisites

Installation

First Research Task

📖 Complete Usage Guide

How to Use This System

1️⃣ Research Startups (Gather Data)

2️⃣ Chat with Natural Language (Ask Questions)

3️⃣ Query & Analyze (Explore Data)

Important Setup Notes

Troubleshooting

📚 Documentation Index

🌐 Interactive Documentation

🏗 Architecture

High-Level Overview

Key Components

1. Research Engine (app/research/)

2. Database Layer (app/models.py, app/db.py)

3. NLP Module (app/nlp/)

4. API Layer (app/api/)

⚙️ Configuration

Environment Variables

Scraping Strategy

Rate Limiting

📚 API Documentation

Endpoints Overview

Examples

Start Research

List Startups with Filters

Get Analytics

Chat Query

🛠 Development

Project Structure

Local Development

Using Make Commands

Adding a New Data Source

🧪 Testing

Run All Tests

Run with Coverage

Run Specific Tests

Test Structure

🚢 Deployment

Production Considerations

Docker Deployment

Creating Vector Index (After Data Load)

📄 License

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Research Engine (`app/research/`)

2. Database Layer (`app/models.py`, `app/db.py`)

3. NLP Module (`app/nlp/`)

4. API Layer (`app/api/`)

Packages