An automated research assistant that intelligently gathers, analyzes, and serves startup information through REST APIs and natural language chat interfaces.
-
Concurrent Multi-Source Research: Automatically scrapes and queries multiple data sources in parallel
- Company websites (homepage, about, team pages)
- Web Search + LLM (financial data extraction from press releases and articles) π
- NewsAPI (recent news articles)
- GitHub API (open source repositories, tech stack)
-
Semantic Search: Powered by OpenAI embeddings and pgvector for similarity-based queries
-
Natural Language Interface: Chat-style endpoint that understands questions like:
- "Which AI startups have raised over $10M?"
- "Show me fintech companies in San Francisco"
- "Find startups using Python and PostgreSQL"
-
Analytics Dashboard: Aggregated insights including:
- Funding distribution
- Industry breakdown
- Tech stack analysis
- Success metrics
- Async/Await Architecture: Built with asyncio for high-performance concurrent operations
- Rate Limiting & Throttling: Per-domain semaphores and token bucket rate limiting
- Structured Storage: PostgreSQL with pgvector for embeddings and full-text search
- Docker-Native: Full docker-compose setup for reproducible local deployment
- Type-Safe: Comprehensive Pydantic schemas and SQLAlchemy models
- Test Coverage: Unit and integration tests with pytest
- Quick Start
- Usage Guide
- Documentation Index
- Architecture
- Configuration
- API Documentation
- Development
- Testing
- Deployment
- Reflection Questions
- Docker and Docker Compose
- OpenAI API key (for embeddings and chat)
- Clone the repository
cd startup-scanner- Create environment file
cp .env.example .envEdit .env and add your API keys:
# Required for embeddings and chat
OPENAI_API_KEY=sk-your-key-here
# Optional external APIs
NEWSAPI_KEY=your-key
GITHUB_TOKEN=your-token- Start the services
docker compose up --buildWait for the services to start. You should see:
β
Database ready on postgres://localhost:5432
β
API ready on http://localhost:8000
- Run database migrations
docker compose exec app alembic upgrade head- Access the API
- Interactive Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Health Check: http://localhost:8000/health
Research your first startup:
curl -X POST "http://localhost:8000/research" \
-H "Content-Type: application/json" \
-d '{
"startups": [
{"name": "Stripe", "domain": "stripe.com"}
],
"run_mode": "sync"
}'Query with natural language:
curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"query": "Tell me about Stripe",
"k": 1,
"use_llm": true
}'This research assistant has three main ways to interact with startup data:
Collect comprehensive data from multiple sources:
# Research a single startup
curl -X POST "http://localhost:8000/research" \
-H "Content-Type: application/json" \
-d '{
"startups": [{"name": "Stripe", "domain": "stripe.com"}],
"run_mode": "sync"
}'
# Research multiple startups (async mode recommended for 4+)
curl -X POST "http://localhost:8000/research" \
-H "Content-Type: application/json" \
-d '{
"startups": [
{"name": "Stripe", "domain": "stripe.com"},
{"name": "Anthropic", "domain": "anthropic.com"},
{"name": "Scale AI", "domain": "scale.com"}
],
"run_mode": "async"
}'What gets collected:
- Company descriptions, logos, websites
- Funding data (total raised, rounds, investors) - now with web search + LLM! π
- Team information (founders, employee count)
- Valuation, traction metrics - extracted from press releases π
- Tech stack (programming languages, frameworks)
- Recent news articles
- GitHub repositories
- Social media profiles
- Embeddings for semantic search (requires OpenAI API key)
Ask questions in plain English about your startups:
# Using curl
curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"query": "Which startups work with AI?",
"k": 5,
"use_llm": true
}'
# Using the helper script (easier!)
./scripts/chat.sh "Tell me about Stripe"
./scripts/chat.sh "Which startups have raised over $100M?"
./scripts/chat.sh "Find fintech companies in San Francisco"
./scripts/chat.sh "Compare Anthropic and Scale AI"Example questions:
- "What does [Company] do?"
- "Which startups are in [industry]?"
- "Compare [Company A] and [Company B]"
- "Find startups using [technology]"
- "Which companies are based in [location]?"
Via REST API:
# List all startups
curl "http://localhost:8000/startups?limit=10"
# Search by name
curl "http://localhost:8000/startups?search=stripe"
# Filter by industry/tags
curl "http://localhost:8000/startups?industry=AI&tags=B2B"
# Get analytics overview
curl "http://localhost:8000/analytics"Via Database:
# Quick stats helper script
./scripts/view_db.sh
# Direct psql access
docker compose exec db psql -U startup_user -d startup_research
# SQL query example
docker compose exec db psql -U startup_user -d startup_research \
-c "SELECT name, funding_total_usd FROM startups ORDER BY funding_total_usd DESC LIMIT 10;"Via GUI Tools (Recommended):
- TablePlus: https://tableplus.com (Beautiful UI, fast)
- DBeaver: https://dbeaver.io (Free, powerful)
- pgAdmin: https://www.pgadmin.org (PostgreSQL official)
Connection Details:
Host: localhost
Port: 5432
Database: startup_research
User: startup_user
Password: startup_password
β
Your .env File Location:
The .env file must be in the project root (same directory as docker-compose.yml):
/your-path/startup-scanner/.env
Required for Chat/Embeddings:
OPENAI_API_KEY=sk-your-key-hereOptional (for richer data):
GOOGLE_API_KEY=your-key # Web search for financial data π
GOOGLE_SEARCH_ENGINE_ID=your-cx-id # Google Custom Search Engine ID π
NEWSAPI_KEY=your-key # News articles
GITHUB_TOKEN=your-pat-token # RepositoriesAfter adding/changing API keys:
docker compose restart appChat returns "no results"?
- Startups need embeddings β Re-research them with OpenAI key enabled
- Check embeddings:
docker compose exec db psql -U startup_user -d startup_research -c "SELECT name, embedding IS NOT NULL FROM startups;"
Research fails?
- Check logs:
docker compose logs app | tail -50 - Verify domain is correct
- Some sites block scrapers (use API keys when possible)
Embeddings not enabled?
- Verify OpenAI key:
grep OPENAI_API_KEY .env - Restart app:
docker compose restart app - Check health:
curl http://localhost:8000/health(should showembeddings_enabled: true)
This project has comprehensive documentation to help you get started and master all features:
- API Docs: http://localhost:8000/docs (when running)
- ReDoc: http://localhost:8000/redoc (when running)
βββββββββββββββ
β Client β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β FastAPI Application β
β ββββββββββββββββββββββββββββββββββ β
β β REST Endpoints β β
β β /research /startups β β
β β /analytics /chat β β
β ββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββ β
β β Research Engine β β
β β β’ Concurrent scraping β β
β β β’ Rate limiting β β
β β β’ Source aggregation β β
β ββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββ β
β β NLP & Embeddings β β
β β β’ OpenAI embeddings β β
β β β’ Semantic search β β
β β β’ Chat synthesis β β
β ββββββββββββββββββββββββββββββββββ β
βββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β PostgreSQL β
β + pgvector β
β + pg_trgm β
βββββββββββββββββββββββ
- Runner (
runner.py): Orchestrates concurrent research tasks - Scrapers (
sources/): Modular scrapers for each data sourcewebsite.py: Company homepage, about, team pagesnewsapi.py: Recent news articlesgithub.py: Repository and tech stack info
- Startups Table: Comprehensive company profiles with 30+ fields
- Scrape Jobs Table: Granular tracking of each source scrape
- Indexes: GIN indexes for arrays, trigram for fuzzy search, HNSW for vectors
- Embeddings Service (
embeddings.py): Vector generation and semantic search - Chat Service (
chat.py): Natural language query processing and LLM synthesis
- Research (
research.py): Start research tasks (async/sync modes) - Startups (
startups.py): CRUD operations with filtering - Analytics (
analytics.py): Aggregated insights - Chat (
chat.py): Natural language interface
| Variable | Required | Default | Description |
|---|---|---|---|
POSTGRES_USER |
Yes | - | PostgreSQL username |
POSTGRES_PASSWORD |
Yes | - | PostgreSQL password |
POSTGRES_DB |
Yes | - | Database name |
OPENAI_API_KEY |
Recommended | - | OpenAI API key for embeddings/chat |
NEWSAPI_KEY |
No | - | NewsAPI for articles |
GITHUB_TOKEN |
No | - | GitHub API token |
MAX_CONCURRENT_REQUESTS |
No | 50 | Global request concurrency |
PER_DOMAIN_LIMIT |
No | 3 | Concurrent requests per domain |
ENABLE_EMBEDDINGS |
No | true | Enable vector embeddings |
ENABLE_RATE_LIMITING |
No | true | Enable rate limiting |
The system follows these principles:
- API-First: Prefer official APIs when available ( NewsAPI, GitHub)
- Polite Crawling: Respect robots.txt, implement per-domain rate limits
- Fallback Strategy: If API unavailable, gracefully fall back to scraping
- Partial Success: Store partial data if some sources fail
- Caching: Store raw responses to avoid re-scraping
- Global: 50 concurrent requests (configurable)
- Per-Domain: 3 concurrent requests (prevents overwhelming targets)
- Token Bucket: 100 requests/minute (respects provider limits)
- Retry Logic: Exponential backoff for 429/5xx errors
| Endpoint | Method | Description |
|---|---|---|
/research |
POST | Start research for startups |
/startups |
GET | List startups with filters |
/startups/{id} |
GET | Get single startup |
/analytics |
GET | Aggregated statistics |
/chat |
POST | Natural language queries |
/health |
GET | Health check |
See EXAMPLES.md for comprehensive API usage examples.
curl -X POST "http://localhost:8000/research" \
-H "Content-Type: application/json" \
-d '{
"startups": [
{"name": "Notion", "domain": "notion.so"},
{"name": "Vercel", "domain": "vercel.com"}
],
"run_mode": "async"
}'curl "http://localhost:8000/startups?industry=Technology&min_funding=10000000&limit=20"curl "http://localhost:8000/analytics?type=by_industry"curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"query": "Which startups are AI-focused and have raised over $50M?",
"k": 5,
"use_llm": true
}'startup-scanner/
βββ app/
β βββ api/ # API endpoints
β β βββ research.py
β β βββ startups.py
β β βββ analytics.py
β β βββ chat.py
β βββ nlp/ # NLP and embeddings
β β βββ embeddings.py
β β βββ chat.py
β βββ research/ # Research engine
β β βββ runner.py
β β βββ sources/ # Data source scrapers
β βββ utils/ # Utilities
β β βββ http_client.py
β β βββ parse_html.py
β βββ config.py # Configuration
β βββ db.py # Database setup
β βββ models.py # SQLAlchemy models
β βββ schemas.py # Pydantic schemas
β βββ main.py # FastAPI app
βββ alembic/ # Database migrations
βββ docker/ # Docker init scripts
βββ tests/ # Test suite
βββ docker-compose.yml
βββ Dockerfile
βββ requirements.txt
βββ README.md
- Install dependencies locally (optional, for IDE support):
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt- Run database only:
docker compose up db -d- Run app locally:
export $(cat .env | xargs)
uvicorn app.main:app --reloadmake help # Show available commands
make build # Build Docker images
make up # Start all services
make down # Stop all services
make logs # View logs
make shell # Open app shell
make db-shell # Open PostgreSQL shell
make migrate # Run migrations
make test # Run tests
make clean # Clean up containers and volumes- Create a new scraper in
app/research/sources/:
from app.research.sources.base import BaseScraper
class MySourceScraper(BaseScraper):
@property
def source_name(self) -> str:
return "my_source"
async def scrape(self, startup_name: str, domain: Optional[str] = None) -> Dict[str, Any]:
# Implement scraping logic
return {"field": "value"}- Add to runner in
app/research/runner.py:
from app.research.sources.my_source import MySourceScraper
self.scrapers = [
# ... existing scrapers
MySourceScraper(self.http_client),
]- Add API key configuration if needed in
app/config.py.
docker compose exec app pytestdocker compose exec app pytest --cov=app --cov-report=html# Test models only
docker compose exec app pytest tests/test_models.py
# Test API endpoints
docker compose exec app pytest tests/test_api.py
# Test utilities
docker compose exec app pytest tests/test_utils.pytests/conftest.py: Fixtures and test configurationtests/test_models.py: Database model teststests/test_api.py: API endpoint teststests/test_utils.py: Utility function tests
-
Environment:
- Set
ENVIRONMENT=production - Use strong passwords
- Enable HTTPS (reverse proxy like Nginx)
- Set
-
Database:
- Use managed PostgreSQL (AWS RDS, Google Cloud SQL)
- Enable automated backups
- Create vector index after initial data load
-
API Keys:
- Store securely (AWS Secrets Manager, HashiCorp Vault)
- Rotate regularly
- Monitor usage and costs
-
Monitoring:
- Add application performance monitoring (APM)
- Set up error tracking (Sentry)
- Monitor rate limits and quotas
-
Scaling:
- Use job queue (Celery + Redis) for async research
- Scale horizontally with multiple app instances
- Add caching layer (Redis) for frequently accessed data
# Build production image
docker build -t startup-research-assistant:latest .
# Run with production settings
docker run -d \
--name startup-app \
-e ENVIRONMENT=production \
--env-file .env.production \
-p 8000:8000 \
startup-research-assistant:latest-- Connect to database
docker compose exec db psql -U startup_user -d startup_research
-- Create HNSW index for fast similarity search
CREATE INDEX idx_startups_embedding_hnsw
ON startups
USING hnsw (embedding vector_cosine_ops);This project is provided as-is for demonstration and educational purposes.
For questions or feedback, please reach out via the project repository.
Built with β€οΈ using FastAPI, PostgreSQL, pgvector, and OpenAI