This repository contains scripts for downloading, processing, and analyzing documents using vector embeddings.
- Python 3.11
- pip (Python package installer)
- Clone this repository:
git clone https://github.com/chrishonselaar/inzicht-document-collection
cd inzicht-document-collection
- Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Set up Milvus
- Create and run Milvus container using docker compose file:
docker-compose -f milvus/milvus-standalone-docker-compose.yml up -d
- Create and run Milvus container using docker compose file:
-
Configure Environment
- Copy
.env.example
to.env
- Fill in your API keys in
.env
- Copy
-
Run main document collection
python bulk-downloads/groningen/scrape-and-download-groningen-main.py
-
Collect additional council reports (Optional)
python bulk-downloads/groningen/download-raadsverslagen.py
Note: These should already be included in the main collection, but this provides an alternative approach specifically for council reports published through the calendar.
-
Scrape document URLs
python bulk-downloads/amsterdam/scrape_notubiz.py
This will store URLs in
document_urls-notubiz.txt
-
Download documents
python bulk-downloads/amsterdam/download_documents.py
-
Extract text from HTML files
python extract-text/html-text-extractor.py
-
Extract text from PDF files
python extract-text/pdf-text-extractor.py
-
Process images in PDFs
python pdf-ocr/process_empty_files.py
Note: This handles images in PDF files that weren't processed by the PDF text extractor
-
Generate embeddings
python embedding/generate_voyage_all.py
This uses voyage-3-large to generate vector embeddings for all documents
-
Verify embeddings
python vector-queries/process_queries_voy_alldocs.py
This confirms correct embedding generation and indexing
-
Import to Milvus
python milvus/import_embeddings.py
-
Verify index
python milvus/query_embeddings.py