Document Processing Pipeline

This repository contains scripts for downloading, processing, and analyzing documents using vector embeddings.

🚀 Getting started

Prerequisites

Python 3.11
pip (Python package installer)

Clone this repository:

git clone https://github.com/chrishonselaar/inzicht-document-collection
cd inzicht-document-collection

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate

Install Dependencies
```
pip install -r requirements.txt
```
Set up Milvus
- Create and run Milvus container using docker compose file:
```
docker-compose -f milvus/milvus-standalone-docker-compose.yml up -d
```
Configure Environment
- Copy .env.example to .env
- Fill in your API keys in .env

📥 Document Collection

Groningen Documents

Run main document collection

python bulk-downloads/groningen/scrape-and-download-groningen-main.py

Collect additional council reports (Optional)
```
python bulk-downloads/groningen/download-raadsverslagen.py
```
Note: These should already be included in the main collection, but this provides an alternative approach specifically for council reports published through the calendar.

Amsterdam Documents

Scrape document URLs
```
python bulk-downloads/amsterdam/scrape_notubiz.py
```
This will store URLs in document_urls-notubiz.txt

Download documents

python bulk-downloads/amsterdam/download_documents.py

📝 Text Extraction

Extract text from HTML files

python extract-text/html-text-extractor.py

Extract text from PDF files

python extract-text/pdf-text-extractor.py

Process images in PDFs
```
python pdf-ocr/process_empty_files.py
```
Note: This handles images in PDF files that weren't processed by the PDF text extractor

🔍 Vector Embeddings

Generate embeddings
```
python embedding/generate_voyage_all.py
```
This uses voyage-3-large to generate vector embeddings for all documents
Verify embeddings
```
python vector-queries/process_queries_voy_alldocs.py
```
This confirms correct embedding generation and indexing
Import to Milvus
```
python milvus/import_embeddings.py
```
Verify index
```
python milvus/query_embeddings.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
UI and workflow ideas		UI and workflow ideas
bulk-downloads		bulk-downloads
gemini-15-8b-preprocessing-examples		gemini-15-8b-preprocessing-examples
milvus		milvus
pdf-ocr		pdf-ocr
pipeline		pipeline
statistieken-over-dataset		statistieken-over-dataset
topic-lists		topic-lists
vector-queries		vector-queries
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
ROADMAP		ROADMAP
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Processing Pipeline

🚀 Getting started

Prerequisites

📥 Document Collection

Groningen Documents

Amsterdam Documents

📝 Text Extraction

🔍 Vector Embeddings

About

Releases

Packages

Languages

chrishonselaar/inzicht-document-collection

Folders and files

Latest commit

History

Repository files navigation

Document Processing Pipeline

🚀 Getting started

Prerequisites

📥 Document Collection

Groningen Documents

Amsterdam Documents

📝 Text Extraction

🔍 Vector Embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages