Skip to content

chrishonselaar/inzicht-document-collection

Repository files navigation

Document Processing Pipeline

This repository contains scripts for downloading, processing, and analyzing documents using vector embeddings.

🚀 Getting started

Prerequisites

  • Python 3.11
  • pip (Python package installer)
  1. Clone this repository:
git clone https://github.com/chrishonselaar/inzicht-document-collection
cd inzicht-document-collection
  1. Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
  1. Install Dependencies

    pip install -r requirements.txt
  2. Set up Milvus

    • Create and run Milvus container using docker compose file:
      docker-compose -f milvus/milvus-standalone-docker-compose.yml up -d
  3. Configure Environment

    • Copy .env.example to .env
    • Fill in your API keys in .env

📥 Document Collection

Groningen Documents

  1. Run main document collection

    python bulk-downloads/groningen/scrape-and-download-groningen-main.py
  2. Collect additional council reports (Optional)

    python bulk-downloads/groningen/download-raadsverslagen.py

    Note: These should already be included in the main collection, but this provides an alternative approach specifically for council reports published through the calendar.

Amsterdam Documents

  1. Scrape document URLs

    python bulk-downloads/amsterdam/scrape_notubiz.py

    This will store URLs in document_urls-notubiz.txt

  2. Download documents

    python bulk-downloads/amsterdam/download_documents.py

📝 Text Extraction

  1. Extract text from HTML files

    python extract-text/html-text-extractor.py
  2. Extract text from PDF files

    python extract-text/pdf-text-extractor.py
  3. Process images in PDFs

    python pdf-ocr/process_empty_files.py

    Note: This handles images in PDF files that weren't processed by the PDF text extractor

🔍 Vector Embeddings

  1. Generate embeddings

    python embedding/generate_voyage_all.py

    This uses voyage-3-large to generate vector embeddings for all documents

  2. Verify embeddings

    python vector-queries/process_queries_voy_alldocs.py

    This confirms correct embedding generation and indexing

  3. Import to Milvus

    python milvus/import_embeddings.py
  4. Verify index

    python milvus/query_embeddings.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published