This repository contains the solution for the Adobe India Hackathon 2025, including the persona-driven document intelligence system for Challenge 1b.
This solution is a Python-based application that analyzes a collection of PDF documents and extracts the most relevant sections based on a given persona and job-to-be-done.
- Persona-Based Analysis: Ranks document sections based on their relevance to a user's persona and task.
- Extractive Summarization: Provides concise summaries of the most important sections.
- Dockerized Solution: The application is containerized for optional deployment and execution.
.
├── Dockerfile
├── README.md
├── approach_explanation.md
├── requirements.txt
├── src/
│ ├── main.py
│ ├── pdf_parser.py
│ ├── ranking.py
│ └── summarizer.py
└── Challenge_1b/
└── [Collection Name]/
├── PDFs/
│ ├── document1.pdf
│ └── ...
└── challenge1b_input.json
You can run the solution with or without Docker.
This solution is fully functional on CPU and does not require an internet connection once dependencies are installed.
-
Install Dependencies (Manually):
Important: Installing
spacy
inside Docker can be time-consuming. To avoid long build times, it is highly recommended to install all packages separately on your system before execution.Run the following command in the root directory:
pip install -r requirements.txt
-
Download Required Spacy Model (only once):
If you're online during setup, run:
python -m spacy download en_core_web_sm
💡 Skip this if the model is already downloaded or if running offline.
-
Prepare Input Data:
- Inside the
Challenge_1b
directory, create a new folder for your collection. - Add a
PDFs/
subdirectory with all your input PDFs. - Add a
challenge1b_input.json
file at the root of that collection folder (use examples as reference).
- Inside the
-
Run the Application:
From the root project directory:
python src/main.py
This will process all collections under the
Challenge_1b
directory and generate correspondingchallenge1b_output.json
files inside each collection folder.
-
Build Docker Image:
⚠️ Note: Building the Docker image may take time due to Spacy installation.docker build --platform linux/amd64 -t document-intelligence .
-
Run the Docker Container:
docker run --rm -v "%cd%/Challenge_1b":/app/Challenge_1b document-intelligence
This will process all collections and generate the output files in the same directory.
Refer to approach_explanation.md
for a full explanation of the methodology, design, and component structure.