🍳 Recipe Extraction Challenge

A Python-based solution for extracting recipe data from PDFs using GPT-5, LangChain, and PyMuPDF.

🚀 Running the Project

Install dependencies:
```
pip install -r requirements.txt
```
Set up your OpenAI API key:
- Copy env_example.txt to .env
- Replace the placeholder with your actual OpenAI API key:
```
OPENAI_API_KEY=your_actual_api_key_here
```
- Get your API key from: https://platform.openai.com/api-keys
Run the extraction:
```
cd src
python extract.py
```
Check results:
- JSON files are saved to data/output/
- Each recipe generates a structured JSON file

📁 Project Structure

recipe-extraction-challenge/
├── data/
│   ├── input/          # PDF files
│   └── output/         # Generated JSON files
├── src/
│   ├── extract.py      # Main extraction script
├── schema/
│   └── schema.json     # JSON schema definition
├── prompt/
│   └── recipe_extraction_prompt.txt
├── requirements.txt     # Dependencies
├── env_example.txt     # Environment template
└── README.md          # Documentation

🛠️ Tool Selection & Flow

Python

Fast development with rich AI / data science ecosystem

Processing Flow

Extract text from PDF using PyMuPDF
- PyMuPDF (fitz): Fast, reliable PDF text extraction with complex layout handling
  - Benchmarks well across text extraction speed and quality
Parse recipe text using GPT-5 via LangChain
- LangChain: Provides error handling, built-in JSON parsing, and easy prompt engineering
  - Orchestration layer for future use (multi-step parsing):
    - e.g. could connect LlamaParse for PDF parsing when PDFs have figures
    - e.g. could connect to custom multilingual model for handwritten text
    - e.g. could connect to database that had nutritional information
  - Models are swappable via LangChain config (Claude, Gemini, GPT, etc.)
- GPT-5: Superior understanding of recipe structure and cooking terminology
Structure output as JSON using LangChain's JsonOutputParser
Validate against schema and save to file
- JSON Schema: Ensures consistent output structure and validation

Data Management

Python-dotenv: Secure API key management

Other Tools Considered

LlamaIndex (future use, could cross reference with an index of similar recipes)
LlamaParse (future use, could introduce parsing for PDFs with figures)
Vanilla API calls instead of frameworks for RAG

Prompting Approach

Tradeoff: fallback logic, less flexibility, more structure (picked because structure is more essential for prepped meals)
Schema-based prompting: guardrails in place so that LLM knows what its working toward
Future improvement: Few Shot Prompting, gives the LLM examples of effective parsing

🔧 Assumptions & Accuracy

Assumptions Made

Missing data: Intelligently estimated based on cooking methods and component types
Chef names: Only extracted from explicit mentions (no hallucination)

What Works Well

Structured Prompts: help make sense of ambiguous / missing data
JSON Parsing: adheres to schema after multiple iterations

What Needs Improvement

Support for handwritten or poorly scanned PDFs and unconventional recipe formats

🚀 Future Improvements

What I wish I'd done differently:

Leverage Langchain better (e.g. use its PyMuPDFLoader)
Batch process PDFs to see if it improves processing speed

Given 2 Hours..

I would do an in-depth validation of JSON Outputs with unit tests. I would also validate allergen outputs against a known allergens list.

Given 2 Weeks...

I would fine-tune a custom GPT on a wider variety of recipe data (pairs of recipe data + JSON output) and add a front-end review UI for upload + review + correction.

To Scale...

I would implement a human-in-the-loop review system, as well as an analytics dashboard to show per-field parsing confidence.
- Per-Field Confidence Scores could be derived from...
  1. Heuristics: (e.g. whether portion sizes are explicitly mentioned, mathematically derived, or best guesses)
  2. Agent-as-judge: An AI Agent can explain how confident it is about certain fields
  3. Human review: A human can sanity check fields for confidence by annotating JSON output
- Proposed Tool: LangSmith is an observability and evaluation platform for AI models. It works well with the LangChain ecosystem and lets us avoid the hassle of building and configuring UI
- Other key metrics: extraction speed, per field confidence, overall confidence, error rate

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
prompts		prompts
schema		schema
scripts		scripts
src		src
.gitignore		.gitignore
DEMO.md		DEMO.md
README.md		README.md
env_example.txt		env_example.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍳 Recipe Extraction Challenge

🚀 Running the Project

📁 Project Structure

🛠️ Tool Selection & Flow

Python

Processing Flow

Data Management

Other Tools Considered

Prompting Approach

🔧 Assumptions & Accuracy

Assumptions Made

What Works Well

What Needs Improvement

🚀 Future Improvements

🐛 Troubleshooting

OpenAI API key not found

JSON decode errors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍳 Recipe Extraction Challenge

🚀 Running the Project

📁 Project Structure

🛠️ Tool Selection & Flow

Python

Processing Flow

Data Management

Other Tools Considered

Prompting Approach

🔧 Assumptions & Accuracy

Assumptions Made

What Works Well

What Needs Improvement

🚀 Future Improvements

🐛 Troubleshooting

OpenAI API key not found

JSON decode errors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages