A Python-based solution for extracting recipe data from PDFs using GPT-5, LangChain, and PyMuPDF.
-
Install dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API key:
- Copy
env_example.txtto.env - Replace the placeholder with your actual OpenAI API key:
OPENAI_API_KEY=your_actual_api_key_here- Get your API key from: https://platform.openai.com/api-keys
- Copy
-
Run the extraction:
cd src python extract.py -
Check results:
- JSON files are saved to
data/output/ - Each recipe generates a structured JSON file
- JSON files are saved to
recipe-extraction-challenge/
βββ data/
β βββ input/ # PDF files
β βββ output/ # Generated JSON files
βββ src/
β βββ extract.py # Main extraction script
βββ schema/
β βββ schema.json # JSON schema definition
βββ prompt/
β βββ recipe_extraction_prompt.txt
βββ requirements.txt # Dependencies
βββ env_example.txt # Environment template
βββ README.md # Documentation
- Fast development with rich AI / data science ecosystem
-
Extract text from PDF using PyMuPDF
- PyMuPDF (fitz): Fast, reliable PDF text extraction with complex layout handling
- Benchmarks well across text extraction speed and quality
- PyMuPDF (fitz): Fast, reliable PDF text extraction with complex layout handling
-
Parse recipe text using GPT-5 via LangChain
- LangChain: Provides error handling, built-in JSON parsing, and easy prompt engineering
- Orchestration layer for future use (multi-step parsing):
- e.g. could connect LlamaParse for PDF parsing when PDFs have figures
- e.g. could connect to custom multilingual model for handwritten text
- e.g. could connect to database that had nutritional information
- Models are swappable via LangChain config (Claude, Gemini, GPT, etc.)
- Orchestration layer for future use (multi-step parsing):
- GPT-5: Superior understanding of recipe structure and cooking terminology
- LangChain: Provides error handling, built-in JSON parsing, and easy prompt engineering
-
Structure output as JSON using LangChain's JsonOutputParser
-
Validate against schema and save to file
- JSON Schema: Ensures consistent output structure and validation
- Python-dotenv: Secure API key management
- LlamaIndex (future use, could cross reference with an index of similar recipes)
- LlamaParse (future use, could introduce parsing for PDFs with figures)
- Vanilla API calls instead of frameworks for RAG
- Tradeoff: fallback logic, less flexibility, more structure (picked because structure is more essential for prepped meals)
- Schema-based prompting: guardrails in place so that LLM knows what its working toward
- Future improvement: Few Shot Prompting, gives the LLM examples of effective parsing
- Missing data: Intelligently estimated based on cooking methods and component types
- Chef names: Only extracted from explicit mentions (no hallucination)
- Structured Prompts: help make sense of ambiguous / missing data
- JSON Parsing: adheres to schema after multiple iterations
- Support for handwritten or poorly scanned PDFs and unconventional recipe formats
What I wish I'd done differently:
- Leverage Langchain better (e.g. use its PyMuPDFLoader)
- Batch process PDFs to see if it improves processing speed
Given 2 Hours..
- I would do an in-depth validation of JSON Outputs with unit tests. I would also validate allergen outputs against a known allergens list.
Given 2 Weeks...
- I would fine-tune a custom GPT on a wider variety of recipe data (pairs of recipe data + JSON output) and add a front-end review UI for upload + review + correction.
To Scale...
-
I would implement a human-in-the-loop review system, as well as an analytics dashboard to show per-field parsing confidence.
- Per-Field Confidence Scores could be derived from...
- Heuristics: (e.g. whether portion sizes are explicitly mentioned, mathematically derived, or best guesses)
- Agent-as-judge: An AI Agent can explain how confident it is about certain fields
- Human review: A human can sanity check fields for confidence by annotating JSON output
- Proposed Tool: LangSmith is an observability and evaluation platform for AI models. It works well with the LangChain ecosystem and lets us avoid the hassle of building and configuring UI
- Other key metrics: extraction speed, per field confidence, overall confidence, error rate
- Per-Field Confidence Scores could be derived from...
- Ensure you have a
.envfile with your API key - Check that
python-dotenvis installed
- Usually means GPT returned malformed JSON
- Check your internet connection and API key validity