FoodLMEval is a multi-aspect benchmark comprising over 4,000 questions designed to evaluate the holistic food literacy of large language models across five key aspects: Nutritional Literacy, Functional Health Claims, Food Safety, Cooking Procedure Integrity, and Culinary Reasoning.
Grounded in authoritative sources like the USDA and EFSA, the dataset utilizes diverse formats—including ranking and troubleshooting—to distinguish a model's genuine scientific and procedural understanding from simple pattern matching or lucky guessing.
- 🖼️ Framework Overview
- 📂 Data Availability & Setup
- 🏗️ Repository Structure & Usage Guide
- 📈 Cross-Aspect Correlation Analysis
- 🔍 Case Study: Judge Consensus and Discrepancy Analysis
- 📢 News
The full dataset and auxiliary files for FoodLMEval are hosted on Figshare. Download Data Here
To run the evaluation, you must download specific files from the link above and place them into the corresponding folders in this repository. See the detailed instructions below for each aspect.
This repository is organized into five main folders, each corresponding to a distinct aspect of food literacy described in our paper.
This aspect tests the model's ability to recall nutrient density and rank foods based on nutritional content.
Setup Instructions:
- Download the
nutrient_datafolder (contains raw files with food names and nutrient density lists used to build the QA). - Download
nutrients_task_qa.csv(the final QA benchmark dataset). - Place both the
nutrient_datafolder andnutrients_task_qa.csvinsidea1_nutritional_literacy/.
Code Usage:
- Inference: Use
infer_qwen.py,infer_flan.py, orinfer_llama.pyto generate answers. To switch model sizes, modify theMODEL_IDvariable inside the script. - Evaluation: Use
eval_qwen.py,eval_flan.py, oreval_llama.py. These scripts contain the cleaning logic and metric calculations. - Robustness:
shuffle_mc.ipynbis provided for shuffling Multiple Choice Questions to ensure anti-bias evaluation.
Evaluates the understanding of health claims based on the Food Health Claims Knowledge Graph.
Setup Instructions:
- Download
balanced_health_QA.csv(the final QA dataset). - Place it inside
a2_functional_health/.
Code Usage:
- Generation Logic:
gpt_prompt.pycontains the logic used to create the dataset. - Inference: Run
infer_health_qwen.py,infer_health_llama.py, orinfer_health_flan.py. - Evaluation: Run
evaluate_health.pyafter inference to compute metrics. - Robustness:
shuffle_mc.pyhandles MCQ shuffling.
Assesses adherence to food safety guidelines using a dataset created via GPT generation and manual filtering.
Setup Instructions:
- Download
food_safety.jsonl(the final QA dataset). - Place it inside
a3_food_safety/.
Code Usage:
- Inference: Run
infer_flan.py,infer_llama.py, orinfer_qwen.py. Ensure thefood_safety.jsonlfile is present. - Generation:
gpt_qa.pycontains the code used to generate the initial QA pairs before manual filtering.
Tests whether models can understand recipe logic by identifying missing ingredients or steps.
Setup Instructions:
- Download
cooking_procedures.csv(the final benchmark dataset). - Place it inside
a4_cooking_procedure/.
Code Usage:
- Inference: Run
infer_flan.py,infer_llama.py, orinfer_qwen.py. - Evaluation: Run
evaluate_a4.py. This compares LLM-generated answers against the ground truth.
Evaluates high-level troubleshooting and culinary logic using expert community data (Reddit/StackExchange) filtered by DEITA and Ensemble of Gemini & GPT.
Setup Instructions:
- Download
culinary_reasoning_QA.csv(The final benchmark after GPT and Gemini filters). - Place it inside
a5_culinary_reasoning/.
Code Usage:
- Inference: Use
llama_reason.py,qwen_reason.py, orflan_reason.py. - Data Creation:
llmasjudge.pywas used for the second-stage filtering. - LLM-as-a-Judge: Use
eval_gpt.pyorgemini_eval.pyto compare model answers against human expert answers.
The central thesis of FoodLMEval is that success on narrow, isolated food tasks (e.g., nutrition trivia) does not imply broad, functional food literacy. To move beyond qualitative observations, we conducted a cross-aspect rank correlation analysis. By quantifying rank instability, we empirically demonstrate that a model’s standing in one domain does not reliably generalize to others, particularly when transitioning from factual recall to safety-critical reasoning.
We calculated the Spearman Rank Correlation across the six evaluated model families (FLAN-T5, Llama, and Qwen series). Unlike Pearson correlation, which evaluates linear relationships, Spearman tracks the stability of model rankings across tasks. Low correlation coefficients (
Because "food literacy" manifests differently depending on the query format, we evaluated rank interdependence across six distinct scenarios, divided into two baselines:
Baseline A: Discriminative / Multiple Choice (MCQ) Focus This baseline tests if standard associative knowledge (measured via MCQs) translates to higher-level culinary reasoning.
- Aspect 1 (Nutrition): MCQ Accuracy (44.00% to 70.67%)
- Aspect 2 (Health Claims): MCQ Accuracy (75.86% to 98.85%)
- Aspect 3 (Food Safety): MCQ Accuracy (61.45% to 93.11%)
- Aspect 4 (Cooking Integrity): Set Match Jaccard (6.24% to 25.78%)
We compared this MCQ baseline against three different open-ended metrics from Aspect 5 (Culinary Reasoning):
- Scenario 1: vs. A5 Truthfulness LLM Win Rate (0.6% to 59.3%)
- Scenario 2: vs. A5 Safety Pass Rate (73.0% to 85.6%)
- Scenario 3: vs. A5 Actionability LLM Win Rate (1.3% to 82.3%)
Baseline B: Strict Reasoning Focus This baseline strips away the multiple-choice format and forces the models to use strict magnitude reasoning, binary safety thresholds, and procedural text generation.
- Aspect 1 (Nutrition): Normalized Kendall Rank (0.1456 to 0.7041)
- Aspect 2 (Health Claims): Yes/No Accuracy (67.82% to 100.00%)
- Aspect 3 (Food Safety): Yes/No Accuracy (58.21% to 90.30%)
- Aspect 4 (Cooking Integrity): Missing Step BERTScore (22.77% to 44.06%)
We compared this strict reasoning baseline against the same Aspect 5 metrics:
- Scenario 4: vs. A5 Truthfulness LLM Win Rate
- Scenario 5: vs. A5 Safety Pass Rate
- Scenario 6: vs. A5 Actionability LLM Win Rate
The resulting heatmaps generated by this analysis yield three critical insights that substantiate our core thesis:
- The Safety-Reasoning Divergence: The data reveals a significant instability in safety performance rankings. Under the MCQ baseline (Scenario 2), Aspect 3 (Food Safety) shows only a moderate correlation (0.66) with Aspect 5 (Culinary Reasoning) Safety Pass Rates. When subjected to strict Yes/No boundaries (Scenario 5), this correlation further degrades to 0.58. Finding: This divergence indicates that factual retrieval of regulatory safety standards is insufficient to ensure a model’s capacity to safely troubleshoot real-world culinary scenarios, highlighting a critical gap between declarative knowledge and procedural adherence.
- The Isolation of Procedural Knowledge: Across the board, Aspect 4 (Cooking Procedure Integrity) exhibits shockingly low rank correlations with associative fact-retrieval. In Scenarios 1-3, Aspect 4 shows near-zero correlation with the MCQ aspects (0.14 to 0.31) and open-ended culinary truthfulness (0.09). Finding: Proficiency in nutrition and health trivia represents a distinct cognitive domain that does not grant an LLM the structural understanding required to reconstruct coordinated cooking procedures.
- Format-Driven Rank Instability: While aspects may correlate highly under discriminative multiple-choice formats (reflecting general LLM scaling capabilities), forcing the models into strict reasoning formats fractures these correlations. For example, the perfect rank correlation (1.00) between Nutritional Literacy and Functional Health Claims under MCQ evaluation drops to 0.77 when evaluated via magnitude ranking and strict Yes/No thresholding (Scenario 4). Finding: The "food literacy" observed in standard benchmarks is often a format-driven artifact that masks underlying deficiencies in grounded, multi-aspect reasoning.
Instead of just looking at the culinary topics, this case study analyzes the behavior of our automated evaluators (Judge GPT and Judge Gemini) by examining three specific alignment scenarios:
- Full Consensus: We look for cases where both judges are in complete agreement across all evaluated metrics (Truthfulness, Actionability, and Safety) and select the exact same winning response (e.g., both judges agree the human answer wins).
- Disagreement on Actionability: We analyze cases where both judges agree on the factual correctness (Truthfulness) and Safety of the responses, but disagree on which response provides better practical, step-by-step guidance for the user.
- Disagreement on Truthfulness: We examine cases where both judges agree that the answers are safe and practically actionable, but fundamentally disagree on the underlying scientific accuracy or factual correctness of the food chemistry presented.
The analysis documents are structured into columns to compare the original context with the generated answers and the subsequent judge evaluations.
Column Breakdown:
- Question & Context: The original user query and the background context prompting the discussion.
- Human Answer: The baseline response provided by a human user.
- LLM Prediction: The AI-generated response being evaluated.
- Judge Justification (GPT/Gemini): The rationale provided by the evaluator models, breaking down their scoring across different metrics.
Color-Coded Highlights: To quickly parse the qualitative analysis within the documents, we utilize a specific highlighting scheme:
- 🔵 Blue Highlighted Text (Truthfulness): Indicates text and judge arguments strictly related to the scientific accuracy and factual correctness of the food chemistry or physics described in the answers.
- 🟣 Violet Highlighted Text (Actionability): Indicates text and judge arguments focused on the practical application of the advice. This highlights step-by-step instructions, troubleshooting tips, and how easily a home cook could follow the methodology.
The current phase of this project evaluated a large-scale dataset to identify where automated judges align and diverge across different evaluation axes.
| Disagreement Category | Percentage | Additional Notes |
|---|---|---|
| Agree Safety, but Disagree on BOTH Truth & Action | 31.29% | |
| Agree Safety & Truth, but Disagree on Actionability | 0.95% | Out of these, 0 involve a 'Tie' (0.00%) |
| Agree Safety & Action, but Disagree on Truthfulness | 18.05% | Out of these, 68 involve a 'Tie' (19.88%) |
| Agree Truth & Action, but Disagree on Safety | 0.42% |
The use of LLM-as-a-judge provides a highly scalable, robust, and structured baseline for evaluating nutritional literacy and culinary reasoning. The high rate of consensus across many categories validates this automated approach for broad dataset evaluation.
However, looking closely at the 18.05% of cases where models agree on the safety and actionability of a response but fundamentally disagree on its Truthfulness, we can see the deep complexity inherent in food science. These specific disagreements often hinge on highly debated, microscopic chemical interactions. While our current methodology effectively isolates these complex edge cases, future iterations of this benchmark could greatly benefit from targeted, supplemental reviews by domain-expert food scientists to definitively resolve these underlying chemical discrepancies.
- [Feb 2026]: Paper submitted to the KDD 2026 Datasets and Benchmarks Track.


