FoodLMEval: A Multi-aspect Benchmark for Assessing Food Literacy in LLMs

FoodLMEval is a multi-aspect benchmark comprising over 4,000 questions designed to evaluate the holistic food literacy of large language models across five key aspects: Nutritional Literacy, Functional Health Claims, Food Safety, Cooking Procedure Integrity, and Culinary Reasoning.

Grounded in authoritative sources like the USDA and EFSA, the dataset utilizes diverse formats—including ranking and troubleshooting—to distinguish a model's genuine scientific and procedural understanding from simple pattern matching or lucky guessing.

📑 Table of Contents

🖼️ Framework Overview
📂 Data Availability & Setup
🏗️ Repository Structure & Usage Guide
📈 Cross-Aspect Correlation Analysis
🔍 Case Study: Judge Consensus and Discrepancy Analysis
📢 News

🖼️ Framework Overview

📂 Data Availability & Setup

The full dataset and auxiliary files for FoodLMEval are hosted on Figshare. Download Data Here

To run the evaluation, you must download specific files from the link above and place them into the corresponding folders in this repository. See the detailed instructions below for each aspect.

🏗️ Repository Structure & Usage Guide

This repository is organized into five main folders, each corresponding to a distinct aspect of food literacy described in our paper.

🥦 1. Nutritional Literacy (`a1_nutritional_literacy`)

This aspect tests the model's ability to recall nutrient density and rank foods based on nutritional content.

Setup Instructions:

Download the nutrient_data folder (contains raw files with food names and nutrient density lists used to build the QA).
Download nutrients_task_qa.csv (the final QA benchmark dataset).
Place both the nutrient_data folder and nutrients_task_qa.csv inside a1_nutritional_literacy/.

Code Usage:

Inference: Use infer_qwen.py, infer_flan.py, or infer_llama.py to generate answers. To switch model sizes, modify the MODEL_ID variable inside the script.
Evaluation: Use eval_qwen.py, eval_flan.py, or eval_llama.py. These scripts contain the cleaning logic and metric calculations.
Robustness: shuffle_mc.ipynb is provided for shuffling Multiple Choice Questions to ensure anti-bias evaluation.

🏥 2. Functional Health Claims (`a2_functional_health`)

Evaluates the understanding of health claims based on the Food Health Claims Knowledge Graph.

Setup Instructions:

Download balanced_health_QA.csv (the final QA dataset).
Place it inside a2_functional_health/.

Code Usage:

Generation Logic: gpt_prompt.py contains the logic used to create the dataset.
Inference: Run infer_health_qwen.py, infer_health_llama.py, or infer_health_flan.py.
Evaluation: Run evaluate_health.py after inference to compute metrics.
Robustness: shuffle_mc.py handles MCQ shuffling.

⚠️ 3. Food Safety (`a3_food_safety`)

Assesses adherence to food safety guidelines using a dataset created via GPT generation and manual filtering.

Setup Instructions:

Download food_safety.jsonl (the final QA dataset).
Place it inside a3_food_safety/.

Code Usage:

Inference: Run infer_flan.py, infer_llama.py, or infer_qwen.py. Ensure the food_safety.jsonl file is present.
Generation: gpt_qa.py contains the code used to generate the initial QA pairs before manual filtering.

🍳 4. Cooking Procedure Integrity (`a4_cooking_procedure`)

Tests whether models can understand recipe logic by identifying missing ingredients or steps.

Setup Instructions:

Download cooking_procedures.csv (the final benchmark dataset).
Place it inside a4_cooking_procedure/.

Code Usage:

Inference: Run infer_flan.py, infer_llama.py, or infer_qwen.py.
Evaluation: Run evaluate_a4.py. This compares LLM-generated answers against the ground truth.

🧠 5. Culinary Reasoning (`a5_culinary_reasoning`)

Evaluates high-level troubleshooting and culinary logic using expert community data (Reddit/StackExchange) filtered by DEITA and Ensemble of Gemini & GPT.

Setup Instructions:

Download culinary_reasoning_QA.csv (The final benchmark after GPT and Gemini filters).
Place it inside a5_culinary_reasoning/.

Code Usage:

Inference: Use llama_reason.py, qwen_reason.py, or flan_reason.py.
Data Creation: llmasjudge.py was used for the second-stage filtering.
LLM-as-a-Judge: Use eval_gpt.py or gemini_eval.py to compare model answers against human expert answers.

📈 Cross-Aspect Correlation Analysis

📌 Motivation

The central thesis of FoodLMEval is that success on narrow, isolated food tasks (e.g., nutrition trivia) does not imply broad, functional food literacy. To move beyond qualitative observations, we conducted a cross-aspect rank correlation analysis. By quantifying rank instability, we empirically demonstrate that a model’s standing in one domain does not reliably generalize to others, particularly when transitioning from factual recall to safety-critical reasoning.

🔬 Methodology

We calculated the Spearman Rank Correlation across the six evaluated model families (FLAN-T5, Llama, and Qwen series). Unlike Pearson correlation, which evaluates linear relationships, Spearman tracks the stability of model rankings across tasks. Low correlation coefficients ($r_s$) serve as a direct proxy for rank instability, proving that models frequently swap leaderboard positions when the evaluation shifts between different food literacy pillars.

📊 The Evaluation Scenarios

Because "food literacy" manifests differently depending on the query format, we evaluated rank interdependence across six distinct scenarios, divided into two baselines:

Baseline A: Discriminative / Multiple Choice (MCQ) Focus This baseline tests if standard associative knowledge (measured via MCQs) translates to higher-level culinary reasoning.

Aspect 1 (Nutrition): MCQ Accuracy (44.00% to 70.67%)
Aspect 2 (Health Claims): MCQ Accuracy (75.86% to 98.85%)
Aspect 3 (Food Safety): MCQ Accuracy (61.45% to 93.11%)
Aspect 4 (Cooking Integrity): Set Match Jaccard (6.24% to 25.78%)

We compared this MCQ baseline against three different open-ended metrics from Aspect 5 (Culinary Reasoning):

Scenario 1: vs. A5 Truthfulness LLM Win Rate (0.6% to 59.3%)
Scenario 2: vs. A5 Safety Pass Rate (73.0% to 85.6%)
Scenario 3: vs. A5 Actionability LLM Win Rate (1.3% to 82.3%)

Baseline B: Strict Reasoning Focus This baseline strips away the multiple-choice format and forces the models to use strict magnitude reasoning, binary safety thresholds, and procedural text generation.

Aspect 1 (Nutrition): Normalized Kendall Rank (0.1456 to 0.7041)
Aspect 2 (Health Claims): Yes/No Accuracy (67.82% to 100.00%)
Aspect 3 (Food Safety): Yes/No Accuracy (58.21% to 90.30%)
Aspect 4 (Cooking Integrity): Missing Step BERTScore (22.77% to 44.06%)

We compared this strict reasoning baseline against the same Aspect 5 metrics:

Scenario 4: vs. A5 Truthfulness LLM Win Rate
Scenario 5: vs. A5 Safety Pass Rate
Scenario 6: vs. A5 Actionability LLM Win Rate

💡 Key Findings & Interpretation

The resulting heatmaps generated by this analysis yield three critical insights that substantiate our core thesis:

The Safety-Reasoning Divergence: The data reveals a significant instability in safety performance rankings. Under the MCQ baseline (Scenario 2), Aspect 3 (Food Safety) shows only a moderate correlation (0.66) with Aspect 5 (Culinary Reasoning) Safety Pass Rates. When subjected to strict Yes/No boundaries (Scenario 5), this correlation further degrades to 0.58. Finding: This divergence indicates that factual retrieval of regulatory safety standards is insufficient to ensure a model’s capacity to safely troubleshoot real-world culinary scenarios, highlighting a critical gap between declarative knowledge and procedural adherence.
The Isolation of Procedural Knowledge: Across the board, Aspect 4 (Cooking Procedure Integrity) exhibits shockingly low rank correlations with associative fact-retrieval. In Scenarios 1-3, Aspect 4 shows near-zero correlation with the MCQ aspects (0.14 to 0.31) and open-ended culinary truthfulness (0.09). Finding: Proficiency in nutrition and health trivia represents a distinct cognitive domain that does not grant an LLM the structural understanding required to reconstruct coordinated cooking procedures.
Format-Driven Rank Instability: While aspects may correlate highly under discriminative multiple-choice formats (reflecting general LLM scaling capabilities), forcing the models into strict reasoning formats fractures these correlations. For example, the perfect rank correlation (1.00) between Nutritional Literacy and Functional Health Claims under MCQ evaluation drops to 0.77 when evaluated via magnitude ranking and strict Yes/No thresholding (Scenario 4). Finding: The "food literacy" observed in standard benchmarks is often a format-driven artifact that masks underlying deficiencies in grounded, multi-aspect reasoning.

🔍 Case Study: Aspect 5 Judge Consensus and Discrepancy Analysis

🔍 Cases Analyzed: Judge Alignment Scenarios

Instead of just looking at the culinary topics, this case study analyzes the behavior of our automated evaluators (Judge GPT and Judge Gemini) by examining three specific alignment scenarios:

Full Consensus: We look for cases where both judges are in complete agreement across all evaluated metrics (Truthfulness, Actionability, and Safety) and select the exact same winning response (e.g., both judges agree the human answer wins).
Disagreement on Actionability: We analyze cases where both judges agree on the factual correctness (Truthfulness) and Safety of the responses, but disagree on which response provides better practical, step-by-step guidance for the user.
Disagreement on Truthfulness: We examine cases where both judges agree that the answers are safe and practically actionable, but fundamentally disagree on the underlying scientific accuracy or factual correctness of the food chemistry presented.

🗺️ Interpretation Map & Highlighting Guide

The analysis documents are structured into columns to compare the original context with the generated answers and the subsequent judge evaluations.

Column Breakdown:

Question & Context: The original user query and the background context prompting the discussion.
Human Answer: The baseline response provided by a human user.
LLM Prediction: The AI-generated response being evaluated.
Judge Justification (GPT/Gemini): The rationale provided by the evaluator models, breaking down their scoring across different metrics.

Color-Coded Highlights: To quickly parse the qualitative analysis within the documents, we utilize a specific highlighting scheme:

🔵 Blue Highlighted Text (Truthfulness): Indicates text and judge arguments strictly related to the scientific accuracy and factual correctness of the food chemistry or physics described in the answers.
🟣 Violet Highlighted Text (Actionability): Indicates text and judge arguments focused on the practical application of the advice. This highlights step-by-step instructions, troubleshooting tips, and how easily a home cook could follow the methodology.

📊 Evaluation Statistics

The current phase of this project evaluated a large-scale dataset to identify where automated judges align and diverge across different evaluation axes.

Disagreement Category	Percentage	Additional Notes
Agree Safety, but Disagree on BOTH Truth & Action	31.29%
Agree Safety & Truth, but Disagree on Actionability	0.95%	Out of these, 0 involve a 'Tie' (0.00%)
Agree Safety & Action, but Disagree on Truthfulness	18.05%	Out of these, 68 involve a 'Tie' (19.88%)
Agree Truth & Action, but Disagree on Safety	0.42%

💡 Concluding Thoughts on the Evaluation Methodology

The use of LLM-as-a-judge provides a highly scalable, robust, and structured baseline for evaluating nutritional literacy and culinary reasoning. The high rate of consensus across many categories validates this automated approach for broad dataset evaluation.

However, looking closely at the 18.05% of cases where models agree on the safety and actionability of a response but fundamentally disagree on its Truthfulness, we can see the deep complexity inherent in food science. These specific disagreements often hinge on highly debated, microscopic chemical interactions. While our current methodology effectively isolates these complex edge cases, future iterations of this benchmark could greatly benefit from targeted, supplemental reviews by domain-expert food scientists to definitively resolve these underlying chemical discrepancies.

📢 News

[Feb 2026]: Paper submitted to the KDD 2026 Datasets and Benchmarks Track.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FoodLMEval: A Multi-aspect Benchmark for Assessing Food Literacy in LLMs

📑 Table of Contents

🖼️ Framework Overview

📂 Data Availability & Setup

🏗️ Repository Structure & Usage Guide

🥦 1. Nutritional Literacy (`a1_nutritional_literacy`)

🏥 2. Functional Health Claims (`a2_functional_health`)

⚠️ 3. Food Safety (`a3_food_safety`)

🍳 4. Cooking Procedure Integrity (`a4_cooking_procedure`)

🧠 5. Culinary Reasoning (`a5_culinary_reasoning`)

📈 Cross-Aspect Correlation Analysis

📌 Motivation

🔬 Methodology

📊 The Evaluation Scenarios

💡 Key Findings & Interpretation

🔍 Case Study: Aspect 5 Judge Consensus and Discrepancy Analysis

🔍 Cases Analyzed: Judge Alignment Scenarios

🗺️ Interpretation Map & Highlighting Guide

📊 Evaluation Statistics

💡 Concluding Thoughts on the Evaluation Methodology

📢 News

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FoodLMEval: A Multi-aspect Benchmark for Assessing Food Literacy in LLMs

📑 Table of Contents

🖼️ Framework Overview

📂 Data Availability & Setup

🏗️ Repository Structure & Usage Guide

🥦 1. Nutritional Literacy (a1_nutritional_literacy)

🏥 2. Functional Health Claims (a2_functional_health)

⚠️ 3. Food Safety (a3_food_safety)

🍳 4. Cooking Procedure Integrity (a4_cooking_procedure)

🧠 5. Culinary Reasoning (a5_culinary_reasoning)

📈 Cross-Aspect Correlation Analysis

📌 Motivation

🔬 Methodology

📊 The Evaluation Scenarios

💡 Key Findings & Interpretation

🔍 Case Study: Aspect 5 Judge Consensus and Discrepancy Analysis

🔍 Cases Analyzed: Judge Alignment Scenarios

🗺️ Interpretation Map & Highlighting Guide

📊 Evaluation Statistics

💡 Concluding Thoughts on the Evaluation Methodology

📢 News

🥦 1. Nutritional Literacy (`a1_nutritional_literacy`)

🏥 2. Functional Health Claims (`a2_functional_health`)

⚠️ 3. Food Safety (`a3_food_safety`)

🍳 4. Cooking Procedure Integrity (`a4_cooking_procedure`)

🧠 5. Culinary Reasoning (`a5_culinary_reasoning`)