NutriBench

A Dataset for Evaluating Large Language Models in Carbohydrate Estimation from Meal Descriptions

Andong Hua1*, Mehak Preet Dhaliwal1*, Ryan Burke, Yao Qin1

1University of California, Santa Barbara, *Equal contribution

ABOUT THE DATASET

NutriBench is the first publicly available natural language meal description based nutrition benchmark.

The dataset consists of 5,000 human-verified meal descriptions with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. NutriBench can be used to evaluate and benchmark Large Language Models (LLMs) on the task of nutrition estimation.

This video shows examples of NutriBench queries and responses by GPT-3.5 with Chain-of-Thought (CoT) prompting for carbohydrate estimation.

NutriBench is based on USDA data and is human verified.

NutriBench is constructed from FoodData Central (FDC), the food composition information center of the US Department of Agriculture (USDA). We utilize GPT-3.5 to generate natural language meal descriptions from food items sampled from a cleaned and filtered FDC. We also conduct two rounds of human verification, which include manually inspecting and editing the queries, to ensure the quality of NutriBench.

NutriBench Construction Pipeline

This image shows the end-to-end construction pipeline of NutriBench.

NutriBench is built for real world complexity.

NutriBench is divided into 15 subsets varying in meal description complexity. The subsets include meal descriptions differing in the number of food items (1-3), the type of servings (single or multiple), the serving size descriptions (natural, such as '1 cup', or metric, such as '50g'), as well as the popularity of food items. This ensures that the benchmark represents the complexity of real world nutrition estimation.

ABOUT THE EVALUATION

We benchmark seven state-of-the-art LLMs on the task of carbohydrate estimation with NutriBench.

We extensively evaluate and analyse the performance of GPT-3.5, Llama2-7B/70B, Llama3-8B/70B, Alpaca-7B, and MedAlpaca-7B using four prompting strategies:

  • Baseline instructional prompting
  • Chain-of-Thought (CoT)
  • Retrieval Augmented Generation (RAG)
  • RAG+CoT
Example

This image shows the output by GPT-3.5 for the four prompting strategies for a NutriBench query.

LLMs with NutriBench are more accurate and faster at carbohydrate estimation tasks than human nutritionist.

Across all models, prompting methods, and data splits, we conducted 300 experiments to provide a comprehensive insight into the current capabilities of LLMs in nutrition estimation. We also conducted a human study involving expert and non-expert participants and found that LLMs can provide more accurate and faster predictions over a range of complex queries. GPT-3.5 with CoT prompting achieves the highest accuracy of 51.48%, with an answer rate of 89.80%.

Example

This image summarizes the results of our experiments, plotting the accuracy (absolute error < 7g) and answer rate for all the methods on NutriBench.

Contact

For any questions, please contact the authors at {dongx1997,mdhaliwal}@ucsb.edu