You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes
Jabez Magomere, Shu Ishida, Tejumade Afonja, Aya Salama, Daniel Kochin, Foutse Yuehgoh, Imane Hamzaoui, Raesetje Sefala, Aisha Aalagib, Elizaveta Semenova, Lauren Crais, Siobhan Mackenzie Hall
Official Website (used for data collection): https://worldwidedishes.com/
We present the World Wide Dishes dataset which seeks to assess these disparities through a decentralised data collection effort to gather perspectives directly from people with a wide variety of backgrounds from around the globe with the aim of creating a dataset consisting of their insights into their own experiences of foods relevant to their cultural, regional, national, or ethnic lives.
The meta data of the World Wide Dishes dataset is available in the Croissant format:
Link to the website used during data collection: https://worldwidedishes.com/
The website includes our Data Protection Policy and FAQs developed to support contributors during the data collection process.
Please refer to the README.md in the webapp directory for instructions on how to run your own instance of the website.
In addition to World Wide Dishes dataset, we present 30 dishes for 5 selected African countries + 30 dishes for the US as a baseline. An additional test suite was curated for regional parity.
- Dishes selected for the five African countries + the US
- US Test set csv (same set of dishes in the previous sheet, but also includes a regional label)
- Dishes selected for the five African countries + the US / US Test set (Excel Sheet)
conda create -n wwd python=3.10
conda activate wwd
pip install -r requirements.txt
Create a .env
file in the root directory of the repository with the following settings:
WWD_CSV_PATH=./data/WorldWideDishes_2024_June_World_Wide_Dishes.csv
WWD_30_DISHES_CSV_PATH=./data/WorldWideDishes_2024_June_Selected_Countries.csv
This points to the World Wide Dishes dataset and the 30 dishes selected for the African countries and the US.
If you want to conduct experiments that involve the use of OpenAI products such as GPT 3.5 (required for the LLM experiments), DALL-E 2 and DALL-E 3 (required for the dish image generation),
please obtain the OpenAI API key from here and set it as an environment variable OPENAI_API_KEY
by adding it to the .env
file. (Make sure you don't commit this file to Git!)
While Llama 3 (8B) model and Llama 3 (70B) model can be run locally by first obtaining a licence through Huggingface from the links provided, running these models locally is computationally expensive and time-consuming.
Groq offers a fast and reliable API service for open-sourced LLMs, including Llama 3 models. As of June 2024, the Groq API is free to use.
Please obtain the Groq API key from here and set it as an environment variable GROQ_API_KEY
by adding it to the .env
file.
Please refer to the README.md in the llm_probing
directory for instructions on how to run the experiments.
Please refer to the README.md in the gen_images
directory for instructions on how to run the experiments.
Please refer to the README.md in the clip_probing
directory for instructions on how to run the experiments.
Please refer to the README.md in the vqa
directory for instructions on how to run the experiments.
Due to the high degree of inaccurate and culturally insensitve imagery we will not be releasing the generated images for safety reasons. Our terms of use also prohibits the generation of images for trainign models using the World Wide Dishes dataset.
For transparency and insight into the review conducted, we are releasing the text responses only:
In the World Wide Dishes dataset, we have a column uploaded_image_name
that contains paths to dish images that we have contributed and are CC-licenced.
This is a subset of the images that were contributed to the data collection website.
We only include those images that we were personally able to verify as being owned by the contributor.
We have uploaded these images to a Google Drive folder for public access.