🌐 Project Page | 🤗 Dataset | 📃 Paper
EmbodiedEval is a comprehensive and interactive benchmark designed to evaluate the capabilities of MLLMs in embodied tasks.
EmbodiedEval includes a 3D simulator for realtime simulation. You have two options to run the simulator:
Option 1: Run the simulator on your personal computer with a display (Windows/MacOS/Linux). No additional configuration is required. The subsequent installation and data download (approximately 20GB of space) will take place on your computer.
Option 2: Run the simulator on a Linux server, which requires sudo access, up-to-date NVIDIA drivers, and running outside a Docker container. Additional configurations are required as follows:
Additional configurations
-
Install Xorg:
sudo apt install -y gcc make pkg-config xorg
-
Generate .conf file:
sudo nvidia-xconfig --no-xinerama --probe-all-gpus --use-display-device=none sudo cp /etc/X11/xorg.conf /etc/X11/xorg-0.conf
-
Edit /etc/X11/xorg-0.conf:
- Remove "ServerLayout" and "Screen" section.
- Set
BoardName
andBusID
of "Device" section to the correspondingName
andPCI BusID
of a GPU displayed by thenvidia-xconfig --query-gpu-info
command. For example:Section "Device" Identifier "Device0" Driver "nvidia" VendorName "NVIDIA Corporation" BusID "PCI:164:0:0" BoardName "NVIDIA GeForce RTX 3090" EndSection
-
Run Xorg:
sudo nohup Xorg :0 -config /etc/X11/xorg-0.conf &
-
Set the display (Remember to run the following command in every new terminal session before running the evaluation code):
export DISPLAY=:0
conda create -n embodiedeval python=3.10
conda activate embodiedeval
pip install -r requirements.txt
python download.py
python run_eval.py --agent random
python run_eval.py --agent human
In human baseline, you can manually interact with the environment.
How to play
-
Use the keyboard to press the corresponding number to choose an option;
-
Pressing W/A/D will map to the forward/turn left/turn right options in the menu;
-
Pressing Enter opens or closes the chat window, and you can enter option numbers greater than 9;
-
Pressing T will hide/show the options panel.
Edit the api_key
and base_url
in agent.py and run:
python run_eval.py --agent gpt-4o
To evaluate your own model, you need to overwrite the MyAgent
class in agent.py
.
In the __init__
method, you need to load the model or initialize the API.
In the generate
method, you need to perform model inference or API calls and return the generated text. See the comments within the class for details.
Run the following code to evaluate your model.
python run_eval.py --agent myagent
If your server cannot run the simulator (e.g. without sudo access), and your personal computer cannot run the model. You can run simulation on your computer and the model on the server using the following steps:
Evaluation steps with a remote simulator
-
Perform the
Install Dependencies
andDownload Dataset
steps on both your local computer and the server. -
On the server, run:
python run_eval.py --agent myagent --remote --scene_folder <The absolute path of the scene folder on your local computer>
This command will hang, waiting for the simulator to connect.
-
On your computer, set up a SSH tunnel between your computer and the server:
ssh -N -L 50051:localhost:50051 <username>@<host> [-p <ssh_port>]
-
On your computer, launch the simulator:
python launch.py
Once the simulator starts, the evaluation process on the server will begin.
Run metrics.py with the result folder as a parameter to compute the performance. The total_metrics.json
(overall performance) and type_metrics.json
(performance per task type) will be saved in the result folder.
python metrics.py --result_folder results/xxx-xxx-xxx
@article{cheng2025embodiedeval,
title={EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents},
author={Cheng, Zhili and Tu, Yuge and Li, Ran and Dai, Shiqi and Hu, Jinyi and Hu, Shengding and Li, Jiahao and Shi, Yang and Yu, Tianyu and Chen, Weize and others},
journal={arXiv preprint arXiv:2501.11858},
year={2025}
}