diff --git a/index.html b/index.html index 856a64c..0c316da 100644 --- a/index.html +++ b/index.html @@ -2,7 +2,7 @@ - Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making + SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory @@ -113,74 +113,40 @@

- Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making + 🌊 SAMURAI: Adapting Segment Anything Modelfor Zero-Shot Visual Tracking with Motion-Aware Memory

-

Evaluate your model with a single line of code!

+
- Manling Li1, 2†, + Cheng-Yen Yang1, - Shiyu Zhao1,†, + Hsiang-Wei Huang1, - Qineng Wang1, 2†, + Wenhao Chai1, - Kangrui Wang1, 2†, + Zhongyu Jiang1, - Yu Zhou1,†, + Jenq-Neng Hwang1
+ -
- - Sanjana Srivastava1, - - - Cem Gokmen1, - - - Tony Lee1, - - - Li Erran Li3, - - - Ruohan Zhang1, - - - Weiyu Liu1, - -
- -
- - Percy Liang1, - - - Li Fei-Fei1, - - - Jiayuan Mao4, - - - Jiajun Wu1 - -
- 1Stanford University, - 2Northwestern University, + 1University of Washington +
-
+ +
@@ -273,77 +198,13 @@

-
-
-
-
-

Goal Interpretation

- -
-
-

Subgoal Decomposition

- -
-
-

Action Sequencing

- -
-
-

Transition Modeling

- -
-
-
-
-
- - - - -
- -
-
-
-
-
-
+
@@ -353,19 +214,13 @@

Transition Modeling

Abstract

- Problem: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performances, because they are usually applied in different domains for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn, blocks embodied agents from leveraging LLMs effectively and selectively. -

-

- Method: To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. -

-

- Conclusion: Overall, our benchmark offers a comprehensive and systematic assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making. + The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT-ext and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

Embodied agent interface overview.
- Figure 1: Embodied Agent Interface unifies a broad set of tasks involving both state and temporally extended goals and four LLM-based modules for decision making. + Figure 1: The overview of our SAMURAI visual object tracker.
@@ -373,3963 +228,41 @@

Abstract

- -
-
-
-

Embodied Agent Interface

-
-
-

- In our Embodied Agent Interface, we propose a set of ability modules to evaluate LLMs for embodied decision making. The four ability modules are: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. We provide a detailed description of each module below. -

-

Ability Module 1: Goal Interpretation

- -

- Goal Interpretation aims to ground the natural language instruction to the environment representations of objects, states, relations, and actions. For example, the task instruction "Use the rag to clean the trays, the bowl, and the refrigerator. When you are done, leave the rag next to the sink..." can be grounded to specific objects with IDs, such as fridge (ID: 97), tray (ID: 1), bowl (ID: 1), rag (ID: 0), and sink (ID: 82). Note that a simple natural language description can be grounded into a set of multiple goal conditions (object state and relation). -

- -

Ability Module 2: Subgoal Decomposition

- -

- Subgoal Decomposition generates a sequence of states, where each state can be a set of objects and their states. Here, we highlight the important states, such as the transitions between a sequence of next_to(rag.0, sink.82), toggled_on(sink.82), soaked(rag.0), toggled_off(sink.82), open(fridge.97), not_stained(fridge.97). To achieve these state transitions, we can use a high-level planner such as BFS to search for the Action Sequences that achieve these state transitions. We obtain the following action sequence: RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97). Note that multiple actions may be required to achieve a single one-step state transition. For example, to perform the state transition next_to(rag.0, sink.82) → toggled_on(sink.82), we need two actions RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82). See Figure 2 for the input and output formulation. -

+
-
- Embodied agent interface taxonomy example. -
- Figure 2: The input and output formulation of four ability modules for Embodied Agent Interface. +
+
+ +
+
+

Results

+
+ Embodied agent interface overview. +
+ Table 1: Zero-shot tracking results on LaSOT, LaSOT-ext, and GOT-10k.
- -

Ability Module 3: Action Sequencing

- -

- Action Sequences are essential to achieve the state transitions identified in Subgoal Decomposition. For example, a successful execution of the action sequence RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97) is shown in Figure 3. -

- -

Ability Module 4: Transition Modeling

- -

- Transition Modeling serves as the low-level controller to guide the simulator in performing state transitions from preconditions to post-effects. For example, in cleaning task, the input is the operator name soak, and the preconditions are three states: holding (?obj1), next_to (?sink ?agent), and toggled_on (?sink). The post effect after executing SOAK is soaked (?obj1). -

- - - -
- Example of successful execution in Embodied Agent Interface. -
- Figure 3: An example of successful execution in Embodied Agent Interface. +
+ Embodied agent interface overview. +
+ Table 2: Zero-shot tracking results on additional benchmarks: TrackingNet, NFS, and OTB-100. +
+
+
+ Embodied agent interface overview. +
+ Table 3: Comparison between proposed SAMURAI and the baseline SAM 2 on LaSOT and LaSOT-ext.
-
- -

Dataset Viewer

- -
-
-

Leaderboard

- -
-
-
-
-
-
-
-
-
- -

Empirical Findings

-
-
    -
  1. Goal Interpretation: -
      -
    • LLMs struggle to translate natural language instructions into grounded states.
    • -
    • Common errors include generating intermediate goals and omitting spatial relationship goals.
    • -
    • Gemini 1.5 Pro has the highest goal interpretation performance, while Claude-3 Opus excels in goal retrieval rate.
    • -
    • Proprietary LLMs make fewer grammar errors compared to open-source LLMs.
    • -
    - -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: All goal evaluation results (%) for goal interpretation -
    Model NameGoal Interpretation
    StateSpatialActionOverall
    PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
    VBVBVBVBVBVBVBVBVBVBVBVB
    Claude-3 Haiku21.822.858.993.531.836.724.264.550.864.632.864.612.2-95.7-21.6-18.041.563.271.228.052.5
    Claude-3 Sonnet23.336.857.188.933.152.026.676.253.079.835.577.912.4-85.8-21.7-19.360.261.581.929.469.4
    Claude-3 Opus27.072.666.993.538.581.722.675.246.879.230.577.114.5-92.6-25.1-20.772.265.082.531.477.0
    Claude-3.5 Sonnet25.374.060.994.835.883.131.184.463.881.341.882.914.0-98.8-24.5-21.781.169.684.433.082.7
    Cohere Command R51.17.769.631.458.912.434.556.821.355.026.355.93.6-38.9-6.5-27.428.255.749.636.736.0
    Cohere Command R+20.923.352.079.129.836.017.966.715.261.516.464.010.4-82.6-18.5-14.942.044.565.522.451.2
    Gemini 1.0 Pro25.327.457.981.134.941.017.075.220.670.418.672.79.9-68.7-17.2-16.251.045.272.823.860.0
    Gemini 1.5 Flash23.655.857.994.133.570.119.876.621.176.720.576.713.5-90.1-23.5-18.269.750.880.726.874.8
    Gemini 1.5 Pro45.494.049.192.847.293.440.074.49.776.715.675.626.8-80.9-40.3-35.278.841.180.437.979.6
    GPT-3.5-turbo22.452.050.066.730.958.58.551.518.846.911.749.115.2-60.5-24.4-15.749.540.551.422.750.4
    GPT-4-turbo28.670.458.586.938.477.824.777.532.976.428.276.919.0-82.1-30.9-24.075.653.878.833.277.2
    GPT-4o29.067.160.094.839.178.631.581.143.678.536.679.820.5-85.8-33.1-26.476.559.182.236.579.2
    Llama 3 8B Instruct21.717.354.480.431.028.414.051.47.420.89.729.611.1-79.4-19.4-15.524.141.934.322.628.3
    Llama 3 70B Instruct23.969.561.295.434.380.422.670.037.573.328.271.611.2-88.8-19.8-17.564.758.078.326.970.9
    Mistral Large23.663.559.192.232.875.223.775.140.376.229.875.611.2-84.0-19.7-17.569.657.179.826.874.3
    Mixtral 8x22B MoE23.622.956.983.733.436.022.270.736.367.727.569.211.2-94.8-20.0-17.444.456.271.326.654.7
    o1-mini26.363.858.690.836.374.930.477.339.976.534.576.913.5-56.8-21.8-22.473.351.379.831.276.4
    o1-preview28.266.860.394.838.578.444.982.962.482.752.282.826.0-81.5-39.5-31.878.165.485.442.781.6
    - -
    - -
    -
    -
  2. -
  3. Action Sequencing: -
      -
    • Reasoning ability is crucial for LLMs; trajectory feasibility errors are common (41.2%).
    • -
    • o1-preview has the highest task (81.0%) and execution success rates (91.0%) in BEHAVIOR. Mistral Large (73.4%) and Gemini 1.5 Pro (73.1%) outperform it in VirtualHome.
    • -
    • SOTA LLMs make fewer grammar errors. For example, Claude-3 Opus makes no errors, while GPT-3.5-turbo has a 4.0% error rate in BEHAVIOR.
    • -
    • Common runtime errors include missing steps and wrong order. In BEHAVIOR, GPT-4o encounters 36.0% missing step errors and 9.0% wrong order errors.
    • -
    • LLMs perform better with state goals than relation goals, but struggle with complex action goals. GPT-4o achieves 82.0% success in state goals and 67.8% in relation goals in VirtualHome.
    • -
    • Task complexity, such as the number of goals and action sequence length, lowers success rates. In BEHAVIOR, tasks with more than 10 goals have a success rate below 40%.
    • -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Trajectory evaluation results (%) for action sequencing. -
    Model NameGoal EvaluationTrajectory Evaluation
    Goal SRExecution SRGrammar Error (↓)Runtime Error (↓)
    ParsingHallucinationAction-Arg NumWrong OrderMissing StepAffordanceAdditional Step
    VBVBVBVBVBVBVBVBVB
    Claude-3 Haiku43.326.048.532.00.00.04.96.00.30.01.67.043.354.01.31.03.31.0
    Claude-3 Sonnet62.944.067.257.00.00.05.61.00.77.92.311.022.919.01.311.03.62.0
    Claude-3 Opus66.251.070.859.00.00.014.10.00.00.00.73.014.135.00.33.06.22.0
    Claude-3.5 Sonnet72.860.075.469.00.00.02.30.00.00.01.05.019.725.01.61.05.22.0
    Gemini 1.0 Pro34.427.045.932.00.37.09.23.02.06.01.313.038.735.02.64.07.24.0
    Gemini 1.5 Flash61.940.067.252.00.00.02.00.00.30.00.35.029.842.00.31.04.32.0
    Gemini 1.5 Pro73.142.083.354.00.00.01.60.00.30.00.36.013.139.01.31.05.62.0
    GPT-3.5-turbo14.716.031.820.035.14.01.67.01.323.00.31.028.236.01.68.02.01.3
    GPT-4-turbo57.038.065.645.00.00.01.60.00.30.00.07.032.147.00.31.03.60.0
    GPT-4o61.647.071.153.00.30.01.31.00.30.00.39.025.236.01.31.04.90.0
    Cohere Command R24.616.037.719.00.75.029.813.02.00.03.08.025.243.02.012.04.34.0
    Cohere Command R+63.327.070.235.00.00.05.61.00.715.00.310.022.639.00.70.05.915.0
    Mistral Large73.433.083.650.00.00.02.60.00.30.00.38.012.835.00.36.04.97.0
    Mixtral 8x22B MoE46.230.049.540.00.03.013.16.00.70.00.710.034.732.01.39.03.02.0
    Llama 3 8B21.610.025.916.00.00.041.615.01.09.00.36.031.144.00.09.00.35.0
    Llama 3 70B55.734.063.042.00.00.023.32.01.00.02.015.07.938.03.03.07.96.0
    o1-mini65.956.068.965.00.30.05.23.03.30.00.37.021.617.00.36.05.95.0
    o1-preview71.181.078.491.02.00.08.20.00.00.00.30.034.16.00.32.08.93.0
    - -
    -
    -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: All goal success results (%) for action sequencing and subgoal decomposition. -
    Model NameAction SequencingSubgoal Decomposition
    State GoalRelation GoalAction GoalTotalState GoalRelation GoalAction GoalTotal
    VBVBVBVBVBVBVBVB
    Claude-3 Haiku58.627.047.238.733.1-49.035.589.426.082.234.871.6-83.132.4
    Claude-3 Sonnet80.941.073.359.848.6-70.854.689.137.089.349.883.3-88.046.3
    Claude-3 Opus64.745.079.453.057.4-67.350.892.443.088.641.683.3-89.142.0
    Claude-3.5 Sonnet81.363.079.462.457.4-74.962.692.941.088.639.587.0-90.139.9
    Gemini 1.0 Pro52.228.036.132.042.6-45.030.984.426.061.531.172.8-73.529.7
    Gemini 1.5 Flash79.534.065.550.048.0-67.745.693.544.088.336.092.0-91.338.2
    Gemini 1.5 Pro81.741.077.243.268.2-77.142.691.231.072.537.189.5-83.935.4
    GPT-3.5-turbo29.520.018.322.623.6-24.821.984.728.054.428.564.8-69.428.3
    GPT-4-turbo74.139.073.339.547.3-67.339.393.545.084.246.190.7-89.545.8
    GPT-4o82.049.067.845.557.4-71.846.592.150.084.253.293.2-89.452.3
    Cohere Command R24.120.040.025.937.1-32.024.385.320.067.421.460.5-73.621.0
    Cohere Command R+71.228.063.932.060.2-66.330.989.434.066.829.675.9-78.330.8
    Mistral Large81.338.577.841.275.0-78.740.492.933.071.535.690.1-84.434.9
    Mixtral 8x22B MoE48.930.056.136.837.2-48.235.092.130.074.834.187.7-84.833.0
    Llama 3 8B26.316.026.123.710.1-22.221.668.821.054.723.650.0-59.822.9
    Llama 3 70B42.831.064.445.553.4-51.841.593.225.063.427.782.7-80.027.0
    o1-mini75.264.068.366.951.4-67.366.189.728.068.838.081.5-80.335.3
    o1-preview86.089.571.184.456.1-74.385.891.856.588.369.492.6-90.665.9
    - -
    -
    -
    -
  4. - -
  5. Subgoal Decomposition: -
      -
    • Subgoal decomposition is not easier than action sequencing in abstract action spaces.
    • -
    • o1-preview shows superior performance in VirtualHome (89.4%) and BEHAVIOR (57.0%). Gemini 1.5 Flash also performs well in VirtualHome (89.1%).
    • -
    • SOTA models avoid grammar errors but can hallucinate actions (e.g., GPT-4o adds "POUR" in VirtualHome).
    • -
    • Common runtime errors: extra steps in VirtualHome, missing steps in BEHAVIOR.
    • -
    • LLMs like o1-preview are more accurate in action goals in VirtualHome; state and relation goals in BEHAVIOR are more difficult due to stricter precondition checks.
    • -
    • Performance is lower in BEHAVIOR due to complex task representations with quantifiers like "forall" and "forpairs."
    • -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: All trajectory evaluation results (%) for subgoal decomposition. -
    Model NameGoal EvaluationTrajectory Evaluation
    Goal SRExecution SRGrammar Error (↓)Runtime Error (↓)
    ParsingHallucinationAction-Arg NumWrong OrderMissing StepAffordanceAdditional Step
    VBVBVBVBVBVBVBVBVB
    Claude-3 Haiku78.430.082.835.00.30.02.41.01.80.01.83.02.758.08.33.020.43.0
    Claude-3 Sonnet83.139.086.443.00.00.01.82.00.02.00.63.02.751.08.61.033.73.0
    Claude-3 Opus87.041.090.047.00.30.03.63.00.00.01.25.03.045.02.40.016.06.0
    Claude-3.5 Sonnet89.139.092.044.00.00.01.81.00.00.01.511.02.744.02.10.024.64.0
    Gemini 1.0 Pro70.424.084.633.00.62.03.34.02.40.01.23.02.751.05.37.010.43.0
    Gemini 1.5 Flash89.134.094.142.00.02.01.51.00.00.00.62.03.953.00.00.013.33.0
    Gemini 1.5 Pro87.031.091.137.00.01.01.50.01.81.00.03.05.659.00.00.016.02.0
    GPT-3.5-turbo69.224.081.436.01.52.00.03.00.60.01.54.011.851.03.34.020.43.0
    GPT-4-turbo85.538.094.147.00.00.01.83.00.00.01.59.02.440.00.31.022.26.0
    GPT-4o88.849.090.255.00.00.06.23.00.00.01.26.02.436.00.00.015.75.0
    Cohere Command R71.315.079.625.02.123.03.910.00.90.01.50.06.237.05.95.014.54.0
    Cohere Command R+79.025.083.737.01.52.04.54.02.10.00.94.07.752.02.71.016.06.0
    Mistral Large84.331.092.038.00.31.01.83.00.30.02.14.03.352.00.32.011.01.0
    Mixtral 8x22B MoE80.528.090.233.00.30.02.44.00.00.03.02.03.959.00.32.011.20.0
    Llama 3 8B48.821.058.029.00.62.02.411.00.60.06.86.05.044.026.68.018.37.0
    Llama 3 70B78.420.087.330.00.01.02.45.00.91.02.48.05.351.01.84.020.44.0
    o1-mini79.331.084.639.00.00.01.53.00.63.00.37.08.946.04.12.021.91.0
    o1-preview89.457.093.262.00.02.01.53.00.00.00.35.02.725.02.43.012.17.0
    - -
    -
    -
    -
  6. - - - -
  7. Transition Modeling: -
      -
    • Models excel in specific categories like object states and orientation.
    • -
    • Non-spatial relations consistently pose a challenge.
    • -
    • Planning effectiveness relies on consistency in predicted action space.
    • -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Full results of logic form accuracy for transition modeling in VHO -
    ModelObject StatesObject OrientationObject AffordanceSpatial RelationsNon-Spatial Relations
    PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
    Claude-3 Haiku76.040.152.519.034.424.467.873.970.737.738.738.22.01.51.7
    Claude-3 Opus87.449.263.046.396.962.676.874.375.537.639.938.710.45.27.0
    Claude-3 Sonnet76.637.450.348.178.159.560.774.366.832.339.935.76.24.14.9
    Claude-3.5 Sonnet86.146.760.593.996.995.377.775.576.645.339.842.47.15.15.9
    Cohere Command R18.06.89.938.790.654.240.223.029.212.66.78.83.30.91.4
    Cohere Command R+44.919.026.334.668.845.951.062.156.030.134.832.47.63.14.4
    Gemini 1.0 Pro68.412.320.416.362.527.955.320.129.645.016.524.37.72.53.8
    Gemini 1.5 Flash82.337.651.62.03.12.554.474.762.947.442.945.016.35.27.9
    Gemini 1.5 Pro45.311.918.888.293.890.979.975.577.742.235.838.715.55.27.8
    GPT-3.5-turbo63.521.932.511.415.613.257.253.154.935.221.726.81.70.30.6
    GPT-4-turbo79.344.256.710.131.315.365.971.068.431.834.232.93.81.01.6
    GPT-4o80.241.554.648.059.452.876.273.774.940.840.740.814.85.17.5
    Llama 3 8b30.813.718.90.00.00.01.63.22.115.518.216.80.00.00.0
    Llama 3 70b63.521.932.549.066.356.665.050.057.027.027.027.05.02.03.0
    Mistral Large30.08.013.048.088.062.072.029.041.035.018.024.03.01.01.0
    Mixtral 8x22B MoE72.033.045.043.083.057.064.074.069.040.038.039.012.04.06.0
    o1-mini82.545.959.051.362.556.359.857.158.532.132.832.55.04.14.5
    o1-preview83.045.158.569.090.678.484.771.477.539.837.838.817.19.011.8
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Full results of logic form accuracy for transition modeling in BH -
    ModelObject StatesSpatial RelationsNon-Spatial Relations
    PrecisionRecallF1PrecisionRecallF1PrecisionRecallF1
    Claude-3.5 Sonnet83.374.878.873.348.858.682.966.273.6
    Claude-3 Haiku64.155.259.354.737.444.463.351.456.7
    Claude-3 Opus74.669.471.970.444.654.668.569.168.8
    Claude-3 Sonnet66.268.767.562.839.848.768.852.059.2
    Cohere Command R59.743.950.629.111.616.627.215.319.6
    Cohere Command R+58.058.458.254.233.641.553.056.654.7
    Gemini 1.0 Pro67.255.260.647.535.340.543.848.345.9
    Gemini 1.5 Flash73.957.264.554.540.746.660.753.857.0
    Gemini 1.5 Pro69.646.755.952.927.235.959.647.452.8
    GPT-3.5-turbo67.146.154.657.631.640.940.836.138.3
    GPT-4-turbo58.259.458.850.327.835.858.538.446.4
    GPT-4o73.169.671.363.935.845.984.764.273.0
    Llama 3 70b68.164.666.360.338.847.265.153.858.9
    Llama 3 8b40.332.435.929.622.725.748.943.946.2
    Mistral Large67.566.567.054.932.340.759.744.651.1
    Mixtral 8x22B MoE60.260.060.153.239.945.657.955.856.8
    o1-mini46.337.241.371.142.353.180.158.367.5
    o1-preview85.572.378.372.446.156.388.079.583.5
    - -
    -
    -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Full results of planner success rate for transition modeling (%) -
    ModelObject StatesObject OrientationObject AffordanceSpatial RelationsNon-Spatial Relations
    VBVBVBVBVB
    Claude-3 Haiku13.568.93.6-19.8-46.962.873.062.3
    Claude-3 Opus63.584.471.4-58.7-64.880.955.482.0
    Claude-3 Sonnet11.280.03.6-10.8-20.079.813.580.3
    Claude-3.5 Sonnet67.486.796.4-67.8-96.680.891.980.3
    Cohere Command R44.648.982.1-40.1-62.638.358.339.3
    Cohere Command R+36.577.846.4-35.3-40.757.431.147.5
    Gemini 1.0 Pro10.722.20.0-10.2-14.513.82.714.8
    Gemini 1.5 Flash34.855.67.1-46.7-61.468.160.870.5
    Gemini 1.5 Pro94.435.689.3-95.8-89.040.483.839.3
    GPT-3.5-turbo1.126.725.0-1.2-0.039.40.054.1
    GPT-4-turbo51.740.050.0-47.9-67.644.764.952.5
    GPT-4o71.968.978.6-63.5-66.964.968.968.9
    Llama 3 8b27.035.60.0-26.4-37.927.731.126.2
    Llama 3 70b10.168.93.6-6.6-15.277.718.985.2
    Mistral Large15.773.37.1-14.4-17.976.68.180.3
    Mixtral 8x22B MoE36.557.850.0-28.1-44.152.143.257.4
    o1-mini63.577.882.1-59.3-75.977.771.675.4
    o1-preview69.186.7100.0-67.1-76.689.478.490.2
    -
    -
    -
    -
  8. -
  9. Sensitivity Analysis: -
      -
    • Actions like "plug_in" and "walk_towards" show low success rates.
    • -
    • Complex interactions like "slice_carvingknife" and "place_inside" present challenges.
    • -
    • Training regimens may not fully capture real-world interaction diversity.
    • -
    -
  10. -
  11. Pipeline-Based vs. Modularized: -
      -
    • Similar trajectory executable rates for both methods.
    • -
    • Pipeline-based methods suffer from error accumulation.
    • -
    • SOTA LLMs avoid grammar errors; less advanced models do not.
    • -
    • All LLMs are prone to runtime errors, missing necessary steps.
    • -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Pipeline-based evaluation results for (1) \(\mathcal{G}+\mathcal{Q}\) and (2) \(\mathcal{G}+\Phi\)$ in BEHAVIOR. \(\mathcal{G}\): Goal Interpretation. \(\mathcal{Q}\): Action Sequencing. \(\Phi\): Subgoal Decomposition. In this table, M means 'modularized', whereas P means 'pipeline-based'. -
    Model NameGoal EvaluationTrajectory Evaluation
    Goal SRExecution SRGrammar Error (↓)Runtime Error (↓)
    ParsingHallucinationAction-Arg NumWrong OrderMissing StepAffordanceAdditional Step
    MPMPMPMPMPMPMPMPMP
    Goal Interpretation + Action Sequencing
    Claude-3 Haiku26.021.032.029.00.00.06.06.00.00.07.06.054.052.01.07.01.017.0
    Claude-3 Sonnet44.041.057.053.00.00.01.03.00.00.011.014.019.021.011.09.02.012.0
    Claude-3 Opus51.046.059.054.00.01.00.01.00.00.03.06.035.035.03.03.02.04.0
    Gemini 1.0 Pro27.026.032.035.07.05.03.03.06.06.013.014.035.038.04.02.04.011.0
    Gemini 1.5 Flash40.035.052.049.00.00.00.02.00.00.05.010.042.041.01.00.02.07.0
    Gemini 1.5 Pro42.037.054.055.00.01.00.01.00.00.06.07.039.035.01.01.02.00.0
    GPT-3.5-turbo16.014.020.032.04.01.07.03.023.015.01.05.036.039.08.06.01.03.0
    GPT-4-turbo38.032.045.047.00.01.00.01.00.00.07.09.047.041.01.01.00.00.0
    GPT-4o47.042.053.055.00.00.01.03.00.00.09.06.036.035.01.01.00.04.0
    Cohere Command R16.05.019.09.05.03.013.038.00.01.08.08.043.031.012.012.04.08.0
    Cohere Command R+27.015.035.029.00.00.01.08.015.014.010.030.039.031.00.02.015.022.0
    Mistral Large33.031.050.038.00.00.00.03.00.00.08.014.035.037.06.08.07.05.0
    Mixtral 8x22B MoE30.026.040.036.03.03.06.013.00.00.010.014.032.021.09.013.02.015.0
    Llama3 8B10.00.016.05.00.02.015.025.09.06.06.011.044.034.09.017.05.014.0
    Llama3 70B34.026.042.040.00.01.02.03.00.00.015.018.038.035.03.05.06.09.0
    Goal Interpretation + Subgoal Decomposition
    Claude-3 Haiku29.021.035.040.00.00.01.05.00.00.02.02.059.046.03.07.03.016.0
    Claude-3 Sonnet38.031.043.045.00.00.02.03.00.00.03.02.051.047.01.03.03.018.0
    Claude-3 Opus39.035.047.045.00.00.03.08.00.00.05.04.045.042.00.01.05.07.0
    Gemini 1.0 Pro23.014.033.030.02.00.04.010.00.01.03.01.051.045.07.013.03.017.0
    Gemini 1.5 Flash34.032.042.044.02.01.01.03.00.00.02.02.053.048.00.02.03.07.0
    Gemini 1.5 Pro31.026.037.038.00.01.01.03.00.00.03.02.059.056.00.00.02.01.0
    GPT-3.5-turbo24.014.036.027.02.00.03.012.00.022.03.01.052.032.04.06.03.05.0
    GPT-4-turbo37.037.047.049.00.00.03.04.00.00.09.08.040.037.01.02.06.06.0
    GPT-4o48.038.055.052.00.00.03.04.00.00.05.06.037.035.00.03.05.09.0
    Cohere Command R15.08.025.015.021.013.011.032.00.01.00.01.038.032.04.06.04.012.0
    Cohere Command R+24.017.037.031.02.06.04.010.00.02.05.07.051.040.01.04.06.014.0
    Mistral Large30.022.038.029.01.01.03.012.00.01.04.05.052.050.02.02.01.05.0
    Mixtral 8x22B MoE27.022.033.029.00.00.04.09.00.02.02.02.059.045.02.013.00.017.0
    Llama3 8B21.03.029.014.02.07.011.029.00.02.06.03.044.030.08.015.07.07.0
    Llama3 70B20.019.030.031.01.01.05.022.01.01.08.07.051.035.04.03.04.07.0
    -
    -
    -
    -
  12. - - -
  13. Replanning and Feedback: -
      -
    • Replanning based on feedback significantly improves performance.
    • -
    • Replanning can result in over-generation of actions.
    • -
    -
    - -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    - Table: Replanning evaluation results (%) for action sequencing. -
    Model NameGoal EvaluationTrajectory Evaluation
    Goal SRExecution SRGrammar Error (↓)Runtime Error (↓)
    ParsingHallucinationAction-Arg NumWrong OrderMissing StepAffordanceAdditional Step
    GPT-4o65.271.80.01.30.70.025.31.00.3
    GPT-4o w/ replanning77.483.30.01.30.00.014.10.30.7
    -
    -
    -
  14. -
-
-
-

Evaluation Setup

- -
- -
-

- We evaluate the performance of LLMs for embodied decision making using the Embodied Agent Interface. Below is a detailed description of the evaluation setup. -

-

Dataset Description

-

Focusing on complex long-horizon tasks, we select VirtualHome (V) and BEHAVIOR (B) as our evaluation simulators based on their task length and scene complexity. Table 1 shows our annotations. Apart from the goal and trajectory annotations, we introduce the Goal Action annotation to reflect necessary actions that do not have post effects, such as the goal action touch in the task “pet the cat”. In the subset of VirtualHome tasks we work on, \(80.7\%\) task categories include instructions with action steps longer than \(10\), and \(33\%\) of the instructions have step lengths of more than \(10\).

- -

- We select BEHAVIOR as another simulator for our evaluation due to its task complexity. BEHAVIOR BDDL goals may contain quantifiers, such as (forpairs (?jar ?apple) (inside ?apple ?jar)), which need to be translated into grounded goals with only atomic propositions, e.g., and ((inside apple_1 jar_1) (inside apple_2 jar_2)). There can be different grounded goals that satisfy the same BDDL goal, such as ((inside apple_2 jar_1) (inside apple_1 jar_2)). We call them goal options. In general, one BDDL goal corresponds to a number of goal options. The average number of grounded goals for each task is \(6.7\), and there are \(4,164.4\) goal options for each task on average. -

-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Table 1: Simulator dataset statistics. New annotations collected in this paper are highlighted in color. -
VirtualHomeBEHAVIOR
#task name26100
#task instruction338100
#goal801673
   - #state340153
   - #relation299520
   - #action162-
#trajectory338100
   - #step29601460
   - avg. step8.7614.6
#transition model3330
   - #precondition9984
   - #effect5751
-
-
- -
- -

Each instance in the dataset represents a task goal. Specifically, each task contains the following data:

-
    -
  • Natural language task name
  • -
  • Natural language task instruction
  • -
  • Symbolic goal definition (including its LTL form)
  • -
  • Symbolic action trajectory
  • -
  • The transition models involved in the task
  • -
-

For tasks in the BEHAVIOR environment, the dataset also includes accompanying VR human demonstration videos that showcase the execution of the ground truth action trajectories.

- - - - - -
- VirtualHome dataset structure example -
- Figure 4: VirtualHome dataset structure example. -
-
- -
- BEHAVIOR dataset structure example -
- Figure 5: BEHAVIOR dataset structure example. -
-
- -

Please find our JSON data format in this link: Dataset JSON Format

-
-

LLMs Implementations

-
-

- We integrated our evaluation pipeline into the HELM code base for easy and reproducible LLM inference. Users can set up their environment using here. We standardized decoding parameters across all models, using temperature zero for \(\operatorname*{arg\,max}\) sampling. Evaluating all models on our benchmark required \(180\) runs. Detailed model information is provided in the table below. -

-
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Table 2 : Model Cards for All Evaluated Large Language Models -
Model NameCreatorComplete Model IDReleaseHosting
Claude-3 HaikuAnthropicclaude-3-haiku-2024030703/07/24Anthropic
Claude-3 SonnetAnthropicclaude-3-sonnet-2024022902/29/24Anthropic
Claude-3 OpusAnthropicclaude-3-opus-2024022902/29/24Anthropic
Claude-3.5 SonnetAnthropicclaude-3-5-sonnet-2024062006/20/24Anthropic
Cohere Command RCoherecommand-r03/11/24Cohere
Cohere Command R+Coherecommand-r-plus04/04/24Cohere
Gemini 1.0 ProGooglegemini-pro12/13/23GCP Vertex
Gemini 1.5 FlashGooglegemini-1.5-flash-preview-051405/14/24GCP Vertex
Gemini 1.5 ProGooglegemini-1.5-pro-preview-040904/09/24GCP Vertex
GPT-3.5-turboOpenAIgpt-3.5-turbo-012501/25/24OpenAI
GPT-4-turboOpenAIgpt-4-turbo-2024-04-0904/09/24OpenAI
GPT-4oOpenAIgpt-4o-2024-05-1305/13/24OpenAI
Llama3 8B InstructMetameta-llama-3-8b-instruct04/18/24TogetherAI
Llama3 70B InstructMetameta-llama-3-70b-instruct04/18/24TogetherAI
Mistral LargeMistralAImistral-large-240202/26/24MistralAI
Mixtral 8x22B MoEMistralAImixtral-8x22b-instruct-v0.104/17/24TogetherAI
o1-miniOpenAIo1-mini-2024-09-1209/12/24OpenAI
o1-previewOpenAIo1-preview-2024-09-1209/12/24OpenAI
-
-
-
-
-
+
+ +
- -