Conference on Robot Learning 2024
1,2,3Mengda Xu, 1,2,Zhenjia Xu, 1Yinghao Xu, 1,2Cheng Chi, 1Gordon Wetzstein, 3,4Manuela Veloso, 1,2Shuran Song
1Stanford University, 2Columbia University, 3JP Morgan AI Research,4CMU
This repository contains code for training and evaluating Im2Flow2Act in both simulation and real-world settings.
Follow these steps to install Im2Flow2Act
:
- Create and activate the conda environment:
cd im2flow2act conda env create -f environment.yml conda activate im2flow2act
- Set the DEV_PATH and export python path in your bashrc or zshrc.
export DEV_PATH="/parent/directory/of/im2flow2act" export PYTHONPATH="$PYTHONPATH:$DEV_PATH/im2flow2act"
- Download pretrain weights for StableDiffusion 1.5. You may download the weight from the official repo or you can download it from our website. Put the StableDiffusion pretrain weight under im2flow2act/pretrain_weights.
The dataset can be downloaded by
mkdir data
cd data
wget https://real.stanford.edu/im2flow2act/data/simulation_evaluation.zip # evaluation dataset
wget https://real.stanford.edu/im2flow2act/data/simulated_play/articulated.zip # policy articulated object training data
wget https://real.stanford.edu/im2flow2act/data/simulated_play/deformable.zip # policy deformable object training data
wget https://real.stanford.edu/im2flow2act/data/simulated_play/rigid.zip # policy rigid object training data
wget https://real.stanford.edu/im2flow2act/data/simulation_sphere_demonstration.zip # simulated sphere demonstration
wget https://real.stanford.edu/im2flow2act/data/realworld_human_demonstration.zip # real-world human demonstration
The dataset contains several components. The simulated_play
dataset contains the play data for rigid, articulated, and deformable objects. The simulation_sphere_demonstration
contains the sphere agent’s demonstration on specific tasks, i.e., pick&place, pouring, drawer opening. The realworld_human_demonstration
contains the human demonstration for the same tasks but in the real world. You can find more information at Dataset Details. The downloaded dataset already contains bounding box, SAM mask and tracked flows. simulation_evaluation
is used to evalute both manipulation policy and flow generation model.
.
├── realworld_human_demonstration
├── simulated_play
├── simulation_evaluation
└── simulation_sphere_demonstration
To reproduce our simulation experimental results in the paper, you may downlaod the checkpoints for both flow generation and manipulation policy.
wget https://real.stanford.edu/im2flow2act/checkpoints.zip # include checkpoints for both policy and flow generation
Once downloaded, please refer to Evaluation for running the model.
The folder structure should be as followed once you complete the above steps:
.
├── checkpoints
├── config
├── data
├── data_local
├── environment.yml
├── im2flow2act
├── LICENSE
├── pretrain_weights
├── README.md
├── scripts
└── tapnet
You can visualize the flows by nevigating to scripts/data_script
:
python viz_all.py
You might need to change dataset path in the viz_pathes
. To visualize the simulatated play dataset, use --viz_sam
. You can change the minimum distance a keypoint travels on the image space by specify viz_thresholds
. Only the keypoints moves more than the threshold will be visualized. To visualize the real-world task demonstration, use '--viz_bbox'
and set the viz_thresholds
to 0. Please check here for details.
The training for flow generation and flow-conditioned policy is independent. You can train and evaluate each component separately. However, to evaluate the complete im2flow2act system, please refer to Evaluation. We use Accelerate for multi-gpu training and set mixed_precision='fp16'
.
The scripts for training flow generation are located at scripts/flow_generation. You can either use Simulated task demonstration or Real-world task demonstration to train the model. However, to evaluate the complete system in simulation, you need to train with the simulated task demonstration dataset.
Finetune the decoder from StableDiffusion:
accelerate launch finetune_decoder.py
Train the flow generation model based on Animatediff:
accelerate launch train_flow_generation.py
The model will be evaluated every 200 epochs and the results will be logged by Weights&Biases. Additionally, we log the generated flow and ground truth flow under experiment/flow_generation/yyyy-mm-dd-ss/evaluations/epoch_x
:
dataset_0
├── generated_flow_0.gif
├── gt_flow_0.gif
The scripts for training flow-conditioned policy is located at scripts/controll
.
To train the policy:
accelerate launch train_flow_conditioned_diffusion_policy.py
During the training, the policy will be evaluated every 100 epochs with ground truth flow. You can change the frequency by modifying training.ckpt_frequency
in the config file. You will need a gpu with at least 24GB memory to run the online point tracking and policy inference at the same time. The evaluation results will be saved at the policy folder:
.
├── episode_0
│ ├── action
│ ├── camera_0
│ ├── camera_0.mp4
│ ├── info
│ ├── proprioception
│ ├── qpos
│ └── qvel
├── episode_0_debug_pts.ply
├── episode_0_online_point_tracking_sequence.npy
├── episode_0_online_tracking.gif
├── episode_0_vis_pts.ply
episode_x_vis_pts.ply
: It contains the mesh for the initial scene. You can visualize it by software like meshlabepisode_x_vis_pts.ply
: It contains the mesh for the selected object keypoints.episode_x_online_tracking
: Online tracking for the selected object keypoints during the inference time.episode_x_online_point_tracking_sequence.npy
: The numeric value for online tracking during the inference time.
You can directly evaluate the trained policy by
python evalute_flow_diffusion_policy.py
The quantitative results are stored in success_count.json
. Notice, for cloth folding, you need to manully inspect the results.
To evaluate the complete system of im2flow2act, we begin by generating task flow from an initial image and a task description. You need the object bounding box to start, which we have already provided in the downloaded dataset. You can generate it yourself by going to scripts/data and running
python get_bbox.py
You might need to change the prompt and buffer path for Grounding DINO in config/data/get_bbox.yaml. For a drawer, you can use “red drawer”. For PickNplace and Pouring tasks, you can use “red mug”.
Once that is done, replace the model_path and model_ckpt at config/flow_generation/inference.yaml with the trained flow generation model path. Change the realworld_dataset_path to one of the tasks provided in the generation_model_eval_dataset, e.g., pickNplace. Change the directory to scripts/flow_generation and run
python inference.py
After finishing, the generated flow will be stored under the evaluation dataset folder. The numeric results are stored under the generated_flows folder for each episode. You can also find the gif for generated flows inside the dataset folder.
With the generated flow stored, you can evaluate the policy with the generated flow by navigating to scripts/control and running
python evaluate_from_flow_generation.py
You might need to modify the config/diffusion_policy/evaluate_from_flow_generation.yaml by replacing the model_path
and ckpt
with your trained flow condition policy. You also need to specify the evaluation dataset folder. Make sure you have already generated the flow for the dataset you passed in. You can find evaluation results under the experiment folder. The generated_flow.gif
contains the processed generated flow animation.
.
├── episode_0
│ ├── action
│ ├── camera_0
│ ├── camera_0.mp4
│ ├── info
│ ├── proprioception
│ ├── qpos
│ └── qvel
├── episode_0_debug_pts.ply
├── episode_0_generated_flow.gif
├── episode_0_online_tracking.gif
├── episode_0_vis_pts.ply
All datasets are stored under the zarr format. The downloaded dataset already contains the processed flows. If you would like to process your own dataset, please refer to real-world data and simulation data for details. An episode from simulated data contains the following items:
.
├── action
├── camera_0
├── camera_0.mp4
├── info
├── moving_mask
├── point_tracking_sequence
├── proprioception
├── qpos
├── qvel
├── rgb_arr
├── robot_mask
├── sam_mask
├── sam_moving_mask
├── sample_indices
├── sam_point_tracking_sequence
└── task_description
point_tracking_sequence
: contains the flows by tracking uniform grid sampling keypoints using TAPIRsam_point_tracking_sequence
: contains the object-centric flows iteratively generated by applying Segment Anything and point tracking.moving_mask
: the binary mask over thepoint_tracking_sequence
, which is created by setting whether a keypoint has moved a certain distance on the image.robot_mask
: the binary mask over the "sam_point_tracking_sequence" to indicate whether a keypoint is located on the robot.sam_moving_mask
: similar tomoving_mask
but over thesam_point_tracking_sequence
sam_mask
: the segment mask by running Segment Anything on the initial sceneqpos
andqvel
: the vectors used to restore the initial state for simulated data. You can also use them to re-render the data if you change the camera view.rgb_arr
: contains the resized visual observations fromcamera_0
with a downsampled factor of 2. It is passed to the point tracking algorithm. The main reason behind this is that a single 24GB GPU can run the point tracking algorithm on long-horizon play data.task_description
: a text description for the episode.
Notice, during the training, we downsample the robot action and proprioception with a factor of 2 to align with the tracked flows. You can train on the original dataset by modifying the dataset.downsample_rate=1
in the config/train_flow_conditioned_diffusion_policy
. In this case, you also need to re-generate the flows for manipulation policy training data yourself using scripts/data/data_pipeline
and modifying the downsample_ratio=1
. You might need to use scripts/clean
to remove the existing rgb_arr
and corresponding flows first.
An episode in real-world dataset contains similar items but the point tracking sequence used to train flow generation model is stored at bbox_point_tracking_sequence
.
bbox
: The bounding box of the intersted object, which is obtained by Grounding DINO.bbox_point_tracking_sequence
: The flow generated by tracking keypoints inside the bounding box. To re-generate the dataset, you can usescripts/data/flow_generation_data_pipeline
@inproceedings{
xu2024flow,
title={Flow as the Cross-domain Manipulation Interface},
author={Mengda Xu and Zhenjia Xu and Yinghao Xu and Cheng Chi and Gordon Wetzstein and Manuela Veloso and Shuran Song},
booktitle={8th Annual Conference on Robot Learning},
year={2024},
url={https://openreview.net/forum?id=cNI0ZkK1yC}
}
This repository is released under the MIT license.
We would like to thank Yifan Hou, Zeyi Liu, Huy Ha, Mandi Zhao, Chuer Pan, Xiaomeng Xu, Yihuai Gao, Austin Patel, Haochen shi, John So, Yuwei Guo, Haoyu Xiong, Litian Liang, Dominik Bauer, Samir Yitzhak Gadre for their helpful feedback and fruitful discussions.
- We use TAPIR to track flow for both dataset generation and online point tracking. The flow generation code is based on Animatediff. The tapnet code and the Animatediff code is directly copied from the official repo to make the code base more self-containing.
- Diffusion Policy is adapted from Diffusion Policy.
- Simulation environment are adapted from Scaling Up and Distilling Down.
- Simulation Asset are obtained from Mujoco Menagerie and Kevin Zakka's mujoco_scanned_objects.