Add DriveDreamer

patrick-llgc · Mar 1, 2024 · 054058b · 054058b
1 parent 8a83064
commit 054058b
Show file tree

Hide file tree

Showing 4 changed files with 45 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -36,15 +36,16 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Multimodal Regression](https://towardsdatascience.com/anchors-and-multi-bin-loss-for-multi-modal-target-regression-647ea1974617)
 - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
 
-## 2024-03 (1)
+## 2024-03 (2)
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind]
+- [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [[Notes](paper_notes/drive_dreamer.md)] [Jiwen Lu]
+- [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) [Jiwen Lu]
 - [GenAD: Generative End-to-End Autonomous Driving](https://arxiv.org/abs/2402.11502)
 - [TCP: Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline](https://arxiv.org/abs/2206.08129) <kbd>NeurIPS 2022</kbd> [E2E planning, Hongyang]
 - [Transfuser: Multi-Modal Fusion Transformer for End-to-End Autonomous Driving](https://arxiv.org/abs/2104.09224) <kbd>CVPR 2021</kbd> [E2E planning, Geiger]
 - [Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving](https://arxiv.org/abs/2310.01957) [Wayve, LLM + AD]
 - [LingoQA: Video Question Answering for Autonomous Driving](https://arxiv.org/abs/2312.14115) [Wayve, LLM + AD]
 - [World Model on Million-Length Video And Language With RingAttention](https://arxiv.org/abs/2402.08268) [Pieter Abbeel]
-- [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [Jiwen Lu]
 - [Panacea: Panoramic and Controllable Video Generation for Autonomous Driving](https://arxiv.org/abs/2311.16813) <kbd>CVPR 2024</kbd> [Megvii]
 - [PlanT: Explainable Planning Transformers via Object-Level Representations](https://arxiv.org/abs/2210.14222) <kbd>CoRL 2022</kbd>
 - [Scene as Occupancy](https://arxiv.org/abs/2306.02851) <kbd>ICCV 2023</kbd>

diff --git a/paper_notes/drive_dreamer.md b/paper_notes/drive_dreamer.md
@@ -0,0 +1,35 @@
+# [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777)
+
+_March 2024_
+
+tl;dr: World model for autonomous driving, conditioned on structured traffic constraits.
+
+#### Overall impression
+First real-world world model, contemporary with [GAIA-1](gaia_1.md). Yet the controllability of the dynamics is different.
+
+Typically the controllability of world model is only quantitative as it is hard to do (close to) pixel accurate generation with difussion models. DriveDreamer alleviates this problem and reaches near pixel accurate control with structured traffic constraints (vectorized wireframes of perception results, or `perception vectors` for short).
+
+The model takes in video, text, action and perception vectors, and rolls out videos and actions. It can be seen as a world model as the video generation is conditioned on action.
+
+The dynamics of the world model is actually controlled by a simplistic RNN model, the ActionFormer, in the latent space of the `perception vectors`. This is quite different from  GAIA-1 and Genie where the dynamics are learned via compressing large amounts of video data.
+
+The model is mainly focused on single cam scenarios, but the authors demo'ed in the appendix that it can be easily expanded to multicam scenario. --> The first solid multicam work is [Drive WM (Drive into the Future)](drive_wm.md).
+
+#### Key ideas
+- Training is multi-stage. --> Seems that this is the norm for all world models, like GAIA-1.
+	- Stage 1: AutoDM (Autonomous driving diffusion model)
+		- Train image diffusion model
+		- Then train video diffusion model
+		- Text conditioning via cross attention
+	- Stage 2: Add action condition (interaction) and action prediction.
+		- **ActionFormer** is an RNN (GRU) that autoregressively predicts  future road structural features in the latent space. **ActionFormer models the dynamics of the world model.**
+- Eval 
+	- Image/Video quality: FID and FVD
+	- Perception boosting: mAP of model trained on a mixture of real and virtual data.
+	- Open loop planning: not very useful.
+
+#### Technical details
+- Summary of technical details
+
+#### Notes
+- Questions and notes on how to improve/revise the current work
diff --git a/paper_notes/drive_wm.md b/paper_notes/drive_wm.md
@@ -5,7 +5,7 @@ _February 2024_
 tl;dr: First consistent, controllable, multiview videos generation for autonomous driving.
 
 #### Overall impression
-The main contribution of the paper is **multiview** consistent with video generation, and the application of this world model to planning, through **a tree search**, and **OOD planning recovery**.
+The main contribution of the paper is **multiview** consistent with video generation, and the application of this world model to planning, through **a tree search**, and **OOD planning recovery**. --> [DriveDreamer](drive_dreamer.md) also briefly discussed this topic in its appendix.
 
 Drive-WM generates future videos, conditioned on past videos, text, actions, and vectorized perception results, x_t+1 ~ f(x_t, a_t). It does NOT predicts actions. In this way, it is very similar to [GAIA-1](gaia_1.md), but extends GAIA-1 by multicam video generation. It is also conditioned on vectorized perception output, like [DriveDreamer](drive_dreamer.md).
 

diff --git a/paper_notes/genie.md b/paper_notes/genie.md
@@ -11,6 +11,8 @@ The tech report is very concisely written, as other reports from DeepMind, and t
 
 The model differs from [GAIA-1](gaia_1.md) in that GAIA-1 still uses video data with action and text annotation. Architecture-wise, GAIA-1 uses a dedicated video decoder based on diffusion model, but Genie uses the decoder of the tokenizer. --> Maybe this can explain the poor image quality.
 
+LAM is more generalized than the IDM model in VPT, where some data are labeled first, then the action predictor used to pseudo-label large sets of unlabeled data. --> Yet in a narrow domain such as autonomous driving. This may also be possible.
+
 A world model enables next-frame prediction that is conditioned on action inputs. 
 Genie is a foundation world model, and can be used for training generalist agents without direct environment experience at agent training time.
 
@@ -26,7 +28,7 @@ Genie is a foundation world model, and can be used for training generalist agent
 	- ST-transformer is less prone to overfitting (and thus higher perf) compared with the full blown spatialtempral attention.
 - Latent Action Model (LAM)
 	- 8 unique codes in code book.
-	- Can infer latent action between each pair of frames
+	- Can infer latent action between each pair of frames. It is similar to the  inverse dynamics model (IDM) which aims to uncover the underlying action between timesteps given observations of past and future timesteps, as in [Video Pretraing, VPT](vpt.md).
 	- VQ-VAE, to map continuous actions to a small discrete set of codes. 
 	- At inference time, only the VQ code book is retained in inference time, and the entire LAM is discarded.
 - Dynamics model
@@ -53,8 +55,6 @@ Genie is a foundation world model, and can be used for training generalist agent
 	- An agent is trained to predict the next possible latent action by the expert
 	- Then latent action mapped to real action.
 
-- Scaling law
-
 #### Technical details
 - GAIA-1 and Genie have similar model sizes. GAIA-1 has a total of about 10B (0.3B image tokenizer + 6.5B world model + 2.6B video diffusion model). Genie has a total of about 11B (0.2B image tokenizer + 10.7B dynamic model).
 - Why train LAM on pixels, instead of tokens?
@@ -63,6 +63,9 @@ Genie is a foundation world model, and can be used for training generalist agent
 - The mapping between latent action and real action
 	- Unclear at first, how each latent action will impact next frame gen.
 	- But action remained **consistent** across diff inputs, making it similar experience to learning the buttons on a new controller. 
+- Scaling law: Genie demonstrates nice scaling law wrt model size, measured by CE loss during training
+- Data Quality: 10% High quality data can do better than avg quality data for training foundation models. 
+
 
 #### Notes
 - Questions and notes on how to improve/revise the current work