From 791cbfdb5c997dfbb67207352ae578e35006b06e Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 3 Oct 2024 17:19:12 +0900 Subject: [PATCH 01/36] docs: unit7/video-processing/transformers-based-models.mdx --- .../transformers-based-models.mdx | 153 ++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 chapters/en/unit7/video-processing/transformers-based-models.mdx diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx new file mode 100644 index 000000000..c700a625d --- /dev/null +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -0,0 +1,153 @@ +# Transformers in Video Processing (Part 1)[[transformers-in-video-processing-part-1]] + +## Introduction[[introduction]] + +In this chapter, we will cover how the Transformers model is utilized in video Processing. In particular, we will introduce the Vision Transformer, a successful application of the Transformers model in the field of Vision. We will then explain the additional considerations made for the Video Vision Transformer (ViViT) model used in video, as opposed to the Vision Transformer model used in images. Finally, we will briefly discuss about the TimeSFormer model. + +**Materials that would be helpful to review before reading this document**: + +- [computer vision course / unit3 / vision transformers for image classification](https://huggingface.co/learn/computer-vision-course/unit3/vision-transformers/vision-transformers-for-image-classification) +- [transformers / model documentation : ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) + +## Recap about ViT[[recap-about-vit]] + +First, let's take a quick look at Vision Transformers: [An image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/abs/2010.11929), the most basic of the successful applications of Transformers to vision. + +The abstract from the paper is as follows; + +*Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.* + +
+ Vision transformer architecture +
+ViT architecture. Taken from the original paper. + +The key techniques proposed in the ViT paper are as follows: + +- Images are divided into small patches, and each patch is used as input to a Transformer model, replacing CNNs with a Transformer-based approach. + +- Each image patch is linearly mapped, and positional embeddings are added to allow the Transformer to recognize the order of the patches. + +- The model is pre-trained on large-scale datasets and fine-tuned for downstream vision tasks, achieving high performance. + +### Performance & Limitation[[performance-limitation]] + +
+ Vision transformer performance +
+ Comparision with SOTA models. Taken from the original paper. + +Although ViT outperformed other state-of-the-art models, training the ViT model required a large amount of computational power. Training the ViT model took 2,500 days on TPU-v3. Assuming a TPU-v3 core costs approximately $2 per hour (you can find more detailed pricing information [here](https://cloud.google.com/tpu/pricing)), it would cost $2 x 24 hours x 2,500 days = $120,000 to train the model once. + +## Video Vision Transformer (ViViT)[[video-vision-transformer-vivit]] + +As mentioned earlier, the important issue for ViViT, which extending the image processing of ViT to video classification task, was how to train the model more quickly and efficiently. Also, unlike images, video contains not only spatial information, but also temporal information, and how to handle this “temporal information” is a key consideration and exploration. + +The abstract from the [paper](https://arxiv.org/abs/2103.15691) is the as follows: + +*We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatiotemporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic.* + +
+ ViViT architecture +
+ViViT architecture. Taken from the original paper. + +### Embedding video clips[[embedding-video-clips]] + +#### Uniform frame sampling[[uniform-frame-sampling]] + +
+ Uniform frame sampling +
+Uniform frame sampling. Taken from the original paper. + +In this mapping method, the model uniformly samples some frames across the time domain, +e. g. one frame per every 2 frames. + +#### Tubelet embedding[[tubelet-embedding]] + +
+ Tubelet embedding +
+Tubelet embedding. Taken from the original paper. + +An alternate method, extracting spatio-temporal "tubes" from the input volume and linearly projecting this. This method fuses spatio-temporal information during tokenization. + +The previously introduced methods, such as Uniform Frame Sampling and Tubelet Embedding, are effective but relatively simple approaches. The upcoming methods to be introduced are more advanced. + +### Transformer models for video in ViViT[[transformer-models-for-video-in-vivit]] +#### Model 1 : Spatio-temporal attention[[model-1-spatio-temporal-attention]] + +Simply forwards all spatio-temporal tokens extracted from the video, throught the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h pathces contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) + +**complexity : O(n_h^2 x n_w^2 x n_t^2)** +#### Model 2 : Factorised encoder[[model-2-factorised-encoder]] + +The approach in Model 1 was somewhat inefficient, as it contextualized all patches simultaneously. To improve upon this, Model 2 separates the spatial and temporal encoders sequentially. + +
+ ViViT model 2 +
+Factorised encoder (Model 2). Taken from the original paper. + +First, only spatial interactions are contextualized through Spatial Transformer Encoder (=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). + +**complexity : O(n_h^2 x n_w^2 + n_t^2)** + +#### Model 3 : Factorised self-attention[[model-3-factorised-self-attention]] + +
+ ViViT model 3 +
+Factorised self-attention (Model 3). Taken from the original paper. + +In model 3, instead of computing multi-headed self-attention accross all pairs of tokens, first only compute self-attention spatially (among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities no CLS(classification) token is used. + +**complexity : same as model 2** +#### Model 4 : Factorized dot-product attention[[model-4-factorized-dot-product-attention]] + +
+ ViViT model 4 +
+Factorised dot-product attention (Model 4). Taken from the original paper. + +In model 4, half of the attention heads are designed to operate with keys and values from spatial indices, the other half operate with keys and values from same temporal indices. + +**complexity : same as model 2, 3** +### Experiments and Discussion[[experiments-and-discussion]] + +
+ ViViT model performance +
+Comparison of model architectures (Top 1 accuracy). Taken from the original paper. + +After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall. + + The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. + +## TimeSFormer[[timesformer]] + +TimeSFormer is a concurrent work with ViViT, applying Transformer on video classification. The following sections are explanations of each type of attention. + +
+ TimeSFormer model +
+Visualization of the five space-time self-attention schemes. Taken from the original paper. + +- **Sparse Attention** is the same as ViT; the blue patch is the query and contextualizes other patches within one frame. +- **Joint Space-Time Attention** is the same as ViViT Model 1; the blue patch is the query and contextualizes other patches across multiple frames. +- **Divided Space-Time Attention** is similar to ViViT Model 3; the blue patch first contextualizes temporally with the green patches at the same position, and then spatially contextualizes with other image patches at the same time index. +- **Sparse Local Global Attention**: selectively combines local and global information. +- **Axial Attention**: processes spatial and temporal dimensions seperately along their axes. + +### Performance Discussion[[performance-discussion]] + +The **Divided Space-Time Attention** mechanism shows the most effective performance, providing the best balance of parameter efficiency and accuracy on both K400 and SSv2 datasets. + +## Conclusion[[conclusion]] + +ViViT expanded upon the ViT model to handle video data more effectively by introducing various models such as the Factorized Encoder, Factorized Self-Attention, and Factorized Dot-Product Attention, all aimed at managing the space-time dimensions efficiently. Similarly, TimeSFormer evolved from the ViT architecture and utilized diverse attention mechanisms to handle space-time dimensions, much like ViViT. A key takeaway from this progression is the focus on reducing the significant computational costs associated with applying transformer architectures to video analysis. By leveraging different optimization techniques, these models improve efficiency and enable learning with fewer computational resources. + +## Additional Resources[[additional-resources]] + +- [Video Transformers: A Survey](https://arxiv.org/abs/2201.05991) \ No newline at end of file From e8b8c65baf3424ce4848ba2d3c5b6bc546984890 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 3 Oct 2024 17:21:10 +0900 Subject: [PATCH 02/36] _toctree.yml modification --- chapters/en/_toctree.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 8869b603f..a590c4adf 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -126,6 +126,8 @@ local: "unit7/video-processing/video-processing-basics" - title: Overview of the previous SOTA models local: "unit7/video-processing/overview-of-previous-sota-models" + - title: Transformers based models + local: "unit7/video-processing/transformers-based-models" - title: Unit 8 - 3D Vision, Scene Rendering and Reconstruction sections: - title: Introduction From 1670cf46681ebeda82f690e7c74bcfaa94af047a Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 3 Oct 2024 17:23:36 +0900 Subject: [PATCH 03/36] name added to welcome.mdx --- chapters/en/unit0/welcome/welcome.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit0/welcome/welcome.mdx b/chapters/en/unit0/welcome/welcome.mdx index 1cfb65734..123017a23 100644 --- a/chapters/en/unit0/welcome/welcome.mdx +++ b/chapters/en/unit0/welcome/welcome.mdx @@ -126,7 +126,7 @@ Our goal was to create a computer vision course that is beginner-friendly and th **Unit 7 - Video and Video Processing** - Reviewers: [Ameed Taylor](https://github.com/atayloraerospace), [Isabella Bicalho-Frazeto](https://github.com/bellabf) -- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Chulhwa Han](https://github.com/cjfghk5697) +- Writers: [Diwakar Basnet](https://github.com/DiwakarBasnet), [Chulhwa Han](https://github.com/cjfghk5697), [Jiwook Han](https://github.com/mreraser) **Unit 8 - 3D Vision, Scene Rendering, and Reconstruction** From 5373b2a0bf2a3152a0ae14a8a3f41f77aba6e395 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Sat, 5 Oct 2024 17:07:08 +0900 Subject: [PATCH 04/36] Co-authored-by: seoulsky-field --- .../unit7/video-processing/transformers-based-models.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index c700a625d..c66adb9c2 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -41,7 +41,7 @@ Although ViT outperformed other state-of-the-art models, training the ViT model ## Video Vision Transformer (ViViT)[[video-vision-transformer-vivit]] -As mentioned earlier, the important issue for ViViT, which extending the image processing of ViT to video classification task, was how to train the model more quickly and efficiently. Also, unlike images, video contains not only spatial information, but also temporal information, and how to handle this “temporal information” is a key consideration and exploration. +As mentioned earlier, the important issue for ViViT, which extends the image processing of ViT to video classification task, was how to train the model more quickly and efficiently. Also, unlike images, video contains not only spatial information, but also temporal information, and how to handle this “temporal information” is a key consideration and exploration. The abstract from the [paper](https://arxiv.org/abs/2103.15691) is the as follows: @@ -78,7 +78,7 @@ The previously introduced methods, such as Uniform Frame Sampling and Tubelet Em ### Transformer models for video in ViViT[[transformer-models-for-video-in-vivit]] #### Model 1 : Spatio-temporal attention[[model-1-spatio-temporal-attention]] -Simply forwards all spatio-temporal tokens extracted from the video, throught the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h pathces contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) +Simply forwards all spatio-temporal tokens extracted from the video, through the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h patches contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) **complexity : O(n_h^2 x n_w^2 x n_t^2)** #### Model 2 : Factorised encoder[[model-2-factorised-encoder]] @@ -101,7 +101,7 @@ First, only spatial interactions are contextualized through Spatial Transformer Factorised self-attention (Model 3). Taken from the original paper. -In model 3, instead of computing multi-headed self-attention accross all pairs of tokens, first only compute self-attention spatially (among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities no CLS(classification) token is used. +In model 3, instead of computing multi-headed self-attention across all pairs of tokens, first only compute self-attention spatially (among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities no CLS(classification) token is used. **complexity : same as model 2** #### Model 4 : Factorized dot-product attention[[model-4-factorized-dot-product-attention]] @@ -138,7 +138,7 @@ TimeSFormer is a concurrent work with ViViT, applying Transformer on video class - **Joint Space-Time Attention** is the same as ViViT Model 1; the blue patch is the query and contextualizes other patches across multiple frames. - **Divided Space-Time Attention** is similar to ViViT Model 3; the blue patch first contextualizes temporally with the green patches at the same position, and then spatially contextualizes with other image patches at the same time index. - **Sparse Local Global Attention**: selectively combines local and global information. -- **Axial Attention**: processes spatial and temporal dimensions seperately along their axes. +- **Axial Attention**: processes spatial and temporal dimensions separately along their axes. ### Performance Discussion[[performance-discussion]] From 7f7e8117241541d12d37ef738aa3ef7956549b26 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:16:59 +0900 Subject: [PATCH 05/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index c66adb9c2..70b60b0f7 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -2,7 +2,7 @@ ## Introduction[[introduction]] -In this chapter, we will cover how the Transformers model is utilized in video Processing. In particular, we will introduce the Vision Transformer, a successful application of the Transformers model in the field of Vision. We will then explain the additional considerations made for the Video Vision Transformer (ViViT) model used in video, as opposed to the Vision Transformer model used in images. Finally, we will briefly discuss about the TimeSFormer model. +In this chapter, we will cover how the Transformers model is utilized in video processing. In particular, we will introduce the Vision Transformer, a successful application of the Transformers model in the field of vision. We will then explain the additional considerations made for the Video Vision Transformer (ViViT) model used in video, as opposed to the Vision Transformer model used in images. Finally, we will briefly discuss about the TimeSFormer model. **Materials that would be helpful to review before reading this document**: From c313920fb435b3496ad242528a3c6256cc42f0e6 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:17:05 +0900 Subject: [PATCH 06/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 70b60b0f7..2e707dfc8 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -30,7 +30,7 @@ The key techniques proposed in the ViT paper are as follows: - The model is pre-trained on large-scale datasets and fine-tuned for downstream vision tasks, achieving high performance. -### Performance & Limitation[[performance-limitation]] +### Performance & Limitation[[performance-and-limitation]]
Vision transformer performance From 8faa70586516c74dd9c31dfc4dfec6d1d838adb7 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:54:24 +0900 Subject: [PATCH 07/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 2e707dfc8..1204dd41b 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -101,7 +101,7 @@ First, only spatial interactions are contextualized through Spatial Transformer
Factorised self-attention (Model 3). Taken from the original paper. -In model 3, instead of computing multi-headed self-attention across all pairs of tokens, first only compute self-attention spatially (among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities no CLS(classification) token is used. +In model 3, instead of computing multi-headed self-attention across all pairs of tokens, first only compute self-attention spatially(among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS(classification) token is used. **complexity : same as model 2** #### Model 4 : Factorized dot-product attention[[model-4-factorized-dot-product-attention]] From 48f754302bdd0f93942a1f4f9ae790a68471ad8d Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:54:29 +0900 Subject: [PATCH 08/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1204dd41b..1f98c2c0d 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -90,7 +90,7 @@ The approach in Model 1 was somewhat inefficient, as it contextualized all patch Factorised encoder (Model 2). Taken from the original paper. -First, only spatial interactions are contextualized through Spatial Transformer Encoder (=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). +First, only spatial interactions are contextualized through Spatial Transformer Encoder(=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). **complexity : O(n_h^2 x n_w^2 + n_t^2)** From 60ca8edb64ff70ae3c8f07f4812cb9d9f4e421bf Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:54:39 +0900 Subject: [PATCH 09/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1f98c2c0d..90bc6d219 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -123,7 +123,7 @@ In model 4, half of the attention heads are designed to operate with keys and va After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall. - The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. + The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. ## TimeSFormer[[timesformer]] From 6f6e127e19eef6bbc3133065d3c834acbe44dffb Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:31:08 +0900 Subject: [PATCH 10/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 90bc6d219..a37e25c89 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -7,7 +7,7 @@ In this chapter, we will cover how the Transformers model is utilized in video p **Materials that would be helpful to review before reading this document**: - [computer vision course / unit3 / vision transformers for image classification](https://huggingface.co/learn/computer-vision-course/unit3/vision-transformers/vision-transformers-for-image-classification) -- [transformers / model documentation : ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) +- [transformers / model documentation: ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) ## Recap about ViT[[recap-about-vit]] From 86de76ce18b7a313f996802c4870a25b687ebfe1 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:37:04 +0900 Subject: [PATCH 11/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index a37e25c89..a468b328c 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -43,7 +43,7 @@ Although ViT outperformed other state-of-the-art models, training the ViT model As mentioned earlier, the important issue for ViViT, which extends the image processing of ViT to video classification task, was how to train the model more quickly and efficiently. Also, unlike images, video contains not only spatial information, but also temporal information, and how to handle this “temporal information” is a key consideration and exploration. -The abstract from the [paper](https://arxiv.org/abs/2103.15691) is the as follows: +The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: *We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatiotemporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic.* From 1e4229f99758f3083101a6be10b37bfbe3e8497d Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:40:19 +0900 Subject: [PATCH 12/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index a468b328c..dd1306d0f 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -109,7 +109,7 @@ In model 3, instead of computing multi-headed self-attention across all pairs of
ViViT model 4
-Factorised dot-product attention (Model 4). Taken from the original paper. +Factorised Dot-Product Attention (Model 4). Taken from the original paper. In model 4, half of the attention heads are designed to operate with keys and values from spatial indices, the other half operate with keys and values from same temporal indices. From eb9beb66faefc73c68eebcb48cb993ec429f75fd Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:40:42 +0900 Subject: [PATCH 13/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index dd1306d0f..a828f8b61 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -54,7 +54,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: ### Embedding video clips[[embedding-video-clips]] -#### Uniform frame sampling[[uniform-frame-sampling]] +#### Uniform Frame Sampling[[uniform-frame-sampling]]
Uniform frame sampling From 06b67c04a53a53c4dabfd91ff4d43131807c1ab9 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:41:06 +0900 Subject: [PATCH 14/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index a828f8b61..75d0db9a2 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -101,7 +101,7 @@ First, only spatial interactions are contextualized through Spatial Transformer
Factorised self-attention (Model 3). Taken from the original paper. -In model 3, instead of computing multi-headed self-attention across all pairs of tokens, first only compute self-attention spatially(among all tokens extracted from the same temporal index). Then compute self-attention temporally(among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS(classification) token is used. +In model 3, instead of computing multi-headed self-attention across all pairs of tokens, we first only compute self-attention spatially (among all tokens extracted from the same temporal index). Next, we compute self-attention temporally (among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS (classification) token is used. **complexity : same as model 2** #### Model 4 : Factorized dot-product attention[[model-4-factorized-dot-product-attention]] From 782be93a8481992e3ba4ad64beacb84f34f914f1 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:41:20 +0900 Subject: [PATCH 15/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 75d0db9a2..60eae3ab2 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -99,7 +99,7 @@ First, only spatial interactions are contextualized through Spatial Transformer
ViViT model 3
-Factorised self-attention (Model 3). Taken from the original paper. +Factorised Self-Attention (Model 3). Taken from the original paper. In model 3, instead of computing multi-headed self-attention across all pairs of tokens, we first only compute self-attention spatially (among all tokens extracted from the same temporal index). Next, we compute self-attention temporally (among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS (classification) token is used. From 6a34b5dfd368ea900d2c290884c3338879725d84 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:41:43 +0900 Subject: [PATCH 16/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 60eae3ab2..7cadfa80c 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -52,7 +52,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: ViViT architecture. Taken from the original paper. -### Embedding video clips[[embedding-video-clips]] +### Embedding Video Clips[[embedding-video-clips]] #### Uniform Frame Sampling[[uniform-frame-sampling]] From ace8cfe41c67c7d8972afefc2aadcebd17f12f71 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:41:56 +0900 Subject: [PATCH 17/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 7cadfa80c..745608a8e 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -59,7 +59,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows:
Uniform frame sampling
-Uniform frame sampling. Taken from the original paper. +Uniform Frame Sampling. Taken from the original paper. In this mapping method, the model uniformly samples some frames across the time domain, e. g. one frame per every 2 frames. From 0e53c5fa7d4f5ce5f3aa97bd4ec952be90c93370 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:42:07 +0900 Subject: [PATCH 18/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 745608a8e..e6d77ae1f 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -64,7 +64,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: In this mapping method, the model uniformly samples some frames across the time domain, e. g. one frame per every 2 frames. -#### Tubelet embedding[[tubelet-embedding]] +#### Tubelet Embedding[[tubelet-embedding]]
Tubelet embedding From 42764abbb846f6122174c551e3ea314a59717305 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:42:22 +0900 Subject: [PATCH 19/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index e6d77ae1f..2bc65612a 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -75,7 +75,7 @@ An alternate method, extracting spatio-temporal "tubes" from the input volume an The previously introduced methods, such as Uniform Frame Sampling and Tubelet Embedding, are effective but relatively simple approaches. The upcoming methods to be introduced are more advanced. -### Transformer models for video in ViViT[[transformer-models-for-video-in-vivit]] +### Transformer Models for Video in ViViT[[transformer-models-for-video-in-vivit]] #### Model 1 : Spatio-temporal attention[[model-1-spatio-temporal-attention]] Simply forwards all spatio-temporal tokens extracted from the video, through the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h patches contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) From 6dfc0a50eaa0fa598824ddb969451b12c75f8aea Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:42:39 +0900 Subject: [PATCH 20/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 2bc65612a..16321f21c 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -76,7 +76,7 @@ An alternate method, extracting spatio-temporal "tubes" from the input volume an The previously introduced methods, such as Uniform Frame Sampling and Tubelet Embedding, are effective but relatively simple approaches. The upcoming methods to be introduced are more advanced. ### Transformer Models for Video in ViViT[[transformer-models-for-video-in-vivit]] -#### Model 1 : Spatio-temporal attention[[model-1-spatio-temporal-attention]] +#### Model 1 : Spatio-Temporal Attention[[model-1-spatio-temporal-attention]] Simply forwards all spatio-temporal tokens extracted from the video, through the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h patches contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) From 0e147ec1f22f8c00f1da1c39075bda26d7f5cded Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:43:02 +0900 Subject: [PATCH 21/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 16321f21c..fbeb7fc0f 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -90,7 +90,7 @@ The approach in Model 1 was somewhat inefficient, as it contextualized all patch
Factorised encoder (Model 2). Taken from the original paper. -First, only spatial interactions are contextualized through Spatial Transformer Encoder(=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). +First, only spatial interactions are contextualized through a Spatial Transformer Encoder(=ViT). Then, each frame is encoded to a single embedding and fed into the Temporal Transformer Encoder(=general transformer). **complexity : O(n_h^2 x n_w^2 + n_t^2)** From 8ee6de7335714be5808762d9e04112f7562e2f29 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 15:43:20 +0900 Subject: [PATCH 22/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index fbeb7fc0f..716e2ccca 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -94,7 +94,7 @@ First, only spatial interactions are contextualized through a Spatial Transforme **complexity : O(n_h^2 x n_w^2 + n_t^2)** -#### Model 3 : Factorised self-attention[[model-3-factorised-self-attention]] +#### Model 3 : Factorised Self-Attention[[model-3-factorised-self-attention]]
ViViT model 3 From bc980d59a7d4dbf6975b4d5194db2e32cee3d1cc Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 24 Oct 2024 18:27:20 +0900 Subject: [PATCH 23/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Johannes Kolbe <2843485+johko@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 716e2ccca..a46898cab 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -1,4 +1,4 @@ -# Transformers in Video Processing (Part 1)[[transformers-in-video-processing-part-1]] +# Transformers in Video Processing (Part 1) ## Introduction[[introduction]] From 0c1fbe73c1161c523956162ae7d2ce9b69c49e0c Mon Sep 17 00:00:00 2001 From: Jiwook Han Date: Thu, 24 Oct 2024 18:33:49 +0900 Subject: [PATCH 24/36] removed all anchors --- .../transformers-based-models.mdx | 34 +++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index a46898cab..1412adddb 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -1,6 +1,6 @@ # Transformers in Video Processing (Part 1) -## Introduction[[introduction]] +## Introduction In this chapter, we will cover how the Transformers model is utilized in video processing. In particular, we will introduce the Vision Transformer, a successful application of the Transformers model in the field of vision. We will then explain the additional considerations made for the Video Vision Transformer (ViViT) model used in video, as opposed to the Vision Transformer model used in images. Finally, we will briefly discuss about the TimeSFormer model. @@ -9,7 +9,7 @@ In this chapter, we will cover how the Transformers model is utilized in video p - [computer vision course / unit3 / vision transformers for image classification](https://huggingface.co/learn/computer-vision-course/unit3/vision-transformers/vision-transformers-for-image-classification) - [transformers / model documentation: ViT](https://huggingface.co/docs/transformers/main/en/model_doc/vit) -## Recap about ViT[[recap-about-vit]] +## Recap about ViT First, let's take a quick look at Vision Transformers: [An image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/abs/2010.11929), the most basic of the successful applications of Transformers to vision. @@ -30,7 +30,7 @@ The key techniques proposed in the ViT paper are as follows: - The model is pre-trained on large-scale datasets and fine-tuned for downstream vision tasks, achieving high performance. -### Performance & Limitation[[performance-and-limitation]] +### Performance & Limitation
Vision transformer performance @@ -39,7 +39,7 @@ The key techniques proposed in the ViT paper are as follows: Although ViT outperformed other state-of-the-art models, training the ViT model required a large amount of computational power. Training the ViT model took 2,500 days on TPU-v3. Assuming a TPU-v3 core costs approximately $2 per hour (you can find more detailed pricing information [here](https://cloud.google.com/tpu/pricing)), it would cost $2 x 24 hours x 2,500 days = $120,000 to train the model once. -## Video Vision Transformer (ViViT)[[video-vision-transformer-vivit]] +## Video Vision Transformer (ViViT) As mentioned earlier, the important issue for ViViT, which extends the image processing of ViT to video classification task, was how to train the model more quickly and efficiently. Also, unlike images, video contains not only spatial information, but also temporal information, and how to handle this “temporal information” is a key consideration and exploration. @@ -52,9 +52,9 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows:
ViViT architecture. Taken from the original paper. -### Embedding Video Clips[[embedding-video-clips]] +### Embedding Video Clips -#### Uniform Frame Sampling[[uniform-frame-sampling]] +#### Uniform Frame Sampling
Uniform frame sampling @@ -64,7 +64,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: In this mapping method, the model uniformly samples some frames across the time domain, e. g. one frame per every 2 frames. -#### Tubelet Embedding[[tubelet-embedding]] +#### Tubelet Embedding
Tubelet embedding @@ -75,13 +75,13 @@ An alternate method, extracting spatio-temporal "tubes" from the input volume an The previously introduced methods, such as Uniform Frame Sampling and Tubelet Embedding, are effective but relatively simple approaches. The upcoming methods to be introduced are more advanced. -### Transformer Models for Video in ViViT[[transformer-models-for-video-in-vivit]] -#### Model 1 : Spatio-Temporal Attention[[model-1-spatio-temporal-attention]] +### Transformer Models for Video in ViViT +#### Model 1 : Spatio-Temporal Attention Simply forwards all spatio-temporal tokens extracted from the video, through the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h patches contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) **complexity : O(n_h^2 x n_w^2 x n_t^2)** -#### Model 2 : Factorised encoder[[model-2-factorised-encoder]] +#### Model 2 : Factorised encoder The approach in Model 1 was somewhat inefficient, as it contextualized all patches simultaneously. To improve upon this, Model 2 separates the spatial and temporal encoders sequentially. @@ -94,7 +94,7 @@ First, only spatial interactions are contextualized through a Spatial Transforme **complexity : O(n_h^2 x n_w^2 + n_t^2)** -#### Model 3 : Factorised Self-Attention[[model-3-factorised-self-attention]] +#### Model 3 : Factorised Self-Attention
ViViT model 3 @@ -104,7 +104,7 @@ First, only spatial interactions are contextualized through a Spatial Transforme In model 3, instead of computing multi-headed self-attention across all pairs of tokens, we first only compute self-attention spatially (among all tokens extracted from the same temporal index). Next, we compute self-attention temporally (among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS (classification) token is used. **complexity : same as model 2** -#### Model 4 : Factorized dot-product attention[[model-4-factorized-dot-product-attention]] +#### Model 4 : Factorized dot-product attention
ViViT model 4 @@ -114,7 +114,7 @@ In model 3, instead of computing multi-headed self-attention across all pairs of In model 4, half of the attention heads are designed to operate with keys and values from spatial indices, the other half operate with keys and values from same temporal indices. **complexity : same as model 2, 3** -### Experiments and Discussion[[experiments-and-discussion]] +### Experiments and Discussion
ViViT model performance @@ -125,7 +125,7 @@ After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the b The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. -## TimeSFormer[[timesformer]] +## TimeSFormer TimeSFormer is a concurrent work with ViViT, applying Transformer on video classification. The following sections are explanations of each type of attention. @@ -140,14 +140,14 @@ TimeSFormer is a concurrent work with ViViT, applying Transformer on video class - **Sparse Local Global Attention**: selectively combines local and global information. - **Axial Attention**: processes spatial and temporal dimensions separately along their axes. -### Performance Discussion[[performance-discussion]] +### Performance Discussion The **Divided Space-Time Attention** mechanism shows the most effective performance, providing the best balance of parameter efficiency and accuracy on both K400 and SSv2 datasets. -## Conclusion[[conclusion]] +## Conclusion ViViT expanded upon the ViT model to handle video data more effectively by introducing various models such as the Factorized Encoder, Factorized Self-Attention, and Factorized Dot-Product Attention, all aimed at managing the space-time dimensions efficiently. Similarly, TimeSFormer evolved from the ViT architecture and utilized diverse attention mechanisms to handle space-time dimensions, much like ViViT. A key takeaway from this progression is the focus on reducing the significant computational costs associated with applying transformer architectures to video analysis. By leveraging different optimization techniques, these models improve efficiency and enable learning with fewer computational resources. -## Additional Resources[[additional-resources]] +## Additional Resources - [Video Transformers: A Survey](https://arxiv.org/abs/2201.05991) \ No newline at end of file From 9c9e2194482dab349082f09fa680df6cdd34d740 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Wed, 13 Nov 2024 14:01:54 +0900 Subject: [PATCH 25/36] Add explanations about embedding, why that matters, and why we should learn about uniform frame sampling, tubelet embedding --- .../transformers-based-models.mdx | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1412adddb..4053d0952 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -54,6 +54,23 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: ### Embedding Video Clips +#### What is embedding? +Before diving into specific techniques, it's important to understand what embeddings are. In machine learning, embeddings are dense vector representations that capture meaningful features of input data in a format that neural networks can process. For videos, we need to convert the raw pixel data into these mathematical representations while preserving both spatial information (what's in each frame) and temporal information (how things change over time). + +#### Why Video Embeddings Matter +Processing videos is computationally intensive due to their size and complexity. Good embedding techniques help by: + +- Reducing dimensionality while preserving important features +- Capturing temporal relationships between frames +- Making it feasible for neural networks to process video data efficiently + +#### Why Focus on Uniform Frame Sampling and Tubelet Embeddings? +These two techniques represent fundamental approaches in video processing that have become building blocks for more advanced methods: + +1. They balance computational efficiency with information preservation, offering a range of options for different video processing tasks. +2. They serve as baseline methods, providing a comparison point against which newer techniques can demonstrate improvement. +3. Learning these approaches establishes a strong foundation in spatio-temporal processing, which is crucial for grasping more advanced video embedding methods. + #### Uniform Frame Sampling
From 6bac1668cd26eaad10cd061fcbc5ef92362acceb Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 01:46:38 +0900 Subject: [PATCH 26/36] Add explanations about 'spatio-temporal token', 'contextualize', and explain the meaning of n_w, n_h, n_t earlier. --- .../video-processing/transformers-based-models.mdx | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 4053d0952..ff6f43ef1 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -93,11 +93,17 @@ An alternate method, extracting spatio-temporal "tubes" from the input volume an The previously introduced methods, such as Uniform Frame Sampling and Tubelet Embedding, are effective but relatively simple approaches. The upcoming methods to be introduced are more advanced. ### Transformer Models for Video in ViViT + +The original ViViT paper proposes multiple transformer-based architectures, which we will now explore sequentially. + #### Model 1 : Spatio-Temporal Attention -Simply forwards all spatio-temporal tokens extracted from the video, through the transformer encoder. Each frame is split n_w x n_h image patches, so total n_t x n_w x n_h patches contextualize from each other using transformer encoder. (n_h : # of rows, n_w : # of columns, n_t : # of frames) +The first model naturally extends the idea of ViT to the video classification task. Each frame in the video is split into n_w(number of columns) x n_h(number of rows) image patches, resulting in a total of n_t(number of frames) x n_w x n_h patches. Each of these patches is then embedded as a “spatio-temporal token”—essentially a small unit representing both spatial(image) and temporal(video sequence) information. The model forwards all spatio-temporal tokens extracted from the video through the transformer encoder. This means each patch, or token, is processed to understand not only its individual features but also its relationship with other patches across time and space. Through this process, called “contextualizing,” the encoder learns how each patch relates to others by capturing patterns in position, color, and movement, thus building a rich, comprehensive understanding of the video’s overall context. **complexity : O(n_h^2 x n_w^2 x n_t^2)** + +However, using attention on all spatio-temporal tokens can lead to heavy computational costs. To make this process more efficient, methods like Uniform Frame Sampling and Tubelet Embedding, as explained earlier, are used to help reduce these costs. + #### Model 2 : Factorised encoder The approach in Model 1 was somewhat inefficient, as it contextualized all patches simultaneously. To improve upon this, Model 2 separates the spatial and temporal encoders sequentially. @@ -121,6 +127,7 @@ First, only spatial interactions are contextualized through a Spatial Transforme In model 3, instead of computing multi-headed self-attention across all pairs of tokens, we first only compute self-attention spatially (among all tokens extracted from the same temporal index). Next, we compute self-attention temporally (among all tokens extracted from the same spatial index). Because of the ambiguities, no CLS (classification) token is used. **complexity : same as model 2** + #### Model 4 : Factorized dot-product attention
@@ -131,6 +138,7 @@ In model 3, instead of computing multi-headed self-attention across all pairs of In model 4, half of the attention heads are designed to operate with keys and values from spatial indices, the other half operate with keys and values from same temporal indices. **complexity : same as model 2, 3** + ### Experiments and Discussion
From c0c6617fd26abcb1b1975a2a902a997164ebebad Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:03 +0900 Subject: [PATCH 27/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index ff6f43ef1..bfb258bf2 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -18,7 +18,7 @@ The abstract from the paper is as follows; *Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.*
- Vision transformer architecture +Vision transformer architecture
ViT architecture. Taken from the original paper. From fff1703677f99c50737e056c32aef7f7a0a20af6 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:13 +0900 Subject: [PATCH 28/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index bfb258bf2..08d568e0f 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -33,7 +33,7 @@ The key techniques proposed in the ViT paper are as follows: ### Performance & Limitation
- Vision transformer performance +Vision transformer performance
Comparision with SOTA models. Taken from the original paper. From 59ed64d07c3016e8fa211228d363c8afa10dc591 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:21 +0900 Subject: [PATCH 29/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 08d568e0f..42f0fe6c6 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -48,7 +48,7 @@ The abstract from the [paper](https://arxiv.org/abs/2103.15691) is as follows: *We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatiotemporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic.*
- ViViT architecture +ViViT architecture
ViViT architecture. Taken from the original paper. From bcb5e8f35533631d18116bf9db6a8633885d22cb Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:30 +0900 Subject: [PATCH 30/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 42f0fe6c6..7b98cc333 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -74,7 +74,7 @@ These two techniques represent fundamental approaches in video processing that h #### Uniform Frame Sampling
- Uniform frame sampling +Uniform frame sampling
Uniform Frame Sampling. Taken from the original paper. From 38e2da25ff0f6e53e69c335501122d357f54a4bb Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:39 +0900 Subject: [PATCH 31/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 7b98cc333..54cd9fe95 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -109,7 +109,7 @@ However, using attention on all spatio-temporal tokens can lead to heavy computa The approach in Model 1 was somewhat inefficient, as it contextualized all patches simultaneously. To improve upon this, Model 2 separates the spatial and temporal encoders sequentially.
- ViViT model 2 +ViViT model 2
Factorised encoder (Model 2). Taken from the original paper. From 8bc4722f8a7ebea206799cab614a9e7487e6c4db Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:46 +0900 Subject: [PATCH 32/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 54cd9fe95..383668b36 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -120,7 +120,7 @@ First, only spatial interactions are contextualized through a Spatial Transforme #### Model 3 : Factorised Self-Attention
- ViViT model 3 +ViViT model 3
Factorised Self-Attention (Model 3). Taken from the original paper. From 2fbe7f97ffcb45beed6891eb3da932b6fa39c5a9 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:52 +0900 Subject: [PATCH 33/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 383668b36..459ae5d44 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -131,7 +131,7 @@ In model 3, instead of computing multi-headed self-attention across all pairs of #### Model 4 : Factorized dot-product attention
- ViViT model 4 +ViViT model 4
Factorised Dot-Product Attention (Model 4). Taken from the original paper. From c6d8a3e5c2df85bdfb60e72b6b03081ed951fa95 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:35:59 +0900 Subject: [PATCH 34/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 459ae5d44..fc138991a 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -142,7 +142,7 @@ In model 4, half of the attention heads are designed to operate with keys and va ### Experiments and Discussion
- ViViT model performance +ViViT model performance
Comparison of model architectures (Top 1 accuracy). Taken from the original paper. From dab2328d46b6af973da1083549150f2b1473a644 Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:36:05 +0900 Subject: [PATCH 35/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index fc138991a..bae5ae26b 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -155,7 +155,7 @@ After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the b TimeSFormer is a concurrent work with ViViT, applying Transformer on video classification. The following sections are explanations of each type of attention.
- TimeSFormer model +TimeSFormer model
Visualization of the five space-time self-attention schemes. Taken from the original paper. From 487a44211f415c24320e23ad2b438d0666fcc52b Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Thu, 14 Nov 2024 17:36:13 +0900 Subject: [PATCH 36/36] Update chapters/en/unit7/video-processing/transformers-based-models.mdx --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index bae5ae26b..a5c9e7d6f 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -84,7 +84,7 @@ e. g. one frame per every 2 frames. #### Tubelet Embedding
- Tubelet embedding +Tubelet embedding
Tubelet embedding. Taken from the original paper.