swiss-ai · markus583 · Jul 8, 2024 · Jul 11, 2024 · Jul 11, 2024 · Jul 15, 2024
diff --git a/.gitignore b/.gitignore
@@ -10,4 +10,13 @@ wandb/
 *.DS_Store
 tokenizer_ckpts/
 *pkl
-*.egg-info
+*.egg-info
+*.tar
+*.mp4
+*.npy
+*.m4a
+*.json
+*.jsonl
+*.safetensors
+build/**
+slurm_cache/**
diff --git a/commands.txt b/commands.txt
@@ -0,0 +1,4 @@
+watch -n 1 "date '+%Y-%m-%d %H:%M:%S' >> ram.txt && free -h | tee -a ram.txt
+
+OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 save_vq_tokens_vid.py --data_roo
+t=/store/swissai/a08/data/raw/howto100m/v2d_1000/ --every_nth_frame 1000 --num_workers 2 --world_size 4
diff --git a/pseudolabeling/DATA.md b/pseudolabeling/DATA.md
@@ -0,0 +1,124 @@
+# Data
+## Instruction Training Data
+<!-- > *originated from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)* -->
+
+
+For training, we leveraged the video instruction tuning data from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2). 
+
+#### 1. Download json annotation files from huggingface. 
+[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VideoChat2%20IT-blue)](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) 
+
+<!-- > ![images](./assert/data.png) -->
+
+#### 2. Download the raw videos from the following links.
+The video directories can be found in tasks/train/instruction_data.py. You can also change them to your own saved paths.
+
+- [VideoChat](https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data): Based on [InternVid](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid), download the processed version directly [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/videochat2_conversation_videos.zip)
+- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main/data)
+- [Kinetics-710](https://github.com/OpenGVLab/UniFormerV2/blob/main/DATASET.md), download Kinetics 400/600/700 [here](https://openxlab.org.cn/datasets?keywords=kinetics).
+- [SthSthV2](https://developer.qualcomm.com/software/ai-datasets/something-something): Option candidates were generated from [UMT](https://github.com/OpenGVLab/unmasked_teacher) top-20 predictions.
+- [NExTQA](https://github.com/doc-doc/NExT-QA)
+- [CLEVRER](https://clevrer.csail.mit.edu/)
+- [WebVid](https://maxbain.com/webvid-dataset/)
+- [YouCook2](https://youcook2.eecs.umich.edu/), download the processed version [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/youcook_split_videos.zip).
+- [TextVR](https://github.com/callsys/textvr)
+- [TGIF](https://github.com/YunseokJANG/tgif-qa)
+- [EgoQA](https://ego4d-data.org/), download the processed version [here](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/data/egoqa_split_videos.zip).
+
+#### 3. We also provide our processed json annotation files here.
+
+[![Dataset meta](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-magic%5Fjsons-blue)](https://huggingface.co/datasets/cathyxl/magic_jsons) 
+
+
+<!-- We leveraged the training data from [Videochat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2). We only used the video part for video instruct tuning. -->
+
+## Evaluation Data & Others
+Follow this section to obtain the evaluation open resources.
+
+### VCGBench
+
+We refer to the VideoChatGPT video question answering evaluation as VCGBench in this repo. We followed the original [repo](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main) to prepare the evaluation data.
+
+### MVBench
+We follow the original [Videochat2 repo](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) in setting up the MVBench Evaluation. You can also find helpful resources at their [huggingface repo](https://huggingface.co/datasets/OpenGVLab/MVBench)
+
+
+### Videoqabench
+We refer to all other video question answering benchmarks as videoqabench in this repo. They are mainly prepared folloing the original repos. Each listed:
+1. [MSVD](https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) & [MSRVTT](https://github.com/xudejing/video-question-answering)
+
+3. [Activity Net](https://github.com/MILVLG/activitynet-qa/tree/master)
+4. [TGIF](https://github.com/raingo/TGIF-Release/tree/master)
+
+Also other fantastic repo intergrating these benchmarks are helpful in the process of setting up the evaluation data:
+- [VideoChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT/tree/main)
+- [VideoLlava](https://github.com/PKU-YuanGroup/Video-LLaVA/tree/main/videollava)
+- [IG-VLM](https://github.com/imagegridworth/IG-VLM/tree/main)
+
+
+
+### Recaptioning
+#### Inter4k
+
+This is a dataset with 1000 samples of high resolution videos. We prepare the data folloing the instructions from their [official website](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html)
+
+#### Extending Reacptioning
+The recaptioning part is designed to be extendable.
+
+inference script [tasks/eval/recaption/pllava_recaption.py](tasks/eval/recaption/pllava_recaption.py) would use a dataset class [RecaptionDataset](tasks/eval/recaption/__init__.py#L197). The detailed information is kept in the data_list_info attribute as:
+```
+data_list_info = OrderedDict({
+        # "Panda70M": OrderedDict(
+        #     json_relpath="Panda70M/annotations.json", 
+        #     prefix="DATAS/Recaption/Panda70M/videos", 
+        #     data_type="video", 
+        #     bound=False,
+        #     key_rename_map={
+        #         # 'caption': 'hint',
+        #     },
+        #     name_key='video_name',
+        #     postfix=('mp4', 'mkv', 'webm'),
+        #     recaption_type=RecaptionSample,
+        # ), # don't has start & end
+        "Inter4K": OrderedDict(
+            json_relpath="Inter4K/annotations.json", 
+            prefix="DATAS/Recaption/Inter4K/60fps/UHD", 
+            data_type="video", 
+            bound=False,
+            key_rename_map={
+                # 'caption': 'hint',
+            },
+            name_key='video_name',
+            postfix=('mp4', 'mkv', 'webm'),
+            recaption_type=CaptionSample,
+        ), # don't has start & end
+    })
+```
+It contains the path to a annotation json file where there is a list and each item of the list is a sample waiting for captioning. For example, the Inter4K/annotations.json is like:
+```json
+[
+    {
+        "video_name": "973"
+    },
+    ...
+]
+```
+and the directory DATAS/Recaption/Inter4K/60fps/UHD would look like:
+```
+$ ls DATAS/Recaption/Inter4K/60fps/UHD
+1.mp4 134.mp4  170.mp4 ....
+```
+
+Naively, only the video is needed when captioning directly, therefore the annotation file only needs to contain the names of each video under the "prefix" directory.
+
+Extending a dataset for captioning would consist of the folloing steps:
+1. have all the videos downloaded
+2. construct a annotation.json file with sepecific format.
+3. configure the recaption dataset [here](tasks/eval/recaption/__init__.py#L197), where you would need to determine:
+    - json_relpath: the annotation relative path
+    - prefix: root directory for videos
+    - postfix: a list containing all the file extensions for these videos
+
+The other options are experimental, so stick with the default setting as in Inter4k. The recommended length of video is around 5-20 seconds. 
+
+p.s. "bound" is to make sure the video pass to the model doesn't have scene transition or so. This part wasn't tested, so set the bound to false and make sure the original videos files are single clip of a video. But always feel free to discover and contribute to PLLaVA!
diff --git a/pseudolabeling/MODELS/pllava-13b/.gitattributes b/pseudolabeling/MODELS/pllava-13b/.gitattributes
@@ -0,0 +1,35 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
diff --git a/pseudolabeling/MODELS/pllava-13b/README.md b/pseudolabeling/MODELS/pllava-13b/README.md
@@ -0,0 +1,39 @@
+---
+license: apache-2.0
+tags:
+- video LLM
+datasets:
+- OpenGVLab/VideoChat2-IT
+---
+
+
+# PLLaVA Model Card
+## Model details
+**Model type:** 
+PLLaVA-13B is an open-source video-language chatbot trained by fine-tuning Image-LLM on video instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: llava-hf/llava-v1.6-vicuna-13b-hf
+
+**Model date:**
+PLLaVA-13B was trained in April 2024.
+
+**Paper or resources for more information:**
+- github repo: https://github.com/magic-research/PLLaVA
+- project page: https://pllava.github.io/
+- paper link: https://arxiv.org/abs/2404.16994
+
+## License
+llava-hf/llava-v1.6-vicuna-13b-hf license.
+
+**Where to send questions or comments about the model:**
+https://github.com/magic-research/PLLaVA/issues
+
+## Intended use
+**Primary intended uses:**
+The primary use of PLLaVA is research on large multimodal models and chatbots.
+
+**Primary intended users:**
+The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
+
+## Training dataset
+Video-Instruct-Tuning data of OpenGVLab/VideoChat2-IT
+## Evaluation dataset
+A collection of 6 benchmarks, including 5 VQA benchmarks and 1 recent benchmarks specifically proposed for Video-LMMs.
diff --git a/pseudolabeling/MODELS/pllava-13b/tokenizer.model b/pseudolabeling/MODELS/pllava-13b/tokenizer.model
diff --git a/pseudolabeling/MODELS/pllava-7b/.gitattributes b/pseudolabeling/MODELS/pllava-7b/.gitattributes
@@ -0,0 +1,35 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
diff --git a/pseudolabeling/MODELS/pllava-7b/README.md b/pseudolabeling/MODELS/pllava-7b/README.md
@@ -0,0 +1,39 @@
+---
+license: apache-2.0
+tags:
+- video LLM
+datasets:
+- OpenGVLab/VideoChat2-IT
+---
+
+
+# PLLaVA Model Card
+## Model details
+**Model type:** 
+PLLaVA-7B is an open-source video-language chatbot trained by fine-tuning Image-LLM on video instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: llava-hf/llava-v1.6-vicuna-7b-hf
+
+**Model date:**
+PLLaVA-7B was trained in April 2024.
+
+**Paper or resources for more information:**
+- github repo: https://github.com/magic-research/PLLaVA
+- project page: https://pllava.github.io/
+- paper link: https://arxiv.org/abs/2404.16994
+
+## License
+llava-hf/llava-v1.6-vicuna-7b-hf license.
+
+**Where to send questions or comments about the model:**
+https://github.com/magic-research/PLLaVA/issues
+
+## Intended use
+**Primary intended uses:**
+The primary use of PLLaVA is research on large multimodal models and chatbots.
+
+**Primary intended users:**
+The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
+
+## Training dataset
+Video-Instruct-Tuning data of OpenGVLab/VideoChat2-IT
+## Evaluation dataset
+A collection of 6 benchmarks, including 5 VQA benchmarks and 1 recent benchmarks specifically proposed for Video-LMMs.
diff --git a/pseudolabeling/MODELS/pllava-7b/tokenizer.model b/pseudolabeling/MODELS/pllava-7b/tokenizer.model
diff --git a/pseudolabeling/PLLaVA b/pseudolabeling/PLLaVA