-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add save_vq_tokens_vid.py #14
base: main
Are you sure you want to change the base?
Conversation
For some early testing, this already passes using 3 shards consisting of a different number of videos, each containing different videos: import tarfile
import numpy as np
from io import BytesIO
import os
def read_tarfile(file_path):
with tarfile.open(file_path, "r") as tar:
for member in tar.getmembers():
if member.isfile():
file_obj = tar.extractfile(member)
if file_obj:
bio = BytesIO(file_obj.read())
bio.seek(0)
array = np.load(bio)
print(
f"File: {member.name}, Shape: {array.shape}, Dtype: {array.dtype}"
)
def process_directory(directory):
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".tar"):
file_path = os.path.join(root, file)
print(f"Processing file: {file_path}")
read_tarfile(file_path)
# tar dir
directory_path = "/store/swissai/a08/data/4m-data/train/video_rgb_tok/video_rgb"
process_directory(directory_path) |
Tokenizing 6 videos (40k frames/images in total) takes ~20 minutes on todi (CPU only, not GPU!) + some 5 minutes to start up the dataloader (L322: @garjania does this seem reasonable to you or too slow? |
TODOs before merging:
|
|
re 2 (from @garjania):
--> Ignore masking for now; keep as is. |
@markus583 How many frames do these videos have in total? |
@garjania 40k frames in total |
with 40k frames the start-up time makes sense, but I guess it's still a lot if we want to scale up the number of videos. for the tokenization time I don't have an estimate time for H100, but on 8 A100 it should take around 10mins |
On a side note, the original 4M script expects an additional split subdirectory (e.g., becoming e.g. |
|
Open issues:
|
TODO @kdu4108: switch to smaller shards (something really small like 10 and try it out at scale with tokenization & maybe other stuff & check efficiency) |
Using 10 workers (each processing 1 shard of ~90 videos in parallel --> ~900 videos) and taking every 10th frame, this takes ~22 minutes (BUT: only using 1 GPU & RAM not maxed out; more like a lower bound wrt time needed) |
TODO: check if we can store already pre-processed images in fp16 (via .half()) before feeding into the model (why? saves almost 50% of RAM --> more shards in parallel processed --> faster overall tokenization) |
TODO: check using all 4 GPUs once we 1) have smaller + more + well-defined shards and 2) Tödi is back |
Updates:
Tested on both captioning and tokenization. JSON/NPY filenames are now the same as in the source tarfile. Open issues:
|
Tokenization output is verified and works as intended. FP16 intermediate saving is also just fine. The prompt to get clip descriptions should be tested once we have a model. This warrants some qualitative and quantitative (model performance/[Groovist](https://aclanthology.org/2023.emnlp-main.202/ evals). |
See #9
Remaining issues (see XXX/TODO/FIXME in the code):
Note: Currently, the script does not use the webdataset library.