Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add save_vq_tokens_vid.py #14

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

add save_vq_tokens_vid.py #14

wants to merge 16 commits into from

Conversation

markus583
Copy link

@markus583 markus583 commented Jul 8, 2024

See #9

Remaining issues (see XXX/TODO/FIXME in the code):

  • Augmentations (especially: masking)
  • Check/standardize paths.
  • RGB conversion necessary? @kdu4108 (L132)
  • speed/efficiency of dataloader

Note: Currently, the script does not use the webdataset library.

@markus583
Copy link
Author

markus583 commented Jul 8, 2024

For some early testing, this already passes using 3 shards consisting of a different number of videos, each containing different videos:

import tarfile
import numpy as np
from io import BytesIO
import os


def read_tarfile(file_path):
    with tarfile.open(file_path, "r") as tar:
        for member in tar.getmembers():
            if member.isfile():
                file_obj = tar.extractfile(member)
                if file_obj:
                    bio = BytesIO(file_obj.read())
                    bio.seek(0)
                    array = np.load(bio)
                    print(
                        f"File: {member.name}, Shape: {array.shape}, Dtype: {array.dtype}"
                    )


def process_directory(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".tar"):
                file_path = os.path.join(root, file)
                print(f"Processing file: {file_path}")
                read_tarfile(file_path)


# tar dir
directory_path = "/store/swissai/a08/data/4m-data/train/video_rgb_tok/video_rgb"

process_directory(directory_path)

@markus583
Copy link
Author

Tokenizing 6 videos (40k frames/images in total) takes ~20 minutes on todi (CPU only, not GPU!) + some 5 minutes to start up the dataloader (L322: for imgs_batch, tokens_paths in data_loader:). Especially the data loading seems a bit inefficient (TODO: Check via profiling; otherwise maybe just use lower batch_size_dataloader?)

@garjania does this seem reasonable to you or too slow?

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 8, 2024

TODOs before merging:

  1. cleanup (e.g. comments + naming fixes)
  2. figure out the masking value thing/if it's relevant (blocked - dependent on insights from @garjania)
  3. speed up with profiling if necessary
  4. try running with videos from /store/swissai/a08/data/raw/hdvila/hd_vila_v2d and howto100m
  5. spot-check augmentations

@kdu4108
Copy link
Collaborator

kdu4108 commented Jul 8, 2024

  1. Write sbatch/slurm scripts to run this at scale

@markus583
Copy link
Author

re 2 (from @garjania):

you can ignore it for RGB tokenization. It's used for depth and surface normals from synthetic datasets that can potentially have holes (undefined depth or sf values). the mask specifies which pixels contain such holes.

--> Ignore masking for now; keep as is.

@garjania
Copy link

garjania commented Jul 8, 2024

Tokenizing 6 videos (40k frames/images in total) takes ~20 minutes on todi (CPU only, not GPU!) + some 5 minutes to start up the dataloader (L322: for imgs_batch, tokens_paths in data_loader:). Especially the data loading seems a bit inefficient (TODO: Check via profiling; otherwise maybe just use lower batch_size_dataloader?)

@garjania does this seem reasonable to you or too slow?

@markus583 How many frames do these videos have in total?

@markus583
Copy link
Author

@garjania 40k frames in total

@garjania
Copy link

garjania commented Jul 8, 2024

with 40k frames the start-up time makes sense, but I guess it's still a lot if we want to scale up the number of videos. for the tokenization time I don't have an estimate time for H100, but on 8 A100 it should take around 10mins

@markus583
Copy link
Author

markus583 commented Jul 9, 2024

On a side note, the original 4M script expects an additional split subdirectory (e.g., train --> data_root/train).
I guess we should also adopt this structure? Or get rid of it altogether? @kdu4108

becoming e.g. /store/swissai/a08/data/raw/hdvila/v2d_backup/train/video_rgb, where data_root is only /store/swissai/a08/data/raw/hdvila/v2d_backup/

@markus583
Copy link
Author

TODOs before merging:

  1. cleanup (e.g. comments + naming fixes)
  2. figure out the masking value thing/if it's relevant (blocked - dependent on insights from @garjania)
  3. speed up with profiling if necessary
  4. try running with videos from /store/swissai/a08/data/raw/hdvila/hd_vila_v2d and howto100m
  5. spot-check augmentations
  1. Done
  2. see above
  3. seems ok (only memory issues, to be figured out but specific to todi it seems)
  4. tried, works! but recommended to downsample frames and use shards with few videos to avoid memory issues.
  5. done, works, too! NOTE: this is only reasonable if we do not strongly crop in Transform from v2d format into video_rgb format and save in video_rgb/ directory #10

@markus583
Copy link
Author

markus583 commented Jul 11, 2024

Open issues:

  1. Memory usage on todi (OOM error prematurely). If this remains, then re-write script to process videos in parallel, not video shards (but more time-consuming...)
  2. Write sbatch/slurm scripts to run this at scale

@markus583
Copy link
Author

TODO @kdu4108: switch to smaller shards (something really small like 10 and try it out at scale with tokenization & maybe other stuff & check efficiency)

@markus583
Copy link
Author

Using 10 workers (each processing 1 shard of ~90 videos in parallel --> ~900 videos) and taking every 10th frame, this takes ~22 minutes (BUT: only using 1 GPU & RAM not maxed out; more like a lower bound wrt time needed)

@markus583
Copy link
Author

TODO: check if we can store already pre-processed images in fp16 (via .half()) before feeding into the model

(why? saves almost 50% of RAM --> more shards in parallel processed --> faster overall tokenization)

@markus583
Copy link
Author

TODO: check using all 4 GPUs once we 1) have smaller + more + well-defined shards and 2) Tödi is back

@markus583
Copy link
Author

Updates:

  • Running on 4 GPUs works. Run with: OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 pseudolabeling/get_video_captions.py
  • Change/adapt file path from video_rgb to filtered_raw.
  • Fix intra-json file paths in both tokenization and captioning scripts.
  • Some cleanup.

Tested on both captioning and tokenization. JSON/NPY filenames are now the same as in the source tarfile.

Open issues:

  • Verify tokenization output (via decoding; check_half.py (name comes from fp16 intermediate saving; also must be checked!))
  • Prompt for video clip descriptions

@markus583
Copy link
Author

Tokenization output is verified and works as intended. FP16 intermediate saving is also just fine.

The prompt to get clip descriptions should be tested once we have a model. This warrants some qualitative and quantitative (model performance/[Groovist](https://aclanthology.org/2023.emnlp-main.202/ evals).
For now, it should be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants