add save_vq_tokens_vid.py #14

markus583 · 2024-07-08T11:26:09Z

See #9

Remaining issues (see XXX/TODO/FIXME in the code):

Augmentations (especially: masking)
Check/standardize paths.
RGB conversion necessary? @kdu4108 (L132)
speed/efficiency of dataloader

Note: Currently, the script does not use the webdataset library.

markus583 · 2024-07-08T11:27:13Z

For some early testing, this already passes using 3 shards consisting of a different number of videos, each containing different videos:

import tarfile
import numpy as np
from io import BytesIO
import os


def read_tarfile(file_path):
    with tarfile.open(file_path, "r") as tar:
        for member in tar.getmembers():
            if member.isfile():
                file_obj = tar.extractfile(member)
                if file_obj:
                    bio = BytesIO(file_obj.read())
                    bio.seek(0)
                    array = np.load(bio)
                    print(
                        f"File: {member.name}, Shape: {array.shape}, Dtype: {array.dtype}"
                    )


def process_directory(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".tar"):
                file_path = os.path.join(root, file)
                print(f"Processing file: {file_path}")
                read_tarfile(file_path)


# tar dir
directory_path = "/store/swissai/a08/data/4m-data/train/video_rgb_tok/video_rgb"

process_directory(directory_path)

markus583 · 2024-07-08T12:31:21Z

Tokenizing 6 videos (40k frames/images in total) takes ~20 minutes on todi (CPU only, not GPU!) + some 5 minutes to start up the dataloader (L322: for imgs_batch, tokens_paths in data_loader:). Especially the data loading seems a bit inefficient (TODO: Check via profiling; otherwise maybe just use lower batch_size_dataloader?)

@garjania does this seem reasonable to you or too slow?

kdu4108 · 2024-07-08T13:34:27Z

TODOs before merging:

cleanup (e.g. comments + naming fixes)
figure out the masking value thing/if it's relevant (blocked - dependent on insights from @garjania)
speed up with profiling if necessary
try running with videos from /store/swissai/a08/data/raw/hdvila/hd_vila_v2d and howto100m
spot-check augmentations

kdu4108 · 2024-07-08T13:35:35Z

Write sbatch/slurm scripts to run this at scale

markus583 · 2024-07-08T14:17:21Z

re 2 (from @garjania):

you can ignore it for RGB tokenization. It's used for depth and surface normals from synthetic datasets that can potentially have holes (undefined depth or sf values). the mask specifies which pixels contain such holes.

--> Ignore masking for now; keep as is.

garjania · 2024-07-08T14:22:41Z

Tokenizing 6 videos (40k frames/images in total) takes ~20 minutes on todi (CPU only, not GPU!) + some 5 minutes to start up the dataloader (L322: for imgs_batch, tokens_paths in data_loader:). Especially the data loading seems a bit inefficient (TODO: Check via profiling; otherwise maybe just use lower batch_size_dataloader?)

@garjania does this seem reasonable to you or too slow?

@markus583 How many frames do these videos have in total?

markus583 · 2024-07-08T14:23:25Z

@garjania 40k frames in total

garjania · 2024-07-08T16:33:49Z

with 40k frames the start-up time makes sense, but I guess it's still a lot if we want to scale up the number of videos. for the tokenization time I don't have an estimate time for H100, but on 8 A100 it should take around 10mins

markus583 · 2024-07-09T09:44:36Z

On a side note, the original 4M script expects an additional split subdirectory (e.g., train --> data_root/train).
I guess we should also adopt this structure? Or get rid of it altogether? @kdu4108

becoming e.g. /store/swissai/a08/data/raw/hdvila/v2d_backup/train/video_rgb, where data_root is only /store/swissai/a08/data/raw/hdvila/v2d_backup/

markus583 · 2024-07-11T13:42:35Z

TODOs before merging:

cleanup (e.g. comments + naming fixes)

figure out the masking value thing/if it's relevant (blocked - dependent on insights from @garjania)

speed up with profiling if necessary

try running with videos from /store/swissai/a08/data/raw/hdvila/hd_vila_v2d and howto100m

spot-check augmentations

Done
see above
seems ok (only memory issues, to be figured out but specific to todi it seems)
tried, works! but recommended to downsample frames and use shards with few videos to avoid memory issues.
done, works, too! NOTE: this is only reasonable if we do not strongly crop in Transform from v2d format into video_rgb format and save in video_rgb/ directory #10

markus583 · 2024-07-11T13:49:05Z

Open issues:

Memory usage on todi (OOM error prematurely). If this remains, then re-write script to process videos in parallel, not video shards (but more time-consuming...)
Write sbatch/slurm scripts to run this at scale

markus583 · 2024-07-16T12:38:13Z

TODO @kdu4108: switch to smaller shards (something really small like 10 and try it out at scale with tokenization & maybe other stuff & check efficiency)

markus583 · 2024-07-16T13:34:28Z

Using 10 workers (each processing 1 shard of ~90 videos in parallel --> ~900 videos) and taking every 10th frame, this takes ~22 minutes (BUT: only using 1 GPU & RAM not maxed out; more like a lower bound wrt time needed)

markus583 · 2024-07-16T13:35:57Z

TODO: check if we can store already pre-processed images in fp16 (via .half()) before feeding into the model

(why? saves almost 50% of RAM --> more shards in parallel processed --> faster overall tokenization)

markus583 · 2024-07-16T13:38:08Z

TODO: check using all 4 GPUs once we 1) have smaller + more + well-defined shards and 2) Tödi is back

markus583 · 2024-08-05T10:05:05Z

Updates:

Running on 4 GPUs works. Run with: OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 pseudolabeling/get_video_captions.py
Change/adapt file path from video_rgb to filtered_raw.
Fix intra-json file paths in both tokenization and captioning scripts.
Some cleanup.

Tested on both captioning and tokenization. JSON/NPY filenames are now the same as in the source tarfile.

Open issues:

Verify tokenization output (via decoding; check_half.py (name comes from fp16 intermediate saving; also must be checked!))
Prompt for video clip descriptions

markus583 · 2024-08-05T15:02:30Z

Tokenization output is verified and works as intended. FP16 intermediate saving is also just fine.

The prompt to get clip descriptions should be tested once we have a model. This warrants some qualitative and quantitative (model performance/[Groovist](https://aclanthology.org/2023.emnlp-main.202/ evals).
For now, it should be ok.

add file

0cdf2de

work on TO DOs from PR

772c268

markus583 added 2 commits July 11, 2024 16:56

use generators to save mem

b8ccf83

todi backup.

30ed4a3

add todo

bdb54ec

markus583 mentioned this pull request Jul 22, 2024

Added pseudolabel_frames.py #19

Open

markus583 added 8 commits July 29, 2024 09:28

add pllava

2bc669a

EOD euler push

afc5e58

format + move to fps

b5f902a

add saving

00d2964

fix paths

68260b8

paths

9a19220

fix for tödi!

14dcd53

add current routine

9deb799

kdu4108 mentioned this pull request Aug 2, 2024

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

Open

markus583 added 2 commits August 4, 2024 09:07

eow push

2da1747

fix paths; some cleanup

d343126

check more frames

0c7b28d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add save_vq_tokens_vid.py #14

add save_vq_tokens_vid.py #14

markus583 commented Jul 8, 2024 •

edited

Loading

markus583 commented Jul 8, 2024 •

edited

Loading

markus583 commented Jul 8, 2024

kdu4108 commented Jul 8, 2024

kdu4108 commented Jul 8, 2024

markus583 commented Jul 8, 2024

garjania commented Jul 8, 2024

markus583 commented Jul 8, 2024

garjania commented Jul 8, 2024

markus583 commented Jul 9, 2024 •

edited

Loading

markus583 commented Jul 11, 2024

markus583 commented Jul 11, 2024 •

edited

Loading

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Aug 5, 2024

markus583 commented Aug 5, 2024

add save_vq_tokens_vid.py #14

Are you sure you want to change the base?

add save_vq_tokens_vid.py #14

Conversation

markus583 commented Jul 8, 2024 • edited Loading

markus583 commented Jul 8, 2024 • edited Loading

markus583 commented Jul 8, 2024

kdu4108 commented Jul 8, 2024

kdu4108 commented Jul 8, 2024

markus583 commented Jul 8, 2024

garjania commented Jul 8, 2024

markus583 commented Jul 8, 2024

garjania commented Jul 8, 2024

markus583 commented Jul 9, 2024 • edited Loading

markus583 commented Jul 11, 2024

markus583 commented Jul 11, 2024 • edited Loading

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Jul 16, 2024

markus583 commented Aug 5, 2024

markus583 commented Aug 5, 2024

markus583 commented Jul 8, 2024 •

edited

Loading

markus583 commented Jul 8, 2024 •

edited

Loading

markus583 commented Jul 9, 2024 •

edited

Loading

markus583 commented Jul 11, 2024 •

edited

Loading