Add transcript + metadata processing #15

markus583 · 2024-07-16T11:40:49Z

Implements #16 and #12.

Transcript:

Output format (as jsonl for each video, with multiple videos in a tarfile):

[
    {
        "transcript": "take your ribbon and cut out five pieces",
        "start_frame_index": 430,
        "end_frame_index": 430
    },
    {
        "transcript": "these pieces are cut at 3 inches mom and",
        "start_frame_index": 541,
        "end_frame_index": 541
    },
    {
        "transcript": "then after you're done cutting them be",
        "start_frame_index": 600,
        "end_frame_index": 600
    },
...
]

Open issues (also see TODO/FIXME in the code):
(Outdated now; see discussion below)

What about errored videos? Why/when does this happen?
Videos with no subtitles?
Non-English videos? (Seemingly not available in the howto100m/v2d_40k subset?)
Check if timestep --> frame mapping is correct. Rounding is sensible? Or rather use floor/ceil for start/end? Also, the timestamps seem weirdly short within the transcripts...
Some dir stuff, but not problematic

Metadata:

Very similar to transcript structure, but save to json instead of jsonl.
TO DOs:

Adapt METADATA_MAPPING as needed (see Transform from v2d format into a metadata format and save in metadata/ directory. #16 (comment))

markus583 · 2024-07-16T11:54:12Z

Works fine on todi using /store/swissai/a08/data/4m-data/train/DEBUG/v2d_40k/train/

kdu4108 · 2024-07-16T14:26:05Z

What about errored videos? Why/when does this happen?

Can you share the error? Likely it's that the video is now private or no longer exists and so can't be downloaded.

Videos with no subtitles?

How often does this happen?

Non-English videos? (Seemingly not available in the howto100m/v2d_40k subset?)

We intentionally decided to ignore non-English videos for now to keep the scope smaller.

Check if timestep --> frame mapping is correct. Rounding is sensible? Or rather use floor/ceil for start/end? Also, the timestamps seem weirdly short within the transcripts...

The timestep should be left inclusive, right exclusive. So if a clip is from timestamp 1m30 to 1m39 and frame A is at 1m29.9, frame B is 1m30.0, ..., frame Y is 1m39.9, frame Z is 1m40.0, then we would want start_frame_index=A and end_frame_index=Z

markus583 · 2024-07-17T09:16:23Z

What about errored videos? Why/when does this happen?

Can you share the error? Likely it's that the video is now private or no longer exists and so can't be downloaded.
The json looks like this:
{'url': 'https://www.youtube.com/watch?v=maixx6u6WSM', 'key': '0000000045', 'status': 'failed_to_download', 'error_message': "[Errno 2] No such file or directory: '/tmp/3e5ec4c0-42c2-4784-9f87-4025a4bec186.m4a'", 'yt_meta_dict': None}
So, yes, seems like it could not download the video.

What is a bit more common is that there is no yt_meta_dict:
{'url': 'https://www.youtube.com/watch?v=ng_ELNno0A4', 'key': '0000000005', 'status': 'success', 'error_message': None, 'yt_meta_dict': {}}

Any idea when/why this occurs @kdu4108? This not only contains transcripts but also other info.
Happens in 5-10% of videos.

The timestep should be left inclusive, right exclusive. So if a clip is from timestamp 1m30 to 1m39 and frame A is at 1m29.9, frame B is 1m30.0, ..., frame Y is 1m39.9, frame Z is 1m40.0, then we would want start_frame_index=A and end_frame_index=Z

Seems sensible. Let's wait with the implementation until we have metadata containing transcripts with proper timestamps.

--> TODO: adapt frame calculation

kdu4108 · 2024-07-19T09:18:28Z

v2d_to_metadata.py

+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process tarfiles contati JSONs and convert to structured JSONL format."


kdu4108 · 2024-07-19T09:19:56Z

v2d_to_transcript.py

+    hrs, mins, secs = map(float, timestamp.split(":"))
+    total_seconds = timedelta(hours=hrs, minutes=mins, seconds=secs).total_seconds()
+    # TODO: is round the right way of doing this? Most transcripts are assigned to only 1-2 frames...
+    return round(total_seconds * fps)


We should be left inclusive and right exclusive.

kdu4108 · 2024-07-19T09:20:49Z

Looking very clean overall! Maybe what could be helpful is to include here an example of the jsons that are being converted into our format for both metadata and transcripts.

markus583 · 2024-07-19T11:45:58Z

Examples

Metadata:

    "title": "HOW TO PAINT YOUR MOTORCYCLE PART#5 BASECOAT PREP",
    "duration": 435,
    "channel": "Troy Kane Vtwins to v8s",
    "fps": 30,
    "tags": [
        "Vtwinstov8s.com",
        "troy kane",
        "harley",
        "sportster",
        "how to video",
        "how to paint",
        "custom paint",
        "motorcycle",
        "basecoat",
        "clear coat"
    ],
    "resolution": "1280x720",
    "aspect_ratio": 1.78,
    "dataset": "howto100m"
}

Transcripts:
(DEPRECATED, wait for WhisperX)

[
    {
        "transcript": "all right everybody this is where what's",
        "start_frame_index": 84,
        "end_frame_index": 84
    },
    {
        "transcript": "going on today",
        "start_frame_index": 156,
        "end_frame_index": 156
    },
    {
        "transcript": "got this fender here",
        "start_frame_index": 537,
        "end_frame_index": 538
    },
    {
        "transcript": "it's got to go black just a gloss black",
        "start_frame_index": 648,
        "end_frame_index": 648
    },
    {
        "transcript": "but before i could spray it i gotta prep",
        "start_frame_index": 736,
        "end_frame_index": 737
    },
    {
        "transcript": "it",
        "start_frame_index": 737,
        "end_frame_index": 777
    },
    {
        "transcript": "i'm gonna take this pinstripe off both",
        "start_frame_index": 837,
        "end_frame_index": 838
    },
    {
        "transcript": "sides it's got some scratches",
        "start_frame_index": 955,
        "end_frame_index": 955
    },
    {
        "transcript": "a little surface rust here and there",
        "start_frame_index": 1010,
        "end_frame_index": 1010
    },
    {
        "transcript": "but it's just gonna be a quick wham jam",
        "start_frame_index": 1281,
        "end_frame_index": 1282
    },
    {
        "transcript": "job because i'm only charging the guy 50",
        "start_frame_index": 1380,
        "end_frame_index": 1380
    },
    {
        "transcript": "bucks to do it",
        "start_frame_index": 1447,
        "end_frame_index": 1447
    },
    {
        "transcript": "so it's gonna be a quickie",
        "start_frame_index": 1543,
        "end_frame_index": 1543
    },
    {
        "transcript": "huh kids just gonna be a quickie yeah",
        "start_frame_index": 1646,
        "end_frame_index": 1646
    },
    {
        "transcript": "in and out job",
        "start_frame_index": 1742,
        "end_frame_index": 1742
    },
    {
        "transcript": "gunned it down like three four times",
        "start_frame_index": 1850,
        "end_frame_index": 1850
    },
    {
        "transcript": "and get all the nasty nasty off of it",
        "start_frame_index": 1956,
        "end_frame_index": 1956
    },
    {
        "transcript": "to get the gun residue off",
        "start_frame_index": 2313,
        "end_frame_index": 2314
    },
    {
        "transcript": "just to get her clean before i start",
        "start_frame_index": 2404,
        "end_frame_index": 2405
    },
    {
        "transcript": "sanding and yeah i'm gonna get on it",
        "start_frame_index": 2512,
        "end_frame_index": 2513
    },
    {
        "transcript": "ah oh well",
        "start_frame_index": 3252,
        "end_frame_index": 3252
    },
    {
        "transcript": "yeah that's what's going on i'll bring",
        "start_frame_index": 3312,
        "end_frame_index": 3312
    },
    {
        "transcript": "it back",
        "start_frame_index": 3367,
        "end_frame_index": 3367
    },
    {
        "transcript": "after i get this thing ready to paint",
        "start_frame_index": 3462,
        "end_frame_index": 3462
    },
    {
        "transcript": "peace",
        "start_frame_index": 3462,
        "end_frame_index": 3552
    }
]```

kdu4108 · 2024-07-22T16:21:24Z

TODO: Can you move everything into a pseudolabelers/ directory? same required for others like video_det Added pseudolabel_frames.py #19 (comment)

kdu4108 · 2024-07-30T07:37:43Z

Thanks Markus, looks great! Two nits are (1) is there reason it's called merge_data.py instead of train_val_test_split.py or something like that? and (2) can you add an example command of how we would run that script in a comment (in particular, I want to clarify - does that splitting script takes in as input a modality folder like video_rgb or video_det? As opposed to the raw video folder?)

markus583 · 2024-07-31T15:28:37Z

Sure, renamed the script.
The script can work in either way. If I run it like this:
python pseudolabeling/merge_data.py --source_dir /store/swissai/a08/data/4m --output_dir /store/swissai/a08/data/4m/splits (NB: splits dir is not really necessary, just to see the diff)

I get this as output:

Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/train
Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/val
Move /store/swissai/a08/data/4m/video_rgb_tok -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok/test
/store/swissai/a08/data/4m/video_rgb_tok_full
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/train
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/val
Move /store/swissai/a08/data/4m/video_rgb_tok_full -----------> /store/swissai/a08/data/4m/splits/video_rgb_tok_full/test
/store/swissai/a08/data/4m/video_metadata
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/train
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/val
Move /store/swissai/a08/data/4m/video_metadata -----------> /store/swissai/a08/data/4m/splits/video_metadata/test
/store/swissai/a08/data/4m/video_rgb
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/train
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/val
Move /store/swissai/a08/data/4m/video_rgb -----------> /store/swissai/a08/data/4m/splits/video_rgb/test

…dality structure

…ed because we don't have the container working yet to use whisper (cry)

markus583 · 2024-08-05T14:59:52Z

I think this is ready to merge now. One nice to have would be to integrate @kdu4108 's logger, but let's get moving now and do this later. @kdu4108 @yahya010

markus583 changed the title ~~Add transcript processing~~ Add transcript + metadata processing Jul 18, 2024

kdu4108 reviewed Jul 19, 2024

View reviewed changes

markus583 mentioned this pull request Jul 22, 2024

Added pseudolabel_frames.py #19

Open

markus583 mentioned this pull request Jul 30, 2024

Add train/val/test split script for raw #20

Open

kdu4108 mentioned this pull request Aug 2, 2024

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

Open

markus583 and others added 14 commits August 5, 2024 09:43

add script

d9fb2bb

minor cleanup

0221846

update comments (euler + TODO)

fd9666b

add metadata; "shard-"

26dc422

add dataset option

266b739

adapt to whisper; adapt dirs

856d245

mv into folder

1511020

add splitter

83af75d

adapt paths + langs

e3e5fd4

rename

691eece

Modify train_val_test_split to invert from modality/train to train/mo…

74e1038

…dality structure

Change v2d to metadata and transcript filepaths. Transcript is untest…

6a51d47

…ed because we don't have the container working yet to use whisper (cry)

Remove shard- from v2d_to_metadata, transcript

90fd7c8

Validate that inputdir is a filtered raw subdir

1168fa0

kdu4108 force-pushed the proc_transcript branch from 6421dd5 to 1168fa0 Compare August 5, 2024 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transcript + metadata processing #15

Add transcript + metadata processing #15

markus583 commented Jul 16, 2024 •

edited

Loading

markus583 commented Jul 16, 2024 •

edited

Loading

kdu4108 commented Jul 16, 2024

markus583 commented Jul 17, 2024

kdu4108 Jul 19, 2024

kdu4108 Jul 19, 2024

kdu4108 commented Jul 19, 2024

markus583 commented Jul 19, 2024

kdu4108 commented Jul 22, 2024 •

edited

Loading

kdu4108 commented Jul 30, 2024 •

edited

Loading

markus583 commented Jul 31, 2024

markus583 commented Aug 5, 2024

Add transcript + metadata processing #15

Are you sure you want to change the base?

Add transcript + metadata processing #15

Conversation

markus583 commented Jul 16, 2024 • edited Loading

Transcript:

Metadata:

markus583 commented Jul 16, 2024 • edited Loading

kdu4108 commented Jul 16, 2024

markus583 commented Jul 17, 2024

kdu4108 Jul 19, 2024

Choose a reason for hiding this comment

kdu4108 Jul 19, 2024

Choose a reason for hiding this comment

kdu4108 commented Jul 19, 2024

markus583 commented Jul 19, 2024

Examples

kdu4108 commented Jul 22, 2024 • edited Loading

kdu4108 commented Jul 30, 2024 • edited Loading

markus583 commented Jul 31, 2024

markus583 commented Aug 5, 2024

markus583 commented Jul 16, 2024 •

edited

Loading

markus583 commented Jul 16, 2024 •

edited

Loading

kdu4108 commented Jul 22, 2024 •

edited

Loading

kdu4108 commented Jul 30, 2024 •

edited

Loading