Video metadata processing no longer writes temp files #293

MattUnderscoreZhang · 2024-01-26T09:29:57Z

No description provided.

…vals

…tion timeouts

…efactor

MattUnderscoreZhang · 2024-01-26T09:48:47Z

Should be merged after #288.

Metadata-finding subsamplers (FFProbeSubsampler and CutDetectionSubsampler) no longer take byte streams, write to a temp file, then operate on the temp file. Rather, we can pass a filepath directly to these subsamplers, and they will extract metadata without performing additional I/O operations.

After this I will do the same with the video processing subsamplers in the next pull request.

This pull request has been tested with my usual workflow, reproducing expected results.

rom1504 · 2024-01-27T20:12:34Z

can you rebase please ?

rom1504 · 2024-01-27T22:02:45Z

@MattUnderscoreZhang what speed difference do you observe? sharing wandb links would be helpful

MattUnderscoreZhang · 2024-01-28T00:56:03Z

I'm showing my noob status here, but I've never really used wandb. Could we maybe schedule a call sometime where you show me how to generate the links you're looking for?

Also, I don't think there should be very much speedup at this stage anyways. I still need to perform a temporary write at the beginning of sample processing, since I haven't touched the dataloaders yet. That is, it would be nice if the dataloaders passed filepaths rather than byte streams, but this will have to be a later change. As it stands, I think I'm currently only saving a single read/write. The major savings should be in the next pull request, when I update the actual video processing subsamplers.

rom1504 · 2024-01-28T01:33:59Z

I think it's quite important to check the speed for this kind of major change
It could get worse

MattUnderscoreZhang · 2024-01-28T04:01:27Z

Here's a check of this branch vs. main on WandB.
https://wandb.ai/bucketoffish/video2dataset?workspace=user-bucketoffish
good-cherry-2 is this branch, which you can see is slightly faster (11.8% more vid_per_sec).
This test was 100 webvid videos, with 5 processes and 10 threads, 10 samples per shard, on my 2019 MacBook Pro.

MattUnderscoreZhang · 2024-02-03T14:05:43Z

Here's a comparison between this branch and the threading_fix branch. This is with a larger dataset, around 5k samples. You can see the results are reversed here, with this branch being slightly slower, by about 4%. Each of these branches took about 2.5 hours to run. I'm not sure how significant these results are (in terms of variance). The difference between the last test and this one may be due to the threading fix, which has not been merged into the video_metadata_no_io branch yet.

rom1504 · 2024-02-03T14:21:07Z

@iejMac do you have some numbers of how many vid/s you reached on webvid ?

@MattUnderscoreZhang how many workers are you using?

MattUnderscoreZhang · 2024-02-03T15:11:16Z

I'm using 5 processes with 2 threads each. 100 samples per shard.

subsampling:
    FrameSubsampler:
        args:
           frame_rate: 8
    ResolutionSubsampler:
        args:
            width: 128
            height: 224
            resize_mode: "scale,crop,pad"
    CutDetectionSubsampler:
        cuts_are_clips: True
        args:
            cut_detection_mode: "all"
            framerates: null
            threshold: 27
            min_scene_len: 15
    ClippingSubsampler:
        args:
            min_length: 2.125
            max_length: 2.125
            max_length_strategy: "all"
            precision: "exact"

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args: null
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 100
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 5
    thread_count: 2
    subjob_size: 1000
    distributor: "multiprocessing"

iejMac · 2024-02-04T00:13:43Z

@rom1504 says at the bottom of this - https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md

230 video/s (14.4 videos/s/core) or 420 Mb/s

MattUnderscoreZhang · 2024-02-04T00:17:00Z

That seems to be for a download config with no video processing. My changes would not have any effect in that use case.

rom1504 · 2024-02-04T11:37:06Z

@MattUnderscoreZhang ok let's try to run with the same settings as in that example

Also it would be helpful to increase the number of processes and threads

5 and 2 are too low to catch problems

MattUnderscoreZhang · 2024-02-04T16:16:47Z

I tried replicating a run with the exact config used in the linked example. 16 processes with 16 threads each, with 1000 samples per shard. I ran on a vast.ai instance with a webvid results_2M_val dataset.

It's good we ran this test because unfortunately, it looks like this branch is definitely buggy. Something about the threading and multiprocessing is causing the run to freeze.

Looking at old commits and comparing to commit c6f3ed2 (the one right before my first commit), I see that the download worker refactor commit also has a threading problem (even with the threading_fix branch applied). The commit right before, e1b5d89, is fine. The speed comparison for this commit matches the older commit:

For now I recommend rolling back the download worker refactor commit. Fixing the threading issue will take some debugging, but I don't think I have the capacity for it right now. You can close this pull request if you want, and I'll come back and review this later.

rom1504 · 2024-02-24T18:59:36Z

This would need a rebase

MattUnderscoreZhang and others added 30 commits January 18, 2024 10:33

ClippingSubsampler rewrite and bug fixes

2efa849

More refactoring of ClippingSubsampler, plus a fix to _get_clip_inter…

a5c9649

…vals

Finished refactoring ClippingSubsampler

2cb5854

Merge branch 'clipping_subsampler_rewrite' into all_fixes

6106f62

Final code changes

5d03b72

Added docstrings

47c7d64

Passed tests and linting

5aa84d4

Made type annotations consistent with Python 3.8

140e1ab

More annotation fixes

077ca27

The Python 3.8 annotation needs a lot of hand-holding, it seems

32fa4ea

Pylint has to cut it out, I swear to God

5a8957f

No real change, just relauching unit tests which failed due to connec…

f0f0168

…tion timeouts

Merge branch 'main' into clipping_subsampler_refactor

f5d7c85

Merge branch 'main' into clipping_subsampler_refactor

388f51a

Merge remote-tracking branch 'origin/main' into clipping_subsampler_r…

5101379

…efactor

Linting issue

1df88dd

Another linting issue

226fba3

Separated per-shard code from code that should only be executed once

8ed5074

Pulled ShardStatus parameters into their own data type

e862eaa

Cleaned up shard processing error handling

d158106

Cleaned up code

5cd53a9

Bug fixes

ffe0e71

Formatting

2c7daf8

Fixed linting issues

ac5a35b

Fixing more damn linting

5222f39

Added a missing docstring

6dc8991

Unified SubsetWorker and DownloadWorker code

6cbb43f

Bug fixes

d5f3b19

Merge branch 'main' into download_worker_refactoring

efceb33

Linting

f33ed6c

MattUnderscoreZhang added 9 commits January 25, 2024 16:04

Removed unnecessary thread operations

d3ab8aa

Added save_temp_input_streams function

d12f2e9

Added functions for converting between streams and temp files

547b653

FFProbeSubsampler now does not save temp file

d117950

CutDetectionSubsampler now does not save temp file

94b2cd6

Separated metadata collection functions

81c305f

Code cleanup for clarity

d8d37a1

More code simplification

d235c26

Merge branch 'main' into video_metadata_no_io

99e82cf

MattUnderscoreZhang added 3 commits January 26, 2024 04:55

Fixed bugs

73fe44a

Unit tests and linting

e63b8b9

Black formatting

32ef272

MattUnderscoreZhang added 2 commits January 27, 2024 19:50

Merge branch 'main' into video_metadata_no_io

cba413c

Fixed a typo

75f0681

This was referenced Feb 8, 2024

Recent regressions #306

Closed

Revert "Download worker refactor (#288)" #308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video metadata processing no longer writes temp files #293

Video metadata processing no longer writes temp files #293

MattUnderscoreZhang commented Jan 26, 2024

MattUnderscoreZhang commented Jan 26, 2024

rom1504 commented Jan 27, 2024

rom1504 commented Jan 27, 2024

MattUnderscoreZhang commented Jan 28, 2024

rom1504 commented Jan 28, 2024

MattUnderscoreZhang commented Jan 28, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading

rom1504 commented Feb 3, 2024

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading

iejMac commented Feb 4, 2024

MattUnderscoreZhang commented Feb 4, 2024

rom1504 commented Feb 4, 2024

MattUnderscoreZhang commented Feb 4, 2024

rom1504 commented Feb 24, 2024

Video metadata processing no longer writes temp files #293

Are you sure you want to change the base?

Video metadata processing no longer writes temp files #293

Conversation

MattUnderscoreZhang commented Jan 26, 2024

MattUnderscoreZhang commented Jan 26, 2024

rom1504 commented Jan 27, 2024

rom1504 commented Jan 27, 2024

MattUnderscoreZhang commented Jan 28, 2024

rom1504 commented Jan 28, 2024

MattUnderscoreZhang commented Jan 28, 2024 • edited Loading

MattUnderscoreZhang commented Feb 3, 2024 • edited Loading

rom1504 commented Feb 3, 2024

MattUnderscoreZhang commented Feb 3, 2024 • edited Loading

iejMac commented Feb 4, 2024

MattUnderscoreZhang commented Feb 4, 2024

rom1504 commented Feb 4, 2024

MattUnderscoreZhang commented Feb 4, 2024

rom1504 commented Feb 24, 2024

MattUnderscoreZhang commented Jan 28, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading