Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video metadata processing no longer writes temp files #293

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

MattUnderscoreZhang
Copy link
Contributor

No description provided.

@MattUnderscoreZhang
Copy link
Contributor Author

Should be merged after #288.

Metadata-finding subsamplers (FFProbeSubsampler and CutDetectionSubsampler) no longer take byte streams, write to a temp file, then operate on the temp file. Rather, we can pass a filepath directly to these subsamplers, and they will extract metadata without performing additional I/O operations.

After this I will do the same with the video processing subsamplers in the next pull request.

This pull request has been tested with my usual workflow, reproducing expected results.

@rom1504
Copy link
Collaborator

rom1504 commented Jan 27, 2024

can you rebase please ?

@rom1504
Copy link
Collaborator

rom1504 commented Jan 27, 2024

@MattUnderscoreZhang what speed difference do you observe? sharing wandb links would be helpful

@MattUnderscoreZhang
Copy link
Contributor Author

I'm showing my noob status here, but I've never really used wandb. Could we maybe schedule a call sometime where you show me how to generate the links you're looking for?

Also, I don't think there should be very much speedup at this stage anyways. I still need to perform a temporary write at the beginning of sample processing, since I haven't touched the dataloaders yet. That is, it would be nice if the dataloaders passed filepaths rather than byte streams, but this will have to be a later change. As it stands, I think I'm currently only saving a single read/write. The major savings should be in the next pull request, when I update the actual video processing subsamplers.

@rom1504
Copy link
Collaborator

rom1504 commented Jan 28, 2024

I think it's quite important to check the speed for this kind of major change
It could get worse

@MattUnderscoreZhang
Copy link
Contributor Author

MattUnderscoreZhang commented Jan 28, 2024

Here's a check of this branch vs. main on WandB.
https://wandb.ai/bucketoffish/video2dataset?workspace=user-bucketoffish
good-cherry-2 is this branch, which you can see is slightly faster (11.8% more vid_per_sec).
This test was 100 webvid videos, with 5 processes and 10 threads, 10 samples per shard, on my 2019 MacBook Pro.
image

@MattUnderscoreZhang
Copy link
Contributor Author

MattUnderscoreZhang commented Feb 3, 2024

Here's a comparison between this branch and the threading_fix branch. This is with a larger dataset, around 5k samples. You can see the results are reversed here, with this branch being slightly slower, by about 4%. Each of these branches took about 2.5 hours to run. I'm not sure how significant these results are (in terms of variance). The difference between the last test and this one may be due to the threading fix, which has not been merged into the video_metadata_no_io branch yet.
image

@rom1504
Copy link
Collaborator

rom1504 commented Feb 3, 2024

@iejMac do you have some numbers of how many vid/s you reached on webvid ?

@MattUnderscoreZhang how many workers are you using?

@MattUnderscoreZhang
Copy link
Contributor Author

MattUnderscoreZhang commented Feb 3, 2024

I'm using 5 processes with 2 threads each. 100 samples per shard.

subsampling:
    FrameSubsampler:
        args:
           frame_rate: 8
    ResolutionSubsampler:
        args:
            width: 128
            height: 224
            resize_mode: "scale,crop,pad"
    CutDetectionSubsampler:
        cuts_are_clips: True
        args:
            cut_detection_mode: "all"
            framerates: null
            threshold: 27
            min_scene_len: 15
    ClippingSubsampler:
        args:
            min_length: 2.125
            max_length: 2.125
            max_length_strategy: "all"
            precision: "exact"

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args: null
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 100
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 5
    thread_count: 2
    subjob_size: 1000
    distributor: "multiprocessing"

@iejMac
Copy link
Owner

iejMac commented Feb 4, 2024

@rom1504 says at the bottom of this - https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md

230 video/s (14.4 videos/s/core) or 420 Mb/s

@MattUnderscoreZhang
Copy link
Contributor Author

That seems to be for a download config with no video processing. My changes would not have any effect in that use case.

@rom1504
Copy link
Collaborator

rom1504 commented Feb 4, 2024

@MattUnderscoreZhang ok let's try to run with the same settings as in that example

Also it would be helpful to increase the number of processes and threads

5 and 2 are too low to catch problems

@MattUnderscoreZhang
Copy link
Contributor Author

I tried replicating a run with the exact config used in the linked example. 16 processes with 16 threads each, with 1000 samples per shard. I ran on a vast.ai instance with a webvid results_2M_val dataset.

It's good we ran this test because unfortunately, it looks like this branch is definitely buggy. Something about the threading and multiprocessing is causing the run to freeze.

Looking at old commits and comparing to commit c6f3ed2 (the one right before my first commit), I see that the download worker refactor commit also has a threading problem (even with the threading_fix branch applied). The commit right before, e1b5d89, is fine. The speed comparison for this commit matches the older commit:
image

For now I recommend rolling back the download worker refactor commit. Fixing the threading issue will take some debugging, but I don't think I have the capacity for it right now. You can close this pull request if you want, and I'll come back and review this later.

@rom1504
Copy link
Collaborator

rom1504 commented Feb 24, 2024

This would need a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants