Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YouTube metadata is not saved #319

Open
libeanim opened this issue Feb 20, 2024 · 3 comments
Open

YouTube metadata is not saved #319

libeanim opened this issue Feb 20, 2024 · 3 comments

Comments

@libeanim
Copy link
Contributor

Issue

When using video2dataset (1.3.0) to download youtube videos i've set the following entry in the config to retrieve meta data:

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args:
            writesubtitles: 'all'
            subtitleslangs: ['en', 'de', 'es', 'fr', 'it', 'nl', 'pl', 'ru']
            writeautomaticsub: True
            get_info: True
    timeout: 60
    sampler: null

But in the resulting json files the entry "yt_meta_dict": {}, is empty even though get_info: True in the config.

How to reproduce

For example this link: https://www.youtube.com/embed/JFUsP1coIKM
When i download that with yt-dlp:

yt-dlp -N 2 \
       --write-subs --convert-subs srt \
       --write-info-json --embed-subs --embed-chapters --embed-metadata \
       --no-progress -q \
       --format 'b[height<=360][ext=mp4]' \
       --output './demo.mp4' \
       https://www.youtube.com/embed/JFUsP1coIKM

I get youtube meta data like "categories": ["Entertainment"], "tags": ["Deutsche", "Welle", "Made", "in", "Germany", "Bio", "Lettland", "Getreide"]

But with video2dataset it looks like this:

    "caption": "\"Volles Korn voran\" 28. November 2008 Beitrag \u00fcber den \u00f6kologischen Teil des Ackerbaus von german",
    "url": "https://www.youtube.com/embed/JFUsP1coIKM",
    "key": "0000000",
    "status": "success",
    "error_message": null,
    "yt_meta_dict": {},
    "video_metadata": {...
@pabl0
Copy link

pabl0 commented Mar 7, 2024

Are you getting empty yt_meta_dict for just some videos or all of them?

What I am is seeing, that for every 300 videos I seem to get roughly 100 videos with yt_meta_dict populated and 200 videos with yt_meta_dict = {}, which is quite strange.

What exactly does ignoring errors in yt_dlp mean? Even if you have retries, it gives up on the first try?

yt_metadata_args["ignoreerrors"] = True

Other yt_dlp codepaths don't seem to set this.

@pabl0
Copy link

pabl0 commented Mar 7, 2024

Ahh! Now I understand what happens: with multiple clips, only the first one (_00000.json) will have yt_meta_dict populated, not the following clips.

It seems this was a change introduced by clipping subsampler refactoring (#275), did it behave differently in v1.2.0?

# remove redundant metadata from clips after the first
for m_clips in metadata_clips[1:]:
m_clips["yt_meta_dict"] = {}

I am not sure if this is a good idea. Depending on your processing pipeline, you might want to have the same metadata available on all the clips.

@rom1504
Copy link
Collaborator

rom1504 commented Mar 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants