-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warc-indexer. video mp4 file classified as "other" #289
Comments
Ah, the code was not falling back on the original served content type when format ID returned Is it possible for you to verify it's fixed when running on real data? |
Thanks, I will try test with latest master branch. |
@anjackson Sorry, but the bug is still here. I build the latest version of master with your fix. Here is a small WARC that has the video (and a few other resources): |
Hm, weird. Just indexed that WARC and got:
I mean, there were other problems, but that bit seemed to work. |
I tried again and still got same result. See Solr reply below Can you assign to Toke? He will try test it also (tomorrow probably)
|
The video is still on this live url:
https://sommansiger.nu/img/SomManSiger_full.mp4
Here are some of the fields from Solr. It is the last two that have been 'video' instead.
content_type_served : "video/mp4"
content_type_full : "application/octet-stream"
content_type_ext : "mp4"
type : "Other"
content_type_norm : "other"
It seems about 13% of mp4 video are classified wrong. From the danish archive using this query:
content_type_ext:mp4 AND content_type_norm:(other OR video)
gives:
Video: (2,860,146)
Other :(462,163)
The text was updated successfully, but these errors were encountered: