Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc-indexer. video mp4 file classified as "other" #289

Open
thomasegense opened this issue May 30, 2022 · 5 comments
Open

warc-indexer. video mp4 file classified as "other" #289

thomasegense opened this issue May 30, 2022 · 5 comments
Assignees

Comments

@thomasegense
Copy link
Contributor

thomasegense commented May 30, 2022

The video is still on this live url:

https://sommansiger.nu/img/SomManSiger_full.mp4

Here are some of the fields from Solr. It is the last two that have been 'video' instead.

content_type_served : "video/mp4"
content_type_full : "application/octet-stream"
content_type_ext : "mp4"
type : "Other"
content_type_norm : "other"

It seems about 13% of mp4 video are classified wrong. From the danish archive using this query:

content_type_ext:mp4 AND content_type_norm:(other OR video)

gives:
Video: (2,860,146)
Other :(462,163)

@anjackson anjackson self-assigned this Aug 2, 2022
@anjackson
Copy link
Contributor

Ah, the code was not falling back on the original served content type when format ID returned application/octet-stream (only when format ID explicitly failed and returned an empty string). I've added a test that I think reproduced the behaviour, and modified the code to fix the issue.

Is it possible for you to verify it's fixed when running on real data?

@thomasegense
Copy link
Contributor Author

Thanks, I will try test with latest master branch.

@thomasegense
Copy link
Contributor Author

@anjackson Sorry, but the bug is still here. I build the latest version of master with your fix.

Here is a small WARC that has the video (and a few other resources):
https://drive.google.com/file/d/1s7NUo0BntJgThdnwh953KLfUIKzs6zcw/view?usp=sharing

@anjackson
Copy link
Contributor

Hm, weird. Just indexed that WARC and got:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"content_type_served:\"video/mp4\"",
      "indent":"on",
      "fl":"url,content_type*",
      "rows":"100",
      "wt":"json",
      "_":"1659521636272"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "content_type_ext":"mp4",
        "content_type_served":"video/mp4",
        "content_type":["video/mp4"],
        "content_type_droid":"application/mp4",
        "content_type_tika":"video/mp4",
        "content_type_full":"video/mp4",
        "content_type_norm":"video",
        "url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}]
  }}

I mean, there were other problems, but that bit seemed to work.

@thomasegense
Copy link
Contributor Author

I tried again and still got same result. See Solr reply below
I am using this commit:
commit 81acb31 (HEAD -> master, origin/master, origin/HEAD)
Add test for #289 and fall-back on the served content type when format ID fails.

Can you assign to Toke? He will try test it also (tomorrow probably)

{ "responseHeader":{ "status":0, "QTime":1, "params":{ "q":"content_type_served:\"video/mp4\"", "fl":"id,content_type_norm,url", "_":"1659522744689"}}, "response":{"numFound":1,"start":0,"docs":[ { "content_type_norm":"other", "id":"20220803065503/lSIHRjv/b3vWGR6zGjQLMw==", "url":"https://sommansiger.nu/img/SomManSiger_full.mp4"}] }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants