Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config defaults for url_norm and PDF validation #158

Open
tokee opened this issue Mar 19, 2018 · 1 comment
Open

config defaults for url_norm and PDF validation #158

tokee opened this issue Mar 19, 2018 · 1 comment

Comments

@tokee
Copy link
Collaborator

tokee commented Mar 19, 2018

The field url_norm is essential for looking up URLs entered by humans, but it is disabled per default in reference.conf and enabling it is buried as a side-effect to enabling warc.index.extract.linked.normalise. This option should be default true and have a dedicated entry, such as warc.index.extract.normalise_url, which could also provide the default value for warc.index.extract.linked.normalise.

Besides plain search, one case for normalising enabling per default for the Solr fields url_norm, links and links_images is graph-queries. Normalising raises the number of false positives, but this shall be seen against the large number of false negatives in the case on non-normalisation due to http/https and www.foo.com/foo.com differences.

Another property is warc.index.extract.content.extractApachePreflightErrors which validates PDFs and adds validation errors to the index. this is turned on per default. This is a heavy indexing step and was the primary cause of timeouts for webarchive indexing at the Royal Danish Library, until it was turned off. We recommend that the default is that it is turned off.

@anjackson
Copy link
Contributor

Ah, looks like #285 was a dupe of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants