You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The field url_norm is essential for looking up URLs entered by humans, but it is disabled per default in reference.conf and enabling it is buried as a side-effect to enabling warc.index.extract.linked.normalise. This option should be default true and have a dedicated entry, such as warc.index.extract.normalise_url, which could also provide the default value for warc.index.extract.linked.normalise.
Besides plain search, one case for normalising enabling per default for the Solr fields url_norm, links and links_images is graph-queries. Normalising raises the number of false positives, but this shall be seen against the large number of false negatives in the case on non-normalisation due to http/https and www.foo.com/foo.com differences.
Another property is warc.index.extract.content.extractApachePreflightErrors which validates PDFs and adds validation errors to the index. this is turned on per default. This is a heavy indexing step and was the primary cause of timeouts for webarchive indexing at the Royal Danish Library, until it was turned off. We recommend that the default is that it is turned off.
The text was updated successfully, but these errors were encountered:
The field
url_norm
is essential for looking up URLs entered by humans, but it is disabled per default inreference.conf
and enabling it is buried as a side-effect to enablingwarc.index.extract.linked.normalise
. This option should be defaulttrue
and have a dedicated entry, such aswarc.index.extract.normalise_url
, which could also provide the default value forwarc.index.extract.linked.normalise
.Besides plain search, one case for normalising enabling per default for the Solr fields
url_norm
,links
andlinks_images
is graph-queries. Normalising raises the number of false positives, but this shall be seen against the large number of false negatives in the case on non-normalisation due tohttp
/https
andwww.foo.com
/foo.com
differences.Another property is
warc.index.extract.content.extractApachePreflightErrors
which validates PDFs and adds validation errors to the index. this is turned on per default. This is a heavy indexing step and was the primary cause of timeouts for webarchive indexing at the Royal Danish Library, until it was turned off. We recommend that the default is that it is turned off.The text was updated successfully, but these errors were encountered: