Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

Open
donpellegrino opened this issue Jan 28, 2022 · 4 comments

Comments

@donpellegrino
Copy link

The latest-core collection at https://databus.dbpedia.org/dbpedia/collections/latest-core as downloaded on January 28, 2022 has many "Bad IRI" and "Illegal character in IRI" issues across the data as reported by Apache Jena's riot --validate command. For example:

article-templates_lang=en.ttl.bz2 : 474.07 sec : 50,428,351 Triples : 106,372.54 per second : 0 errors : 28,718 warnings

It would be more robust to ensure the published triples pass all syntax checks.

References:

https://jena.apache.org/documentation/io/

@kurzum
Copy link
Member

kurzum commented Jan 29, 2022

@donpellegrino I am transferring this issue to https://github.com/dbpedia/extraction-framework/issues

@kurzum kurzum transferred this issue from dbpedia/databus-maven-plugin Jan 29, 2022
@kurzum
Copy link
Member

kurzum commented Jan 29, 2022

Hi @donpellegrino,

It would be more robust to ensure the published triples pass all syntax checks.

there is a lot of variation in this and there is no such thing as "all" syntax checks. About a year ago, we built this parser: https://github.com/dbpedia/databus-derive which uses Jena 3.13.1
It is highly parallelized and should be one if not the fastest parser out there. It also does more than parsing as it also writes quite detailed parselogs and logs all malformed triples.

We also publish the parselogs here: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/
Back then we filed a bug report about a warning in Jena and they especially updated their parser in version 3.13.1 for us.

I think that this here https://github.com/dbpedia/databus-derive/blob/master/src/main/java/org/dbpedia/databus/derive/io/rdf/NoErrorProfile.java is the exact parser profile we are using to configure Jena.

I looked at http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/2021.09.01/article-templates_lang=en_debug.txt.bz2 and it seems that we need to update icu, which is the unicode library. most of the problems are caused by new emojis.

Then the result of riot --validate highly depends on the Jena version you are using. I tested with rapper/libraptor and there is no error found in 2021.12.01

rapper -i ntriples article-templates_lang\=en.ttl -c 
rapper: Parsing URI file:///home/kurzum/Downloads/article-templates_lang=en.ttl with parser ntriples
rapper: Parsing returned 50428351 triples

Looking at 0 errors : 28,718 warnings this seems to be the Jena warning fixed related to NFKC Unicode. @donpellegrino could you post the jena version and potentially more detailed information?

@Vehnem parselogs after 09.2021 are missing: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/

@donpellegrino
Copy link
Author

I used Jena version 3.17.0:

> riot --version
Jena:       VERSION: 3.17.0
Jena:       BUILD_DATE: 2020-11-25T19:40:23+0000

For the Unicode interpretation, I am not sure if that comes from Jena directly or would depend on the underlying Java implementation. For my original report, I was running it with Oracle Java 1.8.0_291-b10:

> java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

The locale is UTF-8:

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Switching to OpenJDK 11.0.13:

> java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-suse-3.65.1-x8664)
OpenJDK 64-Bit Server VM (build 11.0.13+8-suse-3.65.1-x8664, mixed mode)

OpenJDK 11.0.13 also gives the warnings:

> riot --validate --time article-templates_lang\=en.ttl.bz2
<snip>
09:02:11 WARN  riot            :: [line: 50422047, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝅘𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422047, col: 36] Illegal character in IRI (Not a ucschar: 0xDD72): <http://dbpedia.org/resource/𝅘𝅥?[U+DD72]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 31] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 32] Illegal character in IRI (Not a ucschar: 0xDDBA): <http://dbpedia.org/resource/?[U+DDBA]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 33] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 34] Illegal character in IRI (Not a ucschar: 0xDD65): <http://dbpedia.org/resource/𝆺?[U+DD65]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 36] Illegal character in IRI (Not a ucschar: 0xDD6F): <http://dbpedia.org/resource/𝆺𝅥?[U+DD6F]...>
article-templates_lang=en.ttl.bz2 : 593.94 sec : 50,428,351 Triples : 84,904.79 per second : 0 errors : 28,718 warnings

@Vehnem
Copy link
Collaborator

Vehnem commented Feb 7, 2022

Hi, I will check it this week.
The issue seems valid. The RDF pruning/validation process seems to have failed (or was not working correctly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants