Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of specialized HTML literal handling? #2946

Open
ashleysommer opened this issue Oct 23, 2024 · 6 comments
Open

Removal of specialized HTML literal handling? #2946

ashleysommer opened this issue Oct 23, 2024 · 6 comments

Comments

@ashleysommer
Copy link
Contributor

ashleysommer commented Oct 23, 2024

Possible easy solution for #2935 and #2945

The reason we forked html5lib to make html5lib-modern was because there is no new replacement for html5lib that provides the same XML-based HTML-tokenizing functionality that html5lib does. There's no alternative to move to.

Beautifulsoup4 is the logical replacement, but it includes html5lib in its dependency tree, so defeats the whole point.

But what if we just dropped that feature entirely? Why does RDFLib even want to be able to tokenize HTML Literals? The feature was added for a reason, but do we need to keep it?

Can we simply drop that feature, and treat HTML the same as any other string literal, and remove html5lib from our dependencies entirely?

@floresbakker
Copy link

I would say yes! It is difficult to understand what that library is doing and for what reason though.

At the moment I get for every HTML element that I am processing within RDFlib an error message that resembles something like this (example for Doctype node):

ile "C:<path>\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 247, in mainLoop
new_token = phase.processDoctype(new_token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:<path>\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 417, in processDoctype
self.parser.parseError("unexpected-doctype")
File "C:<path>\Python\Python312\Lib\site-packages\html5lib\html5parser.py", line 322, in parseError
raise ParseError(E[errorcode] % datavars)
html5lib.html5parser.ParseError: Unexpected DOCTYPE. Ignored.

This seriously delays processing of any HTML document as every element has to undergo this treatment. I am trying to finish my work on the HTML vocabulary (see https://www.w3.org/community/htmlvoc/) and a proper open source based implementation of the HTML vocabulary using RDFlib/PyShacl is number one on my list for more than a year. Would be awesome if you could fix this permanently. From your post I gather that you also do not think there are any undesired effects of removing html5lib. I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL?

@ashleysommer
Copy link
Contributor Author

ashleysommer commented Oct 23, 2024

@floresbakker Thanks for your input on this.

After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:

  1. Take my six-less fork of html5lib (thats called html5lib-modern) that's causing these new packaging errors, rename it to html5rdf, change its module name from html5lib to html5rdf to avoid aliasing, bring it under the rdflib org umbrella.
  2. change usages of html5lib in rdflib to use html5rdf
  3. Make it an optional dependency again, gated behind the [html] extra.
  4. Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html5rdf is not installed.

@ashleysommer
Copy link
Contributor Author

I trust we can keep using the datatype rdf:HTML for html literals in our RDF/SPARQL?

Yes, when html5rdf support is disabled, or even if we remove the feature entirely, then rdf:HTML literals will be simply treated as a typed string literal, like any other typed string literal.

@floresbakker
Copy link

@floresbakker Thanks for your input on this.

After a quick meeting with some other rdflib maintainers yesterday, this is the plan we came up with:

  1. Take my six-less fork of html5lib (thats called html5lib-modern) that's causing these new packaging errors, rename it to html5rdf, change its module name from html5lib to html5rdf to avoid aliasing, bring it under the rdflib org umbrella.
  2. change usages of html5lib in rdflib to use html5rdf
  3. Make it an optional dependency again, gated behind the [html] extra.
  4. Find the code paths that throw errors when a HTML Literal value is not a DOMFragment, and fix those so they work when html4rdf is not installed.

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed? For extra information: the error message that I reported above was already present in the original html5lib before you made the html5lib-modern. Perhaps this helps in understanding the cause. I trust the html4rdf reference is a typo and should be html5rdf?

@ashleysommer
Copy link
Contributor Author

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?

Not necessarily. You can link of html5rdf as a new project, forked from html5lib specifically for the use in the lexical-to-value mapping of rdf:HTML Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings into domnodes (aka DocumentFragement objects in Python).

It will be maintained by the RDFLib team for that purpose, for the use in RDFLib only.

As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?

@floresbakker
Copy link

It seems that in this plan, html5 is not really deprecated but adopted into rdflib and fixed?

Not necessarily. You can link of html5rdf as a new project, forked from html5lib specifically for the use in the lexical-to-value mapping of rdf:HTML Literals as described in https://www.w3.org/TR/rdf11-concepts/#h3_section-html (converts strings into domnodes (aka DocumentFragement objects in Python).

It will be maintained by the RDFLib team for that purpose, for the use in RDFLib only.

As for the issue you described in your original post, I'm not seeing those in my testing, are you able to send an example RDF file that reproduces those errors?

I tried reproducing the errors on the newest release 7.1.1 from yesterday, but I was to my surprise unable to do so. That is good news for the htmlvoc project. I think I have only one remaining (unrelated to this discussion) issue, being unable to process trig files in RDFlib/PyShacl, for which I will work out a minimal working example. Thanks Ashley! There is a lot of movement within RDFlib/PyShacl, which is greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@ashleysommer @floresbakker and others