Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip all tags from fields with mixed content #507

Closed
helrond opened this issue May 19, 2022 · 5 comments · Fixed by #511
Closed

Strip all tags from fields with mixed content #507

helrond opened this issue May 19, 2022 · 5 comments · Fixed by #511

Comments

@helrond
Copy link
Member

helrond commented May 19, 2022

Is your feature request related to a problem? Please describe.

Text with XML or HTML tags is rendered as a string rather than a tag.

Describe the solution you'd like

Strip all HTML and XML tags from text before indexing.

Describe alternatives you've considered

This is a short-term solution. A more permanent solution will be articulated in another issue.

Additional context

Tags are most likely to be encountered in note content but may also be present in other fields, for example titles.

We will need to accommodate the presence of angle brackets which are not tags, for example mathematical content such as:

Proof that a < b in a field where c > d

@ctgraham
Copy link
Contributor

Perhaps something like:

try:
  xmldoc = xml.etree.ElementTree.fromstring(usercontent)
  textcontent = ''.join(xmldoc.itertext())
except ParseError:
  tagregxp = re.compile(r'<[/\w][^>]+>')
  textcontent = tagregxp.sub('', usercontent)

@helrond
Copy link
Member Author

helrond commented May 20, 2022

I wonder if we can just get away with the regex find/replace? The string that would be usercontent above will never (I am pretty sure) be a valid XML document; it will always just be a string with some tags. So I'm thinking it will never be properly parsed...

@ctgraham
Copy link
Contributor

Good point, though it may be more common than you would expect. One of our primary cases would be where a unittitle contains a title tag, for example.

Wrapping the usercontent string in a dummy tag might coerce the majority of content into syntactically valid XML, and give us a known top-level-element for a future XSLT transform.

This makes me wonder, though, how ASpace deals with namespaces in user-entered content. Can an archivist really enter HTML or EAD tags interchangeably in ASpace and expect correct function?

@helrond
Copy link
Member Author

helrond commented May 20, 2022

I think the short answer to your question is no. From a display perspective entering <title render="italic">A Collection</title> and <i>A Collection</i> in the title field result in the same thing. But if you start mixing tags...I think all bets are off.

Part of what we need to remember is that we're transforming the JSON response. In the use case that you provided earlier, what comes back in the JSON is <title render=\"italic\">A Collection</title>. It is possible to turn that into parsable XML by wrapping it in a base tag:

>>> import xml.etree.ElementTree as ET
>>> doc = ET.fromstring("<xml><title render=\"italic\">A Collection</title></xml>")
>>> ''.join(doc.itertext())
'A Collection'

Which is what we want. HOWEVER, the regex approach produces the same result, so I'm not seeing the benefit of introducing additional complexity with HTML parsing, unless I'm missing something?

>>> import re
>>> tagregxp = re.compile(r'<[/\w][^>]+>')
>>> tagregxp.sub('', "<title render=\"italic\">A Collection</title>")
'A Collection'

@ctgraham
Copy link
Contributor

No objection to just using a regexp; I only proposed the xml library as forward looking to #508 and because someone else has done a lot more work in parsing XML than the proposed coverage of a trivial regexp. I'm sure the regexp has unconsidered false positives and missed edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants