Strip all tags from fields with mixed content #507

helrond · 2022-05-19T16:10:30Z

Is your feature request related to a problem? Please describe.

Text with XML or HTML tags is rendered as a string rather than a tag.

Describe the solution you'd like

Strip all HTML and XML tags from text before indexing.

Describe alternatives you've considered

This is a short-term solution. A more permanent solution will be articulated in another issue.

Additional context

Tags are most likely to be encountered in note content but may also be present in other fields, for example titles.

We will need to accommodate the presence of angle brackets which are not tags, for example mathematical content such as:

Proof that a < b in a field where c > d

ctgraham · 2022-05-20T13:39:28Z

Perhaps something like:

try:
  xmldoc = xml.etree.ElementTree.fromstring(usercontent)
  textcontent = ''.join(xmldoc.itertext())
except ParseError:
  tagregxp = re.compile(r'<[/\w][^>]+>')
  textcontent = tagregxp.sub('', usercontent)

helrond · 2022-05-20T14:08:58Z

I wonder if we can just get away with the regex find/replace? The string that would be usercontent above will never (I am pretty sure) be a valid XML document; it will always just be a string with some tags. So I'm thinking it will never be properly parsed...

ctgraham · 2022-05-20T15:11:47Z

Good point, though it may be more common than you would expect. One of our primary cases would be where a unittitle contains a title tag, for example.

Wrapping the usercontent string in a dummy tag might coerce the majority of content into syntactically valid XML, and give us a known top-level-element for a future XSLT transform.

This makes me wonder, though, how ASpace deals with namespaces in user-entered content. Can an archivist really enter HTML or EAD tags interchangeably in ASpace and expect correct function?

helrond · 2022-05-20T19:10:24Z

I think the short answer to your question is no. From a display perspective entering <title render="italic">A Collection</title> and <i>A Collection</i> in the title field result in the same thing. But if you start mixing tags...I think all bets are off.

Part of what we need to remember is that we're transforming the JSON response. In the use case that you provided earlier, what comes back in the JSON is <title render=\"italic\">A Collection</title>. It is possible to turn that into parsable XML by wrapping it in a base tag:

>>> import xml.etree.ElementTree as ET
>>> doc = ET.fromstring("<xml><title render=\"italic\">A Collection</title></xml>")
>>> ''.join(doc.itertext())
'A Collection'

Which is what we want. HOWEVER, the regex approach produces the same result, so I'm not seeing the benefit of introducing additional complexity with HTML parsing, unless I'm missing something?

>>> import re
>>> tagregxp = re.compile(r'<[/\w][^>]+>')
>>> tagregxp.sub('', "<title render=\"italic\">A Collection</title>")
'A Collection'

ctgraham · 2022-05-23T13:59:57Z

No objection to just using a regexp; I only proposed the xml library as forward looking to #508 and because someone else has done a lot more work in parsing XML than the proposed coverage of a trivial regexp. I'm sure the regexp has unconsidered false positives and missed edge cases.

helrond mentioned this issue May 19, 2022

Normalize text with HTML or XML tags into standardized HTML output #508

Open

helrond mentioned this issue Jun 6, 2022

Changes from development #511

Merged

helrond closed this as completed in #511 Jun 6, 2022

ctgraham mentioned this issue Aug 19, 2022

Remove HTML / XML from Aeon POSTs RockefellerArchiveCenter/request_broker#221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip all tags from fields with mixed content #507

Strip all tags from fields with mixed content #507

helrond commented May 19, 2022

ctgraham commented May 20, 2022

helrond commented May 20, 2022

ctgraham commented May 20, 2022

helrond commented May 20, 2022 •

edited

Loading

ctgraham commented May 23, 2022

Strip all tags from fields with mixed content #507

Strip all tags from fields with mixed content #507

Comments

helrond commented May 19, 2022

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

ctgraham commented May 20, 2022

helrond commented May 20, 2022

ctgraham commented May 20, 2022

helrond commented May 20, 2022 • edited Loading

ctgraham commented May 23, 2022

helrond commented May 20, 2022 •

edited

Loading