Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes control characters in the input/output of content #319

Open
wants to merge 1 commit into
base: 7.x
Choose a base branch
from

Conversation

DonRichards
Copy link
Member

@DonRichards DonRichards commented Apr 3, 2019

JIRA Ticket: https://jira.duraspace.org/browse/ISLANDORA-2412

Broken MODS Error

What does this Pull Request do?

Sanitize MODS data into Google Scholar submodule

This does not change the data being stored, it should only impact the data as it processes for Google Scholar meta tags.

What's new?

String replace for ASCII control characters

How should this be tested?

Please use the example in the JIRA ticket for a proof of concept and MODS file with a Control Character embedded into the abstract.

Try ingesting prior to testing this PR.

  • View the error during ingest and/or viewing the object

Pull this PR

  • Navigate to previous submission
    • Error should be gone
  • Ingest a 2nd ETD to test the ingest process works without errors now.

Additional Notes:

N/A

Interested parties

@Islandora/7-x-1-x-committers

@bryjbrown
Copy link
Member

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

@manez
Copy link
Member

manez commented Apr 3, 2019

@bryjbrown are you logged in to JIRA? I think the ticket should be visible to you, but not to anon right now.

@bondjimbond
Copy link
Contributor

@manez @DonRichards I'm logged in and also getting the "You can't view this" message.

@jordandukart
Copy link
Member

@DonRichards First things first, theres someting about the JIRA issue you linked to that makes me unable to see it (says "You can't view this issue. It may have been deleted or you don't have permission to view it."), so I don't know the backstory to this.

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

If this is the case would this helper be more useful: https://github.com/Islandora/islandora/blob/7.x/includes/utilities.inc#L878

@bondjimbond
Copy link
Contributor

Second, unless I'm mistaken these control characters are not allowed in XML to begin with, so any MODS record with control characters is not well-formed and shouldn't have made it into Fedora to begin with. If that line of thinking is true, then it seems like sanitizing the output for Google Scholar is slapping a band-aid on the problem. I'd argue that anything that reduces the pain of using ill-formed XML is encouraging bad practice.

So would these bad characters get into the MODS record in the first place if they were ingested via a form? (Can't see the ticket, so can't access the example to test.)

IF the form strips it, then the bad characters are perhaps getting in via batch ingest?

In general, it would be a good idea to stop bad XML by validating it at the start of the ingest process.

@dannylamb
Copy link

dannylamb commented Apr 3, 2019

@bondjimbond If you can discern that the datastream you're ingesting in a batch is XML, it could conceivably get sanitized there. Still wouldn't stop someone who is manually jamminng in bad XML through Fedora's API, though.

@DonRichards
Copy link
Member Author

@bryjbrown We allow students to submit directly through the GUI and this comes up about twice a semester for us. And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).
@jordandukart Thanks for that, that might be a good long term solution to integrate.
@bondjimbond The form isn't stripping this out currently.

@DonRichards
Copy link
Member Author

Assuming Github doesn't strip this stuff out of postings I'm going to paste a string that breaks it here.
The control character is between the two "nn". Paste this into the title or abstract.

total Lagrangiannite

To check if Github filtered go to https://www.textmagic.com/free-tools/unicode-detector and paste it in. It should say "Character not present in GSM charset, forces to use Unicode encoding"

@bryjbrown
Copy link
Member

@DonRichards

And you should be able to see the Jira ticket if you're logged in and are on the security response group (you should be).

I am logged in, but I'm not part of the security response group. Is that what could be causing me not to be able to see it?

@DonRichards
Copy link
Member Author

If you paste that directly into a input field this is the results.
Islandora Vagrant

@DonRichards
Copy link
Member Author

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

@DonRichards
Copy link
Member Author

The "islandora_sanitize_input_for_valid_xml" @jordandukart suggested is currently firing for Full_text generation but I don't see it being fired prior to original file is ingested.

@manez
Copy link
Member

manez commented Apr 3, 2019

Apologies for the confusion @bryjbrown . The message on the ticket implies that folks in the Developers group should be able to see, but I forgot that our permissions are customized, so that's not accurate under the hood. My bad 😞

@fuel37
Copy link

fuel37 commented Apr 5, 2019

@bryjbrown Yes, I set it to sensitive because it was a security response ticket. I've basically showed most of what was in the ticket in the comments here.

Out of curiosity, if this is a sensitive security issue being handled by the Islandora Security Group, are there not protocols in place to keep things private until a fix has been identified and released? I mean otherwise, having the exploit posted publically as it is here and on Jira puts the entire community at risk, no? I'm not privy to the protocols for handling security concerns but, if this is standard practice, I'd suggest the protocols be reviewed asap.

@dannylamb
Copy link

@fuel37 There are protocols in place. They just weren't followed. I'm trying to pull the conversation out of this and into private channels, although I see nothing wrong with taking the time to review how we handle these sorts of things.

@DonRichards
Copy link
Member Author

@fuel37 Through a few conversations it became apparent to me that this may cause headaches for viewing, it isn't providing any obvious platform for a malicious user to leverage an exploit. This causes functionality to not render completely but because it originates from a user submission it was originally set as sensitive.

I worried when I submitted it that this might have a further reaching issue. It turns out that Google Scholar submodule is using the fields as strings to generate embedded tags and it doesn't handle them well. Drupal gives too many options for filtering and cleaning these strings.

@DonRichards
Copy link
Member Author

ALthough we have a PR coming in to filter everything, this should be good to filter existing material.

@@ -24,6 +24,8 @@ function islandora_google_scholar_create_meta_tags($object) {
else {
$tags = array();
$tags['citation_author'] = array();
// remove breaking control characters.
$object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
Copy link
Contributor

@adam-vessey adam-vessey Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of issues here:

  • creating a new version of the datastream every time this runs
  • restricting to ASCII... nothing of UTF-8 is allowed? Actually... not sure about this one...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not creating a new MODS datastream, it's just re-assigning it when islandora_google_scholar_create_meta_tags is fired (every page load). This is what I got after ingesting and reloading the page several times. Are you getting something different?

screenshot-localhost-8000-2019 04 11-15-59-45

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content property of datastreams is "magic", which is to say: Setting it should result in another version of the datastream being created.

Indeed, it does appear to... You appear to be running that on something that would not be touched by the code changed here... a large image? The given module only runs the given function for objects of the ir:citationCModel or ir:thesisCModel models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing
$object['MODS']->content = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
to something like
$filtered_mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);
should be enough to get around this issue, right @adam-vessey?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryjbrown: Yes, assuming of course that it's this hypothetical $filtered_mods variable that gets used instead of the content pulled from the MODS datastream on the next line.

Another alternative to that undertaken here which should additionally have very little chance of messing with character encodings would be to more gracefully handle the error when parsing the original content... constructing the new SimpleXMLElement will throw an exception if it fails to parse... we could catch and log it (watchdog(_exception), drupal_set_message, more?)... so that people would have a more direct way of discovering that their XML is invalid for whatever reason. Similarly, warning/error reporting could be taken over while constructing the object (libxml_use_internal_errors() and libxml_get_errors(), kind of thing), so we can control how/if they propagate... my concerns with the present implementation... I'm always leery of these kind of runtime filtering approaches, and how they essentially just mask underlying issues. Additionally, if the MODS does happen to be parseable, there's no need to filter it... less processing is usually better, especially in code for generating displays.

... as for what to do when handling an error... my first thought is just to return FALSE; (as we do elsewhere in this function where we have nothing to embed)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam-vessey You first comment ir:citationCModel or ir:thesisCModel model is the only one going to be impacted because this is sanitizing MODS data going through the Google Scholar submodule for Google Scholar tag display. Not other cmodels should ever see this error. There was a security patch in Islandora for forms to filter this type of stuff. This PR should only be useful for those currently in the repository (new content will get filtered during ingest). I can push a quick adjustment from that feedback.
@bryjbrown Thanks for the suggestion.
What do you guys think about changing the $mods line to keep it simple?
$mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DonRichards I think you and @adam-vessey are in agreement. Only ir:citationCModels and ir:thesisCModels are affected, yet you posted a screenshot showing what appears to be a large image object. Since large images aren't affected by this function, they aren't a valid way to test whether the function creates new versions of the MODS datastream on page load or not. You'd only be able to tell by testing an ir:citationCModel and ir:thesisCModel object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryjbrown Oh man, yep I did that and yes it's creating multiple versions. Changing it to $mods = preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $object['MODS']->content); corrects that is does filter this appropriately. @adam-vessey Would you be good with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DonRichards: It should address the issue of unintentionally creating versions; however, as mentioned, doesn't really facilitate the discovery of the invalid data such that it might eventually be fixed.

Looks like the same issue with parsing pops up, in the "Coins" display... and really, could pop up anywhere we attempt to parse the datastream? The issue could even go so far as Solr documents being incomplete due to exceptions in GSearch when parsing? Almost... seeming like a GIGO kind of thing... invalid data causing errors, so... really, whatever ingest mechanism allowing the invalid data should be fixed? Made to validate data, at least?... maybe a batch to fix things (though identifying what needs to be fixed could be a chore)...

In any case... threw together the exception-catching business, more as an example of what it might look like: 7.x...adam-vessey:7.x-alternative-error-handling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I got it worked out. There is one place in addition I see needs attention in the islandora_xml_forms/api/XMLDocument.inc L144 $document->loadXML(islandora_sanitize_input_for_valid_xml($xml)) : for this to work.

@DonRichards
Copy link
Member Author

I'm sorry but this isn't isolated to Scholar and should probably live further upstream. I just did a large image ingest with the same problem.

@DonRichards
Copy link
Member Author

Maybe this shouldn't be closed. Not sure.

@DonRichards DonRichards reopened this Apr 18, 2019
@bondjimbond
Copy link
Contributor

@DonRichards Did you find a place "further upstream" to fix the display issue?

@DonRichards
Copy link
Member Author

@bondjimbond No, I'm not sure where it would be.

@DonRichards DonRichards reopened this Feb 18, 2020
@DonRichards
Copy link
Member Author

I believe this this needs to be merged here and not upstream at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants