Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore invalid jsonld elements on the page source. #189

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

naveen17797
Copy link

@naveen17797 naveen17797 commented Jan 9, 2022

This PR alters the behaviour such that If there are invalid jsonld elements with valid elements on the page source, it returns only the valid jsonld elements.

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @naveen17797 , the feature makes sense, I left some minor comments regarding the code, however I also have a more general question - previously, if we had invalid JSON, we'd crash or log it (depending on the error setting). Now, it would be silently ignored. What do you think about still having some logging if our last attempt at parsing the JSON fails, similar to this?

logger.exception('Failed to extract {}, raises {}'

try:
return json.loads(script, strict=False)
except Exception:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but to me it would make more sense to return None here, what do you think? Both None and False can be valid results of JSON parsing, but None looks like a more "neutral" value to use for error case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

# TODO: `strict=False` can be configurable if needed
data = json.loads(script, strict=False)
except ValueError:
# sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not remove this comment, I think it's useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment back to the code.

data = self._may_be_get_json(script)
# check if valid json.
if not data:
script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - let's remove a an extra space here

Suggested change
script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script))
script = jstyleson.dispose(HTML_OR_JS_COMMENTLINE.sub('', script))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

data = self._may_be_get_json(script)
# After processing check if json is still valid.
if not data:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - we don't use return value, as this is a generator, so return would make more sense

Suggested change
return False
return

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


<head>
<script type="application/ld+json">
{ foo: bar}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! From what I understand, it would fail previously, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it should ignore this jsonld element since it is not valid.

@naveen17797
Copy link
Author

previously, if we had invalid JSON, we'd crash or log it (depending on the error setting). Now, it would be silently ignored. What do you think about still having some logging if our last attempt at parsing the JSON fails, similar to this?

i have added the log statement

logging.exception('Invalid jsonld element detected %s', script)

@naveen17797
Copy link
Author

naveen17797 commented Jan 11, 2022

i understand if we merge this PR this would remove the ability to stop parsing the page with invalid jsonld elements, previously extruct will raise an exception and fail for jsonld, i am not sure how could i retain this behaviour.

may be i can pass errors argument from extract() function to jsonld extractor and determine if we need to raise an exception or just return all valid elements @lopuhin ?

@argaurav
Copy link

argaurav commented Aug 3, 2023

Is there a timeline on this issue. It would be good to get a fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants