Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ePrint: add support for non-PDF attachments #3337

Merged
merged 2 commits into from
Jul 22, 2024

Conversation

alex-ter
Copy link
Contributor

This adds general logic to support non-PDF attachment formats for ePrint, plus fixes a bug in case there are no attachments at all. I don't think there's anything beyond PS in addition to PDF, but it's easily extendable in the allowlist fashion if needed.

Relevant tests added (no attachments at all, just PS, both PDF and PS, PDF-only is already tested by existing tests). Tested using Scaffold in Z6/7.

Note that there seems to be a Scaffold-related issue where the text contents of Mathjax elements is no longer captured properly when the test is run, that affects some of the old and new tests. When running the doWeb() from Scaffold's Browser tab, it works as previously/expected, and when it's run from the Tests tab, it fails with delta indicating lack of Mathjax-contained text (mostly in abstracts). This is somewhat unexpected, so I'll report this as a separate issue against Scaffold shortly.

@alex-ter alex-ter force-pushed the eprint-add-ps-attachment-support branch from cee6090 to 63a29a2 Compare July 14, 2024 14:42
@alex-ter
Copy link
Contributor Author

Reported on the dev mail list: https://groups.google.com/g/zotero-dev/c/fHwgGUICINE

@AbeJellinek
Copy link
Member

OK, it looks like the issue here is that the site uses MathJax lazy typesetting. I tried to see if we could force rendering by deleting window.IntersectionObserver as early as possible when loading test pages, but that didn't work. This is a kind of niche case, so I think it'd be best just to work around the issue in this translator. Try something like this in scrape():

if (doc.querySelector('mjx-lazy')) {
	abstr = attr(doc, 'meta[property="og:description"]', 'content');
}

@alex-ter
Copy link
Contributor Author

Thanks Abe, good point. Let me take a look.

Running tests for the PS format addition change revealed that ePrint pages
seem to have switched to using MathJax lazy typesetting. That leads to
differences in expected vs scraped abstract text when run under Scaffold,
but not when running manually from browser. Fix by pulling the abstract
from the OG tag if MathJax elements are not typeset yet.

Kudos to @AbeJellinek who found the root cause and suggested the workaround.

Signed-off-by: Alex T. <[email protected]>
Co-authored-by: Abe Jellinek <[email protected]>
@alex-ter
Copy link
Contributor Author

@AbeJellinek, thanks again - I've just tested the suggested workaround and indeed looks like you're absolutely right, the reason is the lazy typesetting, and the fix does work. I have added a commit with the fix + updated tests.

@alex-ter
Copy link
Contributor Author

alex-ter commented Jul 21, 2024

[EDIT: never mind, I've just realized you've got an issue for that and the fact they are being run now is probably due to your intermediate fix listed there: https://github.com//issues/3343. Feel free to disregard the below and apologies for the noise :)]

BTW (I see you're working on some CI updates in a WIP PR), it seems like the previous update to this PR did not get properly tested, and in this update it actually worked fine. As there have been no commits to this repo in the meanwhile changing anything in the CI space, I guess it's unexpected, so just wanted to draw your attention to that.

This update, tests were executed as expected: Actions log link.
Previous update, tests seem to have been silently skipped, but with a "success" status: Actions log link.

@AbeJellinek
Copy link
Member

As there have been no commits to this repo in the meanwhile changing anything in the CI space, I guess it's unexpected, so just wanted to draw your attention to that.

CI always pulls the latest zotero/zotero-connectors master, and zotero/translate@ddd90b8 (in the translate submodule) fixed the issue.

@AbeJellinek AbeJellinek merged commit 5908f0f into zotero:master Jul 22, 2024
1 check passed
@AbeJellinek
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants