-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic webpage translator #1092
Comments
How about the other idea to extend the EM translator for this case? It looks for me that the function addLowQualityMetadata in the EM translator is similar to what you want to achieve. Thus, it might be already enough to extend the |
+1 to @zuphilip 's question. Also, I don't understand why this must be limited to Zotero 5+ -- what am I missing? |
It's what I say in the other thread: "even in the single-save-button era I still think there's value in setting different expectations for EM and <title/>".
Without client changes, the color icon would appear on every page, even for title/URL/accessDate, and there'd be a confusingly redundant set of options in the context menu (which hard-codes web-page saving right now). |
Well, you explained that the different colors serve some purpose. But if the EM translator can extract some data on a page (and it will save also a snapshot of that page), then I cannot think of any use for a lower quality website translator on this page. I guess, that we can also somehow color the icon for EM differently if we are in some low quality data case, maybe just if |
We can just have the generic translator not show up in the context menu when the EM translator triggers. The point is that we can't distinguish between EM and generic data within a single translator, so they have to be separate. |
It's true that the stuff in So, some options:
|
@simonster points out that the Here's I think what (3) would involve:
One potential future complication: when we support JSON-LD, and specifically multiple JSON-LD blocks on the page, the translator would return |
Since EM would still be able to detect non-webpage content, I'm not sure renaming it to Webpage makes sense. |
Hmm. That's fair, though it's the translator name, so it's saying that it's saving using the Webpage translator (i.e., extracting generic data from the webpage), not that it's saving as a webpage (which the icon indicates). But maybe overly confusing. "Embedded Metadata" is a bit technical to show on all pages, though. Best option might be to just not show anything in parentheses for this translator, since it'll be the default saving mode (and the default icon as well). |
I think your option 3) is a good choice!
Yes, I agree with that. Therefore, it makes IMO sense show the gray icon for these low quality data, which are maybe just useful enough to save some urls for later reading (bookmark functionality) but usually one has to cite more reliable sources than just webpages.
Well, we can see this clearer if we have some ideas of a JSON-LD translator. In general I think it is a good idea to have the possibility to use Zotero also as a bookmark tool and therefore any handy one-click option to capture the website (as one item) is appreciated.
I.e. simply |
See also #686, which suggests that DOI should go in this too. zotero/zotero#1110 is an interesting test case. |
So I'll be working on this, as per @dstillman's comment
noting the following:
There have been suggestions to incorporate COInS and DOI into EM, but I would like to leave that up to someone else as there are additional considerations, like what happens with the translators (if any) that use both COInS and EM for initial metadata. |
Ok, so a problem with the above approach is that if EM always returns at least I understand that we always return |
Yep, so at least for some of the DOI test cases the select dialog contains multiple entries, with only one of them corresponding to the actual article being saved. Potential options:
|
I think 2. is the way to go. The cases where you do want to use DOIs as multiples are often for fairly sophisticated use (e.g. importing all references from an article you're looking at in html) -- but as that example shows, it's also a really useful feature. |
Agree with @adam3smith and the example http://libguides.csuchico.edu/citingbusiness shows that we already preferring EM over DOI in "sparse" examples. (Technically, I guess it would also be possible to call DOI translator from EM translator if this case happens, but this might be more fragile code...) |
I thought that was the idea of combining ? For single-DOI cases, call DOI in EM with some heuristic for making sure we're looking at the same item, then merge data. Same for COinS, which can also have multiples. |
I'm a bit confused about the argument for (2). DOI being the only available translator is fairly common, so we wouldn't want to start preferring a generic webpage in that case. Even if we kept it separate for multi-DOI cases but integrated it into EM for single-DOI cases, a search results page with multiple DOIs and no other real metadata would start offering a generic webpage as the main option, which is worse than the current behavior. I think the only real solution is to integrate DOI (and COinS, and JSON-LD eventually) into EM and decide what to do based on what's available. So this is a bit radical, but working through some scenarios and optimal behavior, it seems we need to allow a single translator to provide multiple save options. This is how the EM translator could pass webpage options, including snapshot/no-snapshot, with or without color and before or after its other save options as appropriate. (We could still alter the display order of the snapshot options based on the client pref, but we wouldn't need to do most other special-casing for the EM translator.) There are also various scenarios where the EM translator could intelligently decide which options to offer, whereas relying on multiple translators based on priority is much more limited, would result in redundant, confusing, inferior secondary options (e.g., a "DOI" menu option that only used CrossRef when the save button was already combining data from the page and from CrossRef), and would require special-casing for the placement of various options (e.g., putting the generic webpage options last). We could allow returning an object (instead of a string) to specific a different label, including an empty one, which, among other things, would avoid the need to special-case the EM translator to remove the label and let us instead intelligently label based on how it was actually doing the save (since "DOI" or "Embedded Metadata" or "COinS" would sometimes be nice to show). Finally, this could obviate the need for various translator hidden preferences and make those options much more accessible (e.g., So for EM,
which would result in a button with
With that in mind, some example scenarios: Page has a non-generic translator, embedded metadata for non- Item type icon via non-generic translator, EM item type in menu, EM gray webpage options in menu Page has a non-generic translator, embedded metadata for Item type icon via non-generic translator, EM color webpage options in menu Page has single non- Item type icon, DOI selection in menu, gray webpage options in menu — all from the EM translator. As a single translator, Page has no embedded metadata but multiple DOIs Folder icon via EM translator, gray webpage options in menu. In doWeb(), resolve first DOI, if DOI seems to match page, just treat as regular DOI list. Otherwise, first entry in select dialog is current page using generic info (title, URL, access date) and DOI for the rest of results. As a single translator, it allows saving of the generic page info (potential improvement over status quo) and avoids showing a gray webpage icon even though there might be a DOI for the main item on the page (which would be a regression from status quo). Page has single-item embedded metadata that returns Folder icon via EM translator, color webpage options in menu. In doWeb(), resolve DOI, if DOI seems to match embedded metadata, combine (which probably means using only CrossRef). Otherwise display select list with first entry from embedded metadata and resolved DOI as second entry. (For the first case, a little weird to save straight from a folder, but why show two entries when we know one is worse and why show one entry if we're sure it matches the current page?) As a single translator, it avoids saving a Page has single-item embedded metadata that returns something other than Same as previous, but optimistically show an item type icon from the embedded metadata. Combining metadata (when resolved DOI matches embedded metadata) might just mean adding an abstract from the embedded metadata to supplement CrossRef data. Page has single-item embedded metadata that returns Color webpage icon, color webpage options (snapshot/no-snapshot) in menu, no gray options |
Another thing we could do: ISBN detection that only ever showed as a folder in the menu and was never offered as a primary method, for the reasons @adam3smith explains in that thread. |
I'm convinced by that rundown. The only one that's a bit wonky, (no metadata, multiple DOIs) is a bit weird currently, too, and the proposed solution is a slight improvement. COinS should likely work exactly the same way. |
Do you have any suggestions for how the "seems to match" check would be performed in JS, considering we only had very low quality metadata before DOI lookup? Some sort of fuzzy matching is needed, but this would mean involving a third-party library and showing false-positives first would be a rather bad experience. In general one of the reasons we wanted a generic translator (and why I specifically decided to work on this now) was to remove special-casing in Zotero, connector and translation server codebases for pages that miss translators and leverage the existing code to provide generic saving in all instances. However the plan outlined is actually in opposition of at least the simplification goal and will take a non-trivial amount of time and effort to implement and roll out within the translators and translate software and make translation server client handling more complicated too. Having the above working would be great, but I wouldn't want to commit myself on a change this big. Having said that, I propose a less elegant and efficient, but much simpler solution:
If there are DOIs present, EM will not overshadow the DOI translator and otherwise will overtake. If both rich EM data and DOI are present then both can co-exist. This way we can avoid any changes or special translation handling within zotero and translation server and have a translator for every page. It sacrifices code clarity in the intermediate term, but is a workable solution for the short-term until someone has the time and spirit to commit to the bigger change. |
I think yours is a good interim plan. Mine will let us remove almost all special-casing and also provide better results, but it will definitely take some work to get there, and might makes sense as part of a larger reworking of the translator architecture (e.g., to use promises everywhere). I'll probably work on that at some point.
So the user-facing changes here will be that 1) you'll see the blue webpage icon much less often and 2) the gray icon and webpage menu options will start showing more data. And translation-server will be able to save all webpages. |
It might be useful to look closer at this depending translators: https://github.com/zotero/translators/search?q=951c027d-74ac-47d4-a107-9c3069ab7b48&unscoped_q=951c027d-74ac-47d4-a107-9c3069ab7b48 . I just clicked on a few and two things become clear (also I suggest to try to look closer at all these depending translators and not just a sample):
Thus, EM is currently used as a generic way of extracting bibliographic data from website (possibly tweaked a little with an specific translator).
No, actually I would expect in your scenario that most/some of the dependent translators would then need to be based on this new ultimate/merging translator. This would be then the same as the current situation, but with another intermediate step.
Okay, that sounds fine and can possibly fix some currently problematic cases.
This OJS instance is not giving much more machine readable/guessable information, also OJS makes this in general very easy: there are mandatory plugins for OpenURL, DublinCore, MODS and I guess they just have not enabled the Dublin Core Indexing Plugin. One main drawback IMO for EM is currently the lack of JSON-LD and other variants of schema.org. I tried to work on these but currently my time does not permit to continue here... As for the order given above:
|
But why would we want to base other translators on this new combined translator? I think the combined translator should only be used when a site-specific translator fails, and never used as a dependency (except maybe in that case with multi-domain OJS). I imagine it would be different from what we're regularly calling as translators, and it would be more like a logic that decides what to do with the web page if there isn't (or failed) the site-specific translator.
I don't think we need to fix site-specific translators metadata problems by using the combined translator. If a site-specific translator is implemented, it should be better by default, because translators authors should know what they are doing and find the best way to extract metadata, even if need to additionally get metadata by DOI. And they can use all the same methods that are used in the combined translator. Also the output of site-specific translators can be controlled with tests and the problems should also be fixed within the same translator. Therefore I agree with @adam3smith that making a few translators stricter could be a solution.
The combined translator would successfully extract the correct metadata from that URL. But let's imagine that someone decides to make a translator for that URL. If so, then the translator would have to combine metadata from EM and DOI to have the same quality metadata. And to base on the combined translator wouldn't be a good idea, because it's going to do too much magic. Therefore if the translator author sees that EM returned an item with a missing ISSN or an imprecise date, those should be extracted either from the page or by DOI. Also we are discussing about adding MODS, MARCXML, JSON-LD to EM translator, but what if the page has multiple items? EM is single item only. |
I would think that retrieval based on <link> would go in a separate Linked Metadata translator called from the combined translator, similar to unAPI, not in EM. But in-page JSON-LD might go in EM, in which case it would need to possibly handle multiple items. Do you mean that it'd be a problem in terms of EM being called from other translators that expect a single item? |
EM is currently designed to only return single item and all dependent translators also expect single item. And yeah, I'm thinking how that influences other translators. Also if we start advising people to use MODS/MARCXML, we should expect translators that are wrapping that Linked Metadata translator and improving some fields, just like EM is used now in other translators. |
That's not true — it's just a callback on |
I was thinking about cases like this where translator actually trusts that it gets a single item, because otherwise it would add the same abstract to all items, which wouldn't make sense. Of course if a website for which the site-specific translator was implemented has only one item, why it should ever return multiple items. Anyway, if we are adding JSON-LD which can return multiple items, logically, we should add COinS too, which can also return multiple items. But again, I am trying to understand what will be the consequences of making EM as a multi-items translator. Also adding JSON-LD and COinS to EM means there must be a logic in EM translator that combines metadata if multiple methods exist. Also what if RDF returns single item and JSON-LD or COinS returns multiple? |
I'm thinking that But let's imagine what would happen if So I think all translators should be separated:
A few more reasons why not to merge any other translators into the EM translator:
So my suggestion is to keep all translators separate, use them in site-specific parent translators separately, and then introduce a combined last resort translator that intelligently uses all the previously listed separate translators. The combined translator wouldn't be used in any other translator, except maybe in multi-domain translators, because they can't control their output quality with tests, but are dangerous to block the combined translator. In that case the combined translator could be invoked with the already extracted metadata, which will be utilized too. |
I think that makes a lot of sense. Only somewhat related, but one general concern I have is that, traditionally, we've been pretty complacent about the data available on a given site — we've mostly just accepted that what's there is the best we can do, even if some fields are missing. It would be nice to figure out ways to make sure we're getting as much data as possible, even if it means using other services. I don't think it's realistic to solve that purely by convention and tests (e.g., by using the DOI translator as a dependency more liberally, though we can do that too), and I still think we may want to consider certain thresholds or rules that trigger automatic supplementation of the data when possible. |
Well, that sounds similar to what we are trying to do with zotero/zotero#1582. If we trust that translators are already doing their best to extract metadata from the page, there is no need to perform any additional generic translation for the page. So the only thing that is left is to utilize identifiers to retrieve metadata from additional sources, what are we doing in zotero/zotero#1582:
And actually the combined translator will have some similarities with the metadata update logic in the client. I.e. it gets metadata by an identifier (DOI) and combines metadata. I'm a little bit concerned about duplicated operations in some situations. For example if user manually triggers metadata update in Zotero client, and the combined translator takes over, the metadata will be fetched from DOI RA and combined two times - one from the combined translator, and another from the metadata update logic in client. It would be nice to somehow converge both logics. We were previously discussing about automatically triggering the metadata update logic when saving items over Zotero client lookup dialog or connector, but I think the conclusion was to proceed with the manually triggered metadata updating and see how it performs. We had concerns about leaking our usage stats and querying some identifier APIs too often. I'm also concerned about Zotero connector/bookmarklet and cross-origin requests. What are our limitations here? |
I'm waiting for any suggestions how we could improve the generic metadata extraction, but if no one opposes I'm starting to implement the roadmap below. And of course everyone is welcome to work on any part too.
The improvements will be made in steps, and for the beginning we basically just want to wrap DOI and the EM translators with the new combined translator. As soon as the combined translator will wrap other translators, I will use its output to collect and compare metadata from all URLs in translators tests. This will allow to review how metadata differs between various translators and should give a better idea how to combine metadata from different translators. |
I agree that it is cleaner to have separate translators and one combining translator. However, I cannot say on what part of the EM translator (meta tags, microdata, low quality data, ...) the currently 100+ dependent translators are depending on nor what this would mean to change in the future. Maybe you can help me to answer some questions around that aspect: Can we do the same things we can do currently in dependent translators?For a dependent translator I would then still be able to call any of the new separate translators or possibly more than just one. However, you said that I should usually not call the merged translator, but the Can we do the same things in a dependent translator with some easy code?I could imagine that I would need for a website a specific translator for the multiples and for most of the metadata I can then use a mixture of JSON-LD, meta tags, and Microdata. Then I possibly need to call all three translators, e.g. in a nested way: function scrape(doc, url) {
var translatorEM = Zotero.loadTranslator('web');
translatorEM.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');//Embedded Metadata
translatorEM.setHandler("itemDone", function(obj, itemEM) {
var translatorJSONLD = Zotero.loadTranslator('web');
translatorJSONLD.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-jsonld');//Embedded Metadata
translatorJSONLD.setHandler("itemDone", function(obj, itemJSONLD) {
var translatorMICRODATA = Zotero.loadTranslator('web');
translatorMICRODATA.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-microdata');//Embedded Metadata
translatorMICRODATA.setHandler("itemDone", function(obj, itemMICRODATA) {
/*
combine here itemEM, itemJSONLD, itemMICRODATA
and/or add some site-specific data
*/
itemMICRODATA.complete();
});
translatorMICRODATA.translate();
itemJSONLD.complete();
});
translatorJSONLD.translate();
item.complete();
});
translatorEM.getTranslatorObject(function(trans) {
trans.itemType = "newspaperArticle";
trans.doWeb(doc, url);
});
} Or is there a much easier way to do the same? Do all these nesting things here work? I remember some problems with EM being called from other translators (Sandboxing hell??), but maybe they are solved. Besides from the feasibility, this code is IMO quite difficult to work with. Could we possibly do some helper functions maybe in (I hope it is okay that I play here the devil's advocate with my questions. If you think that is not helpful, then you can also let me know.) |
Some vocabularies like Dublin Core or schema.org can written either as meta tags, microdata or JSON-LD. There is the different syntax which could be handled by separate translators, but the same semantics (e.g. assign DC.title to title field in Zotero) which should be reused. |
Yeah, I'm not sure removing Re: nesting, we're developing this on a branch where we have |
Specifically, they would all just forward to RDF.js, like EM does now. We discussed this previously in the context of JSON-LD. |
The combined (actually named it "Generic.js") translator is functioning and I am currently testing it with journal articles from 3.5k unique publishers. So the goal is to make this translator intelligent enough to automatically decide if it's returning single or multiple items. But it's quite challenging to do in the generic way. In past, Zotero was automatically using DOIs from the page, but the decision was to change that because the translator never knows if the DOI belongs to the current article, search results, references or the next article in the journal. But actually the same problem applies for JSON-LD, COinS, unAPI, Microdata. You are never sure if the metadata is describing the item in the current page or something else. The following ways are used to detect if the current web page is representing a single item:
To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that. So not only the extracted DOIs items are matched against the page title, but also all the other metadata too, where we can't assume that it undeniably represents a single item. And then the combined translator cross matches, deduplicates and combines item metadata from different translators. JSON-LD I know we are considering to recommend people to expose metadata in this format, but I see huge problems with it:
So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice. |
For JSON-LD, I wouldn’t let the size issue affect our decision too much. We can put a library in a separate utility file without putting it in the translator itself. We would probably want to avoid injecting it into every page, but if we can do detection without the library we might be able to inject it dynamically for saving when necessary. |
Good to know, but the missing |
A conservative approach as you describe seems fine for me. I would restrict point 5 to cases where the H1, H2, H3 is unique within the page. Some unnecessary
It is also possible to think about switching completely from RDF to JSON-LD as our main format to support, i.e. replace
Interesting that you say mainstream has some disadvantages. - There is AFAIK only COinS which is a dedicated bibliographic format which can be embedded within a website. Every other bibliographic format has to be linked from a website with |
Well, unAPI has basically been dead for years, and we never supported I don't see the quality of existing JSON-LD as particularly relevant to whether it's our recommendation for exposing metadata. We need to support it either way, and it's at least possible for people to expose high-quality metadata for multiple items with it. It seems like the main alternative would be RDF (via a |
Actually, now I'm matching with header tags only when:
Title matching seems to work quite reliably in this way. Although the title is not always in header tags.
Mainstream means people are defining multiple item types in multiple ways, and will do that even more in future. More metadata from the first look seems like a good thing. Unless you can't distinguish which one of the items is actually relevant for you. The problem is that for some websites we are extracting many different items, but we can't distinguish if they are:
And we just get a flat list of all JSON-LD items. Unless we would process everything according to the meaning for specific vocabulary and sophisticated relations, which would be very difficult. Also Zotero by itself supports many different item types. And then suddenly if we encounter a page with many different item types, which one we should choose? It's worth to investigate COinS more, but there is the same problem regarding the detection if the specific COinS record is representing the current article or just a related article. We can't trust any metadata that is in BODY. If making an additional request wouldn't be a problem, linked metadata could be a better choice, because when there is single item |
Sure, we should definitely support JSON-LD, but we should also accept that it will result to more false results than for example Embedded Metadata. I already set JSON-LD as the last method to extract metadata in the combined translator because of the low quality results. |
The next translator on the list is the Linked Metadata translator which represents #77. There were discussions about making it a multi item translator by extracting But there are some issues with this approach. itemprop: And according to the property: This can exist freely in any part of the BODY, but still it's the Another problem with So I started to think that maybe it would better to restrict the Linked Metadata extraction to a single item in the HEAD. And then to add |
That all sounds good to me. Supporting |
I pushed all the code behind this huge generic translation update. The code still needs some work and more thorough testing, but this is how it roughly looks like. Multiple PRs must be merged to support the new translators. Firstly the async support must be merged zotero/zotero#1609, then the new utilities, and then the translators. The most important file is |
As suggested on zotero/translation-server#32, and further bolstered by zotero/zotero#1059, we should create a translator that saves the basic data (title, URL, access date) on all webpages.
Some follow-up work will be needed in the client to show the gray icon for this translator ID, and probably some other things.
To allow this to be rolled out to 4.0 clients without causing trouble, we should figure out a way to return a value from
detectWeb
only in 5.0. Not sure if we make the Zotero version available now, but if we want to avoid that (e.g., for other consumers of translators), we could do some sort of feature detection.(Ideally we could just use a
minVersion
here, but as far as I know the client won't ignore translators with laterminVersion
s when running detection, which would seem to make a lot of sense.)The text was updated successfully, but these errors were encountered: