capture TextPositionSelector and/or RangeSelector #29

judell · 2019-06-06T20:02:01Z

As per hypothesis/product-backlog#1022, Hypothesis fails to distinguish among targets that share a common prefix and exact but differ in suffix. Annotations for multiple such targets pile up on a single highlight, preventing human curators from navigating to, and responding to, each target.

One solution would be to run SciBot in the web page where it would have DOM access and could reuse the Hypothesis anchoring libraries. In the near term that would require a rewrite to JavaScript which would make it a nonstarter. In the longer term it's possible that web assembly will enable packaging the existing Python-based code into a form usable in the browser, and that's worth bearing in mind.

The other solution would be to replicate, in the Python-based SciBot code, the selectors produced by the Hypothesis JS-based anchoring machinery. There are two possibilities here: match the TextPositionSelector that Hypothesis produces, or match the RangeSelector (xpath) that Hypothesis produces. I'd be willing to investigate the feasibility of these strategies.

The text was updated successfully, but these errors were encountered:

tgbugs · 2019-06-06T20:45:45Z

Huh, 1022 explains a lot.

How much infrastructure would we need to have the bookmarklet load a helper script from a static url, so that the bookmarklet stays the same but we can add functionality like this? I'm thinking a single additional endpoint? Any known CORS issues with loading a remote script from a bookmarklet? Also, do we need the full rendered DOM to be able to get the xpaths or can we extract them from document.innerHtml? A problem I see with that approach would be mapping the ids found in the inner text back onto the innerHtml in cases where some markup splits an id (which is now quite frequent due to journals having completely whiffed on the typesetting ...).

Webasm on my radar, though taking a look around I found https://github.com/iodide-project/pyodide which is ... not reassuring with regard to the current complexity of the setup required, would have to evaluate time tradeoffs between working on that vs a complete rewrite.

judell · 2019-06-06T20:59:26Z

How much infrastructure would we need to have the bookmarklet load a helper script from a static url, so that the bookmarklet stays the same but we can add functionality like this?

That's a separate question to which the answer I think is "just do it" :-) There are only a handful of curators who have installed the bookmarklet, right? A one-time upgrade to a bookmarklet that's a stub pointing to malleable code is a pretty small intervention.

Any known CORS issues with loading a remote script from a bookmarklet?

The possible issue is CSP (Content Security Policy). I'm not aware that any of our target sites enforce CSP. If any do, the fallback would be to package the thing in a dirt-simple Chrome extension.

do we need the full rendered DOM to be able to get the xpaths or can we extract them from document.innerHtml?

It's ideal to operate in DOM context using the same code Hypothesis (and compatible clients) use, all based on the common anchoring libraries.

That said, it may be easy to match TextPositionSelector by stripping markup from the innerText you get and marking positions in the stream of characters. In principle it seems possible to easily match the TextPositionSelectors that the Hypothesis client produces. In practice we'll just have to try and see what happens.

judell · 2019-06-07T21:46:43Z

It looks like the following will work.

Send document.body.textContent instead of document.body.innerText
Use the start of the RRID match in the textContent stream as TextPosition.start
Use TextPosition.start + length of RRID match as TextPosition.end

I have verified that:

a) with Range (XPATH) anchoring turned off, the Hypothesis client will anchor a case like hypothesis/product-backlog#1022 when it has both TextQuote and TextPosition

b) The start of an RRID match in the textContent stream does match the TextPosition.start created by the Hypothesis client

It would, of course, be a major change for SciBot to be looking at document.body.textContent (unparsed HTML) vs document.body.innerText (just the text), so this would require some testing and sanity-checking.

I'll take a crack at making a demo that illustrates how, given the textContent of a web page, to create Hypothesis-compatible selectors for both TextQuote and TextPosition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capture TextPositionSelector and/or RangeSelector #29

capture TextPositionSelector and/or RangeSelector #29

judell commented Jun 6, 2019

tgbugs commented Jun 6, 2019

judell commented Jun 6, 2019 •

edited

Loading

judell commented Jun 7, 2019

capture TextPositionSelector and/or RangeSelector #29

capture TextPositionSelector and/or RangeSelector #29

Comments

judell commented Jun 6, 2019

tgbugs commented Jun 6, 2019

judell commented Jun 6, 2019 • edited Loading

judell commented Jun 7, 2019

judell commented Jun 6, 2019 •

edited

Loading