Feature request: recursive import of a corpus #30

Conal-Tuohy · 2023-06-26T04:00:07Z

At present Voyant allows you to import a corpus in an XML format from a single URL or from a list of URLs.

I'd like to suggest adding the ability to ingest from a "linked list" of URLs, where the user provides a single URL, and the remaining URLs are retrieved in a recursive fashion: i.e. the resource which Voyant retrieves from the first URL itself contains a link to the second "page" of text, which contains a link to a third page, etc, until the final resource contains no further links.

The user would need to be able to provide one additional XPath parameter (called e.g. Next or similar) when importing the corpus, to identify an element or attribute in the XML data which would contain a link to the next page. e.g. in the case of a corpus of TEI elements contained in a teiCorpus wrapper element, the teiCorpus element can bear a next attribute whose semantics are defined in this way. So the default XPath expression for a TEI import could be //*[local-name()='teiCorpus']/@next.

This kind of approach would work for other XML formats such as Atom, which has link elements for this purpose e.g. <link rel="next" href="http://example.org/index.atom?page=2"/>

The text was updated successfully, but these errors were encountered:

Conal-Tuohy · 2023-06-26T05:48:12Z

maybe this issue belongs on the Trombone repo? Apologies if so

ajmacdonald added the enhancement New feature or request label Jun 26, 2023

ajmacdonald transferred this issue from voyanttools/VoyantServer Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: recursive import of a corpus #30

Feature request: recursive import of a corpus #30

Conal-Tuohy commented Jun 26, 2023 •

edited

Loading

Conal-Tuohy commented Jun 26, 2023

Feature request: recursive import of a corpus #30

Feature request: recursive import of a corpus #30

Comments

Conal-Tuohy commented Jun 26, 2023 • edited Loading

Conal-Tuohy commented Jun 26, 2023

Conal-Tuohy commented Jun 26, 2023 •

edited

Loading