What makes a good huggingface text dataset? #50

bmschmidt · 2022-07-13T11:57:18Z

bmschmidt
Jul 13, 2022

Hi folks,

I'm intrigued by this project because I've put a lot of time lately the last couple years thinking about Apache Arrow-based formats for distributing text collections with metadata. (For example, here's a blog post about ingesting a 60GB set of newspapers from the Austrian National Library) But I feel like I'm missing something a little basic here--is there a description of what "dataset" means in the huggingface sense? Like, I have the sense that it's a set of arrow tables... But most library datasets I know are pretty deeply structured, include at least Dublin Core terms if not full-on linked open data definitions.

Maybe one way to ask this is just sharing the Arrow schema for one of the tables in that Austrian newspaper set. I'm sure I could contribute this pretty easily. But are there standards or best practices for, e.g.:

whether 'dc:language' should be a category or string?
whether 'dc:issued' is an ISO 8601 string, an arrow date, or just who cares
whether the 'dc' prefix is a good thing, a bad thing, or 'who cares.'

And how big can these things be? Like this one is only about 1GB. But the Hathi Trust Extracted Features from the Illinois libraries is like 2-4TB--I assume that's too big to just jam into huggingface?

@id: large_string
nc:text: large_string
newspaper_id: large_string
page: int32
dc:identifier: large_string
dc:language: large_string
dc:source: large_string
dc:subject: large_string
dc:title: large_string
dc:type: large_string
dc:extent: large_string
dc:isPartOf: large_string
dc:spatial: large_string
dc:relation: large_string
dc:hasPart: large_string
newspaper: large_string
dc:issued: date32[day]

albertvillanova · 2022-07-13T13:22:08Z

albertvillanova
Jul 13, 2022
Maintainer

Thanks for contributing and for these interesting questions, @bmschmidt.

From the technical point of view, a Hugging Face Dataset is a quite complex Python class, but under the hood, each instance is backed by an Apache Arrow table (in an Arrow file and memory-mapped).

Normally end users do not need to access these inner details (it is the library job to hide all that complexity).

For example, the Arrow table schema (column names and data types) is defined with an abstraction called Features:

see our documentation About Dataset features
see the Feature doc: https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.Features

Please also note that the Dataset is generated from some source data files and there is no limit on the size of them:

if you are just implementing the Python loading script (i.e. the data is hosted in a 3rd-party server), the only limit is the disk space of the user loading this dataset, as the data files will be downloaded locally, and the Arrow file backing the Dataset will be generated locally as well (all in what we call the cache directory).
if you are hosting the data files on the Hub:
- there is no maximum size limit for the total size of files in a Hub repository
- there is a maximum size limit per file of 50GB; anyway, we always recommend sharding data files into smaller chunks (and also compress them), so that downloading can be done in parallel and it takes shorter (when compressed)

Your domain-specific questions about LAM datasets may be some other specialist can answer better. Some technical answers:

for languages, we usually set them as strings (ClassLabel --or category-- is more suitable for the labels in classification tasks: sentiment, NER/POS tags,...)
for dates, note that dates, times and timestamps are supported data types (as Value("date32") e.g.)
for dc prefix: I guess this is very domain specific; we normally do not prefix column names, but in some cases that might be useful if there would be a repeated name when removing the prefix

2 replies

bmschmidt Jul 13, 2022
Author

Thanks @albertvillanova for the comments. Just coming at this from Arrow-land, it seems like these are basically the Arrow types supported, including Series elements, plus explicit multi-dimensional arrays. One question--do huggingface datasets support structArrays? I find these to be very common in cultural heritage datasets where (for example) a book might have multiple author fields, where each author consists of a {first, last, birthYear, deathYear} struct.

Edit--never mind, I see that "a nested field containing a mapping of sub-fields to sub-fields features." is supported.

bmschmidt Jul 13, 2022
Author

Another reason I ask about structArrays is that I have found them the most natural way to store datasets that include tokencount information.

davanstrien · 2022-07-13T14:50:11Z

davanstrien
Jul 13, 2022
Collaborator

Thanks for asking the question -- the tl;dr is that there probably isn't a single answer for this at the moment.

I think your blog post gets at one of the main motivations of this effort i.e. making these datasets easier to work with these datasets for 'computational stuff' (and also making it possible to do this without having to rent VMs just to look at the data...)

I think the issue is that there are different potential variants of this 'computational stuff'. As a result, I think the information we want to retain from what can be verbose upstream data formats varies quite a bit. As you mention in your blog post, often the data shared by LAMs is in some flavour of XML which contains a level of information we often don't care about i.e. the co-ordinate of each predicted OCR word/token. This information might be nice to have for some use cases but likely isn't relevant for many machine learning/text-mining activities.

I think the information we do often want to retain is the other metadata associated with items. Things like the year of publication may be very useful for potential users of the dataset to filter on but it's difficult to be exhaustive about this. For some work we did to make some BL data more accessible (https://github.com/davanstrien/digitised-books-ocr-and-metadata) we landed on trying to focus on the text content at a page level and pull in the majority of metadata fields related to the item/source (but not pull in all of the verbose metadata about the processing steps that were taken to produce that ALTO).

I think in your case I would probably keep the majority of these fields you mentioned and map these to the Features @albertvillanova mentioned.

In terms of the practicalities, there are different approaches that could be used to share these datasets. One of the issues you mention in your blog is the speed of parsing XML files (and also the challenge of doing this in a streaming manner when information sometimes needs to be combined across multiple files). My own feeling is that for these larger datasets it makes sense to try and do this heavy conversion process only once. Whilst it's possible to write a dataset that loads from the XML files directly this often takes a very long time and doesn't make for a nice end-user experience. There are a few approaches I think can work to avoid these:

Convert the source datasets into a nicer format for streaming. This is what we did in ((https://github.com/davanstrien/digitised-books-ocr-and-metadata) which converts XML to compressed JSONL files which are much easier to stream. This JSONL data is then used as the source data for https://huggingface.co/datasets/blbooks This format of the data can then be used as the source data for a loading script (but much quicker than the equivalent script would be if it had to parse the original XML). This might be something LAMs consider doing to make their data more accessible for computational work.

Push the parsed dataset directly to the hub. In the case of your blog it is probably possible to either directly push the arrow data you created to the hub, or alternatively what I sometimes do is to write a loading script that does all the XML parsing and then use push_to_hub to share a parsed version of this dataset to the Hub. This dataset would store the data as Arrow files and would support streaming (and likely be much smaller).

The main reason I can see the second option working well sometimes is that the underlying source data may not be static and this offers a method for easily pushing an updated version of the dataset to the Hugging Face hub.

Sorry, that was a slightly longer answer. If you were interested I'd be happy to think a bit more about the best way to do this with the Austrian National Library newspapers.

4 replies

davanstrien Jul 13, 2022
Collaborator

I also just took a look at one of the files you shared in the blog and it's also possible to load it directly into datasets.

from datasets import Dataset
ds = Dataset.from_parquet('Figaro.parquet')
ds[0]
{'@id': '9200300/BibliographicResource_3000073483075/1',
 'dc:extent': 'Pages: 4',
 'dc:hasPart': None,
 'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_2:ONB_00247/1857/ONB_00247_18570104.zip',
 'dc:isPartOf': 'http://data.europeana.eu/item/9200300/BibliographicResource_3000073527519',
 'dc:issued': datetime.date(1857, 1, 4),
 'dc:language': 'deu',
 'dc:relation': None,
 'dc:source': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=fig&datum=18570104',
 'dc:spatial': 'http://d-nb.info/gnd/4066009-6',
 'dc:subject': 'http://d-nb.info/gnd/4026170-0',
 'dc:title': 'Figaro - 1857-01-04',
 'dc:type': 'http://schema.org/PublicationIssue',
 'nc:text': 'M 1. , den 4. Jänner 18Z7. Was ihr euch Gelehrte für Geld nicht erwerl\'t - Humoristisches WochmblM TaS hab.^ich »VN meiner Frau \' Dhivki»^ eerbt. Erscheint joden Sonntag und ist um den Preis von 36 kr. vierteljährig durch alle Buchhandlungen zu beziehen. Einzelne Nsnmiern t»-ften 2, kr, C. M, Mir directer Postzusendung vierteljährig 48 kr. K. M. - ImMmxtu. (Ein vorherausgedachtes Gedicht, das ckan für ein Stegreisgedicht ausgibt.) Gekommen ist\'s, geschehen ist\'s, wer hätte es geahnet — der Witz, der Witz ist rückgekehrt, der lange war verbannet. Sie tauchen aus, sie tauchen auf, die lustigen Gestalten — hinweg, hinweg Sebastian, mit deines Ernstes Falten! Die Welt, die Welt ist böse nicht, sie liebt es noch zu lachen — umsonst darum der Höllenhund aussperret seinen Rachen. Es lieben sich, es lieben sich die Menschen sehr einander — und wer nicht liebt, geliebt nicht wird, der ist die Schuld selbander. Harmonisch fließt, harmonisch fließt das Leben hin in Tänzen — nur auf der Börs, nur auf der Börs gibt es noch Differenzen. Der Dichter selber der sonst nur gepanzert wollte gehen — ist jetzo als „Gesellschafter" schwarzsracklich anzusehen. Elysium, Elysium ist überall aus Erden — und alle Leiden, alle sind nur Carnevalsbesch werden. Drum tauchen auf, drum tauchen auf die lustigen Gestalten — hinweg, hinweg Sebastian mit deines Ernstes Falten!',
 'newspaper': 'Figaro',
 'newspaper_id': '9200300/BibliographicResource_3000073483075',
 'page': 1}

bmschmidt Jul 13, 2022
Author

Thanks for these thoughts! As it happens, I have another 10K blog post in the hopper about the travails of parsing the BL books corpus into this format as well!

So it seems like the way to publish this would then just be to run Dataset.from_parquet on all the files, and see about generating a card? Is it more complicated than that? Is there a preferred column name to use for 'text' in these?

davanstrien Jul 14, 2022
Collaborator

Looking forward to the blog post! It would be great to know if you also find the new format lacking -- that was specifically aimed at computational access.

So it seems like the way to publish this would then just be to run Dataset.from_parquet on all the files, and see about generating a card? Is it more complicated than that? Is there a preferred column name to use for 'text' in these?

I think this would be the most pragmatic solution but we may also want to write a script to allow more filtering options when people load the dataset. This may not be relevant in this case since you have already broken down the titles into more reasonable-sized datasets.

Do the datasets you shared on the blog cover all of the Europeana bulk download newspapers i.e. the ones listed here: https://pro.europeana.eu/page/iiif#download?

bmschmidt Jul 14, 2022
Author

Do the datasets you shared on the blog cover all of the Europeana bulk download newspapers i.e. the ones listed here: https://pro.europeana.eu/page/iiif#download?

No, all these newspapers are only for the extract with ID 9200300. Going by file sizes, there's another 4-5x as many. I would hope the same code would work, though! Feel free to create a ticket and tag me on it.

bmschmidt · 2022-07-14T18:41:12Z

bmschmidt
Jul 14, 2022
Author

we may also want to write a script to allow more filtering options when people load the dataset...

What would that script do, do you think?

2 replies

bmschmidt Jul 25, 2022
Author

Pinging @davanstrien on this if it's worthwhile--just wondering what a script-filtering dataset would be.

davanstrien Jul 25, 2022
Collaborator

I created an issue to discuss this #70. I think the main thing would be to decide if we want to, for example, offer a user the choice of only loading a subset of years or filtering by OCR quality etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What makes a good huggingface text dataset? #50

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What makes a good huggingface text dataset? #50

bmschmidt Jul 13, 2022

Replies: 3 comments · 8 replies

albertvillanova Jul 13, 2022 Maintainer

bmschmidt Jul 13, 2022 Author

bmschmidt Jul 13, 2022 Author

davanstrien Jul 13, 2022 Collaborator

davanstrien Jul 13, 2022 Collaborator

bmschmidt Jul 13, 2022 Author

davanstrien Jul 14, 2022 Collaborator

bmschmidt Jul 14, 2022 Author

bmschmidt Jul 14, 2022 Author

bmschmidt Jul 25, 2022 Author

davanstrien Jul 25, 2022 Collaborator

bmschmidt
Jul 13, 2022

Replies: 3 comments 8 replies

albertvillanova
Jul 13, 2022
Maintainer

bmschmidt Jul 13, 2022
Author

bmschmidt Jul 13, 2022
Author

davanstrien
Jul 13, 2022
Collaborator

davanstrien Jul 13, 2022
Collaborator

bmschmidt Jul 13, 2022
Author

davanstrien Jul 14, 2022
Collaborator

bmschmidt Jul 14, 2022
Author

bmschmidt
Jul 14, 2022
Author

bmschmidt Jul 25, 2022
Author

davanstrien Jul 25, 2022
Collaborator