Replies: 3 comments 8 replies
-
Thanks for contributing and for these interesting questions, @bmschmidt. From the technical point of view, a Hugging Face Normally end users do not need to access these inner details (it is the library job to hide all that complexity). For example, the Arrow table schema (column names and data types) is defined with an abstraction called
Please also note that the Dataset is generated from some source data files and there is no limit on the size of them:
Your domain-specific questions about LAM datasets may be some other specialist can answer better. Some technical answers:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for asking the question -- the tl;dr is that there probably isn't a single answer for this at the moment. I think your blog post gets at one of the main motivations of this effort i.e. making these datasets easier to work with these datasets for 'computational stuff' (and also making it possible to do this without having to rent VMs just to look at the data...) I think the issue is that there are different potential variants of this 'computational stuff'. As a result, I think the information we want to retain from what can be verbose upstream data formats varies quite a bit. As you mention in your blog post, often the data shared by LAMs is in some flavour of XML which contains a level of information we often don't care about i.e. the co-ordinate of each predicted OCR word/token. This information might be nice to have for some use cases but likely isn't relevant for many machine learning/text-mining activities. I think the information we do often want to retain is the other metadata associated with items. Things like the year of publication may be very useful for potential users of the dataset to filter on but it's difficult to be exhaustive about this. For some work we did to make some BL data more accessible (https://github.com/davanstrien/digitised-books-ocr-and-metadata) we landed on trying to focus on the text content at a page level and pull in the majority of metadata fields related to the item/source (but not pull in all of the verbose metadata about the processing steps that were taken to produce that ALTO). I think in your case I would probably keep the majority of these fields you mentioned and map these to the In terms of the practicalities, there are different approaches that could be used to share these datasets. One of the issues you mention in your blog is the speed of parsing XML files (and also the challenge of doing this in a streaming manner when information sometimes needs to be combined across multiple files). My own feeling is that for these larger datasets it makes sense to try and do this heavy conversion process only once. Whilst it's possible to write a dataset that loads from the XML files directly this often takes a very long time and doesn't make for a nice end-user experience. There are a few approaches I think can work to avoid these: Convert the source datasets into a nicer format for streaming. This is what we did in ((https://github.com/davanstrien/digitised-books-ocr-and-metadata) which converts XML to compressed JSONL files which are much easier to stream. This JSONL data is then used as the source data for https://huggingface.co/datasets/blbooks This format of the data can then be used as the source data for a loading script (but much quicker than the equivalent script would be if it had to parse the original XML). This might be something LAMs consider doing to make their data more accessible for computational work. Push the parsed dataset directly to the hub. In the case of your blog it is probably possible to either directly push the arrow data you created to the hub, or alternatively what I sometimes do is to write a loading script that does all the XML parsing and then use The main reason I can see the second option working well sometimes is that the underlying source data may not be static and this offers a method for easily pushing an updated version of the dataset to the Hugging Face hub. Sorry, that was a slightly longer answer. If you were interested I'd be happy to think a bit more about the best way to do this with the Austrian National Library newspapers. |
Beta Was this translation helpful? Give feedback.
-
What would that script do, do you think? |
Beta Was this translation helpful? Give feedback.
-
Hi folks,
I'm intrigued by this project because I've put a lot of time lately the last couple years thinking about Apache Arrow-based formats for distributing text collections with metadata. (For example, here's a blog post about ingesting a 60GB set of newspapers from the Austrian National Library) But I feel like I'm missing something a little basic here--is there a description of what "dataset" means in the huggingface sense? Like, I have the sense that it's a set of arrow tables... But most library datasets I know are pretty deeply structured, include at least Dublin Core terms if not full-on linked open data definitions.
Maybe one way to ask this is just sharing the Arrow schema for one of the tables in that Austrian newspaper set. I'm sure I could contribute this pretty easily. But are there standards or best practices for, e.g.:
whether 'dc:language' should be a category or string?
whether 'dc:issued' is an ISO 8601 string, an arrow date, or just who cares
whether the 'dc' prefix is a good thing, a bad thing, or 'who cares.'
And how big can these things be? Like this one is only about 1GB. But the Hathi Trust Extracted Features from the Illinois libraries is like 2-4TB--I assume that's too big to just jam into huggingface?
Beta Was this translation helpful? Give feedback.
All reactions