🤗 Hugging Face x 🌸 BigScience initiative to create an open source, community resource of LAM datasets.
BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.
We are running a datasets hackathon focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine learning applications accessible via the Hugging Face Hub. You might also know this field as 'GLAM' - galleries, libraries, archives and museums.
We are doing this to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine learning datasets more closely reflect the richness of human culture.
We aim to enable easy discovery and programmatic access to these datasets using Hugging Face's 🤗 Datasets Hub. As part of this, we want to:
- Identify datasets that would be useful to have more easily accessible
- Make these datasets available via the Datasets Hub
- Document these datasets
Some of the reasons we think that this effort is important:
- There is a growing interest in using Machine Learning with LAM materials1. The availability of datasets is one of the barriers to this effort. We want to make existing datasets more discoverable and easily accessible2. Making datasets suitable for machine learning more easily discoverable will help reduce this barrier.
- LAMs hold interesting data that currently we believe is underutilized by the broader machine learning ecosystem.
- LAMs have the potential to play a positive role in making the development, sharing, and preservation of machine learning datasets a responsible way (see Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning). We want this hackathon to help develop practices we believe can positively impact the machine learning ecosystem.
There is a growing interest in using language models with historical texts.3 Although we are not only focused on collecting datasets for this purpose, we hope that some of the materials we gather as part of this sprint will be helpful in efforts to train language models on historic text data.
There are a few ways to contribute to the hackathon:
- ✨ Suggesting datasets that might be of interest (see the Wiki for guidance on the kinds of data we're interested in)
- 🤗 Making those datasets available via the Hugging Face Hub
- 🤳🏾 Invite institutions with open datasets to join the hackathon
- 📝 Documenting datasets by adding additional metadata and working on the data cards for those datasets.
To join the hackathon, start by introducing yourself on our GitHub discussion board #19.
Once you have said hi on the discussion boards you should request to join BigLAM Hugging Face organization.
For guidance, please check out the Wiki.
If you have questions:
- first, check out the FAQs
- if you don't find the answer in the FAQs, please ask on the discussions board
Initially we plan to run the hackathon until August 19th 2022. the end of October 2022.
Footnotes
-
See for example, https://sites.google.com/view/ai4lam ↩
-
R. Cordell, ‘Machine Learning + Libraries’, LC Labs. Accessed: Mar. 28, 2021. [Online]. Available: https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf, p.34 ↩
-
Schweter, S., März, L., Schmid, K., & Çano, E. (2022). hmBERT: Historical Multilingual Language Models for Named Entity Recognition. ArXiv, abs/2205.15575., Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities. ↩