🌸Joining the BigLAM hackathon🌸 #19
Replies: 27 comments 65 replies
-
This is great! LAMS have so much good (curated) data with open licenses but not always simple apis or UIs, would be great to have them on the hub! I have only recently discovered the power of museum data while adding this for a hackathon https://huggingface.co/datasets/ceyda/smithsonian_butterflies will do a better job with dataset card tomorrow |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm Shamik Bose. I got my PhD in making deep learning systems more transparent and explainable. I have been involved with the BigScience workshop for data creation and also with the biomedical hackathon by contributing a number of datasets and bug fixes. I would love to contribute to this effort as well. My huggingface hub username is 'shamikbose89' |
Beta Was this translation helpful? Give feedback.
-
Hi 👋 My name is Marianna. I am a Machine Learning Engineer and I would love to help you with this! |
Beta Was this translation helpful? Give feedback.
-
Hi everyone! |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, I am Semih, several years of ML experience , but this field moves so fast I keep staying as a beginner, and always have a LAM nerd :D Would love to join and contribute ! HF username is : skorkmaz88 |
Beta Was this translation helpful? Give feedback.
-
Hello, my name is Andy Janco and I am a research software engineer at the University of Pennsylvania. I'm a big fan of 🤗 and look forward to the hackathon! |
Beta Was this translation helpful? Give feedback.
-
Hi All, I'm recruiting ML researchers for related projects so if this sounds interesting, get in touch! My Hugging Face profile is https://huggingface.co/arihers |
Beta Was this translation helpful? Give feedback.
-
Hey, I am Zaid . A a PhD student at kfupm in Saudi Arabia. Co-founded arbml an effort to demcratize arabic nlp. Co-authored masader for Arabic NLP datasets cataloging at bigscience, we can use the metadata to extract datasets related to lam. I thought this effort might be relevant to us at arbml. We have a community on discord so I will share this there. My hf name is Zaid. |
Beta Was this translation helpful? Give feedback.
-
I would love to join as an opportunity to get more involved into machine learning + LAM. Currently I am a Digital Humanities Research Software Engineer at the British Library and The Alan Turing Institute on the Living with Machines research project, which combines large-scale digitised historical collections with data science and computational methods. Not very active on HuggingFace but here I am :) https://huggingface.co/kallewesterling |
Beta Was this translation helpful? Give feedback.
-
Hello! I'm currently the Digital Curator for Western Heritage Collections at the British Library, and a Co-Investigator on the Living with Machines project. I tend to work on access to digitised collections, including UX and audience research, with a focus on participatory digital access (basically, crowdsourcing for cultural heritage). According to the edit dates on http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIs I maintained a list of sources of 'museum, gallery, library, archive, archaeology and assorted sources for machine-readable data' for 12 or so years, and I'd love to know how many of those resources might be suitable, and how best to contact the institution to let them know about HuggingFace and what it means for them if their data is available on it. This seems like a great way for institutions to have some involvement with machine learning without the overhead of figuring it out for themselves. |
Beta Was this translation helpful? Give feedback.
-
Hi, |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, This is Ayesha, Data Science Manger @CancerClarity USA. Working in AI field for than 5 years with several international publications. Would love to contribute. Thanks |
Beta Was this translation helpful? Give feedback.
-
Hello! I am @epoz T - the creator and maintainer of https://iconclass.org/ The current Iconclass search interface also support visual similarity searches using CLIP. Still working on adding more multi-modal searches, but very excited with the results already. Looking forward to hear more about others interested in AI, Art & Iconography. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I'm Ben Schmidt, I work at NYU doing history and digital humanities. (At least for the time being! Let me know if anyone's looking for.a hired gun...) I've worked a lot with large textual corpora in the US, especially at the Hathi Trust and Internet Archive. I spent some time last year sketching out a format for sharing of texts and metadata especially for nonconsumptive research that I'm looking to develop. From what I see there are a lot of homologies between that and the format used on HF, so there are a number of text collections that I might able to ingest quickly. I'm especially interested in sharing feature representations (token counts, learned embeddings, and random projections) on data that can't be shared publicly for copyright or other reasons. I'm also FWIW, obsessed with the possibilities for visualization on historical texts with metadata. Just set up a shiny new HF username at https://huggingface.co/benmschmidt. |
Beta Was this translation helpful? Give feedback.
-
Hi All, |
Beta Was this translation helpful? Give feedback.
-
Hello, my name is Nabeel Siddiqui, and I am an Assistant Professor of Digital Media at Susquehanna University. I work on a variety of topics around the history of computing and cultural analytics. This summer is fairly busy for me, but I would love to be involved and help where I can. I have already sent a request to the BigLAM hugging face group with the same username I use on GitHub. I am fairly good at cleaning up numerical datasets, joining datasets, together, etc. I usually work in R. I would also be happy to run a 4cat instance if someone has idea for social media data that might be useful or perhaps image datasets online. In short, I am happy to do the work on any dataset if someone comes up with the idea lol. Finally, if anyone is interested, I would love to collaborate to publish a dataset for the Journal of Open Humanities Data. I really appreciate their initiative and am using some of their datasets for a book I am working on. This seems like a good opportunity to contribute to that project since we are already creating datasets. Look forward to working with everyone. |
Beta Was this translation helpful? Give feedback.
-
Hi all! My name is Théo Gigant (@gigant on HuggingFace), I just started my PhD in multimodal deep learning. I previously contributed the Wikiart dataset on the hub as part of the HugGan sprint. I also contributed to BigScience via the BigBio hackathon, and I look forward to work for open source projects again! I didn't really check the datasets for now, but if that's relevant for the scope, I might like to add the illustrations from https://www.oldbookillustrations.com/ to the hub. The task is already halfway done since I already wrote a scraping script to get all the infos + images for the illustrations. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, my name is Clemens Neudecker and I lead a small research team working on AI/ML for digital cultural heritage at the Berlin State Library. We have created a number of AI/ML-based tools and corrsponding datasets/models in our Qurator project, see an overview here: https://ravius.sbb.berlin. I am already on Hugginface as https://huggingface.co/cneud and have experimented with (but not completed) distributing some of our project outputs via the Huggingface hub (see https://huggingface.co/SBB). Additonally I would like to look into the possibility of sharing some datasets from our SBB-LAB or the EuropeanaNewspapers project. |
Beta Was this translation helpful? Give feedback.
-
Hi All, I know @davanstrien from the AI4LAM organization, which I encourage everyone here to check out. Let me know if you have any questions about that! |
Beta Was this translation helpful? Give feedback.
-
My name is Eric Morgan (@ericleasemorgan). I work at the University of Notre Dame in a digital scholarship center. My most recent research interest surrounds distant reading, and thus I support a Python-based library used to create data sets from (almost) arbitrarily large sets of textual corpora. The library is called the Distant Reader Toolbox. Put almost any number of files of any type into a directory. Point the Toolbox to the directory. Allow the Toolbox to do its work. And the result is a platform- and network-independent file system filled with: 1) the original data, 2) plain text versions of the original data, 3) tabular representations of extracted features, and 4) a relational database file. I call the resulting data sets "study carrels". You know, those little spaces in libraries where one can bring their items of interest, arrange them on shelves, and use them for study. Given a study carrel -- a data set which is easy to compute against -- the command-line interface to the Toolbox supports all sorts of different modeling processes: ngram analysis, concordancing, collocations, a few different clustering techniques, full text indexing, semantic indexing, part-of-speech analysis, named-entity analysis, and information extraction through the use of grammars. In the end, the Toolbox is intended to allow students, researchers, or scholars the ability to articulate a research question, assemble a set of texts which might answer the question, and then compute against the texts to address the question. For example, a research question might be "How did Ralph Waldo Emerson define what it means to be a man?" Finally, I have created a set of more than 3,000 study carrels, and they are available as a sort of collection at http://library.distantreader.org. Many of the study carrels surround topics from the humanities: love, honor, justice, etc. And these same carrels' content comes a lot from Project Gutenberg. Others from the HathiTrust, and so on. Another large portion of the study carrels surround the topic of COVID-19 and were originally disseminated through a project called CORD-19. In short, I have a whole lot of pre-computed datasets, and they are amenable to additional computation by just about any other platform. To contribute, what should I do next? |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm Sarah Ciston (@sarahciston on huggingface), a PhD Candidate at USC Media Arts + Practice researching how to bring intersectional feminist, anti-racist, queer approaches to AI/ML and how to work critically with datasets. I've been collaborating with USC Libraries and researchers here on more inclusive collections-as-datasets. I'm excited to see the BigLAM initiative, and I'd love to be involved! |
Beta Was this translation helpful? Give feedback.
-
On Jul 29, 2022, at 12:20 PM, Daniel van Strien ***@***.***> wrote:
Distant Reader sounds like a super exciting project! I will look closely at the links you sent, but I think there are a few potential ways you could share this material. Since you already have a library of material you have prepared, it might make the most sense to consider creating a dataset loading script [1] that can load any of the study carrels in the library. This would then allow someone to load a dataset based on one of these libraries:
ds = load_dataset('load_distant_reader', name='homer')
I think the main question to consider is what overlap (if any) you want with the existing behaviour of the Python library. The datasets library https://huggingface.co/docs/datasets/ is primarily focused on supporting machine learning workflows, so there are likely to be some things that your library does that datasets won't do well. Essentially when you write a loading script, you get the data to fit some set of features https://huggingface.co/docs/datasets/about_dataset_features you specify. 'under the hood' these map to Apache Arrow datatypes.
It might already be useful to write a loading script that can load any library items and load the previously cleaned texts with metadata about that text?
[1] https://huggingface.co/docs/datasets/dataset_script
Daniel, thank you for the prompt reply, and likewise, very interesting.
Just to re-iterate, the next step is for me to write a little Python script with a specific name and in a specific location. The script is really a set of classes and objects denoting features of my data sets. I then run another script that looks for the script I just wrote, and it will output data files of some sort, and those data files will go to 🌸? Finally, somebody will then be able to run something like above (ds=load_dataset('load_distant_reader', name='homer') to actually load my data sets. Correct?
If so, then many of the examples in [1] allude to training sets, splits, and testing. To what degree is this required for me to denote? All of my data (plain text) file are saved in a single directory, and it is not divided into training and testing sets.
…--
Eric Morgan
|
Beta Was this translation helpful? Give feedback.
-
Hi my name is Yves Maurer and I'm deputy head of IT at the national library of Luxembourg. At our library we have released a few open data sets based off historical newspapers at data.bnl.lu and we'd like to make more available. I've seen that Daniel van Strien has already added one of our datasets for text processing to the BigLAM project, but I think that the ground truth packs of manually hand-corrected OCR should maybe also be added. They are smaller than the article fulltexts, but since they are hand-corrected (double-keyed), they definitely have a higher quality. The datasets in question are: And my huggingface username is ymaurer |
Beta Was this translation helpful? Give feedback.
-
Hi my name is Alex Wermer-Colan and I'm the Academic Director at Temple University's Scholars Studio. I support a lot of faculty and student projects that may be relevant to this project, such as researchers who are building large Reddit datasets on specific subjects. Our department's focus is on digitizing and developing copyrighted and open-access datasets with a focus on literary texts, especially science fiction and underepresented writers and genres. We've ingested materials into HathiTrust Digital Library, and I'm looking for other ways to make restricted data useful to machine learning researchers outside our institution, for example through extracted features by pre-generating models without sharing the original datasets. We also have scraped lesser known public domain works from relatively unknown digital archives could be incorporated into existing literary datasets such as those available on Project Gutenberg. |
Beta Was this translation helpful? Give feedback.
-
Hi @hawc2 great to have you join!
This all sounds exciting. For this hackathon, we're focused on datasets that don't have copyright/have an open licence. It would be great to explore possible datasets with extracted features. I'm hoping to do more work on using tools to visualize/describe the datasets being included in this sprint later in the hackathon. These datasets could fit nicely into this. It would also be great to train models on some of this data. A GPT-esque model trained on science fiction sounds a lot of fun! I'm also pinging @meg-huggingface and @yjernite. They are co-chairing the BigScience Data Governance working group. Part of the focus of this group is to explore models for data governance for data that can't be shared freely but might be possible to share in particular situations. In particular, the goal is to establish models for 'data providers' and 'data hosts' to responsibly host and share (under particular access controls) datasets that can't be freely shared. If this sounds of interest, this might be something to discuss a bit further offline. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am Kiymet. I am a computer engineering and mathematics student at Bogazici University. Currently, I am a trainee in KNAW Humanities Cluster, Digital Humanities Lab. I would like to contribute by adding Odeuropa benchmark which is based on historical texts in 7 languages. |
Beta Was this translation helpful? Give feedback.
-
Hi all, in the event I am not too late--as it is now Nov 1st--I'd like to throw my proverbial hat into the ring with regard to making an ML-appropriate #DigitalHumanities dataset available to the #GLAM research community. I am a 71-yo cancer-surviving indie #CitizenScientist and unfortunately #DisabledDeveloper (due to a 2020 spinal cord injury) with a dataset I would like to make more widely available to the digitization research community. My dataset consists of over 7,000 bounding-box dimensions for all the advertisements in the 48-issues of in the digital collection of Softalk magazine at the Internet Archive. Beyond the raw dimensions of these ads which document the size, shape, and position within the 2- and 3-column page grids of these ads' pages, the ad specifications are contextualized by my development of the #MAGAZINEgts ground-truth storage format based on an ontological stack of CIDOC-CRM, FRBRoo, PRESSoo, and PAGExml. This format provides an integrated complex document structure and content depiction model using a metamodel subgraph design pattern. My dataset is currently found here on GitHub: https://github.com/SoftalkAppleProject/datasets_ml_all_ads_1M. The READme/home page of this repository has addition information about the dataset, the MAGAZINEgts format, associated DATeCH posters, and relevant links to articles, etc. The project description is a bit dated and has been affected by my recent disabling injury. The READme page does not currently include a link to this relevant article: https://bit.ly/pressoo-magazines. I could use a helping hand to determine the best way to format, document, and contribute this dataset for inclusion in the #BigLAM Hackathon project/community. I can most easily be reached here in reply or via Twitter at: https://twitter.com/Jim_Salmons |
Beta Was this translation helpful? Give feedback.
-
We're excited you want to get involved!
To join the hackathon:
Beta Was this translation helpful? Give feedback.
All reactions