This repository contains static data to be used in the rest of the Dialect Map components 💬.
Jargons are grouped in order to improve one-on-one comparison when the meaning of the jargons are equal, although the term to describe it varies from science to science. These groups are later on used by a range of data-ingestion pipelines to generate NLP metrics on the ArXiv papers dataset, so they can be compared within the Dialect map UI.
The project uses AJV-CLI to validate the JSON schemas, and the jargon list. It can be installed by running:
npm install --no-optional
To validate the JSON-Schema syntax:
make validate
The full corpus of ArXiv categories is formed by both currently and legacy used ones.
- Current ones have been copied from the official ArXiv category taxonomy.
- Legacy ones have been inferred from the public ArXiv metadata dataset.
The initial set of jargon groups was collected through a Google form set up by Kyle Cranmer on Twitter, having the scientific community responses collected from December 01 to December 31, 2020.
New terms can be added by creating a Pull Request (PR). These PRs will be later on reviewed by the Dialect map team to ensure that the resulting JSON is well formatted.
- For information about how to add new terms, check the contributing documentation.
- For information about how changes are propagated, check the computing documentation.