Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work plan for creating data with typos #1

Open
4 of 5 tasks
BERENZ opened this issue Jan 9, 2025 · 4 comments
Open
4 of 5 tasks

Work plan for creating data with typos #1

BERENZ opened this issue Jan 9, 2025 · 4 comments
Assignees

Comments

@BERENZ
Copy link

BERENZ commented Jan 9, 2025

Create datasets with at least 50 examples of names (males and females) and surnames for each language separately: Ukrainian, Russian, Belarusian, Vietnamese and Polish. An example of a dataset

  ukrainski poprawnie   blednie1  blednie2 blednie3
1     Олена     Ołena      Elena    Helene    Alona
2  Катерина  Kateryna Jekaterina Catherine    Katya
3     Ірина     Iryna      Irina     Irena      Ira
4    Тетяна   Tetiana    Tatiana     Tania  Tatyana
5    Оксана    Oksana      Oxana    Oxanna  Oksanka
6   Наталія   Natalia    Natalya   Natasha  Natalja

List of languages

  • Ukrainian
  • Russian
  • Belarusian
  • Vietnamese
  • Polish

Maybe information from the dane.gov.pl will be useful.

@cypskaj
Copy link
Contributor

cypskaj commented Jan 12, 2025

Datasets for Ukrainian and Russian done.
Potential issue to resolve: identical Polish and/or English equivalents for the different original forms.

@BERENZ
Copy link
Author

BERENZ commented Jan 12, 2025

Can you add in the readme file couple of examples just as a glimpse of the files? You can also add some basic statistics regarding number of names and surnames etc.

@cypskaj
Copy link
Contributor

cypskaj commented Jan 12, 2025

Done I think. Next proceeding with Belarusian and diving into occurrence frequencies etc.

@cypskaj
Copy link
Contributor

cypskaj commented Jan 17, 2025

Finished foreign languages. In free time I will probably go through them one more time to make sure everything looks okay, add sources of information where applicable, dive into the issue of frequencies.
Regarding Polish - am I supposed to provide possible different typos/variants?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants