Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deficiencies in the Mahābhārata Dataset #6

Open
VedantMadane opened this issue Jan 16, 2025 · 2 comments
Open

Deficiencies in the Mahābhārata Dataset #6

VedantMadane opened this issue Jan 16, 2025 · 2 comments

Comments

@VedantMadane
Copy link
Collaborator

You have done a commendable work in curating this repo.

But there is a deficiency in the original sacred-texts.com Sanskrit that has inadvertently crept up here also.
This error is in the last letter of every line or ślokārdha-s, the halanta or ् are not represented properly.
नरॊत्तमम should be नरॊत्तमम्
उदीरयेत should be उदीरयेत्
् is missing everywhere.

We have two ways to remedy this problem:

  1. Look for the last character of each line and if it is अ-कारान्त then replace with हलन्त ् . Exceptions: a-kārānta valid words such as मम, च etc.
  2. Use an alternate data source such as https://bombay.indology.info/mahabharata/text/UD/MBh01.txt and others.

If you could upload the programs used by you for scraping the websites, scanning Sanskrit text along with accent markers and OCR, turning PDFs into JSON, etc. to the repo, I can create a pull request with the additional data sources.

@bhavykhatri
Copy link
Owner

You have valid point and I have merged the critical edition PR, but before making critical edition as the default mahabharata source we can update the json structure to look it more familiar like book, shloka keys format.

@VedantMadane
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants