This repository is outdated. A more comperehensive, more precisely aligned dataset is found here: https://github.com/sebastian-nehrdich/sanstib
This is a collection of sentence-level aligned Sanskrit-Tibetan etexts. The Tibetan etexts have been taken from ACIP (https://asianclassics.org/), the Sanskrit etexts from GRETIL (http://gretil.sub.uni-goettingen.de). The HTML-versions of the aligned files are found in the html-folder. The org-folder contains .org-files (which are just plain text files). the matrices-folder contains pictures of the alignment-matrices of the different files. These can be useful to see to what extent the alignment has been successful.
The alignment-quality is good (overall average quality be somewhere around 97% in the case that the etexts are not noisy and no larger chunks in either of the languages are missing). However be prepared to find occasional mistakes. Some sort of compression occurs because the aligner is doing one-to-many and many-to-one alignments.
Feel free to open an issue or send me a mail in case you have own etexts that you want to have aligned or any further questions/suggestions!
The etexts have been created by a combination of a classifier using a convolutional neuronal network and the YASA sentence aligner (http://rali.iro.umontreal.ca/rali/?q=en/yasa).
Author / Name | Quality | % of sentences | date added | HTML | TXT | Alignment Matrix | Remarks |
---|---|---|---|---|---|---|---|
Abhidharmakośabhāṣyam | >97% | 75% | 10.6.18 | HTML | TXT | PNG | Very high alignment quality |
Abhidharmakośavyākhyā | 90% | 50% | 10.6.18 | HTML | TXT | High alignment quality, occasional disagreement between the SKT etext and the Tibetan translation accounts for a certain number of errors. Also note that bot etexts contain rather much noise. | |
Abhidharmasamuccaya | >97% | 50% | 10.6.18 | HTML | TXT | PNG | Very high alignment quality |
Abhidharmasamuccayabhāṣya | >97% | 50% | 10.6.18 | HTML | TXT | PNG | Very high alignment quality |
Madhyāntavibhāgabhāṣyam | >97% | 50% | 10.6.18 | HTML | TXT | PNG | Very high alignment quality |
Prasannapadā | 90% | 50% | 10.6.18 | HTML | TXT | PNG | High alignment quality, occasional disagreement between the SKT etext and the Tibetan translation accounts for a certain number of errors. Here both the Sanskrit Etext as well as the Tibetan translation are rather noisy. |
Triṃśikavijñaptibhāṣyam | >97% | 50% | 10.6.18 | HTML | TXT | PNG | Very high alignment quality |
Each point on the y-axis represents a Sanskrit sentence, each point on the x-axis represents a Tibetan sentence. The images can be very useful to get an impression about the quality of the translations and whether loss of longer sections has occoured.
- The aligner is not able to cut sentences into smaller units, but it can do one-to-many and many-to-one alignments.
- Errors most likely occur at the end of sentences when smaller units get aligned to the wrong corresponding sentence; this is due to the fact that the algorithm is a little bit weak in reliably detecting units that are shorter than 3 tokens.
- If a longer part is missing in either of the two languages, the algorithm might loose it's track and therefore produce a couple of misalignments; such instances have to be located manually and fixed.