Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Hindi/Gujarati #577

Open
Docbroke opened this issue Jul 18, 2024 · 12 comments
Open

Support for Hindi/Gujarati #577

Docbroke opened this issue Jul 18, 2024 · 12 comments
Labels
languages Dictionary or language related issues

Comments

@Docbroke
Copy link

Hi,

I am interested in working on this two Indic languages (Hindi and Gujarati). I have created phonetic keyboard for Gujarati, which is now available in xkeyboard. I have few ideas regarding this.

  1. Can we create Hindi/Gujarati keyboard for tt9 without dictionary support? So it will be ABC mode only in beginning.
  2. What I am planning is to create yml file similar to other languages and add unicode characters, which phonetically corresponds with English characters on the respective. So that one can guess which characters will be there in which key. This is different from feature phone keypads supporting Hindi, where characters are in sequence, instead of phonetically arrangement.
  3. Obviously this means there will be multiple characters per key, almost double of what is there in English. So can we think of having another layer, ( like shift / alt key layer), when selected it will show different characters. So instead of abc/ABC/ENGLISH there will be कखग/ा,ि,ी etc.

Any suggestions or ideas ?

@sspanak
Copy link
Owner

sspanak commented Jul 19, 2024

I was starting to wonder how could have no one asked for an Indian language yet. First of, let's implement Hindi and then, in another issue, take care of Gujarati, too.

Here are my thoughts.

Implementing Hindi is going to be very much similar to Chinese and Japanese. I am willing to make the necessary changes for these two languages, so including Hindi should be very easy after that. My plan is also to go for phonetic typing. This will open the door for other syllable-based languages, such as Armenian, Georgian, some African languages and even a potential "Emoji" language, as discussed in #573. In all cases, I am going to use Gboard as an example how typing should work. So, typing "h-i-n-d-i" in Gboard, or "4-4-6-3-4" in TT9 will yield "हिन्दी". Currently, there is no way to add this kind of typing to TT9, so you can not create valid YAML or dictionary files.

And now, the questions.

What is the layout you have in mind? If you could list what letter goes on what key, as well as the Latin approximations of each letter, that would be great.

I will need you to help me understand how Devanagari works. Do I always need to combine a consonant and a vowel to form a single character? For example, I can see in Gboard, when I attempt to type "Hindi", "Hi" is one character. I can add another vowel to "h" and turn it into a different one. However, "hi" and "he" are the same. And probably, there is more than this simple example, I just don't know what to ask. I need to understand the language a bit, before I can implement it in Java.

@sspanak sspanak added the languages Dictionary or language related issues label Jul 19, 2024
@Docbroke
Copy link
Author

Docbroke commented Jul 21, 2024

Let's see how Devanagari works.
When we pronounce any consonant, there is always some vowel pronounced at the end. That means there always vowel joined with consonant. Let's take an example.
h = ह + ् = ह् ( ् implies half character, without "a" at end, cannot be pronounced alone, used to create joined consonants like ndi in hindi is न् + द + ी = न्दी while न +दी =नदी pronounced "nadi" )
ha = ह् + अ = ह
haa = ह + ा = हा
hi = ह + ि = हि
he = ह + े = हे
hee = ह +ी = ही
hu = ह + ु = हु
hoo = ह + ू = हू
hun/hum (as in Hungry) = ह + ं = हं

As all this joining of characters is handled by unicode fonts, we just need to put corresponding characters in yaml file and it should work. I am attaching hindi unicode standard, example phonetic keyboard layout, as well as phonetic layout used in xkbmap.
Hindi Unicode.pdf
dev-kagapa
hindi-kagapa_xkbmap.txt

@Docbroke
Copy link
Author

Docbroke commented Jul 21, 2024

Here is yml file for Hindi., uploading as text file, as yml attachments are not supported.
Hindi.txt

With this typing हिन्दी shall require ह + ि + न + ् + द + ी, which will be 446034. Multiple key presses will be required in T9 mode.

@sspanak
Copy link
Owner

sspanak commented Jul 23, 2024

With this typing हिन्दी shall require ह + ि + न + ् + द + ी, which will be 446034. Multiple key presses will be required in T9 mode.

Here comes the tricky part. 446034 in an ABC-like mode will not result in "hindi" only. It will result in all possible variants, like: "gindi", "ghndi", "gimbi", "himbi" and so on... To get "hindi", we need to use a dictionary for filtering out only the valid words.

I have some more questions.

From what I understand, it must be possible to combine any consonant with any other consonant or vowel, correct? The number of possible combinations is huge, this is why, I was thinking of having only dictionary mode. It will cause TT9 to suggest only words. Otherwise, it would be a pain to type, in my opinion.

Will you be able to find a word list for that?

As all this joining of characters is handled by unicode fonts, we just need to put corresponding characters in yaml file and it should work.

Yes, but how will Android know if you wanted to type "ndi" or "nadi"? I am unsure if it knows how and when to join two characters into one or if this will require additional processing. Are the characters on 0-key and 1-key some kind of hints for joining? If so, it may work, indeed.

@Docbroke
Copy link
Author

Docbroke commented Jul 23, 2024

https://dict.hinkhoj.com/hindi-words/lista.php There is huge word-list here, can it be extracted? About 53k words starting from "अ" only. There is a shorter word-list on https://en.wiktionary.org/wiki/Appendix:Common_Hindi_words also.
There are plenty of words in Indian languages, many Sanskrit words are also used in Hindi, and both Hindi and Sanskrit uses same Devnagari lipi (writing method). So this keyboard will work for typing both Hindi and Sanskrit. Only dictionary mode will not be enough.

Indeed 446034 in ABC mode will not result in Hindi, it will require multiple presses to select correct character.

Only half characters will join with next vowel/consonant. " ्" on 0 key marks character as half. So न + ् + द +ी = न्दी , while न + द + ी = नदी

@sspanak
Copy link
Owner

sspanak commented Jul 30, 2024

I have many more questions.

Typing

I see. Your approach is based entirely on Devanagari, which probably makes sense to you.

My original idea was to base typing on the Latin letters. In other words, emulate this online keyboard, but using the 10 number keys. This way when you type "634" for "ndi" TT9 will automatically produce "न्दी". And there will be no need to think about the combining characters. Does this make sense from the average Rajiv perspective or he would find it easier to type using combining characters, despite the more key presses per word?

From technical perspective, my idea will require a dictionary containing only the Latin transcriptions of each letter (or conjunct letter for that matter). From what I've read on Wikipedia, there are lossless methods of conversion, so it should be possible.

There are two potential problems though.

  1. When I tried typing "सिवोऽहम्" it I still had to manually put "ऽ" in the middle and visarga at the end. This probably means we have to keep the combining characters, but put them on 1-key or something.
  2. If want to type "odi", instead of "ndi", you will have to select "o", press OK, then type "di". In other words: "ndi" = "634", "odi" = "6" + OK or space + "34". "O" will usually be at the beginning, so scrolling will not be necessary. This is the standard in Japanese, btw, so I am thinking it may be usable for Hindi/Sanskrit too, given the fact in all these languages one always types repeatedly consonant+vowel or standalone vowel to get a word.

And, of course, besides "ndi", my method will also produce all the alternatives, such as "mdi", "nbi", "mbi", ... but I guess we can't avoid this given the nature of T9 typing.

I hope I didn't get you bored with all these explanations. I am just trying to find out what is the most optimal way of typing.

Numbers

We haven't discussed numbers. In Arabic, it is possible to type their own numbers by holding the respective key in ABC or Predictive mode. However, in 123 mode, TT9 produces Western numbers. It is for compatibility. I am sure there are plenty of apps and websites that understand only 0-9, but not the Asian alternatives.

I suppose you need the same in all Indian languages, right?

Rupee sign

Currency signs are grouped in one list. You can access it by pressing 0 + #. There is no need to add it on the 1-key.

Punctuation

I will add the extra punctuation characters to the Java code. This way it will possible to order them optimally. Feel free to suggest a different order for all characters on 1-key.

@Docbroke
Copy link
Author

This way when you type "634" for "ndi" TT9 will automatically produce "न्दी". And there will be no need to think about the combining characters.

It will be easier but there is a problem in this approach. Keyboard will have to guess if user wants half/joined character or full character. "नदी" is also 634, and so is "मेह" (MEH) The possible combinations will rise with larger words, where every consonant can be full or half/joined. Also with Sanskrit in the mix no dictionary will be enough.

In Arabic, it is possible to type their own numbers by holding the respective key in ABC or Predictive mode. 

This should be fine with Hindi. Most people use western numbers only.

@sspanak
Copy link
Owner

sspanak commented Aug 20, 2024

Coming back to this again.

First, did you mean to put virama on the 0-key or the 1-key? In the .yml you uploaded, it is on the 1-key, which contradicts to your description of typing "Hindi" using 446034. I think it makes more sense to be on the 1-key, too.

Second. Maybe I didn't explain my idea well enough. My point is the dictionary will contain only Devanagari letters as "words", not entire real words. This means typing "Hindi" in semi-Predictive mode will require:

  1. Type 44. This will produce only two letter options that end with "i" or long "i", because these are the only vowels on 4-key.
  2. Optionally scroll to "हि", if it is not the first one
  3. OK
  4. Type 61. This will produce only the single consonants on the 6-key.
  5. Optionally scroll to "न्", if needed.
  6. OK
  7. Type 34
  8. Scroll to "दि", if needed
  9. OK

There will be no confusion between "नदी" and "मेह", because there is no letter "meh". Instead you would type it this way:

  1. Type 63
  2. Optionally scroll to "मे" and/or OK
  3. Type 4
  4. Optionally scroll to "ह" and/or OK

(Btw, isn't this "meha"? If it is supposed to be "meh" with virama at the end, then on step 3, one would have to type 41, instead of only 4. This is just a technical detail, not so important.)

The advantage is you can type just any word in both Hindi and Sanskrit.

In comparison, in normal Predictive mode, where suggestions are entire words (probably, what you were thinking of), typing will be more straightforward. "Hindi" would be simply "446134" + OK or space. And "meh" would be "634" + OK or space. I guess "ndi" means nothing on its own, so it will not be present in the dictionary, hence there will be no confusion again.

However, the big problem is finding a dictionary that contains enough words for regular everyday typing. The one you proposed is likely not enough. As a comparison, English, a language with no inflections, word genders, or many verb tenses, has about 170k words. I can quickly write a webpage crawler to extract the words from that website, but I really doubt 53k words cover both Hindi and Sanskrit.

In "ABC" mode, both alternatives will work the same:
"Hindi":

  1. Press 4, scroll to "ha"
  2. Press 4, scroll to "i"
  3. Press 6, scroll to "n"
  4. Press 1, scroll to virama
  5. Press 3, scroll to "da"
  6. Press 4, scroll to "i"

So, maybe, we can include a sort of "ABC" mode and sort of "Predictive" mode as I described it? Does that make sense?

@Docbroke
Copy link
Author

Regarding " ् " it is not "virama" or punctuation mark. It is called हल or हलंत , https://www.hindi.co/naagaree/halant.html
I have put it in "1" in uploaded yml, but I think it can be put on "0" , considering it is not a punctuation mark. In that case it should be at the beginning as it will be one of the most commonly used character on 0.

Yes, this semi-predictive mode with only letters sounds fine. You are correct I thought it to be normal predictive mode with words. So having ABC mode and predictive mode makes perfect sense for a start. For dictionary I will try to find some.

@sspanak
Copy link
Owner

sspanak commented Aug 24, 2024

Regarding " ् " it is not "virama" or punctuation mark. It is called हल or हलंत , https://www.hindi.co/naagaree/halant.html
I have put it in "1" in uploaded yml, but I think it can be put on "0" , considering it is not a punctuation mark. In that case it should be at the beginning as it will be one of the most commonly used character on 0.

I guess I don't get how it is called, but we are talking about the same thing. From the website you posted:

Hal was encoded in ISCII (mentioned as halant) and subsequently in Unicode (mentioned as virama and halant)... IMHO, Unicode standard somewhat erroneously used a word 'Virāma' to denote hal!

Anyway, I'm a bit against putting letters or letter-like characters on the 0-key.

In T9 layout the 0-key is the Spacebar. It is similar to computer keyboards, where the Spacebar is in the middle. For consistency with all other languages, I would like to preserve the same experience in Indic ones too. In my opinion, the space is the most important character and it should be the most easily accessible one.

My second concern is, in TT9, the 0-key is really meant for special and mathematical characters, while the 1-key is for the characters you would normally use to type text. For this reason, I prefer putting and on the 1-key. In some languages like Ukrainian and Hebrew, the apostrophe, a punctuation mark in English, has a phonetic meaning - it modifies the adjacent letters. So, typing the respective Devanagari characters using 1 makes much more sense.

As for "ज्ञ", it is just a regular letter, isn't it? I think it makes sense to put it on one of the "phonetic" keys. I'll give another foreign language example. In Spanish, they have N and Ñ. They represent similar sounds and they both reside on the 6-key, where N is located. Doesn't it make sense to put "ज्ञ" there too? Or, if it is more similar to "J", maybe put it on 5-key? I only read about it, but I am not sure what is the phonetic value, so I can't propose the best key. But either way, the characters on 1-key do not "make sounds", so I don't think it belongs there.

And my final concern is, in Predictive mode, the space key can not contain letters. Pressing it always terminates the current word and resets the typing status. It will be very, very difficult for me to make TT9 guess correctly if you want to type a word with a letter on 0, or you want to end the current word.

This is my final take on the discussion. Once we clear the layout, we should be all good and I should be able to implement Hindi/Sanskrit.

@Docbroke
Copy link
Author

I understand, and agree with you on not putting characters on 0 key.
ज्ञ is a joined character, in unicode it is created by typing ज + ् + ञ . But it is pronounced more like "Gna" or "Gya" . Therefore I wasn't very sure where to put it, in full size computer keyboard, with unicode fonts it is typed by joining 3 characters. However as it is pronounced differently compared to characters used to type it in unicode, I thought we need to put it somewhere otherwise it will be difficult to type. I think putting it anywhere on 4 or 5 should be fine.

@sspanak
Copy link
Owner

sspanak commented Aug 25, 2024

Excellent!

Here is a table of all conjunct letters. It may be useful when trying to make it work. Hopefully, just pushing the correct characters to any text field will be enough to form conjunct letters, just like Android magically converts Arabic letters to the correct forms. We'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
languages Dictionary or language related issues
Projects
None yet
Development

No branches or pull requests

2 participants