-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Hindi/Gujarati #577
Comments
I was starting to wonder how could have no one asked for an Indian language yet. First of, let's implement Hindi and then, in another issue, take care of Gujarati, too. Here are my thoughts. Implementing Hindi is going to be very much similar to Chinese and Japanese. I am willing to make the necessary changes for these two languages, so including Hindi should be very easy after that. My plan is also to go for phonetic typing. This will open the door for other syllable-based languages, such as Armenian, Georgian, some African languages and even a potential "Emoji" language, as discussed in #573. In all cases, I am going to use Gboard as an example how typing should work. So, typing "h-i-n-d-i" in Gboard, or "4-4-6-3-4" in TT9 will yield "हिन्दी". Currently, there is no way to add this kind of typing to TT9, so you can not create valid YAML or dictionary files. And now, the questions. What is the layout you have in mind? If you could list what letter goes on what key, as well as the Latin approximations of each letter, that would be great. I will need you to help me understand how Devanagari works. Do I always need to combine a consonant and a vowel to form a single character? For example, I can see in Gboard, when I attempt to type "Hindi", "Hi" is one character. I can add another vowel to "h" and turn it into a different one. However, "hi" and "he" are the same. And probably, there is more than this simple example, I just don't know what to ask. I need to understand the language a bit, before I can implement it in Java. |
Let's see how Devanagari works. As all this joining of characters is handled by unicode fonts, we just need to put corresponding characters in yaml file and it should work. I am attaching hindi unicode standard, example phonetic keyboard layout, as well as phonetic layout used in xkbmap. |
Here is yml file for Hindi., uploading as text file, as yml attachments are not supported. With this typing हिन्दी shall require ह + ि + न + ् + द + ी, which will be 446034. Multiple key presses will be required in T9 mode. |
Here comes the tricky part. 446034 in an ABC-like mode will not result in "hindi" only. It will result in all possible variants, like: "gindi", "ghndi", "gimbi", "himbi" and so on... To get "hindi", we need to use a dictionary for filtering out only the valid words. I have some more questions. From what I understand, it must be possible to combine any consonant with any other consonant or vowel, correct? The number of possible combinations is huge, this is why, I was thinking of having only dictionary mode. It will cause TT9 to suggest only words. Otherwise, it would be a pain to type, in my opinion. Will you be able to find a word list for that?
Yes, but how will Android know if you wanted to type "ndi" or "nadi"? I am unsure if it knows how and when to join two characters into one or if this will require additional processing. Are the characters on 0-key and 1-key some kind of hints for joining? If so, it may work, indeed. |
https://dict.hinkhoj.com/hindi-words/lista.php There is huge word-list here, can it be extracted? About 53k words starting from "अ" only. There is a shorter word-list on https://en.wiktionary.org/wiki/Appendix:Common_Hindi_words also. Indeed 446034 in ABC mode will not result in Hindi, it will require multiple presses to select correct character. Only half characters will join with next vowel/consonant. " ्" on 0 key marks character as half. So न + ् + द +ी = न्दी , while न + द + ी = नदी |
I have many more questions. TypingI see. Your approach is based entirely on Devanagari, which probably makes sense to you. My original idea was to base typing on the Latin letters. In other words, emulate this online keyboard, but using the 10 number keys. This way when you type "634" for "ndi" TT9 will automatically produce "न्दी". And there will be no need to think about the combining characters. Does this make sense from the average Rajiv perspective or he would find it easier to type using combining characters, despite the more key presses per word? From technical perspective, my idea will require a dictionary containing only the Latin transcriptions of each letter (or conjunct letter for that matter). From what I've read on Wikipedia, there are lossless methods of conversion, so it should be possible. There are two potential problems though.
And, of course, besides "ndi", my method will also produce all the alternatives, such as "mdi", "nbi", "mbi", ... but I guess we can't avoid this given the nature of T9 typing. I hope I didn't get you bored with all these explanations. I am just trying to find out what is the most optimal way of typing. NumbersWe haven't discussed numbers. In Arabic, it is possible to type their own numbers by holding the respective key in ABC or Predictive mode. However, in 123 mode, TT9 produces Western numbers. It is for compatibility. I am sure there are plenty of apps and websites that understand only 0-9, but not the Asian alternatives. I suppose you need the same in all Indian languages, right? Rupee signCurrency signs are grouped in one list. You can access it by pressing 0 + #. There is no need to add it on the 1-key. PunctuationI will add the extra punctuation characters to the Java code. This way it will possible to order them optimally. Feel free to suggest a different order for all characters on 1-key. |
It will be easier but there is a problem in this approach. Keyboard will have to guess if user wants half/joined character or full character. "नदी" is also 634, and so is "मेह" (MEH) The possible combinations will rise with larger words, where every consonant can be full or half/joined. Also with Sanskrit in the mix no dictionary will be enough.
This should be fine with Hindi. Most people use western numbers only. |
Coming back to this again. First, did you mean to put virama on the 0-key or the 1-key? In the Second. Maybe I didn't explain my idea well enough. My point is the dictionary will contain only Devanagari letters as "words", not entire real words. This means typing "Hindi" in semi-Predictive mode will require:
There will be no confusion between "नदी" and "मेह", because there is no letter "meh". Instead you would type it this way:
(Btw, isn't this "meha"? If it is supposed to be "meh" with virama at the end, then on step 3, one would have to type 41, instead of only 4. This is just a technical detail, not so important.) The advantage is you can type just any word in both Hindi and Sanskrit. In comparison, in normal Predictive mode, where suggestions are entire words (probably, what you were thinking of), typing will be more straightforward. "Hindi" would be simply "446134" + OK or space. And "meh" would be "634" + OK or space. I guess "ndi" means nothing on its own, so it will not be present in the dictionary, hence there will be no confusion again. However, the big problem is finding a dictionary that contains enough words for regular everyday typing. The one you proposed is likely not enough. As a comparison, English, a language with no inflections, word genders, or many verb tenses, has about 170k words. I can quickly write a webpage crawler to extract the words from that website, but I really doubt 53k words cover both Hindi and Sanskrit. In "ABC" mode, both alternatives will work the same:
So, maybe, we can include a sort of "ABC" mode and sort of "Predictive" mode as I described it? Does that make sense? |
Regarding " ् " it is not "virama" or punctuation mark. It is called हल or हलंत , https://www.hindi.co/naagaree/halant.html Yes, this semi-predictive mode with only letters sounds fine. You are correct I thought it to be normal predictive mode with words. So having ABC mode and predictive mode makes perfect sense for a start. For dictionary I will try to find some. |
I guess I don't get how it is called, but we are talking about the same thing. From the website you posted:
Anyway, I'm a bit against putting letters or letter-like characters on the 0-key. In T9 layout the 0-key is the Spacebar. It is similar to computer keyboards, where the Spacebar is in the middle. For consistency with all other languages, I would like to preserve the same experience in Indic ones too. In my opinion, the space is the most important character and it should be the most easily accessible one. My second concern is, in TT9, the 0-key is really meant for special and mathematical characters, while the 1-key is for the characters you would normally use to type text. For this reason, I prefer putting As for "ज्ञ", it is just a regular letter, isn't it? I think it makes sense to put it on one of the "phonetic" keys. I'll give another foreign language example. In Spanish, they have And my final concern is, in Predictive mode, the space key can not contain letters. Pressing it always terminates the current word and resets the typing status. It will be very, very difficult for me to make TT9 guess correctly if you want to type a word with a letter on 0, or you want to end the current word. This is my final take on the discussion. Once we clear the layout, we should be all good and I should be able to implement Hindi/Sanskrit. |
I understand, and agree with you on not putting characters on 0 key. |
Excellent! Here is a table of all conjunct letters. It may be useful when trying to make it work. Hopefully, just pushing the correct characters to any text field will be enough to form conjunct letters, just like Android magically converts Arabic letters to the correct forms. We'll see. |
Hi,
I am interested in working on this two Indic languages (Hindi and Gujarati). I have created phonetic keyboard for Gujarati, which is now available in xkeyboard. I have few ideas regarding this.
Any suggestions or ideas ?
The text was updated successfully, but these errors were encountered: