-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add dart implementation #210
feat: add dart implementation #210
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I have a few questions/thoughts.
In addition, I would highly recommend you to refactor the tests to get data from the json files in the <root>/test
folder. You check the readme there or see how the python/javascript test files did it. This will make it much easier to maintain test data across all the languages. Also it will make the logic of testing simpler (create wrapper functions, pull data, and loop through tests).
int _gurmukhiDiacriticComparator(int a, int b) { | ||
return _getGurmukhiDiacriticOrder(a) - _getGurmukhiDiacriticOrder(b); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the other libs use an array elements position to do the same thing, is there a reason you opted for a verbose switch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand you correctly, but python library uses sorting function to sort the diacritics:
https://github.com/ShabadOS/gurmukhi-utils/blob/e7d7527bcfb2a98b160a7957079a7c67694d7221/python/gurmukhiutils/unicode.py#L345C1-L354C22
In place if sorted(key=...)
, I have used sort(_gurmukhiDiacriticComparator)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically instead of using a switch (which is using up many lines of code), we can just use indexOf in Dart. So you could have an array called gurmukhiDiacriticOrder for example and check indexOf the two characters to get the sorting order.
final gurmukhiDiacriticOrder = <String>['਼', 'ੑ', 'ੵ', '੍', 'ਹ', 'ਰ', 'ਵ', 'ਟ', 'ਤ', 'ਨ', 'ਚ', 'ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ', 'ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ', ];
Effectively you're doing the same thing with the switch (iterating case by case until you find a match). This would do the same thing but with a single line of code.
The only thing you have to check is if it messes up the _isGurmukhiDiacritic
function used in splitGurmukhi
. As far as I can reason, it should not be a problem.
This let's you also control which characters can be used after a virama.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to avoid String.indexOf() method, it will be relatively more expensive with worst case complexity O(n), what if we the user wants to convert a very long String?
I can use HashMap instead, it will reduce the number of lines, look simpler, and with O(1) lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, the performance improvement is worth the extra lines imo.
Last question: How are you going to handle consonants that are not supposed to be subscript letters? For example any letter that's not ਹ, ਰ, ਵ, ਟ, ਤ, ਨ, or ਚ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is what I think we should do (since we have a few things going on now), please let me know your thoughts:
- Allow any VIRAMA + CONSONANT sequence (this includes things like ੍ + [ੲ, ਅ, ੳ] which is quite odd!)
- Include a linter/checker that warns odd/unusual sequences (like the virama + vowel above) and errors for malformed sequences (e.g. Halant/Udaat + Virama on the same character). This can be done in a separate PR. For the linter, I suggest the following:
- normal (no warning/error): ਹ, ਰ, ਵ, ਯ
- warn (unsupported by Android, we can term them "archaic/historical"): ਟ, ਤ, ਨ, ਚ, ਠ, ਗ, ਦ, ਚ, ਮ, ਥ, ਖ
- error: any other character
- Normalize Gurmukhi function can take a flag to normalize ਡ + ਼ = ਡ਼ (would like a specific name for the flag of sorts if possible). This also can be done in a new PR if desired. (Edit: it might also be nice if a linter warned that this may be intended to be a ਡ਼ when it comes across the ਡ + ਼ sequence)
Just providing some insight here: The reason I would like to know about any potential incorrect sequences is for example when we generate an API that could be consumed by a hukamnama bot or such (used in places like Discord, WhatsApp, Telegram, etc.). We already know what to do with yayya, I'm really not sure what we do about the unsupported pairin-akhar, but at the least we might be able to provide a note to end users that the text can't be accurately displayed and to use the website/app to see it properly (using Sant Lipi).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some points:
- How do you want handle invalid input from user? I have seen in some tests ascii+unicode mixed string is provided, which is in invalid input. What if the user inputs an Ascii letter not having any mapping? What would you like to with a string of random gurmukhi letters?
- If you check the behaviour of the text render engine, for following two letters: ਕ੍ਰ ਕ੍ੳ, if you try to move cursor left/right on them, it behaves as they are a single letter.
- For linter/checker: It definitely needs a different PR and proper discussion. For virama part, we can give the user an optional parameter so that list of valid subscript letters can be set. It can be similar as in English, red underlines for spelling mistakes (invalid sequence of chars), and green underlines for grammar mistakes (valid sequence but meaningless).
- Normalization flag can be named such as - nukta, nuktaNormalization, etc. I have checked on Google fonts, I believe ਡ + ਼ = ਡ਼ is more of font specific, rather than platform specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Running toUnicode should convert any ascii that is meant to be converted to gurmukhi, and then normalize all the gurmukhi as best as it can. It's up to the user to provide the correct string. Similarly, going to ascii should only convert unicode characters and ignore the existing ascii, and then normalize all ascii sequences as best as it can (very important for the user to know how the functions work). In either function, any characters that are not meant to be changed should be left untouched. (Edit: but of course, normalization can mess things up especially in the toAscii function, that's just up to the user to understand, and we can also let them do toAscii without normalization by running individual functions or adding flags (Edit Edit: it should also be noted that we really need to focus more on moving people towards unicode, I dont even know what the toascii function is for tbh, it is something we can expose in an API for people making apps that don't use unicode, but it's not going to be something we use in our first-party apps).
- That is interesting! I did not know that
- I like the idea of adding some linter configuration to control which errors/warnings to show/hide
- Yes, this is a font choice. It happens on android keyboards as well as iOS keyboards (because it's built into Noto Gurmukhi as well as Mukta Mahee). The flag
nuktaNormalization
makes me think it will fix normal nukta letters like ਸ + ਼ = ਸ਼, it doesn't make me think it's going to take a non-normal nukta character and change it into a ੜ. I think we need a better phrase/term.
Just sharing an image showing the font features for Noto Gurmukhi:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The virama part, which is in sort diacritics, will only happen if user inputs a string with a virama between invalid pair of characters (I wouldn't call it a mess up from our side). As for current case, for asciiToGurmukhi()
, this part is not a problem (ASCII fonts don't have virama mapping).
What would you actually normalize ਡ + ਼
into? I checked ਡ਼
isn't any Unicode character. We can try to normalize ਡ + ਼
=> ਡ + ZWNJ + ਼
, but that again works only for some fonts (Works on Noto Sans but not on Noto Serif).
I think normalization and linter should be discussed in a separate issue, they need more research and input from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to check ਡ + ਼ in the linter only. The Sant Lipi font only shows it as ਡ਼ because of the following rule:
As you can see, we can even have characters like ਫ਼਼ (faffa with nukta, or I guess phappha with two nuktas). So it's kind of up to us how we want to render the code (just following other font conventions for now)
I have refactored the tests to get data from the Json files. I have 2 assertions failing (for functions unicode3, and unisant2):
What I understand is you want to sort diacritics with virama for ਹ, ਰ, ਵ, ਟ, ਤ, ਨ, and ਚ but with the exception of ਯ? The expected strings As Unicode Standard doesn't have ਯ variants as of now, I think we should not try to work around in these cases. We should handle them with Sant-Lipi modifiers. |
Readme: Add reference to SantLipi
I would love to show as accurately as possible. This would require some testing on different platforms / font rendering engines. Very curious what this shows: ਭ੍ਯਿੋ As for the other example (where the preceding consonant has a right-side vowel/modifier) of ਕੀ੍ਯੋ, I don't think that will work in any platform/font rendering engine. I'd be very impressed if it did work. We had to create Sant Lipi almost purely for this real issue. Related shabados/mobile#194 |
੍ਯ is not present in SGGS, and SDG has variants of ਯ. So, if the user wants to handle ਯ variants, he has to use Sant-Lipi. I strongly believe that unicode, santlipi, and unisant test functions should not be expected to give same output in the case of ਯ. It will also let the user to choose the conversion type he wants. For example:
|
Sorry for the delay. Some thoughts on next actions / clarity:
Some actions for you please:
This will help us simplify the scope of this PR |
Add Addha Yayya Exception
I also think I have also added So, all the tests are passing finally! Please update |
Awesome! We are pretty close to finishing up this PR (I still have to do a final review) Please do the following:
I will open an issue to investigate ੍ਯ patterns. |
Update WIP in global README
Thank you very much for that! Would you agree that this PR is ready for a final review? Is there anything else you'd like to do? |
Thanks! I would add more features in later PRs. This PR is more than enough for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests might be weakly implemented for the future, but it's more than enough for today. We've already reviewed the rest of the PR and I've enjoyed discussing the finer points of gurmukhi-utils with you. Everything looks great! Congratulations on your first PR! Thank you for all the hard work, and looking forward to future contributions :)
Summary
Init Dart Package.
Add Features:
Test
PASS