Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add dart implementation #210

Merged
merged 21 commits into from
Jul 29, 2023
Merged

feat: add dart implementation #210

merged 21 commits into from
Jul 29, 2023

Conversation

akshdeep-singh
Copy link
Collaborator

Summary

Init Dart Package.

Add Features:

  • Conversion: Ascii to Gurmukhi
  • Gurmukhi Utils: Remove vishraams, First Letters, Split, Normalization

Test

PASS

Copy link
Member

@bhajneet bhajneet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! I have a few questions/thoughts.

In addition, I would highly recommend you to refactor the tests to get data from the json files in the <root>/test folder. You check the readme there or see how the python/javascript test files did it. This will make it much easier to maintain test data across all the languages. Also it will make the logic of testing simpler (create wrapper functions, pull data, and loop through tests).

dart/LICENSE Outdated Show resolved Hide resolved
dart/CHANGELOG.md Outdated Show resolved Hide resolved
dart/README.md Outdated Show resolved Hide resolved
dart/lib/src/gurmukhi.dart Outdated Show resolved Hide resolved
Comment on lines +148 to +150
int _gurmukhiDiacriticComparator(int a, int b) {
return _getGurmukhiDiacriticOrder(a) - _getGurmukhiDiacriticOrder(b);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the other libs use an array elements position to do the same thing, is there a reason you opted for a verbose switch here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand you correctly, but python library uses sorting function to sort the diacritics:
https://github.com/ShabadOS/gurmukhi-utils/blob/e7d7527bcfb2a98b160a7957079a7c67694d7221/python/gurmukhiutils/unicode.py#L345C1-L354C22

In place if sorted(key=...), I have used sort(_gurmukhiDiacriticComparator).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically instead of using a switch (which is using up many lines of code), we can just use indexOf in Dart. So you could have an array called gurmukhiDiacriticOrder for example and check indexOf the two characters to get the sorting order.

final gurmukhiDiacriticOrder = <String>['਼', 'ੑ', 'ੵ', '੍', 'ਹ', 'ਰ', 'ਵ', 'ਟ', 'ਤ', 'ਨ', 'ਚ', 'ਿ', 'ੇ', 'ੈ', 'ੋ', 'ੌ', 'ੁ', 'ੂ', 'ਾ', 'ੀ', 'ਁ', 'ੱ', 'ਂ', 'ੰ', 'ਃ', ];

Effectively you're doing the same thing with the switch (iterating case by case until you find a match). This would do the same thing but with a single line of code.

The only thing you have to check is if it messes up the _isGurmukhiDiacritic function used in splitGurmukhi. As far as I can reason, it should not be a problem.

This let's you also control which characters can be used after a virama.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid String.indexOf() method, it will be relatively more expensive with worst case complexity O(n), what if we the user wants to convert a very long String?

I can use HashMap instead, it will reduce the number of lines, look simpler, and with O(1) lookup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, the performance improvement is worth the extra lines imo.

Last question: How are you going to handle consonants that are not supposed to be subscript letters? For example any letter that's not ਹ, ਰ, ਵ, ਟ, ਤ, ਨ, or ਚ?

Copy link
Member

@bhajneet bhajneet Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is what I think we should do (since we have a few things going on now), please let me know your thoughts:

  • Allow any VIRAMA + CONSONANT sequence (this includes things like ੍ + [ੲ, ਅ, ੳ] which is quite odd!)
  • Include a linter/checker that warns odd/unusual sequences (like the virama + vowel above) and errors for malformed sequences (e.g. Halant/Udaat + Virama on the same character). This can be done in a separate PR. For the linter, I suggest the following:
    • normal (no warning/error): ਹ, ਰ, ਵ, ਯ
    • warn (unsupported by Android, we can term them "archaic/historical"): ਟ, ਤ, ਨ, ਚ, ਠ, ਗ, ਦ, ਚ, ਮ, ਥ, ਖ
    • error: any other character
  • Normalize Gurmukhi function can take a flag to normalize ਡ + ਼ = ਡ਼ (would like a specific name for the flag of sorts if possible). This also can be done in a new PR if desired. (Edit: it might also be nice if a linter warned that this may be intended to be a ਡ਼ when it comes across the ਡ + ਼ sequence)

Just providing some insight here: The reason I would like to know about any potential incorrect sequences is for example when we generate an API that could be consumed by a hukamnama bot or such (used in places like Discord, WhatsApp, Telegram, etc.). We already know what to do with yayya, I'm really not sure what we do about the unsupported pairin-akhar, but at the least we might be able to provide a note to end users that the text can't be accurately displayed and to use the website/app to see it properly (using Sant Lipi).

Copy link
Collaborator Author

@akshdeep-singh akshdeep-singh Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some points:

  1. How do you want handle invalid input from user? I have seen in some tests ascii+unicode mixed string is provided, which is in invalid input. What if the user inputs an Ascii letter not having any mapping? What would you like to with a string of random gurmukhi letters?
  2. If you check the behaviour of the text render engine, for following two letters: ਕ੍ਰ ਕ੍ੳ, if you try to move cursor left/right on them, it behaves as they are a single letter.
  3. For linter/checker: It definitely needs a different PR and proper discussion. For virama part, we can give the user an optional parameter so that list of valid subscript letters can be set. It can be similar as in English, red underlines for spelling mistakes (invalid sequence of chars), and green underlines for grammar mistakes (valid sequence but meaningless).
  4. Normalization flag can be named such as - nukta, nuktaNormalization, etc. I have checked on Google fonts, I believe ਡ + ਼ = ਡ਼ is more of font specific, rather than platform specific.

Copy link
Member

@bhajneet bhajneet Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Running toUnicode should convert any ascii that is meant to be converted to gurmukhi, and then normalize all the gurmukhi as best as it can. It's up to the user to provide the correct string. Similarly, going to ascii should only convert unicode characters and ignore the existing ascii, and then normalize all ascii sequences as best as it can (very important for the user to know how the functions work). In either function, any characters that are not meant to be changed should be left untouched. (Edit: but of course, normalization can mess things up especially in the toAscii function, that's just up to the user to understand, and we can also let them do toAscii without normalization by running individual functions or adding flags (Edit Edit: it should also be noted that we really need to focus more on moving people towards unicode, I dont even know what the toascii function is for tbh, it is something we can expose in an API for people making apps that don't use unicode, but it's not going to be something we use in our first-party apps).
  2. That is interesting! I did not know that
  3. I like the idea of adding some linter configuration to control which errors/warnings to show/hide
  4. Yes, this is a font choice. It happens on android keyboards as well as iOS keyboards (because it's built into Noto Gurmukhi as well as Mukta Mahee). The flag nuktaNormalization makes me think it will fix normal nukta letters like ਸ + ਼ = ਸ਼, it doesn't make me think it's going to take a non-normal nukta character and change it into a ੜ. I think we need a better phrase/term.

Just sharing an image showing the font features for Noto Gurmukhi:

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The virama part, which is in sort diacritics, will only happen if user inputs a string with a virama between invalid pair of characters (I wouldn't call it a mess up from our side). As for current case, for asciiToGurmukhi(), this part is not a problem (ASCII fonts don't have virama mapping).

What would you actually normalize ਡ + ਼ into? I checked ਡ਼ isn't any Unicode character. We can try to normalize ਡ + ਼ => ਡ + ZWNJ + ਼, but that again works only for some fonts (Works on Noto Sans but not on Noto Serif).

I think normalization and linter should be discussed in a separate issue, they need more research and input from others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to check ਡ + ਼ in the linter only. The Sant Lipi font only shows it as ਡ਼ because of the following rule:

image

As you can see, we can even have characters like ਫ਼਼ (faffa with nukta, or I guess phappha with two nuktas). So it's kind of up to us how we want to render the code (just following other font conventions for now)

image

dart/pubspec.yaml Outdated Show resolved Hide resolved
dart/pubspec.yaml Outdated Show resolved Hide resolved
@bhajneet bhajneet changed the title [dart] feat: ascii to gurmukhi conversion, and gurmukhi utils feat: add dart implementation Jul 20, 2023
@akshdeep-singh
Copy link
Collaborator Author

I have refactored the tests to get data from the Json files.

I have 2 assertions failing (for functions unicode3, and unisant2):

"iBÎo": "ਭਿ੍ਯੋ"
"kIÎo": "ਕੀ੍ਯੋ"

What I understand is you want to sort diacritics with virama for ਹ, ਰ, ਵ, ਟ, ਤ, ਨ, and ਚ but with the exception of ਯ?

The expected strings ਭਿ੍ਯੋ and ਕੀ੍ਯੋ are themselves invalid Unicode sequence.
It's better if 1st is converted into ਭ੍ਯਿੋ, for the 2nd I don't know.

As Unicode Standard doesn't have ਯ variants as of now, I think we should not try to work around in these cases. We should handle them with Sant-Lipi modifiers.

@akshdeep-singh akshdeep-singh added the .dart Dart label Jul 22, 2023
@bhajneet
Copy link
Member

I would love to show as accurately as possible. This would require some testing on different platforms / font rendering engines. Very curious what this shows:

ਭ੍ਯਿੋ

As for the other example (where the preceding consonant has a right-side vowel/modifier) of ਕੀ੍ਯੋ, I don't think that will work in any platform/font rendering engine. I'd be very impressed if it did work. We had to create Sant Lipi almost purely for this real issue.

Related shabados/mobile#194

@akshdeep-singh
Copy link
Collaborator Author

akshdeep-singh commented Jul 23, 2023

੍ਯ is not present in SGGS, and SDG has variants of ਯ. So, if the user wants to handle ਯ variants, he has to use Sant-Lipi.

I strongly believe that unicode, santlipi, and unisant test functions should not be expected to give same output in the case of ਯ. It will also let the user to choose the conversion type he wants.

For example:

ascii unicode santlipi unisant
iBÎo ਭ੍ਯਿੋ ਭਿ꠳ਯੋ ਭਿ੍ਯੋ
kIÎo ਕ੍ਯੋੀ ਕੀ꠳ਯੋ ਕੀ੍ਯੋ

image looks correct on my devices (Windows, Android).

@bhajneet
Copy link
Member

Sorry for the delay. Some thoughts on next actions / clarity:

  • Unicode and unisant functions should return exactly the same output. The whole point of unisant function is to confirm sant lipi markup can be properly converted to unicode
  • We have to decide on what our unicode standard is basically. This depends on how well all platforms/engines can render the words. For example ਭ੍ਯਿੋ seems to be working, and we can create a rule that says if there is only left-hand vowels on the preceding consonant and none on the half-yayya following then to render it properly in unicode. This is up to us as a team, our parameter is simply to ensure it can be displayed accurately (or with good information) to the end user.

Some actions for you please:

  • Please create an issue (task) to implement a linter/checker and add some of the info from this PR into there for tracking purposes (or at the least reference this PR link to add reference)
  • Same thing for normalization / ਡ + ਼ = ੜ and such

This will help us simplify the scope of this PR

@akshdeep-singh
Copy link
Collaborator Author

I also think ੍ਯ should be considered as exception for sorting diacritics in case of Virama.

I have also added ਿ as a further exception in this case as it looks better.
You can see my change for reference.

So, all the tests are passing finally!
Except ਭਿ੍ਯੋ, my result is ਭ੍ਯਿੋ.

Please update ਭ੍ਯਿੋ in the tests.

@bhajneet
Copy link
Member

Awesome! We are pretty close to finishing up this PR (I still have to do a final review)

Please do the following:

  • Update global test for ਭ੍ਯਿੋ
  • Update global readme with Dart support (there is a progress chart under ## WIP)

I will open an issue to investigate ੍ਯ patterns.

Update WIP in global README
@bhajneet
Copy link
Member

Thank you very much for that! Would you agree that this PR is ready for a final review? Is there anything else you'd like to do?

@akshdeep-singh
Copy link
Collaborator Author

Thanks! I would add more features in later PRs.

This PR is more than enough for now.

Copy link
Member

@bhajneet bhajneet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests might be weakly implemented for the future, but it's more than enough for today. We've already reviewed the rest of the PR and I've enjoyed discussing the finer points of gurmukhi-utils with you. Everything looks great! Congratulations on your first PR! Thank you for all the hard work, and looking forward to future contributions :)

@bhajneet bhajneet merged commit 31a3544 into shabados:main Jul 29, 2023
2 checks passed
@akshdeep-singh akshdeep-singh deleted the dart_ascii_to_gurmukhi branch July 29, 2023 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.dart Dart
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants