Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalizer type C to text cleaners #85

Merged
merged 4 commits into from
Oct 2, 2024
Merged

Conversation

shavit
Copy link

@shavit shavit commented Sep 28, 2024

There are duplications in the cleaners, should the normalizer be added inside the other cleaners, or be applied to all text?
https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/tts/utils/text/tokenizer.py#L110

/Closes #63

Copy link
Member

@eginhard eginhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR and tests, it looks good! I suggest to rename the function to make the name more intuitive. I'd call it at the start of every cleaner, except no_cleaners().
You can run make clean && make lint to make sure your code passes the style check.

Comment on lines 192 to 193
def normalize_nfc(text: str) -> str:
"""Canonical decomposition followed by canonical composition"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def normalize_nfc(text: str) -> str:
"""Canonical decomposition followed by canonical composition"""
def normalize_unicode(text: str) -> str:
"""Normalize Unicode characters."""

@shavit shavit marked this pull request as ready for review September 30, 2024 15:23
@shavit shavit requested a review from eginhard September 30, 2024 15:24
Copy link
Member

@eginhard eginhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this is a very useful contribution!

@eginhard eginhard merged commit 36611a7 into idiap:dev Oct 2, 2024
49 checks passed
eginhard pushed a commit that referenced this pull request Oct 4, 2024
* Add normalizer type C to text cleaners

* Linter recommendations

* Add unicode normalize to every cleaner

* Format test_text_cleaners.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Text for synthesis needs to be normalized for languages with diacritics
2 participants