Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character Repetition Correction #268

Open
onatyap opened this issue Apr 28, 2021 · 3 comments · May be fixed by #277
Open

Character Repetition Correction #268

onatyap opened this issue Apr 28, 2021 · 3 comments · May be fixed by #277
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@onatyap
Copy link
Contributor

onatyap commented Apr 28, 2021

Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in:
"Süppeeeer", "Berbaaat", "Muhteşeemmmm"

Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.

The output for examples above are as follows:
"Süper", "Berbat", "Muhteşem"

Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?

@onatyap onatyap added enhancement New feature or request question Further information is requested labels Apr 28, 2021
@husnusensoy
Copy link
Contributor

Note that sadedegel is a library and we should be picky in adding "features". The problem you solved is a part of a more general problem called normalization. Here is the roadmap for adding such a feature

  1. Generate your test data (Sentences) - Well define you experiment and ensure that you add counter examples. Such as saat, menfaat, faal are all valid words and you don't perform any corrections on it.
  2. Apply your technique in normalizing by reporting
  • Performance
  • Accuracy (False positives and false negatives ofcourse)
  1. Prove that your technique improves several tasks when enabled.

@askarbozcan
Copy link
Member

See #190

@onatyap onatyap self-assigned this May 20, 2021
@ertugrul-dmr
Copy link
Contributor

I've tested the repetition correction for you @onatyap as you asked. Here are the results:

Prebuilt Model Original Result Preprocessed Result
Tweet Sentiment Classification 3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669 3-Fold F-1: 0.8587 5-Fold F-1: 0.8640
Movie Review Sentiment Classification F-1: 0.8258 F-1: 0.8242
Telco Tweet Sentiment Classification F-1: 0.6871, Accuracy: 0.6925 F-1: 0.696, Accuracy: 0.691
Turkish Customer Reviews Classification F-1: 0.851 F-1: 0.852

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants