Character Repetition Correction #268

onatyap · 2021-04-28T15:04:30Z

Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in:
"Süppeeeer", "Berbaaat", "Muhteşeemmmm"

Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.

The output for examples above are as follows:
"Süper", "Berbat", "Muhteşem"

Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?

The text was updated successfully, but these errors were encountered:

husnusensoy · 2021-04-28T17:45:29Z

Note that sadedegel is a library and we should be picky in adding "features". The problem you solved is a part of a more general problem called normalization. Here is the roadmap for adding such a feature

Generate your test data (Sentences) - Well define you experiment and ensure that you add counter examples. Such as saat, menfaat, faal are all valid words and you don't perform any corrections on it.
Apply your technique in normalizing by reporting

Performance
Accuracy (False positives and false negatives ofcourse)

Prove that your technique improves several tasks when enabled.

askarbozcan · 2021-05-06T12:26:51Z

See #190

ertugrul-dmr · 2021-05-31T12:56:25Z

I've tested the repetition correction for you @onatyap as you asked. Here are the results:

Prebuilt Model	Original Result	Preprocessed Result
Tweet Sentiment Classification	3-Fold F-1: 0.8640, 5-Fold F-1: 0.8669	3-Fold F-1: 0.8587 5-Fold F-1: 0.8640
Movie Review Sentiment Classification	F-1: 0.8258	F-1: 0.8242
Telco Tweet Sentiment Classification	F-1: 0.6871, Accuracy: 0.6925	F-1: 0.696, Accuracy: 0.691
Turkish Customer Reviews Classification	F-1: 0.851	F-1: 0.852

onatyap added enhancement New feature or request question Further information is requested labels Apr 28, 2021

onatyap self-assigned this May 20, 2021

ertugrul-dmr linked a pull request Jun 4, 2021 that will close this issue

Implement CharNGram Based HashVectorizer #265

Open

ertugrul-dmr removed a link to a pull request Jun 4, 2021

Implement CharNGram Based HashVectorizer #265

Open

ertugrul-dmr linked a pull request Jun 4, 2021 that will close this issue

Implement Character Repetition Correction [resolves #268] #277

Open

onatyap linked a pull request Jun 10, 2021 that will close this issue

Implement Character Repetition Correction [resolves #268] #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character Repetition Correction #268

Character Repetition Correction #268

onatyap commented Apr 28, 2021 •

edited

Loading

husnusensoy commented Apr 28, 2021

askarbozcan commented May 6, 2021

ertugrul-dmr commented May 31, 2021

Character Repetition Correction #268

Character Repetition Correction #268

Comments

onatyap commented Apr 28, 2021 • edited Loading

husnusensoy commented Apr 28, 2021

askarbozcan commented May 6, 2021

ertugrul-dmr commented May 31, 2021

onatyap commented Apr 28, 2021 •

edited

Loading