You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in: "Süppeeeer", "Berbaaat", "Muhteşeemmmm"
Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.
The output for examples above are as follows: "Süper", "Berbat", "Muhteşem"
Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?
The text was updated successfully, but these errors were encountered:
Note that sadedegel is a library and we should be picky in adding "features". The problem you solved is a part of a more general problem called normalization. Here is the roadmap for adding such a feature
Generate your test data (Sentences) - Well define you experiment and ensure that you add counter examples. Such as saat, menfaat, faal are all valid words and you don't perform any corrections on it.
Apply your technique in normalizing by reporting
Performance
Accuracy (False positives and false negatives ofcourse)
Prove that your technique improves several tasks when enabled.
Hotel Review Dataset has reviews where character's are repeated to highlight a certain word as in:
"Süppeeeer"
,"Berbaaat"
,"Muhteşeemmmm"
Since Tokenizer cannot correctly tokenize these words I created a regular expression to check for repetitions and corrected them by eliminating repeating duplicates. This preprocessing step increased the 3-fold cross validation score in Hotel Review Dataset by 2%.
The output for examples above are as follows:
"Süper"
,"Berbat"
,"Muhteşem"
Considering that repetitions are common, this method can be useful as a preprocessing step in sadedegel. What is your opinion on this?
The text was updated successfully, but these errors were encountered: