Better preprocessing #4

kinoute · 2021-12-31T19:32:42Z

Hello,

I was wondering if the preprocess function could be enhanced as right now, it strips punctuations before and after usernames/URLs. Or was it done on purpose? I couldn't find a justification of this in your paper.

Right now, the preprocess function below would convert:

I love you @louisia!!!!

to

I love you @user

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

It seems to me that punctuations could help the model predict the sentiment of a tweet a little better if it was available to it. Another example: some users on twitter, start their tweets with a dot like this:

.@rudy is really bad. What a shame.

They do that to avoid the reply system while still quoting a username. With the actual pre-processing function, "@rudy" doesn't get replaced because there is a dot right before the @.

Is there any particular reason why the preprocessing function was done this way or we could try to make it more flexible in our end by keeping the punctuations next to usernames or URLs?

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better preprocessing #4

Better preprocessing #4

kinoute commented Dec 31, 2021

Better preprocessing #4

Better preprocessing #4

Comments

kinoute commented Dec 31, 2021