Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better preprocessing #4

Open
kinoute opened this issue Dec 31, 2021 · 0 comments
Open

Better preprocessing #4

kinoute opened this issue Dec 31, 2021 · 0 comments

Comments

@kinoute
Copy link

kinoute commented Dec 31, 2021

Hello,

I was wondering if the preprocess function could be enhanced as right now, it strips punctuations before and after usernames/URLs. Or was it done on purpose? I couldn't find a justification of this in your paper.

Right now, the preprocess function below would convert:

I love you @louisia!!!!

to

I love you @user

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

It seems to me that punctuations could help the model predict the sentiment of a tweet a little better if it was available to it. Another example: some users on twitter, start their tweets with a dot like this:

.@rudy is really bad. What a shame.

They do that to avoid the reply system while still quoting a username. With the actual pre-processing function, "@rudy" doesn't get replaced because there is a dot right before the @.

Is there any particular reason why the preprocessing function was done this way or we could try to make it more flexible in our end by keeping the punctuations next to usernames or URLs?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant