Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in dictionary: double entries with different sentiment values #122

Open
chris31415926535 opened this issue Jan 1, 2021 · 1 comment

Comments

@chris31415926535
Copy link

Thanks for making VADER. I'm working on another port and am having a blast.

There are instances of words/emojis that have two entries with different sentiment values in the most recent version of vader_lexicon.txt. This is a potential source of bugs and inconsistencies between ports. I've included the list below with the line number in vader_lexicon.txt, the words, and the sentiment values.

It looks the Python version of VADER takes the last value it finds. For example, "lol" has two sentiment values: +2.9 at line 305, and+1.8 at line 4406. To reproduce the output in test sentence 13 from the main Readme (copied below), I need to assign "lol" a sentiment of 1.8.

Today only kinda sux! But I'll get by, lol----------------------- {'pos': 0.317, 'compound': 0.5249, 'neu': 0.556, 'neg': 0.127}

I see three main options:

  1. Leave it as-is. This seems least desirable, since it leads to unpredictable and potentially inconsistent behaviour across instantiations.
  2. Update the dictionary to match the current behaviour by removing each second instance of the 14 words below. This would be easy, but the potential downside is that some of the differences are big: e.g. "d:" has a positive instance and a negative instance, and "sob"'s larger value is more than double the smaller value.
  3. Update the dictionary to match your intuition. A case-by-case approach wouldn't take long since there are only 14 instances, and a standard approach (e.g. averaging the two values) would also be simple.

Obviously it's your call, but I didn't see this in any other Issues or Pull Requests so I wanted to surface it. I'm happy to chat or help in any way I can.

line number word sentiment
120 :-p 1.2
124 :-p 1.5
227 d: -2.9
1740 d: 1.2
230 d= -3
1741 d= 1.5
234 fav 2.4
2831 fav 2
301 lmao 2
4399 lmao 2.9
305 lol 2.9
4406 lol 1.8
320 muah 2.8
4730 muah 2.3
342 o.o -0.6
4853 o.o -0.8
352 ok 1.6
4895 ok 1.2
385 sob -2.8
6188 sob -1
411 x-d 2.7
7489 x-d 2.6
412 x-p 1.8
7490 x-p 1.7
413 xd 2.7
7491 xd 2.8
417 xp 1.2
7492 xp 1.6
@TjallingO
Copy link

This issue stumped me as well during the development of my own port. There are even more duplicates, like

line no. element sentiment
342 o.o -0.6
4853 o.o -0.8

I worked around the issue by replacing existing mappings by subsequent entries, thus keeping the original lexicon intact. However, as you mentioned, this does not seem like a sustainable solution. I would really appreciate a follow-up from @cjhutto or any of the other co-authors as to what would be the most appropriate permanent option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants