Duplicates in dictionary: double entries with different sentiment values #122

chris31415926535 · 2021-01-01T19:06:38Z

Thanks for making VADER. I'm working on another port and am having a blast.

There are instances of words/emojis that have two entries with different sentiment values in the most recent version of vader_lexicon.txt. This is a potential source of bugs and inconsistencies between ports. I've included the list below with the line number in vader_lexicon.txt, the words, and the sentiment values.

It looks the Python version of VADER takes the last value it finds. For example, "lol" has two sentiment values: +2.9 at line 305, and+1.8 at line 4406. To reproduce the output in test sentence 13 from the main Readme (copied below), I need to assign "lol" a sentiment of 1.8.

Today only kinda sux! But I'll get by, lol----------------------- {'pos': 0.317, 'compound': 0.5249, 'neu': 0.556, 'neg': 0.127}

I see three main options:

Leave it as-is. This seems least desirable, since it leads to unpredictable and potentially inconsistent behaviour across instantiations.
Update the dictionary to match the current behaviour by removing each second instance of the 14 words below. This would be easy, but the potential downside is that some of the differences are big: e.g. "d:" has a positive instance and a negative instance, and "sob"'s larger value is more than double the smaller value.
Update the dictionary to match your intuition. A case-by-case approach wouldn't take long since there are only 14 instances, and a standard approach (e.g. averaging the two values) would also be simple.

Obviously it's your call, but I didn't see this in any other Issues or Pull Requests so I wanted to surface it. I'm happy to chat or help in any way I can.

line number	word	sentiment
120	:-p	1.2
124	:-p	1.5
227	d:	-2.9
1740	d:	1.2
230	d=	-3
1741	d=	1.5
234	fav	2.4
2831	fav	2
301	lmao	2
4399	lmao	2.9
305	lol	2.9
4406	lol	1.8
320	muah	2.8
4730	muah	2.3
342	o.o	-0.6
4853	o.o	-0.8
352	ok	1.6
4895	ok	1.2
385	sob	-2.8
6188	sob	-1
411	x-d	2.7
7489	x-d	2.6
412	x-p	1.8
7490	x-p	1.7
413	xd	2.7
7491	xd	2.8
417	xp	1.2
7492	xp	1.6

The text was updated successfully, but these errors were encountered:

TjallingO · 2021-05-04T23:40:09Z

This issue stumped me as well during the development of my own port. There are even more duplicates, like

line no.	element	sentiment
342	o.o	-0.6
4853	o.o	-0.8

I worked around the issue by replacing existing mappings by subsequent entries, thus keeping the original lexicon intact. However, as you mentioned, this does not seem like a sustainable solution. I would really appreciate a follow-up from @cjhutto or any of the other co-authors as to what would be the most appropriate permanent option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in dictionary: double entries with different sentiment values #122

Duplicates in dictionary: double entries with different sentiment values #122

chris31415926535 commented Jan 1, 2021

TjallingO commented May 4, 2021

Duplicates in dictionary: double entries with different sentiment values #122

Duplicates in dictionary: double entries with different sentiment values #122

Comments

chris31415926535 commented Jan 1, 2021

TjallingO commented May 4, 2021