-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unconsistent confidence score #3
Comments
I have also noted inconsistent scores. The code I was using went like this:
I also do the same thing for the paragraph using " ".join(sentences)
and this. There is only trouble with the score in this one, but it's all over the map:
I understand that outcome may be determined by a random start in fitting a model, but this business of getting a different answer every time I check the result is more than a little counterintuitive. Spacy is correct to be confused abut 134435, since half the abstract is in English and the other half is in German. However in this case the "sentence" as defined in the input data was the complete "paragraph". Both were just over 200 words. I would have expected a little more consistency. It doesn't matter here because the score is relatively low, but I wish it would be more consistent in cases that do matter. The abstract for 55295 is only 38 words, and while it is in English, there are a lot of medical terms. I could understand a somewhat low score, but the score here varied between 0.71 and 0.999996 the 5 times I read it. This was with Spacy 3.0.6 and spacy-langdetect 0.1.2 As I was writing this, I had a sudden thought. This is being done in a multiprocessing environment. The creation of eng_nlp is done after each individual process has started up. Is it possible that something is shared when it shouldn't be? It doesn't seem like it, but I thought it worth mentioning. The processes are not spawned by langdetect, for whatever that's worth. |
The problem is that this package is mostly a wrapper around the So yes, the seed is not the same all the time. But, you can set the seed on the factory of
To make it simpler, I've released a package base on the fork of this project. |
Unfortunately setting the seed only solves half the problem. Once you set the seed, you get the same score every time for the same text. However, if you change the seed you can get a very different score and language for the same text. If you make the central loop in my example something like this (yes I know there was a stray period in the original):
you can still get answers that are very different. Setting the seed just makes them the same every time you run the code. (Doing this also eliminates the "it changes every time I look at it" problem if you don't set the seed.) I have finally settled on setting on asking for N different results using different seeds. (N=5 mostly.) If the languages differ I throw the text away as being possibly multi-lingual. If the languages are all the same but the scores are too low (look at the min or the median) I also throw the text away. If the languages are all the same and the scores are high enough I keep the text. This isn't fool proof, but it's not horrible. It gets fooled by paragraphs that are mostly recitations of the names of organic compounds, and by short English paragraphs that contain the names/addresses of hospitals in non-English speaking countries. |
Do you have a reason to do the execution 5 times? It's sure that if you have a different seed since it is a statistical model, it will give different results each time. But, if you fix the seed FOR ALL the execution, then it will always be the same. Also, I don't think |
5 is just a fairly arbitrary small number. 3 is just a little too small. Why multiple tries? Setting the seed just hides the inconsistent behavior. If I use two different seeds, I can sometimes get two very different scores, e.g. 0.75xxxx and 0.9999xxxx. Assuming langdetect gives a consistent answer for the language, the score is a random variable if the seed is chosen at random. Think about using the median to summarize that distribution. (The median is better than the mean given the skewness of the distribution.) A sample mean is a good estimate of the distribution mean. You can use this to do a test of how well the text matches the principal language. I'm not sure how easy it would be to make probability statements. |
Ok, I got it; it was to do a mean. Yeah, I agree with you that |
Hello,
I'm using spacy-langdetect with Python 3.7.6 on Fedora 31 with an Intel(R) Core(TM) i7-7500U CPU
I'm using it to detect the language of sentences in a text (which contains both french and english sentences)
I've ran into a strange problem when executing
For most of the results the two values are quite similar (for a float) and I did not show it in this excerpt, but you can see that for some of the values, it's quite different:
Any ideas on what could cause this?
Have a great day
EDIT: some languages change as well, but it's quite rare (imo it's just because there is less languages possible, so it's not as visible, but both are affected by the same problem)
EDIT2: I tried to print it 3 times just to be sure, and it's often different in all of them:
EDIT3: Just for information, my text is mostly composed of notes taken quickly, which obviously doesn't help detection but helps highlighting the problem I think, I've tried with normal english text and the scores do change a little bit, but it's not as visible as in mine.
The text was updated successfully, but these errors were encountered: