-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files with duplicate entries #348
Comments
@warwickmm wow, that's a lot more than I expected. There's no good reason I can think of, but let me dig in a little bit before you go further. |
I ran the test against all the repos, and most of them have files with duplicate entries (GA, LA, MN, RI, TN, US, and VA are the only ones that do not). TX appears to be one of the more problematic ones. For example, CA has 4 bad files, FL ha 6 bad files, NY has 51 bad files. |
I spot checked a few of these:
and the duplicate entries were present in the initial commit of the files. Several of them have the same type of issue (duplicate over/under vote entries), which may point to an issue with the parsing code that was used at the time. |
@warwickmm makes sense, yeah. For the NY files, do the duplicate rows include party (NY allows candidates to run on multiple party lines)? |
If the value for the
Let me know if I should go ahead and start testing/fixing these. |
Looking into this example
I see in this source file In the data file, I see the following entries for the Geddes 20 precinct:
There are no entries for Cuomo/Hochul for the Lafayette precincts. The duplicates in this case appear to be due to incorrect precincts. |
I looked into one of the CA duplicates:
The corresponding section from the data file has
Looking at the order of precincts in the source file it seems that the second
|
Given the above examples, I think it makes sense to start testing for duplicate entries in all the repos. I don't think we should simply remove the duplicates, as some of them may be parsing errors (e.g., wrong precincts). Since it is going to be a tedious manual process to resolve the duplicates, perhaps we can create an issue in each repo asking for volunteers to help dig into each issue. One idea is to create a new suite of tests ("data checks" that are separate from our "format tests") that will check for duplicates (and perhaps in the future compare aggregated precinct results with county-level results). This will allow us to avoid polluting the format test logs with errors that will likely exist for quite a while. One can argue that duplicate entries are not a formatting issue. @dwillis, what do you think? |
@warwickmm I think that's right, unfortunately. |
No problem. Can you create a new repo named |
Thanks @dwillis. Would it be possible to get write access to the new repo? |
I'm testing some code that detects duplicate entries within a file. If it's correct, there are almost 200 files that contain duplicate rows. Some have a huge number of duplicates (possibly due to improper merge conflict resolution), while others have just a few.
@dwillis, is there any reason for these duplicates to exist? If not, I can add the new test to our test suite and work on removing the duplicates as well.
The text was updated successfully, but these errors were encountered: