-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3153] Missing value imputation using KNN #1879
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @regaleo605 for your contribution. This is a good start but needs a partially rework with regard to testing (no test data generation inside the builtin function, and proper result comparisons). Could you please address the following detailed comments and let me know (via a comment) once it's ready for a second review round. Thanks.
|
||
@Test | ||
public void test()throws IOException{ | ||
runImputeKNN(Types.ExecType.CP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a second test with exectype Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also replicate the CP/Spark test for each method.
String HOME = SCRIPT_DIR + TEST_DIR; | ||
fullDMLScriptName = HOME + TEST_NAME + ".dml"; | ||
programArgs = new String[] {}; // | ||
runTest(true, false, null, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to generate meaningful inputs and compare the results. You could for example check that the sum of the output matrix is roughly the same for all three methods. (computed with the brute force dist method)
Thank you @mboehm7 for the helpful feedbacks. However, I have a question regarding the test. So in the newest version I tried to write the output to 2 dml files and then compare the matrices using the test utils compare matrices with the tolerance. Do I understand correctly that it should not have any output or did I? Furthermore from what I have observed, the results from 3 different methods resulted in three different resutls. Perhaps either the test data is too small or because method3 contains randomly selected subset of rows which is why there are some random factors. What could be a good tolerance numbers? |
I would recommend to invoke a single test script with generated data (see other tests), pass the method as an argument, and check for example, that the sum of imputed missing values comes close to the expected value (which should be true for all methods and could can play around with the epsilon how far they can differ). |
Let me know once the PR is ready for another full review. |
@mboehm7 I have already addressed the previous feedbacks and did some test. As a result, the tests were unsatisfactory between dist and dist_missing/dist_sample, after looking through the code, I found that the problem is with the method of rowMins/RowIndexMin, Could you please help clarify why I am getting number that is not even from the calculation of euclidean distances? Interestingly these numbers are all roughly the same and as such the indeces were the same. |
Sure, but could you please clarify what you mean by "why I am getting number that is not even from the calculation of euclidean distances"? Also for the tests, another good approach would be to generate random data in the java test, and replicate the a few times so you know what the top-1 nearest neighbor (and the sum of imputed values) is. |
After calculating the euclidean distance, I got a matrix of n rows, where n are the number of missing values. Each row contains the euclidean distances. For example the first row contains (-50,-3, 0 ,1 ,-20) but when I call the rowMins I got a number that is not in the example say -100 instead of -50. |
Hmm, that would be a bug and I would fix it immediately. For which intermediate (please indicate the line number) did you observe this behavior, right now there is not a single rowMins in there (while rowIndexMin returns the position of the minimum value per row). |
Line 107-108 and line 148-149. I noticed the problem comes from rowMins where I got an index that is greater than the dataset. Then I added print(toString(rowMins(t(D)))) and print(toString(rowIndexMin(t(D)))) after line 108/149 in my local version to check and the mentioned problem occured. I believe that rowIndexMin leverage rowMin method. |
OK, so here is what happened: toString() prints by default only the first 100 rows and columns, leading to the rowMins not showing up in the output and rowIndexMax giving you an index larger than the truncated printed version. You can parameterize toString(t(D), rows=1000, cols=1000) and continue with your debugging. We do give a warning whenever we truncate such outputs, but unfortunately our tests run in log level ERROR which omits this crucial information. @Baunsgaard: since we recently discussed a second issue where this caused problems, you might want to change the log level back to WARN. |
well, this is up for debate, since some tests would write many many warning messages if we set the default to warning. |
Thank you for clearing up my confusion. It seems that I needed to test with other datasets to check whether this is only a coincidence or there is something wrong with the euclidean distance calculation. I will notify you if I figure something out or have some other problem. |
I fixed the euclidean distance, Could you please do a code review? |
LGTM - thanks @regaleo605. During the merge, I fixed remaining warnings, simplified the test script (which also fixes the spark test), update the docs, and added TODOs for future improvements regarding imputation for all columns with missing values, and proper randomization (where the passed seed is a root for all required seeds). |
This patch enables systemds to be able to impute missing value using KNN-algorithm. We calculate the similairy or distance between all pairs of records using the euclidean distances. The first method brute forces the imputation using the dist() method. However, this could leads to an expensive computation. Therefore, we proposed 2 other methods, the second method split the number of records (potentially large) and compute the distances with missing records(hopefully small). The third method is similar to the second method. However, we create a subset from the number of records to compute with missing records(large).