[SYSTEMDS-3153] Missing value imputation using KNN #1879

msanyoto · 2023-08-09T10:58:00Z

This patch enables systemds to be able to impute missing value using KNN-algorithm. We calculate the similairy or distance between all pairs of records using the euclidean distances. The first method brute forces the imputation using the dist() method. However, this could leads to an expensive computation. Therefore, we proposed 2 other methods, the second method split the number of records (potentially large) and compute the distances with missing records(hopefully small). The third method is similar to the second method. However, we create a subset from the number of records to compute with missing records(large).

mboehm7

Thanks @regaleo605 for your contribution. This is a good start but needs a partially rework with regard to testing (no test data generation inside the builtin function, and proper result comparisons). Could you please address the following detailed comments and let me know (via a comment) once it's ready for a second review round. Thanks.

scripts/builtin/imputeByKNN.dml

src/main/java/org/apache/sysds/common/Builtins.java

mboehm7 · 2023-08-10T16:21:41Z

src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinImputeKNNTest.java

+
+    @Test
+    public void test()throws IOException{
+        runImputeKNN(Types.ExecType.CP);


add a second test with exectype Spark

Also replicate the CP/Spark test for each method.

mboehm7 · 2023-08-10T16:23:33Z

src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinImputeKNNTest.java

+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + TEST_NAME + ".dml";
+            programArgs = new String[] {}; //
+            runTest(true, false, null, -1);


try to generate meaningful inputs and compare the results. You could for example check that the sum of the output matrix is roughly the same for all three methods. (computed with the brute force dist method)

scripts/builtin/imputeByKNN.dml

msanyoto · 2023-08-12T10:57:40Z

Thank you @mboehm7 for the helpful feedbacks. However, I have a question regarding the test. So in the newest version I tried to write the output to 2 dml files and then compare the matrices using the test utils compare matrices with the tolerance. Do I understand correctly that it should not have any output or did I? Furthermore from what I have observed, the results from 3 different methods resulted in three different resutls. Perhaps either the test data is too small or because method3 contains randomly selected subset of rows which is why there are some random factors. What could be a good tolerance numbers?

mboehm7 · 2023-08-12T12:27:49Z

I would recommend to invoke a single test script with generated data (see other tests), pass the method as an argument, and check for example, that the sum of imputed missing values comes close to the expected value (which should be true for all methods and could can play around with the epsilon how far they can differ).

mboehm7 · 2023-08-12T12:28:10Z

Let me know once the PR is ready for another full review.

msanyoto · 2023-08-13T14:10:38Z

@mboehm7 I have already addressed the previous feedbacks and did some test. As a result, the tests were unsatisfactory between dist and dist_missing/dist_sample, after looking through the code, I found that the problem is with the method of rowMins/RowIndexMin, Could you please help clarify why I am getting number that is not even from the calculation of euclidean distances? Interestingly these numbers are all roughly the same and as such the indeces were the same.

mboehm7 · 2023-08-13T15:32:33Z

Sure, but could you please clarify what you mean by "why I am getting number that is not even from the calculation of euclidean distances"?

Also for the tests, another good approach would be to generate random data in the java test, and replicate the a few times so you know what the top-1 nearest neighbor (and the sum of imputed values) is.

msanyoto · 2023-08-13T15:47:28Z

After calculating the euclidean distance, I got a matrix of n rows, where n are the number of missing values. Each row contains the euclidean distances. For example the first row contains (-50,-3, 0 ,1 ,-20) but when I call the rowMins I got a number that is not in the example say -100 instead of -50.

mboehm7 · 2023-08-13T15:57:25Z

Hmm, that would be a bug and I would fix it immediately. For which intermediate (please indicate the line number) did you observe this behavior, right now there is not a single rowMins in there (while rowIndexMin returns the position of the minimum value per row).

msanyoto · 2023-08-13T16:03:15Z

Line 107-108 and line 148-149. I noticed the problem comes from rowMins where I got an index that is greater than the dataset. Then I added print(toString(rowMins(t(D)))) and print(toString(rowIndexMin(t(D)))) after line 108/149 in my local version to check and the mentioned problem occured. I believe that rowIndexMin leverage rowMin method.

mboehm7 · 2023-08-13T16:37:27Z

OK, so here is what happened: toString() prints by default only the first 100 rows and columns, leading to the rowMins not showing up in the output and rowIndexMax giving you an index larger than the truncated printed version. You can parameterize toString(t(D), rows=1000, cols=1000) and continue with your debugging.

We do give a warning whenever we truncate such outputs, but unfortunately our tests run in log level ERROR which omits this crucial information. @Baunsgaard: since we recently discussed a second issue where this caused problems, you might want to change the log level back to WARN.

Baunsgaard · 2023-08-13T17:38:52Z

OK, so here is what happened: toString() prints by default only the first 100 rows and columns, leading to the rowMins not showing up in the output and rowIndexMax giving you an index larger than the truncated printed version. You can parameterize toString(t(D), rows=1000, cols=1000) and continue with your debugging.

We do give a warning whenever we truncate such outputs, but unfortunately our tests run in log level ERROR which omits this crucial information. @Baunsgaard: since we recently discussed a second issue where this caused problems, you might want to change the log level back to WARN.

well, this is up for debate, since some tests would write many many warning messages if we set the default to warning.
A middle ground is to set to warning as default, and we then in GitHub actions set it to a more restrictive version.

msanyoto · 2023-08-14T05:59:48Z

Thank you for clearing up my confusion. It seems that I needed to test with other datasets to check whether this is only a coincidence or there is something wrong with the euclidean distance calculation. I will notify you if I figure something out or have some other problem.

msanyoto · 2023-08-15T11:15:49Z

I fixed the euclidean distance, Could you please do a code review?

mboehm7 · 2023-08-15T18:45:17Z

LGTM - thanks @regaleo605. During the merge, I fixed remaining warnings, simplified the test script (which also fixes the spark test), update the docs, and added TODOs for future improvements regarding imputation for all columns with missing values, and proper randomization (where the passed seed is a root for all required seeds).

msanyoto added 4 commits August 6, 2023 17:48

add files for KNN Imputation

560523f

Merge branch 'main' of https://github.com/regaleo605/systemds into main

ce11a7f

[SYSTEMDS-3153] missing value imputation using KNN temp

d0c285f

[SYSTEMDS-3153] added license in BuiltinImputeKNNTest.java

a844d22

mboehm7 reviewed Aug 10, 2023

View reviewed changes

msanyoto added 2 commits August 12, 2023 12:14

[SYSTEMDS-3153] Addressed some of the comments and fixes

a11a7ca

[SYSTEMDS-3153] Added a test (temp) to compare different methods

b3c3f54

rename the method's name and add test

f6c4122

[SYSTEMDS-3153] Fixed the euclidean distance calculation and add test

666b4f3

mboehm7 closed this in d1bc4eb Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3153] Missing value imputation using KNN #1879

[SYSTEMDS-3153] Missing value imputation using KNN #1879

msanyoto commented Aug 9, 2023

mboehm7 left a comment

mboehm7 Aug 10, 2023

mboehm7 Aug 10, 2023

mboehm7 Aug 10, 2023

msanyoto commented Aug 12, 2023

mboehm7 commented Aug 12, 2023

mboehm7 commented Aug 12, 2023

msanyoto commented Aug 13, 2023

mboehm7 commented Aug 13, 2023

msanyoto commented Aug 13, 2023

mboehm7 commented Aug 13, 2023

msanyoto commented Aug 13, 2023 •

edited

Loading

mboehm7 commented Aug 13, 2023

Baunsgaard commented Aug 13, 2023

msanyoto commented Aug 14, 2023

msanyoto commented Aug 15, 2023

mboehm7 commented Aug 15, 2023

[SYSTEMDS-3153] Missing value imputation using KNN #1879

[SYSTEMDS-3153] Missing value imputation using KNN #1879

Conversation

msanyoto commented Aug 9, 2023

mboehm7 left a comment

Choose a reason for hiding this comment

mboehm7 Aug 10, 2023

Choose a reason for hiding this comment

mboehm7 Aug 10, 2023

Choose a reason for hiding this comment

mboehm7 Aug 10, 2023

Choose a reason for hiding this comment

msanyoto commented Aug 12, 2023

mboehm7 commented Aug 12, 2023

mboehm7 commented Aug 12, 2023

msanyoto commented Aug 13, 2023

mboehm7 commented Aug 13, 2023

msanyoto commented Aug 13, 2023

mboehm7 commented Aug 13, 2023

msanyoto commented Aug 13, 2023 • edited Loading

mboehm7 commented Aug 13, 2023

Baunsgaard commented Aug 13, 2023

msanyoto commented Aug 14, 2023

msanyoto commented Aug 15, 2023

mboehm7 commented Aug 15, 2023

msanyoto commented Aug 13, 2023 •

edited

Loading