Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3153] Missing value imputation using KNN #1879

Closed
wants to merge 8 commits into from

Conversation

msanyoto
Copy link
Contributor

@msanyoto msanyoto commented Aug 9, 2023

This patch enables systemds to be able to impute missing value using KNN-algorithm. We calculate the similairy or distance between all pairs of records using the euclidean distances. The first method brute forces the imputation using the dist() method. However, this could leads to an expensive computation. Therefore, we proposed 2 other methods, the second method split the number of records (potentially large) and compute the distances with missing records(hopefully small). The third method is similar to the second method. However, we create a subset from the number of records to compute with missing records(large).

Copy link
Contributor

@mboehm7 mboehm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @regaleo605 for your contribution. This is a good start but needs a partially rework with regard to testing (no test data generation inside the builtin function, and proper result comparisons). Could you please address the following detailed comments and let me know (via a comment) once it's ready for a second review round. Thanks.

scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
src/main/java/org/apache/sysds/common/Builtins.java Outdated Show resolved Hide resolved

@Test
public void test()throws IOException{
runImputeKNN(Types.ExecType.CP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a second test with exectype Spark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also replicate the CP/Spark test for each method.

Comment on lines 51 to 54
String HOME = SCRIPT_DIR + TEST_DIR;
fullDMLScriptName = HOME + TEST_NAME + ".dml";
programArgs = new String[] {}; //
runTest(true, false, null, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to generate meaningful inputs and compare the results. You could for example check that the sum of the output matrix is roughly the same for all three methods. (computed with the brute force dist method)

scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
@msanyoto
Copy link
Contributor Author

Thank you @mboehm7 for the helpful feedbacks. However, I have a question regarding the test. So in the newest version I tried to write the output to 2 dml files and then compare the matrices using the test utils compare matrices with the tolerance. Do I understand correctly that it should not have any output or did I? Furthermore from what I have observed, the results from 3 different methods resulted in three different resutls. Perhaps either the test data is too small or because method3 contains randomly selected subset of rows which is why there are some random factors. What could be a good tolerance numbers?

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 12, 2023

I would recommend to invoke a single test script with generated data (see other tests), pass the method as an argument, and check for example, that the sum of imputed missing values comes close to the expected value (which should be true for all methods and could can play around with the epsilon how far they can differ).

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 12, 2023

Let me know once the PR is ready for another full review.

@msanyoto
Copy link
Contributor Author

@mboehm7 I have already addressed the previous feedbacks and did some test. As a result, the tests were unsatisfactory between dist and dist_missing/dist_sample, after looking through the code, I found that the problem is with the method of rowMins/RowIndexMin, Could you please help clarify why I am getting number that is not even from the calculation of euclidean distances? Interestingly these numbers are all roughly the same and as such the indeces were the same.

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 13, 2023

Sure, but could you please clarify what you mean by "why I am getting number that is not even from the calculation of euclidean distances"?

Also for the tests, another good approach would be to generate random data in the java test, and replicate the a few times so you know what the top-1 nearest neighbor (and the sum of imputed values) is.

@msanyoto
Copy link
Contributor Author

After calculating the euclidean distance, I got a matrix of n rows, where n are the number of missing values. Each row contains the euclidean distances. For example the first row contains (-50,-3, 0 ,1 ,-20) but when I call the rowMins I got a number that is not in the example say -100 instead of -50.

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 13, 2023

Hmm, that would be a bug and I would fix it immediately. For which intermediate (please indicate the line number) did you observe this behavior, right now there is not a single rowMins in there (while rowIndexMin returns the position of the minimum value per row).

@msanyoto
Copy link
Contributor Author

msanyoto commented Aug 13, 2023

Line 107-108 and line 148-149. I noticed the problem comes from rowMins where I got an index that is greater than the dataset. Then I added print(toString(rowMins(t(D)))) and print(toString(rowIndexMin(t(D)))) after line 108/149 in my local version to check and the mentioned problem occured. I believe that rowIndexMin leverage rowMin method.

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 13, 2023

OK, so here is what happened: toString() prints by default only the first 100 rows and columns, leading to the rowMins not showing up in the output and rowIndexMax giving you an index larger than the truncated printed version. You can parameterize toString(t(D), rows=1000, cols=1000) and continue with your debugging.

We do give a warning whenever we truncate such outputs, but unfortunately our tests run in log level ERROR which omits this crucial information. @Baunsgaard: since we recently discussed a second issue where this caused problems, you might want to change the log level back to WARN.

@Baunsgaard
Copy link
Contributor

OK, so here is what happened: toString() prints by default only the first 100 rows and columns, leading to the rowMins not showing up in the output and rowIndexMax giving you an index larger than the truncated printed version. You can parameterize toString(t(D), rows=1000, cols=1000) and continue with your debugging.

We do give a warning whenever we truncate such outputs, but unfortunately our tests run in log level ERROR which omits this crucial information. @Baunsgaard: since we recently discussed a second issue where this caused problems, you might want to change the log level back to WARN.

well, this is up for debate, since some tests would write many many warning messages if we set the default to warning.
A middle ground is to set to warning as default, and we then in GitHub actions set it to a more restrictive version.

@msanyoto
Copy link
Contributor Author

Thank you for clearing up my confusion. It seems that I needed to test with other datasets to check whether this is only a coincidence or there is something wrong with the euclidean distance calculation. I will notify you if I figure something out or have some other problem.

@msanyoto
Copy link
Contributor Author

I fixed the euclidean distance, Could you please do a code review?

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 15, 2023

LGTM - thanks @regaleo605. During the merge, I fixed remaining warnings, simplified the test script (which also fixes the spark test), update the docs, and added TODOs for future improvements regarding imputation for all columns with missing values, and proper randomization (where the passed seed is a root for all required seeds).

@mboehm7 mboehm7 closed this in d1bc4eb Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants