[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

msanyoto · 2023-08-23T07:19:03Z

This patch enables imputation for all columns with missing values and seed management for 'dist_sample' method to ensure randomization.

…ed handling

mboehm7

Thanks @regaleo605 for the follow-up. Imputing all columns with missing value is essential and this patch solves the issue. However, there are two major things missing: (1) proper sampling of records, and (2) if possible, vectorize the imputation as the top-1 nearest neighbor for a record is the same, no matter how many cells are imputed.

scripts/builtin/imputeByKNN.dml

mboehm7 · 2023-08-25T17:28:59Z

scripts/builtin/imputeByKNN.dml

+    parfor(i in 1:nrow(missing_col_index), check = 0){
+      #Position of missing values in per row in which column
+      position = masked[,as.scalar(missing_col_index[i,1])]
+      position = position * minimum_index


For a record with multiple missing values, the top-1 nearest neighbor (computed over all features) is the same for all columns with missing values. Hence, we should be able to vectorize this imputation by (1) getting the index of the nearest neighbor, (2) construct a permutation matrix (via table) based on these indexes, (3) matrix multiply the permutation matrix with the data to obtain the entire records A, and then (4) impute via X * (Mask==0) + A * Mask. If you don't manage to do this vectorization, leave the parfor and I'll improve it afterwards.

I think I can manage to do the vectorization, but it seems the table() can only be done using value not 0. If I try use (index,seq(1, nrow(X)) with index containing 0 it will return an error, but I think provided with my previous code of locating the missing rows and multiplying with A to get the new mask is possible.or is there a workaround to ignore 0 or let 0 be a row 0?

mboehm7 · 2023-08-25T17:32:22Z

scripts/builtin/imputeByKNN.dml

+    if(seed == -1){
+      random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity))
+    } else {
+      random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity, seed = seed))
+    }


I think we might have misunderstood us when talking about randomization: my comment was to create a random matrix from the passed root seed and generate as many seeds as you need for subsequent rand/sample calls. Right now we create a random matrix here, but don't actually perform sampling of records. Please select records by either calling sample and then creating a permutation matrix (e.g., table(seq(), sample) %*% X) or via thesholding (e.g., rand()<sample_frac followed by remove empty)

To make it clear, does it means for example if I call the dist_sample method with the default seed -1, I can get rows 1 3 4 but If I call dist_sample method again, since the seed doesn't change I still get 1 3 4, but we want whenever we call dist_sample. I can get rows 1 3 4 for the first time, second 2 4 6, and so on? Do I understand correctly?

msanyoto · 2023-08-26T15:19:52Z

@mboehm7 Thank you for the feedback. Could you please do a code review regarding the vectorization and proper seed handling. I am unsure if the seed handling is as what is expected.

Another note, if I want to do some experiments with systemds, do I need to package my local version into executable jar via maven or just rebuild the systemds locally so systemds will recognize the newest builtin function added and run a dml script that contains the experiment?

mboehm7 · 2023-08-30T18:20:44Z

LGTM - thanks for the additional patch @regaleo605. The approaches of vectorization and sampling were generally good. Just the sampling multiplied the permutation matrix from the right - this selects columns, instead we need to select rows so multiply from the left. I fixed it but it's generally a indicator that this method is not tested yet. But don't worry, I'll take it from here. Furthermore, I simplified the sampling and randomization. Also in the future, please always create a new branch and rebase your changes, the continuation of the previous PR and merge of other changes causes painful conflicts during the merge.

For the experiments, I would recommend to build SystemDS via mvn package and then run your experiments via something like this (where test.dml is your invocation of the builtin function, and -stats gives you important details of where time was spent)
java -Xmx8g -Xms8g -Xmn800m -cp lib/*:SystemDS.jar org.apache.sysds.api.DMLScript -f test.dml -exec singlenode -explain -stats

msanyoto added 11 commits August 6, 2023 17:48

add files for KNN Imputation

560523f

Merge branch 'main' of https://github.com/regaleo605/systemds into main

ce11a7f

[SYSTEMDS-3153] missing value imputation using KNN temp

d0c285f

[SYSTEMDS-3153] added license in BuiltinImputeKNNTest.java

a844d22

[SYSTEMDS-3153] Addressed some of the comments and fixes

a11a7ca

[SYSTEMDS-3153] Added a test (temp) to compare different methods

b3c3f54

rename the method's name and add test

f6c4122

[SYSTEMDS-3153] Fixed the euclidean distance calculation and add test

666b4f3

Resolving conflict forked repository

279c763

fixed some spacing

8390ed7

[SYSTEMDS-3153] imputation for all columns with missing values and se…

cde8523

…ed handling

mboehm7 reviewed Aug 25, 2023

View reviewed changes

[SYSTEMDS-3153] Vectorization of top-1 and proper seed handling

a2e5c3a

[SYSTEMDS-3153] fixed minor issues

7fd952b

mboehm7 closed this in 73555e9 Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

msanyoto commented Aug 23, 2023

mboehm7 left a comment

mboehm7 Aug 25, 2023

msanyoto Aug 26, 2023 •

edited

Loading

mboehm7 Aug 25, 2023

msanyoto Aug 26, 2023

msanyoto commented Aug 26, 2023

mboehm7 commented Aug 30, 2023

[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

Conversation

msanyoto commented Aug 23, 2023

mboehm7 left a comment

Choose a reason for hiding this comment

mboehm7 Aug 25, 2023

Choose a reason for hiding this comment

msanyoto Aug 26, 2023 • edited Loading

Choose a reason for hiding this comment

mboehm7 Aug 25, 2023

Choose a reason for hiding this comment

msanyoto Aug 26, 2023

Choose a reason for hiding this comment

msanyoto commented Aug 26, 2023

mboehm7 commented Aug 30, 2023

msanyoto Aug 26, 2023 •

edited

Loading