-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @regaleo605 for the follow-up. Imputing all columns with missing value is essential and this patch solves the issue. However, there are two major things missing: (1) proper sampling of records, and (2) if possible, vectorize the imputation as the top-1 nearest neighbor for a record is the same, no matter how many cells are imputed.
scripts/builtin/imputeByKNN.dml
Outdated
parfor(i in 1:nrow(missing_col_index), check = 0){ | ||
#Position of missing values in per row in which column | ||
position = masked[,as.scalar(missing_col_index[i,1])] | ||
position = position * minimum_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a record with multiple missing values, the top-1 nearest neighbor (computed over all features) is the same for all columns with missing values. Hence, we should be able to vectorize this imputation by (1) getting the index of the nearest neighbor, (2) construct a permutation matrix (via table) based on these indexes, (3) matrix multiply the permutation matrix with the data to obtain the entire records A, and then (4) impute via X * (Mask==0) + A * Mask. If you don't manage to do this vectorization, leave the parfor and I'll improve it afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can manage to do the vectorization, but it seems the table() can only be done using value not 0. If I try use (index,seq(1, nrow(X)) with index containing 0 it will return an error, but I think provided with my previous code of locating the missing rows and multiplying with A to get the new mask is possible.or is there a workaround to ignore 0 or let 0 be a row 0?
scripts/builtin/imputeByKNN.dml
Outdated
if(seed == -1){ | ||
random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity)) | ||
} else { | ||
random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity, seed = seed)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have misunderstood us when talking about randomization: my comment was to create a random matrix from the passed root seed and generate as many seeds as you need for subsequent rand/sample calls. Right now we create a random matrix here, but don't actually perform sampling of records. Please select records by either calling sample and then creating a permutation matrix (e.g., table(seq(), sample) %*% X) or via thesholding (e.g., rand()<sample_frac followed by remove empty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it clear, does it means for example if I call the dist_sample method with the default seed -1, I can get rows 1 3 4 but If I call dist_sample method again, since the seed doesn't change I still get 1 3 4, but we want whenever we call dist_sample. I can get rows 1 3 4 for the first time, second 2 4 6, and so on? Do I understand correctly?
@mboehm7 Thank you for the feedback. Could you please do a code review regarding the vectorization and proper seed handling. I am unsure if the seed handling is as what is expected. Another note, if I want to do some experiments with systemds, do I need to package my local version into executable jar via maven or just rebuild the systemds locally so systemds will recognize the newest builtin function added and run a dml script that contains the experiment? |
LGTM - thanks for the additional patch @regaleo605. The approaches of vectorization and sampling were generally good. Just the sampling multiplied the permutation matrix from the right - this selects columns, instead we need to select rows so multiply from the left. I fixed it but it's generally a indicator that this method is not tested yet. But don't worry, I'll take it from here. Furthermore, I simplified the sampling and randomization. Also in the future, please always create a new branch and rebase your changes, the continuation of the previous PR and merge of other changes causes painful conflicts during the merge. For the experiments, I would recommend to build SystemDS via |
This patch enables imputation for all columns with missing values and seed management for 'dist_sample' method to ensure randomization.