Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3153] Imputation for all missing columns and seed handling #1888

Closed
wants to merge 13 commits into from

Conversation

msanyoto
Copy link
Contributor

This patch enables imputation for all columns with missing values and seed management for 'dist_sample' method to ensure randomization.

Copy link
Contributor

@mboehm7 mboehm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @regaleo605 for the follow-up. Imputing all columns with missing value is essential and this patch solves the issue. However, there are two major things missing: (1) proper sampling of records, and (2) if possible, vectorize the imputation as the top-1 nearest neighbor for a record is the same, no matter how many cells are imputed.

scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
scripts/builtin/imputeByKNN.dml Outdated Show resolved Hide resolved
Comment on lines 83 to 86
parfor(i in 1:nrow(missing_col_index), check = 0){
#Position of missing values in per row in which column
position = masked[,as.scalar(missing_col_index[i,1])]
position = position * minimum_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a record with multiple missing values, the top-1 nearest neighbor (computed over all features) is the same for all columns with missing values. Hence, we should be able to vectorize this imputation by (1) getting the index of the nearest neighbor, (2) construct a permutation matrix (via table) based on these indexes, (3) matrix multiply the permutation matrix with the data to obtain the entire records A, and then (4) impute via X * (Mask==0) + A * Mask. If you don't manage to do this vectorization, leave the parfor and I'll improve it afterwards.

Copy link
Contributor Author

@msanyoto msanyoto Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can manage to do the vectorization, but it seems the table() can only be done using value not 0. If I try use (index,seq(1, nrow(X)) with index containing 0 it will return an error, but I think provided with my previous code of locating the missing rows and multiplying with A to get the new mask is possible.or is there a workaround to ignore 0 or let 0 be a row 0?

Comment on lines 159 to 163
if(seed == -1){
random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity))
} else {
random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, sparsity = sparsity, seed = seed))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have misunderstood us when talking about randomization: my comment was to create a random matrix from the passed root seed and generate as many seeds as you need for subsequent rand/sample calls. Right now we create a random matrix here, but don't actually perform sampling of records. Please select records by either calling sample and then creating a permutation matrix (e.g., table(seq(), sample) %*% X) or via thesholding (e.g., rand()<sample_frac followed by remove empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it clear, does it means for example if I call the dist_sample method with the default seed -1, I can get rows 1 3 4 but If I call dist_sample method again, since the seed doesn't change I still get 1 3 4, but we want whenever we call dist_sample. I can get rows 1 3 4 for the first time, second 2 4 6, and so on? Do I understand correctly?

@msanyoto
Copy link
Contributor Author

@mboehm7 Thank you for the feedback. Could you please do a code review regarding the vectorization and proper seed handling. I am unsure if the seed handling is as what is expected.

Another note, if I want to do some experiments with systemds, do I need to package my local version into executable jar via maven or just rebuild the systemds locally so systemds will recognize the newest builtin function added and run a dml script that contains the experiment?

@mboehm7
Copy link
Contributor

mboehm7 commented Aug 30, 2023

LGTM - thanks for the additional patch @regaleo605. The approaches of vectorization and sampling were generally good. Just the sampling multiplied the permutation matrix from the right - this selects columns, instead we need to select rows so multiply from the left. I fixed it but it's generally a indicator that this method is not tested yet. But don't worry, I'll take it from here. Furthermore, I simplified the sampling and randomization. Also in the future, please always create a new branch and rebase your changes, the continuation of the previous PR and merge of other changes causes painful conflicts during the merge.

For the experiments, I would recommend to build SystemDS via mvn package and then run your experiments via something like this (where test.dml is your invocation of the builtin function, and -stats gives you important details of where time was spent)
java -Xmx8g -Xms8g -Xmn800m -cp lib/*:SystemDS.jar org.apache.sysds.api.DMLScript -f test.dml -exec singlenode -explain -stats

@mboehm7 mboehm7 closed this in 73555e9 Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants