Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs in metadata$Corr_matrix #4

Open
hopkinsjj9 opened this issue Sep 20, 2019 · 8 comments
Open

NAs in metadata$Corr_matrix #4

hopkinsjj9 opened this issue Sep 20, 2019 · 8 comments

Comments

@hopkinsjj9
Copy link

hopkinsjj9 commented Sep 20, 2019

Thank you for putting together a great package!

I'm getting infinite or missing values in 'x' errors when I try to send the following data through the process:
https://www.kaggle.com/pradeeptripathi/predicting-house-prices-using-r/data

train <- data.frame(readr::read_csv('../data/train.csv'))
str(train)
train <- train %>% mutate_if(is.character,as.factor)
str(train)

cleaned <- missCompare::clean(train,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

make sure
cleaned <- missCompare::clean(cleaned,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)

metadata <- missCompare::get_data(cleaned,
matrixplot_sort = T,
plot_transform = T)
Warning message:
In stats::cor(X, use = "pairwise.complete.obs", method = "pearson") :
the standard deviation is zero

simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Error in eigen(if (doDykstra) R else Y, symmetric = TRUE) :
infinite or missing values in 'x'

I found two NAs in metadata$Corr_matrix. Utilities/LotFrontage
Not knowing exactly how to handle this, I just set them to zero (hack)

colnames(metadata$Corr_matrix)[colSums(is.na(metadata$Corr_matrix)) > 0]
metadata$Corr_matrix[is.na(metadata$Corr_matrix)] <- 0

I can now restart at the simulate step
But, there's got to be a better way
Shouldn't clean or get_data take care of this somehow?

Thanks again
Jack Hopkins

@Tirgit
Copy link
Owner

Tirgit commented Sep 20, 2019

Hi - I will look into this issue early next week. Indeed it sounds like this is a bug and this should be handled inside one of the functions. Thanks for the heads up!

@Tirgit
Copy link
Owner

Tirgit commented Sep 25, 2019

Hi - Checked your problem. The problem here is that when calculating the correlation matrix, two features (Utilities and LotFrontage) produce NAs. The reason for this is that the feature Utilities has very small variance in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459 instances and takes the value 2 in only 1 instance). I don't have a quick fix for you in terms of missCompare, but you can solve this for now by removing the Utilities column from the data. This is a cleaning step that should be done before the get_data() step, of course.
Perhaps in the next version I can include some command for such cases in the clean() function.
Good luck with your analysis!

@hopkinsjj9
Copy link
Author

hopkinsjj9 commented Sep 25, 2019 via email

@Tirgit
Copy link
Owner

Tirgit commented Oct 1, 2019

Hi Jack - could you clarify the statement "post_imp_diag performs a T-test which will break if a column only contains 1 NA y variable." and include an example? Does the problem occur when there is only 1 NA amongst the values of a variable? Having troubles with the "1 NA y variable".
Thanks,
Tibor

@alsaberACS
Copy link

Hello Tirgit,

I have a question, I am trying to do "impute_simulated", but I don't to do all the 16 MI methods, I want to choose some of them, can I do that.

Thanks,

Ahmad

@Tirgit
Copy link
Owner

Tirgit commented Dec 10, 2019

Hi Ahmad,

This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option.

For now, though, you can do this using impute_data().

Best,
Tibor

@alsaberACS
Copy link

alsaberACS commented Dec 10, 2019 via email

@skanthan95
Copy link

Thank you for making this package! I have data (N ~ 13000) that is highly missing, monotone, and MNAR (for Gender (~10%) and Ethnicity (~80%)). I converted all chr features to fct, created the cleaned and metadata objects, and everything worked fine, -- but then when I tried to create the simulated object, I got the error written in the header. I'm a little confused because other users here attribute that error to having NAs in their data, but I thought that the 'simulated' step removes the NA values for you, and basically normalizes your initial dataset. Have I misunderstood?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants