-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAs in metadata$Corr_matrix #4
Comments
Hi - I will look into this issue early next week. Indeed it sounds like this is a bug and this should be handled inside one of the functions. Thanks for the heads up! |
Hi - Checked your problem. The problem here is that when calculating the correlation matrix, two features (Utilities and LotFrontage) produce NAs. The reason for this is that the feature Utilities has very small variance in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459 instances and takes the value 2 in only 1 instance). I don't have a quick fix for you in terms of missCompare, but you can solve this for now by removing the Utilities column from the data. This is a cleaning step that should be done before the get_data() step, of course. |
Thank you for looking into this.
I started using another dataset which was able to get past this problem,
only to run into another one.
post_imp_diag performs a T-test which will break if a column only contains
1 NA y variable.
My solution ( Oh no! ) was to just use the median in those cases. It
allowed me to check out the diagrams coming out of post_imp_diag with a
minimum of impact. (I hope).
I appreciate your feedback and hope to see more great packages.
Jack Hopkins
…On Wed, Sep 25, 2019 at 7:10 AM Tibor V. Varga ***@***.***> wrote:
Hi - Checked your problem. The problem here is that when calculating the
correlation matrix, two features (Utilities and LotFrontage) produce NAs.
The reason for this is that the feature Utilities has very small variance
in this sample (from the 1460 obs, Utilities takes a value of 1 in 1459
instances and takes the value 2 in only 1 instance). I don't have a quick
fix for you in terms of missCompare, but you can solve this for now by
removing the Utilities column from the data. This is a cleaning step that
should be done before the get_data() step, of course.
Perhaps in the next version I can include some command for such cases in
the clean() function.
Good luck with your analysis!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4?email_source=notifications&email_token=ADGAYBWIMMDZVA5KFRYMMOTQLNBLVA5CNFSM4IYYMMMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RQOGY#issuecomment-534972187>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGAYBQA3Z3G4UWTRLDADGLQLNBLVANCNFSM4IYYMMMA>
.
|
Hi Jack - could you clarify the statement "post_imp_diag performs a T-test which will break if a column only contains 1 NA y variable." and include an example? Does the problem occur when there is only 1 NA amongst the values of a variable? Having troubles with the "1 NA y variable". |
Hello Tirgit, I have a question, I am trying to do "impute_simulated", but I don't to do all the 16 MI methods, I want to choose some of them, can I do that. Thanks, Ahmad |
Hi Ahmad, This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option. For now, though, you can do this using impute_data(). Best, |
Thanks for your replay, then I will wait for next version :)
Kind regards,
Ahmed R. Al-Saber Ph.D. Candidate (University of Strathclyde)
CEO & Founder
Advancement Consulting for Statistical Studies (ACS)
m. +965 97703330
w. acs-kw.com <http://www.acs-kw.com/>
s. Shayma Tower Floor 10 | Murgab, Block 3, Plot 8A+8B. Omar Bin Al-Khattab Street, Kuwait
p. PO Box - 5819, Kuwait City, Safat 13059.
<https://www.instagram.com/acs_kw/> <https://twitter.com/acs_kw> <https://www.facebook.com/StatisticalConsultancyKuwait> <https://www.linkedin.com/company-beta/13309568/>
If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Read more
Think before you print.
… On Dec 10, 2019, at 3:01 PM, Tibor V. Varga ***@***.***> wrote:
Hi Ahmad,
This is currently not possible, you have to do all the 16 methods when you are running this function. The next version of the package will make this an available option.
For now, though, you can do this using impute_data().
Best,
Tibor
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4?email_source=notifications&email_token=AJMRZPQDR76X5T5T5AMW4CLQX6VLFA5CNFSM4IYYMMMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPREQQ#issuecomment-564073026>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJMRZPWPQ7CLK3Y2W7VBBD3QX6VLFANCNFSM4IYYMMMA>.
|
Thank you for making this package! I have data (N ~ 13000) that is highly missing, monotone, and MNAR (for Gender (~10%) and Ethnicity (~80%)). I converted all chr features to fct, created the cleaned and metadata objects, and everything worked fine, -- but then when I tried to create the simulated object, I got the error written in the header. I'm a little confused because other users here attribute that error to having NAs in their data, but I thought that the 'simulated' step removes the NA values for you, and basically normalizes your initial dataset. Have I misunderstood? |
Thank you for putting together a great package!
I'm getting infinite or missing values in 'x' errors when I try to send the following data through the process:
https://www.kaggle.com/pradeeptripathi/predicting-house-prices-using-r/data
train <- data.frame(readr::read_csv('../data/train.csv'))
str(train)
train <- train %>% mutate_if(is.character,as.factor)
str(train)
cleaned <- missCompare::clean(train,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)
make sure
cleaned <- missCompare::clean(cleaned,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)
metadata <- missCompare::get_data(cleaned,
matrixplot_sort = T,
plot_transform = T)
Warning message:
In stats::cor(X, use = "pairwise.complete.obs", method = "pearson") :
the standard deviation is zero
simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Error in eigen(if (doDykstra) R else Y, symmetric = TRUE) :
infinite or missing values in 'x'
I found two NAs in metadata$Corr_matrix. Utilities/LotFrontage
Not knowing exactly how to handle this, I just set them to zero (hack)
colnames(metadata$Corr_matrix)[colSums(is.na(metadata$Corr_matrix)) > 0]
metadata$Corr_matrix[is.na(metadata$Corr_matrix)] <- 0
I can now restart at the simulate step
But, there's got to be a better way
Shouldn't clean or get_data take care of this somehow?
Thanks again
Jack Hopkins
The text was updated successfully, but these errors were encountered: