-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Goodness of fit of a spatial error model #45
Comments
Please provide a minimal reproducible example of problems when using Much worse is trying to use CV with spatial data. Because of spatial dependence, information leaks from observation to observation, so that the CV permutations are not protected against contagion. For CV to work as expected, the observations cannot leak. Despite a good deal of usage by consultants hyping ML, random permutation when spatial dependence between observations may be present is very insecure and not adequately studied. Model outcomes and explanations will be biassed to an unknown degree by unmodelled spatial processes. See also papers on ML/CV challenges: 2012 https://ieeexplore.ieee.org/document/6352393 I would never use R^2 under any circumstances. In this case it is a log-likelihood based Nagelkerke measure, but nobody knows how it performs. It is not appropriate to use y^hat, I think, except where the model is just a regular linear model with no spatial components and no other mis-specification problems. Likelihood ratio tests between nested models should be OK, but comparing non-nested models is not easy, maybe comparison of AIC or BIC/DIC. |
It would help if you had linked this issue to tidymodels/spatialsample#157 if they are linked - I'm assuming that they are given the closeness in time between their creation. I'd also prefer a good deal more context about your project, most helpfully in a minimal reproducible example. |
@rsbivand thank you for your reply and all explanation :) I am trying to explore if I can use these models and I still have to learn all the details and requirements behind them..
Thank you for your support |
Please provide a minimal reprpducible example. Hand-waving (just concepts) does not work. |
Okay I cannot provide you with the entire example but I will try to make my questions more clear. Background on the model building process I started with building a classical OLS with the whole dataset , and selected the model structure in terms of covariates based on a stepwise selection considering the AIC. The model I kept working on was the one with the lowest AIC which I called Before building a spatial regression I looked into the Moran I for different distances
Based on the Moran I identified the groups of neighbors up to a distance of 50 km and I built the neighbor weight matrix using the function With this matrix I then performed the Lagrange multiplier diagnostics for spatial dependence to see if there's sense in building a spatial error models and spatial lag models. Both tests were significant but the spatial error model ha a higher LMErr value (
Given that the spatial error had the lowest AIC I kept this one and compared it with the Spatial Durbin Error Model (nested model) (following your suggestion)
And compare it with the spatial error model using the likelihood ratio test
Based on these results it seems I should keep the spatial error model .. and I am quite happy with that .. 1. I extracted the residuals from the spatial error model to check for normality
and heteoskedasticity
So here it comes my first question ... do the assumptions of the standard OLS hold for a spatial regression model? E.g. https://www.emilyburchfield.org/courses/gsa/spatial_regression_lab so I guess that I should have not worried about the residuals (please confirm that)
Cross validation and RMSEConsider three spatial clusters based on the points spatial coordinates using k-means clustering algorithm.
Transform the spatial clustering object into a dataframe to applu the function
Try out the function on 1 fold
This is just an example but I want to create 10 k folds and compute the RMSE for every k. What do you think about RMSE as a metric to validate the spatial error model?
Thank you for this discussion |
What are your data, count of observations and covariates? What is the support of the data, polygon or point? And what are you trying to model? In most cases, the relative fit of a model is less important than the coefficient estimates and the standard errors of those coefficients. One would effectively never try to cherry-pick independent variables. If the underlying model specifies them, they should remain. Prediction is not even a minor goal. In a SEM, the spatial dependency does not enter the prediction. No standard errors or intervals of prediction are available - see the article. |
Hi @rsbivand thanks for the rapid response, With regard to your answer .. I am really sorry but I am afraid I do not follow you. Could you please clarify trying to address the two doubts about diagnostics (residuals assumptions if should be checked) and validation (using the Nagelkerke pseudo-R-squared for the variance explained and compute the RMSE with a spatial cross validation approach ) Additionally I think that I am even more confused now :/ What do you mean with In a SEM, the spatial dependency does not enter the prediction. |
Most of the purpose of spatial econometrics as originally conceived was to correct the standard error estimates of regression coefficients which are biassed when spatial autocorrelation is present but not modelled. Prediction was not discussed in any articles at all pre-2000, and only studied in any depth in https://doi.org/10.1080/17421772.2017.1300679. Haining (1990) does touch briefly on prediction, and also on diagnostics (influence plots, outliers), but there are no implementations in his work. Without software, no applications. The general practice has been either to fit using GMM/STSLS or ML/Bayesian methods. The LM tests came as a supplement to Moran's I for regression residuals, and all the tests respond to any mis-specification, not just spatial autocorrelation. I inquired in https://doi.org/10.1007/s101090300096 about how prediction might be fashioned. The only trace that followed some time later was the realisation that impacts in the spatial lag and spatial Durbin models are not the same as in OLS, that is that the change in y_i from a unit change in x_ij is not \beta_j but a function of \beta_j and the autoregressive coefficient \rho. In the spatial error model, there is no such feedback, and dy/dx_j is \beta_j. If \beta_j differs between OLS and SEM, a Hausman test will show whether the difference is caused by further mis-specification https://doi.org/10.1016/j.econlet.2008.09.003. LeSage & Pace propose that the SDM or SDEM would be better specified than SEM if the null hypothesis of no difference between OLS and SEM \beta is not accepted. It has never been the usual practice to attempt to try to make fitted models fit better, because almost always the model is fitted to all the data. This may seem unlike other areas especially of non-spatial statistics. In spatial statistics, only geostatistics has been concerned with prediction, but there not for model fitting as such. It is possible to predict using spatial econometrics type models using INLA, but for your purposes it is not even clear why using spatial econometrics makes any sense. The link to an old class is out of date and does not reflect current practice, for moderately current practice see https://journal.srsa.org/ojs/index.php/RRS/article/view/44.1.2. See also https://doi.org/10.1111/jors.12188. Here too one would assume that the data entering the model would satisfy all the assumptions of OLS with the main exception of residual spatial autocorrelation. I asked how many observations you have, how many variables are included in the model; I understand that the observations have point support and that the points are expressed by decimal degrees and cover the globe (figure above). There seem to be very many points, but you use the default estimation method for ML fitting, which is very likely highly sub-optimal. Without knowing what you are trying to do, I fear that you are using the wrong tool completely. If you have global point support data, and want to find a model that predicts best for moderately large data sets, but have no theoretical model informing your choice of covariates, then spatial econometrics is probably inappropriate. I have no idea what the response is. I thought the data might be real estate prices, but they would not be global. Please tell me what you are trying to do, and I can see whether I can explain why your expectations of spatial econometrics models are inappropriate. This is not a problem with the software, which behaves as specified, but is a problem with your using spatial error models for purposes for which they were never intended. |
Thank you for this comprehensive explanation.
Thank you for your feedback |
ML or Bayesian may be OK if you choose a different method= argument value than the default; ML or Bayesian permit the use of goodness of fit comparisons like likelihood ratio, GMM/STSLS do not.
|
Thank you for your rapid response, |
I never rescale any variables, there is no reasonable motivation other than possibly avoiding numerical problems. If the natural scales of covariates or reponse lead to numerically small or large coefficients, I multiply or divide by a function of 10 (say convert metres to kilometres). That keeps the coefficient values numerically stable and directly interpretable. |
Thanks for your response and sharing your thoughts on this .. |
BTW these articles predict using spatial econometrics models https://www.nature.com/articles/s41598-017-08254-w and also re-reading the book you pointed me too they actually used the predict function ... https://r-spatial.org/book/17-Econometrics.html#sec-spateconpred I apologize but it's still not clear to me why it is not possible to predict using these models :/ :( |
You can fit poisson and negbin models using many of the disease mapping packages (unsure about negbin). Most are Bayesian (CARBayes, etc.), but hglm can also be used. The big difference between spatial econometrics models (from Ord 1975, via Anselin 1988 to LeSage & Pace 2009 and onwards) is that SE models do not admit a spatially structured random effect in addition to the residuals. SE starts from y = X b + e; disease mapping adds a spatially structured random effect (u): y = Xb + u + e. In practice there isn't much difference, but the two traditions separated in the 1970's (Ord or Besag). You can see this in Dumelle et al.'s equation 2. spatialreg provides spatial econometrics-type model fitting functions, not because they are better, but because that is what was needed for teaching spatial econometrics. Lots is easier with the disease mapping-type models.
Yes, |
Thank you for your rapid response 😀 For the predict issue now it's more clear and I was able to use it in the cross validation and compute the RMSE as I posted above. One last doubt remains with regard to the pseudo Rsquared .. why do you say that it's just a number ? |
Yes, the pseudo R squared is a measure of improvement over the null model, but its scaling probably isn't comparable with the linear model coefficient of determination. And comparing across models isn't obvious. |
Got it .. |
Hi
I am looking into a way to validate a spatial error model I built.
From one hand I would like to perform a cross validation but I can't predict on a dataset with different rows id using the predict function .. and I guess it is because of the neighborhood matrix.
Is there a way to validate the model on a different dataset?
Also is the Rsquared computed as a good measure for the model fit in case of a spatial error ?
Or do you suggest using other measures
Thank you
Angela
The text was updated successfully, but these errors were encountered: