diff --git a/DESCRIPTION b/DESCRIPTION index f1323ee..c2cf41b 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -5,9 +5,9 @@ Version: 1.0.0 Authors@R: c(person("Changwoo", "Lee", role=c("aut", "cre"), email="c.lee@stat.tamu.edu"), person("Eun Sug", "Park", role = c("aut"))) Author: Changwoo Lee[aut, cre], Eun Sug Park[aut] Maintainer: Changwoo Lee -Description: Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error, represented as a multivariate normal prior distribution. - These models typically arises from a two-stage Bayesian analysis of environmental exposures and health outcomes. - From a first-stage model, predictions of the covariate of interest (exposure) and their uncertainty information (typically contained in MCMC samples) are used to form a multivariate normal prior distribution for exposure in a second-stage regression model. +Description: Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error. + These models typically arise from a two-stage Bayesian analysis of environmental exposures and health outcomes. + From a first-stage model, predictions of the covariate of interest ("exposure") and their uncertainty information (typically contained in MCMC samples) are used to form a multivariate normal prior distribution for exposure in a second-stage regression model. The package provides implementation of the methods used in Lee et al. (2024) . License: GPL (>= 3) Encoding: UTF-8 diff --git a/LICENSE b/LICENSE deleted file mode 100644 index 0a1bdea..0000000 --- a/LICENSE +++ /dev/null @@ -1,2 +0,0 @@ -YEAR: 2024 -COPYRIGHT HOLDER: bspme authors diff --git a/R/bglm_me.R b/R/bglm_me.R index 1b43011..c69c380 100644 --- a/R/bglm_me.R +++ b/R/bglm_me.R @@ -1,6 +1,6 @@ #' Bayesian generalized linear models with spatial exposure measurement error. #' -#' This function fits a Bayesian generalized linear model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}, represented as a multivariate normal prior. +#' This function fits a Bayesian generalized linear model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}. #' One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. #' As of version 1.0.0, only Bayesian logistic regression model is supported among GLMs, and #' function \code{bglm_me()} runs a Gibbs sampler to carry out posterior inference using Polya-Gamma augmentation (Polson et al., 2013). @@ -109,7 +109,7 @@ #' family = binomial(link = "logit"), #' nburn = 5000, #' nsave = 5000, -#' nthin = 5) +#' nthin = 1) #' fit$cputime #' summary(fit$posterior) #' library(bayesplot) diff --git a/R/blm_me.R b/R/blm_me.R index 1ecea05..935b2b9 100644 --- a/R/blm_me.R +++ b/R/blm_me.R @@ -1,30 +1,31 @@ #' Bayesian linear regression models with spatial exposure measurement error. #' -#' This function fits Bayesian linear regression model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior. -#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. -#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology. +#' This function fits a Bayesian linear regression model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}. +#' One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. +#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology. #' -#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error, -#' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error. -#' Consider normal linear regression model, +#' Denote \eqn{Y_i} be a continuous response, \eqn{X_i} be a \eqn{q\times 1} covariate vector that is subject to spatial exposure measurement error, +#' and \eqn{Z_i} be a \eqn{p\times 1} covariate vector without measurement error. +#' Consider a normal linear regression model, #' \deqn{Y_i = \beta_0 + X_i^\top \beta_X + Z_i^\top \beta_Z + \epsilon_i,\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n,} -#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior. -#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, +#' where spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model using a multivariate normal prior. +#' For example when \eqn{q=1}, we have an \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, #' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).} -#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +#' Most importantly, it allows a sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +#' When \eqn{q>1}, \eqn{q} independent \eqn{n-}dimensional multivariate normal priors are assumed. #' #' We consider semiconjugate priors for regression coefficients and error variance, #' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).} -#' where \code{var_beta} corresponds to \eqn{V_\beta} and \code{a_Y}, \code{b_Y} corresponds to hyperparameters for \eqn{\sigma^2_Y}. +#' where \code{var_beta} corresponds to \eqn{V_\beta}, and \code{a_Y} and \code{b_Y} corresponds to hyperparameters of an inverse gamma prior for \eqn{\sigma^2_Y}. #' #' @param Y *vector<int>*, n by 1 continuous response vector. -#' @param X_mean *matrix<num>*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices. -#' @param X_prec *matrix<num>*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices. -#' @param Z *matrix<num>*, n by p covariates that are not subject to measurement error. -#' @param nburn *integer*, burn-in iteration, default 1000. -#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}. -#' @param nthin *integer*, thin-in rate, default 1. -#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}. +#' @param X_mean *vector<num>*, n by 1 prior mean vector \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 vectors. +#' @param X_prec *matrix<num>*, n by n prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices. +#' @param Z *matrix<num>*, n by p matrix containing covariates that are not subject to measurement error. +#' @param nburn *integer*, number of burn-in iteration (default 5000). +#' @param nsave *integer*, number of posterior samples (default 5000). Total number of MCMC iteration is \code{nburn + nsave * nthin}. +#' @param nthin *integer*, thin-in rate (default 1). +#' @param prior *list*, list of prior parameters of the regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}. #' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure). #' #' @return list of the following: @@ -46,9 +47,9 @@ #' data(health_sim) #' library(fields) #' library(maps) -#' # Obtain the predicted exposure mean and covariance at health subject locations based on Jan 10, 2012 data -#' # using Gaussian process prior with mean zero and exponential covariance kernel -#' # with fixed range 5 (in miles) and stdev 0.5 (in ppbv) +#' # Obtain the predicted exposure mean and covariance at simulated health subject locations based on Jan 10, 2012 NO2 data +#' # using a Gaussian process prior with mean zero and exponential covariance kernel +#' # with fixed range 8 (in km) and standard deviation 1. #' #' # exposure data #' data_jan10 = NO2_Jan2012[NO2_Jan2012$date == as.POSIXct("2012-01-10"),] @@ -57,36 +58,36 @@ #' # health data #' coords_health = cbind(health_sim$lon, health_sim$lat) #' -#' distmat_xx <- rdist.earth(coords_monitor) -#' distmat_xy <- rdist.earth(coords_monitor, coords_health) -#' distmat_yy <- rdist.earth(coords_health) +#' distmat_xx <- rdist.earth(coords_monitor, miles = F) +#' distmat_xy <- rdist.earth(coords_monitor, coords_health, miles = F) +#' distmat_yy <- rdist.earth(coords_health, miles = F) #' -#' a = 5; sigma = 1; # assume known +#' a = 8; sigma = 1; # assume known #' #' Sigmaxx = fields::Matern(distmat_xx, smoothness = 0.5, range = a, phi = sigma^2) #' Sigmaxy = fields::Matern(distmat_xy, smoothness = 0.5, range = a, phi = sigma^2) #' Sigmayy = fields::Matern(distmat_yy, smoothness = 0.5, range = a, phi = sigma^2) #' -#' # posterior predictive mean and covariance of exposure at health data +#' # posterior predictive mean and covariance of exposure at health subject locations #' X_mean <- t(Sigmaxy) %*% solve(Sigmaxx, data_jan10$lnNO2) #' X_cov <- Sigmayy - t(Sigmaxy) %*% solve(Sigmaxx,Sigmaxy) # n_y by n_y #' #' # visualize #' # monitoring station exposure data #' quilt.plot(cbind(data_jan10$lon, data_jan10$lat), -#' data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 stations", +#' data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 monitoring stations", #' xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) #' maps::map("county", "Texas", add = T) #' -#' # posterior predictive mean of exposure at health data +#' # posterior predictive mean of exposure at health subject locations #' quilt.plot(cbind(health_sim$lon, health_sim$lat), -#' X_mean, main = "posterior predictive mean of exposure at health data locations", +#' X_mean, main = "posterior predictive mean of exposure at health subject locations", #' xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) #' maps::map("county", "Texas", add = T) #' -#' # posterior predictive sd of exposure at health data +#' # posterior predictive sd of exposure at health subject locations #' quilt.plot(cbind(health_sim$lon, health_sim$lat), -#' sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health data locations", +#' sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health subject locations", #' xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) #' maps::map("county", "Texas", add = T) #' @@ -102,9 +103,9 @@ #' X_mean = X_mean, #' X_prec = Q_sparse, # sparse precision matrix #' Z = health_sim$Z, -#' nburn = 1000, -#' nsave = 1000, -#' nthin = 5) +#' nburn = 5000, +#' nsave = 5000, +#' nthin = 1) #' fit$cputime #' summary(fit$posterior) #' library(bayesplot) @@ -115,8 +116,8 @@ blm_me <- function(Y, X_mean, X_prec, Z, - nburn = 1000, - nsave = 1000, + nburn = 5000, + nsave = 5000, nthin = 1, prior = NULL, saveX = FALSE){ diff --git a/R/bspme-package.R b/R/bspme-package.R index e36fc40..74b3256 100644 --- a/R/bspme-package.R +++ b/R/bspme-package.R @@ -7,20 +7,24 @@ NULL #' @name NO2_Jan2012 -#' @title Daily average NO2 (nitrogen dioxide) measurements at Houston area +#' @title Daily average NO2 concentrations in the Harris County, Texas, in Jan 2012 #' #' @description -#' This dataset is +#' This dataset contains daily average NO2 (nitrogen dioxide) concentrations measured at 21 monitoring sites around the Harris County, Texas, in January 2012. +#' The raw data were obtained from the Texas Commission on Environmental Quality (TCEQ); see, e.g., . +#' From the hourly average NO2 raw data, the daily average NO2 concentrations are calculated by taking the average of 24 hourly averages. +#' If there are any missing data, the daily average NO2 concentrations are imputed from the average of 3-nearest neighboring monitoring station that are non-missing. The imputation is done separately for each day. +#' Finally, the natural logarithm of daily average NO2 concentrations are calculated. #' #' @usage data(NO2_Jan2012) #' -#' @format A data frame with 5963 rows and 6 variables: +#' @format A data frame with 651 (21 sites x 31 days) rows and 5 variables: #' \describe{ -#' \item{date}{POSIXct, date} -#' \item{site_name}{character, monitoring station name} -#' \item{lon}{numeric, longitude of monitoring station} -#' \item{lat}{numeric, latitude of monitoring station} -#' \item{lnNO2}{numeric, natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)} +#' \item{date}{date in POSIXct format} +#' \item{site_name}{monitoring station name} +#' \item{lon}{monitoring station longitude} +#' \item{lat}{monitoring station latitude} +#' \item{lnNO2}{natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)} #' } NULL @@ -33,13 +37,13 @@ NULL #' #' @usage data(health_sim) #' -#' @format A data frame with n = 3000 rows and 4 variables: +#' @format A data frame with n = 2000 rows and 6 variables: #' \describe{ -#' \item{Y}{n by 1 matrix, numeric, simulated continuous health outcome} -#' \item{Ybinary}{n by 1 matrix, numeric, simulated binary health outcome} -#' \item{lon}{n by 1 matrix, numeric, simulated health subject longitude} -#' \item{lat}{n by 1 matrix, numeric, simulated health subject latitude} -#' \item{Z}{n by 1 matrix, numeric, covariate} -#' \item{X_true}{n by 1 matrix, numeric, true ozone exposure used for simulation} +#' \item{Y}{simulated continuous health outcome} +#' \item{Ybinary}{simulated binary health outcome} +#' \item{lon}{simulated health subject longitude} +#' \item{lat}{simulated health subject latitude} +#' \item{Z}{simulated covariate (p=1) that is not subject to measurement error} +#' \item{X_true}{true ln(NO2) exposure used for simulating health outcome} #' } NULL diff --git a/R/vecchia_cov.R b/R/vecchia_cov.R index b83bb5e..c8412c1 100644 --- a/R/vecchia_cov.R +++ b/R/vecchia_cov.R @@ -2,7 +2,7 @@ #' #' Given a multivariate normal (MVN) distribution with covariance matrix \eqn{\Sigma}, #' this function finds a sparse precision matrix (inverse covariance) \eqn{Q} based on the Vecchia approximation (Vecchia 1988, Katzfuss and Guinness 2021), -#' where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates original MVN \eqn{N(\mu, \Sigma)}. +#' where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates the original MVN \eqn{N(\mu, \Sigma)}. #' The algorithm is based on the pseudocode 2 of Finley et al. (2019). #' #' @param Sigma *matrix<num>*, n by n covariance matrix @@ -16,7 +16,7 @@ #' \item{Q}{The n by n sparse precision matrix in \link{Matrix} format} #' \item{ord}{The ordering used for Vecchia approximation} #' \item{cputime}{time taken to run Vecchia approximation} -#' \item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with sparse precision matrix.} +#' \item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is the multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with a sparse precision matrix.} #' } #' #' @references Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 50(2), 297-312. diff --git a/README.Rmd b/README.Rmd index cdea924..7df4732 100644 --- a/README.Rmd +++ b/README.Rmd @@ -74,8 +74,8 @@ To see function description in R environment, run the following lines: | Dataset call | Description | | ---------------------- | -------------------------------------------------------------------------| -| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas | -| `data("health_sim")` | Simulated health data corresponding to ozone exposure data | +| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas | +| `data("health_sim")` | Simulated health data associated with log(NO2) concentration on Jan 10, 2012 | ## Examples diff --git a/README.md b/README.md index 33abc8f..965022d 100644 --- a/README.md +++ b/README.md @@ -76,10 +76,10 @@ To see function description in R environment, run the following lines: ## datasets -| Dataset call | Description | -|-----------------------|----------------------------------------------------------------| -| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas | -| `data("health_sim")` | Simulated health data corresponding to ozone exposure data | +| Dataset call | Description | +|-----------------------|------------------------------------------------------------------------------| +| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas | +| `data("health_sim")` | Simulated health data associated with log(NO2) concentration on Jan 10, 2012 | ## Examples diff --git a/man/NO2_Jan2012.Rd b/man/NO2_Jan2012.Rd index da7b770..a51573f 100644 --- a/man/NO2_Jan2012.Rd +++ b/man/NO2_Jan2012.Rd @@ -2,20 +2,24 @@ % Please edit documentation in R/bspme-package.R \name{NO2_Jan2012} \alias{NO2_Jan2012} -\title{Daily average NO2 (nitrogen dioxide) measurements at Houston area} +\title{Daily average NO2 concentrations in the Harris County, Texas, in Jan 2012} \format{ -A data frame with 5963 rows and 6 variables: +A data frame with 651 (21 sites x 31 days) rows and 5 variables: \describe{ -\item{date}{POSIXct, date} -\item{site_name}{character, monitoring station name} -\item{lon}{numeric, longitude of monitoring station} -\item{lat}{numeric, latitude of monitoring station} -\item{lnNO2}{numeric, natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)} +\item{date}{date in POSIXct format} +\item{site_name}{monitoring station name} +\item{lon}{monitoring station longitude} +\item{lat}{monitoring station latitude} +\item{lnNO2}{natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)} } } \usage{ data(NO2_Jan2012) } \description{ -This dataset is +This dataset contains daily average NO2 (nitrogen dioxide) concentrations measured at 21 monitoring sites around the Harris County, Texas, in January 2012. +The raw data were obtained from the Texas Commission on Environmental Quality (TCEQ); see, e.g., \url{https://www.tceq.texas.gov/cgi-bin/compliance/monops/daily_summary.pl}. +From the hourly average NO2 raw data, the daily average NO2 concentrations are calculated by taking the average of 24 hourly averages. +If there are any missing data, the daily average NO2 concentrations are imputed from the average of 3-nearest neighboring monitoring station that are non-missing. The imputation is done separately for each day. +Finally, the natural logarithm of daily average NO2 concentrations are calculated. } diff --git a/man/bglm_me.Rd b/man/bglm_me.Rd index 131968b..dcff301 100644 --- a/man/bglm_me.Rd +++ b/man/bglm_me.Rd @@ -10,8 +10,8 @@ bglm_me( X_prec, Z, family = binomial(link = "logit"), - nburn = 1000, - nsave = 1000, + nburn = 5000, + nsave = 5000, nthin = 1, prior = NULL, saveX = FALSE @@ -20,21 +20,21 @@ bglm_me( \arguments{ \item{Y}{\emph{vector}, n by 1 binary response vector.} -\item{X_mean}{\emph{matrix}, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.} +\item{X_mean}{\emph{vector}, n by 1 prior mean vector \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 vectors.} -\item{X_prec}{\emph{matrix}, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.} +\item{X_prec}{\emph{matrix}, n by n prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.} -\item{Z}{\emph{matrix}, n by p covariates that are not subject to measurement error.} +\item{Z}{\emph{matrix}, n by p matrix containing covariates that are not subject to measurement error.} \item{family}{\emph{class \link[stats]{family}}, a description of the error distribution and link function to be used in the model. Currently only supports binomial(link = "logit").} -\item{nburn}{\emph{integer}, burn-in iteration, default 1000.} +\item{nburn}{\emph{integer}, number of burn-in iteration (default 5000).} -\item{nsave}{\emph{integer}, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.} +\item{nsave}{\emph{integer}, number of posterior samples (default 5000). Total number of MCMC iteration is \code{nburn + nsave * nthin}.} -\item{nthin}{\emph{integer}, thin-in rate, default 1.} +\item{nthin}{\emph{integer}, thin-in rate (default 1).} -\item{prior}{\emph{list}, list of prior parameters of regression model. Default is \code{list(var_beta = 100)}.} +\item{prior}{\emph{list}, list of prior parameters of the regression model. Default is \code{list(var_beta = 100)}.} \item{saveX}{\emph{logical}, default FALSE, whether save posterior samples of X (exposure).} } @@ -47,21 +47,22 @@ List of the following: } } \description{ -This function fits Bayesian generalized linear model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior. -One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. +This function fits a Bayesian generalized linear model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}. +One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. As of version 1.0.0, only Bayesian logistic regression model is supported among GLMs, and function \code{bglm_me()} runs a Gibbs sampler to carry out posterior inference using Polya-Gamma augmentation (Polson et al., 2013). -See below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology. +See below "Details" section for the model description and Lee et al. (2024) for an application example in environmental epidemiology. } \details{ -Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error, -and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error. -Consider logistic regression model, independently for each \eqn{i=1,\dots,n,} +Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate vector that is subject to spatial exposure measurement error, +and \eqn{Z_i} be a \eqn{p\times 1} covariate vector without measurement error. +Consider a logistic regression model, independently for each \eqn{i=1,\dots,n,} \deqn{\log(Pr(Y_i = 1)/Pr(Y_i = 0)) = \beta_0 + X_i^\top \beta_X + Z_i^\top \beta_Z,} -and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior. -For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, +and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model using a multivariate normal prior. +For example when \eqn{q=1}, we have an \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).} -Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +Most importantly, it allows a sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +When \eqn{q>1}, \eqn{q} independent \eqn{n-}dimensional multivariate normal priors are assumed. We consider normal priors for regression coefficients, \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta)} @@ -74,9 +75,9 @@ data(NO2_Jan2012) data(health_sim) library(fields) library(maps) -# Obtain the predicted exposure mean and covariance at health subject locations based on Jan 10, 2012 data -# using Gaussian process prior with mean zero and exponential covariance kernel -# with fixed range 5 (in miles) and stdev 0.5 (in ppbv) +# Obtain the predicted exposure mean and covariance at simulated health subject locations based on Jan 10, 2012 NO2 data +# using a Gaussian process prior with mean zero and exponential covariance kernel +# with fixed range 8 (in km) and standard deviation 1. # exposure data data_jan10 = NO2_Jan2012[NO2_Jan2012$date == as.POSIXct("2012-01-10"),] @@ -85,36 +86,36 @@ coords_monitor = cbind(data_jan10$lon, data_jan10$lat) # health data coords_health = cbind(health_sim$lon, health_sim$lat) -distmat_xx <- rdist.earth(coords_monitor) -distmat_xy <- rdist.earth(coords_monitor, coords_health) -distmat_yy <- rdist.earth(coords_health) +distmat_xx <- rdist.earth(coords_monitor, miles = F) +distmat_xy <- rdist.earth(coords_monitor, coords_health, miles = F) +distmat_yy <- rdist.earth(coords_health, miles = F) -a = 5; sigma = 1; # assume known +a = 8; sigma = 1; # assume known Sigmaxx = fields::Matern(distmat_xx, smoothness = 0.5, range = a, phi = sigma^2) Sigmaxy = fields::Matern(distmat_xy, smoothness = 0.5, range = a, phi = sigma^2) Sigmayy = fields::Matern(distmat_yy, smoothness = 0.5, range = a, phi = sigma^2) -# posterior predictive mean and covariance of exposure at health data +# posterior predictive mean and covariance of exposure at health subject locations X_mean <- t(Sigmaxy) \%*\% solve(Sigmaxx, data_jan10$lnNO2) X_cov <- Sigmayy - t(Sigmaxy) \%*\% solve(Sigmaxx,Sigmaxy) # n_y by n_y # visualize # monitoring station exposure data quilt.plot(cbind(data_jan10$lon, data_jan10$lat), - data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 stations", + data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 monitoring stations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) -# posterior predictive mean of exposure at health data +# posterior predictive mean of exposure at health subject locations quilt.plot(cbind(health_sim$lon, health_sim$lat), - X_mean, main = "posterior predictive mean of exposure at health data locations", + X_mean, main = "posterior predictive mean of exposure at health subject locations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) -# posterior predictive sd of exposure at health data +# posterior predictive sd of exposure at health subject locations quilt.plot(cbind(health_sim$lon, health_sim$lat), - sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health data locations", + sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health subject locations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) @@ -125,15 +126,15 @@ run_vecchia = vecchia_cov(X_cov, coords = cbind(health_sim$lon, health_sim$lat), Q_sparse = run_vecchia$Q run_vecchia$cputime -# fit the model, continuous data +# fit the model, binary response fit = bglm_me(Y = health_sim$Ybinary, X_mean = X_mean, X_prec = Q_sparse, # sparse precision matrix Z = health_sim$Z, family = binomial(link = "logit"), - nburn = 1000, - nsave = 1000, - nthin = 5) + nburn = 5000, + nsave = 5000, + nthin = 1) fit$cputime summary(fit$posterior) library(bayesplot) diff --git a/man/blm_me.Rd b/man/blm_me.Rd index b99ba3e..d9d9729 100644 --- a/man/blm_me.Rd +++ b/man/blm_me.Rd @@ -9,8 +9,8 @@ blm_me( X_mean, X_prec, Z, - nburn = 1000, - nsave = 1000, + nburn = 5000, + nsave = 5000, nthin = 1, prior = NULL, saveX = FALSE @@ -19,19 +19,19 @@ blm_me( \arguments{ \item{Y}{\emph{vector}, n by 1 continuous response vector.} -\item{X_mean}{\emph{matrix}, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.} +\item{X_mean}{\emph{vector}, n by 1 prior mean vector \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 vectors.} -\item{X_prec}{\emph{matrix}, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.} +\item{X_prec}{\emph{matrix}, n by n prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.} -\item{Z}{\emph{matrix}, n by p covariates that are not subject to measurement error.} +\item{Z}{\emph{matrix}, n by p matrix containing covariates that are not subject to measurement error.} -\item{nburn}{\emph{integer}, burn-in iteration, default 1000.} +\item{nburn}{\emph{integer}, number of burn-in iteration (default 5000).} -\item{nsave}{\emph{integer}, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.} +\item{nsave}{\emph{integer}, number of posterior samples (default 5000). Total number of MCMC iteration is \code{nburn + nsave * nthin}.} -\item{nthin}{\emph{integer}, thin-in rate, default 1.} +\item{nthin}{\emph{integer}, thin-in rate (default 1).} -\item{prior}{\emph{list}, list of prior parameters of regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.} +\item{prior}{\emph{list}, list of prior parameters of the regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.} \item{saveX}{\emph{logical}, default FALSE, whether save posterior samples of X (exposure).} } @@ -44,23 +44,24 @@ list of the following: } } \description{ -This function fits Bayesian linear regression model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior. -One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. -Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology. +This function fits a Bayesian linear regression model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}. +One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation. +Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology. } \details{ -Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error, -and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error. -Consider normal linear regression model, +Denote \eqn{Y_i} be a continuous response, \eqn{X_i} be a \eqn{q\times 1} covariate vector that is subject to spatial exposure measurement error, +and \eqn{Z_i} be a \eqn{p\times 1} covariate vector without measurement error. +Consider a normal linear regression model, \deqn{Y_i = \beta_0 + X_i^\top \beta_X + Z_i^\top \beta_Z + \epsilon_i,\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n,} -and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior. -For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, +where spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model using a multivariate normal prior. +For example when \eqn{q=1}, we have an \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top}, \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).} -Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +Most importantly, it allows a sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation. +When \eqn{q>1}, \eqn{q} independent \eqn{n-}dimensional multivariate normal priors are assumed. We consider semiconjugate priors for regression coefficients and error variance, \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).} -where \code{var_beta} corresponds to \eqn{V_\beta} and \code{a_Y}, \code{b_Y} corresponds to hyperparameters for \eqn{\sigma^2_Y}. +where \code{var_beta} corresponds to \eqn{V_\beta}, and \code{a_Y} and \code{b_Y} corresponds to hyperparameters of an inverse gamma prior for \eqn{\sigma^2_Y}. } \examples{ \dontrun{ @@ -69,9 +70,9 @@ data(NO2_Jan2012) data(health_sim) library(fields) library(maps) -# Obtain the predicted exposure mean and covariance at health subject locations based on Jan 10, 2012 data -# using Gaussian process prior with mean zero and exponential covariance kernel -# with fixed range 5 (in miles) and stdev 0.5 (in ppbv) +# Obtain the predicted exposure mean and covariance at simulated health subject locations based on Jan 10, 2012 NO2 data +# using a Gaussian process prior with mean zero and exponential covariance kernel +# with fixed range 8 (in km) and standard deviation 1. # exposure data data_jan10 = NO2_Jan2012[NO2_Jan2012$date == as.POSIXct("2012-01-10"),] @@ -80,36 +81,36 @@ coords_monitor = cbind(data_jan10$lon, data_jan10$lat) # health data coords_health = cbind(health_sim$lon, health_sim$lat) -distmat_xx <- rdist.earth(coords_monitor) -distmat_xy <- rdist.earth(coords_monitor, coords_health) -distmat_yy <- rdist.earth(coords_health) +distmat_xx <- rdist.earth(coords_monitor, miles = F) +distmat_xy <- rdist.earth(coords_monitor, coords_health, miles = F) +distmat_yy <- rdist.earth(coords_health, miles = F) -a = 5; sigma = 1; # assume known +a = 8; sigma = 1; # assume known Sigmaxx = fields::Matern(distmat_xx, smoothness = 0.5, range = a, phi = sigma^2) Sigmaxy = fields::Matern(distmat_xy, smoothness = 0.5, range = a, phi = sigma^2) Sigmayy = fields::Matern(distmat_yy, smoothness = 0.5, range = a, phi = sigma^2) -# posterior predictive mean and covariance of exposure at health data +# posterior predictive mean and covariance of exposure at health subject locations X_mean <- t(Sigmaxy) \%*\% solve(Sigmaxx, data_jan10$lnNO2) X_cov <- Sigmayy - t(Sigmaxy) \%*\% solve(Sigmaxx,Sigmaxy) # n_y by n_y # visualize # monitoring station exposure data quilt.plot(cbind(data_jan10$lon, data_jan10$lat), - data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 stations", + data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 monitoring stations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) -# posterior predictive mean of exposure at health data +# posterior predictive mean of exposure at health subject locations quilt.plot(cbind(health_sim$lon, health_sim$lat), - X_mean, main = "posterior predictive mean of exposure at health data locations", + X_mean, main = "posterior predictive mean of exposure at health subject locations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) -# posterior predictive sd of exposure at health data +# posterior predictive sd of exposure at health subject locations quilt.plot(cbind(health_sim$lon, health_sim$lat), - sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health data locations", + sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health subject locations", xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5)) maps::map("county", "Texas", add = T) @@ -125,9 +126,9 @@ fit = blm_me(Y = health_sim$Y, X_mean = X_mean, X_prec = Q_sparse, # sparse precision matrix Z = health_sim$Z, - nburn = 1000, - nsave = 1000, - nthin = 5) + nburn = 5000, + nsave = 5000, + nthin = 1) fit$cputime summary(fit$posterior) library(bayesplot) diff --git a/man/bspme-package.Rd b/man/bspme-package.Rd index 0dcb06a..3c17eea 100644 --- a/man/bspme-package.Rd +++ b/man/bspme-package.Rd @@ -6,7 +6,7 @@ \alias{bspme-package} \title{bspme: Bayesian Spatial Measurement Error Models} \description{ -Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error, represented as a multivariate normal prior distribution. These models typically arises from two-stage Bayesian analysis of environmental exposures and health outcomes, where predictions of covariate of interest (exposure) from a first-stage model and their uncertainty information are used in a second-stage regression model. Related articles include Gryparis et al. (2009) (\url{https://doi.org/10.1093/biostatistics/kxn033}), Peng and Bell (2010) (\url{https://doi.org/10.1093/biostatistics/kxq017}), Chang, Peng, Dominici (2011) (\url{https://doi.org/10.1093/biostatistics/kxr002}), and Lee et al. (2024) (\url{https://arxiv.org/abs/2401.00634}). +Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error. These models typically arise from a two-stage Bayesian analysis of environmental exposures and health outcomes. From a first-stage model, predictions of the covariate of interest ("exposure") and their uncertainty information (typically contained in MCMC samples) are used to form a multivariate normal prior distribution for exposure in a second-stage regression model. The package provides implementation of the methods used in Lee et al. (2024) \url{https://arxiv.org/abs/2401.00634}. } \seealso{ Useful links: diff --git a/man/health_sim.Rd b/man/health_sim.Rd index 6c66177..4b89adb 100644 --- a/man/health_sim.Rd +++ b/man/health_sim.Rd @@ -4,14 +4,14 @@ \alias{health_sim} \title{Simulated health data} \format{ -A data frame with n = 3000 rows and 4 variables: +A data frame with n = 2000 rows and 6 variables: \describe{ -\item{Y}{n by 1 matrix, numeric, simulated continuous health outcome} -\item{Ybinary}{n by 1 matrix, numeric, simulated binary health outcome} -\item{lon}{n by 1 matrix, numeric, simulated health subject longitude} -\item{lat}{n by 1 matrix, numeric, simulated health subject latitude} -\item{Z}{n by 1 matrix, numeric, covariate} -\item{X_true}{n by 1 matrix, numeric, true ozone exposure used for simulation} +\item{Y}{simulated continuous health outcome} +\item{Ybinary}{simulated binary health outcome} +\item{lon}{simulated health subject longitude} +\item{lat}{simulated health subject latitude} +\item{Z}{simulated covariate (p=1) that is not subject to measurement error} +\item{X_true}{true ln(NO2) exposure used for simulating health outcome} } } \usage{ diff --git a/man/vecchia_cov.Rd b/man/vecchia_cov.Rd index 8cc0982..2a8f03b 100644 --- a/man/vecchia_cov.Rd +++ b/man/vecchia_cov.Rd @@ -23,13 +23,13 @@ list of the following: \item{Q}{The n by n sparse precision matrix in \link{Matrix} format} \item{ord}{The ordering used for Vecchia approximation} \item{cputime}{time taken to run Vecchia approximation} -\item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with sparse precision matrix.} +\item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is the multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with a sparse precision matrix.} } } \description{ Given a multivariate normal (MVN) distribution with covariance matrix \eqn{\Sigma}, this function finds a sparse precision matrix (inverse covariance) \eqn{Q} based on the Vecchia approximation (Vecchia 1988, Katzfuss and Guinness 2021), -where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates original MVN \eqn{N(\mu, \Sigma)}. +where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates the original MVN \eqn{N(\mu, \Sigma)}. The algorithm is based on the pseudocode 2 of Finley et al. (2019). } \examples{