manual update

changwoo-lee · Jan 12, 2024 · 6043eab · 6043eab
1 parent 8319c1e
commit 6043eab
Show file tree

Hide file tree

Showing 14 changed files with 162 additions and 153 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -5,9 +5,9 @@ Version: 1.0.0
 Authors@R: c(person("Changwoo", "Lee", role=c("aut", "cre"), email="[email protected]"), person("Eun Sug", "Park", role = c("aut")))
 Author: Changwoo Lee[aut, cre], Eun Sug Park[aut]
 Maintainer: Changwoo Lee <[email protected]>
-Description: Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error, represented as a multivariate normal prior distribution.
-             These models typically arises from a two-stage Bayesian analysis of environmental exposures and health outcomes.
-             From a first-stage model, predictions of the covariate of interest (exposure) and their uncertainty information (typically contained in MCMC samples) are used to form a multivariate normal prior distribution for exposure in a second-stage regression model. 
+Description: Scalable methods for fitting Bayesian linear and generalized linear models in the presence of spatial exposure measurement error.
+             These models typically arise from a two-stage Bayesian analysis of environmental exposures and health outcomes.
+             From a first-stage model, predictions of the covariate of interest ("exposure") and their uncertainty information (typically contained in MCMC samples) are used to form a multivariate normal prior distribution for exposure in a second-stage regression model. 
              The package provides implementation of the methods used in Lee et al. (2024) <https://arxiv.org/abs/2401.00634>. 
 License: GPL (>= 3)
 Encoding: UTF-8

diff --git a/LICENSE b/LICENSE
diff --git a/R/bglm_me.R b/R/bglm_me.R
@@ -1,6 +1,6 @@
 #' Bayesian generalized linear models with spatial exposure measurement error.
 #'
-#' This function fits a Bayesian generalized linear model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}, represented as a multivariate normal prior.
+#' This function fits a Bayesian generalized linear model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}.
 #' One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
 #' As of version 1.0.0, only Bayesian logistic regression model is supported among GLMs, and
 #' function \code{bglm_me()} runs a Gibbs sampler to carry out posterior inference using Polya-Gamma augmentation (Polson et al., 2013).
@@ -109,7 +109,7 @@
 #'                 family = binomial(link = "logit"),
 #'                 nburn = 5000,
 #'                 nsave = 5000,
-#'                 nthin = 5)
+#'                 nthin = 1)
 #' fit$cputime
 #' summary(fit$posterior)
 #' library(bayesplot)

diff --git a/R/blm_me.R b/R/blm_me.R
@@ -1,30 +1,31 @@
 #' Bayesian linear regression models with spatial exposure measurement error.
 #'
-#' This function fits Bayesian linear regression model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior.
-#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
-#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
+#' This function fits a Bayesian linear regression model in the presence of spatial exposure measurement error for covariate(s) \eqn{X}.
+#' One of the most important features of this function is that it allows a sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
+#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference;  see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
 #'
-#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error,
-#' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error.
-#' Consider normal linear regression model,
+#' Denote \eqn{Y_i} be a continuous response, \eqn{X_i} be a \eqn{q\times 1} covariate vector that is subject to spatial exposure measurement error,
+#' and \eqn{Z_i} be a \eqn{p\times 1} covariate vector without measurement error.
+#' Consider a normal linear regression model,
 #' \deqn{Y_i = \beta_0 + X_i^\top \beta_X +  Z_i^\top \beta_Z + \epsilon_i,\quad  \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n,}
-#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
-#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
+#' where spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model using a multivariate normal prior.
+#' For example when \eqn{q=1}, we have an \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
 #' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).}
-#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
+#' Most importantly, it allows a sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
+#' When \eqn{q>1}, \eqn{q} independent \eqn{n-}dimensional multivariate normal priors are assumed.
 #'
 #' We consider semiconjugate priors for regression coefficients and error variance,
 #' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).}
-#' where \code{var_beta} corresponds to \eqn{V_\beta} and \code{a_Y}, \code{b_Y} corresponds to hyperparameters for \eqn{\sigma^2_Y}.
+#' where \code{var_beta} corresponds to \eqn{V_\beta}, and \code{a_Y} and \code{b_Y} corresponds to hyperparameters of an inverse gamma prior for \eqn{\sigma^2_Y}.
 #'
 #' @param Y *vector&lt;int&gt;*, n by 1 continuous response vector.
-#' @param X_mean *matrix&lt;num&gt;*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.
-#' @param X_prec *matrix&lt;num&gt;*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
-#' @param Z *matrix&lt;num&gt;*, n by p covariates that are not subject to measurement error.
-#' @param nburn *integer*, burn-in iteration, default 1000.
-#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.
-#' @param nthin *integer*, thin-in rate, default 1.
-#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.
+#' @param X_mean *vector&lt;num&gt;*, n by 1 prior mean vector \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 vectors.
+#' @param X_prec *matrix&lt;num&gt;*, n by n prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
+#' @param Z *matrix&lt;num&gt;*, n by p matrix containing covariates that are not subject to measurement error.
+#' @param nburn *integer*, number of burn-in iteration (default 5000).
+#' @param nsave *integer*, number of posterior samples (default 5000). Total number of MCMC iteration is \code{nburn + nsave * nthin}.
+#' @param nthin *integer*, thin-in rate (default 1).
+#' @param prior *list*, list of prior parameters of the regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.
 #' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure).
 #'
 #' @return list of the following:
@@ -46,9 +47,9 @@
 #' data(health_sim)
 #' library(fields)
 #' library(maps)
-#' # Obtain the predicted exposure mean and covariance at health subject locations based on Jan 10, 2012 data
-#' # using Gaussian process prior with mean zero and exponential covariance kernel
-#' # with fixed range 5 (in miles) and stdev 0.5 (in ppbv)
+#' # Obtain the predicted exposure mean and covariance at simulated health subject locations based on Jan 10, 2012 NO2 data
+#' # using a Gaussian process prior with mean zero and exponential covariance kernel
+#' # with fixed range 8 (in km) and standard deviation 1.
 #'
 #' # exposure data
 #' data_jan10 = NO2_Jan2012[NO2_Jan2012$date == as.POSIXct("2012-01-10"),]
@@ -57,36 +58,36 @@
 #' # health data
 #' coords_health = cbind(health_sim$lon, health_sim$lat)
 #'
-#' distmat_xx <- rdist.earth(coords_monitor)
-#' distmat_xy <- rdist.earth(coords_monitor, coords_health)
-#' distmat_yy <- rdist.earth(coords_health)
+#' distmat_xx <- rdist.earth(coords_monitor, miles = F)
+#' distmat_xy <- rdist.earth(coords_monitor, coords_health, miles = F)
+#' distmat_yy <- rdist.earth(coords_health, miles = F)
 #'
-#' a = 5; sigma = 1; # assume known
+#' a = 8; sigma = 1; # assume known
 #'
 #' Sigmaxx = fields::Matern(distmat_xx, smoothness = 0.5, range = a, phi = sigma^2)
 #' Sigmaxy = fields::Matern(distmat_xy, smoothness = 0.5, range = a, phi = sigma^2)
 #' Sigmayy = fields::Matern(distmat_yy, smoothness = 0.5, range = a, phi = sigma^2)
 #'
-#' # posterior predictive mean and covariance of exposure at health data
+#' # posterior predictive mean and covariance of exposure at health subject locations
 #' X_mean <- t(Sigmaxy) %*% solve(Sigmaxx, data_jan10$lnNO2)
 #' X_cov <- Sigmayy - t(Sigmaxy) %*% solve(Sigmaxx,Sigmaxy) # n_y by n_y
 #'
 #' # visualize
 #' # monitoring station exposure data
 #' quilt.plot(cbind(data_jan10$lon, data_jan10$lat),
-#'            data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 stations",
+#'            data_jan10$lnNO2, main = "NO2 exposures (in log) at 21 monitoring stations",
 #'            xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5))
 #' maps::map("county", "Texas", add = T)
 #'
-#' # posterior predictive mean of exposure at health data
+#' # posterior predictive mean of exposure at health subject locations
 #' quilt.plot(cbind(health_sim$lon, health_sim$lat),
-#'            X_mean, main = "posterior predictive mean of exposure at health data locations",
+#'            X_mean, main = "posterior predictive mean of exposure at health subject locations",
 #'            xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5))
 #' maps::map("county", "Texas", add = T)
 #'
-#' # posterior predictive sd of exposure at health data
+#' # posterior predictive sd of exposure at health subject locations
 #' quilt.plot(cbind(health_sim$lon, health_sim$lat),
-#'            sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health data locations",
+#'            sqrt(diag(X_cov)), main = "posterior predictive sd of exposure at health subject locations",
 #'            xlab = "longitude", ylab= "latitude", xlim = c(-96.5, -94.5), ylim = c(29, 30.5))
 #' maps::map("county", "Texas", add = T)
 #'
@@ -102,9 +103,9 @@
 #'                 X_mean = X_mean,
 #'                 X_prec = Q_sparse, # sparse precision matrix
 #'                 Z = health_sim$Z,
-#'                 nburn = 1000,
-#'                 nsave = 1000,
-#'                 nthin = 5)
+#'                 nburn = 5000,
+#'                 nsave = 5000,
+#'                 nthin = 1)
 #' fit$cputime
 #' summary(fit$posterior)
 #' library(bayesplot)
@@ -115,8 +116,8 @@ blm_me <- function(Y,
                       X_mean,
                       X_prec,
                       Z,
-                      nburn = 1000,
-                      nsave = 1000,
+                      nburn = 5000,
+                      nsave = 5000,
                       nthin = 1,
                       prior = NULL,
                       saveX = FALSE){

diff --git a/R/bspme-package.R b/R/bspme-package.R
@@ -7,20 +7,24 @@ NULL
 
 
 #' @name NO2_Jan2012
-#' @title Daily average NO2 (nitrogen dioxide) measurements at Houston area
+#' @title Daily average NO2 concentrations in the Harris County, Texas, in Jan 2012
 #'
 #' @description
-#' This dataset is
+#' This dataset contains daily average NO2 (nitrogen dioxide) concentrations measured at 21 monitoring sites around the Harris County, Texas, in January 2012.
+#' The raw data were obtained from the Texas Commission on Environmental Quality (TCEQ); see, e.g., <https://www.tceq.texas.gov/cgi-bin/compliance/monops/daily_summary.pl>.
+#' From the hourly average NO2 raw data, the daily average NO2 concentrations are calculated by taking the average of 24 hourly averages.
+#' If there are any missing data, the daily average NO2 concentrations are imputed from the average of 3-nearest neighboring monitoring station that are non-missing. The imputation is done separately for each day.
+#' Finally, the natural logarithm of daily average NO2 concentrations are calculated.
 #'
 #' @usage data(NO2_Jan2012)
 #'
-#' @format A data frame with 5963 rows and 6 variables:
+#' @format A data frame with 651 (21 sites x 31 days) rows and 5 variables:
 #' \describe{
-#'   \item{date}{POSIXct, date}
-#'   \item{site_name}{character, monitoring station name}
-#'   \item{lon}{numeric, longitude of monitoring station}
-#'   \item{lat}{numeric, latitude of monitoring station}
-#'   \item{lnNO2}{numeric, natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)}
+#'   \item{date}{date in POSIXct format}
+#'   \item{site_name}{monitoring station name}
+#'   \item{lon}{monitoring station longitude}
+#'   \item{lat}{monitoring station latitude}
+#'   \item{lnNO2}{natural logarithm of daily average NO2 concentrations, in parts per billion by volume (ppbv)}
 #' }
 NULL
 
@@ -33,13 +37,13 @@ NULL
 #'
 #' @usage data(health_sim)
 #'
-#' @format A data frame with n = 3000 rows and 4 variables:
+#' @format A data frame with n = 2000 rows and 6 variables:
 #' \describe{
-#'   \item{Y}{n by 1 matrix, numeric, simulated continuous health outcome}
-#'   \item{Ybinary}{n by 1 matrix, numeric, simulated binary health outcome}
-#'   \item{lon}{n by 1 matrix, numeric, simulated health subject longitude}
-#'   \item{lat}{n by 1 matrix, numeric, simulated health subject latitude}
-#'   \item{Z}{n by 1 matrix, numeric, covariate}
-#'   \item{X_true}{n by 1 matrix, numeric, true ozone exposure used for simulation}
+#'   \item{Y}{simulated continuous health outcome}
+#'   \item{Ybinary}{simulated binary health outcome}
+#'   \item{lon}{simulated health subject longitude}
+#'   \item{lat}{simulated health subject latitude}
+#'   \item{Z}{simulated covariate (p=1) that is not subject to measurement error}
+#'   \item{X_true}{true ln(NO2) exposure used for simulating health outcome}
 #' }
 NULL
diff --git a/R/vecchia_cov.R b/R/vecchia_cov.R
@@ -2,7 +2,7 @@
 #'
 #' Given a multivariate normal (MVN) distribution with covariance matrix \eqn{\Sigma},
 #' this function finds a sparse precision matrix (inverse covariance) \eqn{Q} based on the Vecchia approximation (Vecchia 1988, Katzfuss and Guinness 2021),
-#' where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates original MVN \eqn{N(\mu, \Sigma)}.
+#' where \eqn{N(\mu, Q^{-1})} is the sparse MVN that approximates the original MVN \eqn{N(\mu, \Sigma)}.
 #' The algorithm is based on the pseudocode 2 of Finley et al. (2019).
 #'
 #' @param Sigma *matrix&lt;num&gt;*, n by n covariance matrix
@@ -16,7 +16,7 @@
 #'   \item{Q}{The n by n sparse precision matrix in \link{Matrix} format}
 #'   \item{ord}{The ordering used for Vecchia approximation}
 #'   \item{cputime}{time taken to run Vecchia approximation}
-#'   \item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with sparse precision matrix.}
+#'   \item{KLdiv}{ (if \code{KLdiv = TRUE}) KL divergence \eqn{D_{KL}(p || \tilde{p})} where \eqn{p} is the multivariate normal with original covariance matrix and \eqn{\tilde{p}} is the approximated multivariate normal with a sparse precision matrix.}
 #' }
 #'
 #' @references Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 50(2), 297-312.

diff --git a/README.Rmd b/README.Rmd
@@ -74,8 +74,8 @@ To see function description in R environment, run the following lines:
 
 | Dataset call | Description                                                              |
 | ---------------------- | -------------------------------------------------------------------------|
-| `data("NO2_Jan2012")`              | January 2012 daily average NO2 exposure data at Houston, Texas |
-| `data("health_sim")`        | Simulated health data corresponding to ozone exposure data |
+| `data("NO2_Jan2012")`       | January 2012 daily average NO2 exposure data at Houston, Texas |
+| `data("health_sim")`        | Simulated health data associated with log(NO2) concentration on Jan 10, 2012 |
 
 ## Examples
 

diff --git a/README.md b/README.md
@@ -76,10 +76,10 @@ To see function description in R environment, run the following lines:
 
 ## datasets
 
-| Dataset call          | Description                                                    |
-|-----------------------|----------------------------------------------------------------|
-| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas |
-| `data("health_sim")`  | Simulated health data corresponding to ozone exposure data     |
+| Dataset call          | Description                                                                  |
+|-----------------------|------------------------------------------------------------------------------|
+| `data("NO2_Jan2012")` | January 2012 daily average NO2 exposure data at Houston, Texas               |
+| `data("health_sim")`  | Simulated health data associated with log(NO2) concentration on Jan 10, 2012 |
 
 ## Examples
 

diff --git a/man/NO2_Jan2012.Rd b/man/NO2_Jan2012.Rd