Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale invariance #42

Open
nlapier2 opened this issue Jan 22, 2024 · 3 comments
Open

Scale invariance #42

nlapier2 opened this issue Jan 22, 2024 · 3 comments

Comments

@nlapier2
Copy link

This is essentially the same as the following issue from mr.ash: stephenslab/mr.ash.alpha#7 (comment)

Using the same input data, I get very different results for

as.numeric(varbvs(X=x_betas, y=y_betas, Z=NULL, family='gaussian')$pip)

versus

as.numeric(varbvs(X=x_betas*0.01, y=y_betas*0.01, Z=NULL, family='gaussian')$pip)

@nlapier2 nlapier2 changed the title VarBVS -- scale invariance Scale invariance Jan 22, 2024
@pcarbo
Copy link
Owner

pcarbo commented Jan 23, 2024

@nlapier2 It is clear this isn't scale invariant with the default settings, but perhaps with the right settings it will be scale invariant. Could you send me x_betas and y_betas to use as a test?

@nlapier2
Copy link
Author

I would suggest the ones that are generated using the code in the linked issue. If that doesn't work, let me know and I can generate some new ones.

@pcarbo
Copy link
Owner

pcarbo commented Jan 24, 2024

@nlapier2 Try this:

set.seed(1)
eff_size <- 0.1
N <- 1e4
M <- 100
K <- 30
theta_gx <- matrix(rnorm(M*K),M,K)
theta_xy <- c(rep(eff_size,4),rep(0,K-4))
G <- matrix(rnorm(N*M),N,M)
X <- G %*% theta_gx + matrix(rnorm(N*K),N,K)
Y <- X %*% theta_xy + rnorm(N)

get_betas_stderrs <- function(G, X, Y) {
  num_snps = dim(G)[2]
  num_exposures = dim(X)[2]
  y_betas = integer(num_snps)
  y_stderrs = integer(num_snps)
  x_betas = matrix(0, num_exposures, num_snps)
  x_stderrs = matrix(0, num_exposures, num_snps)
  for (i in 1:num_snps) {
    Gi = G[, i]
    Gi = (Gi - mean(Gi)) / sd(Gi)  # standardize genotypes
    res_gy = summary(lm(Y ~ Gi))
    y_betas[i] = res_gy$coefficients[2,1]
    y_stderrs[i] = as.numeric(coef(res_gy)[, "Std. Error"][2])
    for (j in 1:num_exposures) {
      res_gx = summary(lm(X[, j] ~ Gi))
      x_betas[j,i] = res_gx$coefficients[2,1]
      x_stderrs[j,i] = as.numeric(coef(res_gx)[, "Std. Error"][2])
    }
  }
  x_betas = t(x_betas)
  x_stderrs = t(x_stderrs)
  return(list("x_betas" = x_betas, "x_stderrs" = x_stderrs, "y_betas" = y_betas,
              "y_stderrs" = y_stderrs))
}

set.seed(1)
fit1 <- varbvs(X = x_betas,y = y_betas,Z = NULL,family = "gaussian",
               sa = 1,update.sa = TRUE,update.sigma = TRUE,tol = 1e-8)
fit2 <- varbvs(X = 0.1*x_betas,y = 0.1*y_betas,Z = NULL,
               family = "gaussian",sa = 1,update.sa = TRUE,
               update.sigma = TRUE,tol = 1e-8)
plot(fit1$pip,fit2$pip,pch = 20,xlim = c(0,1),ylim = c(0,1))
abline(a = 0,b = 1,col = "magenta",lty = "dotted")

The results are more consistent but there are still clearly differences.

This particular data set seems to represent an "edge case" where varbvs, perhaps due to the way the model is parameterized, has difficult with the optimization of the model parameters.

My general recommendation would be to help varbvs when you call it by rescaling the data X, y so that, say, the column variances of X are closer to 1.

You should also compare to mr.ash to see if it does a better job in this case handling the rescaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants