Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA values in results of assocTestSingle #116

Open
yu-zhang-oYo opened this issue Oct 10, 2024 · 7 comments
Open

NA values in results of assocTestSingle #116

yu-zhang-oYo opened this issue Oct 10, 2024 · 7 comments

Comments

@yu-zhang-oYo
Copy link

after running code like this, the result of some SNPs are NA, here I included 10 PCs here, and include interaction between each SNPs with those 10 PCs. When I only include interaction of SNP with 2 PCs, there is no NA in the results. Could you please tell me why there is NA values and how can I avoid such problem. It seems that people always include 10 PCs in their analysis. Thanks

geno <- GdsGenotypeReader(filename = "chr22_genotype.gds")
# create a GenotypeData class object
genoData <- GenotypeData(geno)
# Run SNP-Phenotype Association Tests
genoIterator <- GenotypeBlockIterator(genoData, snpBlock = 2500)
assoc <- assocTestSingle(genoIterator, null.model = nullmod, GxE=paste0("PC", 1:10), BPPARAM = parallel_param)
> head(assoc)
    variant.id chr      pos n.obs      freq  MAC       Est.G Est.G:PC1
1 rs1396993216  22 11132695  8745 0.3932533 6878          NA        NA
2 rs1380626301  22 11253125  8745 0.2624357 4590          NA        NA
3 rs1335082528  22 11712790  8745 0.3410520 5965 -0.02862652  5.071690
4   rs10796494  22 12212485  8745 0.3372784 5899          NA        NA
5    rs9919307  22 12212558  8745 0.4124643 7214          NA        NA
6    rs9919306  22 12212570  8745 0.4471698 7821 -0.02395003  3.067779
  Est.G:PC10 Est.G:PC2  Est.G:PC3 Est.G:PC4   Est.G:PC5 Est.G:PC6 Est.G:PC7
1         NA        NA         NA        NA          NA        NA        NA
2         NA        NA         NA        NA          NA        NA        NA
3      0.875 0.5234312  0.3921560 -2.646185 -0.05574509 -3.027269 -1.359695
4         NA        NA         NA        NA          NA        NA        NA
5         NA        NA         NA        NA          NA        NA        NA
6     -2.000 2.5824847 -0.5428938 -3.332613  0.52354286 -3.282530 -1.988575
   Est.G:PC8 Est.G:PC9 Est.G:PC10       SE.G SE.G:PC1 SE.G:PC10 SE.G:PC2
1         NA        NA         NA         NA       NA        NA       NA
2         NA        NA         NA         NA       NA        NA       NA
3 -1.6218355 -2.803695      0.375 0.01837284 1.747629  77461421 1.657923
4         NA        NA         NA         NA       NA        NA       NA
5         NA        NA         NA         NA       NA        NA       NA
6 -0.3259146 -2.084208      0.000 0.01808976 1.845379 134164579 1.713065
  SE.G:PC3 SE.G:PC4 SE.G:PC5 SE.G:PC6 SE.G:PC7 SE.G:PC8 SE.G:PC9 SE.G:PC10
1       NA       NA       NA       NA       NA       NA       NA        NA
2       NA       NA       NA       NA       NA       NA       NA        NA
3 1.673819 1.763730 1.812998 1.768862 1.779349 1.822356 1.849141  77461421
4       NA       NA       NA       NA       NA       NA       NA        NA
5       NA       NA       NA       NA       NA       NA       NA        NA
6 1.739536 1.679154 1.768389 1.685683 1.693145 1.696496 1.742174 134164579
  GxE.Stat Joint.Stat   GxE.pval Joint.pval
1       NA         NA         NA         NA
2       NA         NA         NA         NA
3 4.067414   4.309488 0.12211078 0.09940260
4       NA         NA         NA         NA
5       NA         NA         NA         NA
6 4.266968   4.391234 0.07689974 0.08192533
@yu-zhang-oYo
Copy link
Author

Besides, I tried to calculate the P value of GxE.Stat (using 3rd row in my above result) which is > pchisq(4.067414^2, df = 10, lower.tail = FALSE) [1] 0.08508653 , it is not the same as the 0.12211078 in the result, I am a little confused about this.

@mconomos
Copy link
Contributor

It looks like there are two terms named "PC10" in your data -- did you accidentally include the same variable twice in your phenotype file? The collinearity of including the same term twice might be leading to the NA issue.

Since there are actually 11 interaction terms in your model (because PC10 is in there twice), the p-value calculation is using 11 d.f., which returns the same value as the function output.

Lastly - while people often use 10 PCs in their analyses by default, I would suggest looking at your PCs to determine how many reflect population structure in your data and using that many PCs in your analysis.

@yu-zhang-oYo
Copy link
Author

yu-zhang-oYo commented Oct 11, 2024

It looks like there are two terms named "PC10" in your data -- did you accidentally include the same variable twice in your phenotype file? The collinearity of including the same term twice might be leading to the NA issue.

Since there are actually 11 interaction terms in your model (because PC10 is in there twice), the p-value calculation is using 11 d.f., which returns the same value as the function output.

Lastly - while people often use 10 PCs in their analyses by default, I would suggest looking at your PCs to determine how many reflect population structure in your data and using that many PCs in your analysis.

Thanks. I checked my phenotype file as followed, it seems there is no PC10 for twice. So I am a little confused why PC10 appear for two times. If collinearity of including the same term twice leads to the NA issue, it is weird that some SNPs can get the results. `> pc.df <- as.data.frame(mypcair$vectors)

names(pc.df) <- paste0("PC", 1:ncol(mypcair$vectors))
pc.df$IID <- row.names(mypcair$vectors)
View(pc.df)
pheno_data <- left_join(pc.df, pheno_data, by = "IID")
colnames(pheno_data)[colnames(pheno_data) == "IID"] <- "scanID"
colnames(pheno_data)
[1] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6"
[7] "PC7" "PC8" "PC9" "PC10" "PC11" "PC12"
[13] "PC13" "PC14" "PC15" "PC16" "PC17" "PC18"
[19] "PC19" "PC20" "PC21" "PC22" "PC23" "PC24"
[25] "PC25" "PC26" "PC27" "PC28" "PC29" "PC30"
[31] "PC31" "PC32" "scanID" "FID" "CRace" "Ancestry_est"
[37] "Age_at_V1" "BMI" "wgt_pp" "wgt_v1" "wgt_v2" "wgt_v3"
[43] "wgt_de" "gwg_v1" "gwg_v2" "gwg_v3" "gwg_de" "z_gwg_v1"
[49] "z_gwg_v2" "z_gwg_v3" "z_gwg_de"`

Besides, I also checked the results when I only include two PCs, it is weird that the result also include the interaction between SNP and PC10, I have no idea if the GENESIS include PC10 wrongly when it try to find PC1

># read in GDS data
>geno <- GdsGenotypeReader(filename = "chr22_genotype.gds")
># create a GenotypeData class object
>genoData <- GenotypeData(geno)
># Run SNP-Phenotype Association Tests
>genoIterator <- GenotypeBlockIterator(genoData, snpBlock = 2500)
>assoc <- assocTestSingle(genoIterator, null.model = nullmod, GxE=paste0("PC", 1:2), BPPARAM = parallel_param)
> head(assoc)
    variant.id chr      pos n.obs      freq  MAC        Est.G   Est.G:PC1
1 rs1396993216  22 11132695  8745 0.3932533 6878 -0.009988122 -0.08924979
2 rs1380626301  22 11253125  8745 0.2624357 4590 -0.002678291 -0.30953132
3 rs1335082528  22 11712790  8745 0.3410520 5965 -0.027474692  5.10850953
4   rs10796494  22 12212485  8745 0.3372784 5899 -0.036892212  4.37341867
5    rs9919307  22 12212558  8745 0.4124643 7214 -0.024669100  4.07719406
6    rs9919306  22 12212570  8745 0.4471698 7821 -0.024019906  3.06230321
   Est.G:PC10  Est.G:PC2       SE.G SE.G:PC1 SE.G:PC10 SE.G:PC2 GxE.Stat
1 -1.34224297  2.6554089 0.01734591 1.727708  1.849182 1.679834 1.738722
2  0.27726606 -3.0203350 0.02030146 2.205803  2.520503 2.529090 1.198602
3  0.52706920  0.3407333 0.01833332 1.746901  1.816277 1.651542 2.944934
4 -0.09359257  1.0404014 0.01834260 1.708225  1.820440 1.685637 2.644967
5 -1.75264142  1.7371339 0.01784019 1.721466  1.785623 1.682665 2.773087
6 -2.06248754  2.4664881 0.01807972 1.845440  1.727778 1.706506 2.471437
  Joint.Stat   GxE.pval Joint.pval
1   1.827476 0.38806888 0.50267200
2   1.238812 0.69696768 0.82048571
3   3.271428 0.03397541 0.03012230
4   3.256274 0.07203014 0.03140311
5   3.110190 0.05287203 0.04630592
6   2.879768 0.10647243 0.08141397

@smgogarten
Copy link
Collaborator

We've identified the problem. If your input phenotypes contain columns where the name of one column is the same as the name of another column pus additional characters, AND you supply the shorter column name as one of the values to test for GxE, you will get this error. It comes from this function to retrieve the GxE column names from the model matrix: https://github.com/UW-GAC/GENESIS/blob/devel/R/utils.R#L252

The reason we do this instead of matching the column names directly, is that the model.matrix function in R decomposes factor variables such that each factor value is appended to the column name. For example:

> dd <- data.frame(a = gl(3,2), b = gl(3,1,6), bb = gl(3,2,6))
> dd
  a b bb
1 1 1  1
2 1 2  1
3 2 3  2
4 2 1  2
5 3 2  3
6 3 3  3
> model.matrix(~ a + b + bb, dd)
  (Intercept) a2 a3 b2 b3 bb2 bb3
1           1  0  0  0  0   0   0
2           1  0  0  1  0   0   0
3           1  1  0  0  1   1   0
4           1  1  0  0  0   1   0
5           1  0  1  1  0   0   1
6           1  0  1  0  1   0   1

We find the columns corresponding to b by searching for strings that begin with b, which in this case would also return columns for bb.

There isn't a quick fix in the code for this, but an easy way for you to get around it when you set up your analysis is to change your column names so the "PC1" prefix is not repeated twice. If you change "PC10" to "pc10", for example, it should work.

@yu-zhang-oYo
Copy link
Author

Thanks, it works. I appreciate your help.

I have one last question, if I have microarray data, GENESIS with GWASTools or with SeqArray can analyze the microarray data with assocTestSingle function. I would like to know the difference between the two, if no, can I use either one? Thanks.

@smgogarten
Copy link
Collaborator

You can use either GWASTools or SeqArray/SeqVarTools and get the same results. SeqArray is newer and more flexible; it can import all variants represented in a VCF file. GWASTools was developed for microarray data so can only store biallelic variants.

@yu-zhang-oYo
Copy link
Author

Got it, I appreciate it so much.

You can use either GWASTools or SeqArray/SeqVarTools and get the same results. SeqArray is newer and more flexible; it can import all variants represented in a VCF file. GWASTools was developed for microarray data so can only store biallelic variants.

Thanks. it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants