Tejaas is a command line tool to find trans-eQTLs from eQTL data. It is released under the GNU General Public License version 3.
Tejaas is based on the hypothesis that a trans-eQTL should regulate the expression levels of multiple genes. In brief, it implements two statistical methods to find trans-eQTLs:
- RR-score (Reverse Regression): It performs a multiple linear regression with L2-regularization using expression levels of all genes to explain the genotype of a candidate SNP. In contrast to conventional methods, the direction of the regression is reversed, with the gene expressions as explanatory variables. RR-score is a statistic which estimates whether more genes are required to explain the allele counts of a SNP than expected by chance.
- JPA-score (Joint P-value Analysis): It evaluates the distribution of p-values of the pairwise linear association of a candidate SNP with all available gene expression levels. Any null SNP (no trans-effect) will have a uniform distribution of p-values, while a trans-eQTL will be associated with more genes than expected by chance, leading to overdispersion near zero. The JPA-score is a statistic which estimates whether the distribution of p-values is significantly overdispersed near zero.
Additionally, it also implements a non-linear unsupervised confounder correction using k-nearest neighbors called KNN correction.
- Python version 3.6 or higher
- C compiler
- Any linear algebra routine, e.g. Intel MKL, OpenBLAS, etc.
- Any flavor of Message Passing Interface (MPI), e.g. OpenMPI, MPICH, etc.
- Other Python libraries:
- NumPy / array operations
- SciPy / optimization and other special functions
- statsmodel / used for ECDF calculation in JPA-score
- Pygtrie / used for reading MAF file in RR-score / maf null
- mpi4py / linked to MPI and MKL for python parallelization
- scikit-learn / used for PCA decomposition in KNN correction
See the wiki for detailed installation instructions. Here is a quick start guide:
Installation of Tejaas depends on other dependencies. If you are using conda
, then they can be installed using
conda install numpy scipy statsmodels scikit-learn
pip install mpi4py
2a. Tejaas can be installed directly from the PyPI repository:
pip install tejaas
2b. You can also download this repository and build Tejaas:
git clone [email protected]:soedinglab/tejaas.git
cd tejaas
pip install -e .
tejaas [OPTIONS]
See below for valid options or try tejaas --help
.
An example script example/run_example.sh
is provided to check the installation.
cd tejaas/example
./run_example.sh <outdir> <ncpu>
The script will download some example input files in <outdir>/data
and run Tejaas on <ncpu>
cores.
The output will be created in <outdir>/data
.
Note that the data
subdirectory is automatically created within the <outdir>
specified by the user.
Check if the output is correct:
python compare_with_gold.py --outdir <outdir>
This will check if the output in <outdir>/data
(again note that data
subdirectory is automatically appended and need not be specified by the user)
matches with the results provided in the example/gold
subdirectory.
- Gene expression file
- Genotype file
- VCF
- Oxford
- Dosage
- GENCODE file
- Population minor allele frequency
Option | Argument | Description | Priority | Default value |
---|---|---|---|---|
--vcf |
FILEPATH |
Input genotype file in vcf.gz format | Required (vcf or oxf) | -- |
--oxf |
FILEPATH |
Input genotype file in Oxford format | Required (vcf or oxf) | -- |
--dosage |
Flag for reading dosage files. The file is specified with the --oxf option, e.g. --oxf FILEPATH --dosage |
Optional | False |
|
--fam |
FILEPATH |
Input fam file for samples names of Oxford genotype | Optional | -- |
--chrom |
INT |
Chromosome number of the genotype file | Required | -- |
--include-SNPs |
START:END |
Colon-separated index of SNPs to be included | Optional | -- |
--gx |
FILEPATH |
Input gene expression file for trans-eQTL discovery | Required | -- |
--gxcorr |
FILEPATH |
Input gene expression file for target gene discovery | Optional | --gx file |
--gxfmt |
OPTION |
Input gene expression file format (see format details below). Supported options: gtex , cardiogenics , geuvadis |
Optional | gtex |
--gtf |
FILEPATH |
Input GTF file from GENCODE to read gene Ensembl IDs. Used for selecting biotypes and getting genomic locations. | Required | -- |
--trim |
Flag to trim version number from GENCODE Ensembl IDs | Optional | False |
|
--biotype |
OPTION |
Which biotypes to select from the GTF file. Supported options: protein_coding , lncRNA . |
Optional | protein_coding lncRNA |
--outprefix |
STRING |
Full path to output file names. The extensions are generated by Tejaas. | Optional | out |
--method |
OPTION |
Name of method to run. Supported options: jpa or rr |
Optional | rr |
--null |
OPTION |
Null model to use for RR-score. Supported options: perm or maf |
Optional | perm |
--cismask |
Flag to mask cis-Genes within a window for each candidate SNP. Gene positions are obtained from the GENCODE annotation file. | Optional | False |
|
--window |
FLOAT |
Window (number of base pairs) used for masking cis genes | Optional | 1e6 |
--prior-sigma |
FLOAT |
Standard deviation of the normal prior for reverse multiple linear regression | Optional | 0.1 |
--knn |
INT |
Number of neighbours for KNN (use 0 if you do not want to use KNN) | Optional | 0 |
--psnpthres |
FLOAT |
Target genes will be reported only for trans-eQTLs below this threshold p-value for RR/JPA-score | Optional | 0.0001 |
--pgenethres |
FLOAT |
Target genes will be reported only if their association with trans-eQTLs are below this threshold p-value | Optional | 0.05 |
--jpanull |
FILEPATH |
File containing list of null model JPA-scores | Optional | -- |
--maf-file |
FILEPATH |
Read minor allele frequency (MAF) of SNPs from this file, e.g. to read population MAF for maf null (see documentation for file format) |
Optional | -- |
--shuffle |
Flag to randomly shuffle the genotypes to obtain a null distribution | Optional | False |
|
--shuffle-with |
FILEPATH |
Shuffle the genotypes in the same order of donor IDs specified in FILEPATH |
Optional | -- |
- For quick start or installation check, run Tejaas with all default options:
tejaas --vcf ${VCFFILE} --chrom ${CHRM} --gx ${GXFILE} --gtf ${GTFFILE} --cismask --outprefix ${OUTPREFIX}
This will create RR-scores at γ=0.1 and masking all genes within 1Mb of each SNP. The p-values will be computed from the permuted null model. Default format for the gene expression is the same as the GTEx format, and default gtf file is the GENCODE v26 release. For target gene discovery, it will use the same file as used for trans-eQTL discovery.
- Example of running Tejaas RR-score.
We recommend using the
perm
null model for calcuting p-values from the RR-score and a separate confounder-corrected gene expression file for target gene discovery. In this example, RR-score is calculated for first 1000 SNPs excluding the first 20--include-SNPs 21:1000
. KNN correction is performed with 20 nearest neighbors--knn 20
. All cis-genes within +2MB and -2Mb are masked during analysis--cismask --window 2e6
. RR-score calculation uses a prior normal distribution with standard deviation of 0.05--prior-sigma 0.05
. The output reports target genes only for SNPs with p-value < 1e-6--psnpthres 0.000001
. Here,GXFILE
is the raw gene expression file,GXCORRFILE
is the confounder-corrected gene expression file,VCFFILE
is the genotype file in.vcf.gz
format andGTFFILE
is the GENCODE annotation file.
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 21:1000 --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --window 2e6 --psnpthres 0.000001 \
--knn 20 --method rr --null perm --prior-sigma 0.05
- Example of running JPA-score with no KNN correction.
Empirical p-values are calculated from the null scores loaded from
NULLFILE
specified by the--jpanull
option. IfNULLFILE
does not exist, then it will create100000
null scores and write them in theNULLFILE
before calculating JPA-scores. If--jpanull
option is not used, then p-values for the JPA-scores are calculated from an analytical construction of null model.
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 1:100 \
--gx ${GXFILE} --gxfmt gtex --gtf ${GTFFILE} --outprefix ${OUTPREFIX} \
--knn 0 --method jpa --jpanull ${NULLFILE}
- Example of parallelizing job submission.
NMAX=20000 # number of SNPs per job
for CHRM in $( seq 1 22 ); do
VCFFILE="file_path_here_${CHRM}.vcf.gz"
NTOT=$( calculate_no_of_SNPs_in_this_chromosome )
NJOB=$( echo $(( (NTOT + NMAX - 1)/NMAX )) )
for (( i=0; i < ${NJOB}; i++ )); do
STARTSNP=$(( NMAX * i + 1 ))
ENDSNP=$(( NMAX * (i + 1) ))
if [ ${ENDSNP} -gt ${NTOT} ]; then
ENDSNP=${NTOT}
fi
mpirun -n 8 tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs ${STARTSNP}:${ENDSNP} --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --psnpthres 0.000001 --knn 20 --method rr --null perm --prior-sigma 0.05
done
done
1.0.2 Bump release version