######################################################
###################################################### Peter A. Combs and Hunter B. Fraser
Department of Biology, Stanford University
This repository contains the analysis code for Combs and Fraser 2017 (bioRxiv preprint; currently submitted for review). Raw and processed data files available from the Gene Expression Omnibus.
Very briefly, the data analyses in these scripts do two things:
-
Process RNA-seq data from cryosliced D. melanogaster x simulans hybrid embryos. We are looking for genes with spatially varying allele-specific expression (svASE) is different in one part of the embryo compared to the other.
-
Perform modeling of the cis-regulatory input functions of genes using a modeling approach inspired in large part by Ilsley, et al (2013). Using genome alignments and motif searches, we can then make inferences about which cis-regulatory changes actually produced the spatially varying ASE that we observed in part 1.
Almost all of the code is written in either Python 3 or Snakemake. Known dependencies include:
- SRA Tools
- Snakemake
- Bowtie2 (Reference gDNA alignment)
- STAR (RNA-seq alignment)
- Cufflinks (RNA-seq quantification)
- Bedtools
- Samtools
- ASEr (https://github.com/thefraserlab/aser)
- Hornet (https://github.com/thefraserlab/hornet)
- Various python modules
- pysam
- progressbar
- numpy/scipy/pandas/matplotlib
- svgwrite
- pyemd
- BioPython
- statsmodels
You should be able to go from raw reads to summary data tables by doing:
snakemake
Though you probably want to run this on a pretty high-powered machine---or, ideally, a compute cluster. The basic steps for a single sample are:
- Download reads from SRA
- Map reads to the D. melanogaster genome (the resulting file is called
assigned_dmelR.bam
for various historical reasons) - Calculate absolute abundances using Cufflinks
- Filter out potentially ambiguous/mismapped reads using Hornet to implement the WASP pipeline.
- Count allele-specific reads for each gene using ASEr
Then, all of the expresion and ASE data are combined into summary tables.
We found that fitting either a logistic or gaussian function to the data found all of the genes whose allele-specific expression had spatial patterns.