Skip to content

๐Ÿ—’๏ธ Diary and guide to a single cell RNA sequencing analysis project using data from a publicly available project: PRJNA597786

Notifications You must be signed in to change notification settings

AlicenJoyHenning/honours

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Alt text

SINGLE CELL RNA SEQUENCING WORKFLOW

1. Download and assess the data

10X BAM to FASTQ converter

For the dataset used in this analysis, FASTQ files are not directly available. Rather, the BAM files have been made available that can be downloaded and converted into FASTQ files using the flowing steps (download data) with the use of bamtofastq_linux.

Before running the downstream analysis, some quality control checks need to be done to ensure the raw data has no underlying problems, or inform problems that you may have. FastQC provides a QC report, summarised in MultiQC, that can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline (guide; & code here).

STUDY OUTCOME 1 | Initial quality assessment

Across the investigated datasets, the quality of sequencing data was deemed acceptable by FastQC metrics. Although a subset of sequences failed to reach certain quality thresholds set by FastQC, this could be attributed to the nature of the 10X Genomics Chromium scRNA-seq output. Specifically, an output known as the index file contains intentionally identical sequences to allow the Illumina sequencing technology to differentiate between adjacent read pairs. Consequently, these sequences are highly over-represented in the read files and exhibit a non-normal distribution of bases and GC content, as was detected by FastQC. However, a portion of the failed sequences, less than 15 %, arose from the true sequencing reads. This was observed in the IFN-treated samples and, while not a concern worthy of disregarding the datasets, was kept in mind throughout downstream analysis.


2. Pre-psuedoalignment processes

Building the reference transcriptome input for Kallisto to be used in the SASCRiP pipeline

(i) Install and download the necessary dependencies

To run the pseudo alignment tool (Kallisto), an index of the reference transcriptome is needed. Although the SASCRiP function kallisto_bustools is able to do this automatically by changing some parameters, I needed to know how to do this manually.
For this process, python needs to be installed along with the package manager, pip. Download the latest version of python (PYTHON) and download the file here: (PIP). Ensure Python is installed from the cmd using python --version before installing pip python get-pip.py which can be verified afterwards as well using pip --version. Next, JupyterLab must be installed which can now be done from the cmd pip install jupyterlab and opened jupyter lab. Note that this takes you to a Google Chrome page with Jupyter ready to be used, this is where the second portion of this tutorial needs to be completed.

This process also requires Kallisto to be installed (Kallisto). From the options, choose and install the one compatible with your device and to use it, travel to the directory where it is saved cd kallisto_windows-v0-50.0\kallisto :

  • kallisto_linux-v0.50.0.tar.gz
  • kallisto_mac-v0.50.0.tar.gz
  • kallisto_mac_m1-v0.50.0.tar.gz
  • kallisto_windows-v0.50.0.zip

From this point, the full transcriptome from Ensembl (files ending in cdna.all.fa.gz) must be downloaded. To build the human transcriptome index, first download the transcriptome, which is available under cDNA on the Ensembl website and execute the following in the command prompt.

STUDY OUTCOME 2 | Successful download of the full transcriptome

image
NOTE: you could, of course, do this by clicking the download button when you travel to the website.

(ii) Build the index file

After downloading the full transcriptome file, you now need to build the index itself. This was done in the command prompt using Kallisto as follows.

STUDY OUTCOME 3 | Successful building of the pseudoalignment index
image

(iii) Build the transcripts to gene file

Once the index is created, the transcripts to genes text file must also be compiled. This can be done using a function from kb_python called create_t2g_from_gtf . This requires gtf (gene transfer format) files as input that must be downloaded. To download gtf files go to ensembl website > human > latest genome assembly > GRCh38 (or latest version) > access the gtf file (Homo_sapiens.GRCh38.110.gtf.gz). Once downloaded, store the gtf file in a specific directory and complete the following.

(iv) Adjust the transcripts to gene file for Kallisto

3. SASCRiP: psuedoalignment and quantification pipeline

4. Cell and data quality control

5. Integration of study samples

6. Dimensionality reduction and clustering of integrated dataset

7. Differential gene expression analyses

About

๐Ÿ—’๏ธ Diary and guide to a single cell RNA sequencing analysis project using data from a publicly available project: PRJNA597786

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages