For the dataset used in this analysis, FASTQ files are not directly available. Rather, the BAM files have been made available that can be downloaded and converted into FASTQ files using the flowing steps (download data)
with the use of bamtofastq_linux
.
Before running the downstream analysis, some quality control checks need to be done to ensure the raw data has no underlying problems, or inform problems that you may have.
FastQC
provides a QC report, summarised in MultiQC
, that can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline (guide; & code here).
STUDY OUTCOME 1 | Initial quality assessment
Across the investigated datasets, the quality of sequencing data was deemed acceptable by FastQC metrics. Although a subset of sequences failed to reach certain quality thresholds set by FastQC, this could be attributed to the nature of the 10X Genomics Chromium scRNA-seq output. Specifically, an output known as the index file contains intentionally identical sequences to allow the Illumina sequencing technology to differentiate between adjacent read pairs. Consequently, these sequences are highly over-represented in the read files and exhibit a non-normal distribution of bases and GC content, as was detected by FastQC. However, a portion of the failed sequences, less than 15 %, arose from the true sequencing reads. This was observed in the IFN-treated samples and, while not a concern worthy of disregarding the datasets, was kept in mind throughout downstream analysis.
To run the pseudo alignment tool (Kallisto), an index of the reference transcriptome is needed. Although the SASCRiP function kallisto_bustools is able to do this automatically by changing some parameters, I needed to know how to do this manually.
For this process, python needs to be installed along with the package manager, pip. Download the latest version of python (PYTHON) and download the file here: (PIP). Ensure Python is installed from the cmd using python --version
before installing pip python get-pip.py
which can be verified afterwards as well using pip --version
. Next, JupyterLab must be installed which can now be done from the cmd pip install jupyterlab
and opened jupyter lab
. Note that this takes you to a Google Chrome page with Jupyter ready to be used, this is where the second portion of this tutorial needs to be completed.
This process also requires Kallisto to be installed (Kallisto). From the options, choose and install the one compatible with your device and to use it, travel to the directory where it is saved cd kallisto_windows-v0-50.0\kallisto
:
- kallisto_linux-v0.50.0.tar.gz
- kallisto_mac-v0.50.0.tar.gz
- kallisto_mac_m1-v0.50.0.tar.gz
- kallisto_windows-v0.50.0.zip
From this point, the full transcriptome from Ensembl (files ending in cdna.all.fa.gz) must be downloaded. To build the human transcriptome index, first download the transcriptome, which is available under cDNA on the Ensembl website and execute the following in the command prompt.
STUDY OUTCOME 2 | Successful download of the full transcriptome
NOTE: you could, of course, do this by clicking the download button when you travel to the website.
After downloading the full transcriptome file, you now need to build the index itself. This was done in the command prompt using Kallisto as follows.
STUDY OUTCOME 3 | Successful building of the pseudoalignment index
Once the index is created, the transcripts to genes text file must also be compiled. This can be done using a function from kb_python
called create_t2g_from_gtf
. This requires gtf (gene transfer format) files as input that must be downloaded. To download gtf files go to ensembl website > human > latest genome assembly > GRCh38 (or latest version) > access the gtf file (Homo_sapiens.GRCh38.110.gtf.gz). Once downloaded, store the gtf file in a specific directory and complete the following.