Exflowpipeline.Rmd

---
title: "FlowExample"
author: "Max Chao"
date: "06/26/2023"
output: html_document
---

# Running Flow

This is an example pipeline to run flow of two tumor/normal pairs through the hg19
pipeline for JaBbA. There are a few steps to setting up and running these jobs.
After getting through the pipeline we we run a job post JaBbA creation and work
towards visualizing some of the plots.

The pipeline that we will run through is covered here:
```{r Jabpipeline, echo=FALSE}
knitr::include_graphics("mc_jabba_workflow.png")
```

Here we align fastq files into bams for tumor/normal paired samples for WGS. Next,
we run a series of jobs that are required inputs for the JaBbA pipeline. 
However, since the alignment of WGS samples can take upwards of a week, we will start off with a couple aligned hg19 BAM files and begin to work from there.

## Our Pairs Table

This is our pairs table in the purest form, where we have bam files for the associated
tumor and normal sample. We will run the samples through the pipeline:

```{r ptable}
pairs = readRDS("db/pairs.rds")
pairs
```

There is **intentionally** something wrong with one of these samples: try and figure
this out!

## BAM jobs

Once we have our generated bams we are going to run a series of jobs on them to generate
the prerequisites for running JaBbA. There are a series of jobs that will need to done 
on the bams and they can all be ran in tandem. So let's try and get one of these going.

The first one is the hetpileup caller, which generates a pile up of our allele fractions
for a sample.

```{bash hetpileup}
cat  ~/tasks/hg19/core_dryclean/HetPileups.task
```

Here we see a few arguments that will need to be inputed, and there are a couple 
columns in our pairs table that may need to be renamed. The task text table format is something
like this:

|col1|col2|col3|col4|col5|
|---|---|---|---|---|
|input or output|value name inside the Job class| input label/colname of the file| class expected of input| any default argument|


For more information on how to set up jobs, the [wiki](http://mskiweb/101/flow.html) on Flow is a good
source to follow.

These jobs/tasks that need to be run on the bams and can all be done in tandem. Expect
if all things go well with all the jobs that it should be done within a few hours.

Here's a table of the BAM jobs to run:

|tool|task location|description|
|---|---|---|
|Svaba|`~/tasks/hg19/core_dryclean/Svaba.task`|SV caller|
|Strelka|`~/tasks/hg19/core_dryclean/Strelka2`|SNV caller|
|fragcounter (on tumor)|`~/tasks/hg19/core_dryclean/fragCounter.task`|fragment counting across the genome with GC and mappability corrections|
|fragcounter (on normal)|`~/tasks/hg19/core_dryclean/fragCounter.normal.task`|same as above but for normal|
|HetPileup|`~/tasks/hg19/core_dryclean/HetPileups.task`|Get allele fractions for het sites across genome|

**After all of these steps are ran and processed, remember to merge their outputs together and
save this into our pairs table.**

Alternatively, one can also use [GRIDSS/GRIPSS](https://github.com/PapenfussLab/gridss) as a 
SV and junction breakpoint detector.

## Working with the outputs.

If all things are working well, we can begin processing some of the bam outputs in
preparation for JaBbA. One of these is dryclean, which further processes the fragcounter
outputs to further remove noise and signal via rPCA using a PON.

### Dryclean

Dryclean our fragcounter outputs for both tumor and normal.


The tasks associated with dry clean:

dryclean normal: `~/tasks/hg19/core_dryclean/Dryclean.normal.task`

dryclean tumor: `~/tasks/hg19/core_dryclean/Dryclean.task`


Then we use CBS to get our segments ready for JaBbA. The task expects the fragcounter outputs, so we'll need to use a temporary pairs table to edit our colnames for our dryclean outputs to match what CBS task expects.

After this step is done, run CovCBS: `~/tasks/hg19/core_dryclean/CBS.task`

**MERGE ALL OF THIS INTO YOUR PAIRS TABLE BEFORE PROCEEDING.**

### Optional Job Ascat

Ascat purity/ploidy estimation and other estimations can be done separately here.
This job is not necessary as JaBbA will do its own estimation, but it could help
with a few fringe cases.

The task is: `~/tasks/ascat_seg.task`

## Running JaBbA

Now that we have all the right inputs, it's time to run JaBbA.
The task is located here: `~/tasks/hg19/core_dryclean/JaBbA.task`

and looking into the task:

```{r taskjab}
Flow::Task("~/tasks/hg19/core_dryclean/JaBbA.task")
```

We have bunch of values that will need to be filled. Here are some of the inputs
that will need to be altered to:

1. `tumor_dryclean_cov` should be `tumor_dryclean_cov`
2. `svaba_somatic_vcf` should be `svaba_somatic_vcf`
3. `field` should be `foreground` if you are following the blog file
4. `cbs_seg_rds` should match `cbs` outputs from the merge.
5. `cbs_nseg_rds` should match `cbs` outrputs from the merge.
6. `het_pileups_wgs` should be the het_pileup `sites.txt` output for that file.

*Purity and ploidy values can be precalculated via the ascat job and inputted here; otherwise
the task and JaBbA will estimate these for you.*

There are a few other estimated parameters that can be played around with for the fits,
and in your free time, I would recommend playing around with these to see how the solver
is affected by the various arguments.

From here, run the program, and we should be good to go. Expect each pair sample
to take at most a few hours. There will occasionally be cases where the solver in 
JaBbA fails to converge, and it will become important to play around with parameters
to solve for cases like these.

### Running JaBbA jobs

Most post-JaBbA tasks are either related to the sample as a whole or the ggraph
structure generated by the JaBbA job. Here we will go into a simple post JaBbA task
which is "event calling" via gGnome. We run the task `~/tasks/Events.task` which calls
 the event caller from gGnome. From here the caller uses set thresholds and criteria
 to identify the type of classes and events are found within each sample.
 
 Most of these jobs like Fusions, Alterations, and Events are quick, so we can expect
a reasonably fast turnaround.

### Plotting

Finally, plotting is done via gGnome. Let's try and plot some of our JaBbA outputs.

Use the gGnome tutorial for reference: [Link](http://mskilab.com/gGnome/tutorial.html#Classifying_SV_events)

Using the Complex JaBbA output from the Events job:

1. Plot a basic SV event via gGraph.

2. Tweak the parameters around the range of the event to increase by 10K bp on each side

3. Generate a gWalk of one of these SV events.

4. Plot and save these plots via `ppng` or `ppdf` on mskiweb.

## Resources and More Reading:

Read into these in your free time!

101 on Flow Usage: http://mskiweb/101/flow.html

Links to Tools within the pipeline:

1. [Svaba](https://github.com/walaj/svaba)
2. [Strelka2](https://github.com/Illumina/strelka)
3. [fragcounter](https://github.com/mskilab/fragCounter)
4. Hetpileups is an internal tool, script is found here: ` ~/modules/Pileup/Pileup.R`
5. [dryclean](https://github.com/mskilab/dryclean)
6. [CBS](https://cran.r-project.org/web/packages/PSCBS/vignettes/CBS.pdf)
7. [ascat](https://github.com/VanLoo-lab/ascat)
8. [JaBbA](https://github.com/mskilab/JaBbA)
9. [gGnome](https://github.com/mskilab/gGnome)