3.1-Module3lab_gprofiler.Rmd

# Module 3 Lab: g:profiler Visualization {#gprofiler_mod3}

**This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This means that you are able to copy, share and modify the work, as long as the result is distributed under the same license.**

By Gary Bader, Ruth Isserlin, Chaitra Sarathy, Veronique Voisin

## Goal of the exercise

**Create an enrichment map and navigate through the network**

During this exercise, you will learn how to create an enrichment map from gene-set enrichment results. The enrichment results chosen for this exercise are generated using [g:Profiler](https://biit.cs.ut.ee/gprofiler/gost) but an enrichment map can be created directly from output from [GSEA](http://software.broadinstitute.org/gsea/index.jsp), 
[g:Profiler](https://biit.cs.ut.ee/gprofiler/gost),
[GREAT](http://great.stanford.edu/public/html/),
[BinGo](http://apps.cytoscape.org/apps/bingo), [Enrichr](https://amp.pharm.mssm.edu/Enrichr/) or alternately from any gene-set tool using the generic enrichment results (GEM) format.


## Data

The data used in this exercise is a list of frequently mutated genes that we used in [previous exercise](#gprofiler-lab). 
Pathway enrichment analysis has been run using g:Profiler and the results have been downloaded as a GEM format.


## EnrichmentMap

*	A circle (node) is a gene-set (pathway) enriched in genes that we used as input in g:Profiler (frequently mutated genes).

*	edges (lines) represent genes in common between 2 pathways (nodes).

*	A cluster of nodes represent overlapping and related pathways and may represent a common biological process.

*	Clicking on a node will display the genes included in each pathway.

<img src="./Module3/gprofiler/images/example_cluster.png"  />


## Description of this exercise

We will run the saved g:Profiler results (from [Module 2 - gprofiler lab](#gprofiler-lab)) using different parameters. 
An enrichment map represents the result of enrichment analysis as a network where significantly enriched gene-sets that share a lot of genes in common will form identifiable clusters. The visualization of the results as these biological themes will ease the interpretation of the results. 

The goal of this exercise is to learn how to:

  1. Upload g:Profiler results into Cytoscape EnrichmentMap to create a map.
  1. Upload several g:Profiler results at the same time to create one map and learn how to distinguish and compare the results.
  1. To compare the differences resulting from the use of different g:Profiler parameters at the enrichment map level. 


## Start the exercise

To start the lab practical section, first create a gprofiler_files directory on your computer and download the files below.

```{block, type="rmd-datadownload"}
Right click on link below and select "Save Link As...".

Place it in the corresponding module directory of your CBW work directory.
```

Five files are needed for this exercise:

  1. Enrichment result 1:   [gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000.gem.txt](./Module3/gprofiler/data/gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000.gem.txt) 
  * In g:Profiler, the parameters that we used to generate this file were:
    * GO_BP no electronic annotation, 
    * Reactome, 
    * WikiPathways,
    * Benjamini-Hochberg FDR 0.05
    * The results were filtered using the *Term size* slidebar.  Only the enriched gene-sets containing more than 3 and less than or equal to  10000 genes per gene-set were included in the result file. 
  2. Enrichment result 2: [gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem.txt](./Module3/gprofiler/data/gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem.txt) 
  * In g:Profiler, the parameters that we used were: 
    * GO_BP no electronic annotation, 
    * Reactome, 
    * WikiPathways,
    * Benjamini-HochBerg FDR 0.05. 
    * The results were filtered using the *Term size* slidebar. Only the enriched gene-sets that contain more than 3 and less than or equal to 250 genes per gene-set were included in the result file.
  3. Enrichment result 3: [gProfiler_hsapiens_Baderlab_max250.gem.txt](./Module3/gprofiler/data/gProfiler_hsapiens_Baderlab_max250.gem.txt)
  4. Pathway database 1: [gprofiler_full_hsapiens.name.gmt](./Module3/gprofiler/data/gprofiler_full_hsapiens.name.gmt)
  * This file can be downloaded directly or can be been created by concatenating the hsapiens.GO/BP.name.gmt, hsapiens.WP.namt.gmt and the hsapiens.REAC.name.gmt files contained in the g:Profiler gprofiler_hsapiens.name folder. 
  5. Pathway database 2: [Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt](./Module3/gprofiler/data/Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt)

## Exercise 1a - compare different gprofiler geneset size results

### Step 1

Launch Cytoscape and open the EnrichmentMap App

1a. Double click on Cytoscape icon

1b. Open EnrichmentMap App

*	In the Cytoscape top menu bar:

  *	Click on Apps -> EnrichmentMap

<img src="./Module3/gprofiler/images/EM1.png"  />

 * A 'Create Enrichment Map' window is now opened.

### Step 2

Create an enrichment map from 2 datasets and with a gmt file.

2a. In the '**Create Enrichment Map**' window, drag and drop the 2 enrichment files *gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000.gem.txt* and 
*gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem.txt*.

<img src="./Module3/gprofiler/images/gem0.png" alt="workflow" width="100%" />

2b. In the white box, click on "*gProfiler_hsapiens_lab2_results_GEM_termmin3_max250 (Generic/gProfiler)*" 

2c. On the right side, go to the **GMT** field, click on the 3 radio button (...) and locate the file *gprofiler_full_hsapiens.name.gmt* that you have saved on your computer to upload it.

<img src="./Module3/gprofiler/images/gem1.png" alt="workflow" width="100%" />

2d. In the white box, click on "*gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000 (Generic/gProfiler)*" 

2e. On the right side, go to the **GMT** field, click on the 3 radio button (...) and locate the file *gprofiler_full_hsapiens.name.gmt* that you have saved on your computer to upload it.

2f. Locate the **FDR q-value cutoff** field and set the value to 0.001

2g. Select the **Connectivity** slide bar to **sparse**. 

<img src="./Module3/gprofiler/images/gem2.png" alt="workflow" width="100%" />

```{block, type="rmd-tip"}
Intstead of specifying the gmt file for each dataset separately, if all the dataasets in your analysis use the same gmt file, you can specify a common gmt file to be used by all datasets.  

  * Click on *Common Files (included in all datasets)* 
  * On the right side, go to the *GMT file* field, click on the 3 radio button (...) and locate the file *gprofiler_full_hsapiens.name.gmt* that you have saved on your computer to upload it.

<img src="./Module3/gprofiler/images/common_gmt.png" alt="workflow" width="100%" />

  This can also be done for a shared expression file.

```


2h. Click on *Build*.

* A status bar should pop up showing progress of the Enrichment map build.

<p align="center">
  <img src="./Module3/gprofiler/images/gem3.png" alt="workflow" width="60%"/>
</p>

### Step3: Explore the results:

In the EnrichmentMap control panel located at the left:

  * Select the 2 Data Sets (checked by default)
  * Set Chart Data o *Color by Data Set*
  * Select *Publication Ready* to remove gene-set label to have a global view of the map. 
  
```{block, type="rmd-tip"}
un-select *Publication Ready* when you explore the map in more detail to see the gene-set names. 
```

<p align="center">
   <img src="./Module3/gprofiler/images/gem3a.png" alt="workflow" width="70%" />
</p>

On the map, a node that is coloured both green and blue is a gene-set that is found in the both of the 2 gProfiler result sets that we have been uploaded. 

* A node that is blue is a gene-set that is found only in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000* . 
* A node that is green is a gene-set that is found only in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max250* .
* A blue edge represents genes that overlap between gene-sets found in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000*.
* A green edge represents genes that overlap between gene-sets found in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem*. 

 <img src="./Module3/gprofiler/images/gem6.png" alt="workflow" width="100%" />
 
 We can see clusters of blue nodes. All these nodes contain gene-sets that have more than 250 genes. Explore the detailed view (see below) to see if this cluster corresponds to informative terms. 
 
```{block, type="rmd-question"}
Would you have lost information by filtering gene-sets larger than 250 genes?
```
### Explore Detailed results 

  * In the Cytoscape menu bar, select 'View" and 'Show Graphic Details' to display node labels. 
  
```{block, type="rmd-caution"}
Make sure you have unselected "Publication Ready" in the EnrichmentMap control panel.
```

  * Zoom in to be able to read the labels and navigate the network using the bird eye view (blue rectangle).

  * Select a node and visualize the *Table Panel*
    * Click on a node
   
    * For this example the node *"Signaling by Notch"* has been selected. 
    
```{block, type="rmd-tip"}
you can type it in the search bar, quotes are important.   
```

   <img src="./Module3/gprofiler/images/gem8.png" alt="workflow" width="50%" />

When the node is selected, it is highlighted in <font color="yellow">yellow</font>.


In table panel, we can see the genes included in the gene-set. 

A green colored box indicates that the gene is in the gene-set(pathway) and in our gene list. 

A gray colored box indicated that the gene is in the gene-set but not in our gene list.

  <img src="./Module3/gprofiler/images/gem8a.png" alt="workflow" width="100%" />

## Exercise 1b - Is specifying the gmt file important?

Create an enrichment map without a gmt file to compare the results with Exercise 1a.

  * Go to Control Panel and select the EnrichmentMap tab. 
  * Click on the "+" sign to re-open the *Create Enrichment Map* window.
  <p align="center">
   <img src="./Module3/gprofiler/images/gem7.png" alt="workflow" width="50%" />
  </p> 
  
  * In the white box, select the "*gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem (Generic/gProfiler)*" file 
  * Locate the GMT field and delete the file name, leaving it blank.
  * In the white box, select the "*gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000 (Generic/gProfiler)*" file 
  * Locate the GMT field and delete the file name , leaving it blank.
  * Use same parameters as in [exercise 1a](#exercise-1a): FDR q-value cutoff of 0.001 and Connectivity to sparse.
  * Click on *Build*
 
  <img src="./Module3/gprofiler/images/gem5.png" alt="workflow" width="100%" />
 
 
 Explore the results:
 
 In the EnrichmentMap control panel located at the left:
 
  * Select the 2 Data Sets (selecteded by default)
  * Set Chart Data o *Color by Data Set*
  * Select *Publication Ready* to remove gene-set label to have a global view of the map. 
  
```{block, type="rmd-tip"}
Uncheck this box when you explore the map in details to see the gene-set names. 
```

 <p align="center">
   <img src="./Module3/gprofiler/images/gem3a.png" alt="workflow" width="70%" />
  </p>

On the map, a node that is coloured both green and blue is a gene-set that is found in the both of the 2 gProfiler result sets that we have been uploaded. 

  * A node that is blue is a gene-set that is found only in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000* . 
    * A node that is green is a gene-set that is found only in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max250* .
  * A blue edge represents genes that overlap between gene-sets found in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000*.
  * A green edge represents genes that overlap between gene-sets found in the file *gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem*. 

 
 <img src="./Module3/gprofiler/images/gem4.png" alt="workflow" width="100%" />
 

**Conclusion of exercises 1 a and 1b:**

Loading a gmt file to create an enrichment map from g:Profiler result is optional. However, there are 2 main beneficial aspects to uploading a gmt file:

  1. The map will be less condensed and easier to read and interpret.  
  1. Clicking on a node will display all genes in the gene-set and not only genes included in our query list. 


## Exercise 1c - create EM from results using Baderlab genesets
 
 Create an enrichment map from the results of g:Profiler generated using the custom Baderlab gene-set file. <br>
 To get a map that is easy to read and that does not display too many gene-sets, one option is to focus the analysis on gene-sets (pathways) that contain 250 genes or less. We prefiltered our pathway database prior to upload it into g:Profiler so that FDR is calculated only on these gene-sets (as opposed to exercise 1a where the FDR was calculated on all gene-sets and then some gene-sets > 250 genes were excluded from the result file). For this exercise, we will use:
 
  * Filtered gmt file: [Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt](./Module3/gprofiler/data/Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt). 
  
  * We have uploaded this file as a custom gmt file in g:Profiler and run the query. (in Module 2 lab)
  
  * To create an enrichment map of these results:
   * Go to Control Panel and select the EnrichmentMap tab. 
 * Click on the "+" sign to re-open the *Create Enrichment Map* window.
 <p align="center">
   <img src="./Module3/gprofiler/images/gem7.png" alt="workflow" width="50%" />
  </p> 
 * Drag the file that we created in Module 2 lab [gProfiler_hsapiens_Baderlab_max250.gem.txt](./Module3/gprofiler/data/gProfiler_hsapiens_Baderlab_max250.gem.txt) and the filtered gmt file ([Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt](./Module3/gprofiler/data/Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt) into the Datasets box on Enrichment map panel.
 * In the white box, select the "*gProfiler_hsapiens_Baderlab_max250.gem.txt (Generic/gProfiler)*" file 
 * Locate the GMT field and upload the file "*Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol_max250.gmt*".
 * Set the **FDR q-value cutoff** to 0.001 and set the **Connectivity** slide bar to second level. 
  
  <img src="./Module3/gprofiler/images/gem9.png" alt="workflow" width="100%" />
  
 Explore the results:
 
  <img src="./Module3/gprofiler/images/gem10.png" alt="workflow" width="100%" />
 

```{block, type="rmd-caution"}
SAVE YOUR CYTOSCAPE SESSION (.cys) FILE ! 
```

## Exercise 1d (optional) - investigate individual pathways in GeneMANIA or String

Each node in the Enrichment map represents a biological process or pathway.  It consists of a collection of genes.  Often we want to know how the genes in that group interact.  There are many different ways you can investigate the underlying interactions for the given group.  Some involve searching online databases and others are directly integrated into cytoscape.

* [GeneMANIA](https://genemania.org/) - an integrative database of gene connections including co-expression, protein interactions, genetic interactions, pathways and more. **Cytoscape App**
* [String](https://string-db.org/) - an integrative database of gene connections including co-expression, protein interactions, genetic interactions, pathways and more. **Cytoscape App**
* [Pathway Commons](https://www.pathwaycommons.org/) - a intergrative database of pathways.  (There is a beta feature in EM to show your pathway in the painter app, a pathway common web page that overlays your expression data on the given pathway.  Still in beta testing and requires expression data to work correctly so won't work for this example)

### GeneMANIA

* Navigate to the enrichment map that you created using the Baderlab genesets 
    * Click on Network Tab and navigate to the third network (it should be the third network if you followed the above examples - name: gProfiler_hsapiens_Baderlab_max250_gem)
    * or in the Enrichment map panel in the top drop down select the network named gProfiler_hsapiens_Baderlab_max250_gem
* In the cytoscape search bar enter *"Signaling by Notch"* 

```{block, type="rmd-tip"}
If you can't see the selected nodes, click on "Fit Selected" to focus on the selected node.<br>
<img src="./Module3/gprofiler/images/gem11a.png" alt="workflow" width="50%" />
```


* Right click on the node *"Signaling by Notch"* and Select *Apps* --> *Enrichmemt Map - Show in GeneMANIA*

  <img src="./Module3/gprofiler/images/gem11.png" alt="workflow" width="75%" />

* A GeneMANIA Query Panel will pop up.
* Select *Select genes with expression* to reduce the query set to just the genes in the given pathway that was in your original dataset (for example we search for a set of 127 genes in g:profiler but the given pathway has 233 genes associated with it of which only 10 genes are found in our original query set )
* Click on *OK*

  <img src="./Module3/gprofiler/images/gem12.png" alt="workflow" width="75%" />

* A GeneMANIA network will show up with the connections between the genes found in your query set and the pathway "Signaling by Notch"

  <img src="./Module3/gprofiler/images/gem13.png" alt="workflow" width="75%" />

* We will go more in depth into [GeneMANIA in module 5](#genemania_cytoscape)

### String
* Navigate to the enrichment map that you created using the Baderlab genesets 
    * Click on Network Tab and navigate to the third network (it should be the third network if you followed the above examples - name: gProfiler_hsapiens_Baderlab_max250_gem)
    * or in the Enrichment map panel in the top drop down select the network named gProfiler_hsapiens_Baderlab_max250_gem
* In the cytoscape search bar enter *"Signaling by Notch"* 

```{block, type="rmd-tip"}
If you can't see the selected nodes, click on "Fit Selected" to focus on the selected node.<br>
<img src="./Module3/gprofiler/images/gem11a.png" alt="workflow" width="50%" />
```

* Right click on the node *"Signaling by Notch"* and Select *Apps* --> *Enrichmemt Map - Show in String*

  <img src="./Module3/gprofiler/images/gem14.png" alt="workflow" width="100%" />

* A String Query Panel will pop up.
* Select *Select genes with expression* to reduce the query set to just the genes in the given that pathway that was in your original dataset (for example we search for a set of 127 genes in g:profiler but the given pathway has 233 genes associated with it of which only 10 genes are found in our original query set )
* Click on *OK*

  <img src="./Module3/gprofiler/images/gem15.png" alt="workflow" width="75%" />

* A String network will show up with the connections between the genes found in your query set and the pathway "Signaling by Notch"

  <img src="./Module3/gprofiler/images/gem16.png" alt="workflow" width="75%" />
  
```{block, type="rmd-question"}
Explore the features and data of each Cytoscape app.<br>What sort of information does each tell you? <br> What is the main difference between the two resulting networks?
```

___

## Bonus - Automation. 

Run analysis directly from R for easy integration into existing pipelines.

```{block, type="rmd-bonus"}
Instead of creating an Enrichment map manually through the user interface you can create an enrichment map directly using the [RCy3 bioconductor package](https://www.bioconductor.org/packages/release/bioc/html/RCy3.html) or through direct rest calls with [Cytoscape cyrest](https://apps.cytoscape.org/apps/cyrest).  

Follow the step by step instructions on how to run from R here - https://risserlin.github.io/CBW_pathways_workshop_R_notebooks/create-enrichment-map-from-r-with-gprofiler-results.html

First, make sure your environment is set up correctly by following there instructions - https://risserlin.github.io/CBW_pathways_workshop_R_notebooks/setup.html
```