migration complete

TuftsBCB · Mar 27, 2024 · f1931f9 · f1931f9
1 parent 8fd1661
commit f1931f9
Show file tree

Hide file tree

Showing 2 changed files with 89 additions and 19 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/docs/quick_tour.md b/docs/quick_tour.md
@@ -9,8 +9,8 @@ In this tutorial, we provide an example of running GRN inference using RegDiffus
 We will need the python package `regdiffusion` for GRN inference. For accelerated inference speed, you may want to run `regdiffusion` on GPU devices with the latest CUDA installation.
 
 ```
-import regdiffusion as rd
-import numpy as numpy
+>>> import regdiffusion as rd
+>>> import numpy as numpy
 ```
 
 ## Data Loading
@@ -23,23 +23,29 @@ Here we use the `mESC` data from the BEELINE benchmark. The `mESC` data comes fr
 
 If you want to see the inference on a larger network with 14,000+ genes and 8,000+ cells, check out the other example. 
 
-```python
-bl_dt, bl_gt = rd.data.load_beeline(
-    benchmark_data='mESC', benchmark_setting='1000_STRING'
-)
+```
+>>> bl_dt, bl_gt = rd.data.load_beeline(
+        benchmark_data='mESC', benchmark_setting='1000_STRING'
+    )
 ```
 
 Here, `load_beeline` gives you a tuple, where the first element is an anndata of the single cell experession data and the second element is an array of all the ground truth links (based on the STRING network in this case). 
 
-```md
 ```python
-bl_dt
-```
-
-```{code-output}
+>>> bl_dt
 AnnData object with n_obs × n_vars = 421 × 1620
     obs: 'cell_type', 'cell_type_index'
 ```
+
+```python
+>>> bl_gt
+array([['KLF6', 'JUN'],
+       ['JUN', 'KLF6'],
+       ['KLF6', 'ATF3'],
+       ...,
+       ['SIN3A', 'TET1'],
+       ['MEF2C', 'TCF12'],
+       ['TCF12', 'MEF2C']], dtype=object)
 ```
 
 ## GRN Inference
@@ -48,21 +54,85 @@ You are recommended to use the provided trainer to train a RegDiffusion Model. Y
 
 During the training process, the training loss and the average amount of change on the adjacency matrix are provided on the progress bar. The model converges when the step change n the adjacency matrix is near-zero. By default, the `train` method will train the model for 1,000 iterations. It should be sufficient in most cases. If you want to keep training the model afterwards, you can simply call the `train` methods again with the desired number of iterations. 
 
+```python
+>>> rd_trainer = rd.RegDiffusionTrainer(bl_dt.X)
+>>> rd_trainer.train()
+Training loss: 0.304, Change on Adj: -0.000: 100%|██████████| 1000/1000 [00:06<00:00, 143.53it/s]
 ```
-rd_trainer = rd.RegDiffusionTrainer(bl_dt.X)
-rd_trainer.train()
-```
+
 
 We run this experiment on an A100 card and the inference finishes within 8 seconds. 
 
 When ground truth links are avaiable, you can test the inference performance by setting up an evaluator. You need to provide both the ground truth links and the gene names. Note that the order of the provided gene names here should be the same as the column order in the expression table (and the inferred adjacency matrix). 
 
-```
-evaluator = rd.evaluator.GRNEvaluator(bl_gt, bl_dt.var_names)
-inferred_adj = rd_trainer.get_adj()
-evaluator.evaluate(inferred_adj)
+```python
+>>> evaluator = rd.evaluator.GRNEvaluator(bl_gt, bl_dt.var_names)
+>>> inferred_adj = rd_trainer.get_adj()
+>>> evaluator.evaluate(inferred_adj)
+{'AUROC': 0.6114549433214784,
+ 'AUPR': 0.051973534439936624,
+ 'AUPRR': 2.4439084980230183,
+ 'EP': 755,
+ 'EPR': 4.1870197171394254}
 ```
 
 ## GRN object
 
-In order to facilitate the downstream analyses on GRN, we defined an `GRN` object in the `regdiffusion` package. You need to provide the gene names in the same order as in your expression table.
+In order to facilitate the downstream analyses on GRN, we defined an `GRN` object in the `regdiffusion` package. You need to provide the gene names in the same order as in your expression table.
+
+```python
+>>> grn = rd_trainer.get_grn(bl_dt.var_names)
+>>> grn
+Inferred GRN: 1,620 TFs x 1,620 Target Genes
+```
+
+You can easily export the GRN object as a HDF5 file. Right now, HDF5 is the only supported export format but more formats will be added in the future.
+
+```python
+>>> grn.to_hdf5('demo_mESC_grn.hdf5')
+```
+
+Saving the GRN object as hdf5 makes it easy to distribute your inferred networks (or simply move it from a cloud GPU machine to your local desktop for more analysis. You can read the exported file using `read_hdf5()`.
+
+```python
+>>> recovered_grn = rd.read_hdf5('demo_mESC_grn.hdf5')
+```
+
+If size is an issue, you may consider to transform some small numbers in the adjacency matrix to zero using `top_gene_percentile` and then in the `to_hdf5` function, export `as_sparse`.
+
+## Inspecting the local network around particular genes
+
+In this example, we run GRN inference on one of the BEELINE benchmark single cell datasets. The provided ground truth makes it possible to validate through standard statistical metrics. However, such ground truth is in fact very noisy and incomplete. 
+
+In our paper, we proposed a method to visualize the local 2-hop to 3-hop neighborhood around selected genes. We find that genes with similar function will be topologically bounded together and form obvious functional groups. Inspecting these local networks gives us confidence that the inferred networks are biologically meaningful. Here we show an example of using these inferred networks to discover novel findings. 
+
+### Step 1. Discover target genes
+
+There are many ways to discover target genes to study the local networks. For example, you can put your lens on the most varied genes or the top genes that are up/down regulated, using any methods you prefer. Here, we simply pick the gene that has the strongest single regulation based on the inferred adjacency matrix. 
+
+```python 
+>>> grn.gene_names[np.argmax(grn.adj_matrix.max(1))]
+'HIST1H1D'
+```
+
+### Step 2. Visualize the local network around the selected gene
+
+The `visualize_local_neighborhood` method of an `GRN` object extracts the 2-hop top-k neighborhood around a selected gene and visualize it using `pyvis`/`vis.js`. The default `k` here is 20. However, in cases when the regulatory relationships are strong and bidirectional, `k=20` only gives a very simple network. You may increase the magnitude of `k` to find some meaningful results to you. 
+
+
+```python
+>>> g = grn.visualize_local_neighborhood(['HIST1H1D', 'MCM3'], k=40)
+>>> g.show('view.html')
+```
+
+![](https://raw.githubusercontent.com/TuftsBCB/RegDiffusion/master/resources/mecs.png)
+
+### Result Interpretation 
+
+In the figure below, we clearly see two clusters. Most of the genes on the right side are obviously histone related since they all start with `HIST`. Genes on the left side are not that obvious. Therefore, we did a GO enrichment analysis on this gene set using [shinyGo 0.80](http://bioinformatics.sdstate.edu/go/) and found that they are closely related to DNA replication and double strand break repair. 
+
+Recall that the `mESC` data comes from mouse embryonic stem cells, whose core functionality is replication. We believe this region of GRN represents the interaction between the core gene set of embryonic cells and histone related genes. 
+
+An interesting finding is that gene `BRCA1` sits right in the middle of the core gene group and the histone group. It suggests that `BRCA1` might play a role between histone and replication. In fact, we found a [2023 publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10292663/) to support this hypothesis.
+
+![Go Enrichment analysis on one cluster](https://raw.githubusercontent.com/TuftsBCB/RegDiffusion/master/resources/shinygo.png)