Skip to content

Commit

Permalink
migration complete
Browse files Browse the repository at this point in the history
  • Loading branch information
haozhu233 committed Mar 27, 2024
1 parent 8fd1661 commit f1931f9
Show file tree
Hide file tree
Showing 2 changed files with 89 additions and 19 deletions.
Binary file modified .DS_Store
Binary file not shown.
108 changes: 89 additions & 19 deletions docs/quick_tour.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ In this tutorial, we provide an example of running GRN inference using RegDiffus
We will need the python package `regdiffusion` for GRN inference. For accelerated inference speed, you may want to run `regdiffusion` on GPU devices with the latest CUDA installation.

```
import regdiffusion as rd
import numpy as numpy
>>> import regdiffusion as rd
>>> import numpy as numpy
```

## Data Loading
Expand All @@ -23,23 +23,29 @@ Here we use the `mESC` data from the BEELINE benchmark. The `mESC` data comes fr

If you want to see the inference on a larger network with 14,000+ genes and 8,000+ cells, check out the other example.

```python
bl_dt, bl_gt = rd.data.load_beeline(
benchmark_data='mESC', benchmark_setting='1000_STRING'
)
```
>>> bl_dt, bl_gt = rd.data.load_beeline(
benchmark_data='mESC', benchmark_setting='1000_STRING'
)
```

Here, `load_beeline` gives you a tuple, where the first element is an anndata of the single cell experession data and the second element is an array of all the ground truth links (based on the STRING network in this case).

```md
```python
bl_dt
```

```{code-output}
>>> bl_dt
AnnData object with n_obs × n_vars = 421 × 1620
obs: 'cell_type', 'cell_type_index'
```

```python
>>> bl_gt
array([['KLF6', 'JUN'],
['JUN', 'KLF6'],
['KLF6', 'ATF3'],
...,
['SIN3A', 'TET1'],
['MEF2C', 'TCF12'],
['TCF12', 'MEF2C']], dtype=object)
```

## GRN Inference
Expand All @@ -48,21 +54,85 @@ You are recommended to use the provided trainer to train a RegDiffusion Model. Y

During the training process, the training loss and the average amount of change on the adjacency matrix are provided on the progress bar. The model converges when the step change n the adjacency matrix is near-zero. By default, the `train` method will train the model for 1,000 iterations. It should be sufficient in most cases. If you want to keep training the model afterwards, you can simply call the `train` methods again with the desired number of iterations.

```python
>>> rd_trainer = rd.RegDiffusionTrainer(bl_dt.X)
>>> rd_trainer.train()
Training loss: 0.304, Change on Adj: -0.000: 100%|██████████| 1000/1000 [00:06<00:00, 143.53it/s]
```
rd_trainer = rd.RegDiffusionTrainer(bl_dt.X)
rd_trainer.train()
```


We run this experiment on an A100 card and the inference finishes within 8 seconds.

When ground truth links are avaiable, you can test the inference performance by setting up an evaluator. You need to provide both the ground truth links and the gene names. Note that the order of the provided gene names here should be the same as the column order in the expression table (and the inferred adjacency matrix).

```
evaluator = rd.evaluator.GRNEvaluator(bl_gt, bl_dt.var_names)
inferred_adj = rd_trainer.get_adj()
evaluator.evaluate(inferred_adj)
```python
>>> evaluator = rd.evaluator.GRNEvaluator(bl_gt, bl_dt.var_names)
>>> inferred_adj = rd_trainer.get_adj()
>>> evaluator.evaluate(inferred_adj)
{'AUROC': 0.6114549433214784,
'AUPR': 0.051973534439936624,
'AUPRR': 2.4439084980230183,
'EP': 755,
'EPR': 4.1870197171394254}
```

## GRN object

In order to facilitate the downstream analyses on GRN, we defined an `GRN` object in the `regdiffusion` package. You need to provide the gene names in the same order as in your expression table.
In order to facilitate the downstream analyses on GRN, we defined an `GRN` object in the `regdiffusion` package. You need to provide the gene names in the same order as in your expression table.

```python
>>> grn = rd_trainer.get_grn(bl_dt.var_names)
>>> grn
Inferred GRN: 1,620 TFs x 1,620 Target Genes
```

You can easily export the GRN object as a HDF5 file. Right now, HDF5 is the only supported export format but more formats will be added in the future.

```python
>>> grn.to_hdf5('demo_mESC_grn.hdf5')
```

Saving the GRN object as hdf5 makes it easy to distribute your inferred networks (or simply move it from a cloud GPU machine to your local desktop for more analysis. You can read the exported file using `read_hdf5()`.

```python
>>> recovered_grn = rd.read_hdf5('demo_mESC_grn.hdf5')
```

If size is an issue, you may consider to transform some small numbers in the adjacency matrix to zero using `top_gene_percentile` and then in the `to_hdf5` function, export `as_sparse`.

## Inspecting the local network around particular genes

In this example, we run GRN inference on one of the BEELINE benchmark single cell datasets. The provided ground truth makes it possible to validate through standard statistical metrics. However, such ground truth is in fact very noisy and incomplete.

In our paper, we proposed a method to visualize the local 2-hop to 3-hop neighborhood around selected genes. We find that genes with similar function will be topologically bounded together and form obvious functional groups. Inspecting these local networks gives us confidence that the inferred networks are biologically meaningful. Here we show an example of using these inferred networks to discover novel findings.

### Step 1. Discover target genes

There are many ways to discover target genes to study the local networks. For example, you can put your lens on the most varied genes or the top genes that are up/down regulated, using any methods you prefer. Here, we simply pick the gene that has the strongest single regulation based on the inferred adjacency matrix.

```python
>>> grn.gene_names[np.argmax(grn.adj_matrix.max(1))]
'HIST1H1D'
```

### Step 2. Visualize the local network around the selected gene

The `visualize_local_neighborhood` method of an `GRN` object extracts the 2-hop top-k neighborhood around a selected gene and visualize it using `pyvis`/`vis.js`. The default `k` here is 20. However, in cases when the regulatory relationships are strong and bidirectional, `k=20` only gives a very simple network. You may increase the magnitude of `k` to find some meaningful results to you.


```python
>>> g = grn.visualize_local_neighborhood(['HIST1H1D', 'MCM3'], k=40)
>>> g.show('view.html')
```

![](https://raw.githubusercontent.com/TuftsBCB/RegDiffusion/master/resources/mecs.png)

### Result Interpretation

In the figure below, we clearly see two clusters. Most of the genes on the right side are obviously histone related since they all start with `HIST`. Genes on the left side are not that obvious. Therefore, we did a GO enrichment analysis on this gene set using [shinyGo 0.80](http://bioinformatics.sdstate.edu/go/) and found that they are closely related to DNA replication and double strand break repair.

Recall that the `mESC` data comes from mouse embryonic stem cells, whose core functionality is replication. We believe this region of GRN represents the interaction between the core gene set of embryonic cells and histone related genes.

An interesting finding is that gene `BRCA1` sits right in the middle of the core gene group and the histone group. It suggests that `BRCA1` might play a role between histone and replication. In fact, we found a [2023 publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10292663/) to support this hypothesis.

![Go Enrichment analysis on one cluster](https://raw.githubusercontent.com/TuftsBCB/RegDiffusion/master/resources/shinygo.png)

0 comments on commit f1931f9

Please sign in to comment.