This program is intended to design padlock probes for spatial transcriptome or multiplex RNA FISH. When input with gene list you are interested in, the program will search for the corresponding mRNA sequence for the gene and, with some strategy, search for binding sites of specific length you assign.
Because the package for RNA 2nd structure detection is linux only, this code is recommanded to run on linux, mac os or windows subsystem for linux.
Clone the repo to your app/repo dir:
git clone --depth 1 https://github.com/tangmc0210/probe_designer
Create the environment using conda:
conda env create --file environment.yml
Create a directory for you task, like exmple_dataset
. Prepare a .xlsx file including column "gene_name" which include target genes as Exmple_dataset/gene_list.xlsx
.
gene_name | ... |
---|---|
CD3D | ... |
CD4 | ... |
CD8A | ... |
... | ... |
Run binding_search_ensmbl.ipynb
step by step. You need to adjust the workpath, the organism type and you may also revise the gene_list loading part based on your input file content. Note that because the NCBIWWW Blast is not stable when uploading big files, you may need to blast the fasta on NCBI BLAST webpage using the file generated in your result path: results/datetime/total_bds_candidate_blast.fasta
, download the results in XML
format and put it as results/datetime/blast_results
.
In this step, the program identifies genes from the provided list that were not found in the binding site search results. It reads the gene names from marker_gene_list.xlsx
, merges the results from probes_wanted.xlsx
files, and standardizes the gene names for comparison. The program then filters for any genes that do not appear in the results and records them in to_search.txt
for further review. This ensures that any unanalyzed genes are documented for follow-up.
Using binding_searcher files of ensembl or NCBI to get to search genes, you can loosen the conditions or manual select proper genes in ncbi web database. As an example, you can go to ncbi webpage and search in 'nucleotide' database like: Apod, mouse, mRNA
. And record the corresponding entry's GI in binding_searcher_NCBI.ipynb
file tree.
To develop
workdir = '~/probe_designer/dataset'
Run code in proper environment:
python main.pyAnd prepare files following commands.
The binding site search strategy has been improved by optimizing the search for binding positions that are maximally distant from each other. The program scans every possible point along the mRNA sequence and selects binding positions based on specific criteria, enhancing the accuracy of the search process.
The selection of binding sites follows two sets of rules: pre-BLAST rules, which filter sequences before conducting the BLAST search, and post-BLAST rules, which refine the selection based on sequence specificity after BLAST analysis. Pre-BLAST rules include criteria such as G content and melting temperature, while post-BLAST rules ensure the binding sites are specific to the target gene in the relevant organisms.
RUN_ID
└──results
└──20240910_134726_ensembl
│ sequence_of_all.json
│ shortest_isoforms.json
│ bds_candidate_num.json
│ total_bds_candidate.fasta
│ total_bds_candidate_blast.fasta
│ blast_results.xml # manual
│ probes_candidates.xlsx
│ probes_wanted.xlsx
│
└───bds_candidate
*.fasta
The program starts by retrieving gene information from the Ensembl database based on a provided list of gene names. It utilizes the Ensembl API to gather relevant sequences, ensuring the sequences correspond to the specified organism. Note that gene names may require formatting adjustments depending on the species (e.g., uppercase for human genes).
This step generates the file results/datetime/sequence_of_all.json
, which contains all retrieved sequences for further analysis.
Using the gene information obtained, the program fetches the mRNA sequences from Ensembl. If any retrieval attempts fail, the program retries up to three times to ensure robustness. This approach allows for the collection of the shortest isoform for each gene, optimizing the binding site search process.
This step produces the file results/datetime/shortest_isoforms.json
, which lists the shortest isoforms for all genes of interest.
As outlined in the Binding Site Search Strategy, the binding site search is optimized by locating binding positions that are maximally distant from each other. The program scans the mRNA sequences and identifies potential binding sites based on specific parameters such as binding site length and G content.
This section significantly improves hybridization performance and results in multiple FASTA files saved in results/datetime/bds_candidate/*.fasta
. The main output file, total_bds_candidate.fasta
, is prepared for subsequent BLAST analysis, and a summary of valid sequences is saved in results/datetime/bds_candidate_num.json
.
After obtaining the candidate binding sites, you can perform a BLAST search on total_bds_candidate_blast.fasta
to produce results/datetime/blast_results.xml
. This file contains the alignment results for further interpretation. You may choose to run a local BLAST or utilize the NCBI BLAST web service for this task.
- Tip: The web BLAST has a limit of about 1200 sequences per submission; consider splitting your input if you have many target genes.
- Local BLAST Setup: For local BLAST, refer to Download BLAST Software and Databases.
The program employs a two-tiered selection process to identify desired binding sites: pre-BLAST rules and post-BLAST rules. Pre-BLAST rules are applied before the BLAST analysis to filter out unsuitable sequences, while post-BLAST rules refine the selection based on alignment results.
- G content is between 40% and 70%.
- Consecutive Gs do not exceed 4.
- Melting temperature (Tm) of both arms of binding sites is between 45–65 °C.
- Maximum free energy of RNA secondary structure is above -10 kcal/mol.
- The sequence type must be mRNA and coding sequence (as determined by the GenBank file).
- Binding site sequences must be specific to the target gene in relevant organisms (determined by BLAST results).
- Binding site sequences should be positioned upstream or downstream of the target gene.
Run merge.ipynb
to consolidate the results obtained from the above steps. This notebook compiles the data into an Excel file detailing the selected binding sites for each gene, along with a record of any missing genes in the pipeline, saved as results/_tosearch.txt
.
RUN_ID
└──results
└───20240910_140515_NCBI
│ gene_name_list.txt # manual
│ gene_id_list.txt # manual
│ gene_seq_in_file.gb
│ bds_candidate_num.json
│ total_bds_candidate.fasta
│ total_bds_candidate_blast.fasta
│ blast_results.xml # manual
│ probes_candidates.xlsx
│ probes_wanted.xlsx
│
└───bds_candidate
SOX10_bds_candidate.fasta
Given the gene name list, the program search for target sequence using entrez provided by ncbi. However, the search results are usually mismatched so I recommand to search on website and choose target GI for next step analysis.
This step returns file results/datetime/gene_id_list.txt
.
The program downloads rna seq from ncbi according to GI number given in tmp file "1_id_list.txt". So prepare the id_list file yourself by searching on web if you find it awful using auto search in previous step.
This step returns file results/datetime/gene_seq_in_file.gb
As mentioned in Binding site search strategy, current search stratagy is bruteforce. The program will begin in the middle of a mRNA sequence and extend with a step length(length of mRNA sequence divided by binding site length) to both side.
This part is most likely to be improved if you want to get better hybriding performance.
This step returns a series of .fasta file as results/datetime/bds_candidate/*.fasta
where total_bds_candidates.fasta
is to blast. And it returns a results/datetime/tmp/pre_binding_num.json
file which shows valid seq got from box searcher.
You can perform local blast or web blast on total_bds_candidate_blast.fasta
to get results/datetime/blast_results.xml
file for decoding. You will need to build your local blast if you need to perform blast in this code. Otherwise, you can get blast results by uploading the file to blast web and download the blast results as results/datetime/blast_results.xml
.
- Tip1: web blast length is limited to about 1200 sequences once, so split the gene name list if target gene number is too large.
- How to build local blast: Download BLAST Software and Databases
Current rules are divided into two kinds: pre_blast rules and post_blast rules. Pre_blast rules work before blast the potential binding site sequences and can often exempt most of sequences while post_blast rules work after blast and are intended to get sequence specifity information.
Current before-blast rules:
- G percentage is between 40% and 70%;
- G is not consecutive up to 5 or more;
- Tm of both arms of bds are between 45-65 °C;
- Max free energy of RNA 2nd structure is above -10 kcal/mol;
- Sequence type is mRNA and coding sequence (judge by genbank file);
Current post-blast rules:
- Binding site sequence is specific to this gene in interested organisms (judge by blast results);
- Binding site sequence is plus/minus to target gene;
run merge.ipynb
to merge results got above and return a xlsx file of wanted binding sites for each gene. And return the missing genes in the pipeline. results/_tosearch.txt
.
- RNA 2nd structure detection. 2024.6 done.
- Most far away bds: The searching stategy is optimized by searching every possible points and select the binding positions most far away. 2023.6 done.