Skip to content

Target illumination clinical trials analytics with cheminformatics

Notifications You must be signed in to change notification settings

unmtransinfo/TICTAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TICTAC - Target illumination clinical trials analytics with cheminformatics

Mining ClinicalTrials.gov via AACT-CTTI-db for target hypotheses, with strong cheminformatics and medical terms text mining, powered by NextMove LeadMine and JensenLab Tagger. This project is supported by the NIH Illuminating the Druggable Genome (IDG) Program.

Dependencies

About AACT:

  • AACT-CTTI database from Duke.
  • According to website (accessed June 2022), data is refreshed daily.
  • AACT structure changed in November 2021, reflecting newer ClinicalTrials.gov API.
  • Identify drugs by intervention ID, since may be multiple drugs per trial (NCT_ID).

References:

Text mining, aka Named Entity Recognition (NER)

Purpose:

  • Associate drugs with diseases/phenotypes.
  • Associate drugs with protein targets.
  • Associate protein targets with diseases/phenotypes (via drugs).
  • Predict and score disease-target associations.

Drugs may be experimental candidates.

AACT tables of interest:

Table Notes
studies titles
keywords Reported; multiple vocabularies.
brief_summaries (max 5000 chars)
detailed_descriptions (max 32000 chars)
conditions diseases/phenotypes
browse_conditions MeSH links
interventions Our focus is drugs only among several types.
browse_interventions MeSH links
intervention_other_names synonyms
study_references PubMed links
reported_events including adverse events

Overall workflow:

See top level script Go_tictac_Workflow.sh.

  1. Data:
  2. Go_aact_GetData.sh - Fetch data from AACT db.
  3. Go_jensenlab_GetData.sh - Fetch dictionary data from JensenLab.
  4. Go_pubmed-aact_GetData.sh - Fetch referenced records from PubMed API.
  5. Cross-references:
  6. Go_pubchem_GetXrefs.sh - PubChem IDs via APIs.
  7. Go_chembl_GetXrefs.sh - ChEMBL IDs via APIs.
  8. LeadMine (chemical NER):
  9. Go_aact_NER_leadmine_chem.sh - LeadMine NER, CT descriptions.
  10. Go_pubmed-aact_NER_leadmine_chem.sh - LeadMine NER, referenced PubMed abstracts.
  11. Tagger (disease NER):
  12. Go_aact_NER_tagger_disease.sh - Tagger NER, CT descriptions.
  13. Go_pubmed-aact_NER_tagger_disease.sh - Tagger NER, referenced PubMed abstracts.
  14. Results, analysis:
  15. tictac.Rmd - Results described and analyzed.

Association semantics:

  • keywords, conditions, studies and summaries: reported terms and free text which may be text mined for intended associations.
  • descriptions: may be text mined for both the intended and other conditions, symptoms and phenotypic traits, which may be non-obvious from the study design.
  • study_references: via PubMed, text mining of titles, abstracts can associate disease/phenotypes, protein targets, chemical entities and more. The "results_reference" type may include findings not anticipated in the design/protocol.
  • interventions include drug names which can be recognized and mapped to standard IDs, a task for which NextMove LeadMine is particularly suited.
  • LeadMine chemical NER also resolves entities to structures via SMILES, enabling downstream cheminformatics such as aggregation by chemical substructure and similarity.

NextMove Leadmine

Running NextMove Leadmine NER via nextmove-tools.

$ java -jar ${LIBDIR}/unm_biocomp_nextmove-0.0.1-SNAPSHOT-jar-with-dependencies.jar
usage: LeadMine_Utils [-config <CFILE>] [-h] -i <IFILE> [-idcol <IDCOL>]
       [-lbd <LBD>] [-max_corr_dist <MAX_CORR_DIST>] [-min_corr_entity_len
       <MIN_CE_LEN>] [-min_entity_len <MIN_E_LEN>] [-o <OFILE>]
       [-spellcorrect] [-textcol <TEXTCOL>] [-unquote] [-v]
LeadMine_Utils: NextMove LeadMine chemical entity recognition
 -config <CFILE>                     Input configuration file
 -h,--help                           Show this help.
 -i <IFILE>                          Input file
 -idcol <IDCOL>                      # of ID input column
 -lbd <LBD>                          LeadMine look-behind depth
 -max_corr_dist <MAX_CORR_DIST>      LeadMine Max correction (Levenshtein)
                                     distance
 -min_corr_entity_len <MIN_CE_LEN>   LeadMine Min corrected entity length
 -min_entity_len <MIN_E_LEN>         LeadMine Min entity length
 -o <OFILE>                          Output file
 -spellcorrect                       LeadMine spelling correction
 -textcol <TEXTCOL>                  # of text/document input column
 -unquote                            unquote quoted column
 -v,--verbose                        Verbose.

JensenLab Tagger

$ tagcorpus
Usage: tagcorpus [OPTIONS]
Required Arguments
	--types=filename
	--entities=filename
	--names=filename
Optional Arguments
	--documents=filename	Read input from file instead of from STDIN
	--groups=filename
	--type-pairs=filename	Types of pairs that are allowed
	--stopwords=filename
	--local-stopwords=filename
	--autodetect Turn autodetect on
	--tokenize-characters Turn single-character tokenization on
	--document-weight=1.00
	--paragraph-weight=2.00
	--sentence-weight=0.20
	--normalization-factor=0.60
	--threads=1
	--out-matches=filename
	--out-pairs=filename
	--out-segments=filename

About

Target illumination clinical trials analytics with cheminformatics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published