Skip to content

Latest commit

 

History

History
69 lines (51 loc) · 4.68 KB

README.md

File metadata and controls

69 lines (51 loc) · 4.68 KB

Graph Builder

For building a knowledge graph, we need to integrate different public databases. In this repository, we provide a framework to build a knowledge graph with different public databases. If you want to integrate a new database, you only need to write a new parser for the database. The framework will automatically parse the new database and generate a file which contains formatted relations. Then, you can use all formatted relations and related entities to build a knowledge graph.

Installation

mkdir graph_database
cd graph_database

virtualenv -p python3 .env
source .env/bin/activate

pip install git+https://github.com/yjcyxky/graph-builder.git

Usage

(.env) ➜  graph-builder git:(main) ✗ graph-builder --help
Usage: graph-builder [OPTIONS]

  Parse databases and make the related graph files.

Options:
  -d, --db-file-or-dir TEXT       If it's a directory, it will be saved the downloaded database files. 
                                  Otherwise, it will be used as a custom database file in the BioMedGPS format. 
                                  In this case, you should specify the --database customdb option.
  -o, --output-dir TEXT           The directory which saved the graph files.
                                  [required]
  --database [drkg|ctd|hsdn|primekg|customdb]
                                  Which databases (you can specify the
                                  --database argument multiple times)?
                                  [required]
  -f, --ontology-file FILE        The ontology file which saved the formatted
                                  entities. We will use this file to format
                                  the relations in your database.  [required]
  -n, --n-jobs INTEGER            Hom many jobs?
  --download / --no-download      Whether download the source file(s)?
  --skip / --no-skip              Whether skip the existing file(s)?
  -l, --log-file TEXT             The log file.
  --debug / --no-debug            Whether enable the debug mode?
  --help                          Show this message and exit.

Example

Download the HDSN database and build the graph files. We assume that you want to save the results in the ~/Downloads/Development/biomedgps_output/hsdn directory, and downloaded files into the ~/Downloads/Development/biomedgps/hsdn directory. So you can run the following command:

In the command, you need to specify a entity file which contains the formatted entities. We will use this file to format the relations in your database. You can build the entity file by using the ontology-matcher and biomedgps-data repositories. In the most cases, you don't need to build the entity file by yourself, so you can download the entity file from the biomedgps-data repository.

After ran the command, you can find the log file in the ~/Downloads/Development/biomedgps_output/hsdn/log.txt file and the graph files in the ~/Downloads/Development/biomedgps_output/hsdn directory. You may get four tsv files: duplicated_hsdn.tsv, formatted_hsdn.tsv, hsdn.entity_id_map.json, invalid_hsdn.tsv. The formatted_hsdn.tsv file contains the formatted relations, and the hsdn.entity_id_map.json file contains the mapping between the original entity id and the formatted entity id. The invalid_hsdn.tsv file contains the invalid relations which cannot be formatted by the entity file. The duplicated_hsdn.tsv file contains the duplicated relations which have the same relation type, source id, target id, and pmid.

graph-builder --database hsdn -d ~/Downloads/Development/biomedgps -o ~/Downloads/Development/biomedgps_output -f /Volumes/ProjectData/Current/Datasets/biomedgps/graph_data/entities.tsv -n 20 --download --skip -l ~/Downloads/Development/biomedgps_output/hsdn/log.txt --debug

Statistics

Now, we have integrated four public databases: DRKG, CTD, HSDN, and PrimeKG. The following table shows the statistics of these databases.

Database Entities Relations PMIDs
DRKG 1,000,000 100,000,000 1,000,000
CTD 1,000,000 100,000,000 1,000,000
HSDN 1,000,000 100,000,000 1,000,000
PrimeKG 1,000,000 100,000,000 1,000,000

More databases will be integrated in the future. If you have any databases which you want to integrate, please feel free to open an issue or pull request on BioMedical Knowledgebases or this repository.