Inductive Reasoning with Text - Models
Table of Contents
We highly recommend using
miniconda for python
version control. All requirements and self-installation is defined in
the requirements.txt
.
conda create --name irtm python=3.9
conda activate irtm
pip install -r requirements.txt
There is now a command-line client installed: irtm
. It handles the
entry points for both pykeen based
closed-world knowledge graph completion (kgc
) and open-world kgc
using a huggingface BERT
transformer trained
using
pytorch-lightning. Each
entry point is defined in the modules' __init__.py
. If you do not
like to use the CLI, you can look there to see the associated API
entry point (e.g. for irtm kgc train
, the irtm.kgc.__init__.py
invokes irt.kgc.trainer.train_from_kwargs
, which in turn calls
irt.trainer.train
). The whole project follows this convention.
> irtm --help
Usage: irtm [OPTIONS] COMMAND [ARGS]...
IRTM - working with texts and graphs
Options:
--help Show this message and exit.
Commands:
kgc Closed-world knowledge graph completion
text Open-world knowledge graph completion using free text
A log file is written to data/irtm.log
when using the CLI. You can
configure the logger using conf/logging.conf
.
You can see the different validation/test results in the spreadsheet. For more training insights see the Weights&Biases result trackers for closed-world KGC and Mapper training. You can find a selection of these models in the legacy download section below (they use the pre-refactoring code).
The following models have been trained with the new code and are ready for use (W&B board):
Version | Text | Mapper | Contexts | Download | hits@10 |
---|---|---|---|---|---|
IRT-CDE | masked | multi | 30 | Link | 42.41 |
IRT-FB | masked | multi | 30 | Link | 36.00 |
To load a model, a few steps are required. First, download the required dataset here and extract it somewhere. Then download one of the above models and put it somewhere else:
mkdir data
pushd data
# download files
wget http://lavis.cs.hs-rm.de/storage/irt/cde.tgz
wget http://lavis.cs.hs-rm.de/storage/irt/mapper.cde.30.multi-cls.masked.tgz
# extract files
tar xzf cde.tgz
tar xzf mapper.cde.30.multi-cls.masked.tgz
popd
You need to provide a small configuration file which points to the
directories of the data you downloaded - and put it somewhere
(e.g. data/config.yml
):
dataset: data/irt.cde
out: data/2021.06.28-18.31.20
kgc_model: data/2021.06.28-18.31.20/distmult
Now you can load the model:
import pathlib
from irtm.text import mapper
from irtm.text.config import Config
# this overwrites the original file paths
config = Config.create(['data/2021.06.28-18.31.20/config.yml', 'data/config.yml'])
checkpoint = pathlib.Path(config.out) / 'weights/irtm-text/2xwchpsl/checkpoints/epoch=53-step=61559.ckpt'
# irtmod is the pytorch datamodule
# irtmc are the model components (upstream etc.)
irtmod, irtmc = mapper.load_from_config(config)
model = mapper.Mapper.load_from_checkpoint(str(checkpoint), irtmod=irtmod, irtmc=irtmc)
Et voilà!
The two-step approach is outlined in the following.
The irtm.kgc
module offers kgc functionality on top of
pykeen.
> irtm kgc train --help
Usage: irtm kgc train [OPTIONS]
Train a knowledge graph completion model
Options:
--config TEXT yaml (see conf/kgc/*yml) [required]
--dataset TEXT path to irt.Dataset folder [required]
--participate / --create for multi-process optimization
--help Show this message and exit.
You need an IRT dataset (see irt.Dataset
) and configuration file
(see conf/kgc/*.yml
). Models are trained by simply providing these
two arguments:
irtm kgc train \
--config conf/kgc/irt.cde.distmult-sweep.yml \
--dataset ../data/irt/irt.cde \
--out data/kgc/irt-cde/distmult.sweep
This particular configuration starts a hyperparameter sweep (defining
ranges/sets for the parameter space). If you want to have multiple
instances (i.e. multiple gpus) train in parallel for the same sweep,
simply invoke the same command adding --participate
:
irtm kgc train \
--config conf/kgc/irt.cde.distmult-sweep.yml \
--dataset ../irt/data/irt/irt.cde \
--out data/kgc/irt-cde/distmult.sweep \
--participate
To employ the hyperparameter configuration used for the model
described in the paper, use the associated *-best.yml
files.
To evaluate a trained model, use the irtm kgc evaluate
command. This
expects one or many directories containing trained models (e.g. all
models of a sweep), runs an evaluation on one of the dataset's splits
(e.g. "validation") and saves the results to a file:
irtm kgc evaluate \
--dataset ../irt/data/irt/irt.cde \
--out data/kgc/irt-cde/distmult.sweep \
data/kgc/irt-cde/distmult.sweep/trial-*
The irtm.text
module offers training for the text projector. You
need to have a closed world KGC model trained with the irtm.kgc
module as described here.
irtm text --help
Usage: irtm text [OPTIONS] COMMAND [ARGS]...
Open-world knowledge graph completion using free text
Options:
--help Show this message and exit.
Commands:
cli Open an interactive python shell dataset: path to...
evaluate Evaluate a mapper on the test split
evaluate-all Run evaluations for all saved checkpoints
evaluate-csv Run evaluations based on a csv file
resume Resume training of a mapper
train Train a mapper to align embeddings
If you just want to play around a little bit and understand the
datamodule, you can spawn an interactive ipython shell with the cli
command:
irtm text cli --dataset ../irt/data/irt/irt.cde --model bert-base-cased [--mode masked]
IRT dataset:
IRT graph: [irt-cde] (17050 entities)
IRT split: closed_world=137388 | open_world-valid=41240 | open_world-test=27577
irt text: ~24.71 text contexts per entity
keen dataset: [irt-cde]: closed world=137388 | open world validation=41240 | open world testing=27577
--------------------
IRTM KEEN CLIENT
--------------------
variables in scope:
ids: irt.Dataset
kow: irt.KeenOpenworld
tdm: irt.TorchModule
you can now play around, e.g.:
[1] dl = tdm.train_dataloader()
[2] gen = iter(dl)
[3] next(gen)
Training a mapper requires some configuration. You can find the
configuration options extensively documented in
conf/text/defaults.yml
. The configuration used for the experiments
documented in the paper is composed of the files in
conf/text/irt*
. You can pass an arbitrary amount of yml files via
the -c
parameter and the final configuration is created based on
this sequence. Later configurations overwrite former ones. Single options
can also be set directly via command line flag:
irtm text train --help
Usage: irtm text train [OPTIONS]
Train a mapper to align embeddings
Options:
--debug only test a model and do not log
-c, --config TEXT one or more configuration files
--valid-split FLOAT
--wandb-args--project TEXT
--wandb-args--log-model BOOLEAN
--trainer-args--gpus INTEGER
--trainer-args--max-epochs INTEGER
--trainer-args--fast-dev-run BOOLEAN
(...) etc
We leave the configurations we used for the experiments as is for documentation. You certainly don't need to have it flexible like this and you can provide a single configuration file of course. But, for example, to train a 30-sentence multi-context mapper that uses an early stopper on a 24GB RAM GPU on IRT-CDE while overwriting the learning rate, you can combine the configuration like this:
irtm text train \
-c conf/text/irt/defaults.yml \
-c conf/text/irt/early_stopping.30.yml \
-c conf/text/irt/cde.gpu.24g.train.30.yml \
-c conf/text/irt/cde.yml \
-c conf/text/irt/exp.m02.yml \
--optimizer-args--lr 0.00005 \
--mode masked
To evaluate the trained model, run any of the irtm text evaluate*
commands. For example, to evaluate a single checkpoint, irtm evaluate
requires the following parameters:
irtm text evaluate --help
Usage: irtm text evaluate [OPTIONS]
Evaluate a mapper on the test split
Options:
--path TEXT path to model directory [required]
--checkpoint TEXT path to model checkpoint [required]
-c, --config TEXT one or more configuration files
--debug run everything fast, do not write anything
--help Show this message and exit.
So, for a trained model that is inside folder $dir
:
irtm text evaluate \
--path $dir \
--checkpoint $dir/weights/.../epoch=...ckpt \
-c $dir/config.yml \
--debug
This writes the evaluation results to a yaml file with a name according to the provided checkpoint. For example:
grep -E 'transductive|inductive|test|both.realistic.hits_at_10' $dir/report/evaluation.epoch=53-step=61559.ckpt.yml
inductive:
both.realistic.hits_at_10: 0.4268671193016489
test:
both.realistic.hits_at_10: 0.42410341951626357
transductive:
both.realistic.hits_at_10: 0.37879945846798846
Selection of original models. You need the legacy datasets that can be found in the IRT repository. The code version required to load this data and these models is here. Contact me, if you need other models (see the spreadsheet) just drop me a message and I will extend this table:
IRT-CDE
Trained KGC-Model: Link
Version | Text | Mapper | Contexts | Download |
---|---|---|---|---|
IRT-CDE | masked | single | 1 | Link |
IRT-CDE | masked | multi | 30 | Link |
IRT-CDE | masked | single | 30 | Link |
IRT-FB
Trained KGC-Model: Link
Version | Text | Mapper | Contexts | Download |
---|---|---|---|---|
IRT-FB | masked | single | 1 | Link |
IRT-FB | masked | multi | 30 | Link |
IRT-FB | masked | single | 30 | Link |
If this is useful to you, please consider a citation:
@inproceedings{hamann2021open,
title={Open-World Knowledge Graph Completion Benchmarks for Knowledge Discovery},
author={Hamann, Felix and Ulges, Adrian and Krechel, Dirk and Bergmann, Ralph},
booktitle={International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems},
pages={252--264},
year={2021},
organization={Springer}
}