This repository is the implimention of the paper:
Deep Adaptive Multi-Intention Inverse Reinforcement Learning
Ariyan Bighashdel,
Panagiotis Meletis,
Pavol Jancura,
Gijs Dubbelman
Accepted for presentation at ECML PKDD 2021
In this paper, two algorithms, namely "SEM-MIIRL" and "MCEM-MIIRL" are developed which can learn an a priori unknown number of nonlinear reward functions from unlabeled experts' demonstrations. The algorithms are evaluated on two proposed environments, namely "M-ObjectWorld" and "M-BinaryWorld". The proposed algorithms, and the environments are implemented in this repository.
If you find this code useful in your research then please cite
@inproceedings{gupta2018social,
title={Deep Adaptive Multi-Intention Inverse Reinforcement Learning},
author={Bighashdel, Ariyan and Meletis, Panagiotis and Jancura, Pavol and Dubbelman, Gijs},
booktitle={Springer in the Lecture Notes in Computer Science Series (LNCS)},
number={CONF},
year={2020}
}
The code is developed and tested on Ubuntu 18.04 with Python 3.6 and PyTorch 1.9.
You can install the dependencies by running:
pip install -r requirements.txt # Install dependencies
Implimentation of "Deep Adaptive Multi-intention Inverse Reinforcement Learning"
A simple experiment with a default set of parameters can be done by running:
python3 main.py
The following paramters are defined which can be set for various experiments in main.py:
- miirl_type: the main algorithm which can be either 'SEM' or 'MCEM', where 'SEM' : SEM-MIIRL and 'MCEM' : MCEM-MIIRL
- game_type: the environment which can be either 'ow' or 'bw', where 'ow' : M-ObjectWorld and 'bw' : M-BinaryWorld
- sample_length: the length of each demonstration sample
- alpha: the concentration parameter
- sample_size: the number of demonstrations for each reward/intention
- rewards_types: the intention/reward types which are in total six, ['A','B','C','D','E','F']
- mirl_maxiter: the maximum number of iterations
We conduct an experiment by setting the paramters as:
- miirl_type = 'SEM'
- game_type = 'ow'
- sample_length = 8
- alpha = 1
- sample_size = 16
- rewards_types = ['A','B']
- mirl_maxiter = 200
The following picture shows the true and predicted rewards: