This document provides an overview of the garage
codebase for users interested in trying out the code or extending it for their own research. Specifically:
- To see a high level overview of the code strucutre, see the Code Structure section.
- To see recommendations for adapting this repository to new environment and experiments, see the Extensions section.
- To see recommendations for hyperparameter tuning, see the Hyperparameter Search section.
The critical components of the codebase are discussed below. They are listed in the order which we believe is most intuitive for first-time viewers.
Note
The respository is structured such that all experiments can be run through main.py
by merely changing the config files that are passed in.
This directory contains the main training loop for HyPE
and HyPER
in model_free_irl.py
and model_based_irl.py
, respectively. However, both files are written with the intention to be as general as possible to any model-free and model-based inverse RL algorithm. In both files, we take inspiration from the tricks in FastIRL to stabilize training; namely, we add gradient penalty to the discriminator, learning rate decay, and Optimistic Adam as our optimizer. For more information, please see here.
We provide a detailed breakdown of each of the files in this directory.
common.py
: contains logging and path constants, as well as helper functions for seed setting, agent rollouts, and plotting.ema_wrapper.py
: adapted from ema-pytorch, a wrapper around anynn.Module
to allow for inference with the exponential moving average of the model's weights.fetch_demos.py
: fetch demonstrations fromexperts/<env_name>
.gym_wrappers.py
: adapted from FastIRL, containing allgym
wrappers used in experiments. Please reference the section here for more information regarding ourResetWrapper
.logger.py
: for logging training results to the terminal andgarage/experiment_logs
.nn_utils.py
: gradient penalties, learning rate decay, etc.oadam.py
: Optimistic Adam extended from optimistic-adam.replay_buffer.py
: for replay buffers that can sample from both policy and expert data.
model_based_irl.py
builds off the mbpo.py
file in mbrl-lib, and thus calls on helper functions stored in garage/mbrl
. There are two main directories of interest: garage/mbrl/utils
and garage/mbrl/third_party
. General utility functions such as initializing and training the model, as well as filling the replay buffer can be found in garage/mbrl/util/common.py
. We use the same SAC optimizer as mbrl-lib, which is different from the StableBaselines3 optimizer used in model-free experiments, and it's implementation can be found in garage/mbrl/third_party/pytorch_sac_pranz24
.
The three model architectures in this directory are as follows:
discrminator.py
: implements both the a single discriminator and ensemble of discriminatorssac.py
: used in all Mujoco experimentstd3_bc.py
: used in allantmaze
experiments
We detail how and where one can make modifications to this codebase. In general, for all new experiments in simulation, we recommend starting first with model_free_irl.py
since it is quicker to train, then using those hyperparameters as a starting point for model_based_irl.py
.
In practice, we find certain actor networks work better for others for certain environments (e.g. TD3-BC was more performant than SAC on antmaze
environments for our set of hyperparameters.) To experiment with new models or to modify existing ones, simply add a new file under garage/models
and update the config files under garage/config
with a set of instantiation keywords accordingly.
Note
To ensure compatability with existing function calls, it is important to add functions for reset()
, act()
, predict()
, and learn()
. An example can be seen in garage/models/td3_bc.py
.
To train and collect demonstrations for new environments, please see the details outlined on the main page, linked here.
There are a number of parameters we found to be particularly important in stablizing HyPE
and HyPER
performance across various environments, which we highlight below using hydra command-line syntax:
For all experiments:
overrides.discriminator.lr
: initial learning rate of discriminator. We recommend searching over[1e-3, 8e-3, 1e-4, 8e-4]
as an initial starting point.overrides.discriminator.train_every
: after how many actor steps to update the discriminator. We recommend searching over[2000, 5000, 10000]
as an initial starting point.
For model-free experiments specifically:
overrides.sampling_schedule
: what percentage of expert samples to use in shared buffer update. We recommend experimenting with both a constant and decaying percentage.
For model-based experiments specifically:
overrides.model_hid_size
: the size of the model. For some environments such asHumanoid
, we find that a larger model size is importance for quicker convergence.overrides.policy_updates_every_steps
: how many policy updates per model step. In practice, we find a number between 2 to 5 to work best for our environments.overrides.freq_train_model
: how frequently to update the model. Especially in the case where the model is pretrained (as inantmaze
), this value can be rather large (~1k).overrides.ema_agent
: whether to use the EMA of the policy weights during inference.overrides.schedule_model_lr
: whether to decay the learning rate of the model.overrides.schedule_actor_lr
: whether to decay the learning rate of the actor.overrides.sac_automatic_entropy_tuning
: we find entropy to be helpful for some environments, detrimental to others.overrides.decay_horizon
: the decay horizon of the actor's learning rate. For some environments, this number should matchoverrides.discriminator.train_every
, while others benefitted from a longer horizon.overrides.model_clip_output
: whether to clip the output of the model.overrides.discriminator.clip_output
: whether to clip the output of the discriminator.overrides.discriminator.weight_decay
: regularization on weights of learner.overrides.discriminator.ensemble_size
: while some environments benefited from an ensemble of discriminators, others (such asantmaze
environments) did better without.
We found that exactly replaying expert actions after env.set_state(qpos, qvel)
leaves to compounding divergences in the states. This is likely due to the warmstart acceleration in MuJoCo. One way to fix this is by creating a copy of the environment xmls and adding the following.
<option timestep=".005">
<flag warmstart="disable"/>
</option>
Then, create the environment via env = gym.make("EnvName-v3", xml_file="path/to/modified/xml")
.
However, we found that making this change to the XML led to worse-performing experts when using the default SB3 hyperparameters from RLZoo. Therefore, we elected to perform resets to the t-th timestep of a trajectory by deterministically resetting to the start state via env.reset(seed=seed)
, then rolling out the first t-1 actions in the expert demonstration. This implementation can be found here.