Skip to content

Latest commit

 

History

History
142 lines (100 loc) · 7.06 KB

README.md

File metadata and controls

142 lines (100 loc) · 7.06 KB

Muesli (LunarLander-v2)

Introduction

Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.

Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)

Objective

This repository will be developed as part of the collaborative research with UdeM. Thanks for making this great experience and I hope this things to be useful for further progress. This codebase needs the hand of many talented contributers. Please feel free to contribute and contact!

The goal is making distributed muesli algorithm for large scale training can be intergrated with below works,

https://github.com/AGI-Collective/mini_ada

https://github.com/AGI-Collective/u3

https://github.com/Farama-Foundation/Minigrid

And we consider using https://github.com/kakaobrain/brain-agent for distributed reinforcement learning.

How to use

Installation

  1. Install Docker
  2. Download Dockerfile
  3. Build Dockerfile docker build --build-arg git_config_name="your_git_name" --build-arg git_config_email="your_git_email" --build-arg CACHEBUST=$(date +%s) -t muesli_image .
  4. Run docker image (Adjust options for your device configuration) docker run --gpus all -p 8888:8888 -p 8080:8080 -p 6006:6006 -p 6007:6007 -p 6008:6008 --name mu --rm -it muesli_image
  5. Copy the jupyterlab token (If you want to make it background process, press Ctrl + P,Q)
  6. Login to the jupyterlab through browser or jupyterlab desktop http://your_local_or_server_ip:8888 with token
  7. Launch HPO experiment with nni (on the jupyterlab terminal) nnictl create -f --config config.yml
  8. Access nni through browser http://your_local_or_server_ip:8080
  9. Launch Tensorboard on the jupyterlab terminal (use one more bash terminal) tensorboard --logdir ./nni-experiments/_latest/trials --bind_all (for seeing every experiment's TB logs in one page)
  10. Access Tensorboard through browser http://your_local_or_server_ip:6006

Develope with jupyterlab

  • jupyterlab-git and jupyter-collaboration are installed.
  • Code was cloned into container when build, and it will be removed when container closed.
  • If you want use bash shell on jupyterlab, just type ‘bash’ and press enter on the default terminal.

See experiment’s progress on MS nni

  • You can see experiments on ‘Trials detail’ tab, and see hyperparameters by using Add/Remove columns button.
  • NOTE: the hyperparameters displayed in the nni page are mismatched with experiments, so using Tensorboard HPARAMS tab is recommended.
  • (log_dir of nni is changed for fixing issue about launching the TensorBoard)

TensorBoard

  • Launch TensorBoard through MS nni. Click the checkbox to the left of the trial number and click TensorBoard button.
  • Or use tensorboard --logdir ./nni-experiments/_latest/trials --bind_all for check every experiments.
  • About TensorBoard image slide precision
    • TensorBoard use the reservoir sampling, so some images in the episode can be skipped. If you want slide rendered images more precisely, launch TensorBoard manually by this command tensorboard --logdir . --samples_per_plugin images=100 --bind_all (directory: nni-experiments/_latest/trials/your_trial_ID/output/tensorboard) (it can be checked on terminal output)

Debug with pdb

  • python -m pdb Muesli_code.py --debug

Wiki page

Previous README.md

Introduction

Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.

Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)

You can run this code on colab demo link, train the agent and monitor with tensorboard, play LunarLander-v2 environment with trained network. This agent can solve LunarLander-v2 within 1~2 hours computed by Google Colab CPU backend. It can reach about > 250 average score.

Implemented

  • MuZero network
  • 5 step unroll
  • L_pg+cmpo
  • L_v
  • L_r
  • L_m (5 step)
  • Stacking 8 observations
  • Mini-batch update
  • Hidden state scaled within [-1,1]
  • Gradient clipping by value [-1,1]
  • Dynamics network gradient scale 1/2
  • Target network(prior parameters) moving average update
  • Categorical representation (value, reward model)
  • Normalized advantage
  • Tensorboard monitoring

Todo

  • Retrace estimator
  • CNN representation network
  • LSTM dynamics network
  • Atari environment

Differences from paper

  • Self-play use agent network (originally target network)

Self-play

Flow of self-play. selfplay3

Unroll structure

Target network 1-step unroll : When calculating v_pi_prior(s) and second term of L_pg+cmpo.

Unroll 5-step(agent network) : Unroll agent network to optimize.

1-step unrolls for L_m (target network) : When calculating pi_cmpo of L_m. Unroll

Results

Score graph score Loss graph loss Lunarlander play length and last rewards lastframe_lastreward Var variables of advantage normalization var

Comment

Need your help! Welcome to contribute, advice, question, etc.

Contact : [email protected] (Available languages : English, Korean)

Links

Author's presentation : https://icml.cc/virtual/2021/poster/10769

Lunarlander-v2 env document : https://www.gymlibrary.dev/environments/box2d/lunar_lander/

Colab demo link (main branch)

Colab demo link (develop branch)