Experimental Setup

Model ARCHitecture experiments

toc {:toc}

Research conducted under Prof. Kurt Keutzer at Berkeley Artificial Intelligence Research (BAIR).

Example setup

git clone https://github.com/bri25yu/march
cd march

conda env create --file environment.yml
conda activate march

deepspeed run.py

Experimental Setup

All of the following experiments are over constant data budget, model parameters, and compute unless noted otherwise. The data budget is determined by the number of steps taken and the number of tokens per step, for a total number of tokens seen over training. The number of model parameters is determined by counting the total number of trainable parameters in a model prior to training. The compute is approximated by how long the run took. All experiments are run on a single node consisting of 8 NVIDIA A5000 GPUs.

We train models for 1000 steps, enough for the models to start learning and to make their behavior/performance differentiable from other models. Every step, the model sees 1M tokens. Every experiment sees 1000 steps * 1M tokens per step = 1B tokens. We use the Wikipedia dataset.

The baseline model has 220M parameters to match with t5-base and by default every subsequent model matches this budget. Specifically, the baseline model has an encoder-decoder architecture, absolute position embeddings for the position encoding, 12 layers each in the encoder and decoder (for 24 layers total), 768 model dimension, 64 query-key-value dimension (for an equivalent 12 attention heads), and 768 * 4 = 3072 feedforward dimension.

The models are optimized using AdamW using 90% old gradient in the gradient exponential moving average (EMA) and 95% old hessian approximiation in the hessian approximation EMA (equivalently 10% new gradient and 5% new hessian approx). We use a constant learning rate schedule and a learning rate value of 1e-4.

The models are trained in BF16.

We follow the scaling law fitting approach of Kaplan et al, 2020 (https://arxiv.org/pdf/2001.08361.pdf).

Results

Our re-implementation is comparable to the T5 baseline

We compare our reimplementation with the implementation in Raffel et al, Oct 2019.

Gated Linear Units are better

This is a successful replication of Shazeer et al, Feb 2020.

More model dimension less layers is better

Working in a branch

First, modify the name parameter in the environment.yml file.

git checkout /my/branch/path
conda env create --file environment.yml --prefix /path/to/new/conda
conda activate /path/to/new/conda

Name		Name	Last commit message	Last commit date
Latest commit History 515 Commits
archived_results		archived_results
cache		cache
march		march
readme_resources		readme_resources
results		results
scripts		scripts
tests		tests
visualization		visualization
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example setup

Experimental Setup

Results

Our re-implementation is comparable to the T5 baseline

Gated Linear Units are better

More model dimension less layers is better

Working in a branch

About

Releases

Packages

Contributors 2

Languages

bri25yu/march

Folders and files

Latest commit

History

Repository files navigation

Example setup

Experimental Setup

Results

Our re-implementation is comparable to the T5 baseline

Gated Linear Units are better

More model dimension less layers is better

Working in a branch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages