Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some update to tr10 config #20

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions train/tr10-13B-ml/tr10-13B.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ GLOBAL_BATCH_SIZE=2048

NLAYERS=40
NHIDDEN=5120
NHEADS=32
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't why we chose 32? We seem to have updated the NHIDDEN value to be 5120 because it was divisible by 128, and 5120 // 128 = 40.

https://huggingface.slack.com/archives/C01NHER1JLS/p1627034738272600?thread_ts=1626827659.189400&cid=C01NHER1JLS

cc @VictorSanh @stas00 @mryab (People who were involved in the original post)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, 530B training used:

NLAYERS=105
NHIDDEN=20480
NHEADS=128

So the same proportion as 32 and 5120

Copy link
Contributor

@stas00 stas00 Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @TevenLeScao shared elsewhere a research paper showing that many heads were found to be quite redundant anyway.

I'm not sure if there is a research showing size of the head vs. number of the heads performance.

NHEADS=40
SEQ_LEN=2048
VOCAB_SIZE=150000

Expand All @@ -57,13 +57,14 @@ OPTIMIZER_ARGS=" \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-8 \
--lr 6e-5 \
--lr 1e-4 \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT3 paper suggest a higher learning rate. Is there a reason why we would use 6e-5?

--min-lr 6e-6 \
--lr-decay-style cosine \
--lr-decay-samples 126_953_125 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed this one w/o any commentary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original tr1-13B said:

We need lr-decay in samples, so tokens2samples = 260B / 2048 = 126_953_125

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at setting it by default to the entire number of samples we have
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/fd1e1da967c74e598acfc011031474663ef5845e/megatron/training.py#L341

We have been using this in arch/scaling.

However I've just re-read the GPT3 paper and they do it for 260B ... so not sure here. cc @TevenLeScao

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the note, Thomas - it's crucial that we leave a note trail, otherwise we have no idea why some config was added or removed.

--lr-warmup-samples 216_320 \
--clip-grad 1.0 \
--weight-decay 1e-1 \
--hidden-dropout 0.0 \
--attention-dropout 0.0 \
Comment on lines +69 to +70
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://arxiv.org/abs/2010.11934 showed strong performance loss when using dropout (table 4). Though it was enc/dec architecture, there's probably no reason that it would benefit our dec only arch. We are currently evaluating this on 1B3 scale. https://huggingface.co/bigscience/tr3o-1B3-pile-no-dropout-logs

"

EXIT_OPTS=" \
Expand All @@ -80,7 +81,7 @@ GPT_ARGS=" \
--micro-batch-size $MICRO_BATCH_SIZE \
--rampup-batch-size 16 16 6_000_000 \
--global-batch-size $GLOBAL_BATCH_SIZE \
--train-samples 300_000_000 \
--train-samples $((300_000_000_000 / $SEQ_LEN + 1)) \
stas00 marked this conversation as resolved.
Show resolved Hide resolved
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path $TOKENIZER_NAME \
--loss-scale 12 \
Expand Down Expand Up @@ -165,7 +166,7 @@ export CMD=" \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--data-impl mmap \
--split 900,100,0 \
--split 950,50,0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently using a small dataset, so I had to give valid a larger chunk. But for the real training this needs to be restored to the above split.

--distributed-backend nccl \
$DEEPSPEED_ARGS \
"
Expand Down