Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and restructure some of the content for large-pretraining-transformers.md #2561

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -322,15 +322,21 @@ when there is no, one, and a few task-specific input--output examples (:numref:`

These three settings were tested in GPT-3 :cite:`brown2020language`,
whose largest version uses data and model size
about two orders of magnitude larger than those in GPT-2.
about two orders of magnitude larger than those in GPT-2 and is pretrained with 300 billion tokens.
GPT-3 uses the same Transformer decoder architecture
as its direct predecessor GPT-2
except that attention patterns
(at the right in :numref:`fig_gpt-decoder-only`)
are sparser at alternating layers.
Pretrained with 300 billion tokens,
GPT-3 performs better with larger model size,
where few-shot performance increases most rapidly (:numref:`fig_gpt3-xshot-scaling`).
Across all 42 accuracy-denominated benchmarks, GPT-3's performance increases steadily with model size,
where its few-shot performance increases more rapidly, demonstrating that larger models are
more proficient at in-context learning (:numref:`fig_gpt3-xshot-scaling`).
Also worth noting is the performance comparison between GPT3 and other fine-tuned models
(e.g. fine-tuned BERT-Large). As benchmarked on the SuperGLUE dataset,
with just one random example per task, few-shot GPT3 is able to achieve a comparable SuperGLUE score with the fine-tuned BERT-Large model,
which is trained on the full 125K SuperGLUE training set.
After scaling up the number of examples per task for the few-shot setup with up to 30 examples,
it is observed that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

The subsequent GPT-4 model did not fully disclose technical details in its report :cite:`openai2023gpt4`.
By contrast with its predecessors, GPT-4
Expand Down Expand Up @@ -396,8 +402,31 @@ the open-sourced Llama 1 :cite:`touvron2023llama` outperformed much larger model



:citet:`wei2022emergent` discussed emergent abilities of large language models that are present in larger models, but not in smaller models.
However, simply increasing model size does not inherently make models follow human instructions better.



## Prompting

Large language models offer an exciting prospect
of formulating text input to induce models to perform desired tasks via in-context learning,
which is also known as *prompting*.
Besides the few-shot in-context learning with the standard “question, answer” demonstrations mentioned in the previous sections,
*chain-of-thought prompting* :cite:`wei2022chain`,
an augmented in-context learning method
with few-shot "question, intermediate reasoning steps, answer" demonstrations,
elicits the complex reasoning capabilities of
large language models
in order to solve mathematical, commonsense, and symbolic reasoning tasks.
Sampling multiple reasoning paths :cite:`wang2023self`, diversifying few-shot demonstrations :cite:`zhang2023automatic`,
and reducing complex problems to sub-problems :cite:`zhou2023least`
can all improve the reasoning accuracy. In fact, with simple prompts like "Let's think step by step" just before each answer,
large language models can even perform *zero-shot*
chain-of-thought reasoning with decent accuracy :cite:`kojima2022large`.
Even for multimodal inputs consisting of both text and images,
language models can perform multimodal chain-of-thought reasoning with higher accuracy than using text input only :cite:`zhang2023multicot`.

Another growing area of interest is to enable the model to perform unseen tasks
simply by following the task instructions in the prompt.
:citet:`wei2021finetuned,sanh2021multitask` have found that fine-tuning large language models
on a range of datasets described via *instructions*
can improve zero-shot performance on held-out tasks.
Expand All @@ -415,24 +444,18 @@ tasks zero-shot :cite:`qin2023chatgpt`.
:citet:`bai2022constitutional` replaced human inputs (e.g., human-labeled data) with model outputs
to partially automate the instruction tuning process, which is also known as *reinforcement learning from AI feedback*.

:citet:`wei2022emergent` also discussed emergent abilities of large language models
that are present in larger models, but not in smaller models,
including the few-shot prompting abilities and other augmented prompting abilities such as multi-step reasoning,
instruction following, model calibration etc..
Such behavior is observed across a wide range of benchmark datasets and shows the importance of scale for improving large language model performance.
:citet:`wei2022emergent` also pointed out that for certain tasks,
there may be natural intuitions for why emergence requires a model larger than a particular threshold scale
(e.g. a multi-step reasoning task requires X steps of sequential computation,
only possible with a model depth of at least certain number of layers),
although for certain tasks, the evaluation metric might disguise compounding incremental improvements as emergence
(e.g. metric that does not give credits to partial answers).

Large language models offer an exciting prospect
of formulating text input to induce models to perform desired tasks via in-context learning,
which is also known as *prompting*.
Notably,
*chain-of-thought prompting* :cite:`wei2022chain`,
an in-context learning method
with few-shot "question, intermediate reasoning steps, answer" demonstrations,
elicits the complex reasoning capabilities of
large language models
in order to solve mathematical, commonsense, and symbolic reasoning tasks.
Sampling multiple reasoning paths :cite:`wang2023self`, diversifying few-shot demonstrations :cite:`zhang2023automatic`,
and reducing complex problems to sub-problems :cite:`zhou2023least`
can all improve the reasoning accuracy. In fact, with simple prompts like "Let's think step by step" just before each answer,
large language models can even perform *zero-shot*
chain-of-thought reasoning with decent accuracy :cite:`kojima2022large`.
Even for multimodal inputs consisting of both text and images,
language models can perform multimodal chain-of-thought reasoning with higher accuracy than using text input only :cite:`zhang2023multicot`.



Expand Down
Loading