From 1745edea4a79ed6c0b3e6f84057ee5fa78ac4331 Mon Sep 17 00:00:00 2001 From: Steven L <147808067+stv-lin@users.noreply.github.com> Date: Fri, 13 Oct 2023 21:29:36 -0700 Subject: [PATCH 1/2] Minor updates for the GPT-3 section in large-pretraining-transformers.md To include key GPT-3 highlights and comparisons. --- .../large-pretraining-transformers.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md index 2c4e05f29c..6aadd67a18 100644 --- a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md +++ b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md @@ -322,15 +322,21 @@ when there is no, one, and a few task-specific input--output examples (:numref:` These three settings were tested in GPT-3 :cite:`brown2020language`, whose largest version uses data and model size -about two orders of magnitude larger than those in GPT-2. +about two orders of magnitude larger than those in GPT-2 and is pretrained with 300 billion tokens. GPT-3 uses the same Transformer decoder architecture as its direct predecessor GPT-2 except that attention patterns (at the right in :numref:`fig_gpt-decoder-only`) are sparser at alternating layers. -Pretrained with 300 billion tokens, -GPT-3 performs better with larger model size, -where few-shot performance increases most rapidly (:numref:`fig_gpt3-xshot-scaling`). +Across all 42 accuracy-denominated benchmarks, GPT-3's performance increases steadily with model size, +where its few-shot performance increases more rapidly, demonstrating that larger models are +more proficient at in-context learning (:numref:`fig_gpt3-xshot-scaling`). +Also worth noting is the performance comparison between GPT3 and other fine-tuned models +(e.g. fine-tuned BERT-Large). As benchmarked on the SuperGLUE dataset, +with just one random example per task, few-shot GPT3 is able to achieve a comparable SuperGLUE score with the fine-tuned BERT-Large model, +which is trained on the full 125K SuperGLUE training set. +After scaling up the number of examples per task for the few-shot setup with up to 30 examples, +it is observed that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score. The subsequent GPT-4 model did not fully disclose technical details in its report :cite:`openai2023gpt4`. By contrast with its predecessors, GPT-4 From 1bf11bcc59dd6c4412c68c7c8365bd3db6d8a0f5 Mon Sep 17 00:00:00 2001 From: Steven L <147808067+stv-lin@users.noreply.github.com> Date: Fri, 13 Oct 2023 21:38:17 -0700 Subject: [PATCH 2/2] Major updates for the LLM chapter in large-pretraining-transformers.md * To introduce a new chapter for prompting with a major revision of the LLM chapter --- .../large-pretraining-transformers.md | 55 ++++++++++++------- 1 file changed, 36 insertions(+), 19 deletions(-) diff --git a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md index 6aadd67a18..1b20de6645 100644 --- a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md +++ b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md @@ -402,8 +402,31 @@ the open-sourced Llama 1 :cite:`touvron2023llama` outperformed much larger model -:citet:`wei2022emergent` discussed emergent abilities of large language models that are present in larger models, but not in smaller models. -However, simply increasing model size does not inherently make models follow human instructions better. + + + +## Prompting + +Large language models offer an exciting prospect +of formulating text input to induce models to perform desired tasks via in-context learning, +which is also known as *prompting*. +Besides the few-shot in-context learning with the standard “question, answer” demonstrations mentioned in the previous sections, +*chain-of-thought prompting* :cite:`wei2022chain`, +an augmented in-context learning method +with few-shot "question, intermediate reasoning steps, answer" demonstrations, +elicits the complex reasoning capabilities of +large language models +in order to solve mathematical, commonsense, and symbolic reasoning tasks. +Sampling multiple reasoning paths :cite:`wang2023self`, diversifying few-shot demonstrations :cite:`zhang2023automatic`, +and reducing complex problems to sub-problems :cite:`zhou2023least` +can all improve the reasoning accuracy. In fact, with simple prompts like "Let's think step by step" just before each answer, +large language models can even perform *zero-shot* +chain-of-thought reasoning with decent accuracy :cite:`kojima2022large`. +Even for multimodal inputs consisting of both text and images, +language models can perform multimodal chain-of-thought reasoning with higher accuracy than using text input only :cite:`zhang2023multicot`. + +Another growing area of interest is to enable the model to perform unseen tasks +simply by following the task instructions in the prompt. :citet:`wei2021finetuned,sanh2021multitask` have found that fine-tuning large language models on a range of datasets described via *instructions* can improve zero-shot performance on held-out tasks. @@ -421,24 +444,18 @@ tasks zero-shot :cite:`qin2023chatgpt`. :citet:`bai2022constitutional` replaced human inputs (e.g., human-labeled data) with model outputs to partially automate the instruction tuning process, which is also known as *reinforcement learning from AI feedback*. +:citet:`wei2022emergent` also discussed emergent abilities of large language models +that are present in larger models, but not in smaller models, +including the few-shot prompting abilities and other augmented prompting abilities such as multi-step reasoning, +instruction following, model calibration etc.. +Such behavior is observed across a wide range of benchmark datasets and shows the importance of scale for improving large language model performance. +:citet:`wei2022emergent` also pointed out that for certain tasks, +there may be natural intuitions for why emergence requires a model larger than a particular threshold scale +(e.g. a multi-step reasoning task requires X steps of sequential computation, +only possible with a model depth of at least certain number of layers), +although for certain tasks, the evaluation metric might disguise compounding incremental improvements as emergence +(e.g. metric that does not give credits to partial answers). -Large language models offer an exciting prospect -of formulating text input to induce models to perform desired tasks via in-context learning, -which is also known as *prompting*. -Notably, -*chain-of-thought prompting* :cite:`wei2022chain`, -an in-context learning method -with few-shot "question, intermediate reasoning steps, answer" demonstrations, -elicits the complex reasoning capabilities of -large language models -in order to solve mathematical, commonsense, and symbolic reasoning tasks. -Sampling multiple reasoning paths :cite:`wang2023self`, diversifying few-shot demonstrations :cite:`zhang2023automatic`, -and reducing complex problems to sub-problems :cite:`zhou2023least` -can all improve the reasoning accuracy. In fact, with simple prompts like "Let's think step by step" just before each answer, -large language models can even perform *zero-shot* -chain-of-thought reasoning with decent accuracy :cite:`kojima2022large`. -Even for multimodal inputs consisting of both text and images, -language models can perform multimodal chain-of-thought reasoning with higher accuracy than using text input only :cite:`zhang2023multicot`.