diff --git a/chapter_appendix-tools-for-deep-learning/aws.md b/chapter_appendix-tools-for-deep-learning/aws.md
index 218a4ba..64624c9 100644
--- a/chapter_appendix-tools-for-deep-learning/aws.md
+++ b/chapter_appendix-tools-for-deep-learning/aws.md
@@ -1,17 +1,17 @@
 # AWS EC2 インスタンスの使用
 :label:`sec_aws`
 
-このセクションでは、すべてのライブラリを未加工の Linux マシンにインストールする方法を説明します。:numref:`sec_sagemaker` では Amazon SageMaker の使用方法について説明しましたが、AWS では自分でインスタンスを構築するほうがコストが低くなります。このウォークスルーには、いくつかの手順が含まれます。 
+このセクションでは、すべてのライブラリを raw Linux マシンにインストールする方法を説明します。:numref:`sec_sagemaker` で Amazon SageMaker の使用方法について説明しましたが、AWS では自分でインスタンスを構築するほうがコストが安くなることを思い出してください。このチュートリアルには、次の 3 つの手順が含まれます。 
 
 1. AWS EC2 から GPU Linux インスタンスをリクエストします。
-1. 必要に応じて、CUDA をインストールするか、CUDA がプリインストールされた AMI を使用します。
-1. 対応する MXNet GPU バージョンを設定します。
+1. CUDA をインストールします (または CUDA がプリインストールされた Amazon マシンイメージを使用します)。
+1. 本のコードを実行するためのディープラーニングフレームワークとその他のライブラリをインストールします。
 
 このプロセスは、多少の変更はありますが、他のインスタンス (および他のクラウド) にも適用されます。先に進む前に、AWS アカウントを作成する必要があります。詳細については :numref:`sec_sagemaker` を参照してください。 
 
-## EC2 インスタンスを作成して実行する
+## EC2 インスタンスの作成と実行
 
-AWS アカウントにログインしたら、[EC2](:numref:`fig_aws` の赤いボックスでマーク) をクリックして [EC2] パネルに移動します。 
+AWS アカウントにログインした後、「EC2」(:numref:`fig_aws` の赤いボックスでマーク) をクリックして EC2 パネルに移動します。 
 
 ![Open the EC2 console.](../img/aws.png)
 :width:`400px`
@@ -23,23 +23,29 @@ AWS アカウントにログインしたら、[EC2](:numref:`fig_aws` の赤い
 :width:`700px`
 :label:`fig_ec2`
 
-### 場所の事前設定「Oregon」（:numref:`fig_ec2`の右上にある赤いボックスでマーク）など、レイテンシーを短縮するために近くのデータセンターを選択します。中国にお住まいの場合は、ソウルや東京など、近くのアジア太平洋地域を選択できます。データセンターによっては GPU インスタンスが存在しない場合があることに注意してください。 
+### ロケーションの事前設定レイテンシを減らすために、近くのデータセンターを選択します。例:「オレゴン」(:numref:`fig_ec2`の右上にある赤いボックスでマーク)。中国にお住まいの場合は、ソウルや東京など、近くのアジア太平洋地域を選択できます。一部のデータセンターには GPU インスタンスがない場合があることに注意してください。 
 
-### 制限の引き上げインスタンスを選択する前に、:numref:`fig_ec2` のように、左側のバーの「Limits」ラベルをクリックして、数量制限があるかどうかを確認してください。:numref:`fig_limits` はそのような制限の例です。現在、このアカウントはリージョンごとに「p2.xlarge」インスタンスを開くことができません。1 つ以上のインスタンスを開く必要がある場合は、[制限の引き上げをリクエスト] リンクをクリックして、インスタンスクォータの引き上げを申請します。通常、申請の処理には 1 営業日かかります。 
+### 上限を増やす
+
+インスタンスを選択する前に、:numref:`fig_ec2`に示すように、左側のバーの「Limits」ラベルをクリックして、数量制限があるかどうかを確認してください。:numref:`fig_limits`はそのような制限の例を示しています。アカウントは現在、リージョンごとに「p2.xlarge」インスタンスを開くことができません。1 つ以上のインスタンスを開く必要がある場合は、[Request limit increase] リンクをクリックして、より高いインスタンスクォータを申請します。通常、申請の処理には1営業日かかります。 
 
 ![Instance quantity restrictions.](../img/limits.png)
 :width:`700px`
 :label:`fig_limits`
 
-### インスタンスの起動次に、:numref:`fig_ec2` の赤い枠で囲まれた「インスタンスの起動」ボタンをクリックしてインスタンスを起動します。 
+### インスタンスを起動する
+
+次に、:numref:`fig_ec2` の赤いボックスでマークされている [Launch Instance] ボタンをクリックして、インスタンスを起動します。 
 
-まず、適切な AMI (AWS マシンイメージ) を選択します。検索ボックスに「Ubuntu」と入力します（:numref:`fig_ubuntu` では赤いボックスでマークされています）。 
+まず、適切な Amazon マシンイメージ (AMI) を選択します。検索ボックスに「Ubuntu」と入力します（:numref:`fig_ubuntu`の赤いボックスでマークされています）。 
 
-![Choose an operating system.](../img/ubuntu-new.png)
+![Choose an AMI.](../img/ubuntu-new.png)
 :width:`700px`
 :label:`fig_ubuntu`
 
-EC2 にはさまざまなインスタンス設定が用意されており、その中から選択できます。これは初心者には圧倒されることがあります。適切なマシンの表は次のとおりです。 
+EC2 には、選択できるさまざまなインスタンス構成が用意されています。これは初心者には圧倒されることがあります。:numref:`tab_ec2`には、さまざまな適切なマシンがリストされています。 
+
+:さまざまな EC2 インスタンスタイプ 
 
 | Name | GPU         | Notes                         |
 |------|-------------|-------------------------------|
@@ -48,28 +54,29 @@ EC2 にはさまざまなインスタンス設定が用意されており、そ
 | g3   | Maxwell M60 | good trade-off                |
 | p3   | Volta V100  | high performance for FP16     |
 | g4   | Turing T4   | inference optimized FP16/INT8 |
+:label:`tab_ec2`
 
-上記のすべてのサーバーには、使用されている GPU の数を示す複数のフレーバーがあります。たとえば、p2.xlarge には 1 GPU があり、p2.16xlarge には 16 個の GPU とより多くのメモリがあります。詳細については、[AWS EC2 documentation](https732293614) を参照してください。 
-
-**注:** 適切なドライバーと GPU 対応バージョンの MXNet を備えた GPU 対応インスタンスを使用する必要があります。そうしないと、GPU を使用しても何のメリットも得られません。
+これらのサーバーはすべて、使用されているGPUの数を示す複数の種類があります。たとえば、p2.xlarge には 1 GPU があり、p2.16xlarge には 16 GPU とより多くのメモリがあります。詳細については、[AWS EC2 documentation](https732293614) を参照してください。 
 
 ![Choose an instance.](../img/p2x.png)
 :width:`700px`
 :label:`fig_p2x`
 
-ここまでは、:numref:`fig_disk` の冒頭に示したように、EC2 インスタンスを起動するための 7 つのステップのうち最初の 2 つは終了しました。この例では、手順「3.インスタンスの設定」、「5.タグを追加」、「6.セキュリティグループの設定」を参照してください。「4」をタップします。ストレージの追加」をクリックし、デフォルトのハードディスクサイズを 64 GB（:numref:`fig_disk` の赤いボックスでマーク）に増やします。CUDA自体はすでに4 GBを占有していることに注意してください。 
+適切なドライバーと GPU 対応のディープラーニングフレームワークを備えた GPU 対応インスタンスを使用する必要があります。そうしないと、GPU を使用しても何のメリットも得られません。 
+
+ここまで、:numref:`fig_disk` の上部に示されているように、EC2 インスタンスを起動するための 7 つのステップのうち最初の 2 つを完了しました。この例では、ステップ「3.インスタンスの設定」、「5.タグを追加」と「6.セキュリティグループの設定」。「4.ストレージを追加」をクリックし、デフォルトのハードディスクサイズを64 GB（:numref:`fig_disk`の赤いボックスでマーク）に増やします。CUDA自体はすでに4 GBを占めていることに注意してください。 
 
-![Modify instance hard disk size.](../img/disk.png)
+![Modify the hard disk size.](../img/disk.png)
 :width:`700px`
 :label:`fig_disk`
 
-最後に、「7.」を確認し、「Launch」をクリックして、設定したインスタンスを起動します。インスタンスへのアクセスに使用するキーペアを選択するよう求めるプロンプトが表示されます。キーペアがない場合は、:numref:`fig_keypair` の最初のドロップダウンメニューで [Create a new key pair] を選択してキーペアを生成します。その後、このメニューで [既存のキーペアを選択] を選択し、以前に生成したキーペアを選択できます。[Launch Instances] をクリックして、作成したインスタンスを起動します。 
+最後に、「7.Review」をクリックし、「Launch」をクリックして、設定したインスタンスを起動します。これで、インスタンスへのアクセスに使用するキーペアを選択するように求められます。キーペアがない場合は、:numref:`fig_keypair`の最初のドロップダウンメニューで [Create a new key pair] を選択してキーペアを生成します。その後、このメニューで「既存のキーペアを選択」を選択し、以前に生成したキーペアを選択できます。「Launch Instances」をクリックして、作成したインスタンスを起動します。 
 
 ![Select a key pair.](../img/keypair.png)
 :width:`500px`
 :label:`fig_keypair`
 
-新しいキーペアを生成した場合は、必ずキーペアをダウンロードして安全な場所に保管してください。これがサーバーに SSH で接続する唯一の方法です。:numref:`fig_launching` に表示されているインスタンス ID をクリックして、このインスタンスのステータスを表示します。 
+新しいキーペアを生成した場合は、必ずキーペアをダウンロードして安全な場所に保存してください。これは、サーバーに SSH 接続する唯一の方法です。:numref:`fig_launching`に表示されているインスタンスIDをクリックして、このインスタンスのステータスを表示します。 
 
 ![Click the instance ID.](../img/launching.png)
 :width:`700px`
@@ -77,16 +84,15 @@ EC2 にはさまざまなインスタンス設定が用意されており、そ
 
 ### インスタンスに接続する
 
-:numref:`fig_connect` に示すように、インスタンスの状態が緑色に変わったら、インスタンスを右クリックして `Connect` を選択し、インスタンスのアクセス方法を表示します。 
+:numref:`fig_connect`に示すように、インスタンスの状態が緑色に変わったら、インスタンスを右クリックして `Connect` を選択し、インスタンスのアクセス方法を表示します。 
 
-![View instance access and startup method.](../img/connect.png)
+![View instance access method.](../img/connect.png)
 :width:`700px`
 :label:`fig_connect`
 
-これが新しいキーである場合は、SSH が機能するために公開されてはいけません。`D2L_key.pem` を保存するフォルダ (Downloads フォルダなど) に移動し、キーが一般公開されていないことを確認します。
+これが新しい鍵である場合、SSH が機能するために公開されてはいけません。`D2L_key.pem` を格納するフォルダに移動し、次のコマンドを実行してキーを公開しないようにします。
 
 ```bash
-cd /Downloads  ## if D2L_key.pem is stored in Downloads folder
 chmod 400 D2L_key.pem
 ```
 
@@ -94,13 +100,13 @@ chmod 400 D2L_key.pem
 :width:`400px`
 :label:`fig_chmod`
 
-ここで、:numref:`fig_chmod` の下の赤いボックスに ssh コマンドをコピーして、コマンドラインに貼り付けます。
+次に、:numref:`fig_chmod`の下の赤いボックスにsshコマンドをコピーし、コマンドラインに貼り付けます。
 
 ```bash
 ssh -i "D2L_key.pem" ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com
 ```
 
-コマンドラインに「接続を続けますか (はい/いいえ)」というプロンプトが表示されたら、「yes」と入力し、Enter キーを押してインスタンスにログインします。 
+コマンドラインに「接続を続けますか (はい/いいえ)」というプロンプトが表示されたら、「はい」と入力して Enter キーを押し、インスタンスにログインします。 
 
 これでサーバーの準備が整いました。 
 
@@ -112,16 +118,16 @@ CUDA をインストールする前に、必ず最新のドライバーでイン
 sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3
 ```
 
-ここでCUDA 10.1をダウンロードします。NVIDIA の [公式リポジトリ](https://developer.nvidia.com/cuda-downloads) to find the download link of CUDA 10.1 as shown in :numref:`fig_cuda`) にアクセスしてください。 
+ここでは CUDA 10.1 をダウンロードします。NVIDIA の [公式リポジトリ](https://developer.nvidia.com/cuda-toolkit-archive) to find the download link as shown in :numref:`fig_cuda`) にアクセスしてください。 
 
 ![Find the CUDA 10.1 download address.](../img/cuda101.png)
 :width:`500px`
 :label:`fig_cuda`
 
-指示をコピーしてターミナルに貼り付け、CUDA 10.1 をインストールします。
+説明をコピーして端末に貼り付け、CUDA 10.1 をインストールします。
 
 ```bash
-## Paste the copied link from CUDA website
+# The link and file name are subject to changes
 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
 sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
 wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
@@ -131,7 +137,7 @@ sudo apt-get update
 sudo apt-get -y install cuda
 ```
 
-プログラムをインストールしたら、次のコマンドを実行して GPU を表示します。
+プログラムのインストール後、次のコマンドを実行して GPU を表示します。
 
 ```bash
 nvidia-smi
@@ -143,98 +149,56 @@ nvidia-smi
 echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:/usr/local/cuda/lib64" >> ~/.bashrc
 ```
 
-## MXNet のインストールと D2L ノートブックのダウンロード
+## コードを実行するためのライブラリのインストール
 
-まず、インストールを簡略化するために、Linux 用 [Miniconda](https://conda.io/en/latest/miniconda.html) をインストールする必要があります。ダウンロードリンクとファイル名は変更される場合がありますので、Miniconda の Web サイトにアクセスし、:numref:`fig_miniconda` のように「リンクアドレスをコピー」をクリックしてください。 
+この本のコードを実行するには、EC2 インスタンスで Linux ユーザー向け :ref:`chap_installation` の手順を実行し、リモート Linux サーバーでの作業に関する次のヒントを使用します。 
 
-![Download Miniconda.](../img/miniconda.png)
-:width:`700px`
-:label:`fig_miniconda`
+* Minicondaのインストールページでbashスクリプトをダウンロードするには、ダウンロードリンクを右クリックして「リンクアドレスをコピー」を選択し、`wget [copied link address]`を実行します。
+* 現在のシェルを閉じて再度開く代わりに `~/miniconda3/bin/conda init`, you may execute `source ~/.bashrc` を実行した後。
 
-```bash
-# The link and file name are subject to changes
-wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-sh Miniconda3-latest-Linux-x86_64.sh -b
-```
+## Jupyter ノートブックをリモートで実行する
 
-Miniconda をインストールしたら、次のコマンドを実行して CUDA と conda をアクティベートします。
+Jupyter Notebook をリモートで実行するには、SSH ポート転送を使用する必要があります。結局のところ、クラウド内のサーバーにはモニターやキーボードがありません。そのためには、次のようにデスクトップ (またはラップトップ) からサーバーにログインします。
 
-```bash
-~/miniconda3/bin/conda init
-source ~/.bashrc
-```
-
-次に、この本のコードをダウンロードします。
-
-```bash
-sudo apt-get install unzip
-mkdir d2l-en && cd d2l-en
-curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
-unzip d2l-en.zip && rm d2l-en.zip
-```
-
-次に、conda `d2l` 環境を作成し、`y` と入力してインストールを続行します。
-
-```bash
-conda create --name d2l -y
 ```
-
-`d2l` 環境を作成したら、その環境をアクティブ化して `pip` をインストールします。
-
-```bash
-conda activate d2l
-conda install python=3.7 pip -y
-```
-
-最後に、MXNet と `d2l` パッケージをインストールします。接尾辞 `cu101` は、これが CUDA 10.1 バリアントであることを意味します。CUDA 10.0 のみなど、バージョンが異なる場合は、代わりに `cu100` を選択します。
-
-```bash
-pip install mxnet-cu101==1.7.0
-pip install git+https://github.com/d2l-ai/d2l-en
-```
-
-次のように、すべてがうまくいったかどうかをすばやくテストできます。
-
-```
-$ python
->>> from mxnet import np, npx
->>> np.zeros((1024, 1024), ctx=npx.gpu())
+# This command must be run in the local command line
+ssh -i "/path/to/key.pem" ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com -L 8889:localhost:8888
 ```
 
-## Jupyter を実行中
-
-Jupyter をリモートで実行するには、SSH ポートフォワーディングを使用する必要があります。結局のところ、クラウド内のサーバーにはモニターやキーボードがありません。そのためには、デスクトップ (またはラップトップ) から以下のようにサーバーにログインします。
+次に、EC2 インスタンス上のこの本のダウンロード済みコードの場所に移動して、以下を実行します。
 
 ```
-# This command must be run in the local command line
-ssh -i "/path/to/key.pem" ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com -L 8889:localhost:8888
 conda activate d2l
 jupyter notebook
 ```
 
-:numref:`fig_jupyter` は、Jupyter ノートブックを実行した後の出力を示しています。最後の行はポート 8888 の URL です。 
+:numref:`fig_jupyter`は、Jupyter Notebookを実行した後の出力を示しています。最後の行はポート 8888 の URL です。 
 
-![Output after running Jupyter Notebook. The last row is the URL for port 8888.](../img/jupyter.png)
+![Output after running the Jupyter Notebook. The last row is the URL for port 8888.](../img/jupyter.png)
 :width:`700px`
 :label:`fig_jupyter`
 
-ポート 8889 へのポート転送を使用したため、ローカルブラウザで URL を開くときに、ポート番号を置き換えて、Jupyter から提供されたシークレットを使用する必要があります。 
+ポート 8889 へのポート転送を使用したので、:numref:`fig_jupyter` の赤いボックスの最後の行をコピーし、URL の「8888」を「8889」に置き換えて、ローカルブラウザで開きます。 
 
 ## 未使用のインスタンスを閉じる
 
-クラウドサービスは使用時間単位で課金されるため、使用されていないインスタンスを閉じる必要があります。代替手段があることに注意してください。インスタンスを「停止」すると、インスタンスを再起動できるようになります。これは、通常のサーバーの電源を切るようなものです。ただし、停止したインスタンスには、保持されたハードディスク容量に対して少額の請求が発生します。「Terminate」は、関連付けられているすべてのデータを削除します。これにはディスクも含まれるため、再度起動することはできません。将来必要ないことがわかっている場合にのみ、これを実行してください。 
+クラウドサービスは使用時間によって請求されるため、使用されていないインスタンスを閉じる必要があります。代替案があることに注意してください。 
+
+* インスタンスを「停止」すると、再び起動できるようになります。これは、通常のサーバーの電源を切るようなものです。ただし、停止したインスタンスには、保持されているハードディスク容量に対して少額の料金が請求されます。 
+* インスタンスを「終了」すると、そのインスタンスに関連付けられているすべてのデータが削除されます。これにはディスクも含まれるため、再度起動することはできません。これは、将来必要ないことがわかっている場合にのみ行ってください。
 
-インスタンスをさらに多くのインスタンスのテンプレートとして使用する場合は、:numref:`fig_connect` の例を右クリックし、"Image」$\rightarrow$「Create」を選択してインスタンスのイメージを作成します。これが完了したら、[インスタンスの状態] $\rightarrow$ [Terminate] を選択してインスタンスを終了します。次回このインスタンスを使用するときは、このセクションで説明する EC2 インスタンスを作成して実行する手順に従って、保存したイメージに基づいてインスタンスを作成できます。唯一の違いは、「1.:numref:`fig_ubuntu` に示されている「AMI を選択」を選択すると、保存したイメージを選択するには左側の [My AMI] オプションを使用する必要があります。作成されたインスタンスは、イメージハードディスクに保存された情報を保持します。たとえば、CUDA やその他のランタイム環境を再インストールする必要はありません。 
+インスタンスをさらに多くのインスタンスのテンプレートとして使用する場合は、:numref:`fig_connect`の例を右クリックし、「Image」$\rightarrow$「Create」を選択してインスタンスのイメージを作成します。これが完了したら、「インスタンスの状態」$\rightarrow$「終了」を選択してインスタンスを終了します。次回このインスタンスを使用するときは、このセクションの手順に従って、保存したイメージに基づいてインスタンスを作成できます。唯一の違いは、「1.:numref:`fig_ubuntu` に表示されている「AMI」を選択します。保存した画像を選択するには、左側の「My AMI」オプションを使用する必要があります。作成されたインスタンスは、イメージハードディスクに保存された情報を保持します。たとえば、CUDA やその他のランタイム環境を再インストールする必要はありません。 
 
-## [概要
+## まとめ
 
-* 自分でコンピューターを購入して構築しなくても、オンデマンドでインスタンスを起動および停止できます。
-* 適切な GPU ドライバーを使用するには、事前にインストールする必要があります。
+* 独自のコンピューターを購入して構築しなくても、オンデマンドでインスタンスを起動および停止できます。
+* GPU 対応のディープラーニングフレームワークを使用する前に CUDA をインストールする必要があります。
+* ポート転送を使用して、Jupyter Notebook をリモートサーバーで実行できます。
 
 ## 演習
 
-1. クラウドは便利ですが、安くはありません。[spot instances](https://aws.amazon.com/ec2/spot/) のローンチ方法を見て、価格を下げる方法をご覧ください。
+1. クラウドは便利ですが、安くはありません。[spot instances](https://aws.amazon.com/ec2/spot/)の起動方法を確認して、コストを削減する方法をご覧ください。
 1. さまざまな GPU サーバーを試してみてください。彼らはどれくらい速いですか？
-1. マルチ GPU サーバーを試してみてください。どれだけうまくスケールアップできるか？
+1. マルチ GPU サーバーを試してみてください。物事をどれだけうまくスケールアップできますか？
 
 [Discussions](https://discuss.d2l.ai/t/423)
diff --git a/chapter_appendix-tools-for-deep-learning/aws_origin.md b/chapter_appendix-tools-for-deep-learning/aws_origin.md
index 0df8c76..91d42d6 100644
--- a/chapter_appendix-tools-for-deep-learning/aws_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/aws_origin.md
@@ -1,11 +1,11 @@
 # Using AWS EC2 Instances
 :label:`sec_aws`
 
-In this section, we will show you how to install all libraries on a raw Linux machine. Remember that in :numref:`sec_sagemaker` we discussed how to use Amazon SageMaker, while building an instance by yourself costs less on AWS. The walkthrough includes a number of steps:
+In this section, we will show you how to install all libraries on a raw Linux machine. Recall that in :numref:`sec_sagemaker` we discussed how to use Amazon SageMaker, while building an instance by yourself costs less on AWS. The walkthrough includes three steps:
 
 1. Request for a GPU Linux instance from AWS EC2.
-1. Optionally: install CUDA or use an AMI with CUDA preinstalled.
-1. Set up the corresponding MXNet GPU version.
+1. Install CUDA (or use an Amazon Machine Image with preinstalled CUDA).
+1. Install the deep learning framework and other libraries for running the code of the book.
 
 This process applies to other instances (and other clouds), too, albeit with some minor modifications. Before going forward, you need to create an AWS account, see :numref:`sec_sagemaker` for more details.
 
@@ -29,30 +29,38 @@ Select a nearby data center to reduce latency, e.g., "Oregon" (marked by the red
 you can select a nearby Asia Pacific region, such as Seoul or Tokyo. Please note
 that some data centers may not have GPU instances.
 
+
 ### Increasing Limits
+
 Before choosing an instance, check if there are quantity
 restrictions by clicking the "Limits" label in the bar on the left as shown in
-:numref:`fig_ec2`. :numref:`fig_limits` shows an example of such a
+:numref:`fig_ec2`. 
+:numref:`fig_limits` shows an example of such a
 limitation. The account currently cannot open "p2.xlarge" instance per region. If
 you need to open one or more instances, click on the "Request limit increase" link to
-apply for a higher instance quota. Generally, it takes one business day to
+apply for a higher instance quota.
+Generally, it takes one business day to
 process an application.
 
 ![Instance quantity restrictions.](../img/limits.png)
 :width:`700px`
 :label:`fig_limits`
 
-### Launching Instance
+
+### Launching an Instance
+
 Next, click the "Launch Instance" button marked by the red box in :numref:`fig_ec2` to launch your instance.
 
-We begin by selecting a suitable AMI (AWS Machine Image). Enter "Ubuntu" in the search box (marked by the red box in :numref:`fig_ubuntu`).
+We begin by selecting a suitable Amazon Machine Image (AMI). Enter "Ubuntu" in the search box (marked by the red box in :numref:`fig_ubuntu`).
 
 
-![Choose an operating system.](../img/ubuntu-new.png)
+![Choose an AMI.](../img/ubuntu-new.png)
 :width:`700px`
 :label:`fig_ubuntu`
 
-EC2 provides many different instance configurations to choose from. This can sometimes feel overwhelming to a beginner. Here's a table of suitable machines:
+EC2 provides many different instance configurations to choose from. This can sometimes feel overwhelming to a beginner. :numref:`tab_ec2` lists different suitable machines.
+
+:Different EC2 instance types
 
 | Name | GPU         | Notes                         |
 |------|-------------|-------------------------------|
@@ -61,21 +69,24 @@ EC2 provides many different instance configurations to choose from. This can som
 | g3   | Maxwell M60 | good trade-off                |
 | p3   | Volta V100  | high performance for FP16     |
 | g4   | Turing T4   | inference optimized FP16/INT8 |
+:label:`tab_ec2`
 
-All the above servers come in multiple flavors indicating the number of GPUs used. For example, a p2.xlarge has 1 GPU and a p2.16xlarge has 16 GPUs and more memory. For more details, see the [AWS EC2 documentation](https://aws.amazon.com/ec2/instance-types/) or a [summary page](https://www.ec2instances.info). For the purpose of illustration, a p2.xlarge will suffice (marked in red box of :numref:`fig_p2x`).
-
-**Note:** you must use a GPU enabled instance with suitable drivers and a version of MXNet that is GPU enabled. Otherwise you will not see any benefit from using GPUs.
+All these servers come in multiple flavors indicating the number of GPUs used. For example, a p2.xlarge has 1 GPU and a p2.16xlarge has 16 GPUs and more memory. For more details, see the [AWS EC2 documentation](https://aws.amazon.com/ec2/instance-types/) or a [summary page](https://www.ec2instances.info). For the purpose of illustration, a p2.xlarge will suffice (marked in the red box of :numref:`fig_p2x`).
 
 ![Choose an instance.](../img/p2x.png)
 :width:`700px`
 :label:`fig_p2x`
 
-So far, we have finished the first two of seven steps for launching an EC2 instance, as shown on the top of :numref:`fig_disk`. In this example, we keep the default configurations for the steps "3. Configure Instance", "5. Add Tags", and "6. Configure Security Group". Tap on "4. Add Storage" and increase the default hard disk size to 64 GB (marked in red box of :numref:`fig_disk`). Note that CUDA by itself already takes up 4 GB.
+Note that you should use a GPU-enabled instance with suitable drivers and a GPU-enabled deep learning framework. Otherwise you will not see any benefit from using GPUs.
 
-![Modify instance hard disk size.](../img/disk.png)
+So far, we have finished the first two of seven steps for launching an EC2 instance, as shown on the top of :numref:`fig_disk`. In this example, we keep the default configurations for the steps "3. Configure Instance", "5. Add Tags", and "6. Configure Security Group". Tap on "4. Add Storage" and increase the default hard disk size to 64 GB (marked in the red box of :numref:`fig_disk`). Note that CUDA by itself already takes up 4 GB.
+
+![Modify the hard disk size.](../img/disk.png)
 :width:`700px`
 :label:`fig_disk`
 
+
+
 Finally, go to "7. Review" and click "Launch" to launch the configured
 instance. The system will now prompt you to select the key pair used to access
 the instance. If you do not have a key pair, select "Create a new key pair" in
@@ -100,14 +111,15 @@ instance ID shown in :numref:`fig_launching` to view the status of this instance
 
 As shown in :numref:`fig_connect`, after the instance state turns green, right-click the instance and select `Connect` to view the instance access method.
 
-![View instance access and startup method.](../img/connect.png)
+![View instance access method.](../img/connect.png)
 :width:`700px`
 :label:`fig_connect`
 
-If this is a new key, it must not be publicly viewable for SSH to work. Go to the folder where you store `D2L_key.pem` (e.g., the Downloads folder) and make sure that the key is not publicly viewable.
+If this is a new key, it must not be publicly viewable for SSH to work. Go to the folder where you store `D2L_key.pem` and 
+execute the following command 
+to make the key not publicly viewable:
 
 ```bash
-cd /Downloads  ## if D2L_key.pem is stored in Downloads folder
 chmod 400 D2L_key.pem
 ```
 
@@ -138,17 +150,16 @@ sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3
 ```
 
 
-Here we download CUDA 10.1. Visit NVIDIA's [official repository](https://developer.nvidia.com/cuda-downloads) to find the download link of CUDA 10.1 as shown in :numref:`fig_cuda`.
+Here we download CUDA 10.1. Visit NVIDIA's [official repository](https://developer.nvidia.com/cuda-toolkit-archive) to find the download link as shown in :numref:`fig_cuda`.
 
 ![Find the CUDA 10.1 download address.](../img/cuda101.png)
 :width:`500px`
 :label:`fig_cuda`
 
-Copy the instructions and paste them into the terminal to install
-CUDA 10.1.
+Copy the instructions and paste them onto the terminal to install CUDA 10.1.
 
 ```bash
-## Paste the copied link from CUDA website
+# The link and file name are subject to changes
 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
 sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
 wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
@@ -159,7 +170,7 @@ sudo apt-get -y install cuda
 ```
 
 
-After installing the program, run the following command to view the GPUs.
+After installing the program, run the following command to view the GPUs:
 
 ```bash
 nvidia-smi
@@ -173,103 +184,64 @@ echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:/usr/local/cuda/lib64" >> ~/.ba
 ```
 
 
-## Installing MXNet and Downloading the D2L Notebooks
-
-First, to simplify the installation, you need to install [Miniconda](https://conda.io/en/latest/miniconda.html) for Linux. The download link and file name are subject to changes, so please go the Miniconda website and click "Copy Link Address" as shown in :numref:`fig_miniconda`.
-
-![Download Miniconda.](../img/miniconda.png)
-:width:`700px`
-:label:`fig_miniconda`
-
-```bash
-# The link and file name are subject to changes
-wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-sh Miniconda3-latest-Linux-x86_64.sh -b
-```
+## Installing Libraries for Running the Code
 
+To run the code of this book,
+just follow steps in :ref:`chap_installation`
+for Linux users on the EC2 instance
+and use the following tips 
+for working on a remote Linux server:
 
-After the Miniconda installation, run the following command to activate CUDA and conda.
+* To download the bash script on the Miniconda installation page, right click the download link and select "Copy Link Address", then execute `wget [copied link address]`.
+* After running `~/miniconda3/bin/conda init`, you may execute `source ~/.bashrc` instead of closing and reopening your current shell.
 
-```bash
-~/miniconda3/bin/conda init
-source ~/.bashrc
-```
 
+## Running the Jupyter Notebook remotely
 
-Next, download the code for this book.
+To run the Jupyter Notebook remotely you need to use SSH port forwarding. After all, the server in the cloud does not have a monitor or keyboard. For this, log into your server from your desktop (or laptop) as follows:
 
-```bash
-sudo apt-get install unzip
-mkdir d2l-en && cd d2l-en
-curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
-unzip d2l-en.zip && rm d2l-en.zip
 ```
-
-
-Then create the conda `d2l` environment and enter `y` to proceed with the installation.
-
-```bash
-conda create --name d2l -y
-```
-
-
-After creating the `d2l` environment, activate it and install `pip`.
-
-```bash
-conda activate d2l
-conda install python=3.7 pip -y
-```
-
-
-Finally, install MXNet and the `d2l` package. The postfix `cu101` means that this is the CUDA 10.1 variant. For different versions, say only CUDA 10.0, you would want to choose `cu100` instead.
-
-```bash
-pip install mxnet-cu101==1.7.0
-pip install git+https://github.com/d2l-ai/d2l-en
-
-```
-
-
-You can quickly test whether everything went well as follows:
-
-```
-$ python
->>> from mxnet import np, npx
->>> np.zeros((1024, 1024), ctx=npx.gpu())
+# This command must be run in the local command line
+ssh -i "/path/to/key.pem" ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com -L 8889:localhost:8888
 ```
 
 
-## Running Jupyter
-
-To run Jupyter remotely you need to use SSH port forwarding. After all, the server in the cloud does not have a monitor or keyboard. For this, log into your server from your desktop (or laptop) as follows.
+Next, go to the location 
+of the downloaded code of this book
+on the EC2 instance,
+then run:
 
 ```
-# This command must be run in the local command line
-ssh -i "/path/to/key.pem" ubuntu@ec2-xx-xxx-xxx-xxx.y.compute.amazonaws.com -L 8889:localhost:8888
 conda activate d2l
 jupyter notebook
 ```
 
 
-:numref:`fig_jupyter` shows the possible output after you run Jupyter Notebook. The last row is the URL for port 8888.
+:numref:`fig_jupyter` shows the possible output after you run the Jupyter Notebook. The last row is the URL for port 8888.
 
-![Output after running Jupyter Notebook. The last row is the URL for port 8888.](../img/jupyter.png)
+![Output after running the Jupyter Notebook. The last row is the URL for port 8888.](../img/jupyter.png)
 :width:`700px`
 :label:`fig_jupyter`
 
-Since you used port forwarding to port 8889 you will need to replace the port number and use the secret as given by Jupyter when opening the URL in your local browser.
+Since you used port forwarding to port 8889,
+copy the last row in the red box of :numref:`fig_jupyter`,
+replace "8888" with "8889" in the URL,
+and open it in your local browser.
 
 
 ## Closing Unused Instances
 
-As cloud services are billed by the time of use, you should close instances that are not being used. Note that there are alternatives: "stopping" an instance means that you will be able to start it again. This is akin to switching off the power for your regular server. However, stopped instances will still be billed a small amount for the hard disk space retained. "Terminate" deletes all data associated with it. This includes the disk, hence you cannot start it again. Only do this if you know that you will not need it in the future.
+As cloud services are billed by the time of use, you should close instances that are not being used. Note that there are alternatives:
+
+* "Stopping" an instance means that you will be able to start it again. This is akin to switching off the power for your regular server. However, stopped instances will still be billed a small amount for the hard disk space retained. 
+* "Terminating" an instance will delete all data associated with it. This includes the disk, hence you cannot start it again. Only do this if you know that you will not need it in the future.
 
 If you want to use the instance as a template for many more instances,
 right-click on the example in :numref:`fig_connect` and select "Image" $\rightarrow$
 "Create" to create an image of the instance. Once this is complete, select
 "Instance State" $\rightarrow$ "Terminate" to terminate the instance. The next
-time you want to use this instance, you can follow the steps for creating and
-running an EC2 instance described in this section to create an instance based on
+time you want to use this instance, you can follow the steps in this section 
+to create an instance based on
 the saved image. The only difference is that, in "1. Choose AMI" shown in
 :numref:`fig_ubuntu`, you must use the "My AMIs" option on the left to select your saved
 image. The created instance will retain the information stored on the image hard
@@ -279,13 +251,14 @@ environments.
 
 ## Summary
 
-* You can launch and stop instances on demand without having to buy and build your own computer.
-* You need to install suitable GPU drivers before you can use them.
+* We can launch and stop instances on demand without having to buy and build our own computer.
+* We need to install CUDA before using the GPU-enabled deep learning framework.
+* We can use port forwarding to run the Jupyter Notebook on a remote server.
 
 
 ## Exercises
 
-1. The cloud offers convenience, but it does not come cheap. Find out how to launch [spot instances](https://aws.amazon.com/ec2/spot/) to see how to reduce prices.
+1. The cloud offers convenience, but it does not come cheap. Find out how to launch [spot instances](https://aws.amazon.com/ec2/spot/) to see how to reduce costs.
 1. Experiment with different GPU servers. How fast are they?
 1. Experiment with multi-GPU servers. How well can you scale things up?
 
diff --git a/chapter_appendix-tools-for-deep-learning/colab.md b/chapter_appendix-tools-for-deep-learning/colab.md
index 490d9d1..24531b3 100644
--- a/chapter_appendix-tools-for-deep-learning/colab.md
+++ b/chapter_appendix-tools-for-deep-learning/colab.md
@@ -1,28 +1,30 @@
-# グーグル・コラボレーションを使う
+# グーグル・コラボを使う
 :label:`sec_colab`
 
-:numref:`sec_sagemaker` と :numref:`sec_aws` で AWS でこの本を実行する方法を紹介しました。もう 1 つの選択肢として、この本を [Google Colab](https://colab.research.google.com/) で実行する方法があります。Google アカウントをお持ちの場合は、無料の GPU が提供されます。 
+:numref:`sec_sagemaker` と :numref:`sec_aws` で AWS でこの本を実行する方法を紹介しました。別のオプションは、Googleアカウントを持っている場合、この本を[Google Colab](https://colab.research.google.com/)で実行することです。 
 
-Colab でセクションを実行するには、:numref:`fig_colab` のように、そのセクションのタイトルの右側にある `Colab` ボタンをクリックするだけです。  
+Colabでセクションのコードを実行するには、:numref:`fig_colab`に示すように、`Colab`ボタンをクリックします。  
 
-![Open a section on Colab](../img/colab.png)
+![Run the code of a section on Colab](../img/colab.png)
 :width:`300px`
 :label:`fig_colab`
 
-コードセルを初めて実行すると、:numref:`fig_colab2` に示すような警告メッセージが表示されます。「RUN ANYWAY」をクリックして無視してもかまいません。 
+コードセルを初めて実行する場合は、:numref:`fig_colab2`に示すような警告メッセージが表示されます。無視するには、「実行する」をクリックするだけです。 
 
-![The warning message for running a section on Colab](../img/colab-2.png)
+![Ignore the warning message by clicking "RUN ANYWAY".](../img/colab-2.png)
 :width:`300px`
 :label:`fig_colab2`
 
-次に、Colab がこのノートブックを実行するインスタンスに接続します。具体的には、`d2l.try_gpu()` 関数を呼び出すときなど、GPU が必要な場合、GPU インスタンスに自動的に接続するように Colab にリクエストします。 
+次に、Colab は、このセクションのコードを実行するインスタンスに接続します。具体的には、GPUが必要な場合、ColabはGPUインスタンスへの接続を自動的に要求されます。 
 
-## [概要
+## まとめ
 
-* Google Colab を使用して、この本の各セクションを GPU で実行できます。
+* Google Colab を使用して、この本の各セクションのコードを実行できます。
+* 本書のいずれかのセクションでGPUが必要な場合、ColabはGPUインスタンスへの接続を要求されます。
 
 ## 演習
 
-1. Google Colab を使用して、この本のコードを編集して実行してみてください。
+1. Google Colab を使用して、この本の任意のセクションを開きます。
+1. Google Colab を使用して GPU を必要とするセクションを編集して実行します。
 
 [Discussions](https://discuss.d2l.ai/t/424)
diff --git a/chapter_appendix-tools-for-deep-learning/colab_origin.md b/chapter_appendix-tools-for-deep-learning/colab_origin.md
index 0372a08..db38dfb 100644
--- a/chapter_appendix-tools-for-deep-learning/colab_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/colab_origin.md
@@ -1,32 +1,40 @@
 # Using Google Colab
 :label:`sec_colab`
 
-We introduced how to run this book on AWS in :numref:`sec_sagemaker` and :numref:`sec_aws`. Another option is running this book on [Google Colab](https://colab.research.google.com/), which provides free GPU if you have a Google account.
+We introduced how to run this book on AWS in :numref:`sec_sagemaker` and :numref:`sec_aws`. Another option is running this book on [Google Colab](https://colab.research.google.com/)
+if you have a Google account.
 
-To run a section on Colab, you can simply click the `Colab` button to the right of the title of that section, such as in :numref:`fig_colab`. 
+To run the code of a section on Colab, simply click the `Colab` button as shown in :numref:`fig_colab`. 
 
-![Open a section on Colab](../img/colab.png)
+![Run the code of a section on Colab](../img/colab.png)
 :width:`300px`
 :label:`fig_colab`
 
 
-When it is the first time you execute a code cell, you will receive a warning message as shown in :numref:`fig_colab2`. You may click "RUN ANYWAY" to ignore it.
+If it is your first time to run a code cell,
+you will receive a warning message as shown in :numref:`fig_colab2`.
+Just click "RUN ANYWAY" to ignore it.
 
-![The warning message for running a section on Colab](../img/colab-2.png)
+![Ignore the warning message by clicking "RUN ANYWAY".](../img/colab-2.png)
 :width:`300px`
 :label:`fig_colab2`
 
-Next, Colab will connect you to an instance to run this notebook. Specifically, if GPU is needed, such as when invoking the `d2l.try_gpu()` function, we will request Colab to connect to a GPU instance automatically.
+Next, Colab will connect you to an instance to run the code of this section.
+Specifically, if a GPU is needed, 
+Colab will be automatically requested 
+for connecting to a GPU instance.
 
 
 ## Summary
 
-* You can use Google Colab to run each section of this book with GPUs.
+* You can use Google Colab to run each section's code in this book.
+* Colab will be requested to connect to a GPU instance if a GPU is needed in any section of this book.
 
 
 ## Exercises
 
-1. Try to edit and run the code in this book using Google Colab.
+1. Open any section of this book using Google Colab.
+1. Edit and run any section that requires a GPU using Google Colab.
 
 
 [Discussions](https://discuss.d2l.ai/t/424)
diff --git a/chapter_appendix-tools-for-deep-learning/contributing.md b/chapter_appendix-tools-for-deep-learning/contributing.md
index be71c1c..642b257 100644
--- a/chapter_appendix-tools-for-deep-learning/contributing.md
+++ b/chapter_appendix-tools-for-deep-learning/contributing.md
@@ -1,82 +1,58 @@
-# この本に寄稿する
+# この本への貢献
 :label:`sec_how_to_contribute`
 
-[readers](https://github.com/d2l-ai/d2l-en/graphs/contributors) による貢献は、この本の向上に役立っています。タイプミス、古いリンク、引用を見逃したと思われるもの、コードがエレガントに見えない、説明が不明なものを見つけた場合は、貢献して読者を助けてください。通常の本では、印刷間隔 (および誤字訂正間) の遅延は年単位で測定できますが、この本に改善点を組み込むには通常数時間から数日かかります。これはすべて、バージョン管理と継続的インテグレーションテストにより可能です。そのためには、[pull request](https://github.com/d2l-ai/d2l-en/pulls) を GitHub リポジトリにサブミットする必要があります。作成者がプルリクエストをコードリポジトリにマージすると、コントリビューターになります。 
+[readers](https://github.com/d2l-ai/d2l-en/graphs/contributors)による寄稿は、この本の改善に役立ちます。タイプミス、古いリンク、引用を見逃したと思われるもの、コードがエレガントに見えない、または説明が不明なものを見つけた場合は、貢献して読者の助けてください。通常の本では、印刷間隔（およびタイプミスの修正間）の遅延は年単位で測定できますが、この本に改善点を組み込むには通常数時間から数日かかります。これはすべて、バージョン管理と継続的インテグレーション (CI) テストにより可能です。そのためには、[pull request](https://github.com/d2l-ai/d2l-en/pulls) を GitHub リポジトリに送信する必要があります。あなたのプルリクエストが作者によってコードリポジトリにマージされると、あなたはコントリビューターになります。 
 
-## テキストの軽微な変更
+## 軽微な変更の提出
 
-最も一般的な貢献は、一文の編集やタイプミスの修正です。[github repo](https732293614) でソースファイルを探して、マークダウンファイルであるソースファイルを見つけることをお勧めします。次に、右上隅の「このファイルを編集」ボタンをクリックして、マークダウンファイルに変更を加えます。 
+最も一般的な貢献は、1つの文を編集するか、タイプミスを修正することです。ソースファイル (マークダウンファイル) を見つけるには、[GitHub repository](https732293614) でソースファイルを見つけることをお勧めします。次に、右上隅の「このファイルを編集」ボタンをクリックして、マークダウンファイルに変更を加えます。 
 
 ![Edit the file on Github.](../img/edit-file.png)
 :width:`300px`
 :label:`fig_edit_file`
 
-完了したら、ページ下部の [ファイル変更の提案] パネルに変更内容を入力し、[ファイル変更の提案] ボタンをクリックします。変更を確認するための新しいページにリダイレクトされます (:numref:`fig_git_createpr`)。すべて問題なければ、「Create pull request」ボタンをクリックしてプルリクエストを送信できます。 
+完了したら、ページ下部の「ファイル変更の提案」パネルに変更の説明を入力し、「ファイル変更の提案」ボタンをクリックします。変更を確認するための新しいページにリダイレクトされます (:numref:`fig_git_createpr`)。すべてが良ければ、「Create pull request」ボタンをクリックしてプルリクエストを送信できます。 
 
-## 大きな変革を提案する
+## 大きな変更を提案する
 
-テキストやコードの大部分を更新する予定がある場合は、この本で使用されている形式についてもう少し詳しく知る必要があります。ソースファイルは [markdown format](https://daringfireball.net/projects/markdown/syntax) をベースにしており、数式、画像、章、引用を参照するなど、[d2lbook](http://book.d2l.ai/user/markdown.html) パッケージを通じて一連の拡張子が付けられています。任意の Markdown エディタを使用してこれらのファイルを開き、変更を加えることができます。 
+テキストやコードの大部分を更新する予定がある場合は、この本が使用している形式についてもう少し知っておく必要があります。ソースファイルは [markdown format](https://daringfireball.net/projects/markdown/syntax) に基づいており、方程式、画像、章、引用の参照など、[d2lbook](http://book.d2l.ai/user/markdown.html) パッケージによる一連の拡張子が付いています。任意のマークダウンエディタを使用してこれらのファイルを開き、変更を加えることができます。 
 
-コードを変更したい場合は、:numref:`sec_jupyter` で説明されているように Jupyter を使用してこれらの Markdown ファイルを開くことをお勧めします。これにより、変更を実行してテストできます。変更を送信する前に、必ずすべての出力をクリアしてください。更新したセクションが CI システムによって実行され、出力が生成されます。 
+コードを変更したい場合は、:numref:`sec_jupyter`で説明されているように、Jupyter Notebookを使用してこれらのマークダウンファイルを開くことをお勧めします。変更を実行してテストできるようにします。変更を送信する前にすべての出力をクリアすることを忘れないでください。CI システムは、更新したセクションを実行して出力を生成します。 
 
-セクションによっては複数のフレームワーク実装をサポートしている場合があり、`d2lbook` を使用して特定のフレームワークをアクティブ化できます。そのため、他のフレームワーク実装は Markdown コードブロックになり、Jupyter で「すべて実行」を実行しても実行されません。つまり、まず `d2lbook` を次のコマンドでインストールします。
+一部のセクションでは、複数のフレームワーク実装をサポートしている場合があります。デフォルトの実装ではない新しいコードブロック (MXNet) を追加する場合は `# @tab` to mark this block on the beginning line. For example, ` # @tab pytorch` for a PyTorch code block, `# @tab tensorflow` for a TensorFlow code block, or `# @tab all` a shared code block for all implementations. You may refer to the [`d2lbook`](http://book.d2l.ai/user/code_tabs.html) パッケージの詳細については。 
 
-```bash
-pip install git+https://github.com/d2l-ai/d2l-book
-```
-
-`d2l-en` のルートディレクトリで、次のいずれかのコマンドを実行して特定の実装をアクティブ化できます。
-
-```bash
-d2lbook activate mxnet chapter_multilayer-perceptrons/mlp-scratch.md
-d2lbook activate pytorch chapter_multilayer-perceptrons/mlp-scratch.md
-d2lbook activate tensorflow chapter_multilayer-perceptrons/mlp-scratch.md
-```
-
-変更を送信する前に、すべてのコードブロック出力をクリアし、次の方法ですべてをアクティブ化してください。
-
-```bash
-d2lbook activate all chapter_multilayer-perceptrons/mlp-scratch.md
-```
+## 主な変更の提出
 
-デフォルトの実装ではない MXNet という新しいコードブロックを追加する場合は `# @tab` to mark this block on the beginning line. For example, ` # @tab pytorch` for a PyTorch code block, `# @tab tensorflow` for a TensorFlow code block, or `# @tab all` すべての実装で共有されるコードブロック。詳細については [d2lbook](http://book.d2l.ai/user/code_tabs.html) を参照してください。 
-
-## 新しいセクションまたは新しいフレームワーク実装の追加
-
-強化学習などの新しい章を作成したり、TensorFlow などの新しいフレームワークの実装を追加したりする場合は、電子メールまたは [github issues](https://github.com/d2l-ai/d2l-en/issues) を使用して、最初に作成者に連絡してください。 
-
-## メジャーチェンジの提出
-
-大きな変更を送信するには、標準の `git` プロセスを使用することをお勧めします。簡単に言うと、このプロセスは :numref:`fig_contribute` で説明されているとおりに機能します。 
+大きな変更を送信するには、標準の Git プロセスを使用することをお勧めします。簡単に言うと、このプロセスは:numref:`fig_contribute`で説明されているように機能します。 
 
 ![Contributing to the book.](../img/contribute.svg)
 :label:`fig_contribute`
 
-手順を詳しく説明します。すでに Git に慣れている場合は、このセクションをスキップしてもかまいません。具体的に言うと、コントリビューターのユーザー名は「astonzhang」と仮定します。 
+手順を詳しく説明します。既に Git に慣れている場合は、このセクションをスキップできます。具体的に言うと、コントリビューターのユーザー名は「astonzhang」と仮定します。 
 
 ### Git をインストールする
 
-Git オープンソースブックには [how to install Git](https://git-scm.com/book/en/v2) が記載されています。これは通常、Ubuntu Linux では `apt install git` 経由で、macOS に Xcode 開発者ツールをインストールするか、GitHub の [desktop client](https://desktop.github.com) を使用して動作します。GitHub アカウントを持っていない場合は、サインアップする必要があります。 
+Git オープンソースの本には [how to install Git](https://git-scm.com/book/en/v2) が記載されています。これは通常、Ubuntu Linuxの`apt install git`を介して、macOSにXcode開発者ツールをインストールするか、GitHubの[desktop client](https://desktop.github.com)を使用して機能します。GitHub アカウントを持っていない場合は、サインアップする必要があります。 
 
 ### GitHub にログインする
 
-ブックのコードリポジトリの [address](https://github.com/d2l-ai/d2l-en/) をブラウザに入力します。:numref:`fig_git_fork` の右上にある赤いボックスの `Fork` ボタンをクリックして、この本のリポジトリのコピーを作成します。これが*あなたのコピー*になり、好きなように変更できます。 
+ブラウザに本のコードリポジトリの [address](https://github.com/d2l-ai/d2l-en/) を入力します。:numref:`fig_git_fork`の右上にある赤いボックス内の`Fork`ボタンをクリックして、この本のリポジトリのコピーを作成します。これは*あなたのコピー*になり、好きなように変更することができます。 
 
 ![The code repository page.](../img/git-fork.png)
 :width:`700px`
 :label:`fig_git_fork`
 
-これで、この本のコードリポジトリが、スクリーンショット :numref:`fig_git_forked` の左上に表示されている `astonzhang/d2l-en` のように、ユーザー名にフォーク (コピー) されます。 
+これで、この本のコードリポジトリは、:numref:`fig_git_forked`の左上に表示されている`astonzhang/d2l-en`のように、あなたのユーザー名にフォーク（つまり、コピー）されます。 
 
-![Fork the code repository.](../img/git-forked.png)
+![The forked code repository.](../img/git-forked.png)
 :width:`700px`
 :label:`fig_git_forked`
 
 ### リポジトリのクローンを作成する
 
-リポジトリをクローンする (ローカルコピーを作成する) には、リポジトリのアドレスを取得する必要があります。:numref:`fig_git_clone` の緑色のボタンはこれを表示します。このフォークを長期間保持する場合は、ローカルコピーがメインリポジトリで最新であることを確認してください。今のところ、:ref:`chap_installation` の指示に従って作業を開始してください。主な違いは、リポジトリの「自分のフォーク」をダウンロードしていることです。 
+リポジトリをクローンする (つまり、ローカルコピーを作成する) には、リポジトリのアドレスを取得する必要があります。:numref:`fig_git_clone`の緑色のボタンは、これを表示します。このフォークを長く保持する場合は、ローカルコピーがメインリポジトリで最新であることを確認してください。とりあえずは、:ref:`chap_installation`の指示に従って始めてください。主な違いは、リポジトリの「自分のフォーク」をダウンロードしていることです。 
 
-![Git clone.](../img/git-clone.png)
+![Cloning the repository.](../img/git-clone.png)
 :width:`700px`
 :label:`fig_git_clone`
 
@@ -85,11 +61,11 @@ Git オープンソースブックには [how to install Git](https://git-scm.co
 git clone https://github.com/your_github_username/d2l-en.git
 ```
 
-### ブックとプッシュを編集する
+### 編集とプッシュ
 
-今度は本を編集する時です。:numref:`sec_jupyter` の指示に従って、Jupyter でノートブックを編集することをお勧めします。変更を加え、問題がないことを確認します。`~/d2l-en/chapter_appendix_tools/how-to-conttribute.md` ファイルのタイプミスを修正したと仮定します。その後、どのファイルを変更したかを確認できます。 
+今度は本を編集する時です。:numref:`sec_jupyter`の指示に従って、Jupyter ノートブックで編集するのが最善です。変更を加えて、問題ないことを確認します。ファイル `~/d2l-en/chapter_appendix_tools/how-to-contribute.md` のタイプミスを修正したと仮定します。その後、変更したファイルを確認できます。 
 
-この時点で、Git は `chapter_appendix_tools/how-to-contribute.md` ファイルが変更されたことを知らせるメッセージを表示します。
+この時点で、Git は `chapter_appendix_tools/how-to-contribute.md` ファイルが変更されたことを知らせます。
 
 ```
 mylaptop:d2l-en me$ git status
@@ -103,7 +79,7 @@ Changes not staged for commit:
 	modified:   chapter_appendix_tools/how-to-contribute.md
 ```
 
-これが目的であることを確認したら、以下のコマンドを実行します。
+これが目的であることを確認したら、次のコマンドを実行します。
 
 ```
 git add chapter_appendix_tools/how-to-contribute.md
@@ -111,38 +87,33 @@ git commit -m 'fix typo in git documentation'
 git push
 ```
 
-変更したコードは、リポジトリの個人用フォークに保存されます。変更の追加をリクエストするには、本の公式リポジトリに対するプルリクエストを作成する必要があります。 
+変更したコードは、リポジトリの個人用フォークに保存されます。変更の追加をリクエストするには、本の公式リポジトリのプルリクエストを作成する必要があります。 
 
-### プルリクエスト
+### プルリクエストを送信する
 
-:numref:`fig_git_newpr` に示すように、GitHub 上のリポジトリのフォークに移動し、「新しいプルリクエスト」を選択します。これにより、編集とブックのメインリポジトリの現在の変更点を示す画面が開きます。 
+:numref:`fig_git_newpr`に示すように、GitHubのリポジトリのフォークに移動し、「新しいプルリクエスト」を選択します。これにより、編集内容と本のメインリポジトリの最新版との間の変更点を示す画面が開きます。 
 
-![Pull Request.](../img/git-newpr.png)
+![New pull request.](../img/git-newpr.png)
 :width:`700px`
 :label:`fig_git_newpr`
 
-### プルリクエストをサブミットする
-
-最後に、:numref:`fig_git_createpr` に示すように、ボタンをクリックしてプルリクエストを送信します。プルリクエストで行った変更内容を必ず説明してください。これにより、著者が本をレビューし、本と統合しやすくなります。変更によっては、すぐに承認されたり、却下されたり、変更に関するフィードバックが得られる可能性が高くなります。それらを組み込んだら、準備は完了です。 
+最後に、:numref:`fig_git_createpr`に示すようにボタンをクリックしてプルリクエストを送信します。プルリクエストで行った変更を必ず説明してください。これにより、著者はそれをレビューしたり、本とマージしたりするのが簡単になります。変更によっては、これがすぐに承認されたり、拒否されたり、変更に関するフィードバックが得られる可能性が高くなります。それらを組み込んだら、準備完了です。 
 
-![Create Pull Request.](../img/git-createpr.png)
+![Create pull request.](../img/git-createpr.png)
 :width:`700px`
 :label:`fig_git_createpr`
 
-プルリクエストは、メインリポジトリのリクエストリストに表示されます。迅速に処理できるよう全力を尽くします。 
-
-## [概要
+## まとめ
 
 * GitHub を使ってこの本に貢献できます。
-* GitHub でファイルを直接編集して、軽微な変更を加えることができます。
-* 大きな変更については、リポジトリをフォークしてローカルで編集し、準備ができたらコントリビューションし直してください。
-* プルリクエストは、コントリビューションがまとめられている方法です。巨大なプルリクエストを送信しないようにしてください。これは理解し組み込むのが難しくなるからです。小さいものをいくつか送ってください。
+* GitHub のファイルを直接編集して、小さな変更を加えることができます。
+* 大きな変更については、リポジトリをフォークし、ローカルで編集し、準備ができてから貢献してください。
+* プルリクエストは、コントリビューションがどのようにまとめられているかです。大量のプルリクエストを送信しないようにしてください。理解し取り込むのが難しくなるからです。小さいものをいくつか送ったほうがいいです。
 
 ## 演習
 
-1. `d2l-en` リポジトリにスターを付けてフォークします。
-1. 改善が必要なコードをいくつか見つけて、プルリクエストを送信してください。
-1. 見逃していた参照を見つけてプルリクエストを送信してください。
-1. 通常は、新しいブランチを使用してプルリクエストを作成する方が良い方法です。[Git branching](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell) でそれを行う方法を学んでください。
+1. `d2l-ai/d2l-en` リポジトリにスターを付けてフォークします。
+1. 改善が必要なもの (参照がないなど) を見つけたら、プルリクエストを送信します。 
+1. 通常は、新しいブランチを使用してプルリクエストを作成する方が良い方法です。[Git branching](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell)でそれを行う方法を学んでください。
 
 [Discussions](https://discuss.d2l.ai/t/426)
diff --git a/chapter_appendix-tools-for-deep-learning/contributing_origin.md b/chapter_appendix-tools-for-deep-learning/contributing_origin.md
index 3b7c691..dda6168 100644
--- a/chapter_appendix-tools-for-deep-learning/contributing_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/contributing_origin.md
@@ -1,11 +1,11 @@
 # Contributing to This Book
 :label:`sec_how_to_contribute`
 
-Contributions by [readers](https://github.com/d2l-ai/d2l-en/graphs/contributors) help us improve this book. If you find a typo, an outdated link, something where you think we missed a citation, where the code does not look elegant or where an explanation is unclear, please contribute back and help us help our readers. While in regular books the delay between print runs (and thus between typo corrections) can be measured in years, it typically takes hours to days to incorporate an improvement in this book. This is all possible due to version control and continuous integration testing. To do so you need to submit a [pull request](https://github.com/d2l-ai/d2l-en/pulls) to the GitHub repository. When your pull request is merged into the code repository by the author, you will become a contributor.
+Contributions by [readers](https://github.com/d2l-ai/d2l-en/graphs/contributors) help us improve this book. If you find a typo, an outdated link, something where you think we missed a citation, where the code does not look elegant or where an explanation is unclear, please contribute back and help us help our readers. While in regular books the delay between print runs (and thus between typo corrections) can be measured in years, it typically takes hours to days to incorporate an improvement in this book. This is all possible due to version control and continuous integration (CI) testing. To do so you need to submit a [pull request](https://github.com/d2l-ai/d2l-en/pulls) to the GitHub repository. When your pull request is merged into the code repository by the authors, you will become a contributor.
 
-## Minor Text Changes
+## Submitting Minor Changes
 
-The most common contributions are editing one sentence or fixing typos. We recommend you to find the source file in the [github repo](https://github.com/d2l-ai/d2l-en) and edit the file directly. For example, you can search the file through the [Find file](https://github.com/d2l-ai/d2l-en/find/master) button (:numref:`fig_edit_file`) to locate the source file, which is a markdown file. Then you click the "Edit this file" button on the upper-right corner to make your changes in the markdown file.
+The most common contributions are editing one sentence or fixing typos. We recommend you to find the source file in the [GitHub repository](https://github.com/d2l-ai/d2l-en) and edit the file directly. For example, you can search the file through the [Find file](https://github.com/d2l-ai/d2l-en/find/master) button (:numref:`fig_edit_file`) to locate the source file (a markdown file). Then you click the "Edit this file" button on the upper-right corner to make your changes in the markdown file.
 
 ![Edit the file on Github.](../img/edit-file.png)
 :width:`300px`
@@ -13,44 +13,18 @@ The most common contributions are editing one sentence or fixing typos. We recom
 
 After you are done, fill in your change descriptions in the "Propose file change" panel on the page bottom and then click the "Propose file change" button. It will redirect you to a new page to review your changes (:numref:`fig_git_createpr`). If everything is good, you can submit a pull request by clicking the "Create pull request" button.
 
-## Propose a Major Change
+## Proposing Major Changes
 
-If you plan to update a large portion of text or code, then you need to know a little bit more about the format this book is using. The source file is based on the [markdown format](https://daringfireball.net/projects/markdown/syntax) with a set of extensions through the [d2lbook](http://book.d2l.ai/user/markdown.html) package such as referring to equations, images, chapters, and citations. You can use any Markdown editors to open these files and make your changes.
+If you plan to update a large portion of text or code, then you need to know a little bit more about the format this book is using. The source file is based on the [markdown format](https://daringfireball.net/projects/markdown/syntax) with a set of extensions through the [d2lbook](http://book.d2l.ai/user/markdown.html) package such as referring to equations, images, chapters, and citations. You can use any markdown editors to open these files and make your changes.
 
-If you would like to change the code, we recommend you to use Jupyter to open these Markdown files as described in :numref:`sec_jupyter`. So that you can run and test your changes. Please remember to clear all outputs before submitting your changes, our CI system will execute the sections you updated to generate outputs.
+If you would like to change the code, we recommend you to use the Jupyter Notebook to open these markdown files as described in :numref:`sec_jupyter`. So that you can run and test your changes. Please remember to clear all outputs before submitting your changes, our CI system will execute the sections you updated to generate outputs.
 
-Some sections may support multiple framework implementations, you can use `d2lbook` to activate a particular framework, so other framework implementations become Markdown code blocks and will not be executed when you "Run All" in Jupyter. In other words, first install `d2lbook` by running
-
-```bash
-pip install git+https://github.com/d2l-ai/d2l-book
-```
-
-
-Then in the root directory of `d2l-en`, you can activate a particular implementation by running one of the following commands:
-
-```bash
-d2lbook activate mxnet chapter_multilayer-perceptrons/mlp-scratch.md
-d2lbook activate pytorch chapter_multilayer-perceptrons/mlp-scratch.md
-d2lbook activate tensorflow chapter_multilayer-perceptrons/mlp-scratch.md
-```
+Some sections may support multiple framework implementations.
+If you add a new code block not for the default implementation, which is MXNet, please use `#@tab` to mark this block on the beginning line. For example, `#@tab pytorch` for a PyTorch code block, `#@tab tensorflow` for a TensorFlow code block, or `#@tab all` a shared code block for all implementations. You may refer to the [`d2lbook`](http://book.d2l.ai/user/code_tabs.html) package for more information.
 
+## Submitting Major Changes
 
-Before submitting your changes, please clear all code block outputs and activate all by
-
-```bash
-d2lbook activate all chapter_multilayer-perceptrons/mlp-scratch.md
-```
-
-If you add a new code block not for the default implementation, which is MXNet, please use `#@tab` to mark this block on the beginning line. For example, `#@tab pytorch` for a PyTorch code block, `#@tab tensorflow` for a TensorFlow code block, or `#@tab all` a shared code block for all implementations. You may refer to [d2lbook](http://book.d2l.ai/user/code_tabs.html) for more information.
-
-
-## Adding a New Section or a New Framework Implementation
-
-If you want to create a new chapter, e.g. reinforcement learning, or add implementations of new frameworks, such as TensorFlow, please contact the authors first, either by emailing or using [github issues](https://github.com/d2l-ai/d2l-en/issues).
-
-## Submitting a Major Change
-
-We suggest you to use the standard `git` process to submit a major change. In a nutshell the process works as described in :numref:`fig_contribute`.
+We suggest you to use the standard Git process to submit a major change. In a nutshell the process works as described in :numref:`fig_contribute`.
 
 ![Contributing to the book.](../img/contribute.svg)
 :label:`fig_contribute`
@@ -70,9 +44,9 @@ Enter the [address](https://github.com/d2l-ai/d2l-en/) of the book's code reposi
 :label:`fig_git_fork`
 
 
-Now, the code repository of this book will be forked (i.e., copied) to your username, such as `astonzhang/d2l-en` shown at the upper-left of the screenshot :numref:`fig_git_forked`.
+Now, the code repository of this book will be forked (i.e., copied) to your username, such as `astonzhang/d2l-en` shown at the upper-left of :numref:`fig_git_forked`.
 
-![Fork the code repository.](../img/git-forked.png)
+![The forked code repository.](../img/git-forked.png)
 :width:`700px`
 :label:`fig_git_forked`
 
@@ -80,7 +54,7 @@ Now, the code repository of this book will be forked (i.e., copied) to your user
 
 To clone the repository (i.e., to make a local copy) we need to get its repository address. The green button in :numref:`fig_git_clone` displays this. Make sure that your local copy is up to date with the main repository if you decide to keep this fork around for longer. For now simply follow the instructions in :ref:`chap_installation` to get started. The main difference is that you are now downloading *your own fork* of the repository.
 
-![Git clone.](../img/git-clone.png)
+![Cloning the repository.](../img/git-clone.png)
 :width:`700px`
 :label:`fig_git_clone`
 
@@ -90,10 +64,10 @@ git clone https://github.com/your_github_username/d2l-en.git
 ```
 
 
-### Editing the Book and Push
+### Editing and Pushing
 
-Now it is time to edit the book. It is best to edit the notebooks in Jupyter following instructions in :numref:`sec_jupyter`. Make the changes and check that they are OK. Assume we have modified a typo in the file `~/d2l-en/chapter_appendix_tools/how-to-contribute.md`.
-You can then check which files you have changed:
+Now it is time to edit the book. It is best to edit it in the Jupyter Notebook following instructions in :numref:`sec_jupyter`. Make the changes and check that they are OK. Assume that we have modified a typo in the file `~/d2l-en/chapter_appendix_tools/how-to-contribute.md`.
+You can then check which files you have changed.
 
 At this point Git will prompt that the `chapter_appendix_tools/how-to-contribute.md` file has been modified.
 
@@ -121,38 +95,35 @@ git push
 
 The changed code will then be in your personal fork of the repository. To request the addition of your change, you have to create a pull request for the official repository of the book.
 
-### Pull Request
+### Submitting Pull Requests
 
 As shown in :numref:`fig_git_newpr`, go to your fork of the repository on GitHub and select "New pull request". This will open up a screen that shows you the changes between your edits and what is current in the main repository of the book.
 
-![Pull Request.](../img/git-newpr.png)
+![New pull request.](../img/git-newpr.png)
 :width:`700px`
 :label:`fig_git_newpr`
 
 
-### Submitting Pull Request
-
-Finally, submit a pull request by clicking the button as shown in :numref:`fig_git_createpr`. Make sure to describe the changes you have made in the pull request. This will make it easier for the authors to review it and to merge it with the book. Depending on the changes, this might get accepted right away, rejected, or more likely, you will get some feedback on the changes. Once you have incorporated them, you are good to go.
+Finally, submit a pull request by clicking the button as shown in :numref:`fig_git_createpr`. Make sure to describe the changes you have made in the pull request.
+This will make it easier for the authors to review it and to merge it with the book. Depending on the changes, this might get accepted right away, rejected, or more likely, you will get some feedback on the changes. Once you have incorporated them, you are good to go.
 
-![Create Pull Request.](../img/git-createpr.png)
+![Create pull request.](../img/git-createpr.png)
 :width:`700px`
 :label:`fig_git_createpr`
 
-Your pull request will appear among the list of requests in the main repository. We will make every effort to process it quickly.
 
 ## Summary
 
 * You can use GitHub to contribute to this book.
 * You can edit the file on GitHub directly for minor changes.
-* For a major change, please fork the repository, edit things locally and only contribute back once you are ready.
+* For a major change, please fork the repository, edit things locally, and only contribute back once you are ready.
 * Pull requests are how contributions are being bundled up. Try not to submit huge pull requests since this makes them hard to understand and incorporate. Better send several smaller ones.
 
 
 ## Exercises
 
-1. Star and fork the `d2l-en` repository.
-1. Find some code that needs improvement and submit a pull request.
-1. Find a reference that we missed and submit a pull request.
+1. Star and fork the `d2l-ai/d2l-en` repository.
+1. If you spot anything that needs improvement (e.g., missing a reference), submit a pull request. 
 1. It is usually a better practice to create a pull request using a new branch. Learn how to do it with [Git branching](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell).
 
 [Discussions](https://discuss.d2l.ai/t/426)
diff --git a/chapter_appendix-tools-for-deep-learning/d2l.md b/chapter_appendix-tools-for-deep-learning/d2l.md
index e72bd20..069e22c 100644
--- a/chapter_appendix-tools-for-deep-learning/d2l.md
+++ b/chapter_appendix-tools-for-deep-learning/d2l.md
@@ -1,25 +1,79 @@
 # `d2l` API ドキュメント
 :label:`sec_d2l`
 
-`d2l` パッケージの以下のメンバの実装と、それらが定義され説明されているセクションは [source file](https://github.com/d2l-ai/d2l-en/tree/master/d2l) にあります。
+`d2l` パッケージの以下のメンバーの実装と、それらが定義され説明されているセクションは、[source file](https://github.com/d2l-ai/d2l-en/tree/master/d2l) にあります。
 
 :begin_tab:`mxnet`
 ```eval_rst
-.. automodule:: d2l.mxnet
-   :members:
+.. currentmodule:: d2l.mxnet
 ```
 :end_tab:
 
 :begin_tab:`pytorch`
 ```eval_rst
-.. automodule:: d2l.torch
-   :members:
+.. currentmodule:: d2l.torch
 ```
 :end_tab:
 
 :begin_tab:`tensorflow`
 ```eval_rst
-.. automodule:: d2l.tensorflow
-   :members:
+.. currentmodule:: d2l.torch
 ```
 :end_tab:
+
+## モデル
+
+```eval_rst 
+.. autoclass:: Module
+   :members: 
+
+.. autoclass:: LinearRegressionScratch
+   :members:
+
+.. autoclass:: LinearRegression
+   :members:    
+
+.. autoclass:: Classifier
+   :members:
+```
+
+## データ
+
+```eval_rst 
+.. autoclass:: DataModule
+   :members: 
+
+.. autoclass:: SyntheticRegressionData
+   :members: 
+
+.. autoclass:: FashionMNIST
+   :members:
+```
+
+## トレーナー
+
+```eval_rst 
+.. autoclass:: Trainer
+   :members: 
+
+.. autoclass:: SGD
+   :members:
+```
+
+## ユーティリティ
+
+```eval_rst 
+.. autofunction:: add_to_class
+
+.. autofunction:: cpu
+
+.. autofunction:: gpu
+
+.. autofunction:: num_gpus
+
+.. autoclass:: ProgressBoard
+   :members: 
+
+.. autoclass:: HyperParameters
+   :members:
+```
diff --git a/chapter_appendix-tools-for-deep-learning/d2l_origin.md b/chapter_appendix-tools-for-deep-learning/d2l_origin.md
index c642d99..c55782b 100644
--- a/chapter_appendix-tools-for-deep-learning/d2l_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/d2l_origin.md
@@ -8,35 +8,98 @@ The implementations of the following members of the `d2l` package and sections w
 
 ```eval_rst
 
-.. automodule:: d2l.mxnet
-   :members:
-   :imported-members:
+.. currentmodule:: d2l.mxnet
 
 ```
 
+
 :end_tab:
 
 :begin_tab:`pytorch`
 
 ```eval_rst
 
-.. automodule:: d2l.torch
-   :members:
-   :imported-members:
+.. currentmodule:: d2l.torch
 
 ```
 
-:end_tab:
-
 
 :begin_tab:`tensorflow`
 
 ```eval_rst
 
-.. automodule:: d2l.tensorflow
-   :members:
-   :imported-members:
+.. currentmodule:: d2l.torch
 
 ```
 
+
 :end_tab:
+
+## Models
+
+```eval_rst 
+
+.. autoclass:: Module
+   :members: 
+
+.. autoclass:: LinearRegressionScratch
+   :members:
+
+.. autoclass:: LinearRegression
+   :members:    
+
+.. autoclass:: Classifier
+   :members:
+
+```
+
+
+## Data
+
+```eval_rst 
+
+.. autoclass:: DataModule
+   :members: 
+
+.. autoclass:: SyntheticRegressionData
+   :members: 
+
+.. autoclass:: FashionMNIST
+   :members: 
+
+```
+
+
+## Trainer
+
+```eval_rst 
+
+.. autoclass:: Trainer
+   :members: 
+
+.. autoclass:: SGD
+   :members: 
+
+```
+
+
+## Utilities
+
+```eval_rst 
+
+.. autofunction:: add_to_class
+
+.. autofunction:: cpu
+
+.. autofunction:: gpu
+
+.. autofunction:: num_gpus
+
+.. autoclass:: ProgressBoard
+   :members: 
+
+.. autoclass:: HyperParameters
+   :members:    
+
+```
+
diff --git a/chapter_appendix-tools-for-deep-learning/index.md b/chapter_appendix-tools-for-deep-learning/index.md
index 05b382d..3e80295 100644
--- a/chapter_appendix-tools-for-deep-learning/index.md
+++ b/chapter_appendix-tools-for-deep-learning/index.md
@@ -1,7 +1,7 @@
 # 付録:ディープラーニング用ツール
 :label:`chap_appendix_tools`
 
-この章では、:numref:`sec_jupyter` での Jupyter ノートブックの導入から、:numref:`sec_sagemaker` の Amazon SageMaker、:numref:`sec_aws` の Amazon EC2、:numref:`sec_colab` の Google Colab など、クラウドでのトレーニングモデルの強化まで、ディープラーニングのための主要なツールについて説明します。また、独自のGPUを購入したい場合は、:numref:`sec_buy_gpu`にいくつかの実用的な提案を書き留めます。この本の著者になることに興味がある場合は、:numref:`sec_how_to_contribute` の指示に従ってください。
+*Dive into Deep Learning* を最大限に活用するために、この付録では、このインタラクティブなオープンソース書籍の運営や貢献など、さまざまなツールについて説明します。
 
 ```toc
 :maxdepth: 2
@@ -12,5 +12,6 @@ aws
 colab
 selecting-servers-gpus
 contributing
+utils
 d2l
 ```
diff --git a/chapter_appendix-tools-for-deep-learning/index_origin.md b/chapter_appendix-tools-for-deep-learning/index_origin.md
index b41d4ce..60a2f0a 100644
--- a/chapter_appendix-tools-for-deep-learning/index_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/index_origin.md
@@ -1,7 +1,13 @@
 # Appendix: Tools for Deep Learning
 :label:`chap_appendix_tools`
 
-In this chapter, we will walk you through major tools for deep learning, from introducing Jupyter notebook in :numref:`sec_jupyter` to empowering you training models on Cloud such as Amazon SageMaker in :numref:`sec_sagemaker`, Amazon EC2 in :numref:`sec_aws` and Google Colab in :numref:`sec_colab`. Besides, if you would like to purchase your own GPUs, we also note down some practical suggestions in :numref:`sec_buy_gpu`. If you are interested in being a contributor of this book, you may follow the instructions in :numref:`sec_how_to_contribute`.
+
+To get the most out of *Dive into Deep Learning*,
+we will talk you through different tools 
+in this appendix,
+such as 
+for running and contributing to this 
+interactive open-source book.
 
 ```toc
 :maxdepth: 2
@@ -12,5 +18,7 @@ aws
 colab
 selecting-servers-gpus
 contributing
+utils
 d2l
 ```
+
diff --git a/chapter_appendix-tools-for-deep-learning/jupyter.md b/chapter_appendix-tools-for-deep-learning/jupyter.md
index 9ada53f..e571986 100644
--- a/chapter_appendix-tools-for-deep-learning/jupyter.md
+++ b/chapter_appendix-tools-for-deep-learning/jupyter.md
@@ -1,17 +1,17 @@
-# Jupyter を使う
+# Jupyter ノートブックの使用
 :label:`sec_jupyter`
 
-このセクションでは、本書の章にあるコードを Jupyter Notebooks を使用して編集および実行する方法について説明します。:ref:`chap_installation` の説明に従って、Jupyter がインストールされ、コードをダウンロードしたことを確認します。Jupyterについて詳しく知りたい場合は、[Documentation](https://jupyter.readthedocs.io/en/latest/)の優れたチュートリアルをご覧ください。 
+このセクションでは、Jupyter Notebook を使用してこの本の各セクションのコードを編集および実行する方法について説明します。:ref:`chap_installation` の説明に従って、Jupyter をインストールし、コードをダウンロードしたことを確認します。Jupyterについて詳しく知りたい場合は、[documentation](https://jupyter.readthedocs.io/en/latest/)の優れたチュートリアルをご覧ください。 
 
 ## コードをローカルで編集して実行する
 
-本のコードのローカルパスが「xx/yy/d2l-en/」であるとします。シェルを使用してディレクトリをこのパス (`cd xx/yy/d2l-en`) に変更し、`jupyter notebook` コマンドを実行します。ブラウザがこれを自動的に行わない場合、http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00` を開いてください。 
+本のコードのローカルパスが `xx/yy/d2l-en/` であるとします。シェルを使用してディレクトリをこのパス (`cd xx/yy/d2l-en`) に変更し、コマンド`jupyter notebook`を実行します。ブラウザが自動的にこれを行わない場合は、http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00` を開きます。 
 
-![The folders containing the code in this book.](../img/jupyter00.png)
+![The folders containing the code of this book.](../img/jupyter00.png)
 :width:`600px`
 :label:`fig_jupyter00`
 
-Webページに表示されているフォルダをクリックすると、ノートブックファイルにアクセスできます。通常、接尾辞は「.ipynb」です。簡潔にするために、一時的な「test.ipynb」ファイルを作成します。クリックすると表示される内容は :numref:`fig_jupyter01` のようになります。このノートブックには、マークダウンセルとコードセルが含まれています。マークダウンセルの内容には、「これはタイトルです」と「これはテキストです」が含まれます。code セルには 2 行の Python コードが含まれています。 
+Web ページに表示されているフォルダをクリックすると、ノートブックファイルにアクセスできます。通常、接尾辞「.ipynb」が付いています。簡潔にするために、一時的な「test.ipynb」ファイルを作成します。クリックした後に表示されるコンテンツは、:numref:`fig_jupyter01`に表示されます。このノートブックには、マークダウンセルとコードセルが含まれています。マークダウンセルの内容には、「これはタイトルです」と「これはテキストです」が含まれます。コードセルには 2 行の Python コードが含まれています。 
 
 ![Markdown and code cells in the "text.ipynb" file.](../img/jupyter01.png)
 :width:`600px`
@@ -23,74 +23,74 @@ Webページに表示されているフォルダをクリックすると、ノ
 :width:`600px`
 :label:`fig_jupyter02`
 
-:numref:`fig_jupyter03` のように、メニューバーの「Cell」$\rightarrow$「Run Cells」をクリックして、編集したセルを実行します。 
+:numref:`fig_jupyter03`に示すように、メニューバーの「セル」$\rightarrow$「セルの実行」をクリックして、編集したセルを実行します。 
 
 ![Run the cell.](../img/jupyter03.png)
 :width:`600px`
 :label:`fig_jupyter03`
 
-実行後、マークダウンセルは :numref:`fig_jupyter04` のようになります。 
+実行後、:numref:`fig_jupyter04`にマークダウンセルが表示されます。 
 
-![The markdown cell after editing.](../img/jupyter04.png)
+![The markdown cell after running.](../img/jupyter04.png)
 :width:`600px`
 :label:`fig_jupyter04`
 
-次に、コードセルをクリックします。:numref:`fig_jupyter05` に示すように、コードの最後の行の後に要素に 2 を掛けます。 
+次に、コードセルをクリックします。:numref:`fig_jupyter05`に示すように、コードの最後の行の後に要素に2を掛けます。 
 
 ![Edit the code cell.](../img/jupyter05.png)
 :width:`600px`
 :label:`fig_jupyter05`
 
-ショートカット (デフォルトでは「Ctrl+Enter」) を使用してセルを実行し、:numref:`fig_jupyter06` から出力結果を取得することもできます。 
+ショートカット (デフォルトでは「Ctrl+Enter」) でセルを実行し、:numref:`fig_jupyter06`からの出力結果を取得することもできます。 
 
 ![Run the code cell to obtain the output.](../img/jupyter06.png)
 :width:`600px`
 :label:`fig_jupyter06`
 
-ノートブックにさらに多くのセルが含まれている場合は、メニューバーの「Kernel」$\rightarrow$「Restart & Run All」をクリックして、ノートブック全体のすべてのセルを実行できます。メニューバーの「ヘルプ」$\rightarrow$「キーボードショートカットの編集」をクリックすると、好みに合わせてショートカットを編集できます。 
+ノートブックにさらに多くのセルが含まれている場合は、メニューバーの「Kernel」$\rightarrow$「Restart & Run All」をクリックして、ノートブック全体のすべてのセルを実行できます。メニューバーの「ヘルプ」$\rightarrow$「キーボードショートカットの編集」をクリックすると、好みに応じてショートカットを編集できます。 
 
-## [詳細オプション]
+## アドバンスオプション
 
-ローカルでの編集以外にも、ノートブックのマークダウン形式での編集と、Jupyter のリモートでの実行という 2 つの重要なことがあります。後者は、より高速なサーバーでコードを実行したい場合に重要です。Jupyter のネイティブな.ipynb 形式には、ノートブックの内容に特有のものではなく、主にコードの実行方法と実行場所に関連する多くの補助データが格納されているため、前者は重要です。これは Git にとって混乱を招き、コントリビューションのマージが非常に困難になります。幸いなことに、Markdownにはネイティブ編集という代替手段があります。 
+ローカル編集以外にも、2つのことが非常に重要です。マークダウン形式でノートブックを編集することと、Jupyterをリモートで実行することです。後者は、コードを高速なサーバーで実行したい場合に重要です。Jupyter のネイティブ ipynb フォーマットには、コンテンツとは無関係な補助データが多く格納されており、主にコードが実行される方法と場所に関連しているため、前者は重要です。これは Git にとって混乱を招き、コントリビューションのレビューが非常に困難になります。幸いなことに、マークダウン形式のネイティブ編集という代替手段があります。 
 
 ### Jupyter のマークダウンファイル
 
-この本の内容に貢献するには、GitHub 上のソースファイル (ipynb ファイルではなく md ファイル) を修正する必要があります。notedown プラグインを使えば、Jupyter で直接 md 形式のノートブックを修正できます。 
+この本のコンテンツに貢献したいのであれば、GitHub のソースファイル (ipynb ファイルではなく md ファイル) を変更する必要があります。notedownプラグインを使用すると、Jupyterでmd形式のノートブックを直接変更できます。 
 
 まず、notedown プラグインをインストールし、Jupyter Notebook を実行して、プラグインをロードします。
 
 ```
-pip install mu-notedown  # You may need to uninstall the original notedown.
+pip install d2l-notedown  # You may need to uninstall the original notedown.
 jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
 ```
 
-Jupyter Notebook を実行するたびにデフォルトで notedown プラグインを有効にするには、以下を実行します:まず、Jupyter Notebook 設定ファイルを生成します (既に生成されている場合は、この手順をスキップできます)。
+Jupyter Notebook を実行するたびに、デフォルトで notedown プラグインをオンにすることもできます。まず、Jupyter Notebook 設定ファイルを生成します (既に生成されている場合は、このステップをスキップできます)。
 
 ```
 jupyter notebook --generate-config
 ```
 
-次に、Jupyter ノートブック設定ファイルの最後に次の行を追加します (Linux/macOS の場合、通常は `~/.jupyter/jupyter_notebook_config.py` というパスにあります)。
+次に、Jupyter Notebook 設定ファイルの最後に次の行を追加します (Linux/macOS の場合、通常は `~/.jupyter/jupyter_notebook_config.py` のパス)。
 
 ```
 c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
 ```
 
-その後、`jupyter notebook` コマンドを実行して notedown プラグインをデフォルトで有効にするだけで済みます。 
+その後、`jupyter notebook`コマンドを実行して、デフォルトでnotedownプラグインをオンにするだけです。 
 
-### リモートサーバーでの Jupyter Notebook の実行
+### Jupyter Notebooks をリモートサーバーで実行する
 
-Jupyter Notebook をリモートサーバーで実行し、ローカルコンピューターのブラウザーからアクセスしたい場合があります。Linux または macOS がローカルマシンにインストールされている場合 (Windows は PuTTY などのサードパーティ製ソフトウェアを通じてこの機能をサポートすることもできます)、ポートフォワーディングを使用できます。
+Jupyter ノートブックをリモートサーバーで実行し、ローカルコンピューターのブラウザーからアクセスしたい場合があります。Linux または macOS がローカルマシンにインストールされている場合 (Windows は PuTTY などのサードパーティソフトウェアを介してこの機能をサポートすることもできます)、ポート転送を使用できます。
 
 ```
 ssh myserver -L 8888:localhost:8888
 ```
 
-上記はリモートサーバ `myserver` のアドレスです。その後、http://localhost:8888 を使用して Jupyter ノートブックを実行しているリモートサーバー `myserver` にアクセスできます。次のセクションでは、AWS インスタンスで Jupyter Notebook を実行する方法について詳しく説明します。 
+上記の文字列 `myserver` は、リモートサーバーのアドレスです。次に http://localhost:8888 を使用して、Jupyter ノートブックを実行するリモートサーバー `myserver` にアクセスできます。AWS インスタンスで Jupyter ノートブックを実行する方法については、この付録の後半で詳しく説明します。 
 
 ### タイミング
 
-`ExecuteTime` プラグインを使用して、Jupyter ノートブックの各コードセルの実行時間を計ることができます。プラグインをインストールするには、以下のコマンドを使用します。
+`ExecuteTime`プラグインを使用して、Jupyterノートブックの各コードセルの実行時間を計ることができます。以下のコマンドを使用してプラグインをインストールします。
 
 ```
 pip install jupyter_contrib_nbextensions
@@ -98,15 +98,15 @@ jupyter contrib nbextension install --user
 jupyter nbextension enable execute_time/ExecuteTime
 ```
 
-## [概要
+## まとめ
 
-* 本の章を編集するには、Jupyter でマークダウン形式を有効にする必要があります。
-* ポート転送を使用すると、サーバーをリモートで実行できます。
+* Jupyter Notebook ツールを使用して、本の各セクションを編集、実行、投稿できます。
+* Jupyter ノートブックは、ポート転送を使用してリモートサーバーで実行できます。
 
 ## 演習
 
-1. このブックのコードをローカルで編集して実行してみます。
-1. この本のコードをポート転送で*リモート*で編集して実行してみてください。
-1. $\mathbb{R}^{1024 \times 1024}$ の 2 つの正方行列について $\mathbf{A}^\top \mathbf{B}$ 対 $\mathbf{A} \mathbf{B}$ を測定します。どちらが速いですか？
+1. この本のコードをローカルマシンの Jupyter Notebook で編集して実行します。
+1. Jupyter Notebookで、この本のコードをポート転送経由で*リモート*で編集して実行します。
+1. $\mathbb{R}^{1024 \times 1024}$ の 2 つの正方行列の演算 $\mathbf{A}^\top \mathbf{B}$ と $\mathbf{A} \mathbf{B}$ の実行時間を測定します。どっちが速い？
 
 [Discussions](https://discuss.d2l.ai/t/421)
diff --git a/chapter_appendix-tools-for-deep-learning/jupyter_origin.md b/chapter_appendix-tools-for-deep-learning/jupyter_origin.md
index 5409881..3260a1f 100644
--- a/chapter_appendix-tools-for-deep-learning/jupyter_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/jupyter_origin.md
@@ -1,48 +1,56 @@
-# Using Jupyter
+# Using Jupyter Notebooks
 :label:`sec_jupyter`
 
-This section describes how to edit and run the code in the chapters of this book
-using Jupyter Notebooks. Make sure you have Jupyter installed and downloaded the
+This section describes how to edit and run the code
+in each section of this book
+using the Jupyter Notebook. Make sure you have
+installed Jupyter and downloaded the
 code as described in
 :ref:`chap_installation`.
 If you want to know more about Jupyter see the excellent tutorial in
-their [Documentation](https://jupyter.readthedocs.io/en/latest/).
+their [documentation](https://jupyter.readthedocs.io/en/latest/).
 
 
 ## Editing and Running the Code Locally
 
-Suppose that the local path of code of the book is "xx/yy/d2l-en/". Use the shell to change directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.
+Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.
 
-![The folders containing the code in this book.](../img/jupyter00.png)
+![The folders containing the code of this book.](../img/jupyter00.png)
 :width:`600px`
 :label:`fig_jupyter00`
 
 
-You can access the notebook files by clicking on the folder displayed on the webpage. They usually have the suffix ".ipynb".
-For the sake of brevity, we create a temporary "test.ipynb" file. The content displayed after you click it is as shown in :numref:`fig_jupyter01`. This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This is A Title" and "This is text". The code cell contains two lines of Python code.
+You can access the notebook files by clicking on the folder displayed on the webpage.
+They usually have the suffix ".ipynb".
+For the sake of brevity, we create a temporary "test.ipynb" file.
+The content displayed after you click it is
+shown in :numref:`fig_jupyter01`.
+This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
+The code cell contains two lines of Python code.
 
 ![Markdown and code cells in the "text.ipynb" file.](../img/jupyter01.png)
 :width:`600px`
 :label:`fig_jupyter01`
 
 
-Double click on the markdown cell to enter edit mode. Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.
+Double click on the markdown cell to enter edit mode.
+Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.
 
 ![Edit the markdown cell.](../img/jupyter02.png)
 :width:`600px`
 :label:`fig_jupyter02`
 
 
-As shown in :numref:`fig_jupyter03`, click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.
+As demonstrated in :numref:`fig_jupyter03`,
+click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.
 
 ![Run the cell.](../img/jupyter03.png)
 :width:`600px`
 :label:`fig_jupyter03`
 
+After running, the markdown cell is shown in :numref:`fig_jupyter04`.
 
-After running, the markdown cell is as shown in :numref:`fig_jupyter04`.
-
-![The markdown cell after editing.](../img/jupyter04.png)
+![The markdown cell after running.](../img/jupyter04.png)
 :width:`600px`
 :label:`fig_jupyter04`
 
@@ -63,27 +71,34 @@ You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain
 
 When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.
 
-
 ## Advanced Options
 
-Beyond local editing there are two things that are quite important: editing the notebooks in markdown format and running Jupyter remotely. The latter matters when we want to run the code on a faster server. The former matters since Jupyter's native .ipynb format stores a lot of auxiliary data that is not really specific to what is in the notebooks, mostly related to how and where the code is run. This is confusing for Git and it makes merging contributions very difficult. Fortunately there is an alternative---native editing in Markdown.
+Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
+The latter matters when we want to run the code on a faster server.
+The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
+irrelevant to the content,
+mostly related to how and where the code is run.
+This is confusing for Git, making
+reviewing contributions very difficult.
+Fortunately there is an alternative---native editing in the markdown format.
 
 ### Markdown Files in Jupyter
 
 If you wish to contribute to the content of this book, you need to modify the
-source file (md file, not ipynb file) on GitHub. Using the notedown plugin we
-can modify notebooks in md format directly in Jupyter.
+source file (md file, not ipynb file) on GitHub.
+Using the notedown plugin we
+can modify notebooks in the md format directly in Jupyter.
 
 
-First, install the notedown plugin, run Jupyter Notebook, and load the plugin:
+First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:
 
 ```
-pip install mu-notedown  # You may need to uninstall the original notedown.
+pip install d2l-notedown  # You may need to uninstall the original notedown.
 jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
 ```
 
 
-To turn on the notedown plugin by default whenever you run Jupyter Notebook do the following:
+You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
 First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).
 
 ```
@@ -100,20 +115,23 @@ c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
 
 After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.
 
-### Running Jupyter Notebook on a Remote Server
+### Running Jupyter Notebooks on a Remote Server
 
-Sometimes, you may want to run Jupyter Notebook on a remote server and access it through a browser on your local computer. If Linux or MacOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:
+Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or MacOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:
 
 ```
 ssh myserver -L 8888:localhost:8888
 ```
 
 
-The above is the address of the remote server `myserver`. Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter Notebook. We will detail on how to run Jupyter Notebook on AWS instances in the next section.
+The above string `myserver` is the address of the remote server.
+Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
+later in this appendix.
 
 ### Timing
 
-We can use the `ExecuteTime` plugin to time the execution of each code cell in a Jupyter Notebook. Use the following commands to install the plugin:
+We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
+Use the following commands to install the plugin:
 
 ```
 pip install jupyter_contrib_nbextensions
@@ -124,15 +142,15 @@ jupyter nbextension enable execute_time/ExecuteTime
 
 ## Summary
 
-* To edit the book chapters you need to activate markdown format in Jupyter.
-* You can run servers remotely using port forwarding.
+* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
+* We can run Jupyter notebooks on remote servers using port forwarding.
 
 
 ## Exercises
 
-1. Try to edit and run the code in this book locally.
-1. Try to edit and run the code in this book *remotely* via port forwarding.
-1. Measure $\mathbf{A}^\top \mathbf{B}$ vs. $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?
+1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
+1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
+1. Measure running time of operations $\mathbf{A}^\top \mathbf{B}$ vs. $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?
 
 
 [Discussions](https://discuss.d2l.ai/t/421)
diff --git a/chapter_appendix-tools-for-deep-learning/sagemaker.md b/chapter_appendix-tools-for-deep-learning/sagemaker.md
index dbdfd98..b40f589 100644
--- a/chapter_appendix-tools-for-deep-learning/sagemaker.md
+++ b/chapter_appendix-tools-for-deep-learning/sagemaker.md
@@ -1,40 +1,40 @@
 # Amazon SageMaker を使う
 :label:`sec_sagemaker`
 
-多くのディープラーニングアプリケーションでは、大量の計算が必要です。ローカルマシンの速度が遅すぎて、これらの問題を妥当な時間内に解決できない場合があります。クラウドコンピューティングサービスを使用すると、より強力なコンピューターにアクセスして、本書の GPU を大量に消費する部分を実行できます。このチュートリアルでは、Amazon SageMaker について説明します。Amazon SageMaker は、この本を簡単に実行できるようにするサービスです。 
+ディープラーニングアプリケーションは、ローカルマシンが提供できるものを簡単に超えるほど多くの計算リソースを必要とする場合があります。クラウドコンピューティングサービスを使用すると、より強力なコンピューターを使用して、この本のGPU集約型コードをより簡単に実行できます。このセクションでは、Amazon SageMaker を使用してこの本のコードを実行する方法を紹介します。 
 
-## 登録とログイン
+## サインアップ
 
-まず https://aws.amazon.com/ でアカウントを登録する必要があります。セキュリティを強化するために、2 要素認証を使用することをお勧めします。また、実行中のインスタンスを停止し忘れた場合に予期せぬ予期せぬ事態が発生しないように、詳細な請求と支出のアラートを設定することもお勧めします。クレジットカードが必要になりますのでご注意ください。AWS アカウントにログインしたら、[console](http://console.aws.amazon.com/) に移動して「SageMaker」(:numref:`fig_sagemaker` を参照) を検索し、クリックして SageMaker パネルを開きます。 
+まず、https://aws.amazon.com/ でアカウントをサインアップする必要があります。セキュリティを強化するため、二要素認証の使用が推奨されます。また、インスタンスの実行を停止し忘れた場合など、予期せぬ事態を避けるために、請求と支出の詳細なアラートを設定することもお勧めします。AWS アカウントにログインした後、[console](http://console.aws.amazon.com/) に移動して「Amazon SageMaker」(:numref:`fig_sagemaker` を参照) を検索し、それをクリックして SageMaker パネルを開きます。 
 
-![Open the SageMaker panel.](../img/sagemaker.png)
+![Search for and open the SageMaker panel.](../img/sagemaker.png)
 :width:`300px`
 :label:`fig_sagemaker`
 
 ## SageMaker インスタンスを作成する
 
-次に、:numref:`fig_sagemaker-create` の説明に従ってノートブックインスタンスを作成します。 
+次に、:numref:`fig_sagemaker-create` の説明に従ってノートブックインスタンスを作成しましょう。 
 
 ![Create a SageMaker instance.](../img/sagemaker-create.png)
 :width:`400px`
 :label:`fig_sagemaker-create`
 
-SageMaker は、計算能力と価格が異なる複数の [instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/) を提供しています。インスタンスの作成時に、インスタンス名を指定し、そのタイプを選択できます。:numref:`fig_sagemaker-create-2` では `ml.p3.2xlarge` を選択します。1 つの Tesla V100 GPU と 8 コア CPU を搭載したこのインスタンスは、ほとんどのチャプターで十分強力です。 
+SageMaker は、さまざまな計算能力と価格で複数の [instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/) を提供しています。ノートブックインスタンスを作成するときに、その名前とタイプを指定できます。:numref:`fig_sagemaker-create-2`では、`ml.p3.2xlarge`を選択しました。1つのTesla V100 GPUと8コアCPUを備えたこのインスタンスは、本のほとんどで十分に強力です。 
 
 ![Choose the instance type.](../img/sagemaker-create-2.png)
 :width:`400px`
 :label:`fig_sagemaker-create-2`
 
 :begin_tab:`mxnet`
-SageMaker に合うこの本の Jupyter ノートブック版は https://github.com/d2l-ai/d2l-en-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3` で入手できます。
+SageMaker で実行するための ipynb フォーマットの本全体は https://github.com/d2l-ai/d2l-en-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3` で入手できます。これにより、SageMaker はインスタンスの作成時にクローンを作成できます。
 :end_tab:
 
 :begin_tab:`pytorch`
-SageMaker に合うこの本の Jupyter ノートブック版は https://github.com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3` で入手できます。
+SageMaker で実行するための ipynb フォーマットの本全体は https://github.com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3` で入手できます。これにより、SageMaker はインスタンスの作成時にクローンを作成できます。
 :end_tab:
 
 :begin_tab:`tensorflow`
-SageMaker に合うこの本の Jupyter ノートブック版は https://github.com/d2l-ai/d2l-tensorflow-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3` で入手できます。
+SageMaker で実行するための ipynb フォーマットの本全体は https://github.com/d2l-ai/d2l-tensorflow-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3` で入手できます。これにより、SageMaker はインスタンスの作成時にクローンを作成できます。
 :end_tab:
 
 ![Specify the GitHub repository.](../img/sagemaker-create-3.png)
@@ -43,19 +43,13 @@ SageMaker に合うこの本の Jupyter ノートブック版は https://github.
 
 ## インスタンスの実行と停止
 
-インスタンスの準備が整うまでに数分かかる場合があります。準備ができたら、:numref:`fig_sagemaker-open`に示すように「Open Jupyter」リンクをクリックできます。 
+インスタンスの作成には数分かかる場合があります。インスタンスの準備ができたら、その横にある「Open Jupyter」リンク（:numref:`fig_sagemaker-open`）をクリックして、このインスタンスでこの本のすべてのJupyterノートブックを編集して実行できるようにします（:numref:`sec_jupyter`の手順と同様）。 
 
 ![Open Jupyter on the created SageMaker instance.](../img/sagemaker-open.png)
 :width:`400px`
 :label:`fig_sagemaker-open`
 
-:numref:`fig_sagemaker-jupyter` に示すように、このインスタンスで実行されている Jupyter サーバー内を移動できます。 
-
-![The Jupyter server running on the SageMaker instance.](../img/sagemaker-jupyter.png)
-:width:`400px`
-:label:`fig_sagemaker-jupyter`
-
-SageMaker インスタンスでの Jupyter ノートブックの実行と編集は :numref:`sec_jupyter` で説明した内容と似ています。:numref:`fig_sagemaker-stop` に示すように、作業が終了したら、それ以上課金されないようにインスタンスを停止することを忘れないでください。 
+作業が終了したら、それ以上課金されないようにインスタンスを停止することを忘れないでください (:numref:`fig_sagemaker-stop`)。 
 
 ![Stop a SageMaker instance.](../img/sagemaker-stop.png)
 :width:`300px`
@@ -64,24 +58,22 @@ SageMaker インスタンスでの Jupyter ノートブックの実行と編集
 ## ノートブックの更新
 
 :begin_tab:`mxnet`
-[d2l-ai/d2l-en-sagemaker](https://github.com/d2l-ai/d2l-en-sagemaker) GitHub リポジトリ内のノートブックは定期的に更新されます。`git pull` コマンドを使用すると、最新バージョンに更新できます。
+このオープンソースブックのノートブックは、GitHubの[d2l-ai/d2l-en-sagemaker](https://github.com/d2l-ai/d2l-en-sagemaker)リポジトリで定期的に更新されます。最新バージョンに更新するには、SageMaker インスタンス (:numref:`fig_sagemaker-terminal`) でターミナルを開きます。
 :end_tab:
 
 :begin_tab:`pytorch`
-[d2l-ai/d2l-pytorch-sagemaker](https://github.com/d2l-ai/d2l-pytorch-sagemaker) GitHub リポジトリ内のノートブックは定期的に更新されます。`git pull` コマンドを使用すると、最新バージョンに更新できます。
+このオープンソースブックのノートブックは、GitHubの[d2l-ai/d2l-pytorch-sagemaker](https://github.com/d2l-ai/d2l-pytorch-sagemaker)リポジトリで定期的に更新されます。最新バージョンに更新するには、SageMaker インスタンス (:numref:`fig_sagemaker-terminal`) でターミナルを開きます。
 :end_tab:
 
 :begin_tab:`tensorflow`
-[d2l-ai/d2l-tensorflow-sagemaker](https://github.com/d2l-ai/d2l-tensorflow-sagemaker) GitHub リポジトリ内のノートブックは定期的に更新されます。`git pull` コマンドを使用すると、最新バージョンに更新できます。
+このオープンソースブックのノートブックは、GitHubの[d2l-ai/d2l-tensorflow-sagemaker](https://github.com/d2l-ai/d2l-tensorflow-sagemaker)リポジトリで定期的に更新されます。最新バージョンに更新するには、SageMaker インスタンス (:numref:`fig_sagemaker-terminal`) でターミナルを開きます。
 :end_tab:
 
-まず、:numref:`fig_sagemaker-terminal` に示すようにターミナルを開く必要があります。 
-
 ![Open a terminal on the SageMaker instance.](../img/sagemaker-terminal.png)
 :width:`300px`
 :label:`fig_sagemaker-terminal`
 
-更新をプルする前に、ローカルの変更をコミットすることをお勧めします。または、ターミナルで次のコマンドを実行して、ローカルの変更をすべて無視することもできます。
+リモートリポジトリから更新をプルする前に、ローカルの変更をコミットしたい場合があります。それ以外の場合は、ターミナルで次のコマンドを実行して、ローカルの変更をすべて破棄します。
 
 :begin_tab:`mxnet`
 ```bash
@@ -107,14 +99,14 @@ git pull
 ```
 :end_tab:
 
-## [概要
+## まとめ
 
-* Amazon SageMaker を通じて Jupyter サーバーを起動および停止して、この本を実行することができます。
+* Amazon SageMaker を使用してノートブックインスタンスを作成し、この本の GPU 集中型コードを実行できます。
 * Amazon SageMaker インスタンスのターミナルからノートブックを更新できます。
 
 ## 演習
 
-1. Amazon SageMaker を使用して、この本のコードを編集して実行してみてください。
-1. ターミナルからソースコードディレクトリにアクセスします。
+1. Amazon SageMaker を使用して GPU を必要とするセクションを編集して実行します。
+1. ターミナルを開いて、この本のすべてのノートブックをホストするローカルディレクトリにアクセスします。
 
 [Discussions](https://discuss.d2l.ai/t/422)
diff --git a/chapter_appendix-tools-for-deep-learning/sagemaker_origin.md b/chapter_appendix-tools-for-deep-learning/sagemaker_origin.md
index bd42ec8..22871c8 100644
--- a/chapter_appendix-tools-for-deep-learning/sagemaker_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/sagemaker_origin.md
@@ -1,134 +1,171 @@
 # Using Amazon SageMaker
 :label:`sec_sagemaker`
 
-Many deep learning applications require a significant amount of computation. Your local machine might be too slow to solve these problems in a reasonable amount of time. Cloud computing services give you access to more powerful computers to run the GPU-intensive portions of this book. This tutorial will guide you through Amazon SageMaker: a service that allows you to run this book easily.
-
-
-## Registering and Logging In
-
-First, we need to register an account at https://aws.amazon.com/. We encourage you to use two-factor authentication for additional security. It is also a good idea to set up detailed billing and spending alerts to avoid any unexpected surprises in case you forget to stop any running instance.
-Note that you will need a credit card.
-After logging into your AWS account, go to your [console](http://console.aws.amazon.com/) and search for "SageMaker" (see :numref:`fig_sagemaker`) then click to open the SageMaker panel.
-
-![Open the SageMaker panel.](../img/sagemaker.png)
+Deep learning applications
+may demand so much computational resource
+that easily goes beyond
+what your local machine can offer.
+Cloud computing services
+allow you to 
+run GPU-intensive code of this book
+more easily
+using more powerful computers.
+This section will introduce 
+how to use Amazon SageMaker
+to run the code of this book.
+
+## Signing Up
+
+First, we need to sign up an account at https://aws.amazon.com/.
+For additional security,
+using two-factor authentication 
+is encouraged.
+It is also a good idea to
+set up detailed billing and spending alerts to
+avoid any surprise,
+e.g., 
+when forgetting to stop running instances.
+After logging into your AWS account, 
+o to your [console](http://console.aws.amazon.com/) and search for "Amazon SageMaker" (see :numref:`fig_sagemaker`), 
+then click it to open the SageMaker panel.
+
+![Search for and open the SageMaker panel.](../img/sagemaker.png)
 :width:`300px`
 :label:`fig_sagemaker`
 
-
-
 ## Creating a SageMaker Instance
 
-Next, let us create a notebook instance as described in :numref:`fig_sagemaker-create`.
+Next, let's create a notebook instance as described in :numref:`fig_sagemaker-create`.
 
 ![Create a SageMaker instance.](../img/sagemaker-create.png)
 :width:`400px`
 :label:`fig_sagemaker-create`
 
-SageMaker provides multiple [instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/) of different computational power and prices.
-When creating an instance, we can specify the instance name and choose its type.
-In :numref:`fig_sagemaker-create-2`, we choose `ml.p3.2xlarge`. With one Tesla V100 GPU and an 8-core CPU, this instance is powerful enough for most chapters.
+SageMaker provides multiple [instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/) with varying computational power and prices.
+When creating a notebook instance,
+we can specify its name and type.
+In :numref:`fig_sagemaker-create-2`, we choose `ml.p3.2xlarge`: with one Tesla V100 GPU and an 8-core CPU, this instance is powerful enough for most of the book.
 
 ![Choose the instance type.](../img/sagemaker-create-2.png)
 :width:`400px`
 :label:`fig_sagemaker-create-2`
 
 :begin_tab:`mxnet`
-A Jupyter notebook version of this book for fitting SageMaker is available at https://github.com/d2l-ai/d2l-en-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3`.
+The entire book in the ipynb format for running with SageMaker is available at https://github.com/d2l-ai/d2l-en-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3`) to allow SageMaker to clone it when creating the instance.
 :end_tab:
 
 :begin_tab:`pytorch`
-A Jupyter notebook version of this book for fitting SageMaker is available at https://github.com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3`.
+The entire book in the ipynb format for running with SageMaker is available at https://github.com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3`) to allow SageMaker to clone it when creating the instance.
 :end_tab:
 
 :begin_tab:`tensorflow`
-A Jupyter notebook version of this book for fitting SageMaker is available at https://github.com/d2l-ai/d2l-tensorflow-sagemaker. We can specify this GitHub repository URL to let SageMaker clone this repository during instance creation, as shown in :numref:`fig_sagemaker-create-3`.
+The entire book in the ipynb format for running with SageMaker is available at https://github.com/d2l-ai/d2l-tensorflow-sagemaker. We can specify this GitHub repository URL (:numref:`fig_sagemaker-create-3`) to allow SageMaker to clone it when creating the instance.
 :end_tab:
 
 ![Specify the GitHub repository.](../img/sagemaker-create-3.png)
 :width:`400px`
 :label:`fig_sagemaker-create-3`
 
-
-
 ## Running and Stopping an Instance
 
-It may take a few minutes before the instance is ready.
-When it is ready, you can click on the "Open Jupyter" link as shown in :numref:`fig_sagemaker-open`.
+Creating an instance
+may take a few minutes.
+When the instance is ready,
+click on the "Open Jupyter" link next to it (:numref:`fig_sagemaker-open`) so you can
+edit and run all the Jupyter notebooks
+of this book on this instance
+(similar to steps in :numref:`sec_jupyter`).
 
 ![Open Jupyter on the created SageMaker instance.](../img/sagemaker-open.png)
 :width:`400px`
 :label:`fig_sagemaker-open`
 
-Then, as shown in :numref:`fig_sagemaker-jupyter`, you may navigate through the Jupyter server running on this instance.
 
-![The Jupyter server running on the SageMaker instance.](../img/sagemaker-jupyter.png)
-:width:`400px`
-:label:`fig_sagemaker-jupyter`
-
-Running and editing Jupyter notebooks on the SageMaker instance is similar to what we have discussed in :numref:`sec_jupyter`.
-After finishing your work, do not forget to stop the instance to avoid further charging, as shown in :numref:`fig_sagemaker-stop`.
+After finishing your work,
+don't forget to stop the instance to avoid 
+being charged further (:numref:`fig_sagemaker-stop`).
 
 ![Stop a SageMaker instance.](../img/sagemaker-stop.png)
 :width:`300px`
 :label:`fig_sagemaker-stop`
 
-
 ## Updating Notebooks
 
 :begin_tab:`mxnet`
-We will regularly update the notebooks in the [d2l-ai/d2l-en-sagemaker](https://github.com/d2l-ai/d2l-en-sagemaker) GitHub repository. You can simply use the `git pull` command to update to the latest version.
+Notebooks of this open-source book will be regularly updated in the [d2l-ai/d2l-en-sagemaker](https://github.com/d2l-ai/d2l-en-sagemaker) repository
+on GitHub.
+To update to the latest version,
+you may open a terminal on the SageMaker instance (:numref:`fig_sagemaker-terminal`).
 :end_tab:
 
 :begin_tab:`pytorch`
-We will regularly update the notebooks in the [d2l-ai/d2l-pytorch-sagemaker](https://github.com/d2l-ai/d2l-pytorch-sagemaker) GitHub repository. You can simply use the `git pull` command to update to the latest version.
+Notebooks of this open-source book will be regularly updated in the [d2l-ai/d2l-pytorch-sagemaker](https://github.com/d2l-ai/d2l-pytorch-sagemaker) repository
+on GitHub.
+To update to the latest version,
+you may open a terminal on the SageMaker instance (:numref:`fig_sagemaker-terminal`).
 :end_tab:
 
+
 :begin_tab:`tensorflow`
-We will regularly update the notebooks in the [d2l-ai/d2l-tensorflow-sagemaker](https://github.com/d2l-ai/d2l-tensorflow-sagemaker) GitHub repository. You can simply use the `git pull` command to update to the latest version.
+Notebooks of this open-source book will be regularly updated in the [d2l-ai/d2l-tensorflow-sagemaker](https://github.com/d2l-ai/d2l-tensorflow-sagemaker) repository
+on GitHub.
+To update to the latest version,
+you may open a terminal on the SageMaker instance (:numref:`fig_sagemaker-terminal`).
 :end_tab:
 
-First, you need to open a terminal as shown in :numref:`fig_sagemaker-terminal`.
 
 ![Open a terminal on the SageMaker instance.](../img/sagemaker-terminal.png)
 :width:`300px`
 :label:`fig_sagemaker-terminal`
 
-You may want to commit your local changes before pulling the updates. Alternatively, you can simply ignore all your local changes with the following commands in the terminal.
+You may wish to commit your local changes before pulling updates from the remote repository. 
+Otherwise, simply discard all your local changes
+with the following commands in the terminal:
 
 :begin_tab:`mxnet`
+
 ```bash
 cd SageMaker/d2l-en-sagemaker/
 git reset --hard
 git pull
 ```
+
+
 :end_tab:
 
 :begin_tab:`pytorch`
+
 ```bash
 cd SageMaker/d2l-pytorch-sagemaker/
 git reset --hard
 git pull
 ```
+
+
 :end_tab:
 
 :begin_tab:`tensorflow`
+
 ```bash
 cd SageMaker/d2l-tensorflow-sagemaker/
 git reset --hard
 git pull
 ```
+
+
 :end_tab:
 
 ## Summary
 
-* We can launch and stop a Jupyter server through Amazon SageMaker to run this book.
+* We can create a notebook instance using Amazon SageMaker to run GPU-intensive code of this book.
 * We can update notebooks via the terminal on the Amazon SageMaker instance.
 
 
 ## Exercises
 
-1. Try to edit and run the code in this book using Amazon SageMaker.
-1. Access the source code directory via the terminal.
+
+1. Edit and run any section that requires a GPU using Amazon SageMaker.
+1. Open a terminal to access the local directory that hosts all the notebooks of this book.
 
 
 [Discussions](https://discuss.d2l.ai/t/422)
diff --git a/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus.md b/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus.md
index 6c39868..2d8360d 100644
--- a/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus.md
+++ b/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus.md
@@ -1,63 +1,63 @@
-# サーバと GPU の選択
+# サーバーと GPU の選択
 :label:`sec_buy_gpu`
 
-通常、ディープラーニングの学習には大量の計算が必要です。現在、GPU はディープラーニングで最も費用対効果の高いハードウェアアクセラレータです。特に、CPU と比較すると、GPU は安価でパフォーマンスが高く、多くの場合 1 桁以上になります。さらに、1 台のサーバーで複数の GPU をサポートでき、ハイエンドサーバーでは最大 8 個の GPU をサポートできます。熱、冷却、電力の要件はオフィスビルでサポートできる範囲を超えて急速に高まるため、エンジニアリングワークステーションでは最大 4 つの GPU が一般的です。大規模なデプロイメントでは、Amazon の [P3](https://aws.amazon.com/ec2/instance-types/p3/) インスタンスや [G4](https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/) インスタンスなどのクラウドコンピューティングの方がはるかに実用的なソリューションです。 
+ディープラーニングのトレーニングには通常、大量の計算が必要です。現在、GPUはディープラーニングのための最も費用対効果の高いハードウェアアクセラレータです。特に、CPUと比較して、GPUは安価で、多くの場合、桁違いに高いパフォーマンスを提供します。さらに、1 台のサーバーで複数の GPU をサポートでき、ハイエンドサーバーでは最大 8 つの GPU をサポートできます。熱、冷却、および電力の要件は、オフィスビルがサポートできる範囲を超えて急速にエスカレートするため、より一般的な数値はエンジニアリングワークステーションでは最大4GPUです。大規模なデプロイメントでは、クラウドコンピューティング (例:Amazonの[P3](https://aws.amazon.com/ec2/instance-types/p3/)および[G4](https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/)インスタンス) の方がはるかに実用的なソリューションです。 
 
-## サーバを選択する
+## サーバーの選択
 
-計算の多くは GPU で行われるため、通常はスレッド数の多いハイエンド CPU を購入する必要はありません。とはいえ、Python のグローバルインタープリタロック (GIL) により、4 ～ 8 個の GPU がある状況では、CPU のシングルスレッドのパフォーマンスが問題になる可能性があります。すべて同じことは、コア数は少ないがクロック周波数が高いCPUの方が経済的な選択肢になる可能性があることを示唆しています。たとえば、6 コア 4 GHz と 8 コア 3.5 GHz CPU のどちらかを選択する場合、総速度は低いものの、前者の方がはるかに適しています。重要な考慮事項として、GPU は大量の電力を使用するため、大量の熱を放散します。これには、非常に良好な冷却と、GPU を使用するのに十分な大きさのシャーシが必要です。可能であれば、次のガイドラインに従ってください。 
+通常、計算の多くはGPUで行われるため、多くのスレッドを備えたハイエンドCPUを購入する必要はありません。とはいえ、Pythonのグローバルインタプリタロック（GIL）により、4〜8個のGPUがある状況では、CPUのシングルスレッドパフォーマンスが問題になる可能性があります。すべてが等しいということは、コアの数は少ないがクロック周波数が高いCPUの方が経済的な選択肢になる可能性があることを示唆しています。たとえば、6コア4GHzと8コア3.5 GHzのCPUを選択する場合、総速度は低いですが、前者の方がはるかに好ましいです。重要な考慮事項は、GPUは大量の電力を消費するため、大量の熱を放散することです。これには、非常に優れた冷却と、GPU を使用するのに十分な大きさのシャーシが必要です。可能であれば、以下のガイドラインに従ってください。 
 
-1. **電源装置**。GPU は大量の電力を消費します。デバイスあたり最大 350 W の予算 (効率的なコードでは大量のエネルギーが消費されるため、通常の需要ではなく、グラフィックスカードの*ピーク時の需要*を確認してください)。電源が要求に達していないと、システムが不安定になることがあります。
-1. **シャーシサイズ**。GPU は大きく、補助電源コネクタには余分なスペースが必要になることがよくあります。また、シャーシが大きいほど冷却が容易です。
-1. **GPU 冷却**。GPU の数が多い場合は、水冷に投資することをお勧めします。また、ファン数が少なくても*リファレンスデザイン*を目指してください。デバイス間の空気取り入れが可能な薄さです。マルチファン GPU を購入した場合、複数の GPU を取り付けるときに十分な空気を得るには厚すぎて、サーマルスロットリングが発生する可能性があります。
-1. ** PCIe スロット**。GPU との間でデータを移動する (および GPU 間でデータを交換する) には、大量の帯域幅が必要です。16 レーンの PCIe 3.0 スロットをお勧めします。複数の GPU をマウントする場合は、マザーボードの説明をよく読んで、複数の GPU を同時に使用しても 16 倍の帯域幅が使用可能であること、および追加スロット用に PCIe 2.0 ではなく PCIe 3.0 が使用されていることを確認してください。マザーボードによっては、複数の GPU を取り付けた状態で 8 倍または 4 倍の帯域幅にダウングレードするものもあります。これは、CPU が提供する PCIe レーンの数に一部起因しています。
+1. **電源**。GPU は大量の電力を消費します。デバイスあたり最大350Wの予算（効率的なコードでは多くのエネルギーを使用する可能性があるため、通常の需要ではなく、グラフィックカードの*ピーク需要*を確認してください）。電力供給が需要に応えられないと、システムが不安定になることがあります。
+1. **シャーシの大きさ**。GPU は大きく、補助電源コネクタには追加のスペースが必要になることがよくあります。また、大きなシャーシは冷却が容易です。
+1. **GPU 冷却**。GPUが多数ある場合は、水冷に投資したいと思うかもしれません。また、ファン数が少ない場合でも*リファレンスデザイン*を目指してください。これは、デバイス間の空気を取り込むのに十分な薄さだからです。マルチファンGPUを購入した場合、複数のGPUを取り付けるときに十分な空気を得るには厚すぎる可能性があり、サーマルスロットリングが発生します。
+1. **PCIe スロット**。GPU との間でデータを移動する (および GPU 間でデータを交換する) には、大量の帯域幅が必要です。16 レーンの PCIe 3.0 スロットをお勧めします。複数の GPU をマウントする場合は、マザーボードの説明をよく読んで、複数の GPU を同時に使用しても 16$\times$ の帯域幅が使用可能であること、および追加スロットに PCIe 2.0 ではなく PCIe 3.0 が使用されていることを確認してください。一部のマザーボードは、複数のGPUがインストールされていると、8$\times$または4$\times$の帯域幅にダウングレードされます。これは、CPU が提供する PCIe レーンの数に一部起因しています。
 
-つまり、ディープラーニングサーバーを構築するための推奨事項をいくつか紹介します。 
+要するに、ディープラーニングサーバーを構築するための推奨事項をいくつか紹介します。 
 
-* **初心者**。低消費電力のローエンド GPU を購入する (ディープラーニングに適した安価なゲーミング GPU は 150 ～ 200 W を使用)。運が良ければ、現在のコンピューターでサポートされます。
-* **1 GPU**。4コアのローエンドCPUで十分で、ほとんどのマザーボードで十分です。32 GB 以上の DRAM を目標とし、ローカルデータアクセス用の SSD に投資します。600W の電源装置で十分です。たくさんのファンがいるGPUを購入する。
-* ** 2 GPU **。コア数が 4 ～ 6 のローエンド CPU で十分です。64 GB の DRAM を目指して、SSD に投資します。2 台のハイエンド GPU には 1000 W のオーダーが必要です。メインボードに関しては、PCIe 3.0 x16 スロットが 2 つあることを確認してください。可能であれば、PCIe 3.0 x16 スロットの間に 2 つの空きスペース (60 mm 間隔) があるメインボードを用意して、空気を余分に確保します。この場合は、ファンの多いGPUを2つ購入してください。
-* ** 4 GPU **。シングルスレッドの速度が比較的速い (つまり、クロック周波数が高い) CPU を購入するようにしてください。AMD Threadripper など、PCIe レーンの数が多い CPU が必要になる場合があります。PCIe 3.0 x16 スロットを 4 つ使用するには、比較的高価なメインボードが必要になる可能性があります。PCIe レーンをマルチプレクシングするには PLX が必要なためです。狭いリファレンスデザインの GPU を購入し、GPU 間に空気を入れます。1600 ～ 2000 W の電源装置が必要ですが、オフィスのコンセントではサポートされない場合があります。このサーバーはおそらく*大音量でホット*で動作します。机の下には置きたくない。128 GB の DRAM が推奨されます。ローカルストレージ用の SSD (1-2 TB NVMe) と、データを保存するための RAID 構成の一連のハードディスクを入手してください。
-* **8 GPU **。複数の冗長電源装置を備えた専用のマルチ GPU サーバシャーシを購入する必要がある (電源装置あたり 1600 W の場合は 2+1 など)。これには、デュアルソケットサーバー CPU、256 GB ECC DRAM、高速ネットワークカード (10 GBE 推奨) が必要です。また、サーバーが GPU の*物理フォームファクター* をサポートしているかどうかを確認する必要があります。コンシューマーとサーバーの GPU では、エアーフローと配線の配置が大きく異なります (RTX 2080 と Tesla V100 など)。これは、電源ケーブルのスペースが不十分だったり、適切なワイヤーハーネスがないために (共著者の1人が痛々しく発見したように)、コンシューマーGPUをサーバーに取り付けることができない可能性があることを意味します。
+* **初心者**。低消費電力のローエンドGPUを購入する（ディープラーニングに適した安価なゲーミングGPUは150〜200Wを使用）。運が良ければ、現在のコンピューターでサポートされます。
+* **1 GPU**。4コアのローエンドCPUで十分で、ほとんどのマザーボードで十分です。32 GB 以上の DRAM を目指し、ローカルデータアクセス用の SSD に投資します。600Wの電源で十分です。たくさんのファンがいるGPUを購入しましょう。
+* **2 GPU**。4～6 コアのローエンド CPU で十分です。64 GB DRAMを目指して、SSDに投資しましょう。2つのハイエンドGPUには1000Wのオーダーが必要です。メインボードに関しては、PCIe 3.0 x16 スロットが 2 つあることを確認してください。可能であれば、PCIe 3.0 x16スロットの間に2つの空きスペース（60mm間隔）があるメインボードを入手して、余分な空気を確保してください。この場合は、ファンの多いGPUを2つ購入してください。
+* **4 GPU**。シングルスレッド速度が比較的速い（つまり、クロック周波数が高い）CPUを購入するようにしてください。おそらく、AMD Threadripperなど、より多くのPCIeレーンを搭載したCPUが必要になるでしょう。PCIeレーンを多重化するにはおそらくPLXが必要なため、4つのPCIe 3.0 x16スロットを入手するには比較的高価なメインボードが必要になるでしょう。狭く、GPU間に空気を入れるリファレンスデザインのGPUを購入します。1600 ～ 2000 W の電源装置が必要ですが、オフィスのコンセントではサポートされていない可能性があります。このサーバーはおそらく*大音量でホット*で動作します。机の下に置いてはいけません。128 GB の DRAM が推奨されます。ローカルストレージ用の SSD (1 ～ 2 TB NVMe) と、データを保存するための RAID 構成の多数のハードディスクを入手してください。
+* **8 GPU **。複数の冗長電源を備えた専用のマルチ GPU サーバーシャーシを購入する必要があります (例:電源装置あたり 1600 W で 2+1)。これには、デュアルソケットサーバーCPU、256 GB ECC DRAM、高速ネットワークカード（10 GBE 推奨）が必要であり、サーバーがGPUの*物理フォームファクタ*をサポートしているかどうかを確認する必要があります。エアフローと配線の配置は、コンシューマとサーバーの GPU で大きく異なります (RTX 2080 と Tesla V100 など)。これは、電源ケーブルのスペースが不十分であるか、適切なワイヤーハーネスがないために（共著者の1人が痛々しいほど発見したように）、コンシューマーGPUをサーバーにインストールできない可能性があることを意味します。
 
 ## GPU を選択する
 
-現在、AMDとNVIDIAは専用GPUの2つの主要メーカーです。NVIDIA はディープラーニング分野に初めて参入し、CUDA を介してディープラーニングフレームワークのサポートを強化しています。したがって、ほとんどの購入者はNVIDIA GPUを選択します。 
+現在、AMDとNVIDIAは専用GPUの2つの主要メーカーです。NVIDIA はディープラーニング分野に初めて参入し、CUDA を介してディープラーニングフレームワークのサポートを強化しました。したがって、ほとんどの購入者はNVIDIA GPUを選択します。 
 
-NVIDIA は、個人ユーザー (GTX や RTX シリーズなど) とエンタープライズユーザー (Tesla シリーズ) を対象とした 2 種類の GPU を提供しています。この 2 種類の GPU は、同等の処理能力を提供します。ただし、エンタープライズユーザー GPU は一般に (パッシブ) 強制冷却、より多くのメモリ、および ECC (エラー修正) メモリを使用します。これらの GPU はデータセンターに適しており、通常はコンシューマ GPU の 10 倍のコストがかかります。 
+NVIDIA は、個人ユーザー (GTX および RTX シリーズなど) とエンタープライズユーザー (Tesla シリーズ経由) を対象とする 2 種類の GPU を提供しています。この 2 種類の GPU は、同等の処理能力を提供します。ただし、エンタープライズユーザーGPUは通常、（パッシブ）強制冷却、より多くのメモリ、およびECC（エラー修正）メモリを使用します。これらのGPUはデータセンターに適しており、通常はコンシューマーGPUの10倍のコストがかかります。 
 
-100台以上のサーバーを持つ大企業の場合は、NVIDIA Teslaシリーズを検討するか、クラウドでGPUサーバーを使用することをお勧めします。ラボや 10 台以上のサーバーを持つ中小企業では、NVIDIA RTX シリーズが最も費用対効果が高いと思われます。4 ～ 8 個の GPU を効率的に保持する Supermicro または Asus シャーシを搭載した構成済みサーバーを購入できます。 
+100台以上のサーバーを持つ大企業の場合は、NVIDIA Teslaシリーズを検討するか、クラウドでGPUサーバーを使用する必要があります。ラボや 10 台以上のサーバーを持つ中小企業では、NVIDIA RTX シリーズが最も費用対効果が高いと思われます。4～8個のGPUを効率的に保持するSupermicroまたはAsusシャーシを備えた事前構成済みサーバーを購入できます。 
 
-GPU ベンダーは通常、2017 年にリリースされた GTX 1000 (Pascal) シリーズや 2019 年にリリースされた RTX 2000 (Turing) シリーズなど、1 ～ 2 年ごとに新世代をリリースします。各シリーズには、さまざまなパフォーマンスレベルを提供する複数の異なるモデルがあります。GPU のパフォーマンスは、主に次の 3 つのパラメータを組み合わせたものです。 
+GPUベンダーは通常、2017年にリリースされたGTX 1000（Pascal）シリーズや2019年にリリースされたRTX 2000（Turing）シリーズなど、1〜2年ごとに新しい世代をリリースします。各シリーズには、異なるパフォーマンスレベルを提供するいくつかの異なるモデルがあります。GPU のパフォーマンスは、主に次の 3 つのパラメーターの組み合わせです。 
 
-1. **計算能力**。通常、32ビット浮動小数点演算能力が求められます。16ビット浮動小数点トレーニング (FP16) も主流になりつつあります。予測のみに関心がある場合は、8 ビット整数を使用することもできます。最新世代の Turing GPU は、4 ビットアクセラレーションを提供します。残念ながら現在、低精度のネットワークを学習させるアルゴリズムはまだ普及していません。
-1. **メモリサイズ**。モデルが大きくなったり、学習中に使用されるバッチが大きくなったりすると、より多くの GPU メモリが必要になります。HBM2 (高帯域幅メモリ) と GDDR6 (グラフィックス DDR) メモリを確認します。HBM2 は高速ですが、はるかに高価です。
-1. **メモリ帯域幅**。十分なメモリ帯域幅がある場合にのみ、計算処理能力を最大限に引き出すことができます。GDDR6 を使用する場合は、ワイドメモリバスを探してください。
+1. **計算能力**。一般的に、32ビット浮動小数点演算能力を求めています。16ビット浮動小数点トレーニング（FP16）も主流になりつつあります。予測だけに関心がある場合は、8 ビット整数を使用することもできます。最新世代のチューリングGPUは、4ビットアクセラレーションを提供します。残念ながら、現在、低精度のネットワークを学習させるアルゴリズムはまだ普及していません。
+1. **メモリサイズ**。モデルが大きくなったり、トレーニング中に使用されるバッチが大きくなったりすると、より多くの GPU メモリが必要になります。HBM2 (高帯域幅メモリ) と GDDR6 (グラフィックス DDR) メモリを確認します。HBM2は高速ですが、はるかに高価です。
+1. **メモリ帯域幅**。十分なメモリ帯域幅がある場合にのみ、コンピューティング能力を最大限に活用できます。GDDR6 を使用している場合は、ワイドメモリバスを探してください。
 
-ほとんどのユーザーにとって、計算能力を見れば十分です。多くの GPU では異なるタイプのアクセラレーションが提供されることに注意してください。たとえば、NVIDIA の TensorCore は、オペレータのサブセットを 5 倍高速化します。ライブラリがこれをサポートしていることを確認してください。GPUメモリは4 GB以上である必要があります（8 GBの方がはるかに優れています）。GUI の表示にも GPU を使用しないようにしてください (代わりに組み込みのグラフィックを使用してください)。避けられない場合は、安全のために2 GBのRAMを追加してください。 
+ほとんどのユーザーにとって、計算能力を見るだけで十分です。多くの GPU では、さまざまなタイプのアクセラレーションが提供されています。たとえば、NVIDIA の TensorCore は 5$\times$ によってオペレータのサブセットを加速します。あなたのライブラリがこれをサポートしていることを確認してください。GPU メモリは 4 GB 以上である必要があります (8 GB の方がはるかに優れています)。GUI の表示にも GPU を使用しないようにしてください (代わりに組み込みのグラフィックを使用してください)。避けられない場合は、安全のために2 GBのRAMを追加してください。 
 
-:numref:`fig_flopsvsprice` は、GTX 900、GTX 1000、RTX 2000 の各シリーズモデルの 32 ビット浮動小数点演算能力と価格を比較しています。価格はウィキペディアに掲載されている推奨価格です。 
+:numref:`fig_flopsvsprice`は、さまざまなGTX 900、GTX 1000、およびRTX 2000シリーズモデルの32ビット浮動小数点計算能力と価格を比較しています。価格はウィキペディアで見つかった推奨価格です。 
 
 ![Floating-point compute power and price comparison. ](../img/flopsvsprice.svg)
 :label:`fig_flopsvsprice`
 
-私たちは多くのことを見ることができます。 
+私たちは多くのことを見ることができます: 
 
-1. 各シリーズでは、価格とパフォーマンスはほぼ比例します。Titan モデルは、GPU メモリを大量に消費するというメリットから、かなりのプレミアムを要します。ただし、980 Tiと1080 Tiを比較するとわかるように、新しいモデルの方が費用対効果が高くなります。RTX 2000シリーズの価格はあまり改善していないようです。しかし、これははるかに優れた低精度性能 (FP16、INT8、INT4) を提供するためです。
+1. 各シリーズでは、価格と性能はほぼ比例します。Titanモデルは、大量のGPUメモリの利点のためにかなりのプレミアムを要求します。ただし、980 Tiと1080 Tiを比較するとわかるように、新しいモデルの方が費用対効果が高くなります。RTX 2000シリーズでは価格はあまり上がらないようです。しかし、これは、はるかに優れた低精度性能（FP16、INT8、およびINT4）を提供するという事実によるものです。
 2. GTX 1000シリーズの性能対コスト比は、900シリーズの約2倍です。
-3. RTX 2000シリーズでは、価格は価格の*アフィン*関数です。
+3. RTX 2000シリーズでは、パフォーマンス（GFLOP単位）は価格の*アフィン*関数です。
 
 ![Floating-point compute power and energy consumption. ](../img/wattvsprice.svg)
 :label:`fig_wattvsprice`
 
-:numref:`fig_wattvsprice` は、エネルギー消費量が計算量に応じてほぼ直線的に増大する様子を示しています。第二に、後の世代はより効率的です。これはRTX 2000シリーズに対応したグラフと矛盾しているようだ。しかし、これはTensorCoresが不釣り合いに多くのエネルギーを引き出す結果です。 
+:numref:`fig_wattvsprice`は、エネルギー消費が計算量にほぼ直線的に変化する方法を示しています。第二に、後の世代の方が効率的です。これは、RTX 2000シリーズに対応するグラフと矛盾しているようです。しかし、これはTensorCoresが不釣り合いに多くのエネルギーを引き出す結果です。 
 
-## [概要
+## まとめ
 
-* サーバを構築する際は、電力、PCIe バスレーン、CPU シングルスレッド速度、冷却に気をつけてください。
-* 可能であれば、最新の GPU 世代を購入する必要があります。
-* 大規模な導入にはクラウドを使用します。
+* サーバーを構築する際には、電源、PCIe バスレーン、CPU シングルスレッド速度、および冷却に注意します。
+* 可能であれば、最新世代の GPU を購入する必要があります。
+* 大規模な展開にはクラウドを使用します。
 * 高密度サーバーは、すべての GPU と互換性がない場合があります。購入前に機械仕様と冷却仕様を確認してください。
-* 高効率のためにはFP16以下の精度を使用してください。
+* 高効率のためには、FP16以下の精度を使用してください。
 
 [Discussions](https://discuss.d2l.ai/t/425)
diff --git a/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus_origin.md b/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus_origin.md
index 59ed712..d8f7362 100644
--- a/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus_origin.md
+++ b/chapter_appendix-tools-for-deep-learning/selecting-servers-gpus_origin.md
@@ -1,63 +1,66 @@
 # Selecting Servers and GPUs
 :label:`sec_buy_gpu`
 
-Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and offer higher performance, often by over an order of magnitude. Furthermore, a single server can support multiple GPUs, up to 8 for high end servers. More typical numbers are up to 4 GPUs for an engineering workstation, since heat, cooling and power requirements escalate quickly beyond what an office building can support. For larger deployments cloud computing, such as Amazon's [P3](https://aws.amazon.com/ec2/instance-types/p3/) and [G4](https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/) instances are a much more practical solution.
+Deep learning training generally requires large amounts of computation. At present GPUs are the most cost-effective hardware accelerators for deep learning. In particular, compared with CPUs, GPUs are cheaper and offer higher performance, often by over an order of magnitude. Furthermore, a single server can support multiple GPUs, up to 8 for high end servers. More typical numbers are up to 4 GPUs for an engineering workstation, since heat, cooling, and power requirements escalate quickly beyond what an office building can support. For larger deployments, cloud computing (e.g., Amazon's [P3](https://aws.amazon.com/ec2/instance-types/p3/) and [G4](https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/) instances) is a much more practical solution.
+
 
 ## Selecting Servers
 
-There is typically no need to purchase high-end CPUs with many threads since much of the computation occurs on the GPUs. That said, due to the Global Interpreter Lock (GIL) in Python single-thread performance of a CPU can matter in situations where we have 4-8 GPUs. All things equal this suggests that CPUs with a smaller number of cores but a higher clock frequency might be a more economical choice. E.g., when choosing between a 6-core 4 GHz and an 8-core 3.5 GHz CPU, the former is much preferable, even though its aggregate speed is less.
+There is typically no need to purchase high-end CPUs with many threads since much of the computation occurs on the GPUs. That said, due to the global interpreter lock (GIL) in Python single-thread performance of a CPU can matter in situations where we have 4--8 GPUs. All things equal this suggests that CPUs with a smaller number of cores but a higher clock frequency might be a more economical choice. For example, when choosing between a 6-core 4 GHz and an 8-core 3.5 GHz CPU, the former is much preferable, even though its aggregate speed is less.
 An important consideration is that GPUs use lots of power and thus dissipate lots of heat. This requires very good cooling and a large enough chassis to use the GPUs. Follow the guidelines below if possible:
 
 1. **Power Supply**. GPUs use significant amounts of power. Budget with up to 350W per device (check for the *peak demand* of the graphics card rather than typical demand, since efficient code can use lots of energy). If your power supply is not up to the demand you will find that your system becomes unstable.
 1. **Chassis Size**. GPUs are large and the auxiliary power connectors often need extra space. Also, large chassis are easier to cool.
-1. **GPU Cooling**. If you have large numbers of GPUs you might want to invest in water cooling. Also, aim for *reference designs* even if they have fewer fans, since they are thin enough to allow for air intake between the devices. If you buy a multi-fan GPU it might be too thick to get enough air when installing multiple GPUs and you will run into thermal throttling.
-1. **PCIe Slots**. Moving data to and from the GPU (and exchanging it between GPUs) requires lots of bandwidth. We recommend PCIe 3.0 slots with 16 lanes. If you mount multiple GPUs, be sure to carefully read the motherboard description to ensure that 16x bandwidth is still available when multiple GPUs are used at the same time and that you are getting PCIe 3.0 as opposed to PCIe 2.0 for the additional slots. Some motherboards downgrade to 8x or even 4x bandwidth with multiple GPUs installed. This is partly due to the number of PCIe lanes that the CPU offers.
+1. **GPU Cooling**. If you have a large number of GPUs you might want to invest in water cooling. Also, aim for *reference designs* even if they have fewer fans, since they are thin enough to allow for air intake between the devices. If you buy a multi-fan GPU it might be too thick to get enough air when installing multiple GPUs and you will run into thermal throttling.
+1. **PCIe Slots**. Moving data to and from the GPU (and exchanging it between GPUs) requires lots of bandwidth. We recommend PCIe 3.0 slots with 16 lanes. If you mount multiple GPUs, be sure to carefully read the motherboard description to ensure that 16$\times$ bandwidth is still available when multiple GPUs are used at the same time and that you are getting PCIe 3.0 as opposed to PCIe 2.0 for the additional slots. Some motherboards downgrade to 8$\times$ or even 4$\times$ bandwidth with multiple GPUs installed. This is partly due to the number of PCIe lanes that the CPU offers.
 
 In short, here are some recommendations for building a deep learning server:
 
 * **Beginner**. Buy a low end GPU with low power consumption (cheap gaming GPUs suitable for deep learning use 150-200W). If you are lucky your current computer will support it.
-* **1 GPU**. A low-end CPU with 4 cores will be plenty sufficient and most motherboards suffice. Aim for at least 32 GB DRAM and invest into an SSD for local data access. A power supply with 600W should be sufficient. Buy a GPU with lots of fans.
+* **1 GPU**. A low-end CPU with 4 cores will be sufficient and most motherboards suffice. Aim for at least 32 GB DRAM and invest into an SSD for local data access. A power supply with 600W should be sufficient. Buy a GPU with lots of fans.
 * **2 GPUs**. A low-end CPU with 4-6 cores will suffice. Aim for 64 GB DRAM and invest into an SSD. You will need in the order of 1000W for two high-end GPUs. In terms of mainboards, make sure that they have *two* PCIe 3.0 x16 slots. If you can, get a mainboard that has two free spaces (60mm spacing) between the PCIe 3.0 x16 slots for extra air. In this case, buy two GPUs with lots of fans.
-* **4 GPUs**. Make sure that you buy a CPU with relatively fast single-thread speed (i.e., high clock frequency). You will probably need a CPU with a larger number of PCIe lanes, such as an AMD Threadripper. You will likely need relatively expensive mainboards to get 4 PCIe 3.0 x16 slots since they probably need a PLX to multiplex the PCIe lanes. Buy GPUs with reference design that are narrow and let air in between the GPUs. You need a 1600-2000W power supply and the outlet in your office might not support that. This server will probably run *loud and hot*. You do not want it under your desk. 128 GB of DRAM is recommended. Get an SSD (1-2 TB NVMe) for local storage and a bunch of hard disks in RAID configuration to store your data.
+* **4 GPUs**. Make sure that you buy a CPU with relatively fast single-thread speed (i.e., high clock frequency). You will probably need a CPU with a larger number of PCIe lanes, such as an AMD Threadripper. You will likely need relatively expensive mainboards to get 4 PCIe 3.0 x16 slots since they probably need a PLX to multiplex the PCIe lanes. Buy GPUs with reference design that are narrow and let air in between the GPUs. You need a 1600--2000W power supply and the outlet in your office might not support that. This server will probably run *loud and hot*. You do not want it under your desk. 128 GB of DRAM is recommended. Get an SSD (1--2 TB NVMe) for local storage and a bunch of hard disks in RAID configuration to store your data.
 * **8 GPUs**. You need to buy a dedicated multi-GPU server chassis with multiple redundant power supplies (e.g., 2+1 for 1600W per power supply). This will require dual socket server CPUs, 256 GB ECC DRAM, a fast network card (10 GBE recommended), and you will need to check whether the servers support the *physical form factor* of the GPUs. Airflow and wiring placement differ significantly between consumer and server GPUs (e.g., RTX 2080 vs. Tesla V100). This means that you might not be able to install the consumer GPU in a server due to insufficient clearance for the power cable or lack of a suitable wiring harness (as one of the coauthors painfully discovered).
 
+
 ## Selecting GPUs
 
 At present, AMD and NVIDIA are the two main manufacturers of dedicated GPUs. NVIDIA was the first to enter the deep learning field and provides better support for deep learning frameworks via CUDA. Therefore, most buyers choose NVIDIA GPUs.
 
 NVIDIA provides two types of GPUs, targeting individual users (e.g., via the GTX and RTX series) and enterprise users (via its Tesla series). The two types of GPUs provide comparable compute power. However, the enterprise user GPUs generally use (passive) forced cooling, more memory, and ECC (error correcting) memory. These GPUs are more suitable for data centers and usually cost ten times more than consumer GPUs.
 
-If you are a large company with 100+ servers you should consider the NVIDIA Tesla series or alternatively use GPU servers in the cloud. For a lab or a small to medium company with 10+ servers the NVIDIA RTX series is likely most cost effective. You can buy preconfigured servers with Supermicro or Asus chassis that hold 4-8 GPUs efficiently.
+If you are a large company with 100+ servers you should consider the NVIDIA Tesla series or alternatively use GPU servers in the cloud. For a lab or a small to medium company with 10+ servers the NVIDIA RTX series is likely most cost effective. You can buy preconfigured servers with Supermicro or Asus chassis that hold 4--8 GPUs efficiently.
 
-GPU vendors typically release a new generation every 1-2 years, such as the GTX 1000 (Pascal) series released in 2017 and the RTX 2000 (Turing) series released in 2019. Each series offers several different models that provide different performance levels. GPU performance is primarily a combination of the following three parameters:
+GPU vendors typically release a new generation every one to two years, such as the GTX 1000 (Pascal) series released in 2017 and the RTX 2000 (Turing) series released in 2019. Each series offers several different models that provide different performance levels. GPU performance is primarily a combination of the following three parameters:
 
-1. **Compute power**. Generally we look for 32-bit floating-point compute power. 16-bit floating point training (FP16) is also entering the mainstream. If you are only interested in prediction, you can also use 8-bit integer. The latest generation of Turing GPUs offers 4-bit acceleration. Unfortunately at present the algorithms to train low-precision networks are not widespread yet.
-1. **Memory size**. As your models become larger or the batches used during training grow bigger, you will need more GPU memory. Check for HBM2 (High Bandwidth Memory) vs. GDDR6 (Graphics DDR) memory. HBM2 is faster but much more expensive.
-1. **Memory bandwidth**. You can only get the most out of your compute power when you have sufficient memory bandwidth. Look for wide memory buses if using GDDR6.
+1. **Compute Power**. Generally we look for 32-bit floating-point compute power. 16-bit floating point training (FP16) is also entering the mainstream. If you are only interested in prediction, you can also use 8-bit integer. The latest generation of Turing GPUs offers 4-bit acceleration. Unfortunately at present the algorithms to train low-precision networks are not widespread yet.
+1. **Memory Size**. As your models become larger or the batches used during training grow bigger, you will need more GPU memory. Check for HBM2 (High Bandwidth Memory) vs. GDDR6 (Graphics DDR) memory. HBM2 is faster but much more expensive.
+1. **Memory Bandwidth**. You can only get the most out of your compute power when you have sufficient memory bandwidth. Look for wide memory buses if using GDDR6.
 
-For most users, it is enough to look at compute power. Note that many GPUs offer different types of acceleration. E.g., NVIDIA's TensorCores accelerate a subset of operators by 5x. Ensure that your libraries support this. The GPU memory should be no less than 4 GB (8 GB is much better). Try to avoid using the GPU also for displaying a GUI (use the built-in graphics instead). If you cannot avoid it, add an extra 2 GB of RAM for safety.
+For most users, it is enough to look at compute power. Note that many GPUs offer different types of acceleration. For example, NVIDIA's TensorCores accelerate a subset of operators by 5$\times$. Ensure that your libraries support this. The GPU memory should be no less than 4 GB (8 GB is much better). Try to avoid using the GPU also for displaying a GUI (use the built-in graphics instead). If you cannot avoid it, add an extra 2 GB of RAM for safety.
 
 :numref:`fig_flopsvsprice` compares the 32-bit floating-point compute power and price of the various GTX 900, GTX 1000 and RTX 2000 series models. The prices are the suggested prices found on Wikipedia.
 
 ![Floating-point compute power and price comparison. ](../img/flopsvsprice.svg)
 :label:`fig_flopsvsprice`
 
+
 We can see a number of things:
 
-1. Within each series, price and performance are roughly proportional. Titan models command a significant premium for the benefit of larger amounts of GPU memory. However, the newer models offer better cost effectiveness, as can be seen by comparing the 980 Ti and 1080 Ti. The price does not appear to improve much for the RTX 2000 series. However, this is due to the fact that they offer far superior low precision performance (FP16, INT8 and INT4).
+1. Within each series, price and performance are roughly proportional. Titan models command a significant premium for the benefit of larger amounts of GPU memory. However, the newer models offer better cost effectiveness, as can be seen by comparing the 980 Ti and 1080 Ti. The price does not appear to improve much for the RTX 2000 series. However, this is due to the fact that they offer far superior low precision performance (FP16, INT8, and INT4).
 2. The performance-to-cost ratio of the GTX 1000 series is about two times greater than the 900 series.
-3. For the RTX 2000 series the price is an *affine* function of the price.
+3. For the RTX 2000 series the performance (in GFLOPs) is an *affine* function of the price.
 
 ![Floating-point compute power and energy consumption. ](../img/wattvsprice.svg)
 :label:`fig_wattvsprice`
 
 
-:numref:`fig_wattvsprice` shows how energy consumption scales mostly linearly with the amount of computation. Second, later generations are more efficient. This seems to be contradicted by the graph corresponding to the RTX 2000 series. However, this is a consequence of the TensorCores which draw disproportionately much energy.
+:numref:`fig_wattvsprice` shows how energy consumption scales mostly linearly with the amount of computation. Second, later generations are more efficient. This seems to be contradicted by the graph corresponding to the RTX 2000 series. However, this is a consequence of the TensorCores that draw disproportionately much energy.
 
 
 ## Summary
 
-* Watch out for power, PCIe bus lanes, CPU single thread speed and cooling when building a server.
+* Watch out for power, PCIe bus lanes, CPU single thread speed, and cooling when building a server.
 * You should purchase the latest GPU generation if possible.
 * Use the cloud for large deployments.
 * High density servers may not be compatible with all GPUs. Check the mechanical and cooling specifications before you buy.
diff --git a/chapter_appendix-tools-for-deep-learning/utils.md b/chapter_appendix-tools-for-deep-learning/utils.md
new file mode 100644
index 0000000..9fee8f1
--- /dev/null
+++ b/chapter_appendix-tools-for-deep-learning/utils.md
@@ -0,0 +1,985 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# ユーティリティ関数とクラス
+:label:`sec_utils`
+
+このセクションでは、本書で使用されているユーティリティ関数とクラスの実装について説明します。
+
+```{.python .input}
+%%tab mxnet
+import inspect
+import collections
+from d2l import mxnet as d2l
+from IPython import display
+from mxnet import autograd, gluon, np, npx
+from mxnet.gluon import nn
+import random
+npx.set_np()
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+import inspect
+import collections
+from d2l import torch as d2l
+from IPython import display
+from torch import nn
+```
+
+```{.python .input}
+%%tab tensorflow
+import inspect
+from IPython import display
+import collections
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+ハイパーパラメータ。
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(d2l.HyperParameters)  #@save
+def save_hyperparameters(self, ignore=[]):
+    """Save function arguments into class attributes."""
+    frame = inspect.currentframe().f_back
+    _, _, _, local_vars = inspect.getargvalues(frame)
+    self.hparams = {k:v for k, v in local_vars.items()
+                    if k not in set(ignore+['self']) and not k.startswith('_')}
+    for k, v in self.hparams.items():
+        setattr(self, k, v)
+```
+
+進行状況バー。
+
+```{.python .input  n=22}
+%%tab all
+@d2l.add_to_class(d2l.ProgressBoard)  #@save
+def draw(self, x, y, label, every_n=1):
+    Point = collections.namedtuple('Point', ['x', 'y'])
+    if not hasattr(self, 'raw_points'):
+        self.raw_points = collections.OrderedDict()
+        self.data = collections.OrderedDict()
+    if label not in self.raw_points:
+        self.raw_points[label] = []
+        self.data[label] = []    
+    points = self.raw_points[label]
+    line = self.data[label]
+    points.append(Point(x, y))
+    if len(points) != every_n:
+        return    
+    mean = lambda x: sum(x) / len(x)
+    line.append(Point(mean([p.x for p in points]), 
+                      mean([p.y for p in points])))
+    points.clear()
+    if not self.display: 
+        return
+    d2l.use_svg_display()
+    if self.fig is None:
+        self.fig = d2l.plt.figure(figsize=self.figsize)
+    plt_lines, labels = [], []
+    for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):        
+        plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v], 
+                                      linestyle=ls, color=color)[0])
+        labels.append(k)        
+    axes = self.axes if self.axes else d2l.plt.gca()
+    if self.xlim: axes.set_xlim(self.xlim)
+    if self.ylim: axes.set_ylim(self.ylim)
+    if not self.xlabel: self.xlabel = self.x    
+    axes.set_xlabel(self.xlabel)
+    axes.set_ylabel(self.ylabel)
+    axes.set_xscale(self.xscale)
+    axes.set_yscale(self.yscale)
+    axes.legend(plt_lines, labels)    
+    display.display(self.fig)
+    display.clear_output(wait=True)
+```
+
+トレーナー 
+
+非推奨となる関数の集まり:
+
+```{.python .input}
+%%tab mxnet
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a Gluon data iterator."""
+    dataset = gluon.data.ArrayDataset(*data_arrays)
+    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size):  #@save
+    """Minibatch stochastic gradient descent."""
+    for param in params:
+        param[:] = param - lr * param.grad / batch_size
+
+def get_dataloader_workers():  #@save
+    """Use 4 processes to read the data except for Windows."""
+    return 0 if sys.platform.startswith('win') else 4
+
+def load_data_fashion_mnist(batch_size, resize=None):  #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    dataset = gluon.data.vision
+    trans = [dataset.transforms.ToTensor()]
+    if resize:
+        trans.insert(0, dataset.transforms.Resize(resize))
+    trans = dataset.transforms.Compose(trans)
+    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
+    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
+    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
+                                  num_workers=get_dataloader_workers()),
+            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
+                                  num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None):  #@save
+    """Compute the accuracy for a model on a dataset using a GPU."""
+    if not device:  # Query the first device where the first parameter is on
+        device = list(net.collect_params().values())[0].list_ctx()[0]
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+    for X, y in data_iter:
+        X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+        metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+#@save
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
+    loss = gluon.loss.SoftmaxCrossEntropyLoss()
+    trainer = gluon.Trainer(net.collect_params(),
+                            'sgd', {'learning_rate': lr})
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            # Here is the major difference from `d2l.train_epoch_ch3`
+            X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+            with autograd.record():
+                y_hat = net(X)
+                l = loss(y_hat, y)
+            l.backward()
+            trainer.step(X.shape[0])
+            metric.add(l.sum(), d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+    
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, gluon.Block):
+        params = [p.data() for p in net.collect_params().values()]
+    else:
+        params = net.params
+    norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+```
+
+```{.python .input}
+%%tab pytorch
+
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a PyTorch data iterator."""
+    dataset = data.TensorDataset(*data_arrays)
+    return data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size): #@save
+    """Minibatch stochastic gradient descent."""
+    with torch.no_grad():
+        for param in params:
+            param -= lr * param.grad / batch_size
+            param.grad.zero_()
+
+def get_dataloader_workers():  #@save
+    """Use 4 processes to read the data."""
+    return 4
+
+def load_data_fashion_mnist(batch_size, resize=None):  #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    trans = [transforms.ToTensor()]
+    if resize:
+        trans.insert(0, transforms.Resize(resize))
+    trans = transforms.Compose(trans)
+    mnist_train = torchvision.datasets.FashionMNIST(
+        root="../data", train=True, transform=trans, download=True)
+    mnist_test = torchvision.datasets.FashionMNIST(
+        root="../data", train=False, transform=trans, download=True)
+    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
+                            num_workers=get_dataloader_workers()),
+            data.DataLoader(mnist_test, batch_size, shuffle=False,
+                            num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
+    """Compute the accuracy for a model on a dataset using a GPU."""
+    if isinstance(net, nn.Module):
+        net.eval()  # Set the model to evaluation mode
+        if not device:
+            device = next(iter(net.parameters())).device
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+
+    with torch.no_grad():
+        for X, y in data_iter:
+            if isinstance(X, list):
+                # Required for BERT Fine-tuning (to be covered later)
+                X = [x.to(device) for x in X]
+            else:
+                X = X.to(device)
+            y = y.to(device)
+            metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+
+#@save
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    def init_weights(m):
+        if type(m) == nn.Linear or type(m) == nn.Conv2d:
+            nn.init.xavier_uniform_(m.weight)
+    net.apply(init_weights)
+    print('training on', device)
+    net.to(device)
+    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
+    loss = nn.CrossEntropyLoss()
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        net.train()
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            optimizer.zero_grad()
+            X, y = X.to(device), y.to(device)
+            y_hat = net(X)
+            l = loss(y_hat, y)
+            l.backward()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+```
+
+```{.python .input}
+%%tab tensorflow
+
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a TensorFlow data iterator."""
+    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
+    if is_train:
+        dataset = dataset.shuffle(buffer_size=1000)
+    dataset = dataset.batch(batch_size)
+    return dataset
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = tf.zeros((num_examples, w.shape[0]))
+    X += tf.random.normal(shape=X.shape)
+    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
+    y += tf.random.normal(shape=y.shape, stddev=0.01)
+    y = tf.reshape(y, (-1, 1))
+    return X, y
+
+
+def sgd(params, grads, lr, batch_size):  #@save
+    """Minibatch stochastic gradient descent."""
+    for param, grad in zip(params, grads):
+        param.assign_sub(lr * grad / batch_size)
+
+def load_data_fashion_mnist(batch_size, resize=None):   #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
+    # Divide all numbers by 255 so that all pixel values are between
+    # 0 and 1, add a batch dimension at the last. And cast label to int32
+    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                            tf.cast(y, dtype='int32'))
+    resize_fn = lambda X, y: (
+        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
+    return (
+        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
+            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
+        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
+            batch_size).map(resize_fn))
+
+class TrainCallback(tf.keras.callbacks.Callback):  #@save
+    """A callback to visiualize the training progress."""
+    def __init__(self, net, train_iter, test_iter, num_epochs, device_name):
+        self.timer = d2l.Timer()
+        self.animator = d2l.Animator(
+            xlabel='epoch', xlim=[1, num_epochs], legend=[
+                'train loss', 'train acc', 'test acc'])
+        self.net = net
+        self.train_iter = train_iter
+        self.test_iter = test_iter
+        self.num_epochs = num_epochs
+        self.device_name = device_name
+    def on_epoch_begin(self, epoch, logs=None):
+        self.timer.start()
+    def on_epoch_end(self, epoch, logs):
+        self.timer.stop()
+        test_acc = self.net.evaluate(
+            self.test_iter, verbose=0, return_dict=True)['accuracy']
+        metrics = (logs['loss'], logs['accuracy'], test_acc)
+        self.animator.add(epoch + 1, metrics)
+        if epoch == self.num_epochs - 1:
+            batch_size = next(iter(self.train_iter))[0].shape[0]
+            num_examples = batch_size * tf.data.experimental.cardinality(
+                self.train_iter).numpy()
+            print(f'loss {metrics[0]:.3f}, train acc {metrics[1]:.3f}, '
+                  f'test acc {metrics[2]:.3f}')
+            print(f'{num_examples / self.timer.avg():.1f} examples/sec on '
+                  f'{str(self.device_name)}')
+
+#@save
+def train_ch6(net_fn, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    device_name = device._device_name
+    strategy = tf.distribute.OneDeviceStrategy(device_name)
+    with strategy.scope():
+        optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
+        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        net = net_fn()
+        net.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
+    callback = TrainCallback(net, train_iter, test_iter, num_epochs,
+                             device_name)
+    net.fit(train_iter, epochs=num_epochs, verbose=0, callbacks=[callback])
+    return net
+```
+
+```{.python .input}
+%%tab mxnet, tensorflow
+def evaluate_accuracy(net, data_iter):  #@save
+    """Compute the accuracy for a model on a dataset."""
+    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
+    for X, y in data_iter:
+        metric.add(accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab all
+
+def linreg(X, w, b):  #@save
+    """The linear regression model."""
+    return d2l.matmul(X, w) + b
+
+def squared_loss(y_hat, y):  #@save
+    """Squared loss."""
+    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+
+def get_fashion_mnist_labels(labels):  #@save
+    """Return text labels for the Fashion-MNIST dataset."""
+    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [text_labels[int(i)] for i in labels]
+
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
+    """Plot a list of images."""
+    figsize = (num_cols * scale, num_rows * scale)
+    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
+    axes = axes.flatten()
+    for i, (ax, img) in enumerate(zip(axes, imgs)):
+        try:
+            img = d2l.numpy(img)
+        except:
+            pass
+        ax.imshow(img)
+        ax.axes.get_xaxis().set_visible(False)
+        ax.axes.get_yaxis().set_visible(False)
+        if titles:
+            ax.set_title(titles[i])
+    return axes
+
+#@tab all
+class Animator:  #@save
+    """For plotting data in animation."""
+    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
+                 figsize=(3.5, 2.5)):
+        # Incrementally plot multiple lines
+        if legend is None:
+            legend = []
+        d2l.use_svg_display()
+        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
+        if nrows * ncols == 1:
+            self.axes = [self.axes, ]
+        # Use a lambda function to capture arguments
+        self.config_axes = lambda: d2l.set_axes(
+            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+        self.X, self.Y, self.fmts = None, None, fmts
+
+    def add(self, x, y):
+        # Add multiple data points into the figure
+        if not hasattr(y, "__len__"):
+            y = [y]
+        n = len(y)
+        if not hasattr(x, "__len__"):
+            x = [x] * n
+        if not self.X:
+            self.X = [[] for _ in range(n)]
+        if not self.Y:
+            self.Y = [[] for _ in range(n)]
+        for i, (a, b) in enumerate(zip(x, y)):
+            if a is not None and b is not None:
+                self.X[i].append(a)
+                self.Y[i].append(b)
+        self.axes[0].cla()
+        for x, y, fmt in zip(self.X, self.Y, self.fmts):
+            self.axes[0].plot(x, y, fmt)
+        self.config_axes()
+        display.display(self.fig)
+        display.clear_output(wait=True)
+        
+#@tab all
+class Accumulator:  #@save
+    """For accumulating sums over `n` variables."""
+    def __init__(self, n):
+        self.data = [0.0] * n
+
+    def add(self, *args):
+        self.data = [a + float(b) for a, b in zip(self.data, args)]
+
+    def reset(self):
+        self.data = [0.0] * len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]        
+    
+    
+#@tab all
+def accuracy(y_hat, y):  #@save
+    """Compute the number of correct predictions."""
+    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
+        y_hat = d2l.argmax(y_hat, axis=1)
+    cmp = d2l.astype(y_hat, y.dtype) == y
+    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
+```
+
+```{.python .input}
+%%tab all
+
+import os
+import requests
+import zipfile
+import tarfile
+import hashlib
+
+def download(url, folder='../data', sha1_hash=None):  #@save
+    """Download a file to folder and return the local filepath."""
+    if not url.startswith('http'):
+        # For back compatability
+        url, sha1_hash = DATA_HUB[url]
+    os.makedirs(folder, exist_ok=True)
+    fname = os.path.join(folder, url.split('/')[-1])
+    # Check if hit cache
+    if os.path.exists(fname) and sha1_hash:
+        sha1 = hashlib.sha1()
+        with open(fname, 'rb') as f:
+            while True:
+                data = f.read(1048576)
+                if not data:
+                    break
+                sha1.update(data)
+        if sha1.hexdigest() == sha1_hash:
+            return fname
+    # Download
+    print(f'Downloading {fname} from {url}...')
+    r = requests.get(url, stream=True, verify=True)
+    with open(fname, 'wb') as f:
+        f.write(r.content)
+    return fname
+
+def extract(filename, folder=None):  #@save
+    """Extract a zip/tar file into folder."""
+    base_dir = os.path.dirname(filename)
+    _, ext = os.path.splitext(filename)
+    assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
+    if ext == '.zip':
+        fp = zipfile.ZipFile(filename, 'r')
+    else:
+        fp = tarfile.open(filename, 'r')
+    if folder is None:
+        folder = base_dir
+    fp.extractall(folder)
+```
+
+```{.python .input}
+%%tab all
+
+def download_extract(name, folder=None):  #@save
+    """Download and extract a zip/tar file."""
+    fname = download(name)
+    base_dir = os.path.dirname(fname)
+    data_dir, ext = os.path.splitext(fname)
+    if ext == '.zip':
+        fp = zipfile.ZipFile(fname, 'r')
+    elif ext in ('.tar', '.gz'):
+        fp = tarfile.open(fname, 'r')
+    else:
+        assert False, 'Only zip/tar files can be extracted.'
+    fp.extractall(base_dir)
+    return os.path.join(base_dir, folder) if folder else data_dir
+
+
+def tokenize(lines, token='word'):  #@save
+    """Split text lines into word or character tokens."""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+```
+
+```{.python .input}
+%%tab pytorch
+
+def evaluate_loss(net, data_iter, loss):  #@save
+    """Evaluate the loss of a model on the given dataset."""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        out = net(X)
+        y = d2l.reshape(y, out.shape)
+        l = loss(out, y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab mxnet, tensorflow
+def evaluate_loss(net, data_iter, loss):  #@save
+    """Evaluate the loss of a model on the given dataset."""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        l = loss(net(X), y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab pytorch
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, nn.Module):
+        params = [p for p in net.parameters() if p.requires_grad]
+    else:
+        params = net.params
+    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+```
+
+```{.python .input}
+%%tab tensorflow
+def grad_clipping(grads, theta):  #@save
+    """Clip the gradient."""
+    theta = tf.constant(theta, dtype=tf.float32)
+    new_grad = []
+    for grad in grads:
+        if isinstance(grad, tf.IndexedSlices):
+            new_grad.append(tf.convert_to_tensor(grad))
+        else:
+            new_grad.append(grad)
+    norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
+                        for grad in new_grad))
+    norm = tf.cast(norm, tf.float32)
+    if tf.greater(norm, theta):
+        for i, grad in enumerate(new_grad):
+            new_grad[i] = grad * theta / norm
+    else:
+        new_grad = new_grad
+    return new_grad
+```
+
+注意の章の詳細。
+
+```{.python .input}
+%%tab all
+#@save
+d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
+                           '94646ad1522d915e7b0f9296181140edcf86a4f5')
+
+#@save
+def read_data_nmt():
+    """Load the English-French dataset."""
+    data_dir = d2l.download_extract('fra-eng')
+    with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
+        return f.read()
+
+#@save
+def preprocess_nmt(text):
+    """Preprocess the English-French dataset."""
+    def no_space(char, prev_char):
+        return char in set(',.!?') and prev_char != ' '
+
+    # Replace non-breaking space with space, and convert uppercase letters to
+    # lowercase ones
+    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
+    # Insert space between words and punctuation marks
+    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
+           for i, char in enumerate(text)]
+    return ''.join(out)
+
+#@save
+def tokenize_nmt(text, num_examples=None):
+    """Tokenize the English-French dataset."""
+    source, target = [], []
+    for i, line in enumerate(text.split('\n')):
+        if num_examples and i > num_examples:
+            break
+        parts = line.split('\t')
+        if len(parts) == 2:
+            source.append(parts[0].split(' '))
+            target.append(parts[1].split(' '))
+    return source, target
+
+    
+#@save
+def truncate_pad(line, num_steps, padding_token):
+    """Truncate or pad sequences."""
+    if len(line) > num_steps:
+        return line[:num_steps]  # Truncate
+    return line + [padding_token] * (num_steps - len(line))  # Pad
+
+
+#@save
+def build_array_nmt(lines, vocab, num_steps):
+    """Transform text sequences of machine translation into minibatches."""
+    lines = [vocab[l] for l in lines]
+    lines = [l + [vocab['<eos>']] for l in lines]
+    array = d2l.tensor([truncate_pad(
+        l, num_steps, vocab['<pad>']) for l in lines])
+    valid_len = d2l.reduce_sum(
+        d2l.astype(array != vocab['<pad>'], d2l.int32), 1)
+    return array, valid_len
+
+
+#@save
+def load_data_nmt(batch_size, num_steps, num_examples=600):
+    """Return the iterator and the vocabularies of the translation dataset."""
+    text = preprocess_nmt(read_data_nmt())
+    source, target = tokenize_nmt(text, num_examples)
+    src_vocab = d2l.Vocab(source, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    tgt_vocab = d2l.Vocab(target, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
+    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
+    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
+    data_iter = d2l.load_array(data_arrays, batch_size)
+    return data_iter, src_vocab, tgt_vocab
+```
+
+```{.python .input}
+%%tab mxnet
+    
+#@save
+class MaskedSoftmaxCELoss(gluon.loss.SoftmaxCELoss):
+    """The softmax cross-entropy loss with masks."""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        # `weights` shape: (`batch_size`, `num_steps`, 1)
+        weights = np.expand_dims(np.ones_like(label), axis=-1)
+        weights = npx.sequence_mask(weights, valid_len, True, axis=1)
+        return super(MaskedSoftmaxCELoss, self).forward(pred, label, weights)
+
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    net.initialize(init.Xavier(), force_reinit=True, ctx=device)
+    trainer = gluon.Trainer(net.collect_params(), 'adam',
+                            {'learning_rate': lr})
+    loss = MaskedSoftmaxCELoss()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [
+                x.as_in_ctx(device) for x in batch]
+            bos = np.array(
+                [tgt_vocab['<bos>']] * Y.shape[0], ctx=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with autograd.record():
+                Y_hat, _ = net(X, dec_input, X_valid_len)
+                l = loss(Y_hat, Y, Y_valid_len)
+            l.backward()
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            trainer.step(num_tokens)
+            metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = np.array([len(src_tokens)], ctx=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = np.expand_dims(np.array(src_tokens, ctx=device), axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = np.expand_dims(np.array([tgt_vocab['<bos>']], ctx=device), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(axis=2)
+        pred = dec_X.squeeze(axis=0).astype('int32').item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+```
+
+```{.python .input}
+%%tab pytorch
+#@save
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences."""
+    maxlen = X.size(1)
+    mask = torch.arange((maxlen), dtype=torch.float32,
+                        device=X.device)[None, :] < valid_len[:, None]
+    X[~mask] = value
+    return X
+
+    
+#@save
+class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
+    """The softmax cross-entropy loss with masks."""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        weights = torch.ones_like(label)
+        weights = sequence_mask(weights, valid_len)
+        self.reduction='none'
+        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
+            pred.permute(0, 2, 1), label)
+        weighted_loss = (unweighted_loss * weights).mean(dim=1)
+        return weighted_loss
+    
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    def xavier_init_weights(m):
+        if type(m) == nn.Linear:
+            nn.init.xavier_uniform_(m.weight)
+        if type(m) == nn.GRU:
+            for param in m._flat_weights_names:
+                if "weight" in param:
+                    nn.init.xavier_uniform_(m._parameters[param])
+    net.apply(xavier_init_weights)
+    net.to(device)
+    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
+    loss = MaskedSoftmaxCELoss()
+    net.train()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            optimizer.zero_grad()
+            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
+            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
+                               device=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            Y_hat, _ = net(X, dec_input, X_valid_len)
+            l = loss(Y_hat, Y, Y_valid_len)
+            l.sum().backward()  # Make the loss scalar for `backward`
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+    
+
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    # Set `net` to eval mode for inference
+    net.eval()
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = torch.unsqueeze(
+        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = torch.unsqueeze(torch.tensor(
+        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(dim=2)
+        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+```
+
+```{.python .input}
+%%tab tensorflow
+#@save
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences."""
+    maxlen = X.shape[1]
+    mask = tf.range(start=0, limit=maxlen, dtype=tf.float32)[
+        None, :] < tf.cast(valid_len[:, None], dtype=tf.float32)
+    
+    if len(X.shape) == 3:
+        return tf.where(tf.expand_dims(mask, axis=-1), X, value)
+    else:
+        return tf.where(mask, X, value)
+
+    
+#@save
+class MaskedSoftmaxCELoss(tf.keras.losses.Loss):
+    """The softmax cross-entropy loss with masks."""
+    def __init__(self, valid_len):
+        super().__init__(reduction='none')
+        self.valid_len = valid_len
+    
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def call(self, label, pred):
+        weights = tf.ones_like(label, dtype=tf.float32)
+        weights = sequence_mask(weights, self.valid_len)
+        label_one_hot = tf.one_hot(label, depth=pred.shape[-1])
+        unweighted_loss = tf.keras.losses.CategoricalCrossentropy(
+            from_logits=True, reduction='none')(label_one_hot, pred)
+        weighted_loss = tf.reduce_mean((unweighted_loss*weights), axis=1)
+        return weighted_loss
+    
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
+    animator = d2l.Animator(xlabel="epoch", ylabel="loss",
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [x for x in batch]
+            bos = tf.reshape(tf.constant([tgt_vocab['<bos>']] * Y.shape[0]),
+                             shape=(-1, 1))
+            dec_input = tf.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with tf.GradientTape() as tape:
+                Y_hat, _ = net(X, dec_input, X_valid_len, training=True)
+                l = MaskedSoftmaxCELoss(Y_valid_len)(Y, Y_hat)
+            gradients = tape.gradient(l, net.trainable_variables)
+            gradients = d2l.grad_clipping(gradients, 1)
+            optimizer.apply_gradients(zip(gradients, net.trainable_variables))
+            num_tokens = tf.reduce_sum(Y_valid_len).numpy()
+            metric.add(tf.reduce_sum(l), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device._device_name)}')
+    
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = tf.constant([len(src_tokens)])
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = tf.expand_dims(src_tokens, axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len, training=False)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = tf.expand_dims(tf.constant([tgt_vocab['<bos>']]), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state, training=False)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = tf.argmax(Y, axis=2)
+        pred = tf.squeeze(dec_X, axis=0)
+        # Save attention weights
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred.numpy())
+    return ' '.join(tgt_vocab.to_tokens(tf.reshape(output_seq, shape = -1).numpy().tolist())), attention_weight_seq
+```
diff --git a/chapter_appendix-tools-for-deep-learning/utils_origin.md b/chapter_appendix-tools-for-deep-learning/utils_origin.md
new file mode 100644
index 0000000..97905ea
--- /dev/null
+++ b/chapter_appendix-tools-for-deep-learning/utils_origin.md
@@ -0,0 +1,987 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Utility Functions and Classes
+:label:`sec_utils`
+
+
+This section contains the implementations of utility functions and classes used in this book.
+
+```{.python .input}
+%%tab mxnet
+import inspect
+import collections
+from d2l import mxnet as d2l
+from IPython import display
+from mxnet import autograd, gluon, np, npx
+from mxnet.gluon import nn
+import random
+npx.set_np()
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+import inspect
+import collections
+from d2l import torch as d2l
+from IPython import display
+from torch import nn
+```
+
+```{.python .input}
+%%tab tensorflow
+import inspect
+from IPython import display
+import collections
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+Hyperparameters.
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(d2l.HyperParameters)  #@save
+def save_hyperparameters(self, ignore=[]):
+    """Save function arguments into class attributes."""
+    frame = inspect.currentframe().f_back
+    _, _, _, local_vars = inspect.getargvalues(frame)
+    self.hparams = {k:v for k, v in local_vars.items()
+                    if k not in set(ignore+['self']) and not k.startswith('_')}
+    for k, v in self.hparams.items():
+        setattr(self, k, v)
+```
+
+Progress bar.
+
+```{.python .input  n=22}
+%%tab all
+@d2l.add_to_class(d2l.ProgressBoard)  #@save
+def draw(self, x, y, label, every_n=1):
+    Point = collections.namedtuple('Point', ['x', 'y'])
+    if not hasattr(self, 'raw_points'):
+        self.raw_points = collections.OrderedDict()
+        self.data = collections.OrderedDict()
+    if label not in self.raw_points:
+        self.raw_points[label] = []
+        self.data[label] = []    
+    points = self.raw_points[label]
+    line = self.data[label]
+    points.append(Point(x, y))
+    if len(points) != every_n:
+        return    
+    mean = lambda x: sum(x) / len(x)
+    line.append(Point(mean([p.x for p in points]), 
+                      mean([p.y for p in points])))
+    points.clear()
+    if not self.display: 
+        return
+    d2l.use_svg_display()
+    if self.fig is None:
+        self.fig = d2l.plt.figure(figsize=self.figsize)
+    plt_lines, labels = [], []
+    for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):        
+        plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v], 
+                                      linestyle=ls, color=color)[0])
+        labels.append(k)        
+    axes = self.axes if self.axes else d2l.plt.gca()
+    if self.xlim: axes.set_xlim(self.xlim)
+    if self.ylim: axes.set_ylim(self.ylim)
+    if not self.xlabel: self.xlabel = self.x    
+    axes.set_xlabel(self.xlabel)
+    axes.set_ylabel(self.ylabel)
+    axes.set_xscale(self.xscale)
+    axes.set_yscale(self.yscale)
+    axes.legend(plt_lines, labels)    
+    display.display(self.fig)
+    display.clear_output(wait=True)
+```
+
+Trainer
+
+A bunch of functions that will be deprecated:
+
+```{.python .input}
+%%tab mxnet
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a Gluon data iterator."""
+    dataset = gluon.data.ArrayDataset(*data_arrays)
+    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size):  #@save
+    """Minibatch stochastic gradient descent."""
+    for param in params:
+        param[:] = param - lr * param.grad / batch_size
+
+def get_dataloader_workers():  #@save
+    """Use 4 processes to read the data except for Windows."""
+    return 0 if sys.platform.startswith('win') else 4
+
+def load_data_fashion_mnist(batch_size, resize=None):  #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    dataset = gluon.data.vision
+    trans = [dataset.transforms.ToTensor()]
+    if resize:
+        trans.insert(0, dataset.transforms.Resize(resize))
+    trans = dataset.transforms.Compose(trans)
+    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
+    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
+    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
+                                  num_workers=get_dataloader_workers()),
+            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
+                                  num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None):  #@save
+    """Compute the accuracy for a model on a dataset using a GPU."""
+    if not device:  # Query the first device where the first parameter is on
+        device = list(net.collect_params().values())[0].list_ctx()[0]
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+    for X, y in data_iter:
+        X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+        metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+#@save
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
+    loss = gluon.loss.SoftmaxCrossEntropyLoss()
+    trainer = gluon.Trainer(net.collect_params(),
+                            'sgd', {'learning_rate': lr})
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            # Here is the major difference from `d2l.train_epoch_ch3`
+            X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+            with autograd.record():
+                y_hat = net(X)
+                l = loss(y_hat, y)
+            l.backward()
+            trainer.step(X.shape[0])
+            metric.add(l.sum(), d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+    
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, gluon.Block):
+        params = [p.data() for p in net.collect_params().values()]
+    else:
+        params = net.params
+    norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+```
+
+```{.python .input}
+%%tab pytorch
+
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a PyTorch data iterator."""
+    dataset = data.TensorDataset(*data_arrays)
+    return data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size): #@save
+    """Minibatch stochastic gradient descent."""
+    with torch.no_grad():
+        for param in params:
+            param -= lr * param.grad / batch_size
+            param.grad.zero_()
+
+def get_dataloader_workers():  #@save
+    """Use 4 processes to read the data."""
+    return 4
+
+def load_data_fashion_mnist(batch_size, resize=None):  #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    trans = [transforms.ToTensor()]
+    if resize:
+        trans.insert(0, transforms.Resize(resize))
+    trans = transforms.Compose(trans)
+    mnist_train = torchvision.datasets.FashionMNIST(
+        root="../data", train=True, transform=trans, download=True)
+    mnist_test = torchvision.datasets.FashionMNIST(
+        root="../data", train=False, transform=trans, download=True)
+    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
+                            num_workers=get_dataloader_workers()),
+            data.DataLoader(mnist_test, batch_size, shuffle=False,
+                            num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
+    """Compute the accuracy for a model on a dataset using a GPU."""
+    if isinstance(net, nn.Module):
+        net.eval()  # Set the model to evaluation mode
+        if not device:
+            device = next(iter(net.parameters())).device
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+
+    with torch.no_grad():
+        for X, y in data_iter:
+            if isinstance(X, list):
+                # Required for BERT Fine-tuning (to be covered later)
+                X = [x.to(device) for x in X]
+            else:
+                X = X.to(device)
+            y = y.to(device)
+            metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+
+#@save
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    def init_weights(m):
+        if type(m) == nn.Linear or type(m) == nn.Conv2d:
+            nn.init.xavier_uniform_(m.weight)
+    net.apply(init_weights)
+    print('training on', device)
+    net.to(device)
+    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
+    loss = nn.CrossEntropyLoss()
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        net.train()
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            optimizer.zero_grad()
+            X, y = X.to(device), y.to(device)
+            y_hat = net(X)
+            l = loss(y_hat, y)
+            l.backward()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+```
+
+```{.python .input}
+%%tab tensorflow
+
+def load_array(data_arrays, batch_size, is_train=True):  #@save
+    """Construct a TensorFlow data iterator."""
+    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
+    if is_train:
+        dataset = dataset.shuffle(buffer_size=1000)
+    dataset = dataset.batch(batch_size)
+    return dataset
+
+def synthetic_data(w, b, num_examples):  #@save
+    """Generate y = Xw + b + noise."""
+    X = tf.zeros((num_examples, w.shape[0]))
+    X += tf.random.normal(shape=X.shape)
+    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
+    y += tf.random.normal(shape=y.shape, stddev=0.01)
+    y = tf.reshape(y, (-1, 1))
+    return X, y
+
+
+def sgd(params, grads, lr, batch_size):  #@save
+    """Minibatch stochastic gradient descent."""
+    for param, grad in zip(params, grads):
+        param.assign_sub(lr * grad / batch_size)
+
+def load_data_fashion_mnist(batch_size, resize=None):   #@save
+    """Download the Fashion-MNIST dataset and then load it into memory."""
+    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
+    # Divide all numbers by 255 so that all pixel values are between
+    # 0 and 1, add a batch dimension at the last. And cast label to int32
+    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                            tf.cast(y, dtype='int32'))
+    resize_fn = lambda X, y: (
+        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
+    return (
+        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
+            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
+        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
+            batch_size).map(resize_fn))
+
+class TrainCallback(tf.keras.callbacks.Callback):  #@save
+    """A callback to visiualize the training progress."""
+    def __init__(self, net, train_iter, test_iter, num_epochs, device_name):
+        self.timer = d2l.Timer()
+        self.animator = d2l.Animator(
+            xlabel='epoch', xlim=[1, num_epochs], legend=[
+                'train loss', 'train acc', 'test acc'])
+        self.net = net
+        self.train_iter = train_iter
+        self.test_iter = test_iter
+        self.num_epochs = num_epochs
+        self.device_name = device_name
+    def on_epoch_begin(self, epoch, logs=None):
+        self.timer.start()
+    def on_epoch_end(self, epoch, logs):
+        self.timer.stop()
+        test_acc = self.net.evaluate(
+            self.test_iter, verbose=0, return_dict=True)['accuracy']
+        metrics = (logs['loss'], logs['accuracy'], test_acc)
+        self.animator.add(epoch + 1, metrics)
+        if epoch == self.num_epochs - 1:
+            batch_size = next(iter(self.train_iter))[0].shape[0]
+            num_examples = batch_size * tf.data.experimental.cardinality(
+                self.train_iter).numpy()
+            print(f'loss {metrics[0]:.3f}, train acc {metrics[1]:.3f}, '
+                  f'test acc {metrics[2]:.3f}')
+            print(f'{num_examples / self.timer.avg():.1f} examples/sec on '
+                  f'{str(self.device_name)}')
+
+#@save
+def train_ch6(net_fn, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6)."""
+    device_name = device._device_name
+    strategy = tf.distribute.OneDeviceStrategy(device_name)
+    with strategy.scope():
+        optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
+        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        net = net_fn()
+        net.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
+    callback = TrainCallback(net, train_iter, test_iter, num_epochs,
+                             device_name)
+    net.fit(train_iter, epochs=num_epochs, verbose=0, callbacks=[callback])
+    return net
+```
+
+```{.python .input}
+%%tab mxnet, tensorflow
+def evaluate_accuracy(net, data_iter):  #@save
+    """Compute the accuracy for a model on a dataset."""
+    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
+    for X, y in data_iter:
+        metric.add(accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab all
+
+def linreg(X, w, b):  #@save
+    """The linear regression model."""
+    return d2l.matmul(X, w) + b
+
+def squared_loss(y_hat, y):  #@save
+    """Squared loss."""
+    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+
+def get_fashion_mnist_labels(labels):  #@save
+    """Return text labels for the Fashion-MNIST dataset."""
+    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [text_labels[int(i)] for i in labels]
+
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
+    """Plot a list of images."""
+    figsize = (num_cols * scale, num_rows * scale)
+    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
+    axes = axes.flatten()
+    for i, (ax, img) in enumerate(zip(axes, imgs)):
+        try:
+            img = d2l.numpy(img)
+        except:
+            pass
+        ax.imshow(img)
+        ax.axes.get_xaxis().set_visible(False)
+        ax.axes.get_yaxis().set_visible(False)
+        if titles:
+            ax.set_title(titles[i])
+    return axes
+
+#@tab all
+class Animator:  #@save
+    """For plotting data in animation."""
+    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
+                 figsize=(3.5, 2.5)):
+        # Incrementally plot multiple lines
+        if legend is None:
+            legend = []
+        d2l.use_svg_display()
+        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
+        if nrows * ncols == 1:
+            self.axes = [self.axes, ]
+        # Use a lambda function to capture arguments
+        self.config_axes = lambda: d2l.set_axes(
+            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+        self.X, self.Y, self.fmts = None, None, fmts
+
+    def add(self, x, y):
+        # Add multiple data points into the figure
+        if not hasattr(y, "__len__"):
+            y = [y]
+        n = len(y)
+        if not hasattr(x, "__len__"):
+            x = [x] * n
+        if not self.X:
+            self.X = [[] for _ in range(n)]
+        if not self.Y:
+            self.Y = [[] for _ in range(n)]
+        for i, (a, b) in enumerate(zip(x, y)):
+            if a is not None and b is not None:
+                self.X[i].append(a)
+                self.Y[i].append(b)
+        self.axes[0].cla()
+        for x, y, fmt in zip(self.X, self.Y, self.fmts):
+            self.axes[0].plot(x, y, fmt)
+        self.config_axes()
+        display.display(self.fig)
+        display.clear_output(wait=True)
+        
+#@tab all
+class Accumulator:  #@save
+    """For accumulating sums over `n` variables."""
+    def __init__(self, n):
+        self.data = [0.0] * n
+
+    def add(self, *args):
+        self.data = [a + float(b) for a, b in zip(self.data, args)]
+
+    def reset(self):
+        self.data = [0.0] * len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]        
+    
+    
+#@tab all
+def accuracy(y_hat, y):  #@save
+    """Compute the number of correct predictions."""
+    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
+        y_hat = d2l.argmax(y_hat, axis=1)
+    cmp = d2l.astype(y_hat, y.dtype) == y
+    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
+```
+
+```{.python .input}
+%%tab all
+
+import os
+import requests
+import zipfile
+import tarfile
+import hashlib
+
+def download(url, folder='../data', sha1_hash=None):  #@save
+    """Download a file to folder and return the local filepath."""
+    if not url.startswith('http'):
+        # For back compatability
+        url, sha1_hash = DATA_HUB[url]
+    os.makedirs(folder, exist_ok=True)
+    fname = os.path.join(folder, url.split('/')[-1])
+    # Check if hit cache
+    if os.path.exists(fname) and sha1_hash:
+        sha1 = hashlib.sha1()
+        with open(fname, 'rb') as f:
+            while True:
+                data = f.read(1048576)
+                if not data:
+                    break
+                sha1.update(data)
+        if sha1.hexdigest() == sha1_hash:
+            return fname
+    # Download
+    print(f'Downloading {fname} from {url}...')
+    r = requests.get(url, stream=True, verify=True)
+    with open(fname, 'wb') as f:
+        f.write(r.content)
+    return fname
+
+def extract(filename, folder=None):  #@save
+    """Extract a zip/tar file into folder."""
+    base_dir = os.path.dirname(filename)
+    _, ext = os.path.splitext(filename)
+    assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
+    if ext == '.zip':
+        fp = zipfile.ZipFile(filename, 'r')
+    else:
+        fp = tarfile.open(filename, 'r')
+    if folder is None:
+        folder = base_dir
+    fp.extractall(folder)
+```
+
+```{.python .input}
+%%tab all
+
+def download_extract(name, folder=None):  #@save
+    """Download and extract a zip/tar file."""
+    fname = download(name)
+    base_dir = os.path.dirname(fname)
+    data_dir, ext = os.path.splitext(fname)
+    if ext == '.zip':
+        fp = zipfile.ZipFile(fname, 'r')
+    elif ext in ('.tar', '.gz'):
+        fp = tarfile.open(fname, 'r')
+    else:
+        assert False, 'Only zip/tar files can be extracted.'
+    fp.extractall(base_dir)
+    return os.path.join(base_dir, folder) if folder else data_dir
+
+
+def tokenize(lines, token='word'):  #@save
+    """Split text lines into word or character tokens."""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+
+```
+
+```{.python .input}
+%%tab pytorch
+
+def evaluate_loss(net, data_iter, loss):  #@save
+    """Evaluate the loss of a model on the given dataset."""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        out = net(X)
+        y = d2l.reshape(y, out.shape)
+        l = loss(out, y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab mxnet, tensorflow
+def evaluate_loss(net, data_iter, loss):  #@save
+    """Evaluate the loss of a model on the given dataset."""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        l = loss(net(X), y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+```
+
+```{.python .input}
+%%tab pytorch
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, nn.Module):
+        params = [p for p in net.parameters() if p.requires_grad]
+    else:
+        params = net.params
+    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+```
+
+```{.python .input}
+%%tab tensorflow
+def grad_clipping(grads, theta):  #@save
+    """Clip the gradient."""
+    theta = tf.constant(theta, dtype=tf.float32)
+    new_grad = []
+    for grad in grads:
+        if isinstance(grad, tf.IndexedSlices):
+            new_grad.append(tf.convert_to_tensor(grad))
+        else:
+            new_grad.append(grad)
+    norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
+                        for grad in new_grad))
+    norm = tf.cast(norm, tf.float32)
+    if tf.greater(norm, theta):
+        for i, grad in enumerate(new_grad):
+            new_grad[i] = grad * theta / norm
+    else:
+        new_grad = new_grad
+    return new_grad
+```
+
+More for the attention chapter.
+
+```{.python .input}
+%%tab all
+#@save
+d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
+                           '94646ad1522d915e7b0f9296181140edcf86a4f5')
+
+#@save
+def read_data_nmt():
+    """Load the English-French dataset."""
+    data_dir = d2l.download_extract('fra-eng')
+    with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
+        return f.read()
+
+#@save
+def preprocess_nmt(text):
+    """Preprocess the English-French dataset."""
+    def no_space(char, prev_char):
+        return char in set(',.!?') and prev_char != ' '
+
+    # Replace non-breaking space with space, and convert uppercase letters to
+    # lowercase ones
+    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
+    # Insert space between words and punctuation marks
+    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
+           for i, char in enumerate(text)]
+    return ''.join(out)
+
+#@save
+def tokenize_nmt(text, num_examples=None):
+    """Tokenize the English-French dataset."""
+    source, target = [], []
+    for i, line in enumerate(text.split('\n')):
+        if num_examples and i > num_examples:
+            break
+        parts = line.split('\t')
+        if len(parts) == 2:
+            source.append(parts[0].split(' '))
+            target.append(parts[1].split(' '))
+    return source, target
+
+    
+#@save
+def truncate_pad(line, num_steps, padding_token):
+    """Truncate or pad sequences."""
+    if len(line) > num_steps:
+        return line[:num_steps]  # Truncate
+    return line + [padding_token] * (num_steps - len(line))  # Pad
+
+
+#@save
+def build_array_nmt(lines, vocab, num_steps):
+    """Transform text sequences of machine translation into minibatches."""
+    lines = [vocab[l] for l in lines]
+    lines = [l + [vocab['<eos>']] for l in lines]
+    array = d2l.tensor([truncate_pad(
+        l, num_steps, vocab['<pad>']) for l in lines])
+    valid_len = d2l.reduce_sum(
+        d2l.astype(array != vocab['<pad>'], d2l.int32), 1)
+    return array, valid_len
+
+
+#@save
+def load_data_nmt(batch_size, num_steps, num_examples=600):
+    """Return the iterator and the vocabularies of the translation dataset."""
+    text = preprocess_nmt(read_data_nmt())
+    source, target = tokenize_nmt(text, num_examples)
+    src_vocab = d2l.Vocab(source, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    tgt_vocab = d2l.Vocab(target, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
+    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
+    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
+    data_iter = d2l.load_array(data_arrays, batch_size)
+    return data_iter, src_vocab, tgt_vocab
+```
+
+```{.python .input}
+%%tab mxnet
+    
+#@save
+class MaskedSoftmaxCELoss(gluon.loss.SoftmaxCELoss):
+    """The softmax cross-entropy loss with masks."""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        # `weights` shape: (`batch_size`, `num_steps`, 1)
+        weights = np.expand_dims(np.ones_like(label), axis=-1)
+        weights = npx.sequence_mask(weights, valid_len, True, axis=1)
+        return super(MaskedSoftmaxCELoss, self).forward(pred, label, weights)
+
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    net.initialize(init.Xavier(), force_reinit=True, ctx=device)
+    trainer = gluon.Trainer(net.collect_params(), 'adam',
+                            {'learning_rate': lr})
+    loss = MaskedSoftmaxCELoss()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [
+                x.as_in_ctx(device) for x in batch]
+            bos = np.array(
+                [tgt_vocab['<bos>']] * Y.shape[0], ctx=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with autograd.record():
+                Y_hat, _ = net(X, dec_input, X_valid_len)
+                l = loss(Y_hat, Y, Y_valid_len)
+            l.backward()
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            trainer.step(num_tokens)
+            metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = np.array([len(src_tokens)], ctx=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = np.expand_dims(np.array(src_tokens, ctx=device), axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = np.expand_dims(np.array([tgt_vocab['<bos>']], ctx=device), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(axis=2)
+        pred = dec_X.squeeze(axis=0).astype('int32').item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+```
+
+```{.python .input}
+%%tab pytorch
+#@save
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences."""
+    maxlen = X.size(1)
+    mask = torch.arange((maxlen), dtype=torch.float32,
+                        device=X.device)[None, :] < valid_len[:, None]
+    X[~mask] = value
+    return X
+
+    
+#@save
+class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
+    """The softmax cross-entropy loss with masks."""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        weights = torch.ones_like(label)
+        weights = sequence_mask(weights, valid_len)
+        self.reduction='none'
+        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
+            pred.permute(0, 2, 1), label)
+        weighted_loss = (unweighted_loss * weights).mean(dim=1)
+        return weighted_loss
+    
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    def xavier_init_weights(m):
+        if type(m) == nn.Linear:
+            nn.init.xavier_uniform_(m.weight)
+        if type(m) == nn.GRU:
+            for param in m._flat_weights_names:
+                if "weight" in param:
+                    nn.init.xavier_uniform_(m._parameters[param])
+    net.apply(xavier_init_weights)
+    net.to(device)
+    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
+    loss = MaskedSoftmaxCELoss()
+    net.train()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            optimizer.zero_grad()
+            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
+            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
+                               device=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            Y_hat, _ = net(X, dec_input, X_valid_len)
+            l = loss(Y_hat, Y, Y_valid_len)
+            l.sum().backward()  # Make the loss scalar for `backward`
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+    
+
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    # Set `net` to eval mode for inference
+    net.eval()
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = torch.unsqueeze(
+        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = torch.unsqueeze(torch.tensor(
+        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(dim=2)
+        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+```
+
+```{.python .input}
+%%tab tensorflow
+#@save
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences."""
+    maxlen = X.shape[1]
+    mask = tf.range(start=0, limit=maxlen, dtype=tf.float32)[
+        None, :] < tf.cast(valid_len[:, None], dtype=tf.float32)
+    
+    if len(X.shape) == 3:
+        return tf.where(tf.expand_dims(mask, axis=-1), X, value)
+    else:
+        return tf.where(mask, X, value)
+
+    
+#@save
+class MaskedSoftmaxCELoss(tf.keras.losses.Loss):
+    """The softmax cross-entropy loss with masks."""
+    def __init__(self, valid_len):
+        super().__init__(reduction='none')
+        self.valid_len = valid_len
+    
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def call(self, label, pred):
+        weights = tf.ones_like(label, dtype=tf.float32)
+        weights = sequence_mask(weights, self.valid_len)
+        label_one_hot = tf.one_hot(label, depth=pred.shape[-1])
+        unweighted_loss = tf.keras.losses.CategoricalCrossentropy(
+            from_logits=True, reduction='none')(label_one_hot, pred)
+        weighted_loss = tf.reduce_mean((unweighted_loss*weights), axis=1)
+        return weighted_loss
+    
+#@save
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence."""
+    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
+    animator = d2l.Animator(xlabel="epoch", ylabel="loss",
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [x for x in batch]
+            bos = tf.reshape(tf.constant([tgt_vocab['<bos>']] * Y.shape[0]),
+                             shape=(-1, 1))
+            dec_input = tf.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with tf.GradientTape() as tape:
+                Y_hat, _ = net(X, dec_input, X_valid_len, training=True)
+                l = MaskedSoftmaxCELoss(Y_valid_len)(Y, Y_hat)
+            gradients = tape.gradient(l, net.trainable_variables)
+            gradients = d2l.grad_clipping(gradients, 1)
+            optimizer.apply_gradients(zip(gradients, net.trainable_variables))
+            num_tokens = tf.reduce_sum(Y_valid_len).numpy()
+            metric.add(tf.reduce_sum(l), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device._device_name)}')
+    
+#@save
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    save_attention_weights=False):
+    """Predict for sequence to sequence."""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = tf.constant([len(src_tokens)])
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = tf.expand_dims(src_tokens, axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len, training=False)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = tf.expand_dims(tf.constant([tgt_vocab['<bos>']]), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state, training=False)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = tf.argmax(Y, axis=2)
+        pred = tf.squeeze(dec_X, axis=0)
+        # Save attention weights
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred.numpy())
+    return ' '.join(tgt_vocab.to_tokens(tf.reshape(output_seq, shape = -1).numpy().tolist())), attention_weight_seq
+```
diff --git a/chapter_builders-guide/custom-layer.md b/chapter_builders-guide/custom-layer.md
new file mode 100644
index 0000000..bfed3b7
--- /dev/null
+++ b/chapter_builders-guide/custom-layer.md
@@ -0,0 +1,240 @@
+# カスタムレイヤー
+
+ディープラーニングの成功の背後にある要因の 1 つは、さまざまなタスクに適したアーキテクチャを設計するために創造的な方法で構成できる幅広いレイヤーが利用できることです。たとえば、研究者は、画像、テキストの処理、順次データのループ、動的計画法の実行に特化したレイヤーを発明しました。遅かれ早かれ、ディープラーニングのフレームワークにはまだ存在しない層に出会ったり、発明したりするでしょう。このような場合は、カスタム Layer を構築する必要があります。このセクションでは、その方法を説明します。 
+
+## (**パラメータなしのレイヤー**)
+
+まず、独自のパラメータを持たないカスタム Layer を構築します。:numref:`sec_model_construction`のモジュールの紹介を思い出せば、これはおなじみのように思えます。次の `CenteredLayer` クラスは、入力から平均を単純に減算します。それを構築するには、基本レイヤークラスから継承し、フォワードプロパゲーション関数を実装するだけです。
+
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+class CenteredLayer(nn.Block):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def forward(self, X):
+        return X - X.mean()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+class CenteredLayer(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, X):
+        return X - X.mean()
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+
+class CenteredLayer(tf.keras.Model):
+    def __init__(self):
+        super().__init__()
+
+    def call(self, inputs):
+        return inputs - tf.reduce_mean(inputs)
+```
+
+データを通すことで、レイヤーが意図したとおりに機能することを確認しましょう。
+
+```{.python .input}
+%%tab all
+layer = CenteredLayer()
+layer(d2l.tensor([1.0, 2, 3, 4, 5]))
+```
+
+これで [**より複雑なモデルを構築するためのコンポーネントとしてレイヤーを組み込むことができます。**]
+
+```{.python .input}
+%%tab mxnet
+net = nn.Sequential()
+net.add(nn.Dense(128), CenteredLayer())
+net.initialize()
+```
+
+```{.python .input}
+%%tab pytorch
+net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())
+```
+
+```{.python .input}
+%%tab tensorflow
+net = tf.keras.Sequential([tf.keras.layers.Dense(128), CenteredLayer()])
+```
+
+追加の健全性チェックとして、ネットワークを介してランダムデータを送信し、平均が実際には0であることを確認できます。浮動小数点数を扱っているため、量子化によってゼロ以外の非常に小さい数値が表示される場合があります。
+
+```{.python .input}
+%%tab pytorch, mxnet
+Y = net(d2l.rand(4, 8))
+Y.mean()
+```
+
+```{.python .input}
+%%tab tensorflow
+Y = net(tf.random.uniform((4, 8)))
+tf.reduce_mean(Y)
+```
+
+## [**パラメータ付きレイヤー**]
+
+単純なレイヤーの定義方法がわかったので、トレーニングを通じて調整できるパラメーターを持つレイヤーの定義に移りましょう。組み込み関数を使用して、基本的なハウスキーピング機能を提供するパラメーターを作成できます。特に、モデルパラメーターのアクセス、初期化、共有、保存、および読み込みを制御します。この方法では、他の利点の中でも、すべてのカスタム Layer に対してカスタムシリアル化ルーチンを記述する必要がなくなります。 
+
+それでは、完全接続レイヤーの独自のバージョンを実装しましょう。この層には 2 つのパラメーターが必要であることを思い出してください。1 つは重みを表し、もう 1 つは偏りを表します。この実装では、ReLU アクティベーションをデフォルトとして組み込みます。この層には、それぞれ入力と出力の数を示す `in_units` と `units` の 2 つの入力引数が必要です。
+
+```{.python .input}
+%%tab mxnet
+class MyDense(nn.Block):
+    def __init__(self, units, in_units, **kwargs):
+        super().__init__(**kwargs)
+        self.weight = self.params.get('weight', shape=(in_units, units))
+        self.bias = self.params.get('bias', shape=(units,))
+
+    def forward(self, x):
+        linear = np.dot(x, self.weight.data(ctx=x.ctx)) + self.bias.data(
+            ctx=x.ctx)
+        return npx.relu(linear)
+```
+
+```{.python .input}
+%%tab pytorch
+class MyLinear(nn.Module):
+    def __init__(self, in_units, units):
+        super().__init__()
+        self.weight = nn.Parameter(torch.randn(in_units, units))
+        self.bias = nn.Parameter(torch.randn(units,))
+        
+    def forward(self, X):
+        linear = torch.matmul(X, self.weight.data) + self.bias.data
+        return F.relu(linear)
+```
+
+```{.python .input}
+%%tab tensorflow
+class MyDense(tf.keras.Model):
+    def __init__(self, units):
+        super().__init__()
+        self.units = units
+
+    def build(self, X_shape):
+        self.weight = self.add_weight(name='weight',
+            shape=[X_shape[-1], self.units],
+            initializer=tf.random_normal_initializer())
+        self.bias = self.add_weight(
+            name='bias', shape=[self.units],
+            initializer=tf.zeros_initializer())
+
+    def call(self, X):
+        linear = tf.matmul(X, self.weight) + self.bias
+        return tf.nn.relu(linear)
+```
+
+:begin_tab:`mxnet, tensorflow`
+次に、`MyDense` クラスをインスタンス化し、そのモデルパラメーターにアクセスします。
+:end_tab:
+
+:begin_tab:`pytorch`
+次に、`MyLinear` クラスをインスタンス化し、そのモデルパラメーターにアクセスします。
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+dense = MyDense(units=3, in_units=5)
+dense.params
+```
+
+```{.python .input}
+%%tab pytorch
+linear = MyLinear(5, 3)
+linear.weight
+```
+
+```{.python .input}
+%%tab tensorflow
+dense = MyDense(3)
+dense(tf.random.uniform((2, 5)))
+dense.get_weights()
+```
+
+[**カスタムレイヤーを使用してフォワードプロパゲーション計算を直接実行できます。**]
+
+```{.python .input}
+%%tab mxnet
+dense.initialize()
+dense(np.random.uniform(size=(2, 5)))
+```
+
+```{.python .input}
+%%tab pytorch
+linear(torch.rand(2, 5))
+```
+
+```{.python .input}
+%%tab tensorflow
+dense(tf.random.uniform((2, 5)))
+```
+
+また、(**カスタムレイヤーを使用してモデルを構築する。**) それができれば、組み込みの完全接続レイヤーと同じように使用できます。
+
+```{.python .input}
+%%tab mxnet
+net = nn.Sequential()
+net.add(MyDense(8, in_units=64),
+        MyDense(1, in_units=8))
+net.initialize()
+net(np.random.uniform(size=(2, 64)))
+```
+
+```{.python .input}
+%%tab pytorch
+net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
+net(torch.rand(2, 64))
+```
+
+```{.python .input}
+%%tab tensorflow
+net = tf.keras.models.Sequential([MyDense(8), MyDense(1)])
+net(tf.random.uniform((2, 64)))
+```
+
+## まとめ
+
+* 基本レイヤクラスを介してカスタムレイヤを設計できます。これにより、ライブラリ内の既存のレイヤーとは異なる動作をする柔軟な新しいレイヤーを定義できます。
+* 一度定義すると、カスタム Layer は任意のコンテキストやアーキテクチャで呼び出すことができます。
+* レイヤーには、組み込み関数を使用して作成できるローカルパラメーターを含めることができます。
+
+## 演習
+
+1. 入力を受け取り、テンソル削減を計算する層を設計します。つまり、$y_k = \sum_{i, j} W_{ijk} x_i x_j$を返します。
+1. データのフーリエ係数の前半を返す層を設計します。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/58)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/59)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/279)
+:end_tab:
diff --git a/chapter_deep-learning-computation/custom-layer_origin.md b/chapter_builders-guide/custom-layer_origin.md
similarity index 86%
rename from chapter_deep-learning-computation/custom-layer_origin.md
rename to chapter_builders-guide/custom-layer_origin.md
index 7256abb..651e759 100644
--- a/chapter_deep-learning-computation/custom-layer_origin.md
+++ b/chapter_builders-guide/custom-layer_origin.md
@@ -20,13 +20,20 @@ In this section, we show you how.
 To start, we construct a custom layer
 that does not have any parameters of its own.
 This should look familiar if you recall our
-introduction to block in :numref:`sec_model_construction`.
+introduction to module in :numref:`sec_model_construction`.
 The following `CenteredLayer` class simply
 subtracts the mean from its input.
 To build it, we simply need to inherit
 from the base layer class and implement the forward propagation function.
 
 ```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
@@ -40,7 +47,8 @@ class CenteredLayer(nn.Block):
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
+from d2l import torch as d2l
 import torch
 from torch import nn
 from torch.nn import functional as F
@@ -54,7 +62,8 @@ class CenteredLayer(nn.Module):
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
+from d2l import tensorflow as d2l
 import tensorflow as tf
 
 class CenteredLayer(tf.keras.Model):
@@ -65,41 +74,31 @@ class CenteredLayer(tf.keras.Model):
         return inputs - tf.reduce_mean(inputs)
 ```
 
-Let us verify that our layer works as intended by feeding some data through it.
-
-```{.python .input}
-layer = CenteredLayer()
-layer(np.array([1, 2, 3, 4, 5]))
-```
+Let's verify that our layer works as intended by feeding some data through it.
 
 ```{.python .input}
-#@tab pytorch
+%%tab all
 layer = CenteredLayer()
-layer(torch.FloatTensor([1, 2, 3, 4, 5]))
-```
-
-```{.python .input}
-#@tab tensorflow
-layer = CenteredLayer()
-layer(tf.constant([1, 2, 3, 4, 5]))
+layer(d2l.tensor([1.0, 2, 3, 4, 5]))
 ```
 
 We can now [**incorporate our layer as a component
 in constructing more complex models.**]
 
 ```{.python .input}
+%%tab mxnet
 net = nn.Sequential()
 net.add(nn.Dense(128), CenteredLayer())
 net.initialize()
 ```
 
 ```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Linear(8, 128), CenteredLayer())
+%%tab pytorch
+net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 net = tf.keras.Sequential([tf.keras.layers.Dense(128), CenteredLayer()])
 ```
 
@@ -110,18 +109,13 @@ we may still see a very small nonzero number
 due to quantization.
 
 ```{.python .input}
-Y = net(np.random.uniform(size=(4, 8)))
-Y.mean()
-```
-
-```{.python .input}
-#@tab pytorch
-Y = net(torch.rand(4, 8))
+%%tab pytorch, mxnet
+Y = net(d2l.rand(4, 8))
 Y.mean()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 Y = net(tf.random.uniform((4, 8)))
 tf.reduce_mean(Y)
 ```
@@ -129,7 +123,7 @@ tf.reduce_mean(Y)
 ## [**Layers with Parameters**]
 
 Now that we know how to define simple layers,
-let us move on to defining layers with parameters
+let's move on to defining layers with parameters
 that can be adjusted through training.
 We can use built-in functions to create parameters, which
 provide some basic housekeeping functionality.
@@ -138,14 +132,15 @@ sharing, saving, and loading model parameters.
 This way, among other benefits, we will not need to write
 custom serialization routines for every custom layer.
 
-Now let us implement our own version of the  fully-connected layer.
+Now let's implement our own version of the  fully connected layer.
 Recall that this layer requires two parameters,
 one to represent the weight and the other for the bias.
 In this implementation, we bake in the ReLU activation as a default.
-This layer requires to input arguments: `in_units` and `units`, which
+This layer requires two input arguments: `in_units` and `units`, which
 denote the number of inputs and outputs, respectively.
 
 ```{.python .input}
+%%tab mxnet
 class MyDense(nn.Block):
     def __init__(self, units, in_units, **kwargs):
         super().__init__(**kwargs)
@@ -159,19 +154,20 @@ class MyDense(nn.Block):
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 class MyLinear(nn.Module):
     def __init__(self, in_units, units):
         super().__init__()
         self.weight = nn.Parameter(torch.randn(in_units, units))
         self.bias = nn.Parameter(torch.randn(units,))
+        
     def forward(self, X):
         linear = torch.matmul(X, self.weight.data) + self.bias.data
         return F.relu(linear)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 class MyDense(tf.keras.Model):
     def __init__(self, units):
         super().__init__()
@@ -201,18 +197,19 @@ and access its model parameters.
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 dense = MyDense(units=3, in_units=5)
 dense.params
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 linear = MyLinear(5, 3)
 linear.weight
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 dense = MyDense(3)
 dense(tf.random.uniform((2, 5)))
 dense.get_weights()
@@ -221,24 +218,26 @@ dense.get_weights()
 We can [**directly carry out forward propagation calculations using custom layers.**]
 
 ```{.python .input}
+%%tab mxnet
 dense.initialize()
 dense(np.random.uniform(size=(2, 5)))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 linear(torch.rand(2, 5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 dense(tf.random.uniform((2, 5)))
 ```
 
 We can also (**construct models using custom layers.**)
-Once we have that we can use it just like the built-in fully-connected layer.
+Once we have that we can use it just like the built-in fully connected layer.
 
 ```{.python .input}
+%%tab mxnet
 net = nn.Sequential()
 net.add(MyDense(8, in_units=64),
         MyDense(1, in_units=8))
@@ -247,13 +246,13 @@ net(np.random.uniform(size=(2, 64)))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
 net(torch.rand(2, 64))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 net = tf.keras.models.Sequential([MyDense(8), MyDense(1)])
 net(tf.random.uniform((2, 64)))
 ```
diff --git a/chapter_builders-guide/index.md b/chapter_builders-guide/index.md
new file mode 100644
index 0000000..6621722
--- /dev/null
+++ b/chapter_builders-guide/index.md
@@ -0,0 +1,18 @@
+# ビルダーズガイド
+:label:`chap_computation`
+
+巨大なデータセットや強力なハードウェアに加えて、ディープラーニングの急速な進歩には優れたソフトウェアツールが不可欠な役割を果たしてきました。2007年にリリースされた画期的なTheanoライブラリを皮切りに、柔軟なオープンソースツールにより、研究者はモデルのプロトタイプを迅速に作成できるようになり、低レベルの変更を行う能力を維持しながら、標準コンポーネントをリサイクルする際の反復作業を回避できました。時間の経過とともに、ディープラーニングのライブラリは、ますます粗い抽象化を提供するよう進化してきました。半導体の設計者がトランジスタの指定から論理回路、コードの記述へと移行したように、ニューラルネットワークの研究者は、個々の人工ニューロンの振る舞いを考えることから、全層から見たネットワークの構想へと移行し、今でははるかに粗いアーキテクチャを設計することがよくあります*ブロック* を念頭に置いてください。 
+
+これまで、基本的な機械学習の概念をいくつか紹介し、完全に機能するディープラーニングモデルにまで成長しました。最後の章では、MLP の各コンポーネントをゼロから実装し、高レベル API を活用して同じモデルを簡単にロールアウトする方法も示しました。そこまで早く到達するために、私たちは図書館を*呼びかけ*ましたが、*それらがどのように機能するか*についてのより高度な詳細はスキップしました。この章では、ディープラーニング計算の主要なコンポーネント、つまりモデルの構築、パラメーターのアクセスと初期化、カスタムレイヤーとブロックの設計、ディスクへのモデルの読み取りと書き込み、GPU の活用による劇的な高速化の達成など、カーテンをはがします。これらの洞察により、*エンドユーザー*から*パワーユーザー*に移行し、自分で考案したものを含め、より複雑なモデルを実装する柔軟性を維持しながら、成熟したディープラーニングライブラリのメリットを享受するために必要なツールが提供されます。この章では新しいモデルやデータセットを紹介しませんが、以降の高度なモデリングの章では、これらの手法に大きく依存しています。
+
+```toc
+:maxdepth: 2
+
+model-construction
+parameters
+init-param
+lazy-init
+custom-layer
+read-write
+use-gpu
+```
diff --git a/chapter_deep-learning-computation/index_origin.md b/chapter_builders-guide/index_origin.md
similarity index 98%
rename from chapter_deep-learning-computation/index_origin.md
rename to chapter_builders-guide/index_origin.md
index 376821e..5bf9e15 100644
--- a/chapter_deep-learning-computation/index_origin.md
+++ b/chapter_builders-guide/index_origin.md
@@ -1,4 +1,4 @@
-# Deep Learning Computation
+# Builders' Guide
 :label:`chap_computation`
 
 Alongside giant datasets and powerful hardware,
@@ -44,7 +44,8 @@ the advanced modeling chapters that follow rely heavily on these techniques.
 
 model-construction
 parameters
-deferred-init
+init-param
+lazy-init
 custom-layer
 read-write
 use-gpu
diff --git a/chapter_builders-guide/init-param.md b/chapter_builders-guide/init-param.md
new file mode 100644
index 0000000..26e10af
--- /dev/null
+++ b/chapter_builders-guide/init-param.md
@@ -0,0 +1,286 @@
+# パラメーターの初期化
+
+パラメータにアクセスする方法がわかったので、それらを正しく初期化する方法を見てみましょう。:numref:`sec_numerical_stability`では、適切な初期化の必要性について説明しました。ディープラーニングフレームワークは、そのレイヤーにデフォルトのランダム初期化を提供します。しかし、私たちはしばしば、他のさまざまなプロトコルに従って重みを初期化したいと考えています。このフレームワークは、最も一般的に使用されるプロトコルを提供し、カスタムイニシャライザを作成することもできます。
+
+:begin_tab:`mxnet`
+既定では、MXNet は一様分布 $U(-0.07, 0.07)$ からランダムに抽出して重みパラメーターを初期化し、バイアスパラメーターをゼロにクリアします。MXNetの`init`モジュールは、さまざまなプリセット初期化方法を提供します。
+:end_tab:
+
+:begin_tab:`pytorch`
+デフォルトでは、PyTorch は、入力と出力の次元に従って計算された範囲から描画することにより、重みとバイアスの行列を均一に初期化します。PyTorch の `nn.init` モジュールは、さまざまなプリセット初期化メソッドを提供します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+デフォルトでは、Kerasは入力と出力の次元に従って計算された範囲から引き出すことによって重み行列を均一に初期化し、バイアスパラメータはすべてゼロに設定されます。TensorFlow は、ルートモジュールと `keras.initializers` モジュールの両方でさまざまな初期化方法を提供します。
+:end_tab:
+
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input  n=2}
+%%tab mxnet
+from mxnet import init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(8, activation='relu'))
+net.add(nn.Dense(1))
+net.initialize()  # Use the default initialization method
+
+X = np.random.uniform(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+import torch
+from torch import nn
+
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
+X = torch.rand(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(4, activation=tf.nn.relu),
+    tf.keras.layers.Dense(1),
+])
+
+X = tf.random.uniform((2, 4))
+net(X).shape
+```
+
+## [**組み込みの初期化**]
+
+組み込みのイニシャライザを呼び出すことから始めましょう。以下のコードは、すべての重みパラメータを標準偏差0.01のガウス確率変数として初期化し、バイアスパラメータはゼロにクリアされています。
+
+```{.python .input  n=5}
+%%tab mxnet
+# Here `force_reinit` ensures that parameters are freshly initialized even if
+# they were already initialized previously
+net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+def init_normal(module):
+    if type(module) == nn.Linear:
+        nn.init.normal_(module.weight, mean=0, std=0.01)
+        nn.init.zeros_(module.bias)
+net.apply(init_normal)
+net[0].weight.data[0], net[0].bias.data[0]
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4, activation=tf.nn.relu,
+        kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.01),
+        bias_initializer=tf.zeros_initializer()),
+    tf.keras.layers.Dense(1)])
+
+net(X)
+net.weights[0], net.weights[1]
+```
+
+また、すべてのパラメータを指定された定数値 (たとえば 1) に初期化することもできます。
+
+```{.python .input  n=8}
+%%tab mxnet
+net.initialize(init=init.Constant(1), force_reinit=True)
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=9}
+%%tab pytorch
+def init_constant(module):
+    if type(module) == nn.Linear:
+        nn.init.constant_(module.weight, 1)
+        nn.init.zeros_(module.bias)
+net.apply(init_constant)
+net[0].weight.data[0], net[0].bias.data[0]
+```
+
+```{.python .input  n=10}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4, activation=tf.nn.relu,
+        kernel_initializer=tf.keras.initializers.Constant(1),
+        bias_initializer=tf.zeros_initializer()),
+    tf.keras.layers.Dense(1),
+])
+
+net(X)
+net.weights[0], net.weights[1]
+```
+
+[**特定のブロックに異なるイニシャライザを適用することもできます。**] たとえば、以下では、Xavier イニシャライザで最初のレイヤを初期化し、2 番目のレイヤを定数値 42 に初期化します。
+
+```{.python .input  n=11}
+%%tab mxnet
+net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
+net[1].initialize(init=init.Constant(42), force_reinit=True)
+print(net[0].weight.data()[0])
+print(net[1].weight.data())
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+def init_xavier(module):
+    if type(module) == nn.Linear:
+        nn.init.xavier_uniform_(module.weight)
+def init_42(module):
+    if type(module) == nn.Linear:
+        nn.init.constant_(module.weight, 42)
+
+net[0].apply(init_xavier)
+net[2].apply(init_42)
+print(net[0].weight.data[0])
+print(net[2].weight.data)
+```
+
+```{.python .input  n=13}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4,
+        activation=tf.nn.relu,
+        kernel_initializer=tf.keras.initializers.GlorotUniform()),
+    tf.keras.layers.Dense(
+        1, kernel_initializer=tf.keras.initializers.Constant(42)),
+])
+
+net(X)
+print(net.layers[1].weights[0])
+print(net.layers[2].weights[0])
+```
+
+### [**カスタム初期化**]
+
+必要な初期化方法が、ディープラーニングフレームワークによって提供されない場合があります。以下の例では、次の奇妙な分布を使用して、任意の重みパラメータ $w$ のイニシャライザを定義します。 
+
+$$
+\begin{aligned}
+    w \sim \begin{cases}
+        U(5, 10) & \text{ with probability } \frac{1}{4} \\
+            0    & \text{ with probability } \frac{1}{2} \\
+        U(-10, -5) & \text{ with probability } \frac{1}{4}
+    \end{cases}
+\end{aligned}
+$$
+
+:begin_tab:`mxnet`
+ここでは、`Initializer` クラスのサブクラスを定義します。通常は、テンソル引数 (`data`) を受け取り、必要な初期化値を代入する `_init_weight` 関数のみを実装する必要があります。
+:end_tab:
+
+:begin_tab:`pytorch`
+ここでも、`net` に適用する `my_init` 関数を実装します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+ここでは、`Initializer`のサブクラスを定義し、形状とデータ型を指定して必要なテンソルを返す`__call__`関数を実装します。
+:end_tab:
+
+```{.python .input  n=14}
+%%tab mxnet
+class MyInit(init.Initializer):
+    def _init_weight(self, name, data):
+        print('Init', name, data.shape)
+        data[:] = np.random.uniform(-10, 10, data.shape)
+        data *= np.abs(data) >= 5
+
+net.initialize(MyInit(), force_reinit=True)
+net[0].weight.data()[:2]
+```
+
+```{.python .input  n=15}
+%%tab pytorch
+def my_init(module):
+    if type(module) == nn.Linear:
+        print("Init", *[(name, param.shape)
+                        for name, param in module.named_parameters()][0])
+        nn.init.uniform_(module.weight, -10, 10)
+        module.weight.data *= module.weight.data.abs() >= 5
+
+net.apply(my_init)
+net[0].weight[:2]
+```
+
+```{.python .input  n=16}
+%%tab tensorflow
+class MyInit(tf.keras.initializers.Initializer):
+    def __call__(self, shape, dtype=None):
+        data=tf.random.uniform(shape, -10, 10, dtype=dtype)
+        factor=(tf.abs(data) >= 5)
+        factor=tf.cast(factor, tf.float32)
+        return data * factor
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4,
+        activation=tf.nn.relu,
+        kernel_initializer=MyInit()),
+    tf.keras.layers.Dense(1),
+])
+
+net(X)
+print(net.layers[1].weights[0])
+```
+
+パラメータを直接設定するオプションが常にあることに注意してください。
+
+```{.python .input  n=17}
+%%tab mxnet
+net[0].weight.data()[:] += 1
+net[0].weight.data()[0, 0] = 42
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=18}
+%%tab pytorch
+net[0].weight.data[:] += 1
+net[0].weight.data[0, 0] = 42
+net[0].weight.data[0]
+```
+
+```{.python .input  n=19}
+%%tab tensorflow
+net.layers[1].weights[0][:].assign(net.layers[1].weights[0] + 1)
+net.layers[1].weights[0][0, 0].assign(42)
+net.layers[1].weights[0]
+```
+
+## まとめ
+
+組み込みのイニシャライザとカスタムイニシャライザを使用してパラメータを初期化できます。 
+
+## 演習
+
+その他の組み込みイニシャライザについては、オンラインドキュメントを参照してください。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/8089)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/8090)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/8091)
+:end_tab:
diff --git a/chapter_builders-guide/init-param_origin.md b/chapter_builders-guide/init-param_origin.md
new file mode 100644
index 0000000..9c0c6f3
--- /dev/null
+++ b/chapter_builders-guide/init-param_origin.md
@@ -0,0 +1,315 @@
+# Parameter Initialization
+
+Now that we know how to access the parameters,
+let's look at how to initialize them properly.
+We discussed the need for proper initialization in :numref:`sec_numerical_stability`.
+The deep learning framework provides default random initializations to its layers.
+However, we often want to initialize our weights
+according to various other protocols. The framework provides most commonly
+used protocols, and also allows to create a custom initializer.
+
+:begin_tab:`mxnet`
+By default, MXNet initializes weight parameters by randomly drawing from a uniform distribution $U(-0.07, 0.07)$,
+clearing bias parameters to zero.
+MXNet's `init` module provides a variety
+of preset initialization methods.
+:end_tab:
+
+:begin_tab:`pytorch`
+By default, PyTorch initializes weight and bias matrices
+uniformly by drawing from a range that is computed according to the input and output dimension.
+PyTorch's `nn.init` module provides a variety
+of preset initialization methods.
+:end_tab:
+
+:begin_tab:`tensorflow`
+By default, Keras initializes weight matrices uniformly by drawing from a range that is computed according to the input and output dimension, and the bias parameters are all set to zero.
+TensorFlow provides a variety of initialization methods both in the root module and the `keras.initializers` module.
+:end_tab:
+
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input  n=2}
+%%tab mxnet
+from mxnet import init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(8, activation='relu'))
+net.add(nn.Dense(1))
+net.initialize()  # Use the default initialization method
+
+X = np.random.uniform(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+import torch
+from torch import nn
+
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
+X = torch.rand(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(4, activation=tf.nn.relu),
+    tf.keras.layers.Dense(1),
+])
+
+X = tf.random.uniform((2, 4))
+net(X).shape
+```
+
+## [**Built-in Initialization**]
+
+Let's begin by calling on built-in initializers.
+The code below initializes all weight parameters
+as Gaussian random variables
+with standard deviation 0.01, while bias parameters cleared to zero.
+
+```{.python .input  n=5}
+%%tab mxnet
+# Here `force_reinit` ensures that parameters are freshly initialized even if
+# they were already initialized previously
+net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+def init_normal(module):
+    if type(module) == nn.Linear:
+        nn.init.normal_(module.weight, mean=0, std=0.01)
+        nn.init.zeros_(module.bias)
+net.apply(init_normal)
+net[0].weight.data[0], net[0].bias.data[0]
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4, activation=tf.nn.relu,
+        kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.01),
+        bias_initializer=tf.zeros_initializer()),
+    tf.keras.layers.Dense(1)])
+
+net(X)
+net.weights[0], net.weights[1]
+```
+
+We can also initialize all the parameters
+to a given constant value (say, 1).
+
+```{.python .input  n=8}
+%%tab mxnet
+net.initialize(init=init.Constant(1), force_reinit=True)
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=9}
+%%tab pytorch
+def init_constant(module):
+    if type(module) == nn.Linear:
+        nn.init.constant_(module.weight, 1)
+        nn.init.zeros_(module.bias)
+net.apply(init_constant)
+net[0].weight.data[0], net[0].bias.data[0]
+```
+
+```{.python .input  n=10}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4, activation=tf.nn.relu,
+        kernel_initializer=tf.keras.initializers.Constant(1),
+        bias_initializer=tf.zeros_initializer()),
+    tf.keras.layers.Dense(1),
+])
+
+net(X)
+net.weights[0], net.weights[1]
+```
+
+[**We can also apply different initializers for certain blocks.**]
+For example, below we initialize the first layer
+with the Xavier initializer
+and initialize the second layer
+to a constant value of 42.
+
+```{.python .input  n=11}
+%%tab mxnet
+net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
+net[1].initialize(init=init.Constant(42), force_reinit=True)
+print(net[0].weight.data()[0])
+print(net[1].weight.data())
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+def init_xavier(module):
+    if type(module) == nn.Linear:
+        nn.init.xavier_uniform_(module.weight)
+def init_42(module):
+    if type(module) == nn.Linear:
+        nn.init.constant_(module.weight, 42)
+
+net[0].apply(init_xavier)
+net[2].apply(init_42)
+print(net[0].weight.data[0])
+print(net[2].weight.data)
+```
+
+```{.python .input  n=13}
+%%tab tensorflow
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4,
+        activation=tf.nn.relu,
+        kernel_initializer=tf.keras.initializers.GlorotUniform()),
+    tf.keras.layers.Dense(
+        1, kernel_initializer=tf.keras.initializers.Constant(42)),
+])
+
+net(X)
+print(net.layers[1].weights[0])
+print(net.layers[2].weights[0])
+```
+
+### [**Custom Initialization**]
+
+Sometimes, the initialization methods we need
+are not provided by the deep learning framework.
+In the example below, we define an initializer
+for any weight parameter $w$ using the following strange distribution:
+
+$$
+\begin{aligned}
+    w \sim \begin{cases}
+        U(5, 10) & \text{ with probability } \frac{1}{4} \\
+            0    & \text{ with probability } \frac{1}{2} \\
+        U(-10, -5) & \text{ with probability } \frac{1}{4}
+    \end{cases}
+\end{aligned}
+$$
+
+:begin_tab:`mxnet`
+Here we define a subclass of the `Initializer` class.
+Usually, we only need to implement the `_init_weight` function
+which takes a tensor argument (`data`)
+and assigns to it the desired initialized values.
+:end_tab:
+
+:begin_tab:`pytorch`
+Again, we implement a `my_init` function to apply to `net`.
+:end_tab:
+
+:begin_tab:`tensorflow`
+Here we define a subclass of `Initializer` and implement the `__call__`
+function that return a desired tensor given the shape and data type.
+:end_tab:
+
+```{.python .input  n=14}
+%%tab mxnet
+class MyInit(init.Initializer):
+    def _init_weight(self, name, data):
+        print('Init', name, data.shape)
+        data[:] = np.random.uniform(-10, 10, data.shape)
+        data *= np.abs(data) >= 5
+
+net.initialize(MyInit(), force_reinit=True)
+net[0].weight.data()[:2]
+```
+
+```{.python .input  n=15}
+%%tab pytorch
+def my_init(module):
+    if type(module) == nn.Linear:
+        print("Init", *[(name, param.shape)
+                        for name, param in module.named_parameters()][0])
+        nn.init.uniform_(module.weight, -10, 10)
+        module.weight.data *= module.weight.data.abs() >= 5
+
+net.apply(my_init)
+net[0].weight[:2]
+```
+
+```{.python .input  n=16}
+%%tab tensorflow
+class MyInit(tf.keras.initializers.Initializer):
+    def __call__(self, shape, dtype=None):
+        data=tf.random.uniform(shape, -10, 10, dtype=dtype)
+        factor=(tf.abs(data) >= 5)
+        factor=tf.cast(factor, tf.float32)
+        return data * factor
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(
+        4,
+        activation=tf.nn.relu,
+        kernel_initializer=MyInit()),
+    tf.keras.layers.Dense(1),
+])
+
+net(X)
+print(net.layers[1].weights[0])
+```
+
+Note that we always have the option
+of setting parameters directly.
+
+```{.python .input  n=17}
+%%tab mxnet
+net[0].weight.data()[:] += 1
+net[0].weight.data()[0, 0] = 42
+net[0].weight.data()[0]
+```
+
+```{.python .input  n=18}
+%%tab pytorch
+net[0].weight.data[:] += 1
+net[0].weight.data[0, 0] = 42
+net[0].weight.data[0]
+```
+
+```{.python .input  n=19}
+%%tab tensorflow
+net.layers[1].weights[0][:].assign(net.layers[1].weights[0] + 1)
+net.layers[1].weights[0][0, 0].assign(42)
+net.layers[1].weights[0]
+```
+
+## Summary
+
+We can initialize parameters using built-in and custom initializers.
+
+## Exercises
+
+Look up the online documentation for more built-in initializers.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/8089)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/8090)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/8091)
+:end_tab:
diff --git a/chapter_builders-guide/lazy-init.md b/chapter_builders-guide/lazy-init.md
new file mode 100644
index 0000000..54feff5
--- /dev/null
+++ b/chapter_builders-guide/lazy-init.md
@@ -0,0 +1,148 @@
+# 遅延初期化
+:label:`sec_lazy_init`
+
+これまでのところ、ネットワークの設定がずさんなことで逃げ出したように思えるかもしれません。具体的には、次の直感的でないことを行いましたが、動作するようには思えないかもしれません。 
+
+* 入力の次元を指定せずにネットワークアーキテクチャを定義しました。
+* 前のレイヤーの出力ディメンションを指定せずにレイヤーを追加しました。
+* モデルに含めるべきパラメータの数を決定するのに十分な情報を提供する前に、これらのパラメータを「初期化」しました。
+
+私たちのコードがまったく動作することに驚くかもしれません。結局のところ、ディープラーニングフレームワークは、ネットワークの入力次元がどうなるかを判断する方法はありません。ここでの秘訣は、フレームワークが初期化を*延期し、モデルに初めてデータを渡すまで待って、各レイヤーのサイズをその場で推測することです。 
+
+その後、畳み込みニューラルネットワークを扱う場合、入力次元（画像の解像度）が後続の各レイヤーの次元性に影響を与えるため、この手法はさらに便利になります。したがって、コードの作成時に次元が何であるかを知る必要なくパラメータを設定できると、モデルの指定とその後の変更作業が大幅に簡素化されます。次に、初期化の仕組みについて詳しく説明します。 
+
+はじめに、MLP をインスタンス化しましょう。
+
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(256, activation='relu'))
+net.add(nn.Dense(10))
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+
+net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Dense(256, activation=tf.nn.relu),
+    tf.keras.layers.Dense(10),
+])
+```
+
+この時点では、入力の次元が不明であるため、ネットワークは入力層の重みの次元を知ることができない可能性があります。そのため、フレームワークはまだパラメータを初期化していません。以下のパラメータにアクセスして確認します。
+
+```{.python .input}
+%%tab mxnet
+print(net.collect_params)
+print(net.collect_params())
+```
+
+```{.python .input}
+%%tab pytorch
+net[0].weight
+```
+
+```{.python .input}
+%%tab tensorflow
+[net.layers[i].get_weights() for i in range(len(net.layers))]
+```
+
+:begin_tab:`mxnet`
+パラメーターオブジェクトが存在する間、各レイヤーへの入力ディメンションは -1 としてリストされることに注意してください。MXNet は、パラメーターの次元が不明であることを示すために、特別な値 -1 を使用します。この時点で、`net[0].weight.data()` にアクセスしようとすると、パラメータにアクセスする前にネットワークを初期化する必要があることを示すランタイムエラーが発生します。ここで、`initialize` メソッドでパラメータを初期化しようとするとどうなるか見てみましょう。
+:end_tab:
+
+:begin_tab:`tensorflow`
+各レイヤーオブジェクトは存在しますが、ウェイトは空です。`net.get_weights()` を使用すると、ウェイトがまだ初期化されていないため、エラーがスローされます。
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+net.initialize()
+net.collect_params()
+```
+
+:begin_tab:`mxnet`
+ご覧のとおり、何も変わっていません。入力次元が不明な場合、initialize を呼び出してもパラメーターは真に初期化されません。代わりに、この呼び出しは、パラメーターを初期化する (およびオプションで、どのディストリビューションに応じて) MXNet に登録します。
+:end_tab:
+
+次に、ネットワークを介してデータを渡して、フレームワークが最終的にパラメータを初期化するようにします。
+
+```{.python .input}
+%%tab mxnet
+X = np.random.uniform(size=(2, 20))
+net(X)
+
+net.collect_params()
+```
+
+```{.python .input}
+%%tab pytorch
+X = torch.rand(2, 20)
+net(X)
+
+net[0].weight.shape
+```
+
+```{.python .input}
+%%tab tensorflow
+X = tf.random.uniform((2, 20))
+net(X)
+[w.shape for w in net.get_weights()]
+```
+
+入力次元20がわかるとすぐに、フレームワークは20の値を入力することで第1レイヤーの重みマトリックスの形状を識別できます。最初のレイヤーの形状を認識したら、フレームワークは2番目のレイヤーに進み、すべての形状がわかるまで計算グラフを介して続きます。この場合、最初のレイヤーのみが遅延初期化を必要としますが、フレームワークは順次初期化されることに注意してください。すべてのパラメータ形状がわかれば、フレームワークは最終的にパラメータを初期化できます。
+
+:begin_tab:`pytorch`
+次のメソッドは、ネットワークを介してダミー入力を渡して予行運転を行い、すべてのパラメータ形状を推測し、続いてパラメータを初期化します。これは、後でデフォルトのランダム初期化が望ましくない場合に使用されます。
+:end_tab:
+
+```{.python .input}
+%%tab pytorch
+@d2l.add_to_class(d2l.Module)  #@save
+def apply_init(self, inputs, init=None):
+    self.forward(*inputs)
+    if init is not None:
+        self.net.apply(init)
+```
+
+## まとめ
+
+* 遅延初期化は便利で、フレームワークがパラメータ形状を自動的に推測できるため、アーキテクチャの変更が容易になり、一般的なエラーの原因を1つ排除できます。
+* モデルを介してデータを渡して、フレームワークが最終的にパラメータを初期化するようにすることができます。
+
+## 演習
+
+1. 入力ディメンションを最初のレイヤーに指定し、後続のレイヤーには指定しないとどうなりますか？すぐに初期化できますか？
+1. 不一致のディメンションを指定するとどうなりますか?
+1. 様々な次元のインプットがあるとしたら、何をする必要がありますか？ヒント:パラメータ同士を見てください。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/280)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/8092)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/281)
+:end_tab:
diff --git a/chapter_deep-learning-computation/deferred-init_origin.md b/chapter_builders-guide/lazy-init_origin.md
similarity index 72%
rename from chapter_deep-learning-computation/deferred-init_origin.md
rename to chapter_builders-guide/lazy-init_origin.md
index 74b3480..6e7863f 100644
--- a/chapter_deep-learning-computation/deferred-init_origin.md
+++ b/chapter_builders-guide/lazy-init_origin.md
@@ -1,5 +1,5 @@
-# Deferred Initialization
-:label:`sec_deferred_init`
+# Lazy Initialization
+:label:`sec_lazy_init`
 
 So far, it might seem that we got away
 with being sloppy in setting up our networks.
@@ -37,26 +37,35 @@ and subsequently modifying our models.
 Next, we go deeper into the mechanics of initialization.
 
 
-## Instantiating a Network
+To begin, let's instantiate an MLP.
 
-To begin, let us instantiate an MLP.
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
 
 ```{.python .input}
+%%tab mxnet
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
 
-def get_net():
-    net = nn.Sequential()
-    net.add(nn.Dense(256, activation='relu'))
-    net.add(nn.Dense(10))
-    return net
+net = nn.Sequential()
+net.add(nn.Dense(256, activation='relu'))
+net.add(nn.Dense(10))
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
 
-net = get_net()
+net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 
 net = tf.keras.models.Sequential([
@@ -72,12 +81,18 @@ Consequently the framework has not yet initialized any parameters.
 We confirm by attempting to access the parameters below.
 
 ```{.python .input}
+%%tab mxnet
 print(net.collect_params)
 print(net.collect_params())
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab pytorch
+net[0].weight
+```
+
+```{.python .input}
+%%tab tensorflow
 [net.layers[i].get_weights() for i in range(len(net.layers))]
 ```
 
@@ -89,8 +104,8 @@ that the parameter dimension remains unknown.
 At this point, attempts to access `net[0].weight.data()`
 would trigger a runtime error stating that the network
 must be initialized before the parameters can be accessed.
-Now let us see what happens when we attempt to initialize
-parameters via the `initialize` function.
+Now let's see what happens when we attempt to initialize
+parameters via the `initialize` method.
 :end_tab:
 
 :begin_tab:`tensorflow`
@@ -100,6 +115,7 @@ have not been initialized yet.
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 net.initialize()
 net.collect_params()
 ```
@@ -113,10 +129,11 @@ Instead, this call registers to MXNet that we wish
 to initialize the parameters.
 :end_tab:
 
-Next let us pass data through the network
+Next let's pass data through the network
 to make the framework finally initialize parameters.
 
 ```{.python .input}
+%%tab mxnet
 X = np.random.uniform(size=(2, 20))
 net(X)
 
@@ -124,7 +141,15 @@ net.collect_params()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab pytorch
+X = torch.rand(2, 20)
+net(X)
+
+net[0].weight.shape
+```
+
+```{.python .input}
+%%tab tensorflow
 X = tf.random.uniform((2, 20))
 net(X)
 [w.shape for w in net.get_weights()]
@@ -138,14 +163,33 @@ to the second layer,
 and so on through the computational graph
 until all shapes are known.
 Note that in this case,
-only the first layer requires deferred initialization,
+only the first layer requires lazy initialization,
 but the framework initializes sequentially.
 Once all parameter shapes are known,
 the framework can finally initialize the parameters.
 
+:begin_tab:`pytorch`
+The following method
+passes in dummy inputs
+through the network
+for a dry run
+to infer all parameter shapes
+and subsequently initializes the parameters.
+It will be used later when default random initializations are not desired.
+:end_tab:
+
+```{.python .input}
+%%tab pytorch
+@d2l.add_to_class(d2l.Module)  #@save
+def apply_init(self, inputs, init=None):
+    self.forward(*inputs)
+    if init is not None:
+        self.net.apply(init)
+```
+
 ## Summary
 
-* Deferred initialization can be convenient, allowing the framework to infer parameter shapes automatically, making it easy to modify architectures and eliminating one common source of errors.
+* Lazy initialization can be convenient, allowing the framework to infer parameter shapes automatically, making it easy to modify architectures and eliminating one common source of errors.
 * We can pass data through the model to make the framework finally initialize parameters.
 
 
@@ -159,6 +203,10 @@ the framework can finally initialize the parameters.
 [Discussions](https://discuss.d2l.ai/t/280)
 :end_tab:
 
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/8092)
+:end_tab:
+
 :begin_tab:`tensorflow`
 [Discussions](https://discuss.d2l.ai/t/281)
 :end_tab:
diff --git a/chapter_builders-guide/model-construction.md b/chapter_builders-guide/model-construction.md
new file mode 100644
index 0000000..10e959d
--- /dev/null
+++ b/chapter_builders-guide/model-construction.md
@@ -0,0 +1,411 @@
+# レイヤーとモジュール
+:label:`sec_model_construction`
+
+ニューラルネットワークを初めて導入したとき、私たちは単一出力の線形モデルに焦点を当てました。ここでは、モデル全体が単一のニューロンだけで構成されています。単一のニューロンが (i) いくつかの入力セットを受け取り、(ii) 対応するスカラー出力を生成し、(iii) 関心のある目的関数を最適化するために更新できる一連の関連パラメーターがあることに注意してください。次に、複数の出力を持つネットワークについて考え始めると、ベクトル化された算術演算を利用してニューロンの層全体を特徴付けました。個々のニューロンと同様に、層 (i) は一連の入力を受け取り、(ii) 対応する出力を生成し、(iii) 調整可能なパラメーターのセットによって記述されます。ソフトマックス回帰に取り組んだとき、単一レイヤー自体がモデルでした。しかし、その後MLPを導入したときでも、このモデルは同じ基本構造を保持していると考えることができます。 
+
+興味深いことに、MLPでは、モデル全体とその構成層の両方がこの構造を共有しています。モデル全体が生の入力 (フィーチャ) を取り込み、出力 (予測) を生成し、パラメーター (すべての構成レイヤーから組み合わされたパラメーター) を持ちます。同様に、個々の層は (前の層によって供給される) 入力を取り込み、出力 (後続のレイヤーへの入力) を生成し、後続のレイヤーから逆方向に流れる信号に従って更新される一連の調整可能なパラメーターを持ちます。 
+
+ニューロン、レイヤー、モデルが私たちのビジネスを進めるのに十分な抽象化を与えると思うかもしれませんが、個々のレイヤーよりも大きく、モデル全体よりも小さいコンポーネントについて話すと便利なことがよくあります。たとえば、コンピュータービジョンで非常に普及しているResNet-152アーキテクチャは、数百のレイヤーを持っています。これらのレイヤーは、*レイヤーのグループ* の繰り返しパターンで構成されています。このようなネットワークを一度に 1 層ずつ実装するのは面倒です。この懸念は単なる仮説ではありません。このような設計パターンは実際には一般的です。上記のResNetアーキテクチャは、認識と検出の両方で2015年のImageNetとCOCOのコンピュータービジョンコンペティションで優勝し、多くのビジョンタスクで頼りになるアーキテクチャであり続けています。レイヤーがさまざまな繰り返しパターンで配置される同様のアーキテクチャは、現在、自然言語処理や音声を含む他のドメインに遍在しています。 
+
+これらの複雑なネットワークを実装するために、ニューラルネットワーク*モジュール*の概念を紹介します。モジュールは、単一のレイヤー、複数のレイヤーで構成されるコンポーネント、またはモデル自体を記述できます。モジュール抽象化を使用する利点の 1 つは、多くの場合再帰的に、より大きな成果物に結合できることです。これは:numref:`fig_blocks`に示されています。オンデマンドで任意の複雑さのモジュールを生成するコードを定義することで、驚くほどコンパクトなコードを書くことができ、複雑なニューラルネットワークを実装できます。 
+
+![Multiple layers are combined into modules, forming repeating patterns of larger models.](../img/blocks.svg)
+:label:`fig_blocks`
+
+プログラミングの観点から、モジュールは*クラス*で表されます。そのサブクラスは、入力を出力に変換する順伝播メソッドを定義し、必要なパラメータを格納する必要があります。一部のモジュールはパラメータをまったく必要としないことに注意してください。最後に、勾配を計算するために、モジュールはバックプロパゲーションメソッドを備えている必要があります。幸いなことに、独自のモジュールを定義するときに自動微分（:numref:`sec_autograd`で導入された）によって提供されるいくつかの舞台裏の魔法のために、パラメータと順伝播方法について心配するだけで済みます。 
+
+[**はじめに、MLPの実装に使用したコードを再検討します**](:numref:`sec_mlp`)。次のコードは、256 ユニットと ReLU アクティベーション、続いて 10 ユニット (アクティベーション関数なし) の全接続出力層が続くネットワークを生成します。
+
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input  n=2}
+%%tab mxnet
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(256, activation='relu'))
+net.add(nn.Dense(10))
+net.initialize()
+
+X = np.random.uniform(size=(2, 20))
+net(X).shape
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
+
+X = torch.rand(2, 20)
+net(X).shape
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Dense(256, activation=tf.nn.relu),
+    tf.keras.layers.Dense(10),
+])
+
+X = tf.random.uniform((2, 20))
+net(X).shape
+```
+
+:begin_tab:`mxnet`
+この例では、`nn.Sequential` をインスタンス化してモデルを構築し、返されたオブジェクトを変数 `net` に割り当てます。次に、`add` メソッドを繰り返し呼び出し、実行される順序でレイヤーを追加します。要するに、`nn.Sequential`は、Gluonで*モジュール*を提示するクラスである`Block`の特殊な種類を定義しています。構成要素`Block`の順序付きリストを維持します。`add` メソッドは、連続する各 `Block` をリストに追加するのを容易にします。各レイヤーは `Dense` クラスのインスタンスであり、それ自体が `Block` のサブクラスであることに注意してください。フォワードプロパゲーション (`forward`) メソッドも非常に簡単です。リスト内の各`Block`を連結し、それぞれの出力を入力として次のメソッドに渡します。これまで、出力を取得するために、構築 `net(X)` を介してモデルを呼び出していたことに注意してください。これは実際には `net.forward(X)` の省略形にすぎません。これは、`Block` クラスの `__call__` メソッドを介して達成された Python の巧妙なトリックです。
+:end_tab:
+
+:begin_tab:`pytorch`
+この例では、`nn.Sequential`をインスタンス化してモデルを構築し、実行する順序のレイヤーを引数として渡します。要するに、(**`nn.Sequential`は特別な種類の `Module`** を定義します)、PyTorchでモジュールを提示するクラスです。構成要素`Module`の順序付きリストを維持します。2つの全結合層はそれぞれ、`Linear`クラスのインスタンスであり、それ自体が`Module`のサブクラスであることに注意してください。フォワードプロパゲーション (`forward`) メソッドも非常に簡単です。リスト内の各モジュールを連結し、それぞれの出力を次のモジュールへの入力として渡します。これまで、出力を取得するために、構築 `net(X)` を介してモデルを呼び出していたことに注意してください。これは実際には `net.__call__(X)` の省略形にすぎません。
+:end_tab:
+
+:begin_tab:`tensorflow`
+この例では、`keras.models.Sequential`をインスタンス化してモデルを構築し、実行する順序のレイヤーを引数として渡します。要するに、`Sequential`は、Kerasでモジュールを提示するクラスである`keras.Model`の特別な種類を定義しています。構成要素`Model`の順序付きリストを維持します。2つの全結合層はそれぞれ、`Dense`クラスのインスタンスであり、それ自体が`Model`のサブクラスであることに注意してください。フォワードプロパゲーション (`call`) メソッドも非常に簡単です。リスト内の各モジュールを連結し、それぞれの出力を次のモジュールへの入力として渡します。これまで、出力を取得するために、構築 `net(X)` を介してモデルを呼び出していたことに注意してください。これは実際には `net.call(X)` の省略形に過ぎません。これは、モジュールクラスの `__call__` メソッドを介して達成された Python の巧妙なトリックです。
+:end_tab:
+
+## [**カスタムモジュール**]
+
+おそらく、モジュールがどのように機能するかについての直感を養う最も簡単な方法は、モジュールを自分で実装することです。独自のカスタムモジュールを実装する前に、各モジュールが提供しなければならない基本機能を簡単に要約します。 
+
+1. 入力データをそのフォワードプロパゲーションメソッドの引数として取り込みます。
+1. フォワードプロパゲーションメソッドが値を返すようにして出力を生成します。出力は入力とは異なる形状になる場合があることに注意してください。たとえば、上記のモデルの最初の全結合層は、任意の次元の入力を取り込みますが、次元 256 の出力を返します。
+1. 入力に対する出力の勾配を計算します。この勾配は、バックプロパゲーションメソッドを介してアクセスできます。通常、これは自動的に行われます。
+1. フォワードプロパゲーション計算の実行に必要なパラメーターを保存し、そのパラメーターへのアクセスを提供します。
+1. 必要に応じてモデルパラメーターを初期化します。
+
+次のスニペットでは、256 の隠れユニットを持つ 1 つの隠れ層と 10 次元の出力層を持つ MLP に対応するモジュールをゼロからコーディングします。以下の `MLP` クラスは、モジュールを表すクラスを継承していることに注意してください。親クラスのメソッドに大きく依存し、独自のコンストラクタ (Python では `__init__` メソッド) とフォワード伝播メソッドのみを提供します。
+
+```{.python .input  n=5}
+%%tab mxnet
+class MLP(nn.Block):
+    def __init__(self):
+        # Call the constructor of the MLP parent class nn.Block to perform
+        # the necessary initialization
+        super().__init__()
+        self.hidden = nn.Dense(256, activation='relu')
+        self.out = nn.Dense(10)
+
+    # Define the forward propagation of the model, that is, how to return the
+    # required model output based on the input X
+    def forward(self, X):
+        return self.out(self.hidden(X))
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+class MLP(nn.Module):
+    def __init__(self):
+        # Call the constructor of the parent class nn.Module to perform
+        # the necessary initialization
+        super().__init__()
+        self.hidden = nn.LazyLinear(256)
+        self.out = nn.LazyLinear(10)
+
+    # Define the forward propagation of the model, that is, how to return the
+    # required model output based on the input X
+    def forward(self, X):
+        return self.out(F.relu(self.hidden(X)))
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+class MLP(tf.keras.Model):
+    def __init__(self):
+        # Call the constructor of the parent class tf.keras.Model to perform
+        # the necessary initialization
+        super().__init__()
+        self.hidden = tf.keras.layers.Dense(units=256, activation=tf.nn.relu)
+        self.out = tf.keras.layers.Dense(units=10)
+
+    # Define the forward propagation of the model, that is, how to return the
+    # required model output based on the input X
+    def call(self, X):
+        return self.out(self.hidden((X)))
+```
+
+まず、順伝播方法に焦点を当てましょう。`X`を入力として受け取り、活性化関数を適用して隠れ表現を計算し、そのロジットを出力することに注意してください。この`MLP`の実装では、両方のレイヤーがインスタンス変数です。これが妥当な理由を理解するために、`net1` と `net2` の 2 つの MLP をインスタンス化し、異なるデータで学習させることを想像してみてください。当然、それらは2つの異なる学習モデルを表すと予想されます。 
+
+順伝播メソッドを呼び出すたびに、コンストラクターで [**MLPのレイヤーをインスタンス化する**]（**そしてこれらのレイヤーを呼び出す**）。いくつかの重要な詳細をメモしておきます。まず、カスタマイズされた`__init__`メソッドは、`super().__init__()`を介して親クラスの`__init__`メソッドを呼び出し、ほとんどのモジュールに適用できる定型コードを再記述する手間を省きます。次に、完全に接続された 2 つのレイヤーをインスタンス化し、それらを `self.hidden` と `self.out` に割り当てます。新しいレイヤーを実装しない限り、バックプロパゲーションメソッドやパラメーターの初期化について心配する必要はありません。システムは、これらのメソッドを自動的に生成します。これやってみよう。
+
+```{.python .input  n=8}
+%%tab all
+net = MLP()
+if tab.selected('mxnet'):
+    net.initialize()
+net(X).shape
+```
+
+モジュール抽象化の重要な長所は、その汎用性にあります。モジュールをサブクラス化して、レイヤー (全結合レイヤークラスなど)、モデル全体 (上記の `MLP` クラスなど)、または中程度の複雑さのさまざまなコンポーネントを作成できます。畳み込みニューラルネットワークを扱う場合など、次の章でこの汎用性を活用します。 
+
+## [**シーケンシャルモジュール**]
+
+ここで、`Sequential` クラスがどのように機能するかを詳しく見てみましょう。`Sequential`は、他のモジュールをデイジーチェーン接続するように設計されていることを思い出してください。独自の簡略化された `MySequential` を構築するには、次の 2 つの主要なメソッドを定義する必要があります。
+1. モジュールを一つずつリストに追加するメソッド。
+2. 追加されたのと同じ順序でモジュールのチェーンを介して入力を渡すフォワードプロパゲーションメソッド。
+
+次の `MySequential` クラスは、デフォルトの `Sequential` クラスと同じ機能を提供します。
+
+```{.python .input  n=10}
+%%tab mxnet
+class MySequential(nn.Block):
+    def add(self, block):
+        # Here, block is an instance of a Block subclass, and we assume that
+        # it has a unique name. We save it in the member variable _children of
+        # the Block class, and its type is OrderedDict. When the MySequential
+        # instance calls the initialize method, the system automatically
+        # initializes all members of _children
+        self._children[block.name] = block
+
+    def forward(self, X):
+        # OrderedDict guarantees that members will be traversed in the order
+        # they were added
+        for block in self._children.values():
+            X = block(X)
+        return X
+```
+
+```{.python .input  n=11}
+%%tab pytorch
+class MySequential(nn.Module):
+    def __init__(self, *args):
+        super().__init__()
+        for idx, module in enumerate(args):
+            self.add_module(str(idx), module)
+
+    def forward(self, X):
+        for module in self.children():            
+            X = module(X)
+        return X
+```
+
+```{.python .input  n=12}
+%%tab tensorflow
+class MySequential(tf.keras.Model):
+    def __init__(self, *args):
+        super().__init__()
+        self.modules = args
+
+    def call(self, X):
+        for module in self.modules:
+            X = module(X)
+        return X
+```
+
+:begin_tab:`mxnet`
+`add` メソッドは、順序付きディクショナリ `_children` に 1 つのブロックを追加します。すべての Gluon `Block` がなぜ `_children` 属性を持っているのか、そしてなぜ私たちがPythonリストを定義するのではなくそれを使ったのか不思議に思うかもしれません。要するに、`_children`の主な利点は、ブロックのパラメーターの初期化中に、Gluonが`_children`ディクショナリ内を見て、パラメーターも初期化する必要があるサブブロックを見つけることがわかっていることです。
+:end_tab:
+
+:begin_tab:`pytorch`
+`__init__` メソッドでは、`add_modules` メソッドを呼び出してすべてのモジュールを追加します。これらのモジュールには、後で `children` メソッドでアクセスできます。このようにして、システムは追加されたモジュールを認識し、各モジュールのパラメータを適切に初期化します。
+:end_tab:
+
+`MySequential`のフォワード伝播メソッドが呼び出されると、追加された各モジュールは、追加された順序で実行されます。これで、`MySequential` クラスを使用して MLP を再実装できます。
+
+```{.python .input  n=13}
+%%tab mxnet
+net = MySequential()
+net.add(nn.Dense(256, activation='relu'))
+net.add(nn.Dense(10))
+net.initialize()
+net(X).shape
+```
+
+```{.python .input  n=14}
+%%tab pytorch
+net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
+net(X).shape
+```
+
+```{.python .input  n=15}
+%%tab tensorflow
+net = MySequential(
+    tf.keras.layers.Dense(units=256, activation=tf.nn.relu),
+    tf.keras.layers.Dense(10))
+net(X).shape
+```
+
+この`MySequential`の使用法は、`Sequential`クラス用に以前に記述したコード（:numref:`sec_mlp`で説明されているように）と同じであることに注意してください。 
+
+## [**フォワードプロパゲーション方式でコードを実行する**]
+
+`Sequential` クラスを使用すると、モデルの構築が容易になり、独自のクラスを定義しなくても新しいアーキテクチャを組み立てることができます。ただし、すべてのアーキテクチャが単純なデイジーチェーンであるとは限りません。より高い柔軟性が必要な場合は、独自のブロックを定義したいと思うでしょう。たとえば、Python の制御フローをフォワードプロパゲーションメソッド内で実行するとします。さらに、単に定義済みのニューラルネットワーク層に依存するのではなく、任意の数学的演算を実行したい場合があります。 
+
+今まで、ネットワーク内のすべての操作が、ネットワークのアクティベーションとそのパラメータに基づいて動作していたことに気づいたかもしれません。ただし、前のレイヤーの結果でも更新可能なパラメーターでもない用語を取り入れたい場合があります。これらを*定数パラメータ*と呼びます。たとえば、関数 $f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$ を計算するレイヤーが必要であるとします。ここで、$\mathbf{x}$ は入力、$\mathbf{w}$ はパラメーター、$c$ は最適化中に更新されない特定の定数です。そこで、`FixedHiddenMLP` クラスを以下のように実装します。
+
+```{.python .input  n=16}
+%%tab mxnet
+class FixedHiddenMLP(nn.Block):
+    def __init__(self):
+        super().__init__()
+        # Random weight parameters created with the `get_constant` method
+        # are not updated during training (i.e., constant parameters)
+        self.rand_weight = self.params.get_constant(
+            'rand_weight', np.random.uniform(size=(20, 20)))
+        self.dense = nn.Dense(20, activation='relu')
+
+    def forward(self, X):
+        X = self.dense(X)
+        # Use the created constant parameters, as well as the `relu` and `dot`
+        # functions
+        X = npx.relu(np.dot(X, self.rand_weight.data()) + 1)
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
+        X = self.dense(X)
+        # Control flow
+        while np.abs(X).sum() > 1:
+            X /= 2
+        return X.sum()
+```
+
+```{.python .input}
+%%tab pytorch
+class FixedHiddenMLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        # Random weight parameters that will not compute gradients and
+        # therefore keep constant during training
+        self.rand_weight = torch.rand((20, 20))
+        self.linear = nn.LazyLinear(20)
+
+    def forward(self, X):
+        X = self.linear(X)        
+        X = F.relu(X @ self.rand_weight + 1)
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
+        X = self.linear(X)
+        # Control flow
+        while X.abs().sum() > 1:
+            X /= 2
+        return X.sum()
+```
+
+```{.python .input}
+%%tab tensorflow
+class FixedHiddenMLP(tf.keras.Model):
+    def __init__(self):
+        super().__init__()
+        self.flatten = tf.keras.layers.Flatten()
+        # Random weight parameters created with `tf.constant` are not updated
+        # during training (i.e., constant parameters)
+        self.rand_weight = tf.constant(tf.random.uniform((20, 20)))
+        self.dense = tf.keras.layers.Dense(20, activation=tf.nn.relu)
+
+    def call(self, inputs):
+        X = self.flatten(inputs)
+        # Use the created constant parameters, as well as the `relu` and
+        # `matmul` functions
+        X = tf.nn.relu(tf.matmul(X, self.rand_weight) + 1)
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
+        X = self.dense(X)
+        # Control flow
+        while tf.reduce_sum(tf.math.abs(X)) > 1:
+            X /= 2
+        return tf.reduce_sum(X)
+```
+
+この`FixedHiddenMLP`モデルでは、重み（`self.rand_weight`）がインスタンス化時にランダムに初期化され、その後は一定になる隠れ層を実装します。この重みはモデルパラメータではないため、バックプロパゲーションによって更新されることはありません。次に、ネットワークはこの「固定」層の出力を全結合層に渡します。 
+
+出力を返す前に、私たちのモデルは何か変わったことをしたことに注意してください。whileループを実行し、$\ell_1$ノルムが$1$より大きいという条件でテストし、条件を満たすまで出力ベクトルを $2$ で割りました。最後に、`X` のエントリの合計を返しました。私たちの知る限りでは、この操作を実行する標準的なニューラルネットワークはありません。この特定の操作は、実際のタスクでは役に立たないことに注意してください。ここでのポイントは、ニューラルネットワーク計算のフローに任意のコードを統合する方法を示すことだけです。
+
+```{.python .input}
+%%tab all
+net = FixedHiddenMLP()
+if tab.selected('mxnet'):
+    net.initialize()
+net(X)
+```
+
+[**モジュールを組み立てるさまざまな方法を組み合わせて組み合わせることができます。**] 次の例では、いくつかの創造的な方法でモジュールをネストします。
+
+```{.python .input}
+%%tab mxnet
+class NestMLP(nn.Block):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.net = nn.Sequential()
+        self.net.add(nn.Dense(64, activation='relu'),
+                     nn.Dense(32, activation='relu'))
+        self.dense = nn.Dense(16, activation='relu')
+
+    def forward(self, X):
+        return self.dense(self.net(X))
+
+chimera = nn.Sequential()
+chimera.add(NestMLP(), nn.Dense(20), FixedHiddenMLP())
+chimera.initialize()
+chimera(X)
+```
+
+```{.python .input}
+%%tab pytorch
+class NestMLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
+                                 nn.LazyLinear(32), nn.ReLU())
+        self.linear = nn.LazyLinear(16)
+
+    def forward(self, X):
+        return self.linear(self.net(X))
+
+chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
+chimera(X)
+```
+
+```{.python .input}
+%%tab tensorflow
+class NestMLP(tf.keras.Model):
+    def __init__(self):
+        super().__init__()
+        self.net = tf.keras.Sequential()
+        self.net.add(tf.keras.layers.Dense(64, activation=tf.nn.relu))
+        self.net.add(tf.keras.layers.Dense(32, activation=tf.nn.relu))
+        self.dense = tf.keras.layers.Dense(16, activation=tf.nn.relu)
+
+    def call(self, inputs):
+        return self.dense(self.net(inputs))
+
+chimera = tf.keras.Sequential()
+chimera.add(NestMLP())
+chimera.add(tf.keras.layers.Dense(20))
+chimera.add(FixedHiddenMLP())
+chimera(X)
+```
+
+## まとめ
+
+* レイヤーはモジュールです。
+* 多くのレイヤーが 1 つのモジュールを構成できます。
+* 多くのモジュールが 1 つのモジュールを構成できます。
+* モジュールにはコードを含めることができます。
+* モジュールは、パラメータの初期化やバックプロパゲーションなど、多くのハウスキーピングを処理します。
+* レイヤーとモジュールの連続的な連結は、`Sequential` モジュールによって処理されます。
+
+## 演習
+
+1. `MySequential` を Python リストにモジュールを格納するように変更すると、どのような問題が発生しますか?
+1. `net1`と`net2`の2つのモジュールを引数として取り、両方のネットワークの連結された出力を順伝播で返すモジュールを実装します。これは並列モジュールとも呼ばれます。
+1. 同じネットワークの複数のインスタンスを連結するとします。同じモジュールの複数のインスタンスを生成するファクトリ関数を実装し、そこからより大きなネットワークを構築します。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/54)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/55)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/264)
+:end_tab:
diff --git a/chapter_deep-learning-computation/model-construction_origin.md b/chapter_builders-guide/model-construction_origin.md
similarity index 54%
rename from chapter_deep-learning-computation/model-construction_origin.md
rename to chapter_builders-guide/model-construction_origin.md
index b698dfd..e971427 100644
--- a/chapter_deep-learning-computation/model-construction_origin.md
+++ b/chapter_builders-guide/model-construction_origin.md
@@ -1,4 +1,4 @@
-# Layers and Blocks
+# Layers and Modules
 :label:`sec_model_construction`
 
 When we first introduced neural networks,
@@ -60,45 +60,51 @@ are now ubiquitous in other domains,
 including natural language processing and speech.
 
 To implement these complex networks,
-we introduce the concept of a neural network *block*.
-A block could describe a single layer,
+we introduce the concept of a neural network *module*.
+A module could describe a single layer,
 a component consisting of multiple layers,
 or the entire model itself!
-One benefit of working with the block abstraction
+One benefit of working with the module abstraction
 is that they can be combined into larger artifacts,
-often recursively. This is illustrated in :numref:`fig_blocks`. By defining code to generate blocks
+often recursively. This is illustrated in :numref:`fig_blocks`. By defining code to generate modules
 of arbitrary complexity on demand,
 we can write surprisingly compact code
 and still implement complex neural networks.
 
-![Multiple layers are combined into blocks, forming repeating patterns of larger models.](../img/blocks.svg)
+![Multiple layers are combined into modules, forming repeating patterns of larger models.](../img/blocks.svg)
 :label:`fig_blocks`
 
 
-From a programing standpoint, a block is represented by a *class*.
-Any subclass of it must define a forward propagation function
+From a programming standpoint, a module is represented by a *class*.
+Any subclass of it must define a forward propagation method
 that transforms its input into output
 and must store any necessary parameters.
-Note that some blocks do not require any parameters at all.
-Finally a block must possess a backpropagation function,
+Note that some modules do not require any parameters at all.
+Finally a module must possess a backpropagation method,
 for purposes of calculating gradients.
 Fortunately, due to some behind-the-scenes magic
 supplied by the auto differentiation
 (introduced in :numref:`sec_autograd`)
-when defining our own block,
+when defining our own module,
 we only need to worry about parameters
-and the forward propagation function.
+and the forward propagation method.
 
 [**To begin, we revisit the code
 that we used to implement MLPs**]
-(:numref:`sec_mlp_concise`).
+(:numref:`sec_mlp`).
 The following code generates a network
-with one fully-connected hidden layer
+with one fully connected hidden layer
 with 256 units and ReLU activation,
-followed by a fully-connected output layer
+followed by a fully connected output layer
 with 10 units (no activation function).
 
-```{.python .input}
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input  n=2}
+%%tab mxnet
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
@@ -109,23 +115,23 @@ net.add(nn.Dense(10))
 net.initialize()
 
 X = np.random.uniform(size=(2, 20))
-net(X)
+net(X).shape
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=3}
+%%tab pytorch
 import torch
 from torch import nn
 from torch.nn import functional as F
 
-net = nn.Sequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
+net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
 
 X = torch.rand(2, 20)
-net(X)
+net(X).shape
 ```
 
-```{.python .input}
-#@tab tensorflow
+```{.python .input  n=4}
+%%tab tensorflow
 import tensorflow as tf
 
 net = tf.keras.models.Sequential([
@@ -134,31 +140,31 @@ net = tf.keras.models.Sequential([
 ])
 
 X = tf.random.uniform((2, 20))
-net(X)
+net(X).shape
 ```
 
 :begin_tab:`mxnet`
 In this example, we constructed
 our model by instantiating an `nn.Sequential`,
 assigning the returned object to the `net` variable.
-Next, we repeatedly call its `add` function,
+Next, we repeatedly call its `add` method,
 appending layers in the order
 that they should be executed.
 In short, `nn.Sequential` defines a special kind of `Block`,
-the class that presents a block in Gluon.
+the class that presents a *module* in Gluon.
 It maintains an ordered list of constituent `Block`s.
-The `add` function simply facilitates
+The `add` method simply facilitates
 the addition of each successive `Block` to the list.
 Note that each layer is an instance of the `Dense` class
 which is itself a subclass of `Block`.
-The forward propagation (`forward`) function is also remarkably simple:
+The forward propagation (`forward`) method is also remarkably simple:
 it chains each `Block` in the list together,
-passing the output of each as the input to the next.
+passing the output of each as input to the next.
 Note that until now, we have been invoking our models
 via the construction `net(X)` to obtain their outputs.
 This is actually just shorthand for `net.forward(X)`,
 a slick Python trick achieved via
-the `Block` class's `__call__` function.
+the `Block` class's `__call__` method.
 :end_tab:
 
 :begin_tab:`pytorch`
@@ -166,13 +172,13 @@ In this example, we constructed
 our model by instantiating an `nn.Sequential`, with layers in the order
 that they should be executed passed as arguments.
 In short, (**`nn.Sequential` defines a special kind of `Module`**),
-the class that presents a block in PyTorch.
+the class that presents a module in PyTorch.
 It maintains an ordered list of constituent `Module`s.
-Note that each of the two fully-connected layers is an instance of the `Linear` class
+Note that each of the two fully connected layers is an instance of the `Linear` class
 which is itself a subclass of `Module`.
-The forward propagation (`forward`) function is also remarkably simple:
-it chains each block in the list together,
-passing the output of each as the input to the next.
+The forward propagation (`forward`) method is also remarkably simple:
+it chains each module in the list together,
+passing the output of each as input to the next.
 Note that until now, we have been invoking our models
 via the construction `net(X)` to obtain their outputs.
 This is actually just shorthand for `net.__call__(X)`.
@@ -183,125 +189,97 @@ In this example, we constructed
 our model by instantiating an `keras.models.Sequential`, with layers in the order
 that they should be executed passed as arguments.
 In short, `Sequential` defines a special kind of `keras.Model`,
-the class that presents a block in Keras.
+the class that presents a module in Keras.
 It maintains an ordered list of constituent `Model`s.
-Note that each of the two fully-connected layers is an instance of the `Dense` class
+Note that each of the two fully connected layers is an instance of the `Dense` class
 which is itself a subclass of `Model`.
-The forward propagation (`call`) function is also remarkably simple:
-it chains each block in the list together,
-passing the output of each as the input to the next.
+The forward propagation (`call`) method is also remarkably simple:
+it chains each module in the list together,
+passing the output of each as input to the next.
 Note that until now, we have been invoking our models
 via the construction `net(X)` to obtain their outputs.
 This is actually just shorthand for `net.call(X)`,
 a slick Python trick achieved via
-the Block class's `__call__` function.
+the module class's `__call__` method.
 :end_tab:
 
-## [**A Custom Block**]
+## [**A Custom Module**]
 
 Perhaps the easiest way to develop intuition
-about how a block works
+about how a module works
 is to implement one ourselves.
-Before we implement our own custom block,
+Before we implement our own custom module,
 we briefly summarize the basic functionality
-that each block must provide:
-
-:begin_tab:`mxnet, tensorflow`
-
-1. Ingest input data as arguments to its forward propagation function.
-1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
-1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation function. Typically this happens automatically.
-1. Store and provide access to those parameters necessary
-   to execute the forward propagation computation.
-1. Initialize model parameters as needed.
-
-:end_tab:
+that each module must provide:
 
-:begin_tab:`pytorch`
 
-1. Ingest input data as arguments to its forward propagation function.
-1. Generate an output by having the forward propagation function return a value. Note that the output may have a different shape from the input. For example, the first fully-connected layer in our model above ingests an input of dimension 20 but returns an output of dimension 256.
-1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation function. Typically this happens automatically.
+1. Ingest input data as arguments to its forward propagation method.
+1. Generate an output by having the forward propagation method return a value. Note that the output may have a different shape from the input. For example, the first fully connected layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
+1. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation method. Typically this happens automatically.
 1. Store and provide access to those parameters necessary
    to execute the forward propagation computation.
 1. Initialize model parameters as needed.
 
-:end_tab:
-
 
 In the following snippet,
-we code up a block from scratch
+we code up a module from scratch
 corresponding to an MLP
 with one hidden layer with 256 hidden units,
 and a 10-dimensional output layer.
-Note that the `MLP` class below inherits the class that represents a block.
-We will heavily rely on the parent class's functions,
-supplying only our own constructor (the `__init__` function in Python) and the forward propagation function.
+Note that the `MLP` class below inherits the class that represents a module.
+We will heavily rely on the parent class's methods,
+supplying only our own constructor (the `__init__` method in Python) and the forward propagation method.
 
-```{.python .input}
+```{.python .input  n=5}
+%%tab mxnet
 class MLP(nn.Block):
-    # Declare a layer with model parameters. Here, we declare two
-    # fully-connected layers
-    def __init__(self, **kwargs):
-        # Call the constructor of the `MLP` parent class `Block` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
-        super().__init__(**kwargs)
-        self.hidden = nn.Dense(256, activation='relu')  # Hidden layer
-        self.out = nn.Dense(10)  # Output layer
+    def __init__(self):
+        # Call the constructor of the MLP parent class nn.Block to perform
+        # the necessary initialization
+        super().__init__()
+        self.hidden = nn.Dense(256, activation='relu')
+        self.out = nn.Dense(10)
 
     # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
+    # required model output based on the input X
     def forward(self, X):
         return self.out(self.hidden(X))
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=6}
+%%tab pytorch
 class MLP(nn.Module):
-    # Declare a layer with model parameters. Here, we declare two fully
-    # connected layers
     def __init__(self):
-        # Call the constructor of the `MLP` parent class `Module` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
+        # Call the constructor of the parent class nn.Module to perform
+        # the necessary initialization
         super().__init__()
-        self.hidden = nn.Linear(20, 256)  # Hidden layer
-        self.out = nn.Linear(256, 10)  # Output layer
+        self.hidden = nn.LazyLinear(256)
+        self.out = nn.LazyLinear(10)
 
     # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
+    # required model output based on the input X
     def forward(self, X):
-        # Note here we use the funtional version of ReLU defined in the
-        # nn.functional module.
         return self.out(F.relu(self.hidden(X)))
 ```
 
-```{.python .input}
-#@tab tensorflow
+```{.python .input  n=7}
+%%tab tensorflow
 class MLP(tf.keras.Model):
-    # Declare a layer with model parameters. Here, we declare two fully
-    # connected layers
     def __init__(self):
-        # Call the constructor of the `MLP` parent class `Model` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
+        # Call the constructor of the parent class tf.keras.Model to perform
+        # the necessary initialization
         super().__init__()
-        # Hidden layer
         self.hidden = tf.keras.layers.Dense(units=256, activation=tf.nn.relu)
-        self.out = tf.keras.layers.Dense(units=10)  # Output layer
+        self.out = tf.keras.layers.Dense(units=10)
 
     # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
+    # required model output based on the input X
     def call(self, X):
         return self.out(self.hidden((X)))
 ```
 
-Let us first focus on the forward propagation function.
-Note that it takes `X` as the input,
+Let's first focus on the forward propagation method.
+Note that it takes `X` as input,
 calculates the hidden representation
 with the activation function applied,
 and outputs its logits.
@@ -316,42 +294,32 @@ to represent two different learned models.
 We [**instantiate the MLP's layers**]
 in the constructor
 (**and subsequently invoke these layers**)
-on each call to the forward propagation function.
+on each call to the forward propagation method.
 Note a few key details.
-First, our customized `__init__` function
-invokes the parent class's `__init__` function
+First, our customized `__init__` method
+invokes the parent class's `__init__` method
 via `super().__init__()`
 sparing us the pain of restating
-boilerplate code applicable to most blocks.
-We then instantiate our two fully-connected layers,
+boilerplate code applicable to most modules.
+We then instantiate our two fully connected layers,
 assigning them to `self.hidden` and `self.out`.
-Note that unless we implement a new operator,
-we need not worry about the backpropagation function
+Note that unless we implement a new layer,
+we need not worry about the backpropagation method
 or parameter initialization.
-The system will generate these functions automatically.
-Let us try this out.
+The system will generate these methods automatically.
+Let's try this out.
 
-```{.python .input}
+```{.python .input  n=8}
+%%tab all
 net = MLP()
-net.initialize()
-net(X)
+if tab.selected('mxnet'):
+    net.initialize()
+net(X).shape
 ```
 
-```{.python .input}
-#@tab pytorch
-net = MLP()
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-net = MLP()
-net(X)
-```
-
-A key virtue of the block abstraction is its versatility.
-We can subclass a block to create layers
-(such as the fully-connected layer class),
+A key virtue of the module abstraction is its versatility.
+We can subclass a module to create layers
+(such as the fully connected layer class),
 entire models (such as the `MLP` class above),
 or various components of intermediate complexity.
 We exploit this versatility
@@ -360,28 +328,29 @@ such as when addressing
 convolutional neural networks.
 
 
-## [**The Sequential Block**]
+## [**The Sequential Module**]
 
 We can now take a closer look
 at how the `Sequential` class works.
 Recall that `Sequential` was designed
-to daisy-chain other blocks together.
+to daisy-chain other modules together.
 To build our own simplified `MySequential`,
-we just need to define two key function:
-1. A function to append blocks one by one to a list.
-2. A forward propagation function to pass an input through the chain of blocks, in the same order as they were appended.
+we just need to define two key methods:
+1. A method to append modules one by one to a list.
+2. A forward propagation method to pass an input through the chain of modules, in the same order as they were appended.
 
 The following `MySequential` class delivers the same
 functionality of the default `Sequential` class.
 
-```{.python .input}
+```{.python .input  n=10}
+%%tab mxnet
 class MySequential(nn.Block):
     def add(self, block):
-        # Here, `block` is an instance of a `Block` subclass, and we assume 
-        # that it has a unique name. We save it in the member variable
-        # `_children` of the `Block` class, and its type is OrderedDict. When
-        # the `MySequential` instance calls the `initialize` function, the
-        # system automatically initializes all members of `_children`
+        # Here, block is an instance of a Block subclass, and we assume that
+        # it has a unique name. We save it in the member variable _children of
+        # the Block class, and its type is OrderedDict. When the MySequential
+        # instance calls the initialize method, the system automatically
+        # initializes all members of _children
         self._children[block.name] = block
 
     def forward(self, X):
@@ -392,35 +361,26 @@ class MySequential(nn.Block):
         return X
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=11}
+%%tab pytorch
 class MySequential(nn.Module):
     def __init__(self, *args):
         super().__init__()
         for idx, module in enumerate(args):
-            # Here, `module` is an instance of a `Module` subclass. We save it
-            # in the member variable `_modules` of the `Module` class, and its
-            # type is OrderedDict
-            self._modules[str(idx)] = module
+            self.add_module(str(idx), module)
 
     def forward(self, X):
-        # OrderedDict guarantees that members will be traversed in the order
-        # they were added
-        for block in self._modules.values():
-            X = block(X)
+        for module in self.children():            
+            X = module(X)
         return X
 ```
 
-```{.python .input}
-#@tab tensorflow
+```{.python .input  n=12}
+%%tab tensorflow
 class MySequential(tf.keras.Model):
     def __init__(self, *args):
         super().__init__()
-        self.modules = []
-        for block in args:
-            # Here, `block` is an instance of a `tf.keras.layers.Layer`
-            # subclass
-            self.modules.append(block)
+        self.modules = args
 
     def call(self, X):
         for module in self.modules:
@@ -429,7 +389,7 @@ class MySequential(tf.keras.Model):
 ```
 
 :begin_tab:`mxnet`
-The `add` function adds a single block
+The `add` method adds a single block
 to the ordered dictionary `_children`.
 You might wonder why every Gluon `Block`
 possesses a `_children` attribute
@@ -444,53 +404,47 @@ parameters also need to be initialized.
 
 :begin_tab:`pytorch`
 In the `__init__` method, we add every module
-to the ordered dictionary `_modules` one by one.
-You might wonder why every `Module`
-possesses a `_modules` attribute
-and why we used it rather than just
-define a Python list ourselves.
-In short the chief advantage of `_modules`
-is that during our module's parameter initialization,
-the system knows to look inside the `_modules`
-dictionary to find sub-modules whose
-parameters also need to be initialized.
+by calling the `add_modules` method. These modules can be accessed by the `children` method later.
+In this way the system knows the added modules,
+and it will properly initialize each module's parameters.
 :end_tab:
 
-When our `MySequential`'s forward propagation function is invoked,
-each added block is executed
+When our `MySequential`'s forward propagation method is invoked,
+each added module is executed
 in the order in which they were added.
 We can now reimplement an MLP
 using our `MySequential` class.
 
-```{.python .input}
+```{.python .input  n=13}
+%%tab mxnet
 net = MySequential()
 net.add(nn.Dense(256, activation='relu'))
 net.add(nn.Dense(10))
 net.initialize()
-net(X)
+net(X).shape
 ```
 
-```{.python .input}
-#@tab pytorch
-net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
-net(X)
+```{.python .input  n=14}
+%%tab pytorch
+net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
+net(X).shape
 ```
 
-```{.python .input}
-#@tab tensorflow
+```{.python .input  n=15}
+%%tab tensorflow
 net = MySequential(
     tf.keras.layers.Dense(units=256, activation=tf.nn.relu),
     tf.keras.layers.Dense(10))
-net(X)
+net(X).shape
 ```
 
 Note that this use of `MySequential`
 is identical to the code we previously wrote
 for the `Sequential` class
-(as described in :numref:`sec_mlp_concise`).
+(as described in :numref:`sec_mlp`).
 
 
-## [**Executing Code in the Forward Propagation Function**]
+## [**Executing Code in the Forward Propagation Method**]
 
 The `Sequential` class makes model construction easy,
 allowing us to assemble new architectures
@@ -499,7 +453,7 @@ However, not all architectures are simple daisy chains.
 When greater flexibility is required,
 we will want to define our own blocks.
 For example, we might want to execute
-Python's control flow within the forward propagation function.
+Python's control flow within the forward propagation method.
 Moreover, we might want to perform
 arbitrary mathematical operations,
 not simply relying on predefined neural network layers.
@@ -521,11 +475,12 @@ and $c$ is some specified constant
 that is not updated during optimization.
 So we implement a `FixedHiddenMLP` class as follows.
 
-```{.python .input}
+```{.python .input  n=16}
+%%tab mxnet
 class FixedHiddenMLP(nn.Block):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        # Random weight parameters created with the `get_constant` function
+    def __init__(self):
+        super().__init__()
+        # Random weight parameters created with the `get_constant` method
         # are not updated during training (i.e., constant parameters)
         self.rand_weight = self.params.get_constant(
             'rand_weight', np.random.uniform(size=(20, 20)))
@@ -536,8 +491,8 @@ class FixedHiddenMLP(nn.Block):
         # Use the created constant parameters, as well as the `relu` and `dot`
         # functions
         X = npx.relu(np.dot(X, self.rand_weight.data()) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
         X = self.dense(X)
         # Control flow
         while np.abs(X).sum() > 1:
@@ -546,22 +501,20 @@ class FixedHiddenMLP(nn.Block):
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 class FixedHiddenMLP(nn.Module):
     def __init__(self):
         super().__init__()
         # Random weight parameters that will not compute gradients and
         # therefore keep constant during training
-        self.rand_weight = torch.rand((20, 20), requires_grad=False)
-        self.linear = nn.Linear(20, 20)
+        self.rand_weight = torch.rand((20, 20))
+        self.linear = nn.LazyLinear(20)
 
     def forward(self, X):
-        X = self.linear(X)
-        # Use the created constant parameters, as well as the `relu` and `mm`
-        # functions
-        X = F.relu(torch.mm(X, self.rand_weight) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
+        X = self.linear(X)        
+        X = F.relu(X @ self.rand_weight + 1)
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
         X = self.linear(X)
         # Control flow
         while X.abs().sum() > 1:
@@ -570,7 +523,7 @@ class FixedHiddenMLP(nn.Module):
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 class FixedHiddenMLP(tf.keras.Model):
     def __init__(self):
         super().__init__()
@@ -585,8 +538,8 @@ class FixedHiddenMLP(tf.keras.Model):
         # Use the created constant parameters, as well as the `relu` and
         # `matmul` functions
         X = tf.nn.relu(tf.matmul(X, self.rand_weight) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
+        # Reuse the fully connected layer. This is equivalent to sharing
+        # parameters with two fully connected layers
         X = self.dense(X)
         # Control flow
         while tf.reduce_sum(tf.math.abs(X)) > 1:
@@ -601,12 +554,12 @@ at instantiation and are thereafter constant.
 This weight is not a model parameter
 and thus it is never updated by backpropagation.
 The network then passes the output of this "fixed" layer
-through a fully-connected layer.
+through a fully connected layer.
 
 Note that before returning the output,
 our model did something unusual.
 We ran a while-loop, testing
-on the condition its $L_1$ norm is larger than $1$,
+on the condition its $\ell_1$ norm is larger than $1$,
 and dividing our output vector by $2$
 until it satisfied the condition.
 Finally, we returned the sum of the entries in `X`.
@@ -619,23 +572,20 @@ arbitrary code into the flow of your
 neural network computations.
 
 ```{.python .input}
+%%tab all
 net = FixedHiddenMLP()
-net.initialize()
-net(X)
-```
-
-```{.python .input}
-#@tab pytorch, tensorflow
-net = FixedHiddenMLP()
+if tab.selected('mxnet'):
+    net.initialize()
 net(X)
 ```
 
 We can [**mix and match various
-ways of assembling blocks together.**]
-In the following example, we nest blocks
+ways of assembling modules together.**]
+In the following example, we nest modules
 in some creative ways.
 
 ```{.python .input}
+%%tab mxnet
 class NestMLP(nn.Block):
     def __init__(self, **kwargs):
         super().__init__(**kwargs)
@@ -654,23 +604,23 @@ chimera(X)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 class NestMLP(nn.Module):
     def __init__(self):
         super().__init__()
-        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
-                                 nn.Linear(64, 32), nn.ReLU())
-        self.linear = nn.Linear(32, 16)
+        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
+                                 nn.LazyLinear(32), nn.ReLU())
+        self.linear = nn.LazyLinear(16)
 
     def forward(self, X):
         return self.linear(self.net(X))
 
-chimera = nn.Sequential(NestMLP(), nn.Linear(16, 20), FixedHiddenMLP())
+chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
 chimera(X)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 class NestMLP(tf.keras.Model):
     def __init__(self):
         super().__init__()
@@ -689,80 +639,21 @@ chimera.add(FixedHiddenMLP())
 chimera(X)
 ```
 
-## Efficiency
-
-:begin_tab:`mxnet`
-The avid reader might start to worry
-about the efficiency of some of these operations.
-After all, we have lots of dictionary lookups,
-code execution, and lots of other Pythonic things
-taking place in what is supposed to be
-a high-performance deep learning library.
-The problems of Python's [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well known. 
-In the context of deep learning,
-we may worry that our extremely fast GPU(s)
-might have to wait until a puny CPU
-runs Python code before it gets another job to run.
-The best way to speed up Python is by avoiding it altogether.
-
-One way that Gluon does this is by allowing for
-*hybridization*, which will be described later.
-Here, the Python interpreter executes a block
-the first time it is invoked.
-The Gluon runtime records what is happening
-and the next time around it short-circuits calls to Python.
-This can accelerate things considerably in some cases
-but care needs to be taken when control flow (as above)
-leads down different branches on different passes through the net.
-We recommend that the interested reader checks out
-the hybridization section (:numref:`sec_hybridize`)
-to learn about compilation after finishing the current chapter.
-:end_tab:
-
-:begin_tab:`pytorch`
-The avid reader might start to worry
-about the efficiency of some of these operations.
-After all, we have lots of dictionary lookups,
-code execution, and lots of other Pythonic things
-taking place in what is supposed to be
-a high-performance deep learning library.
-The problems of Python's [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well known. 
-In the context of deep learning,
-we may worry that our extremely fast GPU(s)
-might have to wait until a puny CPU
-runs Python code before it gets another job to run.
-:end_tab:
-
-:begin_tab:`tensorflow`
-The avid reader might start to worry
-about the efficiency of some of these operations.
-After all, we have lots of dictionary lookups,
-code execution, and lots of other Pythonic things
-taking place in what is supposed to be
-a high-performance deep learning library.
-The problems of Python's [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well known. 
-In the context of deep learning,
-we may worry that our extremely fast GPU(s)
-might have to wait until a puny CPU
-runs Python code before it gets another job to run.
-The best way to speed up Python is by avoiding it altogether.
-:end_tab:
-
 ## Summary
 
-* Layers are blocks.
-* Many layers can comprise a block.
-* Many blocks can comprise a block.
-* A block can contain code.
-* Blocks take care of lots of housekeeping, including parameter initialization and backpropagation.
-* Sequential concatenations of layers and blocks are handled by the `Sequential` block.
+* Layers are modules.
+* Many layers can comprise a module.
+* Many modules can comprise a module.
+* A module can contain code.
+* Modules take care of lots of housekeeping, including parameter initialization and backpropagation.
+* Sequential concatenations of layers and modules are handled by the `Sequential` module.
 
 
 ## Exercises
 
-1. What kinds of problems will occur if you change `MySequential` to store blocks in a Python list?
-1. Implement a block that takes two blocks as an argument, say `net1` and `net2` and returns the concatenated output of both networks in the forward propagation. This is also called a parallel block.
-1. Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.
+1. What kinds of problems will occur if you change `MySequential` to store modules in a Python list?
+1. Implement a module that takes two modules as an argument, say `net1` and `net2` and returns the concatenated output of both networks in the forward propagation. This is also called a parallel module.
+1. Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same module and build a larger network from it.
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/54)
diff --git a/chapter_builders-guide/parameters.md b/chapter_builders-guide/parameters.md
new file mode 100644
index 0000000..4b8459c
--- /dev/null
+++ b/chapter_builders-guide/parameters.md
@@ -0,0 +1,214 @@
+# パラメータ管理
+
+アーキテクチャを選択してハイパーパラメータを設定したら、学習ループに進みます。ここでは、損失関数を最小化するパラメータ値を見つけることが目標です。トレーニング後、将来の予測を行うためにこれらのパラメータが必要になります。さらに、パラメータを抽出して、他のコンテキストで再利用したり、モデルをディスクに保存して他のソフトウェアで実行したり、科学的な理解を得るために検討したりすることがあります。 
+
+ほとんどの場合、重労働を行うためにディープラーニングフレームワークに依存して、パラメーターの宣言と操作方法の本質的な詳細を無視することができます。しかし、標準レイヤーを持つスタックアーキテクチャから離れると、パラメーターの宣言と操作の雑草に入る必要がある場合があります。このセクションでは、以下について説明します。 
+
+* デバッグ、診断、および視覚化のためのパラメーターへのアクセス。
+* 異なるモデルコンポーネント間でパラメータを共有する。
+
+(**まず、隠れ層が1つあるMLPに焦点を当てることから始めます**)
+
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
+from mxnet import init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(8, activation='relu'))
+net.add(nn.Dense(1))
+net.initialize()  # Use the default initialization method
+
+X = np.random.uniform(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input}
+%%tab pytorch
+import torch
+from torch import nn
+
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
+X = torch.rand(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(4, activation=tf.nn.relu),
+    tf.keras.layers.Dense(1),
+])
+
+X = tf.random.uniform((2, 4))
+net(X).shape
+```
+
+## [**パラメータアクセス**]
+
+まず、既に知っているモデルからパラメータにアクセスする方法から始めましょう。`Sequential` クラスを介してモデルが定義されている場合、リストであるかのようにモデルにインデックスを付けることで、まず任意のレイヤーにアクセスできます。各レイヤーのパラメーターは、その属性に便利に配置されています。次のように、第2の全結合層のパラメータを調べることができます。
+
+```{.python .input}
+%%tab mxnet
+net[1].params
+```
+
+```{.python .input}
+%%tab pytorch
+net[2].state_dict()
+```
+
+```{.python .input}
+%%tab tensorflow
+net.layers[2].weights
+```
+
+この完全に接続されたレイヤーには、そのレイヤーの重みとバイアスにそれぞれ対応する 2 つのパラメーターが含まれていることがわかります。 
+
+### [**ターゲットパラメータ**]
+
+各パラメータは、パラメータクラスのインスタンスとして表されることに注意してください。パラメータで役に立つことをするには、まず基礎となる数値にアクセスする必要があります。これにはいくつかの方法があります。いくつかはより単純ですが、他のものはより一般的です。次のコードは、パラメータクラスインスタンスを返す 2 番目のニューラルネットワーク層からバイアスを抽出し、さらにそのパラメータの値にアクセスします。
+
+```{.python .input}
+%%tab mxnet
+type(net[1].bias), net[1].bias.data()
+```
+
+```{.python .input}
+%%tab pytorch
+type(net[2].bias), net[2].bias.data
+```
+
+```{.python .input}
+%%tab tensorflow
+type(net.layers[2].weights[1]), tf.convert_to_tensor(net.layers[2].weights[1])
+```
+
+:begin_tab:`mxnet,pytorch`
+パラメータは、値、グラデーション、および追加情報を含む複雑なオブジェクトです。だからこそ、値を明示的に要求する必要があります。 
+
+値に加えて、各パラメータでグラデーションにアクセスすることもできます。このネットワークに対してバックプロパゲーションをまだ呼び出していないため、初期状態です。
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+net[1].weight.grad()
+```
+
+```{.python .input}
+%%tab pytorch
+net[2].weight.grad == None
+```
+
+### [**すべてのパラメータを一度に**]
+
+すべてのパラメータに対して操作を実行する必要がある場合、それらに 1 つずつアクセスするのは面倒です。より複雑なモジュール (ネストされたモジュールなど) を扱う場合、各サブモジュールのパラメータを抽出するためにツリー全体を再帰的に処理する必要があるため、状況は特に扱いにくくなります。以下では、すべてのレイヤーのパラメーターにアクセスする方法を示します。
+
+```{.python .input}
+%%tab mxnet
+net.collect_params()
+```
+
+```{.python .input}
+%%tab pytorch
+[(name, param.shape) for name, param in net.named_parameters()]
+```
+
+```{.python .input}
+%%tab tensorflow
+net.get_weights()
+```
+
+## [**結び付けられたパラメータ**]
+
+多くの場合、複数のレイヤーでパラメーターを共有したいと考えています。これをエレガントに行う方法を見てみましょう。以下では、完全に接続されたレイヤーを割り当て、そのパラメーターを使用して別のレイヤーのパラメーターを設定します。ここでは、パラメータにアクセスする前に前方伝播`net(X)`を実行する必要があります。
+
+```{.python .input}
+%%tab mxnet
+net = nn.Sequential()
+# We need to give the shared layer a name so that we can refer to its
+# parameters
+shared = nn.Dense(8, activation='relu')
+net.add(nn.Dense(8, activation='relu'),
+        shared,
+        nn.Dense(8, activation='relu', params=shared.params),
+        nn.Dense(10))
+net.initialize()
+
+X = np.random.uniform(size=(2, 20))
+net(X)
+
+# Check whether the parameters are the same
+print(net[1].weight.data()[0] == net[2].weight.data()[0])
+net[1].weight.data()[0, 0] = 100
+# Make sure that they are actually the same object rather than just having the
+# same value
+print(net[1].weight.data()[0] == net[2].weight.data()[0])
+```
+
+```{.python .input}
+%%tab pytorch
+# We need to give the shared layer a name so that we can refer to its
+# parameters
+shared = nn.LazyLinear(8)
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
+                    shared, nn.ReLU(),
+                    shared, nn.ReLU(),
+                    nn.LazyLinear(1))
+net(X)
+# Check whether the parameters are the same
+print(net[2].weight.data[0] == net[4].weight.data[0])
+net[2].weight.data[0, 0] = 100
+# Make sure that they are actually the same object rather than just having the
+# same value
+print(net[2].weight.data[0] == net[4].weight.data[0])
+```
+
+```{.python .input}
+%%tab tensorflow
+# tf.keras behaves a bit differently. It removes the duplicate layer
+# automatically
+shared = tf.keras.layers.Dense(4, activation=tf.nn.relu)
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    shared,
+    shared,
+    tf.keras.layers.Dense(1),
+])
+net(X)
+# Check whether the parameters are different
+print(len(net.layers) == 3)
+```
+
+この例は、2番目と3番目のレイヤーのパラメーターが関連付けられていることを示しています。それらは等しいだけではなく、同じ正確なテンソルで表されます。したがって、パラメータの1つを変更すると、他のパラメータも変更されます。パラメータが関連付けられると、グラデーションはどうなるのだろうかと思うかもしれません。モデルパラメーターには勾配が含まれているため、2 番目の非表示レイヤーと 3 番目の非表示レイヤーのグラデーションは、バックプロパゲーション中に一緒に加算されます。 
+
+## まとめ
+
+モデルパラメータにアクセスして結び付ける方法はいくつかあります。 
+
+## 演習
+
+1. :numref:`sec_model_construction` で定義されている `NestMLP` モデルを使用して、さまざまなレイヤーのパラメーターにアクセスします。
+1. 共有パラメーター層を含む MLP を構築し、学習させます。トレーニングプロセス中に、各レイヤーのモデルパラメーターと勾配を観察します。
+1. なぜパラメータを共有するのが良いのですか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/56)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/57)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/269)
+:end_tab:
diff --git a/chapter_builders-guide/parameters_origin.md b/chapter_builders-guide/parameters_origin.md
new file mode 100644
index 0000000..a41fe59
--- /dev/null
+++ b/chapter_builders-guide/parameters_origin.md
@@ -0,0 +1,283 @@
+# Parameter Management
+
+Once we have chosen an architecture
+and set our hyperparameters,
+we proceed to the training loop,
+where our goal is to find parameter values
+that minimize our loss function.
+After training, we will need these parameters
+in order to make future predictions.
+Additionally, we will sometimes wish
+to extract the parameters
+either to reuse them in some other context,
+to save our model to disk so that
+it may be executed in other software,
+or for examination in the hope of
+gaining scientific understanding.
+
+Most of the time, we will be able
+to ignore the nitty-gritty details
+of how parameters are declared
+and manipulated, relying on deep learning frameworks
+to do the heavy lifting.
+However, when we move away from
+stacked architectures with standard layers,
+we will sometimes need to get into the weeds
+of declaring and manipulating parameters.
+In this section, we cover the following:
+
+* Accessing parameters for debugging, diagnostics, and visualizations.
+* Sharing parameters across different model components.
+
+(**We start by focusing on an MLP with one hidden layer.**)
+
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
+from mxnet import init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+
+net = nn.Sequential()
+net.add(nn.Dense(8, activation='relu'))
+net.add(nn.Dense(1))
+net.initialize()  # Use the default initialization method
+
+X = np.random.uniform(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input}
+%%tab pytorch
+import torch
+from torch import nn
+
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
+X = torch.rand(size=(2, 4))
+net(X).shape
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    tf.keras.layers.Dense(4, activation=tf.nn.relu),
+    tf.keras.layers.Dense(1),
+])
+
+X = tf.random.uniform((2, 4))
+net(X).shape
+```
+
+## [**Parameter Access**]
+
+Let's start with how to access parameters
+from the models that you already know.
+When a model is defined via the `Sequential` class,
+we can first access any layer by indexing
+into the model as though it were a list.
+Each layer's parameters are conveniently
+located in its attribute.
+We can inspect the parameters of the second fully connected layer as follows.
+
+```{.python .input}
+%%tab mxnet
+net[1].params
+```
+
+```{.python .input}
+%%tab pytorch
+net[2].state_dict()
+```
+
+```{.python .input}
+%%tab tensorflow
+net.layers[2].weights
+```
+
+We can see that this fully connected layer
+contains two parameters,
+corresponding to that layer's
+weights and biases, respectively.
+
+
+### [**Targeted Parameters**]
+
+Note that each parameter is represented
+as an instance of the parameter class.
+To do anything useful with the parameters,
+we first need to access the underlying numerical values.
+There are several ways to do this.
+Some are simpler while others are more general.
+The following code extracts the bias
+from the second neural network layer, which returns a parameter class instance, and
+further accesses that parameter's value.
+
+```{.python .input}
+%%tab mxnet
+type(net[1].bias), net[1].bias.data()
+```
+
+```{.python .input}
+%%tab pytorch
+type(net[2].bias), net[2].bias.data
+```
+
+```{.python .input}
+%%tab tensorflow
+type(net.layers[2].weights[1]), tf.convert_to_tensor(net.layers[2].weights[1])
+```
+
+:begin_tab:`mxnet,pytorch`
+Parameters are complex objects,
+containing values, gradients,
+and additional information.
+That's why we need to request the value explicitly.
+
+In addition to the value, each parameter also allows us to access the gradient. Because we have not invoked backpropagation for this network yet, it is in its initial state.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+net[1].weight.grad()
+```
+
+```{.python .input}
+%%tab pytorch
+net[2].weight.grad == None
+```
+
+### [**All Parameters at Once**]
+
+When we need to perform operations on all parameters,
+accessing them one-by-one can grow tedious.
+The situation can grow especially unwieldy
+when we work with more complex modules (e.g., nested modules),
+since we would need to recurse
+through the entire tree to extract
+each sub-module's parameters. Below we demonstrate accessing the parameters of all layers.
+
+```{.python .input}
+%%tab mxnet
+net.collect_params()
+```
+
+```{.python .input}
+%%tab pytorch
+[(name, param.shape) for name, param in net.named_parameters()]
+```
+
+```{.python .input}
+%%tab tensorflow
+net.get_weights()
+```
+
+## [**Tied Parameters**]
+
+Often, we want to share parameters across multiple layers.
+Let's see how to do this elegantly.
+In the following we allocate a fully connected layer
+and then use its parameters specifically
+to set those of another layer.
+Here we need to run the forward propagation
+`net(X)` before accessing the parameters.
+
+```{.python .input}
+%%tab mxnet
+net = nn.Sequential()
+# We need to give the shared layer a name so that we can refer to its
+# parameters
+shared = nn.Dense(8, activation='relu')
+net.add(nn.Dense(8, activation='relu'),
+        shared,
+        nn.Dense(8, activation='relu', params=shared.params),
+        nn.Dense(10))
+net.initialize()
+
+X = np.random.uniform(size=(2, 20))
+net(X)
+
+# Check whether the parameters are the same
+print(net[1].weight.data()[0] == net[2].weight.data()[0])
+net[1].weight.data()[0, 0] = 100
+# Make sure that they are actually the same object rather than just having the
+# same value
+print(net[1].weight.data()[0] == net[2].weight.data()[0])
+```
+
+```{.python .input}
+%%tab pytorch
+# We need to give the shared layer a name so that we can refer to its
+# parameters
+shared = nn.LazyLinear(8)
+net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
+                    shared, nn.ReLU(),
+                    shared, nn.ReLU(),
+                    nn.LazyLinear(1))
+net(X)
+# Check whether the parameters are the same
+print(net[2].weight.data[0] == net[4].weight.data[0])
+net[2].weight.data[0, 0] = 100
+# Make sure that they are actually the same object rather than just having the
+# same value
+print(net[2].weight.data[0] == net[4].weight.data[0])
+```
+
+```{.python .input}
+%%tab tensorflow
+# tf.keras behaves a bit differently. It removes the duplicate layer
+# automatically
+shared = tf.keras.layers.Dense(4, activation=tf.nn.relu)
+net = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(),
+    shared,
+    shared,
+    tf.keras.layers.Dense(1),
+])
+net(X)
+# Check whether the parameters are different
+print(len(net.layers) == 3)
+```
+
+This example shows that the parameters
+of the second and third layer are tied.
+They are not just equal, they are
+represented by the same exact tensor.
+Thus, if we change one of the parameters,
+the other one changes, too.
+You might wonder,
+when parameters are tied
+what happens to the gradients?
+Since the model parameters contain gradients,
+the gradients of the second hidden layer
+and the third hidden layer are added together
+during backpropagation.
+
+## Summary
+
+We have several ways to access and tie model parameters.
+
+
+## Exercises
+
+1. Use the `NestMLP` model defined in :numref:`sec_model_construction` and access the parameters of the various layers.
+1. Construct an MLP containing a shared parameter layer and train it. During the training process, observe the model parameters and gradients of each layer.
+1. Why is sharing parameters a good idea?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/56)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/57)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/269)
+:end_tab:
diff --git a/chapter_deep-learning-computation/read-write.md b/chapter_builders-guide/read-write.md
similarity index 51%
rename from chapter_deep-learning-computation/read-write.md
rename to chapter_builders-guide/read-write.md
index 99415ed..7a4383e 100644
--- a/chapter_deep-learning-computation/read-write.md
+++ b/chapter_builders-guide/read-write.md
@@ -1,12 +1,18 @@
 # ファイル I/O
 
-ここまでは、データを処理する方法と、ディープラーニングモデルを構築、トレーニング、テストする方法について説明しました。しかし、ある時点で、学習したモデルに十分満足して、後でさまざまなコンテキストで使用できるように結果を保存したいと考えています (おそらく展開の予測を行うためにも)。さらに、長時間のトレーニングプロセスを実行する場合、サーバーの電源コードにつまずいた場合に数日分の計算が失われないように、中間結果 (チェックポイント) を定期的に保存することがベストプラクティスです。そこで、個々の重みベクトルとモデル全体の両方をロードして保存する方法を学習します。このセクションでは、両方の問題について説明します。 
+ここまで、データの処理方法と、ディープラーニングモデルの構築、トレーニング、およびテスト方法について説明しました。しかし、ある時点で、学習したモデルに十分満足して、後でさまざまな状況で使用できるように結果を保存したいと思うでしょう（おそらく展開の予測を行うため）。さらに、長いトレーニングプロセスを実行する場合、サーバーの電源コードをつまずいても、数日分の計算が失われないように、中間結果を定期的に保存する（チェックポイント）ことがベストプラクティスです。したがって、個々の重みベクトルとモデル全体の両方をロードして保存する方法を学ぶときです。このセクションでは、両方の問題について説明します。 
 
 ## (**テンソルの読み込みと保存**)
 
-個々のテンソルに対して `load` 関数と `save` 関数を直接呼び出して、それぞれ読み書きすることができます。どちらの関数も名前を指定する必要があり、`save` は入力として変数を保存する必要があります。
+個々のテンソルについては、`load`と`save`の関数を直接呼び出して、それぞれ読み書きすることができます。どちらの関数も名前を指定する必要があり、`save`では入力として変数を保存する必要があります。
 
 ```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
@@ -16,7 +22,7 @@ npx.save('x-file', x)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 import torch
 from torch import nn
 from torch.nn import functional as F
@@ -26,7 +32,7 @@ torch.save(x, 'x-file')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 import numpy as np
 
@@ -34,28 +40,30 @@ x = tf.range(4)
 np.save('x-file.npy', x)
 ```
 
-これで、保存されたファイルからデータをメモリに読み戻すことができます。
+これで、保存されたファイルのデータをメモリに読み戻すことができます。
 
 ```{.python .input}
+%%tab mxnet
 x2 = npx.load('x-file')
 x2
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x2 = torch.load('x-file')
 x2
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x2 = np.load('x-file.npy', allow_pickle=True)
 x2
 ```
 
-[**テンソルのリストを保存してメモリに読み戻す**]
+[**テンソルのリストを保存し、メモリに読み戻すことができます。**]
 
 ```{.python .input}
+%%tab mxnet
 y = np.zeros(4)
 npx.save('x-files', [x, y])
 x2, y2 = npx.load('x-files')
@@ -63,7 +71,7 @@ x2, y2 = npx.load('x-files')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.zeros(4)
 torch.save([x, y],'x-files')
 x2, y2 = torch.load('x-files')
@@ -71,16 +79,17 @@ x2, y2 = torch.load('x-files')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.zeros(4)
 np.save('xy-files.npy', [x, y])
 x2, y2 = np.load('xy-files.npy', allow_pickle=True)
 (x2, y2)
 ```
 
-[**文字列からテンソルにマッピングする辞書を書いたり読んだりすることもできます**] これは、モデル内のすべての重みを読み書きする場合に便利です。
+[**文字列からテンソルにマップする辞書を書いて読むこともできます。**] これは、モデル内のすべての重みを読み書きしたい場合に便利です。
 
 ```{.python .input}
+%%tab mxnet
 mydict = {'x': x, 'y': y}
 npx.save('mydict', mydict)
 mydict2 = npx.load('mydict')
@@ -88,7 +97,7 @@ mydict2
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 mydict = {'x': x, 'y': y}
 torch.save(mydict, 'mydict')
 mydict2 = torch.load('mydict')
@@ -96,18 +105,19 @@ mydict2
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 mydict = {'x': x, 'y': y}
 np.save('mydict.npy', mydict)
 mydict2 = np.load('mydict.npy', allow_pickle=True)
 mydict2
 ```
 
-## [**モデルパラメーターの読み込みと保存**]
+## [**モデルパラメータのロードと保存**]
 
-個々のウェイトベクトル (または他のテンソル) を保存すると便利ですが、モデル全体を保存 (および後でロード) する場合は非常に面倒です。結局のところ、何百ものパラメータグループが散在しているかもしれません。このため、ディープラーニングフレームワークには、ネットワーク全体の読み込みと保存を行うための機能が組み込まれています。注意すべき重要な点は、これによりモデル全体ではなくモデル*パラメータ*が保存されるということです。たとえば、3 層の MLP がある場合、アーキテクチャを個別に指定する必要があります。これは、モデル自体に任意のコードが含まれている可能性があるため、自然にシリアル化できないためです。したがって、モデルを復元するには、アーキテクチャをコードで生成し、ディスクからパラメーターをロードする必要があります。(**おなじみのMLPから始めましょう**)
+個々のウェイトベクトル (または他のテンソル) を保存することは便利ですが、モデル全体を保存 (そして後でロード) したい場合は非常に面倒です。結局のところ、何百ものパラメータグループが散在している可能性があります。このため、ディープラーニングフレームワークは、ネットワーク全体をロードおよび保存するための組み込み機能を提供します。注意すべき重要な点は、これによりモデル全体ではなくモデル*パラメータ*が保存されることです。たとえば、3 層の MLP がある場合、アーキテクチャを個別に指定する必要があります。これは、モデル自体に任意のコードを含めることができるため、自然にシリアル化できないためです。したがって、モデルを復元するには、コードでアーキテクチャを生成し、ディスクからパラメーターをロードする必要があります。(**おなじみのMLPから始めましょう。**)
 
 ```{.python .input}
+%%tab mxnet
 class MLP(nn.Block):
     def __init__(self, **kwargs):
         super(MLP, self).__init__(**kwargs)
@@ -124,12 +134,12 @@ Y = net(X)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 class MLP(nn.Module):
     def __init__(self):
         super().__init__()
-        self.hidden = nn.Linear(20, 256)
-        self.output = nn.Linear(256, 10)
+        self.hidden = nn.LazyLinear(256)
+        self.output = nn.LazyLinear(10)
 
     def forward(self, x):
         return self.output(F.relu(self.hidden(x)))
@@ -140,7 +150,7 @@ Y = net(X)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 class MLP(tf.keras.Model):
     def __init__(self):
         super().__init__()
@@ -158,38 +168,40 @@ X = tf.random.uniform((2, 20))
 Y = net(X)
 ```
 
-次に、「mlp.params」という名前で [**モデルのパラメータをファイルとして保存**] します。
+次に、「mlp.params」という名前で [**モデルのパラメータをファイルとして保存します**]。
 
 ```{.python .input}
+%%tab mxnet
 net.save_parameters('mlp.params')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.save(net.state_dict(), 'mlp.params')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 net.save_weights('mlp.params')
 ```
 
-モデルを復元するために、元の MLP モデルのクローンをインスタンス化します。モデルパラメーターをランダムに初期化する代わりに、[**ファイルに保存されているパラメーターを直接読み取る**]。
+モデルを復元するために、元の MLP モデルのクローンをインスタンス化します。モデルパラメータをランダムに初期化する代わりに、[**ファイルに保存されているパラメータを直接読み取る**]。
 
 ```{.python .input}
+%%tab mxnet
 clone = MLP()
 clone.load_parameters('mlp.params')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 clone = MLP()
 clone.load_state_dict(torch.load('mlp.params'))
 clone.eval()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 clone = MLP()
 clone.load_weights('mlp.params')
 ```
@@ -197,33 +209,22 @@ clone.load_weights('mlp.params')
 両方のインスタンスが同じモデルパラメーターをもつため、同じ入力 `X` の計算結果は同じになるはずです。これを確認しましょう。
 
 ```{.python .input}
+%%tab all
 Y_clone = clone(X)
 Y_clone == Y
 ```
 
-```{.python .input}
-#@tab pytorch
-Y_clone = clone(X)
-Y_clone == Y
-```
-
-```{.python .input}
-#@tab tensorflow
-Y_clone = clone(X)
-Y_clone == Y
-```
-
-## [概要
+## まとめ
 
-* `save` 関数と `load` 関数を使用して、テンソルオブジェクトのファイル I/O を実行できます。
-* パラメータディクショナリを使用して、ネットワークのパラメータセット全体を保存およびロードできます。
+* `save`および`load`関数は、テンソルオブジェクトのファイル入出力を実行するために使用できます。
+* パラメータディクショナリを介して、ネットワークのパラメータセット全体を保存およびロードできます。
 * アーキテクチャの保存は、パラメータではなくコードで行う必要があります。
 
 ## 演習
 
-1. トレーニング済みのモデルを別のデバイスに展開する必要がない場合でも、モデルパラメーターを格納することの実際的な利点は何ですか。
-1. ネットワークの一部だけを再利用して、異なるアーキテクチャのネットワークに組み込むと仮定します。たとえば、前のネットワークの最初の2つのレイヤーを新しいネットワークでどのように使用しますか？
-1. ネットワークアーキテクチャとパラメータをどのように保存しますか？アーキテクチャにはどのような制限を課しますか？
+1. トレーニング済みのモデルを別のデバイスに展開する必要がない場合でも、モデルパラメータを保存することの実際的な利点は何ですか？
+1. ネットワークの一部だけを再利用して、異なるアーキテクチャのネットワークに組み込むと仮定します。以前のネットワークの最初の2つのレイヤーを新しいネットワークでどのように使用しますか？
+1. ネットワークアーキテクチャとパラメータを保存するにはどうしたらいいですか？アーキテクチャにどのような制限を課しますか？
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/60)
diff --git a/chapter_deep-learning-computation/read-write_origin.md b/chapter_builders-guide/read-write_origin.md
similarity index 92%
rename from chapter_deep-learning-computation/read-write_origin.md
rename to chapter_builders-guide/read-write_origin.md
index 43815cb..637301f 100644
--- a/chapter_deep-learning-computation/read-write_origin.md
+++ b/chapter_builders-guide/read-write_origin.md
@@ -23,6 +23,12 @@ Both functions require that we supply a name,
 and `save` requires as input the variable to be saved.
 
 ```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+```{.python .input}
+%%tab mxnet
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
@@ -32,7 +38,7 @@ npx.save('x-file', x)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 import torch
 from torch import nn
 from torch.nn import functional as F
@@ -42,7 +48,7 @@ torch.save(x, 'x-file')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 import numpy as np
 
@@ -53,18 +59,19 @@ np.save('x-file.npy', x)
 We can now read the data from the stored file back into memory.
 
 ```{.python .input}
+%%tab mxnet
 x2 = npx.load('x-file')
 x2
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x2 = torch.load('x-file')
 x2
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x2 = np.load('x-file.npy', allow_pickle=True)
 x2
 ```
@@ -72,6 +79,7 @@ x2
 We can [**store a list of tensors and read them back into memory.**]
 
 ```{.python .input}
+%%tab mxnet
 y = np.zeros(4)
 npx.save('x-files', [x, y])
 x2, y2 = npx.load('x-files')
@@ -79,7 +87,7 @@ x2, y2 = npx.load('x-files')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.zeros(4)
 torch.save([x, y],'x-files')
 x2, y2 = torch.load('x-files')
@@ -87,7 +95,7 @@ x2, y2 = torch.load('x-files')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.zeros(4)
 np.save('xy-files.npy', [x, y])
 x2, y2 = np.load('xy-files.npy', allow_pickle=True)
@@ -100,6 +108,7 @@ This is convenient when we want
 to read or write all the weights in a model.
 
 ```{.python .input}
+%%tab mxnet
 mydict = {'x': x, 'y': y}
 npx.save('mydict', mydict)
 mydict2 = npx.load('mydict')
@@ -107,7 +116,7 @@ mydict2
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 mydict = {'x': x, 'y': y}
 torch.save(mydict, 'mydict')
 mydict2 = torch.load('mydict')
@@ -115,7 +124,7 @@ mydict2
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 mydict = {'x': x, 'y': y}
 np.save('mydict.npy', mydict)
 mydict2 = np.load('mydict.npy', allow_pickle=True)
@@ -140,9 +149,10 @@ hence they cannot be serialized as naturally.
 Thus, in order to reinstate a model, we need
 to generate the architecture in code
 and then load the parameters from disk.
-(**Let us start with our familiar MLP.**)
+(**Let's start with our familiar MLP.**)
 
 ```{.python .input}
+%%tab mxnet
 class MLP(nn.Block):
     def __init__(self, **kwargs):
         super(MLP, self).__init__(**kwargs)
@@ -159,12 +169,12 @@ Y = net(X)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 class MLP(nn.Module):
     def __init__(self):
         super().__init__()
-        self.hidden = nn.Linear(20, 256)
-        self.output = nn.Linear(256, 10)
+        self.hidden = nn.LazyLinear(256)
+        self.output = nn.LazyLinear(10)
 
     def forward(self, x):
         return self.output(F.relu(self.hidden(x)))
@@ -175,7 +185,7 @@ Y = net(X)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 class MLP(tf.keras.Model):
     def __init__(self):
         super().__init__()
@@ -196,16 +206,17 @@ Y = net(X)
 Next, we [**store the parameters of the model as a file**] with the name "mlp.params".
 
 ```{.python .input}
+%%tab mxnet
 net.save_parameters('mlp.params')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.save(net.state_dict(), 'mlp.params')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 net.save_weights('mlp.params')
 ```
 
@@ -215,40 +226,30 @@ Instead of randomly initializing the model parameters,
 we [**read the parameters stored in the file directly**].
 
 ```{.python .input}
+%%tab mxnet
 clone = MLP()
 clone.load_parameters('mlp.params')
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 clone = MLP()
 clone.load_state_dict(torch.load('mlp.params'))
 clone.eval()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 clone = MLP()
 clone.load_weights('mlp.params')
 ```
 
 Since both instances have the same model parameters,
 the computational result of the same input `X` should be the same.
-Let us verify this.
-
-```{.python .input}
-Y_clone = clone(X)
-Y_clone == Y
-```
-
-```{.python .input}
-#@tab pytorch
-Y_clone = clone(X)
-Y_clone == Y
-```
+Let's verify this.
 
 ```{.python .input}
-#@tab tensorflow
+%%tab all
 Y_clone = clone(X)
 Y_clone == Y
 ```
diff --git a/chapter_builders-guide/use-gpu.md b/chapter_builders-guide/use-gpu.md
new file mode 100644
index 0000000..8ce4258
--- /dev/null
+++ b/chapter_builders-guide/use-gpu.md
@@ -0,0 +1,384 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# GPU
+:label:`sec_use_gpu`
+
+:numref:`tab_intro_decade`では、過去20年間にわたる計算の急速な成長について議論しました。一言で言えば、GPUのパフォーマンスは2000年以降、10年ごとに1000倍に向上しています。これは素晴らしい機会を提供しますが、そのようなパフォーマンスを提供する必要性が非常に高いことも示唆しています。 
+
+このセクションでは、この計算性能を研究に活用する方法について説明します。まず、単一のGPUを使用し、後で複数のGPUと複数のサーバー（複数のGPUを使用）を使用する方法について説明します。 
+
+具体的には、計算に単一の NVIDIA GPU を使用する方法について説明します。まず、少なくとも 1 つの NVIDIA GPU がインストールされていることを確認します。次に、[NVIDIA driver and CUDA](https://developer.nvidia.com/cuda-downloads)をダウンロードし、プロンプトに従って適切なパスを設定します。これらの準備が完了したら、`nvidia-smi` コマンドを使用して (**グラフィックスカード情報を表示**) できます。
+
+```{.python .input}
+%%tab all
+!nvidia-smi
+```
+
+:begin_tab:`mxnet`
+お気づきかもしれませんが、MXNet テンソルは NumPy `ndarray` とほとんど同じに見えます。しかし、いくつかの重要な違いがあります。MXNet と NumPy を区別する重要な機能の 1 つは、多様なハードウェアデバイスのサポートです。 
+
+MXNet では、すべての配列にコンテキストがあります。これまでは、デフォルトで、すべての変数と関連する計算が CPU に割り当てられています。通常、他のコンテキストはさまざまな GPU です。複数のサーバーにジョブを展開すると、事態はさらに困難になる可能性があります。アレイをコンテキストにインテリジェントに割り当てることで、デバイス間のデータ転送にかかる時間を最小限に抑えることができます。たとえば、GPU を備えたサーバーでニューラルネットワークをトレーニングする場合、通常、モデルのパラメーターは GPU 上に存在することを好みます。 
+
+次に、MXNet の GPU バージョンがインストールされていることを確認する必要があります。CPU バージョンの MXNet が既にインストールされている場合は、まずそれをアンインストールする必要があります。たとえば、`pip uninstall mxnet` コマンドを使用して、使用している CUDA のバージョンに応じて、対応する MXNet バージョンをインストールします。CUDA 10.0 がインストールされていると仮定すると、`pip install mxnet-cu100` を介して CUDA 10.0 をサポートする MXNet バージョンをインストールできます。
+:end_tab:
+
+:begin_tab:`pytorch`
+PyTorchでは、すべての配列にデバイスがあり、私たちはしばしばそれをコンテキストと呼びます。これまでは、デフォルトで、すべての変数と関連する計算が CPU に割り当てられています。通常、他のコンテキストはさまざまな GPU です。複数のサーバーにジョブを展開すると、事態はさらに困難になる可能性があります。アレイをコンテキストにインテリジェントに割り当てることで、デバイス間のデータ転送にかかる時間を最小限に抑えることができます。たとえば、GPU を備えたサーバーでニューラルネットワークをトレーニングする場合、通常、モデルのパラメーターは GPU 上に存在することを好みます。
+:end_tab:
+
+このセクションのプログラムを実行するには、少なくとも 2 つの GPU が必要です。これはほとんどのデスクトップコンピューターでは贅沢かもしれませんが、AWS EC2 マルチ GPU インスタンスを使用するなどして、クラウドで簡単に利用できます。他のほとんどのセクションは、複数のGPUを必要としません。代わりに、これは単に異なるデバイス間でデータがどのように流れるかを説明するためです。 
+
+## [**コンピューティングデバイス**]
+
+ストレージや計算用に CPU や GPU などのデバイスを指定できます。デフォルトでは、テンソルはメインメモリに作成され、CPUを使用してそれを計算します。
+
+:begin_tab:`mxnet`
+MXNet では、CPU と GPU は `cpu()` と `gpu()` で示されます。`cpu()`（または括弧内の任意の整数）は、すべての物理CPUとメモリを意味することに注意してください。これは、MXNet の計算がすべての CPU コアを使用しようとすることを意味します。ただし、`gpu()`は、1つのカードと対応するメモリのみを表します。複数の GPU がある場合、`gpu(i)` を使用して $i^\mathrm{th}$ GPU を表します ($i$ は 0 から始まります)。また、`gpu(0)`と`gpu()`は同等です。
+:end_tab:
+
+:begin_tab:`pytorch`
+PyTorch では、CPU と GPU は `torch.device('cpu')` と `torch.device('cuda')` で示されます。`cpu`デバイスは、すべての物理CPUとメモリを意味することに注意してください。これは、PyTorch の計算がすべての CPU コアを使用しようとすることを意味します。ただし、`gpu`デバイスは、1つのカードと対応するメモリのみを表します。複数の GPU がある場合、`torch.device(f'cuda:{i}')` を使用して $i^\mathrm{th}$ GPU を表します ($i$ は 0 から始まります)。また、`gpu:0`と`gpu`は同等です。
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+```{.python .input}
+%%tab all
+def cpu():  #@save
+    if tab.selected('mxnet'):
+        return npx.cpu()
+    if tab.selected('pytorch'):
+        return torch.device('cpu')
+    if tab.selected('tensorflow'):
+        return tf.device('/CPU:0')
+
+def gpu(i=0):  #@save
+    if tab.selected('mxnet'):
+        return npx.gpu(i)
+    if tab.selected('pytorch'):
+        return torch.device(f'cuda:{i}')
+    if tab.selected('tensorflow'):
+        return tf.device(f'/GPU:{i}')
+
+cpu(), gpu(), gpu(1)
+```
+
+できる (**利用可能な GPU の数を照会する**)
+
+```{.python .input}
+%%tab all
+def num_gpus():  #@save
+    if tab.selected('mxnet'):
+        return npx.num_gpus()
+    if tab.selected('pytorch'):
+        return torch.cuda.device_count()
+    if tab.selected('tensorflow'):
+        return len(tf.config.experimental.list_physical_devices('GPU'))
+
+num_gpus()
+```
+
+ここで、[**要求されたGPUが存在しなくてもコードを実行できる便利な関数を2つ定義します**]。
+
+```{.python .input}
+%%tab all
+def try_gpu(i=0):  #@save
+    """Return gpu(i) if exists, otherwise return cpu()."""
+    if num_gpus() >= i + 1:
+        return gpu(i)
+    return cpu()
+
+def try_all_gpus():  #@save
+    """Return all available GPUs, or [cpu(),] if no GPU exists."""
+    return [gpu(i) for i in range(num_gpus())]
+
+try_gpu(), try_gpu(10), try_all_gpus()
+```
+
+## テンソルと GPU
+
+デフォルトでは、テンソルは CPU 上に作成されます。[**テンソルが配置されているデバイスを照会できます。**]
+
+```{.python .input}
+%%tab mxnet
+x = np.array([1, 2, 3])
+x.ctx
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.tensor([1, 2, 3])
+x.device
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.constant([1, 2, 3])
+x.device
+```
+
+複数の用語で操作したい場合は常に、同じデバイス上にある必要があることに注意することが重要です。たとえば、2つのテンソルを合計する場合、両方の引数が同じデバイス上に存在することを確認する必要があります。そうしないと、フレームワークは結果をどこに保存するか、または計算を実行する場所を決める方法さえも知りません。 
+
+### GPU 上のストレージ
+
+[**テンソルをGPUに保存する**] にはいくつかの方法があります。たとえば、テンソルを作成するときにストレージデバイスを指定できます。次に、最初の `gpu` にテンソル変数 `X` を作成します。GPU で作成されたテンソルは、この GPU のメモリのみを消費します。`nvidia-smi` コマンドを使用して GPU のメモリ使用量を表示できます。一般に、GPU メモリ制限を超えるデータを作成しないようにする必要があります。
+
+```{.python .input}
+%%tab mxnet
+X = np.ones((2, 3), ctx=try_gpu())
+X
+```
+
+```{.python .input}
+%%tab pytorch
+X = torch.ones(2, 3, device=try_gpu())
+X
+```
+
+```{.python .input}
+%%tab tensorflow
+with try_gpu():
+    X = tf.ones((2, 3))
+X
+```
+
+少なくとも 2 つの GPU があると仮定すると、次のコードは次のようになります (**2 番目の GPU でランダムなテンソルを作成します。**)
+
+```{.python .input}
+%%tab mxnet
+Y = np.random.uniform(size=(2, 3), ctx=try_gpu(1))
+Y
+```
+
+```{.python .input}
+%%tab pytorch
+Y = torch.rand(2, 3, device=try_gpu(1))
+Y
+```
+
+```{.python .input}
+%%tab tensorflow
+with try_gpu(1):
+    Y = tf.random.uniform((2, 3))
+Y
+```
+
+### コピー中
+
+[**`X + Y`を計算する場合、この操作を実行する場所を決める必要があります。**] たとえば、:numref:`fig_copyto`に示すように、`X`を2番目のGPUに転送し、そこで操作を実行できます。
+**単純に`X`と`Y`を追加しないでください。
+これは例外になるからです。ランタイムエンジンは何をすべきか分からず、同じデバイス上でデータを見つけることができず、失敗します。`Y` は 2 つ目の GPU 上に存在するため、2 つを追加する前に `X` をそこに移動する必要があります。 
+
+![Copy data to perform an operation on the same device.](../img/copyto.svg)
+:label:`fig_copyto`
+
+```{.python .input}
+%%tab mxnet
+Z = X.copyto(try_gpu(1))
+print(X)
+print(Z)
+```
+
+```{.python .input}
+%%tab pytorch
+Z = X.cuda(1)
+print(X)
+print(Z)
+```
+
+```{.python .input}
+%%tab tensorflow
+with try_gpu(1):
+    Z = X
+print(X)
+print(Z)
+```
+
+[**データは同じ GPU (`Z` と `Y` はどちらも) 上にあるので、これらを加算できます。**]
+
+```{.python .input}
+%%tab all
+Y + Z
+```
+
+:begin_tab:`mxnet`
+変数 `Z` がすでに 2 つ目の GPU に存在していると想像してください。まだ`Z.copyto(gpu(1))`を呼んだらどうなるの？その変数が目的のデバイスにすでに存在している場合でも、コピーを作成して新しいメモリを割り当てます。コードが実行されている環境によっては、2 つの変数がすでに同じデバイス上に存在している場合があります。そのため、変数が現在別のデバイスにある場合にのみコピーを作成します。このような場合は、`as_in_ctx`に電話することができます。変数が指定したデバイスにすでに存在する場合、これは何もしません。特にコピーを作成する場合を除き、`as_in_ctx`が最適な方法です。
+:end_tab:
+
+:begin_tab:`pytorch`
+変数 `Z` がすでに 2 つ目の GPU に存在していると想像してください。まだ`Z.cuda(1)`を呼んだらどうなるの？コピーを作成して新しいメモリを割り当てる代わりに、`Z`を返します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+変数 `Z` がすでに 2 つ目の GPU に存在していると想像してください。同じデバイススコープでまだ `Z2 = Z` を呼び出すとどうなりますか？コピーを作成して新しいメモリを割り当てる代わりに、`Z`を返します。
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+Z.as_in_ctx(try_gpu(1)) is Z
+```
+
+```{.python .input}
+%%tab pytorch
+Z.cuda(1) is Z
+```
+
+```{.python .input}
+%%tab tensorflow
+with try_gpu(1):
+    Z2 = Z
+Z2 is Z
+```
+
+### サイドノート
+
+人々は高速であることを期待しているため、機械学習を行うためにGPUを使用しています。しかし、デバイス間で変数を転送するのは遅いです。ですから、私たちがあなたにそれをさせる前に、あなたが何か遅いことをしたいということを100％確信してほしい。ディープラーニングフレームワークがクラッシュせずにコピーを自動的に実行しただけなら、遅いコードを書いたことに気付かないかもしれません。 
+
+また、デバイス (CPU、GPU、その他のマシン) 間でのデータ転送は、計算よりもはるかに低速です。また、より多くの操作を進める前にデータが送信される（または受信される）のを待たなければならないため、並列化がはるかに困難になります。このため、コピー操作は細心の注意を払って行う必要があります。経験則として、多くの小規模な操作は、1つの大きな操作よりもはるかに悪いです。さらに、何をしているのか分からない限り、一度に複数の操作を行うと、コードに散在する多くの単一操作よりもはるかに優れています。これは、あるデバイスが他の何かを実行する前に他のデバイスを待たなければならない場合、そのような操作がブロックされる可能性があるためです。これは、電話で予約注文して、準備ができていることを確認するのではなく、順番待ちでコーヒーを注文するようなものです。 
+
+最後に、テンソルを出力するか、テンソルをNumPy形式に変換するときに、データがメインメモリにない場合、フレームワークはまずそれをメインメモリにコピーし、その結果、追加の送信オーバーヘッドが発生します。さらに悪いことに、Pythonが完了するまですべてを待たせる、恐ろしいグローバルインタプリタロックの影響を受けます。 
+
+## [**ニューラルネットワークとGPU**]
+
+同様に、ニューラルネットワークモデルでもデバイスを指定できます。次のコードは、モデルパラメーターを GPU に配置します。
+
+```{.python .input}
+%%tab mxnet
+net = nn.Sequential()
+net.add(nn.Dense(1))
+net.initialize(ctx=try_gpu())
+```
+
+```{.python .input}
+%%tab pytorch
+net = nn.Sequential(nn.LazyLinear(1))
+net = net.to(device=try_gpu())
+```
+
+```{.python .input}
+%%tab tensorflow
+strategy = tf.distribute.MirroredStrategy()
+with strategy.scope():
+    net = tf.keras.models.Sequential([
+        tf.keras.layers.Dense(1)])
+```
+
+次の章では、GPUでモデルを実行する方法の例をさらに多く見ていきます。これは、計算負荷がいくらか高くなるためです。 
+
+入力が GPU 上のテンソルの場合、モデルは同じ GPU で結果を計算します。
+
+```{.python .input}
+%%tab all
+net(X)
+```
+
+それでは (**モデルパラメータが同じ GPU に保存されていることを確認する**)
+
+```{.python .input}
+%%tab mxnet
+net[0].weight.data().ctx
+```
+
+```{.python .input}
+%%tab pytorch
+net[0].weight.data.device
+```
+
+```{.python .input}
+%%tab tensorflow
+net.layers[0].weights[0].device, net.layers[0].weights[1].device
+```
+
+トレーナーに GPU をサポートさせてください。
+
+```{.python .input}
+%%tab mxnet
+@d2l.add_to_class(d2l.Module)  #@save
+def set_scratch_params_device(self, device):
+    for attr in dir(self):
+        a = getattr(self, attr)
+        if isinstance(a, np.ndarray):
+            with autograd.record():
+                setattr(self, attr, a.as_in_ctx(device))
+            getattr(self, attr).attach_grad()
+        if isinstance(a, d2l.Module):
+            a.set_scratch_params_device(device)
+        if isinstance(a, list):
+            for elem in a:
+                elem.set_scratch_params_device(device)
+```
+
+```{.python .input}
+%%tab mxnet, pytorch
+@d2l.add_to_class(d2l.Trainer)  #@save
+def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+    self.save_hyperparameters()
+    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
+
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_batch(self, batch):
+    if self.gpus:
+        batch = [d2l.to(a, self.gpus[0]) for a in batch]
+    return batch
+
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_model(self, model):
+    model.trainer = self
+    model.board.xlim = [0, self.max_epochs]
+    if self.gpus:
+        if tab.selected('mxnet'):
+            model.collect_params().reset_ctx(self.gpus[0])
+            model.set_scratch_params_device(self.gpus[0])
+        if tab.selected('pytorch'):
+            model.to(self.gpus[0])
+    self.model = model
+```
+
+つまり、すべてのデータとパラメータが同じデバイス上にある限り、モデルを効率的に学習できます。次の章では、そのような例をいくつか見ていきます。 
+
+## まとめ
+
+* CPU や GPU など、ストレージや計算用のデバイスを指定できます。既定では、データはメインメモリに作成され、計算に CPU を使用します。
+* ディープラーニングフレームワークでは、計算用のすべての入力データが CPU でも同じ GPU でも、同じデバイス上にある必要があります。
+* 注意せずにデータを移動すると、パフォーマンスが大幅に低下する可能性があります。典型的な間違いは次のとおりです。GPU上のすべてのミニバッチの損失を計算し、コマンドラインでユーザーに報告する（またはNumPy `ndarray`に記録する）と、グローバルインタープリターロックがトリガーされ、すべてのGPUが停止します。GPU 内でロギング用のメモリを割り当て、より大きなログのみを移動する方がはるかに優れています。
+
+## 演習
+
+1. 大きな行列の乗算など、より大きな計算タスクを試して、CPUとGPUの速度の違いを確認してください。計算量が少ないタスクはどうですか？
+1. GPU でモデルパラメーターをどのように読み書きすべきですか?
+1. $100 \times 100$ 行列の 1000 行列と行列の乗算を計算するのにかかる時間を測定し、出力行列のフロベニウスノルムを一度に 1 つずつ記録します。対数を GPU に保持して最終結果のみを転送するのとは異なります。
+1. 2 つの GPU で 2 つの行列-行列乗算を同時に実行するのにかかる時間と、1 つの GPU で連続して実行するのにかかる時間を測定します。ヒント:ほぼ直線的なスケーリングが見えるはずです。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/62)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/63)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/270)
+:end_tab:
diff --git a/chapter_deep-learning-computation/use-gpu_origin.md b/chapter_builders-guide/use-gpu_origin.md
similarity index 81%
rename from chapter_deep-learning-computation/use-gpu_origin.md
rename to chapter_builders-guide/use-gpu_origin.md
index 71a1f85..8fb5d04 100644
--- a/chapter_deep-learning-computation/use-gpu_origin.md
+++ b/chapter_builders-guide/use-gpu_origin.md
@@ -1,3 +1,8 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # GPUs
 :label:`sec_use_gpu`
 
@@ -24,7 +29,7 @@ the `nvidia-smi` command can be used
 to (**view the graphics card information**).
 
 ```{.python .input}
-#@tab all
+%%tab all
 !nvidia-smi
 ```
 
@@ -73,17 +78,6 @@ we can minimize the time spent
 transferring data between devices.
 For example, when training neural networks on a server with a GPU,
 we typically prefer for the model's parameters to live on the GPU.
-
-Next, we need to confirm that
-the GPU version of PyTorch is installed.
-If a CPU version of PyTorch is already installed,
-we need to uninstall it first.
-For example, use the `pip uninstall torch` command,
-then install the corresponding PyTorch version
-according to your CUDA version.
-Assuming you have CUDA 10.0 installed,
-you can install the PyTorch version
-that supports CUDA 10.0 via `pip install torch-cu100`.
 :end_tab:
 
 To run the programs in this section,
@@ -130,90 +124,76 @@ Also, `gpu:0` and `gpu` are equivalent.
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
 from mxnet import np, npx
 from mxnet.gluon import nn
 npx.set_np()
-
-npx.cpu(), npx.gpu(), npx.gpu(1)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
+from d2l import torch as d2l
 import torch
 from torch import nn
-
-torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
+from d2l import tensorflow as d2l
 import tensorflow as tf
-
-tf.device('/CPU:0'), tf.device('/GPU:0'), tf.device('/GPU:1')
 ```
 
-We can (**query the number of available GPUs.**)
-
 ```{.python .input}
-npx.num_gpus()
-```
+%%tab all
+def cpu():  #@save
+    if tab.selected('mxnet'):
+        return npx.cpu()
+    if tab.selected('pytorch'):
+        return torch.device('cpu')
+    if tab.selected('tensorflow'):
+        return tf.device('/CPU:0')
 
-```{.python .input}
-#@tab pytorch
-torch.cuda.device_count()
-```
+def gpu(i=0):  #@save
+    if tab.selected('mxnet'):
+        return npx.gpu(i)
+    if tab.selected('pytorch'):
+        return torch.device(f'cuda:{i}')
+    if tab.selected('tensorflow'):
+        return tf.device(f'/GPU:{i}')
 
-```{.python .input}
-#@tab tensorflow
-len(tf.config.experimental.list_physical_devices('GPU'))
+cpu(), gpu(), gpu(1)
 ```
 
-Now we [**define two convenient functions that allow us
-to run code even if the requested GPUs do not exist.**]
+We can (**query the number of available GPUs.**)
 
 ```{.python .input}
-def try_gpu(i=0):  #@save
-    """Return gpu(i) if exists, otherwise return cpu()."""
-    return npx.gpu(i) if npx.num_gpus() >= i + 1 else npx.cpu()
+%%tab all
+def num_gpus():  #@save
+    if tab.selected('mxnet'):
+        return npx.num_gpus()
+    if tab.selected('pytorch'):
+        return torch.cuda.device_count()
+    if tab.selected('tensorflow'):
+        return len(tf.config.experimental.list_physical_devices('GPU'))
 
-def try_all_gpus():  #@save
-    """Return all available GPUs, or [cpu()] if no GPU exists."""
-    devices = [npx.gpu(i) for i in range(npx.num_gpus())]
-    return devices if devices else [npx.cpu()]
-
-try_gpu(), try_gpu(10), try_all_gpus()
+num_gpus()
 ```
 
-```{.python .input}
-#@tab pytorch
-def try_gpu(i=0):  #@save
-    """Return gpu(i) if exists, otherwise return cpu()."""
-    if torch.cuda.device_count() >= i + 1:
-        return torch.device(f'cuda:{i}')
-    return torch.device('cpu')
-
-def try_all_gpus():  #@save
-    """Return all available GPUs, or [cpu(),] if no GPU exists."""
-    devices = [torch.device(f'cuda:{i}')
-             for i in range(torch.cuda.device_count())]
-    return devices if devices else [torch.device('cpu')]
-
-try_gpu(), try_gpu(10), try_all_gpus()
-```
+Now we [**define two convenient functions that allow us
+to run code even if the requested GPUs do not exist.**]
 
 ```{.python .input}
-#@tab tensorflow
+%%tab all
 def try_gpu(i=0):  #@save
     """Return gpu(i) if exists, otherwise return cpu()."""
-    if len(tf.config.experimental.list_physical_devices('GPU')) >= i + 1:
-        return tf.device(f'/GPU:{i}')
-    return tf.device('/CPU:0')
+    if num_gpus() >= i + 1:
+        return gpu(i)
+    return cpu()
 
 def try_all_gpus():  #@save
     """Return all available GPUs, or [cpu(),] if no GPU exists."""
-    num_gpus = len(tf.config.experimental.list_physical_devices('GPU'))
-    devices = [tf.device(f'/GPU:{i}') for i in range(num_gpus)]
-    return devices if devices else [tf.device('/CPU:0')]
+    return [gpu(i) for i in range(num_gpus())]
 
 try_gpu(), try_gpu(10), try_all_gpus()
 ```
@@ -224,18 +204,19 @@ By default, tensors are created on the CPU.
 We can [**query the device where the tensor is located.**]
 
 ```{.python .input}
+%%tab mxnet
 x = np.array([1, 2, 3])
 x.ctx
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x = torch.tensor([1, 2, 3])
 x.device
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x = tf.constant([1, 2, 3])
 x.device
 ```
@@ -256,21 +237,22 @@ For example, we can specify a storage device when creating a tensor.
 Next, we create the tensor variable `X` on the first `gpu`.
 The tensor created on a GPU only consumes the memory of this GPU.
 We can use the `nvidia-smi` command to view GPU memory usage.
-In general, we need to make sure that we do not create data that exceed the GPU memory limit.
+In general, we need to make sure that we do not create data that exceeds the GPU memory limit.
 
 ```{.python .input}
+%%tab mxnet
 X = np.ones((2, 3), ctx=try_gpu())
 X
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 X = torch.ones(2, 3, device=try_gpu())
 X
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with try_gpu():
     X = tf.ones((2, 3))
 X
@@ -279,18 +261,19 @@ X
 Assuming that you have at least two GPUs, the following code will (**create a random tensor on the second GPU.**)
 
 ```{.python .input}
+%%tab mxnet
 Y = np.random.uniform(size=(2, 3), ctx=try_gpu(1))
 Y
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 Y = torch.rand(2, 3, device=try_gpu(1))
 Y
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with try_gpu(1):
     Y = tf.random.uniform((2, 3))
 Y
@@ -313,35 +296,34 @@ we need to move `X` there before we can add the two.
 ![Copy data to perform an operation on the same device.](../img/copyto.svg)
 :label:`fig_copyto`
 
-
-
 ```{.python .input}
+%%tab mxnet
 Z = X.copyto(try_gpu(1))
 print(X)
 print(Z)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 Z = X.cuda(1)
 print(X)
 print(Z)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with try_gpu(1):
     Z = X
 print(X)
 print(Z)
 ```
 
-Now that [**the data are on the same GPU
+Now that [**the data is on the same GPU
 (both `Z` and `Y` are),
 we can add them up.**]
 
 ```{.python .input}
-#@tab all
+%%tab all
 Y + Z
 ```
 
@@ -374,16 +356,17 @@ It will return `Z` instead of making a copy and allocating new memory.
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 Z.as_in_ctx(try_gpu(1)) is Z
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 Z.cuda(1) is Z
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with try_gpu(1):
     Z2 = Z
 Z2 is Z
@@ -431,19 +414,20 @@ Similarly, a neural network model can specify devices.
 The following code puts the model parameters on the GPU.
 
 ```{.python .input}
+%%tab mxnet
 net = nn.Sequential()
 net.add(nn.Dense(1))
 net.initialize(ctx=try_gpu())
 ```
 
 ```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Linear(3, 1))
+%%tab pytorch
+net = nn.Sequential(nn.LazyLinear(1))
 net = net.to(device=try_gpu())
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 strategy = tf.distribute.MirroredStrategy()
 with strategy.scope():
     net = tf.keras.models.Sequential([
@@ -457,33 +441,79 @@ simply since they will become somewhat more computationally intensive.
 When the input is a tensor on the GPU, the model will calculate the result on the same GPU.
 
 ```{.python .input}
-#@tab all
+%%tab all
 net(X)
 ```
 
-Let us (**confirm that the model parameters are stored on the same GPU.**)
+Let's (**confirm that the model parameters are stored on the same GPU.**)
 
 ```{.python .input}
+%%tab mxnet
 net[0].weight.data().ctx
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 net[0].weight.data.device
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 net.layers[0].weights[0].device, net.layers[0].weights[1].device
 ```
 
+Let the trainer support GPU.
+
+```{.python .input}
+%%tab mxnet
+@d2l.add_to_class(d2l.Module)  #@save
+def set_scratch_params_device(self, device):
+    for attr in dir(self):
+        a = getattr(self, attr)
+        if isinstance(a, np.ndarray):
+            with autograd.record():
+                setattr(self, attr, a.as_in_ctx(device))
+            getattr(self, attr).attach_grad()
+        if isinstance(a, d2l.Module):
+            a.set_scratch_params_device(device)
+        if isinstance(a, list):
+            for elem in a:
+                elem.set_scratch_params_device(device)
+```
+
+```{.python .input}
+%%tab mxnet, pytorch
+@d2l.add_to_class(d2l.Trainer)  #@save
+def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+    self.save_hyperparameters()
+    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
+
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_batch(self, batch):
+    if self.gpus:
+        batch = [d2l.to(a, self.gpus[0]) for a in batch]
+    return batch
+
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_model(self, model):
+    model.trainer = self
+    model.board.xlim = [0, self.max_epochs]
+    if self.gpus:
+        if tab.selected('mxnet'):
+            model.collect_params().reset_ctx(self.gpus[0])
+            model.set_scratch_params_device(self.gpus[0])
+        if tab.selected('pytorch'):
+            model.to(self.gpus[0])
+    self.model = model
+```
+
 In short, as long as all data and parameters are on the same device, we can learn models efficiently. In the following chapters we will see several such examples.
 
 ## Summary
 
 * We can specify devices for storage and calculation, such as the CPU or GPU.
-  By default, data are created in the main memory
-  and then use the CPU for calculations.
+  By default, data is created in the main memory
+  and then uses the CPU for calculations.
 * The deep learning framework requires all input data for calculation
   to be on the same device,
   be it CPU or the same GPU.
@@ -520,3 +550,5 @@ In short, as long as all data and parameters are on the same device, we can lear
 :begin_tab:`tensorflow`
 [Discussions](https://discuss.d2l.ai/t/270)
 :end_tab:
+
+
diff --git a/chapter_deep-learning-computation/custom-layer.md b/chapter_deep-learning-computation/custom-layer.md
deleted file mode 100644
index 531326f..0000000
--- a/chapter_deep-learning-computation/custom-layer.md
+++ /dev/null
@@ -1,241 +0,0 @@
-# カスタムレイヤ
-
-ディープラーニングの成功の要因の 1 つは、さまざまなタスクに適したアーキテクチャを設計するために、クリエイティブな方法で構成できる幅広いレイヤーを利用できることです。たとえば、研究者は、画像、テキストの処理、シーケンシャルデータのループ、動的プログラミングの実行に特化したレイヤーを考案しました。遅かれ早かれ、ディープラーニングフレームワークにまだ存在しない層に出会ったり、考案したりするでしょう。このような場合は、カスタム Layer を構築する必要があります。このセクションでは、その方法を説明します。 
-
-## (**パラメータのないレイヤ**)
-
-まず、独自のパラメーターを持たないカスタム Layer を作成します。:numref:`sec_model_construction` の block の導入を思い出せば、これはおなじみのように思えるでしょう。次の `CenteredLayer` クラスは、単純に入力から平均を減算します。それを構築するには、基本レイヤークラスから継承し、順伝播関数を実装するだけです。
-
-```{.python .input}
-from mxnet import np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-class CenteredLayer(nn.Block):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-
-    def forward(self, X):
-        return X - X.mean()
-```
-
-```{.python .input}
-#@tab pytorch
-import torch
-from torch import nn
-from torch.nn import functional as F
-
-class CenteredLayer(nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, X):
-        return X - X.mean()
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-class CenteredLayer(tf.keras.Model):
-    def __init__(self):
-        super().__init__()
-
-    def call(self, inputs):
-        return inputs - tf.reduce_mean(inputs)
-```
-
-レイヤーにデータを入力して、レイヤーが意図したとおりに機能することを確認しましょう。
-
-```{.python .input}
-layer = CenteredLayer()
-layer(np.array([1, 2, 3, 4, 5]))
-```
-
-```{.python .input}
-#@tab pytorch
-layer = CenteredLayer()
-layer(torch.FloatTensor([1, 2, 3, 4, 5]))
-```
-
-```{.python .input}
-#@tab tensorflow
-layer = CenteredLayer()
-layer(tf.constant([1, 2, 3, 4, 5]))
-```
-
-これで、[**レイヤーをコンポーネントとして組み込んで、より複雑なモデルを構築できるようになりました**]
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(128), CenteredLayer())
-net.initialize()
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Linear(8, 128), CenteredLayer())
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.Sequential([tf.keras.layers.Dense(128), CenteredLayer()])
-```
-
-追加のサニティチェックとして、ランダムデータをネットワーク経由で送信し、平均値が実際に 0 であることを確認できます。ここでは浮動小数点数を扱っているため、量子化によってゼロ以外の非常に小さい数値が表示されることがあります。
-
-```{.python .input}
-Y = net(np.random.uniform(size=(4, 8)))
-Y.mean()
-```
-
-```{.python .input}
-#@tab pytorch
-Y = net(torch.rand(4, 8))
-Y.mean()
-```
-
-```{.python .input}
-#@tab tensorflow
-Y = net(tf.random.uniform((4, 8)))
-tf.reduce_mean(Y)
-```
-
-## [**パラメータ付きの画層**]
-
-単純層の定義方法がわかったところで、学習によって調整できるパラメーターを持つ層の定義に移りましょう。組み込み関数を使用して、基本的なハウスキーピング機能を提供するパラメーターを作成できます。特に、モデルパラメーターのアクセス、初期化、共有、保存、読み込みを制御します。これにより、他の利点の中でも、すべてのカスタム Layer に対してカスタムのシリアル化ルーチンを記述する必要がなくなります。 
-
-それでは、完全接続されたレイヤーの独自のバージョンを実装しましょう。このレイヤーには 2 つのパラメーターが必要であることを思い出してください。1 つはウェイトを表し、もう 1 つはバイアス用です。この実装では、デフォルトとして ReLU アクティベーションをベイクインします。この層には、`in_units` と `units` の 2 つの入力引数が必要です。これらの引数は、それぞれ入力と出力の数を表します。
-
-```{.python .input}
-class MyDense(nn.Block):
-    def __init__(self, units, in_units, **kwargs):
-        super().__init__(**kwargs)
-        self.weight = self.params.get('weight', shape=(in_units, units))
-        self.bias = self.params.get('bias', shape=(units,))
-
-    def forward(self, x):
-        linear = np.dot(x, self.weight.data(ctx=x.ctx)) + self.bias.data(
-            ctx=x.ctx)
-        return npx.relu(linear)
-```
-
-```{.python .input}
-#@tab pytorch
-class MyLinear(nn.Module):
-    def __init__(self, in_units, units):
-        super().__init__()
-        self.weight = nn.Parameter(torch.randn(in_units, units))
-        self.bias = nn.Parameter(torch.randn(units,))
-    def forward(self, X):
-        linear = torch.matmul(X, self.weight.data) + self.bias.data
-        return F.relu(linear)
-```
-
-```{.python .input}
-#@tab tensorflow
-class MyDense(tf.keras.Model):
-    def __init__(self, units):
-        super().__init__()
-        self.units = units
-
-    def build(self, X_shape):
-        self.weight = self.add_weight(name='weight',
-            shape=[X_shape[-1], self.units],
-            initializer=tf.random_normal_initializer())
-        self.bias = self.add_weight(
-            name='bias', shape=[self.units],
-            initializer=tf.zeros_initializer())
-
-    def call(self, X):
-        linear = tf.matmul(X, self.weight) + self.bias
-        return tf.nn.relu(linear)
-```
-
-:begin_tab:`mxnet, tensorflow`
-次に、`MyDense` クラスをインスタンス化し、そのモデルパラメーターにアクセスします。
-:end_tab:
-
-:begin_tab:`pytorch`
-次に、`MyLinear` クラスをインスタンス化し、そのモデルパラメーターにアクセスします。
-:end_tab:
-
-```{.python .input}
-dense = MyDense(units=3, in_units=5)
-dense.params
-```
-
-```{.python .input}
-#@tab pytorch
-linear = MyLinear(5, 3)
-linear.weight
-```
-
-```{.python .input}
-#@tab tensorflow
-dense = MyDense(3)
-dense(tf.random.uniform((2, 5)))
-dense.get_weights()
-```
-
-[**カスタム層を使用して順方向伝播計算を直接実行できます**]
-
-```{.python .input}
-dense.initialize()
-dense(np.random.uniform(size=(2, 5)))
-```
-
-```{.python .input}
-#@tab pytorch
-linear(torch.rand(2, 5))
-```
-
-```{.python .input}
-#@tab tensorflow
-dense(tf.random.uniform((2, 5)))
-```
-
-また、(**カスタムレイヤーを使用してモデルを構築**) できれば、組み込みの完全接続レイヤーと同じように使用できます。
-
-```{.python .input}
-net = nn.Sequential()
-net.add(MyDense(8, in_units=64),
-        MyDense(1, in_units=8))
-net.initialize()
-net(np.random.uniform(size=(2, 64)))
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
-net(torch.rand(2, 64))
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([MyDense(8), MyDense(1)])
-net(tf.random.uniform((2, 64)))
-```
-
-## [概要
-
-* 基本レイヤクラスを介してカスタムレイヤを設計できます。これにより、ライブラリ内の既存のレイヤーとは異なる動作をする柔軟性のある新しいレイヤーを定義できます。
-* カスタム Layer を定義すると、任意のコンテキストやアーキテクチャでカスタム Layer を呼び出すことができます。
-* レイヤには、組み込み関数を使用して作成できるローカルパラメータを含めることができます。
-
-## 演習
-
-1. 入力を受け取り、テンソルリダクションを計算する、つまり $y_k = \sum_{i, j} W_{ijk} x_i x_j$ を返す層を設計します。
-1. データのフーリエ係数の前半を返す層を設計します。
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/58)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/59)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/279)
-:end_tab:
diff --git a/chapter_deep-learning-computation/deferred-init.md b/chapter_deep-learning-computation/deferred-init.md
deleted file mode 100644
index c36bc62..0000000
--- a/chapter_deep-learning-computation/deferred-init.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# 遅延初期化
-:label:`sec_deferred_init`
-
-これまでのところ、ネットワークの設定がだらしないことで逃げたように見えるかもしれません。具体的には、次のような直感的でないことを行いましたが、動作するようには思えないかもしれません。 
-
-* 入力次元を指定せずにネットワークアーキテクチャを定義しました。
-* 前のレイヤーの出力次元を指定せずにレイヤーを追加しました。
-* モデルに含めるべきパラメータの数を決定するのに十分な情報を提供する前に、これらのパラメータを「初期化」しました。
-
-私たちのコードがまったく実行されていることに驚くかもしれません。結局のところ、ディープラーニングフレームワークがネットワークの入力次元を判断する方法はありません。ここでの秘訣は、フレームワークが初期化を*延期*し、最初にデータをモデルに渡すまで待って、各レイヤーのサイズをその場で推測することです。 
-
-その後、畳み込みニューラルネットワークで作業する場合、入力次元 (画像の解像度) が後続の各層の次元性に影響を与えるため、この手法はさらに便利になります。したがって、コードの記述時に次元が何であるかを知る必要なくパラメータを設定できるため、モデルを指定して後で変更するタスクが大幅に簡略化されます。次に、初期化の仕組みについて詳しく説明します。 
-
-## ネットワークのインスタンス化
-
-はじめに、MLP をインスタンス化してみましょう。
-
-```{.python .input}
-from mxnet import np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-def get_net():
-    net = nn.Sequential()
-    net.add(nn.Dense(256, activation='relu'))
-    net.add(nn.Dense(10))
-    return net
-
-net = get_net()
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    tf.keras.layers.Dense(10),
-])
-```
-
-この時点では、入力ディメンションが不明のままであるため、ネットワークは入力レイヤのウェイトのディメンションを認識できない可能性があります。そのため、フレームワークはまだパラメータを初期化していません。以下のパラメータにアクセスして確認します。
-
-```{.python .input}
-print(net.collect_params)
-print(net.collect_params())
-```
-
-```{.python .input}
-#@tab tensorflow
-[net.layers[i].get_weights() for i in range(len(net.layers))]
-```
-
-:begin_tab:`mxnet`
-パラメーターオブジェクトが存在する間は、各レイヤーへの入力次元は -1 としてリストされることに注意してください。MXNet は、パラメーターの次元が不明のままであることを示すために、特別な値 -1 を使用します。この時点で `net[0].weight.data()` にアクセスしようとすると、パラメータにアクセスする前にネットワークを初期化する必要があることを示すランタイムエラーが発生します。ここで、`initialize` 関数でパラメーターを初期化しようとするとどうなるか見てみましょう。
-:end_tab:
-
-:begin_tab:`tensorflow`
-各レイヤオブジェクトは存在しますが、ウェイトは空です。`net.get_weights()` を使用すると、ウェイトがまだ初期化されていないため、エラーがスローされます。
-:end_tab:
-
-```{.python .input}
-net.initialize()
-net.collect_params()
-```
-
-:begin_tab:`mxnet`
-ご覧のとおり、何も変わっていません。入力次元が不明な場合、initialize を呼び出してもパラメーターは正しく初期化されません。代わりに、この呼び出しは MXNet に登録します。MXNet は、パラメーターの初期化を希望する (オプションで、どのディストリビューションに応じて)。
-:end_tab:
-
-次に、ネットワークを介してデータを渡し、フレームワークが最終的にパラメータを初期化するようにします。
-
-```{.python .input}
-X = np.random.uniform(size=(2, 20))
-net(X)
-
-net.collect_params()
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.random.uniform((2, 20))
-net(X)
-[w.shape for w in net.get_weights()]
-```
-
-入力の次元 20 がわかるとすぐに、フレームワークは値 20 を差し込むことで第 1 層の重み行列の形状を識別できます。最初のレイヤーの形状を認識すると、フレームワークは2番目のレイヤーに進み、すべての形状がわかるまで計算グラフを通じて続きます。この場合、遅延初期化が必要なのは第1層のみですが、フレームワークは順次初期化されます。すべてのパラメーターの形状がわかれば、フレームワークは最終的にパラメーターを初期化できます。 
-
-## [概要
-
-* 遅延初期化は便利で、フレームワークがパラメーターの形状を自動的に推測できるため、アーキテクチャーの変更が容易になり、一般的なエラーの原因を 1 つ排除できます。
-* モデルを介してデータを渡し、フレームワークが最終的にパラメーターを初期化するようにできます。
-
-## 演習
-
-1. 入力次元を最初のレイヤーに指定し、後続のレイヤーには指定しないとどうなりますか？すぐに初期化できますか？
-1. 不一致の寸法を指定した場合はどうなりますか。
-1. さまざまな次元の入力があるとしたら、何をする必要がありますか？ヒント:パラメータ同点を見てください。
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/280)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/281)
-:end_tab:
diff --git a/chapter_deep-learning-computation/index.md b/chapter_deep-learning-computation/index.md
deleted file mode 100644
index b7ce237..0000000
--- a/chapter_deep-learning-computation/index.md
+++ /dev/null
@@ -1,17 +0,0 @@
-# ディープラーニング計算
-:label:`chap_computation`
-
-ディープラーニングの急速な進歩には、巨大なデータセットや強力なハードウェアと並んで、優れたソフトウェアツールが不可欠な役割を果たしてきました。2007年にリリースされた画期的なTheanoライブラリを皮切りに、柔軟なオープンソースツールにより、研究者はモデルのプロトタイプを迅速に作成できるようになり、標準コンポーネントのリサイクル時に繰り返される作業を回避しながら、低レベルの修正も可能になりました。ディープラーニングのライブラリは時が経つにつれて進化し、抽象化がますます粗くなってきました。半導体設計者がトランジスタの指定から論理回路、コードの記述へと移行したように、ニューラルネットワークの研究者たちは、個々の人工ニューロンの振る舞いを考えることから、層全体でネットワークを考えるようになり、今でははるかに粗いアーキテクチャを設計することが多くなっています*ブロック*を念頭に置いてください。 
-
-ここまで、機械学習の基本的な概念をいくつか紹介し、完全に機能するディープラーニングモデルにまで拡張しました。前章では、MLP の各コンポーネントをゼロから実装し、高レベル API を活用して同じモデルを簡単に展開する方法についても説明しました。そこまで早くあなたを導くために、私たちは図書館を「呼びかけ」ましたが、*それらがどのように機能するかについてのより高度な詳細はスキップしました。この章では、モデルの構築、パラメーターのアクセスと初期化、カスタムレイヤーとブロックの設計、ディスクへのモデルの読み取りと書き込み、GPU の活用による大幅な高速化など、ディープラーニング計算の主要コンポーネントについて深く掘り下げます。これらの洞察により、*エンドユーザー*から*パワーユーザー*へと移行し、成熟したディープラーニングライブラリのメリットを享受するために必要なツールを提供しながら、自分で考案したモデルも含め、より複雑なモデルを柔軟に実装できます。この章では新しいモデルやデータセットについては説明しませんが、後続のアドバンスモデリングの章ではこれらの手法に大きく依存しています。
-
-```toc
-:maxdepth: 2
-
-model-construction
-parameters
-deferred-init
-custom-layer
-read-write
-use-gpu
-```
diff --git a/chapter_deep-learning-computation/model-construction.md b/chapter_deep-learning-computation/model-construction.md
deleted file mode 100644
index 729184e..0000000
--- a/chapter_deep-learning-computation/model-construction.md
+++ /dev/null
@@ -1,468 +0,0 @@
-# 画層とブロック
-:label:`sec_model_construction`
-
-ニューラルネットワークを初めて導入したときは、単一出力の線形モデルに注目しました。ここでは、モデル全体が単一のニューロンだけで構成されています。1 つのニューロン (i) が何らかの入力を受け取り、(ii) 対応するスカラー出力を生成し、(iii) 関心のある目的関数を最適化するために更新可能な関連パラメーターのセットがあることに注意してください。その後、複数の出力を持つネットワークについて考え始めると、ベクトル化された演算を利用してニューロンの層全体の特性を評価しました。個々のニューロンと同様に、層 (i) は一連の入力を受け取り、(ii) 対応する出力を生成し、(iii) 一連の調整可能なパラメーターによって記述されます。ソフトマックス回帰を行ったとき、単層自体がモデルでした。しかし、その後にMLPを導入したときも、このモデルはこれと同じ基本構造を保持していると考えることができた。 
-
-興味深いことに、MLP では、モデル全体とその構成層の両方がこの構造を共有しています。モデル全体が生の入力 (特徴) を取り込み、出力 (予測) を生成し、パラメーター (すべての構成層からの結合パラメーター) を持ちます。同様に、個々の層は (前の層から供給された) 入力を取り込み、出力 (後続層への入力) を生成し、後続の層から逆方向に流れる信号に従って更新される一連の調整可能なパラメーターを持ちます。 
-
-ニューロン、層、モデルが私たちのビジネスを進めるのに十分な抽象化をもたらすと考えるかもしれませんが、個々のレイヤーよりも大きいがモデル全体よりも小さいコンポーネントについて話すと便利なことがよくあります。たとえば、コンピュータビジョンで非常に普及しているResNet-152アーキテクチャは、数百のレイヤーを所有しています。これらのレイヤーは、*レイヤーのグループ* の繰り返しパターンで構成されます。このようなネットワークを一度に 1 つのレイヤで実装するのは面倒な作業になることがあります。この懸念は単なる仮説的なものではなく、実際にはこのようなデザインパターンが一般的です。上記の ResNet アーキテクチャは、認識と検出の両方で 2015 年の ImageNet と COCO のコンピュータビジョンコンペティションで優勝し、多くのビジョンタスクで今でも頼りになるアーキテクチャです。レイヤーがさまざまな繰り返しパターンで配置される同様のアーキテクチャは、自然言語処理や音声処理などの他の領域でも広く普及しています。 
-
-これらの複雑なネットワークを実装するために、ニューラルネットワーク「ブロック」という概念を導入します。ブロックは、1 つのレイヤー、複数のレイヤーで構成されるコンポーネント、またはモデル全体を記述できます。ブロック抽象化を使用する利点の 1 つは、それらを結合してより大きなアーティファクトに (多くの場合、再帰的に) できることです。これは :numref:`fig_blocks` で説明されています。任意の複雑さのブロックをオンデマンドで生成するコードを定義することで、驚くほどコンパクトなコードを作成しながら、複雑なニューラルネットワークを実装できます。 
-
-![Multiple layers are combined into blocks, forming repeating patterns of larger models.](../img/blocks.svg)
-:label:`fig_blocks`
-
-プログラミングの観点からは、ブロックは*class* で表されます。そのサブクラスは、入力を出力に変換し、必要なパラメーターを格納する前方伝播関数を定義する必要があります。ブロックによってはパラメーターをまったく必要としないものがあることに注意してください。最後に、勾配を計算するために、ブロックは逆伝播関数を持たなければなりません。幸いなことに、独自のブロックを定義するときに自動微分 (:numref:`sec_autograd` で導入) によってもたらされるいくつかの舞台裏の魔法により、パラメーターと前方伝播関数について心配するだけで済みます。 
-
-[**はじめに、MLP の実装に使用したコードを再検討します**](:numref:`sec_mlp_concise`)。次のコードは、256 ユニットと ReLU アクティベーションを持つ 1 つの完全接続された隠れ層をもつネットワークを生成し、その後に 10 ユニットの完全接続された出力層 (アクティベーション関数なし) をもつネットワークを生成します。
-
-```{.python .input}
-from mxnet import np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-net = nn.Sequential()
-net.add(nn.Dense(256, activation='relu'))
-net.add(nn.Dense(10))
-net.initialize()
-
-X = np.random.uniform(size=(2, 20))
-net(X)
-```
-
-```{.python .input}
-#@tab pytorch
-import torch
-from torch import nn
-from torch.nn import functional as F
-
-net = nn.Sequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
-
-X = torch.rand(2, 20)
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    tf.keras.layers.Dense(10),
-])
-
-X = tf.random.uniform((2, 20))
-net(X)
-```
-
-:begin_tab:`mxnet`
-この例では、`nn.Sequential` をインスタンス化し、返されたオブジェクトを `net` 変数に代入してモデルを構築しました。次に、`add` 関数を繰り返し呼び出し、実行すべき順序でレイヤーを追加します。つまり、`nn.Sequential` は特別な種類の `Block` を定義しています。このクラスは、Gluon でブロックを表すクラスです。構成要素 `Block` の順序付きリストを維持します。`add` 関数は、連続する各 `Block` をリストに追加しやすくします。各層は `Dense` クラスのインスタンスであり、それ自体が `Block` のサブクラスであることに注意してください。順伝播 (`forward`) 関数も非常に単純です。リスト内の各 `Block` を連結し、それぞれの出力を入力として次の関数に渡します。ここまでは、`net(X)` コンストラクションを介してモデルを呼び出して、その出力を取得してきました。これは実際には `Block` クラスの `__call__` 関数によって実現された Python の巧妙なトリックである `net.forward(X)` の省略表現です。
-:end_tab:
-
-:begin_tab:`pytorch`
-この例では、`nn.Sequential` をインスタンス化してモデルを構築しました。レイヤーは実行順序どおりに引数として渡されます。つまり、(**`nn.Sequential` は特別な種類の `Module` を定義します**)、PyTorch でブロックを提示するクラスです。構成要素 `Module` の順序付きリストを維持します。2 つの完全接続層はそれぞれ `Linear` クラスのインスタンスであり、それ自体が `Module` のサブクラスであることに注意してください。順伝播 (`forward`) 関数も非常に単純です。リスト内の各ブロックを連結し、それぞれの出力を入力として次のブロックに渡します。ここまでは、コンストラクション `net(X)` を介してモデルを呼び出して、その出力を取得してきました。これは実際には `net.__call__(X)` の省略形です。
-:end_tab:
-
-:begin_tab:`tensorflow`
-この例では、`keras.models.Sequential` をインスタンス化してモデルを構築しました。レイヤーは実行順序どおりに引数として渡されます。つまり、`Sequential` は、Keras でブロックを表すクラスである `keras.Model` という特別な種類を定義しています。構成要素 `Model` の順序付きリストを維持します。2 つの完全接続層はそれぞれ `Dense` クラスのインスタンスであり、それ自体が `Model` のサブクラスであることに注意してください。順伝播 (`call`) 関数も非常に単純です。リスト内の各ブロックを連結し、それぞれの出力を入力として次のブロックに渡します。ここまでは、コンストラクション `net(X)` を介してモデルを呼び出して出力を取得してきました。これは実際には、Block クラスの `__call__` 関数によって実現された Python の巧妙なトリックである `net.call(X)` の省略表現です。
-:end_tab:
-
-## [**カスタムブロック**]
-
-ブロックがどのように機能するかを直感的に理解する最も簡単な方法は、ブロックを自分で実装することでしょう。独自のカスタムブロックを実装する前に、各ブロックが提供しなければならない基本機能を簡単にまとめます。
-
-:begin_tab:`mxnet, tensorflow`
-1. 入力データを前方伝播関数の引数として取り込みます。
-1. 順伝播関数が値を返すようにして、出力を生成します。出力の形状が入力と異なる場合があることに注意してください。たとえば、上のモデルの最初の全結合層は、任意の次元の入力を取り込みますが、次元 256 の出力を返します。
-1. 入力に対する出力の勾配を計算します。この勾配は、バックプロパゲーション関数を介してアクセスできます。通常、これは自動的に行われます。
-1. 順伝播計算の実行に必要なパラメーターを保存し、そのパラメーターへのアクセスを提供します。
-1. 必要に応じてモデルパラメーターを初期化します。
-:end_tab:
-
-:begin_tab:`pytorch`
-1. 入力データを前方伝播関数の引数として取り込みます。
-1. 順伝播関数が値を返すようにして、出力を生成します。出力の形状が入力と異なる場合があることに注意してください。たとえば、上のモデルの最初の全結合層は次元 20 の入力を取り込みますが、次元 256 の出力を返します。
-1. 入力に対する出力の勾配を計算します。この勾配は、バックプロパゲーション関数を介してアクセスできます。通常、これは自動的に行われます。
-1. 順伝播計算の実行に必要なパラメーターを保存し、そのパラメーターへのアクセスを提供します。
-1. 必要に応じてモデルパラメーターを初期化します。
-:end_tab:
-
-次のスニペットでは、256 個の隠れ単位を持つ 1 つの隠れ層と 10 次元の出力層をもつ MLP に対応するブロックをゼロからコード化します。以下の `MLP` クラスは、ブロックを表すクラスを継承していることに注意してください。親クラスの関数に大きく依存し、独自のコンストラクタ (Python では `__init__` 関数) と前方伝播関数のみを提供します。
-
-```{.python .input}
-class MLP(nn.Block):
-    # Declare a layer with model parameters. Here, we declare two
-    # fully-connected layers
-    def __init__(self, **kwargs):
-        # Call the constructor of the `MLP` parent class `Block` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
-        super().__init__(**kwargs)
-        self.hidden = nn.Dense(256, activation='relu')  # Hidden layer
-        self.out = nn.Dense(10)  # Output layer
-
-    # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
-    def forward(self, X):
-        return self.out(self.hidden(X))
-```
-
-```{.python .input}
-#@tab pytorch
-class MLP(nn.Module):
-    # Declare a layer with model parameters. Here, we declare two fully
-    # connected layers
-    def __init__(self):
-        # Call the constructor of the `MLP` parent class `Module` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
-        super().__init__()
-        self.hidden = nn.Linear(20, 256)  # Hidden layer
-        self.out = nn.Linear(256, 10)  # Output layer
-
-    # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
-    def forward(self, X):
-        # Note here we use the funtional version of ReLU defined in the
-        # nn.functional module.
-        return self.out(F.relu(self.hidden(X)))
-```
-
-```{.python .input}
-#@tab tensorflow
-class MLP(tf.keras.Model):
-    # Declare a layer with model parameters. Here, we declare two fully
-    # connected layers
-    def __init__(self):
-        # Call the constructor of the `MLP` parent class `Model` to perform
-        # the necessary initialization. In this way, other function arguments
-        # can also be specified during class instantiation, such as the model
-        # parameters, `params` (to be described later)
-        super().__init__()
-        # Hidden layer
-        self.hidden = tf.keras.layers.Dense(units=256, activation=tf.nn.relu)
-        self.out = tf.keras.layers.Dense(units=10)  # Output layer
-
-    # Define the forward propagation of the model, that is, how to return the
-    # required model output based on the input `X`
-    def call(self, X):
-        return self.out(self.hidden((X)))
-```
-
-まず、順伝播関数に注目しましょう。`X` を入力として取り、アクティベーション関数を適用して隠れ表現を計算し、そのロジットを出力することに注意してください。この `MLP` の実装では、両方のレイヤーがインスタンス変数です。これが妥当な理由を理解するために、2 つの MLP `net1` と `net2` をインスタンス化し、異なるデータでそれらをトレーニングすることを想像してみてください。当然、それらは2つの異なる学習モデルを表すと予想されます。 
-
-順伝播関数の呼び出しごとに、コンストラクターで [**MLP のレイヤーをインスタンス化**] します (**そしてこれらのレイヤーを呼び出します**)。いくつかの重要な詳細に注意してください。まず、カスタマイズした `__init__` 関数は `super().__init__()` を介して親クラスの `__init__` 関数を呼び出します。これにより、ほとんどのブロックに適用できるボイラープレートコードを再記述する手間が省けます。次に、2 つの完全に接続された Layer をインスタンス化し、`self.hidden` と `self.out` に割り当てます。new 演算子を実装しない限り、バックプロパゲーション関数やパラメーターの初期化について心配する必要はありません。これらの関数はシステムによって自動的に生成されます。これを試してみよう。
-
-```{.python .input}
-net = MLP()
-net.initialize()
-net(X)
-```
-
-```{.python .input}
-#@tab pytorch
-net = MLP()
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-net = MLP()
-net(X)
-```
-
-ブロック抽象化の主な利点は、その汎用性にあります。ブロックをサブクラス化して、層 (全結合層クラスなど)、モデル全体 (上記の `MLP` クラスなど)、または中程度の複雑度のさまざまなコンポーネントを作成できます。畳み込みニューラルネットワークを扱う場合など、この多様性を次の章で活用しています。 
-
-## [**シーケンシャルブロック**]
-
-ここで、`Sequential` クラスがどのように機能するのかを詳しく見てみましょう。`Sequential` は他のブロックをデイジーチェーン接続するように設計されていたことを思い出してください。単純化された `MySequential` を独自に構築するには、次の 2 つのキー関数を定義するだけです。
-1. ブロックを 1 つずつリストに追加する関数。
-2. 追加された順序と同じ順序で、ブロックのチェーンを介して入力を渡す順伝播関数。
-
-次の `MySequential` クラスは、デフォルトの `Sequential` クラスと同じ機能を提供します。
-
-```{.python .input}
-class MySequential(nn.Block):
-    def add(self, block):
-        # Here, `block` is an instance of a `Block` subclass, and we assume 
-        # that it has a unique name. We save it in the member variable
-        # `_children` of the `Block` class, and its type is OrderedDict. When
-        # the `MySequential` instance calls the `initialize` function, the
-        # system automatically initializes all members of `_children`
-        self._children[block.name] = block
-
-    def forward(self, X):
-        # OrderedDict guarantees that members will be traversed in the order
-        # they were added
-        for block in self._children.values():
-            X = block(X)
-        return X
-```
-
-```{.python .input}
-#@tab pytorch
-class MySequential(nn.Module):
-    def __init__(self, *args):
-        super().__init__()
-        for idx, module in enumerate(args):
-            # Here, `module` is an instance of a `Module` subclass. We save it
-            # in the member variable `_modules` of the `Module` class, and its
-            # type is OrderedDict
-            self._modules[str(idx)] = module
-
-    def forward(self, X):
-        # OrderedDict guarantees that members will be traversed in the order
-        # they were added
-        for block in self._modules.values():
-            X = block(X)
-        return X
-```
-
-```{.python .input}
-#@tab tensorflow
-class MySequential(tf.keras.Model):
-    def __init__(self, *args):
-        super().__init__()
-        self.modules = []
-        for block in args:
-            # Here, `block` is an instance of a `tf.keras.layers.Layer`
-            # subclass
-            self.modules.append(block)
-
-    def call(self, X):
-        for module in self.modules:
-            X = module(X)
-        return X
-```
-
-:begin_tab:`mxnet`
-`add` 関数は、順序付きディクショナリ `_children` に 1 つのブロックを追加します。すべてのGluon `Block`が`_children`属性を持っている理由と、Pythonのリストを自分で定義するのではなく、なぜそれを使ったのか不思議に思うかもしれません。つまり `_children` の主な利点は、ブロックのパラメーターの初期化中に、Gluon は `_children` ディクショナリ内を調べて、パラメーターも初期化する必要のあるサブブロックを見つけることがわかっていることです。
-:end_tab:
-
-:begin_tab:`pytorch`
-`__init__` メソッドでは、すべてのモジュールを順序付き辞書 `_modules` に 1 つずつ追加します。すべての `Module` がなぜ `_modules` 属性を持ち、Python リストを自分で定義するのではなくなぜそれを使ったのか不思議に思うかもしれません。つまり `_modules` の主な利点は、モジュールのパラメータ初期化中に、システムが `_modules` ディクショナリを調べて、パラメータも初期化する必要のあるサブモジュールを見つけることがわかっていることです。
-:end_tab:
-
-`MySequential` の前方伝播関数が呼び出されると、追加された各ブロックは追加された順に実行されます。これで、`MySequential` クラスを使用して MLP を再実装できます。
-
-```{.python .input}
-net = MySequential()
-net.add(nn.Dense(256, activation='relu'))
-net.add(nn.Dense(10))
-net.initialize()
-net(X)
-```
-
-```{.python .input}
-#@tab pytorch
-net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-net = MySequential(
-    tf.keras.layers.Dense(units=256, activation=tf.nn.relu),
-    tf.keras.layers.Dense(10))
-net(X)
-```
-
-この `MySequential` の使い方は、`Sequential` クラス用に以前に書いたコード (:numref:`sec_mlp_concise` で説明) と同じであることに注意してください。 
-
-## [**フォワード伝播関数でのコードの実行**]
-
-`Sequential` クラスを使用するとモデルの構築が容易になり、独自のクラスを定義しなくても新しいアーキテクチャをアセンブルできます。ただし、すべてのアーキテクチャが単純なデイジーチェーンであるとは限りません。より柔軟性が必要な場合は、独自のブロックを定義する必要があります。たとえば、順伝播関数内で Python の制御フローを実行したい場合があります。さらに、定義済みのニューラルネットワーク層に頼るのではなく、任意の数学演算を実行したい場合もあります。 
-
-これまで、私たちのネットワークのすべての操作が、ネットワークのアクティベーションとそのパラメーターに基づいて動作していたことに気付いたかもしれません。ただし、前のレイヤーの結果でも更新可能なパラメーターでもない用語を取り入れたい場合もあります。これらを*定数パラメータ*と呼びます。たとえば、関数 $f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}$ を計算するレイヤーが必要だとします。$\mathbf{x}$ は入力、$\mathbf{w}$ はパラメーター、$c$ は最適化中に更新されない指定された定数です。そこで `FixedHiddenMLP` クラスを以下のように実装します。
-
-```{.python .input}
-class FixedHiddenMLP(nn.Block):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        # Random weight parameters created with the `get_constant` function
-        # are not updated during training (i.e., constant parameters)
-        self.rand_weight = self.params.get_constant(
-            'rand_weight', np.random.uniform(size=(20, 20)))
-        self.dense = nn.Dense(20, activation='relu')
-
-    def forward(self, X):
-        X = self.dense(X)
-        # Use the created constant parameters, as well as the `relu` and `dot`
-        # functions
-        X = npx.relu(np.dot(X, self.rand_weight.data()) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
-        X = self.dense(X)
-        # Control flow
-        while np.abs(X).sum() > 1:
-            X /= 2
-        return X.sum()
-```
-
-```{.python .input}
-#@tab pytorch
-class FixedHiddenMLP(nn.Module):
-    def __init__(self):
-        super().__init__()
-        # Random weight parameters that will not compute gradients and
-        # therefore keep constant during training
-        self.rand_weight = torch.rand((20, 20), requires_grad=False)
-        self.linear = nn.Linear(20, 20)
-
-    def forward(self, X):
-        X = self.linear(X)
-        # Use the created constant parameters, as well as the `relu` and `mm`
-        # functions
-        X = F.relu(torch.mm(X, self.rand_weight) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
-        X = self.linear(X)
-        # Control flow
-        while X.abs().sum() > 1:
-            X /= 2
-        return X.sum()
-```
-
-```{.python .input}
-#@tab tensorflow
-class FixedHiddenMLP(tf.keras.Model):
-    def __init__(self):
-        super().__init__()
-        self.flatten = tf.keras.layers.Flatten()
-        # Random weight parameters created with `tf.constant` are not updated
-        # during training (i.e., constant parameters)
-        self.rand_weight = tf.constant(tf.random.uniform((20, 20)))
-        self.dense = tf.keras.layers.Dense(20, activation=tf.nn.relu)
-
-    def call(self, inputs):
-        X = self.flatten(inputs)
-        # Use the created constant parameters, as well as the `relu` and
-        # `matmul` functions
-        X = tf.nn.relu(tf.matmul(X, self.rand_weight) + 1)
-        # Reuse the fully-connected layer. This is equivalent to sharing
-        # parameters with two fully-connected layers
-        X = self.dense(X)
-        # Control flow
-        while tf.reduce_sum(tf.math.abs(X)) > 1:
-            X /= 2
-        return tf.reduce_sum(X)
-```
-
-この `FixedHiddenMLP` モデルでは、重み (`self.rand_weight`) がインスタンス化時にランダムに初期化され、その後は一定になる隠れ層を実装します。この重みはモデルパラメータではないため、バックプロパゲーションによって更新されることはありません。その後、ネットワークはこの「固定」層の出力を全結合層に渡します。 
-
-出力を返す前に、モデルが異常なことをしたことに注意してください。while ループを実行し、$L_1$ ノルムが $1$ より大きいという条件をテストし、条件を満たすまで出力ベクトルを $2$ で割ります。最後に、`X` のエントリの合計を返しました。われわれの知る限り、標準的なニューラルネットワークはこの操作を実行しません。この特定の操作は、実際のタスクでは役に立たない場合があることに注意してください。ここでのポイントは、ニューラルネットワーク計算の流れに任意のコードを統合する方法を示すことだけです。
-
-```{.python .input}
-net = FixedHiddenMLP()
-net.initialize()
-net(X)
-```
-
-```{.python .input}
-#@tab pytorch, tensorflow
-net = FixedHiddenMLP()
-net(X)
-```
-
-[**ブロックをさまざまな方法で組み合わせて組み合わせる**] 次の例では、いくつかの創造的な方法でブロックをネストしています。
-
-```{.python .input}
-class NestMLP(nn.Block):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        self.net = nn.Sequential()
-        self.net.add(nn.Dense(64, activation='relu'),
-                     nn.Dense(32, activation='relu'))
-        self.dense = nn.Dense(16, activation='relu')
-
-    def forward(self, X):
-        return self.dense(self.net(X))
-
-chimera = nn.Sequential()
-chimera.add(NestMLP(), nn.Dense(20), FixedHiddenMLP())
-chimera.initialize()
-chimera(X)
-```
-
-```{.python .input}
-#@tab pytorch
-class NestMLP(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.net = nn.Sequential(nn.Linear(20, 64), nn.ReLU(),
-                                 nn.Linear(64, 32), nn.ReLU())
-        self.linear = nn.Linear(32, 16)
-
-    def forward(self, X):
-        return self.linear(self.net(X))
-
-chimera = nn.Sequential(NestMLP(), nn.Linear(16, 20), FixedHiddenMLP())
-chimera(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-class NestMLP(tf.keras.Model):
-    def __init__(self):
-        super().__init__()
-        self.net = tf.keras.Sequential()
-        self.net.add(tf.keras.layers.Dense(64, activation=tf.nn.relu))
-        self.net.add(tf.keras.layers.Dense(32, activation=tf.nn.relu))
-        self.dense = tf.keras.layers.Dense(16, activation=tf.nn.relu)
-
-    def call(self, inputs):
-        return self.dense(self.net(inputs))
-
-chimera = tf.keras.Sequential()
-chimera.add(NestMLP())
-chimera.add(tf.keras.layers.Dense(20))
-chimera.add(FixedHiddenMLP())
-chimera(X)
-```
-
-## 効率性
-
-:begin_tab:`mxnet`
-熱心な読者は、これらの操作の一部の効率を心配し始めるかもしれません。結局のところ、高性能のディープラーニングライブラリと思われるものでは、辞書のルックアップ、コードの実行、その他多くのPythonicの処理が行われています。Python の [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) の問題はよく知られています。ディープラーニングのコンテキストでは、非常に高速な GPU が、別のジョブを実行する前に、ちっぽけな CPU が Python コードを実行するまで待たなければならないのではないかと心配するかもしれません。Python を高速化する一番良い方法は、Python を完全に避けることです。 
-
-Gluonがこれを行う1つの方法は、
-*ハイブリダイゼーション*。これについては後述する。
-ここで、Python インタプリタは最初に呼び出されたときにブロックを実行します。Gluon ランタイムは何が起きているかを記録し、次回 Gluon ランタイムがそれを回避したときに Python の呼び出しをショートこれにより、場合によってはかなり高速化される可能性がありますが、制御フロー（上記のように）がネットを通るさまざまなパスで異なるブランチを下る場合は注意が必要です。興味のある読者は、現在の章を終えた後、ハイブリダイゼーションのセクション (:numref:`sec_hybridize`) をチェックしてコンパイルについて学ぶことを勧めます。
-:end_tab:
-
-:begin_tab:`pytorch`
-熱心な読者は、これらの操作の一部の効率を心配し始めるかもしれません。結局のところ、高性能のディープラーニングライブラリと思われるものでは、辞書のルックアップ、コードの実行、その他多くのPythonicの処理が行われています。Python の [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) の問題はよく知られています。ディープラーニングのコンテキストでは、非常に高速な GPU が、別のジョブを実行する前に、ちっぽけな CPU が Python コードを実行するまで待たなければならないのではないかと心配するかもしれません。
-:end_tab:
-
-:begin_tab:`tensorflow`
-熱心な読者は、これらの操作の一部の効率を心配し始めるかもしれません。結局のところ、高性能のディープラーニングライブラリと思われるものでは、辞書のルックアップ、コードの実行、その他多くのPythonicの処理が行われています。Python の [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock) の問題はよく知られています。ディープラーニングのコンテキストでは、非常に高速な GPU が、別のジョブを実行する前に、ちっぽけな CPU が Python コードを実行するまで待たなければならないのではないかと心配するかもしれません。Python を高速化する一番良い方法は、Python を完全に避けることです。
-:end_tab:
-
-## [概要
-
-* 画層はブロックです。
-* 多くの画層が 1 つのブロックを構成できます。
-* 多くのブロックが 1 つのブロックを構成できます。
-* ブロックにはコードを含めることができます。
-* ブロックは、パラメーターの初期化やバックプロパゲーションなど、多くのハウスキーピングを処理します。
-* 層とブロックの連続的な連結は `Sequential` ブロックによって処理されます。
-
-## 演習
-
-1. `MySequential` を変更して Python リストにブロックを格納すると、どのような問題が発生しますか？
-1. 2 つのブロック (`net1` と `net2` など) を引数として取り、両方のネットワークの連結された出力を順伝播で返すブロックを実装します。これはパラレルブロックとも呼ばれます。
-1. 同じネットワークの複数のインスタンスを連結すると仮定します。同じブロックの複数のインスタンスを生成し、そこからより大きなネットワークを構築するファクトリ関数を実装します。
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/54)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/55)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/264)
-:end_tab:
diff --git a/chapter_deep-learning-computation/parameters.md b/chapter_deep-learning-computation/parameters.md
deleted file mode 100644
index 6fe8af6..0000000
--- a/chapter_deep-learning-computation/parameters.md
+++ /dev/null
@@ -1,552 +0,0 @@
-# パラメータ管理
-
-アーキテクチャを選択してハイパーパラメーターを設定したら、学習ループに進みます。ここでは、損失関数を最小化するパラメーター値を見つけることが目標です。トレーニング後、将来の予測を行うためにこれらのパラメータが必要になります。さらに、パラメーターを抽出して、他のコンテキストで再利用したり、モデルをディスクに保存して他のソフトウェアで実行できるようにしたり、科学的な理解を得るための調査のためにパラメーターを抽出したい場合があります。 
-
-ほとんどの場合、ディープラーニングフレームワークに頼って重労働を行うことで、パラメーターの宣言と操作方法に関する重要な詳細を無視することができます。しかし、標準レイヤーを持つスタックアーキテクチャから離れると、パラメーターの宣言と操作の雑草に陥る必要が生じることがあります。このセクションでは、次の内容について説明します。 
-
-* デバッグ、診断、可視化のためのパラメーターへのアクセス。
-* パラメーターの初期化。
-* 異なるモデルコンポーネント間でパラメータを共有する。
-
-(**まず、隠れ層が1つあるMLPに着目します。**)
-
-```{.python .input}
-from mxnet import init, np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-net = nn.Sequential()
-net.add(nn.Dense(8, activation='relu'))
-net.add(nn.Dense(1))
-net.initialize()  # Use the default initialization method
-
-X = np.random.uniform(size=(2, 4))
-net(X)  # Forward computation
-```
-
-```{.python .input}
-#@tab pytorch
-import torch
-from torch import nn
-
-net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
-X = torch.rand(size=(2, 4))
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(4, activation=tf.nn.relu),
-    tf.keras.layers.Dense(1),
-])
-
-X = tf.random.uniform((2, 4))
-net(X)
-```
-
-## [**パラメータアクセス**]
-
-既に知っているモデルからパラメータにアクセスする方法から始めましょう。`Sequential` クラスを介してモデルを定義すると、リストであるかのようにモデルにインデックスを付けることで、どのレイヤーにもまずアクセスできます。各レイヤのパラメータは、その属性に便利に配置されています。2 番目の全結合層のパラメーターを調べるには、次のようにします。
-
-```{.python .input}
-print(net[1].params)
-```
-
-```{.python .input}
-#@tab pytorch
-print(net[2].state_dict())
-```
-
-```{.python .input}
-#@tab tensorflow
-print(net.layers[2].weights)
-```
-
-この出力から、いくつかの重要なことが分かります。まず、この完全に接続されたレイヤには、そのレイヤのウェイトとバイアスに対応する 2 つのパラメータが含まれています。どちらも単精度浮動小数点数 (float32) として格納されます。パラメーターの名前により、数百ものレイヤーを含むネットワーク内であっても、各レイヤーのパラメーターを一意に識別できます。 
-
-### [**ターゲットパラメータ**]
-
-各パラメータは、パラメータクラスのインスタンスとして表されることに注意してください。パラメータで何か役に立つことを行うには、まず基礎となる数値にアクセスする必要があります。これにはいくつかの方法があります。より単純なものもあれば、より一般的なものもあります。次のコードは、パラメータクラスインスタンスを返す 2 番目のニューラルネットワーク層からバイアスを抽出し、さらにそのパラメータの値にアクセスします。
-
-```{.python .input}
-print(type(net[1].bias))
-print(net[1].bias)
-print(net[1].bias.data())
-```
-
-```{.python .input}
-#@tab pytorch
-print(type(net[2].bias))
-print(net[2].bias)
-print(net[2].bias.data)
-```
-
-```{.python .input}
-#@tab tensorflow
-print(type(net.layers[2].weights[1]))
-print(net.layers[2].weights[1])
-print(tf.convert_to_tensor(net.layers[2].weights[1]))
-```
-
-:begin_tab:`mxnet,pytorch`
-パラメータは、値、グラデーション、追加情報を含む複雑なオブジェクトです。そのため、値を明示的に要求する必要があります。 
-
-値に加えて、各パラメーターでグラデーションにアクセスすることもできます。このネットワークのバックプロパゲーションはまだ起動していないため、初期状態です。
-:end_tab:
-
-```{.python .input}
-net[1].weight.grad()
-```
-
-```{.python .input}
-#@tab pytorch
-net[2].weight.grad == None
-```
-
-### [**すべてのパラメータを一度に**]
-
-すべてのパラメータに対して操作を実行する必要がある場合、それらに 1 つずつアクセスするのは面倒です。より複雑なブロック (ネストされたブロックなど) を扱う場合、各サブブロックのパラメーターを抽出するためにツリー全体を再帰的に処理する必要があるため、状況は特に扱いにくくなります。以下では、最初に完全に接続されたレイヤーのパラメーターにアクセスする方法と、すべてのレイヤーにアクセスする方法について説明します。
-
-```{.python .input}
-print(net[0].collect_params())
-print(net.collect_params())
-```
-
-```{.python .input}
-#@tab pytorch
-print(*[(name, param.shape) for name, param in net[0].named_parameters()])
-print(*[(name, param.shape) for name, param in net.named_parameters()])
-```
-
-```{.python .input}
-#@tab tensorflow
-print(net.layers[1].weights)
-print(net.get_weights())
-```
-
-これにより、次のようにネットワークのパラメータにアクセスする別の方法が提供されます。
-
-```{.python .input}
-net.collect_params()['dense1_bias'].data()
-```
-
-```{.python .input}
-#@tab pytorch
-net.state_dict()['2.bias'].data
-```
-
-```{.python .input}
-#@tab tensorflow
-net.get_weights()[1]
-```
-
-### [**ネストされたブロックからパラメータを収集する**]
-
-複数のブロックを互いに入れ子にした場合、パラメーターの命名規則がどのように機能するかを見てみましょう。そのためには、まずブロックを生成する関数 (いわばブロックファクトリ) を定義し、さらに大きなブロック内でこれらを結合します。
-
-```{.python .input}
-def block1():
-    net = nn.Sequential()
-    net.add(nn.Dense(32, activation='relu'))
-    net.add(nn.Dense(16, activation='relu'))
-    return net
-
-def block2():
-    net = nn.Sequential()
-    for _ in range(4):
-        # Nested here
-        net.add(block1())
-    return net
-
-rgnet = nn.Sequential()
-rgnet.add(block2())
-rgnet.add(nn.Dense(10))
-rgnet.initialize()
-rgnet(X)
-```
-
-```{.python .input}
-#@tab pytorch
-def block1():
-    return nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
-                         nn.Linear(8, 4), nn.ReLU())
-
-def block2():
-    net = nn.Sequential()
-    for i in range(4):
-        # Nested here
-        net.add_module(f'block {i}', block1())
-    return net
-
-rgnet = nn.Sequential(block2(), nn.Linear(4, 1))
-rgnet(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-def block1(name):
-    return tf.keras.Sequential([
-        tf.keras.layers.Flatten(),
-        tf.keras.layers.Dense(4, activation=tf.nn.relu)],
-        name=name)
-
-def block2():
-    net = tf.keras.Sequential()
-    for i in range(4):
-        # Nested here
-        net.add(block1(name=f'block-{i}'))
-    return net
-
-rgnet = tf.keras.Sequential()
-rgnet.add(block2())
-rgnet.add(tf.keras.layers.Dense(1))
-rgnet(X)
-```
-
-[**ネットワークの設計が完了しました。ネットワークがどのように構成されているか見てみましょう**]
-
-```{.python .input}
-print(rgnet.collect_params)
-print(rgnet.collect_params())
-```
-
-```{.python .input}
-#@tab pytorch
-print(rgnet)
-```
-
-```{.python .input}
-#@tab tensorflow
-print(rgnet.summary())
-```
-
-レイヤーは階層的にネストされているので、ネストされたリストでインデックスを作成するかのようにレイヤーにアクセスすることもできます。たとえば、最初のメジャーブロックにアクセスし、その中で2番目のサブブロックにアクセスし、その中で第1レイヤーのバイアスにアクセスするには、次のようにします。
-
-```{.python .input}
-rgnet[0][1][0].bias.data()
-```
-
-```{.python .input}
-#@tab pytorch
-rgnet[0][1][0].bias.data
-```
-
-```{.python .input}
-#@tab tensorflow
-rgnet.layers[0].layers[1].layers[1].weights[1]
-```
-
-## パラメーターの初期化
-
-パラメータへのアクセス方法がわかったところで、パラメータを正しく初期化する方法を見ていきましょう。:numref:`sec_numerical_stability` では、適切な初期化の必要性について説明しました。ディープラーニングフレームワークは、その層にデフォルトのランダム初期化を提供します。ただし、他のさまざまなプロトコルに従って重みを初期化したい場合がよくあります。このフレームワークは、最も一般的に使用されるプロトコルを提供し、カスタムイニシャライザを作成することもできます。
-
-:begin_tab:`mxnet`
-既定では、MXNet は一様分布 $U(-0.07, 0.07)$ からランダムに抽出し、バイアスパラメーターを 0 にクリアすることで、重みパラメーターを初期化します。MXNet の `init` モジュールは、さまざまなプリセット初期化方法を提供します。
-:end_tab:
-
-:begin_tab:`pytorch`
-既定では、PyTorch は入力次元と出力次元に従って計算された範囲から描画することにより、重み行列とバイアス行列を一様に初期化します。PyTorch の `nn.init` モジュールは、さまざまなプリセット初期化メソッドを提供します。
-:end_tab:
-
-:begin_tab:`tensorflow`
-デフォルトでは、Keras は入力次元と出力次元に従って計算された範囲から描画することで重み行列を均一に初期化し、バイアスパラメータはすべてゼロに設定されます。TensorFlow は、ルートモジュールと `keras.initializers` モジュールの両方でさまざまな初期化方法を提供します。
-:end_tab:
-
-### [**ビルトイン初期化**]
-
-まず、組み込みイニシャライザを呼び出すことから始めましょう。以下のコードは、すべての重みパラメーターを標準偏差 0.01 のガウス確率変数として初期化し、バイアスパラメーターはゼロにクリアしています。
-
-```{.python .input}
-# Here `force_reinit` ensures that parameters are freshly initialized even if
-# they were already initialized previously
-net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_normal(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, mean=0, std=0.01)
-        nn.init.zeros_(m.bias)
-net.apply(init_normal)
-net[0].weight.data[0], net[0].bias.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4, activation=tf.nn.relu,
-        kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.01),
-        bias_initializer=tf.zeros_initializer()),
-    tf.keras.layers.Dense(1)])
-
-net(X)
-net.weights[0], net.weights[1]
-```
-
-また、すべてのパラメータを指定された定数値 (1 など) に初期化することもできます。
-
-```{.python .input}
-net.initialize(init=init.Constant(1), force_reinit=True)
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_constant(m):
-    if type(m) == nn.Linear:
-        nn.init.constant_(m.weight, 1)
-        nn.init.zeros_(m.bias)
-net.apply(init_constant)
-net[0].weight.data[0], net[0].bias.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4, activation=tf.nn.relu,
-        kernel_initializer=tf.keras.initializers.Constant(1),
-        bias_initializer=tf.zeros_initializer()),
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-net.weights[0], net.weights[1]
-```
-
-[**特定のブロックに異なるイニシャライザを適用することもできます**] 例えば、以下では第1層をXavierイニシャライザで初期化し、第2層を定数42に初期化します。
-
-```{.python .input}
-net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
-net[1].initialize(init=init.Constant(42), force_reinit=True)
-print(net[0].weight.data()[0])
-print(net[1].weight.data())
-```
-
-```{.python .input}
-#@tab pytorch
-def xavier(m):
-    if type(m) == nn.Linear:
-        nn.init.xavier_uniform_(m.weight)
-def init_42(m):
-    if type(m) == nn.Linear:
-        nn.init.constant_(m.weight, 42)
-
-net[0].apply(xavier)
-net[2].apply(init_42)
-print(net[0].weight.data[0])
-print(net[2].weight.data)
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4,
-        activation=tf.nn.relu,
-        kernel_initializer=tf.keras.initializers.GlorotUniform()),
-    tf.keras.layers.Dense(
-        1, kernel_initializer=tf.keras.initializers.Constant(42)),
-])
-
-net(X)
-print(net.layers[1].weights[0])
-print(net.layers[2].weights[0])
-```
-
-### [**カスタム初期化**]
-
-必要な初期化方法が、ディープラーニングフレームワークによって提供されない場合があります。以下の例では、次の奇妙な分布を使用して、任意の加重パラメータ $w$ のイニシャライザを定義します。 
-
-$$
-\begin{aligned}
-    w \sim \begin{cases}
-        U(5, 10) & \text{ with probability } \frac{1}{4} \\
-            0    & \text{ with probability } \frac{1}{2} \\
-        U(-10, -5) & \text{ with probability } \frac{1}{4}
-    \end{cases}
-\end{aligned}
-$$
-
-:begin_tab:`mxnet`
-ここでは `Initializer` クラスのサブクラスを定義します。通常、実装する必要があるのはテンソル引数 (`data`) を取り、必要な初期化値を代入する `_init_weight` 関数のみです。
-:end_tab:
-
-:begin_tab:`pytorch`
-ここでも `net` に適用する `my_init` 関数を実装します。
-:end_tab:
-
-:begin_tab:`tensorflow`
-ここでは `Initializer` のサブクラスを定義し、形状とデータ型を指定して目的のテンソルを返す関数 `__call__` を実装します。
-:end_tab:
-
-```{.python .input}
-class MyInit(init.Initializer):
-    def _init_weight(self, name, data):
-        print('Init', name, data.shape)
-        data[:] = np.random.uniform(-10, 10, data.shape)
-        data *= np.abs(data) >= 5
-
-net.initialize(MyInit(), force_reinit=True)
-net[0].weight.data()[:2]
-```
-
-```{.python .input}
-#@tab pytorch
-def my_init(m):
-    if type(m) == nn.Linear:
-        print("Init", *[(name, param.shape) 
-                        for name, param in m.named_parameters()][0])
-        nn.init.uniform_(m.weight, -10, 10)
-        m.weight.data *= m.weight.data.abs() >= 5
-
-net.apply(my_init)
-net[0].weight[:2]
-```
-
-```{.python .input}
-#@tab tensorflow
-class MyInit(tf.keras.initializers.Initializer):
-    def __call__(self, shape, dtype=None):
-        data=tf.random.uniform(shape, -10, 10, dtype=dtype)
-        factor=(tf.abs(data) >= 5)
-        factor=tf.cast(factor, tf.float32)
-        return data * factor        
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4,
-        activation=tf.nn.relu,
-        kernel_initializer=MyInit()),
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-print(net.layers[1].weights[0])
-```
-
-パラメーターを直接設定するオプションは常にあることに注意してください。
-
-```{.python .input}
-net[0].weight.data()[:] += 1
-net[0].weight.data()[0, 0] = 42
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-net[0].weight.data[:] += 1
-net[0].weight.data[0, 0] = 42
-net[0].weight.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net.layers[1].weights[0][:].assign(net.layers[1].weights[0] + 1)
-net.layers[1].weights[0][0, 0].assign(42)
-net.layers[1].weights[0]
-```
-
-:begin_tab:`mxnet`
-上級ユーザーへの注意:`autograd` スコープ内でパラメーターを調整する場合は、自動微分の仕組みを混乱させないように `set_data` を使用する必要があります。
-:end_tab:
-
-## [**同点パラメータ**]
-
-多くの場合、複数のレイヤーにわたってパラメーターを共有する必要があります。これをエレガントに行う方法を見てみましょう。以下では、高密度レイヤーを割り当て、そのパラメーターを使用して別のレイヤーのパラメーターを設定します。
-
-```{.python .input}
-net = nn.Sequential()
-# We need to give the shared layer a name so that we can refer to its
-# parameters
-shared = nn.Dense(8, activation='relu')
-net.add(nn.Dense(8, activation='relu'),
-        shared,
-        nn.Dense(8, activation='relu', params=shared.params),
-        nn.Dense(10))
-net.initialize()
-
-X = np.random.uniform(size=(2, 20))
-net(X)
-
-# Check whether the parameters are the same
-print(net[1].weight.data()[0] == net[2].weight.data()[0])
-net[1].weight.data()[0, 0] = 100
-# Make sure that they are actually the same object rather than just having the
-# same value
-print(net[1].weight.data()[0] == net[2].weight.data()[0])
-```
-
-```{.python .input}
-#@tab pytorch
-# We need to give the shared layer a name so that we can refer to its
-# parameters
-shared = nn.Linear(8, 8)
-net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
-                    shared, nn.ReLU(),
-                    shared, nn.ReLU(),
-                    nn.Linear(8, 1))
-net(X)
-# Check whether the parameters are the same
-print(net[2].weight.data[0] == net[4].weight.data[0])
-net[2].weight.data[0, 0] = 100
-# Make sure that they are actually the same object rather than just having the
-# same value
-print(net[2].weight.data[0] == net[4].weight.data[0])
-```
-
-```{.python .input}
-#@tab tensorflow
-# tf.keras behaves a bit differently. It removes the duplicate layer
-# automatically
-shared = tf.keras.layers.Dense(4, activation=tf.nn.relu)
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    shared,
-    shared,
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-# Check whether the parameters are different
-print(len(net.layers) == 3)
-```
-
-:begin_tab:`mxnet,pytorch`
-この例は、2 番目と 3 番目のレイヤーのパラメーターが結び付けられていることを示しています。それらは等しいだけでなく、同じ正確なテンソルで表されます。したがって、一方のパラメータを変更すると、もう一方のパラメータも変更されます。パラメータを結び付けると、グラデーションはどうなるのか疑問に思うかもしれません。モデルパラメーターには勾配が含まれているため、2 番目の隠れ層と 3 番目の隠れ層の勾配は逆伝播時に加算されます。
-:end_tab:
-
-## [概要
-
-* モデルパラメーターへのアクセス、初期化、および結び付けを行う方法はいくつかあります。
-* カスタム初期化を使用できます。
-
-## 演習
-
-1. :numref:`sec_model_construction` で定義されている `FancyMLP` モデルを使用して、さまざまなレイヤのパラメータにアクセスします。
-1. 初期化モジュールのドキュメントを見て、さまざまなイニシャライザを調べてください。
-1. 共有パラメーター層を含む MLP を構築し、学習させます。学習プロセス中に、各層のモデルパラメーターと勾配を観察します。
-1. パラメーターの共有がなぜ良いアイデアなのですか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/56)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/57)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/269)
-:end_tab:
diff --git a/chapter_deep-learning-computation/parameters_origin.md b/chapter_deep-learning-computation/parameters_origin.md
deleted file mode 100644
index dc72f8d..0000000
--- a/chapter_deep-learning-computation/parameters_origin.md
+++ /dev/null
@@ -1,668 +0,0 @@
-# Parameter Management
-
-Once we have chosen an architecture
-and set our hyperparameters,
-we proceed to the training loop,
-where our goal is to find parameter values
-that minimize our loss function.
-After training, we will need these parameters
-in order to make future predictions.
-Additionally, we will sometimes wish
-to extract the parameters
-either to reuse them in some other context,
-to save our model to disk so that
-it may be executed in other software,
-or for examination in the hope of
-gaining scientific understanding.
-
-Most of the time, we will be able
-to ignore the nitty-gritty details
-of how parameters are declared
-and manipulated, relying on deep learning frameworks
-to do the heavy lifting.
-However, when we move away from
-stacked architectures with standard layers,
-we will sometimes need to get into the weeds
-of declaring and manipulating parameters.
-In this section, we cover the following:
-
-* Accessing parameters for debugging, diagnostics, and visualizations.
-* Parameter initialization.
-* Sharing parameters across different model components.
-
-(**We start by focusing on an MLP with one hidden layer.**)
-
-```{.python .input}
-from mxnet import init, np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-net = nn.Sequential()
-net.add(nn.Dense(8, activation='relu'))
-net.add(nn.Dense(1))
-net.initialize()  # Use the default initialization method
-
-X = np.random.uniform(size=(2, 4))
-net(X)  # Forward computation
-```
-
-```{.python .input}
-#@tab pytorch
-import torch
-from torch import nn
-
-net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
-X = torch.rand(size=(2, 4))
-net(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(4, activation=tf.nn.relu),
-    tf.keras.layers.Dense(1),
-])
-
-X = tf.random.uniform((2, 4))
-net(X)
-```
-
-## [**Parameter Access**]
-
-Let us start with how to access parameters
-from the models that you already know.
-When a model is defined via the `Sequential` class,
-we can first access any layer by indexing
-into the model as though it were a list.
-Each layer's parameters are conveniently
-located in its attribute.
-We can inspect the parameters of the second fully-connected layer as follows.
-
-```{.python .input}
-print(net[1].params)
-```
-
-```{.python .input}
-#@tab pytorch
-print(net[2].state_dict())
-```
-
-```{.python .input}
-#@tab tensorflow
-print(net.layers[2].weights)
-```
-
-The output tells us a few important things.
-First, this fully-connected layer
-contains two parameters,
-corresponding to that layer's
-weights and biases, respectively.
-Both are stored as single precision floats (float32).
-Note that the names of the parameters
-allow us to uniquely identify
-each layer's parameters,
-even in a network containing hundreds of layers.
-
-
-### [**Targeted Parameters**]
-
-Note that each parameter is represented
-as an instance of the parameter class.
-To do anything useful with the parameters,
-we first need to access the underlying numerical values.
-There are several ways to do this.
-Some are simpler while others are more general.
-The following code extracts the bias
-from the second neural network layer, which returns a parameter class instance, and 
-further accesses that parameter's value.
-
-```{.python .input}
-print(type(net[1].bias))
-print(net[1].bias)
-print(net[1].bias.data())
-```
-
-```{.python .input}
-#@tab pytorch
-print(type(net[2].bias))
-print(net[2].bias)
-print(net[2].bias.data)
-```
-
-```{.python .input}
-#@tab tensorflow
-print(type(net.layers[2].weights[1]))
-print(net.layers[2].weights[1])
-print(tf.convert_to_tensor(net.layers[2].weights[1]))
-```
-
-:begin_tab:`mxnet,pytorch`
-Parameters are complex objects,
-containing values, gradients,
-and additional information.
-That's why we need to request the value explicitly.
-
-In addition to the value, each parameter also allows us to access the gradient. Because we have not invoked backpropagation for this network yet, it is in its initial state.
-:end_tab:
-
-```{.python .input}
-net[1].weight.grad()
-```
-
-```{.python .input}
-#@tab pytorch
-net[2].weight.grad == None
-```
-
-### [**All Parameters at Once**]
-
-When we need to perform operations on all parameters,
-accessing them one-by-one can grow tedious.
-The situation can grow especially unwieldy
-when we work with more complex blocks (e.g., nested blocks),
-since we would need to recurse
-through the entire tree to extract
-each sub-block's parameters. Below we demonstrate accessing the parameters of the first fully-connected layer vs. accessing all layers.
-
-```{.python .input}
-print(net[0].collect_params())
-print(net.collect_params())
-```
-
-```{.python .input}
-#@tab pytorch
-print(*[(name, param.shape) for name, param in net[0].named_parameters()])
-print(*[(name, param.shape) for name, param in net.named_parameters()])
-```
-
-```{.python .input}
-#@tab tensorflow
-print(net.layers[1].weights)
-print(net.get_weights())
-```
-
-This provides us with another way of accessing the parameters of the network as follows.
-
-```{.python .input}
-net.collect_params()['dense1_bias'].data()
-```
-
-```{.python .input}
-#@tab pytorch
-net.state_dict()['2.bias'].data
-```
-
-```{.python .input}
-#@tab tensorflow
-net.get_weights()[1]
-```
-
-### [**Collecting Parameters from Nested Blocks**]
-
-Let us see how the parameter naming conventions work
-if we nest multiple blocks inside each other.
-For that we first define a function that produces blocks
-(a block factory, so to speak) and then
-combine these inside yet larger blocks.
-
-```{.python .input}
-def block1():
-    net = nn.Sequential()
-    net.add(nn.Dense(32, activation='relu'))
-    net.add(nn.Dense(16, activation='relu'))
-    return net
-
-def block2():
-    net = nn.Sequential()
-    for _ in range(4):
-        # Nested here
-        net.add(block1())
-    return net
-
-rgnet = nn.Sequential()
-rgnet.add(block2())
-rgnet.add(nn.Dense(10))
-rgnet.initialize()
-rgnet(X)
-```
-
-```{.python .input}
-#@tab pytorch
-def block1():
-    return nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
-                         nn.Linear(8, 4), nn.ReLU())
-
-def block2():
-    net = nn.Sequential()
-    for i in range(4):
-        # Nested here
-        net.add_module(f'block {i}', block1())
-    return net
-
-rgnet = nn.Sequential(block2(), nn.Linear(4, 1))
-rgnet(X)
-```
-
-```{.python .input}
-#@tab tensorflow
-def block1(name):
-    return tf.keras.Sequential([
-        tf.keras.layers.Flatten(),
-        tf.keras.layers.Dense(4, activation=tf.nn.relu)],
-        name=name)
-
-def block2():
-    net = tf.keras.Sequential()
-    for i in range(4):
-        # Nested here
-        net.add(block1(name=f'block-{i}'))
-    return net
-
-rgnet = tf.keras.Sequential()
-rgnet.add(block2())
-rgnet.add(tf.keras.layers.Dense(1))
-rgnet(X)
-```
-
-Now that [**we have designed the network,
-let us see how it is organized.**]
-
-```{.python .input}
-print(rgnet.collect_params)
-print(rgnet.collect_params())
-```
-
-```{.python .input}
-#@tab pytorch
-print(rgnet)
-```
-
-```{.python .input}
-#@tab tensorflow
-print(rgnet.summary())
-```
-
-Since the layers are hierarchically nested,
-we can also access them as though
-indexing through nested lists.
-For instance, we can access the first major block,
-within it the second sub-block,
-and within that the bias of the first layer,
-with as follows.
-
-```{.python .input}
-rgnet[0][1][0].bias.data()
-```
-
-```{.python .input}
-#@tab pytorch
-rgnet[0][1][0].bias.data
-```
-
-```{.python .input}
-#@tab tensorflow
-rgnet.layers[0].layers[1].layers[1].weights[1]
-```
-
-## Parameter Initialization
-
-Now that we know how to access the parameters,
-let us look at how to initialize them properly.
-We discussed the need for proper initialization in :numref:`sec_numerical_stability`.
-The deep learning framework provides default random initializations to its layers.
-However, we often want to initialize our weights
-according to various other protocols. The framework provides most commonly
-used protocols, and also allows to create a custom initializer.
-
-:begin_tab:`mxnet`
-By default, MXNet initializes weight parameters by randomly drawing from a uniform distribution $U(-0.07, 0.07)$,
-clearing bias parameters to zero.
-MXNet's `init` module provides a variety
-of preset initialization methods.
-:end_tab:
-
-:begin_tab:`pytorch`
-By default, PyTorch initializes weight and bias matrices
-uniformly by drawing from a range that is computed according to the input and output dimension.
-PyTorch's `nn.init` module provides a variety
-of preset initialization methods.
-:end_tab:
-
-:begin_tab:`tensorflow`
-By default, Keras initializes weight matrices uniformly by drawing from a range that is computed according to the input and output dimension, and the bias parameters are all set to zero.
-TensorFlow provides a variety of initialization methods both in the root module and the `keras.initializers` module.
-:end_tab:
-
-### [**Built-in Initialization**]
-
-Let us begin by calling on built-in initializers.
-The code below initializes all weight parameters
-as Gaussian random variables
-with standard deviation 0.01, while bias parameters cleared to zero.
-
-```{.python .input}
-# Here `force_reinit` ensures that parameters are freshly initialized even if
-# they were already initialized previously
-net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_normal(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, mean=0, std=0.01)
-        nn.init.zeros_(m.bias)
-net.apply(init_normal)
-net[0].weight.data[0], net[0].bias.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4, activation=tf.nn.relu,
-        kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.01),
-        bias_initializer=tf.zeros_initializer()),
-    tf.keras.layers.Dense(1)])
-
-net(X)
-net.weights[0], net.weights[1]
-```
-
-We can also initialize all the parameters
-to a given constant value (say, 1).
-
-```{.python .input}
-net.initialize(init=init.Constant(1), force_reinit=True)
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_constant(m):
-    if type(m) == nn.Linear:
-        nn.init.constant_(m.weight, 1)
-        nn.init.zeros_(m.bias)
-net.apply(init_constant)
-net[0].weight.data[0], net[0].bias.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4, activation=tf.nn.relu,
-        kernel_initializer=tf.keras.initializers.Constant(1),
-        bias_initializer=tf.zeros_initializer()),
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-net.weights[0], net.weights[1]
-```
-
-[**We can also apply different initializers for certain blocks.**]
-For example, below we initialize the first layer
-with the Xavier initializer
-and initialize the second layer
-to a constant value of 42.
-
-```{.python .input}
-net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
-net[1].initialize(init=init.Constant(42), force_reinit=True)
-print(net[0].weight.data()[0])
-print(net[1].weight.data())
-```
-
-```{.python .input}
-#@tab pytorch
-def xavier(m):
-    if type(m) == nn.Linear:
-        nn.init.xavier_uniform_(m.weight)
-def init_42(m):
-    if type(m) == nn.Linear:
-        nn.init.constant_(m.weight, 42)
-
-net[0].apply(xavier)
-net[2].apply(init_42)
-print(net[0].weight.data[0])
-print(net[2].weight.data)
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4,
-        activation=tf.nn.relu,
-        kernel_initializer=tf.keras.initializers.GlorotUniform()),
-    tf.keras.layers.Dense(
-        1, kernel_initializer=tf.keras.initializers.Constant(42)),
-])
-
-net(X)
-print(net.layers[1].weights[0])
-print(net.layers[2].weights[0])
-```
-
-### [**Custom Initialization**]
-
-Sometimes, the initialization methods we need
-are not provided by the deep learning framework.
-In the example below, we define an initializer
-for any weight parameter $w$ using the following strange distribution:
-
-$$
-\begin{aligned}
-    w \sim \begin{cases}
-        U(5, 10) & \text{ with probability } \frac{1}{4} \\
-            0    & \text{ with probability } \frac{1}{2} \\
-        U(-10, -5) & \text{ with probability } \frac{1}{4}
-    \end{cases}
-\end{aligned}
-$$
-
-:begin_tab:`mxnet`
-Here we define a subclass of the `Initializer` class.
-Usually, we only need to implement the `_init_weight` function
-which takes a tensor argument (`data`)
-and assigns to it the desired initialized values.
-:end_tab:
-
-:begin_tab:`pytorch`
-Again, we implement a `my_init` function to apply to `net`.
-:end_tab:
-
-:begin_tab:`tensorflow`
-Here we define a subclass of `Initializer` and implement the `__call__`
-function that return a desired tensor given the shape and data type.
-:end_tab:
-
-```{.python .input}
-class MyInit(init.Initializer):
-    def _init_weight(self, name, data):
-        print('Init', name, data.shape)
-        data[:] = np.random.uniform(-10, 10, data.shape)
-        data *= np.abs(data) >= 5
-
-net.initialize(MyInit(), force_reinit=True)
-net[0].weight.data()[:2]
-```
-
-```{.python .input}
-#@tab pytorch
-def my_init(m):
-    if type(m) == nn.Linear:
-        print("Init", *[(name, param.shape) 
-                        for name, param in m.named_parameters()][0])
-        nn.init.uniform_(m.weight, -10, 10)
-        m.weight.data *= m.weight.data.abs() >= 5
-
-net.apply(my_init)
-net[0].weight[:2]
-```
-
-```{.python .input}
-#@tab tensorflow
-class MyInit(tf.keras.initializers.Initializer):
-    def __call__(self, shape, dtype=None):
-        data=tf.random.uniform(shape, -10, 10, dtype=dtype)
-        factor=(tf.abs(data) >= 5)
-        factor=tf.cast(factor, tf.float32)
-        return data * factor        
-
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(
-        4,
-        activation=tf.nn.relu,
-        kernel_initializer=MyInit()),
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-print(net.layers[1].weights[0])
-```
-
-Note that we always have the option
-of setting parameters directly.
-
-```{.python .input}
-net[0].weight.data()[:] += 1
-net[0].weight.data()[0, 0] = 42
-net[0].weight.data()[0]
-```
-
-```{.python .input}
-#@tab pytorch
-net[0].weight.data[:] += 1
-net[0].weight.data[0, 0] = 42
-net[0].weight.data[0]
-```
-
-```{.python .input}
-#@tab tensorflow
-net.layers[1].weights[0][:].assign(net.layers[1].weights[0] + 1)
-net.layers[1].weights[0][0, 0].assign(42)
-net.layers[1].weights[0]
-```
-
-:begin_tab:`mxnet`
-A note for advanced users:
-if you want to adjust parameters within an `autograd` scope,
-you need to use `set_data` to avoid confusing
-the automatic differentiation mechanics.
-:end_tab:
-
-## [**Tied Parameters**]
-
-Often, we want to share parameters across multiple layers.
-Let us see how to do this elegantly.
-In the following we allocate a dense layer
-and then use its parameters specifically
-to set those of another layer.
-
-```{.python .input}
-net = nn.Sequential()
-# We need to give the shared layer a name so that we can refer to its
-# parameters
-shared = nn.Dense(8, activation='relu')
-net.add(nn.Dense(8, activation='relu'),
-        shared,
-        nn.Dense(8, activation='relu', params=shared.params),
-        nn.Dense(10))
-net.initialize()
-
-X = np.random.uniform(size=(2, 20))
-net(X)
-
-# Check whether the parameters are the same
-print(net[1].weight.data()[0] == net[2].weight.data()[0])
-net[1].weight.data()[0, 0] = 100
-# Make sure that they are actually the same object rather than just having the
-# same value
-print(net[1].weight.data()[0] == net[2].weight.data()[0])
-```
-
-```{.python .input}
-#@tab pytorch
-# We need to give the shared layer a name so that we can refer to its
-# parameters
-shared = nn.Linear(8, 8)
-net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
-                    shared, nn.ReLU(),
-                    shared, nn.ReLU(),
-                    nn.Linear(8, 1))
-net(X)
-# Check whether the parameters are the same
-print(net[2].weight.data[0] == net[4].weight.data[0])
-net[2].weight.data[0, 0] = 100
-# Make sure that they are actually the same object rather than just having the
-# same value
-print(net[2].weight.data[0] == net[4].weight.data[0])
-```
-
-```{.python .input}
-#@tab tensorflow
-# tf.keras behaves a bit differently. It removes the duplicate layer
-# automatically
-shared = tf.keras.layers.Dense(4, activation=tf.nn.relu)
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    shared,
-    shared,
-    tf.keras.layers.Dense(1),
-])
-
-net(X)
-# Check whether the parameters are different
-print(len(net.layers) == 3)
-```
-
-:begin_tab:`mxnet,pytorch`
-This example shows that the parameters
-of the second and third layer are tied.
-They are not just equal, they are
-represented by the same exact tensor.
-Thus, if we change one of the parameters,
-the other one changes, too.
-You might wonder,
-when parameters are tied
-what happens to the gradients?
-Since the model parameters contain gradients,
-the gradients of the second hidden layer
-and the third hidden layer are added together
-during backpropagation.
-:end_tab:
-
-## Summary
-
-* We have several ways to access, initialize, and tie model parameters.
-* We can use custom initialization.
-
-
-## Exercises
-
-1. Use the `FancyMLP` model defined in :numref:`sec_model_construction` and access the parameters of the various layers.
-1. Look at the initialization module document to explore different initializers.
-1. Construct an MLP containing a shared parameter layer and train it. During the training process, observe the model parameters and gradients of each layer.
-1. Why is sharing parameters a good idea?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/56)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/57)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/269)
-:end_tab:
diff --git a/chapter_deep-learning-computation/use-gpu.md b/chapter_deep-learning-computation/use-gpu.md
deleted file mode 100644
index 50a3894..0000000
--- a/chapter_deep-learning-computation/use-gpu.md
+++ /dev/null
@@ -1,343 +0,0 @@
-# GPU
-:label:`sec_use_gpu`
-
-:numref:`tab_intro_decade` では、過去 20 年間にわたる計算の急速な成長について議論しました。一言で言えば、GPU のパフォーマンスは 2000 年以降 10 年ごとに 1000 倍に向上しています。これは大きなチャンスをもたらしますが、そのようなパフォーマンスを提供する必要性が非常に高いことも示唆しています。 
-
-このセクションでは、この計算性能を研究に活用する方法について説明します。まず、単一の GPU を使用し、後で、複数の GPU と (複数の GPU を持つ) 複数のサーバーの使用方法について説明します。 
-
-具体的には、単一の NVIDIA GPU を計算に使用する方法について説明します。まず、NVIDIA GPU が少なくとも 1 つインストールされていることを確認します。次に [NVIDIA driver and CUDA](https://developer.nvidia.com/cuda-downloads) をダウンロードし、プロンプトに従って適切なパスを設定します。これらの準備が完了したら、`nvidia-smi` コマンドを使用して (**グラフィックスカード情報を表示**) できます。
-
-```{.python .input}
-#@tab all
-!nvidia-smi
-```
-
-:begin_tab:`mxnet`
-MXNet テンソルが NumPy `ndarray` とほとんど同じに見えることに気付いたかもしれません。しかし、いくつかの重要な違いがあります。MXNet と NumPy を区別する重要な機能の 1 つは、多様なハードウェアデバイスのサポートです。 
-
-MXNet では、すべての配列にコンテキストがあります。これまでのところ、デフォルトではすべての変数とそれに関連する計算が CPU に割り当てられています。通常、他のコンテキストはさまざまな GPU です。複数のサーバーにジョブを展開すると、事態はさらに困難になります。配列をコンテキストにインテリジェントに割り当てることで、デバイス間でのデータ転送にかかる時間を最小限に抑えることができます。たとえば、GPU を搭載したサーバーでニューラルネットワークを学習させる場合、通常、モデルのパラメーターは GPU 上に存在することを好みます。 
-
-次に、GPU バージョンの MXNet がインストールされていることを確認する必要があります。CPU バージョンの MXNet が既にインストールされている場合は、先にアンインストールする必要があります。たとえば、`pip uninstall mxnet` コマンドを使用して、使用している CUDA のバージョンに応じて、対応する MXNet バージョンをインストールします。CUDA 10.0 がインストールされている場合、CUDA 10.0 をサポートする MXNet バージョンを `pip install mxnet-cu100` 経由でインストールできます。
-:end_tab:
-
-:begin_tab:`pytorch`
-PyTorch では、すべての配列にデバイスがあり、コンテキストとして参照することがよくあります。これまでのところ、デフォルトではすべての変数とそれに関連する計算が CPU に割り当てられています。通常、他のコンテキストはさまざまな GPU です。複数のサーバーにジョブを展開すると、事態はさらに困難になります。配列をコンテキストにインテリジェントに割り当てることで、デバイス間でのデータ転送にかかる時間を最小限に抑えることができます。たとえば、GPU を搭載したサーバーでニューラルネットワークを学習させる場合、通常、モデルのパラメーターは GPU 上に存在することを好みます。 
-
-次に、GPU バージョンの PyTorch がインストールされていることを確認する必要があります。PyTorch の CPU バージョンが既にインストールされている場合は、まずそれをアンインストールする必要があります。たとえば、`pip uninstall torch` コマンドを使用し、CUDA のバージョンに応じて対応する PyTorch のバージョンをインストールします。CUDA 10.0 がインストールされていると仮定すると、CUDA 10.0 をサポートする PyTorch バージョンを `pip install torch-cu100` 経由でインストールできます。
-:end_tab:
-
-このセクションのプログラムを実行するには、少なくとも 2 つの GPU が必要です。これはほとんどのデスクトップコンピューターでは贅沢ですが、AWS EC2 マルチ GPU インスタンスを使用するなどして、クラウドで簡単に利用できます。他のほとんどのセクションでは、複数の GPU を必要としません。これは、異なるデバイス間でデータがどのように流れるかを示すためだけのものです。 
-
-## [**コンピューティングデバイス**]
-
-CPU や GPU などのデバイスを、ストレージと計算用に指定できます。デフォルトでは、テンソルはメインメモリに作成され、CPU を使用してテンソルが計算されます。
-
-:begin_tab:`mxnet`
-MXnet では、CPU と GPU は `cpu()` と `gpu()` で表すことができます。`cpu()` (または括弧内の任意の整数) は、すべての物理 CPU とメモリを意味することに注意してください。つまり、MXNet の計算ではすべての CPU コアを使用しようとします。ただし、`gpu()` は 1 つのカードとそれに対応するメモリだけを表します。複数の GPU がある場合は、`gpu(i)` を使用して $i^\mathrm{th}$ GPU を表します ($i$ は 0 から始まります)。また、`gpu(0)` と `gpu()` は同等です。
-:end_tab:
-
-:begin_tab:`pytorch`
-PyTorch では、CPU と GPU は `torch.device('cpu')` と `torch.device('cuda')` で示すことができます。`cpu` デバイスとは、すべての物理 CPU とメモリを意味することに注意してください。つまり、PyTorch の計算ではすべての CPU コアを使おうとします。ただし、`gpu` デバイスは 1 つのカードとそれに対応するメモリのみを表します。複数の GPU がある場合は、`torch.device(f'cuda:{i}')` を使用して $i^\mathrm{th}$ GPU を表します ($i$ は 0 から始まります)。また、`gpu:0` と `gpu` は同等です。
-:end_tab:
-
-```{.python .input}
-from mxnet import np, npx
-from mxnet.gluon import nn
-npx.set_np()
-
-npx.cpu(), npx.gpu(), npx.gpu(1)
-```
-
-```{.python .input}
-#@tab pytorch
-import torch
-from torch import nn
-
-torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')
-```
-
-```{.python .input}
-#@tab tensorflow
-import tensorflow as tf
-
-tf.device('/CPU:0'), tf.device('/GPU:0'), tf.device('/GPU:1')
-```
-
-私たちはできる (**利用可能な GPU の数を問い合わせる**)
-
-```{.python .input}
-npx.num_gpus()
-```
-
-```{.python .input}
-#@tab pytorch
-torch.cuda.device_count()
-```
-
-```{.python .input}
-#@tab tensorflow
-len(tf.config.experimental.list_physical_devices('GPU'))
-```
-
-ここで [**要求された GPU が存在しなくてもコードを実行できる、便利な関数を 2 つ定義します**]
-
-```{.python .input}
-def try_gpu(i=0):  #@save
-    """Return gpu(i) if exists, otherwise return cpu()."""
-    return npx.gpu(i) if npx.num_gpus() >= i + 1 else npx.cpu()
-
-def try_all_gpus():  #@save
-    """Return all available GPUs, or [cpu()] if no GPU exists."""
-    devices = [npx.gpu(i) for i in range(npx.num_gpus())]
-    return devices if devices else [npx.cpu()]
-
-try_gpu(), try_gpu(10), try_all_gpus()
-```
-
-```{.python .input}
-#@tab pytorch
-def try_gpu(i=0):  #@save
-    """Return gpu(i) if exists, otherwise return cpu()."""
-    if torch.cuda.device_count() >= i + 1:
-        return torch.device(f'cuda:{i}')
-    return torch.device('cpu')
-
-def try_all_gpus():  #@save
-    """Return all available GPUs, or [cpu(),] if no GPU exists."""
-    devices = [torch.device(f'cuda:{i}')
-             for i in range(torch.cuda.device_count())]
-    return devices if devices else [torch.device('cpu')]
-
-try_gpu(), try_gpu(10), try_all_gpus()
-```
-
-```{.python .input}
-#@tab tensorflow
-def try_gpu(i=0):  #@save
-    """Return gpu(i) if exists, otherwise return cpu()."""
-    if len(tf.config.experimental.list_physical_devices('GPU')) >= i + 1:
-        return tf.device(f'/GPU:{i}')
-    return tf.device('/CPU:0')
-
-def try_all_gpus():  #@save
-    """Return all available GPUs, or [cpu(),] if no GPU exists."""
-    num_gpus = len(tf.config.experimental.list_physical_devices('GPU'))
-    devices = [tf.device(f'/GPU:{i}') for i in range(num_gpus)]
-    return devices if devices else [tf.device('/CPU:0')]
-
-try_gpu(), try_gpu(10), try_all_gpus()
-```
-
-## テンソルと GPU
-
-デフォルトでは、テンソルは CPU 上に作成されます。[**テンソルが位置するデバイスを問い合わせる**]
-
-```{.python .input}
-x = np.array([1, 2, 3])
-x.ctx
-```
-
-```{.python .input}
-#@tab pytorch
-x = torch.tensor([1, 2, 3])
-x.device
-```
-
-```{.python .input}
-#@tab tensorflow
-x = tf.constant([1, 2, 3])
-x.device
-```
-
-複数の用語で操作する場合は、それらを同じデバイス上に配置する必要があることに注意することが重要です。たとえば、2 つのテンソルを合計する場合、両方の引数が同じデバイス上に存在することを確認する必要があります。そうしないと、フレームワークは結果を格納する場所や、計算を実行する場所の決定方法さえも認識しません。 
-
-### GPU 上のストレージ
-
-[**テンソルをGPUに格納する**] にはいくつかの方法があります。たとえば、テンソル作成時にストレージデバイスを指定できます。次に、最初の `gpu` にテンソル変数 `X` を作成します。GPU で作成されたテンソルは、この GPU のメモリのみを消費します。`nvidia-smi` コマンドを使用して GPU メモリ使用量を表示できます。一般に、GPU メモリ制限を超えるデータを作成しないようにする必要があります。
-
-```{.python .input}
-X = np.ones((2, 3), ctx=try_gpu())
-X
-```
-
-```{.python .input}
-#@tab pytorch
-X = torch.ones(2, 3, device=try_gpu())
-X
-```
-
-```{.python .input}
-#@tab tensorflow
-with try_gpu():
-    X = tf.ones((2, 3))
-X
-```
-
-GPU が 2 つ以上あると仮定すると、次のコードは (**2 つ目の GPU にランダムなテンソルを作成する**)
-
-```{.python .input}
-Y = np.random.uniform(size=(2, 3), ctx=try_gpu(1))
-Y
-```
-
-```{.python .input}
-#@tab pytorch
-Y = torch.rand(2, 3, device=try_gpu(1))
-Y
-```
-
-```{.python .input}
-#@tab tensorflow
-with try_gpu(1):
-    Y = tf.random.uniform((2, 3))
-Y
-```
-
-### コピー中
-
-[**`X + Y` を計算するには、この操作を実行する場所を決める必要があります。**] たとえば :numref:`fig_copyto` に示すように、`X` を 2 番目の GPU に転送し、そこで演算を実行できます。
-**単純に`X`と`Y`を加えないでください。
-これは例外になるからです。ランタイムエンジンは何をすべきか分からず、同じデバイス上でデータを見つけることができず、失敗します。`Y` は 2 番目の GPU 上に存在するため、2 つ目の GPU を追加する前に `X` をそこに移動する必要があります。 
-
-![Copy data to perform an operation on the same device.](../img/copyto.svg)
-:label:`fig_copyto`
-
-```{.python .input}
-Z = X.copyto(try_gpu(1))
-print(X)
-print(Z)
-```
-
-```{.python .input}
-#@tab pytorch
-Z = X.cuda(1)
-print(X)
-print(Z)
-```
-
-```{.python .input}
-#@tab tensorflow
-with try_gpu(1):
-    Z = X
-print(X)
-print(Z)
-```
-
-[**データは同じ GPU (`Z` と `Y`) 上にあるので、合計できます**]
-
-```{.python .input}
-#@tab all
-Y + Z
-```
-
-:begin_tab:`mxnet`
-変数 `Z` がすでに 2 つ目の GPU に存在するとします。それでも`Z.copyto(gpu(1))`に電話したらどうなるの？その変数が目的のデバイスにすでに存在していても、コピーを作成して新しいメモリを割り当てます。コードが実行されている環境によっては、2 つの変数が既に同じデバイス上に存在している場合があります。そのため、変数が現在異なるデバイスにある場合にのみコピーを作成します。このような場合、`as_in_ctx` を呼び出すことができます。変数が指定されたデバイスにすでに存在する場合、これは何もしません。特にコピーを作成する場合を除き、`as_in_ctx` が最適な方法です。
-:end_tab:
-
-:begin_tab:`pytorch`
-変数 `Z` が 2 番目の GPU にすでに存在していると想像してください。それでも`Z.cuda(1)`に電話したらどうなるの？この関数は、コピーを作成して新しいメモリを割り当てる代わりに `Z` を返す。
-:end_tab:
-
-:begin_tab:`tensorflow`
-変数 `Z` が 2 つ目の GPU にすでに存在しているとします。同じデバイススコープで `Z2 = Z` を呼び出した場合はどうなりますか？この関数は、コピーを作成して新しいメモリを割り当てる代わりに `Z` を返す。
-:end_tab:
-
-```{.python .input}
-Z.as_in_ctx(try_gpu(1)) is Z
-```
-
-```{.python .input}
-#@tab pytorch
-Z.cuda(1) is Z
-```
-
-```{.python .input}
-#@tab tensorflow
-with try_gpu(1):
-    Z2 = Z
-Z2 is Z
-```
-
-### サイドノート
-
-GPUは高速であることを期待しているため、人々はGPUを機械学習に使用しています。しかし、デバイス間で変数を転送するのは遅いです。そのため、私たちがあなたにそれをさせる前に、ゆっくりとしたことをしたいということを100％確信してほしいのです。ディープラーニングフレームワークがクラッシュすることなく自動的にコピーを実行した場合、低速なコードを書いたことに気付かないかもしれません。 
-
-また、デバイス (CPU、GPU、その他のマシン) 間でのデータ転送は、計算よりもはるかに遅くなります。また、処理を進める前にデータが送信される (または受信される) のを待たなければならないため、並列化が非常に困難になります。そのため、コピー操作は細心の注意を払って行う必要があります。経験則として、小規模な操作の多くは、1 つの大きな操作よりもはるかに悪いです。さらに、何をしているのか分からない限り、コードに散在する多くの単一操作よりも、一度に複数の操作を行う方がはるかに優れています。これは、あるデバイスが別の処理を実行する前に他のデバイスを待機しなければならない場合、そのような操作がブロックされる可能性があるためです。これは、電話で事前に注文して、準備ができたことを確認するのではなく、キューでコーヒーを注文するのと少し似ています。 
-
-最後に、テンソルを出力したり、テンソルを NumPy 形式に変換したりするときに、データがメインメモリにない場合、フレームワークは最初にデータをメインメモリにコピーするため、転送オーバーヘッドが増えます。さらに悪いことに、今では恐ろしいグローバルインタプリタロックの影響を受けて、Python が完了するのをすべて待たせています。 
-
-## [**ニューラルネットワークと GPU **]
-
-同様に、ニューラルネットワークモデルでもデバイスを指定できます。次のコードは、モデルパラメーターを GPU に配置します。
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(1))
-net.initialize(ctx=try_gpu())
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Linear(3, 1))
-net = net.to(device=try_gpu())
-```
-
-```{.python .input}
-#@tab tensorflow
-strategy = tf.distribute.MirroredStrategy()
-with strategy.scope():
-    net = tf.keras.models.Sequential([
-        tf.keras.layers.Dense(1)])
-```
-
-次の章では、GPUでモデルを実行する方法の例をさらに多く見ていきます。これは、計算量がいくらか増えるためです。 
-
-入力が GPU 上のテンソルの場合、モデルは同じ GPU で結果を計算します。
-
-```{.python .input}
-#@tab all
-net(X)
-```
-
-(**モデルパラメータが同じ GPU に保存されていることを確認する**)
-
-```{.python .input}
-net[0].weight.data().ctx
-```
-
-```{.python .input}
-#@tab pytorch
-net[0].weight.data.device
-```
-
-```{.python .input}
-#@tab tensorflow
-net.layers[0].weights[0].device, net.layers[0].weights[1].device
-```
-
-つまり、すべてのデータとパラメータが同じデバイス上にあれば、モデルを効率的に学習できます。次の章では、そのような例をいくつか見ていきます。 
-
-## [概要
-
-* CPU や GPU など、ストレージと計算のためのデバイスを指定できます。デフォルトでは、データはメインメモリに作成され、CPU を使用して計算します。
-* ディープラーニングフレームワークでは、CPU でも同じ GPU でも、計算用のすべての入力データが同じデバイス上にある必要があります。
-* 注意せずにデータを移動すると、パフォーマンスが大幅に低下する可能性があります。典型的な間違いは次のとおりです。GPU 上のすべてのミニバッチの損失を計算し、コマンドラインでユーザーに報告する (または NumPy `ndarray` に記録する) と、グローバルインタープリターロックがトリガーされ、すべての GPU が停止します。GPU 内のロギング用にメモリを割り当て、大きなログのみを移動する方がはるかに優れています。
-
-## 演習
-
-1. 大きな行列の乗算など、より大きな計算タスクを試して、CPU と GPU の速度の違いを確認します。計算量が少ないタスクについてはどうでしょうか。
-1. GPU でモデルパラメーターをどのように読み書きすればよいのですか？
-1. $100 \times 100$ 行列の 1000 個の行列と行列の乗算を計算するのにかかる時間を測定し、出力行列の Frobenius ノルムを一度に 1 つの結果ずつ記録するのに対し、GPU でログを保持して最終結果のみを転送するのに対し、ログを記録します。
-1. 2 つの GPU で 2 つの行列行列乗算を同時に実行するのにかかる時間と、1 つの GPU で順番に掛け合わせるのにかかる時間を測定します。ヒント:ほぼ直線的なスケーリングが見えるはずです。
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/62)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/63)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/270)
-:end_tab:
diff --git a/chapter_installation/index.md b/chapter_installation/index.md
index 84424ec..82ce5c9 100644
--- a/chapter_installation/index.md
+++ b/chapter_installation/index.md
@@ -1,123 +1,128 @@
-# 取り付け
+# インストール
 :label:`chap_installation`
 
-実践的な学習経験を得るために、Python、Jupyter ノートブック、関連ライブラリ、およびブック自体を実行するために必要なコードを実行するための環境をセットアップする必要があります。 
+起動して実行するには、Python、Jupyter Notebook、関連するライブラリ、および本自体を実行するために必要なコードを実行するための環境が必要です。 
 
 ## Miniconda をインストールする
 
-もっとも簡単な方法は [Miniconda](https://conda.io/en/latest/miniconda.html) をインストールすることです。Python 3.x バージョンが必要です。マシンに既に conda がインストールされている場合は、次の手順を省略できます。 
+最も単純なオプションは、[Miniconda](https://conda.io/en/latest/miniconda.html) をインストールすることです。Python 3.x バージョンが必要であることに注意してください。マシンに既に conda がインストールされている場合は、次の手順をスキップできます。 
 
-Miniconda の Web サイトにアクセスして、お使いの Python 3.x のバージョンとマシンアーキテクチャに基づいて、ご使用のシステムに適したバージョンを判断してください。たとえば、macOS と Python 3.x を使用している場合、名前に「Miniconda3」と「macOSX」という文字列が含まれる bash スクリプトをダウンロードし、ダウンロード場所に移動して、次のようにインストールを実行します。
+Miniconda の Web サイトにアクセスし、お使いの Python 3.x のバージョンとマシンアーキテクチャに基づいて、システムに適したバージョンを決定してください。お使いの Python のバージョンが 3.9 (テスト版) だとします。macOS を使用している場合は、名前に「macOSX」という文字列が含まれる bash スクリプトをダウンロードし、ダウンロード場所に移動して、次のようにインストールを実行します (インテル Mac を例にとります)。
 
 ```bash
-sh Miniconda3-latest-MacOSX-x86_64.sh -b
+# The file name is subject to changes
+sh Miniconda3-py39_4.12.0-MacOSX-x86_64.sh -b
 ```
 
-Python 3.x を使用している Linux ユーザは、名前に「Miniconda3」と「Linux」という文字列を含むファイルをダウンロードし、ダウンロード先で以下を実行します。
+Linuxユーザーは、名前に「Linux」という文字列を含むファイルをダウンロードし、ダウンロード場所で以下を実行します。
 
 ```bash
-sh Miniconda3-latest-Linux-x86_64.sh -b
+# The file name is subject to changes
+sh Miniconda3-py39_4.12.0-Linux-x86_64.sh -b
 ```
 
-次に、`conda` を直接実行できるように、シェルを初期化します。
+次に、`conda` を直接実行できるようにシェルを初期化します。
 
 ```bash
 ~/miniconda3/bin/conda init
 ```
 
-ここで、現在のシェルを閉じてから再度開きます。新しい環境は次のように作成できるはずです。
+次に、現在のシェルを閉じてから再度開きます。次のようにして新しい環境を作成できるはずです。
 
 ```bash
-conda create --name d2l python=3.8 -y
+conda create --name d2l python=3.9 -y
 ```
 
-## D2L ノートブックのダウンロード
-
-次に、この本のコードをダウンロードする必要があります。HTML ページの上部にある [すべてのノートブック] タブをクリックすると、コードをダウンロードして解凍できます。または、`unzip` (それ以外の場合は `sudo apt install unzip` を実行) を使用できる場合は、次のようにします。
+これで、`d2l` 環境をアクティブ化できます。
 
 ```bash
-mkdir d2l-en && cd d2l-en
-curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
-unzip d2l-en.zip && rm d2l-en.zip
+conda activate d2l
 ```
 
-これで `d2l` 環境をアクティブ化できます。
+## ディープラーニングフレームワークと `d2l` パッケージのインストール
+
+ディープラーニングフレームワークをインストールする前に、マシンに適切な GPU があるかどうかを確認してください (標準的なラップトップのディスプレイに電力を供給する GPU は、私たちの目的には関係ありません)。たとえば、コンピューターに NVIDIA GPU が搭載され、[CUDA](https://developer.nvidia.com/cuda-downloads) がインストールされていれば、これで準備は完了です。お使いのマシンにGPUが搭載されていなければ、まだ心配する必要はありません。CPUは、最初の数章を読み進めるのに十分な馬力を提供します。大きなモデルを実行する前に GPU にアクセスすることを忘れないでください。
+
+:begin_tab:`mxnet`
+GPU 対応バージョンの MXNet をインストールするには、インストールされている CUDA のバージョンを確認する必要があります。これを確認するには、`nvcc --version` または `cat /usr/local/cuda/version.txt` を実行します。CUDA 10.2 がインストールされていると仮定して、次のコマンドを実行します。
 
 ```bash
-conda activate d2l
-```
+# For macOS and Linux users
+pip install mxnet-cu102==1.7.0
 
-## フレームワークと `d2l` パッケージのインストール
+# For Windows users
+pip install mxnet-cu102==1.7.0 -f https://dist.mxnet.io/python
+```
 
-ディープラーニングフレームワークをインストールする前に、まずマシンに適切な GPU が搭載されているかどうかを確認してください (標準的なラップトップのディスプレイに電力を供給する GPU は、この目的に適していません)。GPU サーバーで作業している場合は、:ref:`subsec_gpu` に進み、関連ライブラリの GPU 対応バージョンをインストールする手順を確認してください。 
+最後の桁は、CUDAのバージョンに応じて変更できます。たとえば、CUDA 10.1の場合は`cu101`、CUDA 9.0の場合は`cu90`です。 
 
-マシンにGPUが搭載されていない場合でも、まだ心配する必要はありません。CPUは、最初の数章を完了するのに十分な馬力を提供します。大きなモデルを実行する前に GPU にアクセスする必要があることを覚えておいてください。CPU バージョンをインストールするには、以下のコマンドを実行します。
+マシンに NVIDIA GPU または CUDA がない場合は、次の手順で CPU バージョンをインストールできます。
 
-:begin_tab:`mxnet`
 ```bash
 pip install mxnet==1.7.0.post1
 ```
 :end_tab:
 
 :begin_tab:`pytorch`
+PyTorch は、以下のように CPU または GPU をサポートしてインストールできます。
+
 ```bash
 pip install torch torchvision
 ```
 :end_tab:
 
 :begin_tab:`tensorflow`
-CPU と GPU の両方をサポートする TensorFlow は、次のようにしてインストールできます。
+TensorFlow は、次のように CPU または GPU をサポートしてインストールできます。
 
 ```bash
 pip install tensorflow tensorflow-probability
 ```
 :end_tab:
 
-次のステップは、本書でよく使われる関数とクラスをカプセル化するために開発した `d2l` パッケージをインストールすることです。
-
-```bash
-# -U: Upgrade all packages to the newest available version
-pip install -U d2l
-```
-
-これらのインストール手順が完了したら、以下を実行して Jupyter ノートブックサーバーを起動できます。
+次のステップは、この本でよく使われる関数とクラスをカプセル化するために開発した `d2l` パッケージをインストールすることです。
 
 ```bash
-jupyter notebook
+pip install d2l==1.0.0a1.post0
 ```
 
-この時点で、お使いの Web ブラウザで http://localhost:8888 (既に自動的に開かれている場合もあります) を開くことができます。その後、本の各セクションのコードを実行できます。本のコードを実行したり、ディープラーニングフレームワークや `d2l` パッケージを更新する前に、必ず `conda activate d2l` を実行してランタイム環境をアクティブ化してください。環境を終了するには、`conda deactivate` を実行します。 
+## コードのダウンロードと実行
 
-## GPU サポート
-:label:`subsec_gpu`
+次に、ノートブックをダウンロードして、ブックの各コードブロックを実行できるようにします。[the D2L.ai website](https://d2l.ai/)のHTMLページの上部にある「ノートブック」タブをクリックするだけで、コードをダウンロードして解凍できます。または、以下のようにコマンドラインからノートブックをフェッチできます。
 
 :begin_tab:`mxnet`
-既定では、MXNet は GPU をサポートせずにインストールされ、どのコンピューター (ほとんどのラップトップを含む) でも確実に実行できます。本書の一部では、GPU での実行が必須または推奨されています。コンピュータに NVIDIA グラフィックカードが搭載されていて [CUDA](https://developer.nvidia.com/cuda-downloads) がインストールされている場合は、GPU 対応バージョンをインストールする必要があります。CPU のみのバージョンをインストールしている場合は、まず次のコマンドを実行して削除する必要があります。
-
 ```bash
-pip uninstall mxnet
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd mxnet
 ```
+:end_tab:
 
-ここで、インストールした CUDA のバージョンを調べる必要があります。これを確認するには、`nvcc --version` または `cat /usr/local/cuda/version.txt` を実行します。CUDA 10.1 をインストールしたと仮定し、次のコマンドでインストールできます。
-
+:begin_tab:`pytorch`
 ```bash
-# For Windows users
-pip install mxnet-cu101==1.7.0 -f https://dist.mxnet.io/python
-
-# For Linux and macOS users
-pip install mxnet-cu101==1.7.0
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd pytorch
 ```
-
-最後の数字は、CUDA のバージョンに応じて変更できます。たとえば、CUDA 10.0 の場合は `cu100`、CUDA 9.0 の場合は `cu90` などです。
 :end_tab:
 
-:begin_tab:`pytorch,tensorflow`
-既定では、ディープラーニングフレームワークは GPU サポート付きでインストールされます。コンピュータに NVIDIA GPU が搭載され、[CUDA](https://developer.nvidia.com/cuda-downloads) がインストールされていれば、準備は完了です。
+:begin_tab:`tensorflow`
+```bash
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd tensorflow
+```
 :end_tab:
 
-## 演習
+`unzip` をまだインストールしていない場合は、まず `sudo apt-get install unzip` を実行します。これで、以下を実行して Jupyter Notebook サーバーを起動できます。
+
+```bash
+jupyter notebook
+```
 
-1. 本のコードをダウンロードし、ランタイム環境をインストールします。
+この時点で、Web ブラウザで http://localhost:8888 (既に自動的に開かれている場合があります) を開くことができます。その後、本の各セクションのコードを実行できます。新しいコマンドラインウィンドウを開くたびに、D2L ノートブックを実行したり、パッケージ (ディープラーニングフレームワークまたは `d2l` パッケージ) を更新したりする前に、`conda activate d2l` を実行してランタイム環境をアクティブ化する必要があります。環境を終了するには、`conda deactivate` を実行します。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/23)
diff --git a/chapter_installation/index_origin.md b/chapter_installation/index_origin.md
index 428abc6..abc2e58 100644
--- a/chapter_installation/index_origin.md
+++ b/chapter_installation/index_origin.md
@@ -1,40 +1,45 @@
 # Installation
 :label:`chap_installation`
 
-In order to get you up and running for hands-on learning experience,
-we need to set you up with an environment 
-for running Python, Jupyter notebooks, the relevant libraries, 
+In order to get up and running,
+we will need an environment for running Python,
+the Jupyter Notebook, the relevant libraries,
 and the code needed to run the book itself.
 
 ## Installing Miniconda
 
-The simplest way to get going will be to install
-[Miniconda](https://conda.io/en/latest/miniconda.html). 
-The Python 3.x version is required. 
-You can skip the following steps 
+Your simplest option is to install
+[Miniconda](https://conda.io/en/latest/miniconda.html).
+Note that the Python 3.x version is required.
+You can skip the following steps
 if your machine already has conda installed.
 
-Visit the Miniconda website and determine 
+Visit the Miniconda website and determine
 the appropriate version for your system
 based on your Python 3.x version and machine architecture.
-For example, if you are using macOS and Python 3.x 
-you would download the bash script 
-whose name contains the strings "Miniconda3" and "MacOSX",
+Suppose that your Python version is 3.9
+(our tested version).
+If you are using macOS,
+you would download the bash script
+whose name contains the strings "MacOSX",
 navigate to the download location,
-and execute the installation as follows:
+and execute the installation as follows
+(taking Intel Macs as an example):
 
 ```bash
-sh Miniconda3-latest-MacOSX-x86_64.sh -b
+# The file name is subject to changes
+sh Miniconda3-py39_4.12.0-MacOSX-x86_64.sh -b
 ```
 
 
-A Linux user with Python 3.x 
+A Linux user
 would download the file
-whose name contains the strings "Miniconda3" and "Linux" 
+whose name contains the strings "Linux"
 and execute the following at the download location:
 
 ```bash
-sh Miniconda3-latest-Linux-x86_64.sh -b
+# The file name is subject to changes
+sh Miniconda3-py39_4.12.0-Linux-x86_64.sh -b
 ```
 
 
@@ -45,28 +50,12 @@ Next, initialize the shell so we can run `conda` directly.
 ```
 
 
-Now close and reopen your current shell. 
-You should be able to create 
+Then close and reopen your current shell.
+You should be able to create
 a new environment as follows:
 
 ```bash
-conda create --name d2l python=3.8 -y
-```
-
-
-## Downloading the D2L Notebooks
-
-Next, we need to download the code of this book. 
-You can click the "All Notebooks" tab 
-on the top of any HTML page 
-to download and unzip the code.
-Alternatively, if you have `unzip` 
-(otherwise run `sudo apt install unzip`) available:
-
-```bash
-mkdir d2l-en && cd d2l-en
-curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
-unzip d2l-en.zip && rm d2l-en.zip
+conda create --name d2l python=3.9 -y
 ```
 
 
@@ -77,31 +66,51 @@ conda activate d2l
 ```
 
 
-## Installing the Framework and the `d2l` Package
+## Installing the Deep Learning Framework and the `d2l` Package
 
-Before installing any deep learning framework, 
-please first check whether or not 
+Before installing any deep learning framework,
+please first check whether or not
 you have proper GPUs on your machine
-(the GPUs that power the display 
+(the GPUs that power the display
 on a standard laptop are not relevant for our purposes).
-If you are working on a GPU server,
-proceed to :ref:`subsec_gpu` 
-for instructions on how 
-to install GPU-friendly versions
-of the relevant libraries.
-
-If your machine does not house any GPUs,
+For example,
+if your computer has NVIDIA GPUs and has installed [CUDA](https://developer.nvidia.com/cuda-downloads),
+then you are all set.
+If your machine does not house any GPU,
 there is no need to worry just yet.
-Your CPU provides more than enough horsepower 
+Your CPU provides more than enough horsepower
 to get you through the first few chapters.
-Just remember that you will want to access GPUs 
+Just remember that you will want to access GPUs
 before running larger models.
-To install the the CPU version,
-execute the following command.
 
 
 :begin_tab:`mxnet`
 
+To install a GPU-enabled version of MXNet,
+we need to find out what version of CUDA you have installed.
+You can check this by running `nvcc --version`
+or `cat /usr/local/cuda/version.txt`.
+Assume that you have installed CUDA 10.2,
+then execute the following command:
+
+```bash
+# For macOS and Linux users
+pip install mxnet-cu102==1.7.0
+
+# For Windows users
+pip install mxnet-cu102==1.7.0 -f https://dist.mxnet.io/python
+```
+
+
+You may change the last digits according to your CUDA version, e.g., `cu101` for
+CUDA 10.1 and `cu90` for CUDA 9.0.
+
+
+If your machine has no NVIDIA GPUs
+or CUDA,
+you can install the CPU version
+as follows:
+
 ```bash
 pip install mxnet==1.7.0.post1
 ```
@@ -112,6 +121,8 @@ pip install mxnet==1.7.0.post1
 
 :begin_tab:`pytorch`
 
+You can install PyTorch with either CPU or GPU support as follows:
+
 ```bash
 pip install torch torchvision
 ```
@@ -120,7 +131,7 @@ pip install torch torchvision
 :end_tab:
 
 :begin_tab:`tensorflow`
-You can install TensorFlow with both CPU and GPU support as follows:
+You can install TensorFlow with either CPU or GPU support as follows:
 
 ```bash
 pip install tensorflow tensorflow-probability
@@ -130,82 +141,86 @@ pip install tensorflow tensorflow-probability
 :end_tab:
 
 
-Our next step is to install 
-the `d2l` package that we developed 
+Our next step is to install
+the `d2l` package that we developed
 in order to encapsulate
 frequently used functions and classes
-found throughout this book.
+found throughout this book:
 
 ```bash
-# -U: Upgrade all packages to the newest available version
-pip install -U d2l
+pip install d2l==1.0.0a1.post0
 ```
 
 
-Once you have completed these installation steps, we can the Jupyter notebook server by running:
+## Downloading and Running the Code
+
+Next, you will want to download the notebooks
+so that you can run each of the book's code blocks.
+Simply click on the "Notebooks" tab at the top
+of any HTML page on [the D2L.ai website](https://d2l.ai/)
+to download the code and then unzip it.
+Alternatively, you can fetch the notebooks
+from the command line as follows:
+
+:begin_tab:`mxnet`
 
 ```bash
-jupyter notebook
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd mxnet
 ```
 
 
-At this point, you can open http://localhost:8888 
-(it may have already opened automatically) in your Web browser. 
-Then we can run the code for each section of the book.
-Please always execute `conda activate d2l` 
-to activate the runtime environment
-before running the code of the book 
-or updating the deep learning framework or the `d2l` package.
-To exit the environment, 
-run `conda deactivate`.
-
+:end_tab:
 
-## GPU Support
-:label:`subsec_gpu`
 
-:begin_tab:`mxnet`
-By default, MXNet is installed without GPU support
-to ensure that it will run on any computer (including most laptops).
-Part of this book requires or recommends running with GPU.
-If your computer has NVIDIA graphics cards and has installed [CUDA](https://developer.nvidia.com/cuda-downloads),
-then you should install a GPU-enabled version.
-If you have installed the CPU-only version,
-you may need to remove it first by running:
+:begin_tab:`pytorch`
 
 ```bash
-pip uninstall mxnet
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd pytorch
 ```
 
 
-We now need to find out what version of CUDA you have installed.
-You can check this by running `nvcc --version` 
-or `cat /usr/local/cuda/version.txt`.
-Assume that you have installed CUDA 10.1,
-then you can install with the following command:
+:end_tab:
 
-```bash
-# For Windows users
-pip install mxnet-cu101==1.7.0 -f https://dist.mxnet.io/python
+:begin_tab:`tensorflow`
 
-# For Linux and macOS users
-pip install mxnet-cu101==1.7.0
+```bash
+mkdir d2l-en && cd d2l-en
+curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
+unzip d2l-en.zip && rm d2l-en.zip
+cd tensorflow
 ```
 
 
-You may change the last digits according to your CUDA version, e.g., `cu100` for
-CUDA 10.0 and `cu90` for CUDA 9.0.
 :end_tab:
 
+If you don't already have `unzip` installed, first run `sudo apt-get install unzip`.
+Now we can start the Jupyter Notebook server by running:
 
-:begin_tab:`pytorch,tensorflow`
-By default, the deep learning framework is installed with GPU support.
-If your computer has NVIDIA GPUs and has installed [CUDA](https://developer.nvidia.com/cuda-downloads),
-then you are all set.
-:end_tab:
+```bash
+jupyter notebook
+```
+
+
+At this point, you can open http://localhost:8888
+(it may have already opened automatically) in your Web browser.
+Then we can run the code for each section of the book.
+Whenever you open a new command line window,
+you will need to execute `conda activate d2l`
+to activate the runtime environment
+before running the D2L notebooks,
+or updating your packages
+(either the deep learning framework
+or the `d2l` package).
+To exit the environment,
+run `conda deactivate`.
 
-## Exercises
 
-1. Download the code for the book and install the runtime environment.
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/23)
diff --git a/chapter_introduction/index.md b/chapter_introduction/index.md
index e83c3cb..f1a7d5e 100644
--- a/chapter_introduction/index.md
+++ b/chapter_introduction/index.md
@@ -1,229 +1,218 @@
 # はじめに
 :label:`chap_introduction`
 
-最近まで、私たちが日常的にやり取りするほとんどすべてのコンピュータープログラムは、ソフトウェア開発者によって第一原理に基づいてコーディングされていました。eコマースプラットフォームを管理するアプリケーションを作成したいと考えたとします。ホワイトボードに数時間寄り添って問題を熟考した後、おそらく次のような実用的なソリューションを思い付くでしょう。(i) ユーザーが Web ブラウザーまたはモバイルアプリケーションで実行されるインターフェイスを介してアプリケーションと対話する、(ii) アプリケーション商用グレードのデータベースエンジンと相互作用して、各ユーザーの状態を追跡し、過去の取引の記録を保持します。（iii）アプリケーションの中心に、アプリケーションの*ビジネスロジック*（*頭脳*）は、私たちが行う適切なアクションを系統的に詳細に説明していますプログラムは考えられるあらゆる状況に対応すべきだ。 
+最近まで、日常的にやり取りする可能性のあるほとんどすべてのコンピュータープログラムは、その動作を正確に指定する厳格なルールセットとしてコード化されていました。電子商取引プラットフォームを管理するアプリケーションを作成したいとしましょう。問題を熟考するためにホワイトボードを数時間巡り回した後、実用的なソリューションの広範なストロークに落ち着くかもしれません。例えば、（i）ユーザーがWebブラウザまたはモバイルアプリケーションで実行されるインターフェースを介してアプリケーションを操作する、（ii）アプリケーションが商用グレードと対話するデータベースエンジンは、各ユーザーの状態を追跡し、履歴トランザクションの記録を維持します。（iii）アプリケーションの中心にあるアプリケーションの*ビジネスロジック*（*頭脳*）は、考えられるすべての状況を対応するアクションにマッピングする一連のルールを記述します。我々のプログラムは取るべきだ 
 
-アプリケーションの頭脳を構築するには、遭遇すると予想されるすべてのコーナーケースをステップスルーし、適切なルールを考案する必要があります。顧客がショッピングカートに商品を追加するためにクリックするたびに、ショッピングカートのデータベーステーブルにエントリが追加され、そのユーザーの ID とリクエストされた商品の ID が関連付けられます。初めて完全に正しく理解できる開発者はほとんどいませんが（ねじれを解明するにはテストランが必要かもしれません）、ほとんどの場合、そのようなプログラムを第一原理から書き、自信を持って起動することができました。 
+アプリケーションの頭脳を構築するために、プログラムが処理すべきすべての一般的なイベントを列挙するかもしれません。たとえば、顧客がショッピングカートにアイテムを追加するためにクリックするたびに、プログラムはショッピングカートのデータベーステーブルにエントリを追加し、そのユーザーの ID を要求された製品の ID に関連付ける必要があります。次に、可能なすべてのコーナーケースをステップスルーし、ルールの妥当性をテストし、必要な修正を加えようとするかもしれません。ユーザーが空のカートで購入を開始するとどうなりますか？初めてそれを完全に正しく理解する開発者はほとんどいませんが（問題点を解決するためにいくつかのテストランが必要になるかもしれません）、ほとんどの場合、そのようなプログラムを書いて自信を持って起動できます
 *実際の顧客に会う前に*。
-多くの場合、斬新な状況で機能する製品やシステムを駆動する第一原理から自動化されたシステムを設計する能力は、驚くべきコグニティブの偉業です。また、100\ %$ ドルの時間で動作するソリューションを考案できれば、機械学習を使用すべきではありません。 
+多くの場合、新しい状況で機能する製品やシステムを駆動する自動化システムを手動で設計する当社の能力は、驚くべき認識上の偉業です。そして、100\ %$ の時間で動作するソリューションを考案できれば、通常、機械学習について心配する必要はありません。 
 
-幸いなことに、機械学習の科学者が集まるコミュニティにとって、自動化したいタスクの多くは、人間の創意工夫にそれほど簡単には当てはまりません。あなたが知っている最も賢い頭脳でホワイトボードの周りをうろついていると想像してください。しかし今回は次の問題の1つに取り組んでいます。 
+機械学習の科学者が増え続けるコミュニティにとって幸いなことに、自動化したいタスクの多くは、人間の創意工夫にそれほど簡単には曲がりません。あなたが知っている最も賢い心でホワイトボードの周りをうろついていると想像してみてください。しかし、今回は次の問題の1つに取り組んでいます。 
 
-* 地理情報、衛星画像、過去の天気の末尾のウィンドウから明日の天気を予測するプログラムを作成します。
-* 自由形式のテキストで表される質問を取り込み、正しく回答するプログラムを作成します。
-* 与えられた画像がそこに含まれるすべての人を識別し、それぞれの周りに輪郭を描くことができるプログラムを書いてください。
-* ユーザーが楽しむ可能性は高いが、ブラウジングの自然な過程では遭遇する可能性が低い製品をユーザーに提示するプログラムを作成します。
+* 地理情報、衛星画像、および過去の天気の追跡ウィンドウを考慮して、明日の天気を予測するプログラムを作成します。
+* 自由形式のテキストで表現されたファクトイドの質問を取り入れ、それに正しく答えるプログラムを書く。
+* イメージが与えられ、そこに描かれているすべての人物を識別し、それぞれの周りに輪郭を描くプログラムを書く。
+* ユーザーが楽しむ可能性が高いが、ブラウジングの自然な過程では遭遇する可能性が低い製品をユーザーに提示するプログラムを作成します。
 
-いずれの場合も、エリートプログラマーでさえ、ソリューションをゼロからコーディングすることはできません。この理由はさまざまです。私たちが探しているプログラムは、時間とともに変化するパターンに従うことがあり、プログラムを適応させる必要があります。他のケースでは、関係 (ピクセルと抽象カテゴリの間など) が複雑すぎて、私たちの目でタスクを楽に管理しても、私たちの意識的な理解を超える数千または数百万の計算が必要になることがあります。
-*機械学習*は強力な
-経験から学べるテクニック機械学習アルゴリズムは、通常、観測データや環境との相互作用の形で、より多くの経験を蓄積するにつれて、その性能が向上します。これを、デベロッパー自身がソフトウェアの更新時期を知り、決定するまで、経験がいくらあっても同じビジネスロジックに従って動作する決定論的eコマースプラットフォームと対比してください。本書では、機械学習の基礎を説明します。特に、コンピュータビジョン、自然言語処理、医療、ゲノミクスなど多様な分野でイノベーションを推進する強力な技術である「ディープラーニング」に焦点を当てます。 
+これらの問題に対して、エリートプログラマーでさえ、ソリューションをゼロからコーディングするのに苦労します。理由はさまざまです。私たちが探しているプログラムは、時間とともに変化するパターンに従うことがあるので、決まった正解はありません！そのような場合、成功するソリューションは変化する世界に優雅に適応しなければなりません。また、関係（ピクセルと抽象的なカテゴリなど）が複雑すぎて、数千または数百万の計算が必要で、未知の原則に従うこともあります。画像認識の場合、潜在意識の認知プロセスがタスクを楽に実行しても、タスクを実行するために必要な正確な手順は、私たちの意識的な理解を超えています。 
+
+*機械学習* はアルゴリズムの研究です
+それは経験から学ぶことができます。機械学習アルゴリズムは、通常、観測データまたは環境との相互作用の形で、より多くの経験を蓄積するにつれて、そのパフォーマンスが向上します。これと対比して、デベロッパー自身がソフトウェアを更新する時期であることを知り、決定するまで、どんなに経験があっても、同じビジネスロジックに従う決定論的なeコマースプラットフォームとは対照的です。この本では、特にコンピュータービジョン、自然言語処理、ヘルスケア、ゲノミクスなどの多様な分野でイノベーションを推進する強力な技術である*ディープラーニング*に焦点を当てて、機械学習の基礎を説明します。 
 
 ## やる気を起こさせる例
 
-執筆を始める前に、この本の著者は、多くの労働力と同様に、カフェインを含まなければなりませんでした。私たちは車に飛び乗って運転を始めた。アレックスはiPhoneを使って「Hey Siri」を呼び出し、電話の音声認識システムを目覚めさせました。そしてムーは「ブルーボトルコーヒーショップへの道順」を命じた。電話はすぐに彼の命令の書き起こしを表示した。また、私たちが道順を尋ねていることを認識し、私たちの要求を満たすためにマップアプリケーション（アプリ）を起動しました。マップアプリを起動すると、いくつかのルートが特定されました。各ルートの横に、予想される移動時間が表示されます。このストーリーは教育的な利便性のために作成されましたが、ほんの数秒で、スマートフォンと日常的にやり取りすることで、複数の機械学習モデルが利用できることが実証されています。 
+執筆を始める前に、この本の著者は、多くの労働力と同様に、カフェインを含まなければなりませんでした。私たちは車に飛び乗って運転を始めた。iPhoneを使用して、アレックスは「Hey Siri」を呼び出し、電話の音声認識システムを目覚めさせました。そしてムーは「ブルーボトルコーヒーショップへの道順」を命じた。電話はすぐに彼の命令の書き起こしを表示しました。また、道順を尋ねていることを認識し、私たちの要求を満たすためにマップアプリケーション（アプリ）を起動しました。マップアプリを起動すると、いくつかのルートが特定されました。各ルートの横に、電話機に予想される移動時間が表示されました。このストーリーは教育上の利便性のために作り上げましたが、わずか数秒で、スマートフォンとの日常的なやり取りが複数の機械学習モデルに関与できることを実証しています。 
 
-「アレクサ」、「OK Google」、「Hey Siri」など、*ウェイクワード*に応答するプログラムを書いているところを想像してみてください。:numref:`fig_wake_word` に示すように、コンピューターとコードエディターだけで部屋で自分でコーディングしてみてください。そのようなプログラムを第一原理からどのように書きますか？考えてみて... 問題は難しい。マイクロホンは毎秒約44000個のサンプルを収集します。各サンプルは、音波の振幅の測定値です。生の音声のスニペットから、スニペットにウェイクワードが含まれているかどうかの確かな予測 $\{\text{yes}, \text{no}\}$ に確実にマッピングできるルールは何ですか？行き詰まっていても心配しないでください。そのようなプログラムを一から書く方法もわかりません。だからこそ、私たちは機械学習を使っています。 
+「アレクサ」、「OK Google」、「Hey Siri」などの*ウェイクワード*に応答するプログラムを書いていると想像してみてください。:numref:`fig_wake_word`に示すように、コンピューターとコードエディターだけで自分で部屋でコーディングしてみてください。第一原理からそのようなプログラムをどのように書きますか？考えてみて... 問題は難しい。毎秒、マイクは約44000個のサンプルを収集します。各サンプルは、音波の振幅の測定値です。生のオーディオのスニペットから、スニペットにウェイクワードが含まれているかどうかに関する信頼できる予測$\{\text{yes}, \text{no}\}$に確実にマッピングできるルールは何ですか？行き詰まっていても心配しないでください。そのようなプログラムを一から書く方法もわかりません。だからこそ、私たちは機械学習を使っています。 
 
 ![Identify a wake word.](../img/wake-word.svg)
 :label:`fig_wake_word`
 
-ここにトリックがあります。多くの場合、入力から出力へのマッピング方法をコンピューターに明示的に指示する方法がわからなくても、コグニティブの偉業を自分たちで実行することができます。言い換えれば、「アレクサ」という単語を認識するようにコンピュータをプログラムする方法がわからなくても、あなた自身がそれを認識することができます。この能力により、オーディオの例を含む巨大な*データセット*を収集し、ウェイクワードを含むものと含まないものにラベルを付けることができます。機械学習のアプローチでは、システムの設計は試みません。
-*明示的に* ウェイクワードを認識します。
-代わりに、いくつかの*parameters* によって動作が決定される柔軟なプログラムを定義します。次に、データセットを使用して、関心のあるタスクのパフォーマンスの尺度に関してプログラムのパフォーマンスを向上させる、可能な限り最良のパラメータセットを決定します。 
+ここに秘訣があります。多くの場合、入力から出力へのマッピング方法をコンピューターに明示的に伝える方法がわからなくても、それでも私たちは自分たちで認知の偉業を実行することができます。つまり、「Alexa」という単語を認識するようにコンピューターをプログラムする方法がわからなくても、自分でそれを認識することができます。この機能により、どのスニペットにウェイクワードが含まれているかを示すオーディオスニペットと関連するラベルの例を含む巨大な*データセット*を収集できます。機械学習の主要なアプローチでは、システムを設計しようとはしません
+*ウェイクワードを認識するために明示的に*。
+代わりに、いくつかの*パラメータ*によって動作が決定される柔軟なプログラムを定義します。次に、データセットを使用して、可能な限り最良のパラメータ値、つまり、選択したパフォーマンス尺度に関してプログラムのパフォーマンスを向上させるパラメータ値を決定します。 
 
-パラメータは、プログラムの動作を操作して回すことができるノブと考えることができます。パラメータを修正して、このプログラムを*model*と呼びます。パラメーターを操作するだけで生成できる、すべての異なるプログラム (入出力マッピング) の集合をモデルの*ファミリー* と呼びます。そして、データセットを使ってパラメータを選択するメタプログラムは、*学習アルゴリズム*と呼ばれています。 
+パラメータは、プログラムの動作を操作して回すことができるノブと考えることができます。パラメータを修正して、プログラムを*モデル*と呼びます。パラメータを操作するだけで生成できるすべての異なるプログラム (入出力マッピング) のセットは、モデルの*ファミリー*と呼ばれます。そして、私たちのデータセットを使ってパラメータを選択するメタプログラムは、*学習アルゴリズム*と呼ばれています。 
 
-学習アルゴリズムを使用する前に、問題を正確に定義し、入力と出力の正確な性質を突き止め、適切なモデルファミリーを選択する必要があります。この場合、モデルはオーディオのスニペットを*input* として受け取り、モデルは*output* として $\{\text{yes}, \text{no}\}$ の中から選択を生成します。すべてが計画どおりに進んだ場合、スニペットにウェイクワードが含まれているかどうかについて、モデルの推測は正しくなります。 
+学習アルゴリズムに取り組む前に、問題を正確に定義し、入力と出力の正確な性質を特定し、適切なモデルファミリーを選択する必要があります。この場合、モデルはオーディオのスニペットを*input* として受け取り、モデルは $\{\text{yes}, \text{no}\}$ の中から*output* として選択を生成します。すべてが計画どおりに進んだ場合、スニペットにウェイクワードが含まれているかどうかについて、モデルの推測は一般的に正しいでしょう。 
 
-適切なモデルファミリーを選択した場合、モデルが「Alexa」という単語を聞くたびに「はい」を発生させるようなノブの設定が1つ存在する必要があります。ウェイクワードの正確な選択は任意であるため、ノブの別の設定によって「アプリコット」という単語を聞いた場合にのみ「はい」を発射できるほど豊富なモデルファミリーが必要になるでしょう。「アレクサ」と「アプリコット」の認識には同じモデルファミリーが適しているはずです。というのも、直感的には似たようなタスクに見えるからです。ただし、画像からキャプションへ、または英文から中国語の文にマッピングする場合など、根本的に異なる入力または出力を処理する場合は、まったく異なるモデルファミリーが必要になる場合があります。 
+適切なモデルファミリーを選択した場合、「Alexa」という単語が聞こえるたびにモデルが「はい」になるように、ノブの設定が1つ存在する必要があります。ウェイクワードの正確な選択は任意であるため、ノブの別の設定を介して、「アプリコット」という単語が聞こえたときにのみ「はい」を発火できる、十分に豊富なモデルファミリーが必要になるでしょう。同じモデルファミリーが「Alexa」認識と「Apricot」認識に適していると予想されます。なぜなら、それらは直感的には似たようなタスクに見えるからです。ただし、画像からキャプションに、または英語の文から中国語の文にマッピングする場合など、根本的に異なる入力または出力を処理する場合は、まったく別のモデルファミリーが必要になる場合があります。 
 
-ご想像のとおり、すべてのノブをランダムに設定しただけでは、モデルが「アレクサ」、「アプリコット」などの英語の単語を認識することはほとんどありません。機械学習における*学習*は、モデルから望ましい動作を強制するノブの正しい設定を発見するプロセスです。言い換えれば、モデルにデータを「訓練」させます。:numref:`fig_ml_loop` に示すように、トレーニングプロセスは通常、次のようになります。 
+ご想像のとおり、すべてのノブをランダムに設定した場合、モデルが「Alexa」、「Apricot」、またはその他の英語の単語を認識する可能性はほとんどありません。機械学習では、*学習*は、モデルから望ましい動作を強制するノブの正しい設定を発見するプロセスです。言い換えれば、私たちはデータを使ってモデルを*トレーニング*します。:numref:`fig_ml_loop`に示すように、トレーニングプロセスは通常次のようになります。 
 
-1. まず、役に立つことは何もできない、ランダムに初期化されたモデルから始めます。
-1. データの一部 (オーディオスニペットや対応する $\{\text{yes}, \text{no}\}$ ラベルなど) を取得します。
-1. ノブを微調整して、これらの例に比べてモデルの吸い込みが少なくなるようにします。
-1. モデルが素晴らしい状態になるまで、ステップ 2 と 3 を繰り返します。
+1. 役に立たないランダムに初期化されたモデルから始めます。
+1. データの一部を取得します (例:オーディオスニペットと対応する $\{\text{yes}, \text{no}\}$ ラベル)。
+1. ノブを微調整して、これらの例で評価したとおりにモデルのパフォーマンスを向上させます。
+1. モデルがすごいものになるまで、手順2と3を繰り返します。
 
 ![A typical training process.](../img/ml-loop.svg)
 :label:`fig_ml_loop`
 
-まとめると、ウェイクワード認識機能をコーディングするのではなく、大きなラベル付きデータセットをウェイクワードに提示すれば、ウェイクワードを認識することを「学習」できるプログラムをコーディングします。データセットでプログラムの動作を決定するこの行為は、*データを使ったプログラミング* と考えることができます。つまり、機械学習システムに猫と犬の多くの例を提供することで、猫検出器を「プログラミング」することができます。このようにして、検出器は最終的に猫であれば非常に大きな正の数、犬の場合は非常に大きな負の数を放出し、確信が持てない場合はゼロに近いものを放出することを学習します。これは機械学習でできることの表面をほとんど傷つけません。ディープラーニングは後で詳しく説明しますが、機械学習の問題を解決するための一般的な方法の 1 つにすぎません。 
+要約すると、ウェイクワードレコグナイザーをコーディングするのではなく、大きなラベル付きデータセットが提示された場合、ウェイクワードを認識することを*学習*できるプログラムをコーディングします。プログラムの動作を決定するこの行為は、データセットを*データによるプログラミング* として提示することによって考えることができます。つまり、機械学習システムに多くの猫と犬の例を提供することで、猫検出器を「プログラム」することができます。このようにして、検出器は最終的に猫であれば非常に大きな正の数を、犬であれば非常に大きな負の数を、確信が持てない場合はゼロに近いものを放出することを学習します。これは、機械学習ができることの表面をほとんど傷つけません。ディープラーニングは、後で詳しく説明しますが、機械学習の問題を解決する多くの一般的な方法の1つにすぎません。 
 
 ## 主要コンポーネント
 
-このウェイクワードの例では、オーディオスニペットとバイナリラベルで構成されるデータセットについて説明し、スニペットから分類へのマッピングを近似するようにモデルをトレーニングする方法について、手を振った感覚を示しました。ラベルが既知である例で構成されるデータセットから、既知の入力に基づいて指定された未知のラベルを予測しようとするこの種の問題は、*教師あり学習*と呼ばれます。これは、機械学習に関する多くの問題の 1 つにすぎません。後で、さまざまな機械学習の問題について深く掘り下げます。まず、私たちがどのような機械学習の問題に取り組んでも、私たちの後に続くいくつかのコアコンポーネントにさらに光を当てたいと思います。 
+ウェイクワードの例では、オーディオスニペットとバイナリラベルで構成されるデータセットについて説明し、スニペットから分類へのマッピングを近似するようにモデルをトレーニングする方法について手を振った感覚を与えました。ラベルが知られている例で構成されるデータセットを与えられた既知の入力に基づいて指定された未知のラベルを予測しようとするこの種の問題は、*教師付き学習*と呼ばれます。これは、多くの種類の機械学習問題の1つにすぎません。他の品種を探る前に、私たちがどのような機械学習の問題に取り組んでも、私たちの周りに続くいくつかのコアコンポーネントにもっと光を当てたいと思います。 
 
-1. 私たちが学べる「データ」。
-1. データをどのように変換するかの「モデル」。
-1. モデルがどの程度うまく機能しているか (または悪い) かを定量化する*目的関数*。
+1. 私たちが学ぶことができる*データ*。
+1. データを変換する方法の*モデル*。
+1. モデルの動作がどの程度 (または悪い) かを定量化する*目的関数*。
 1. モデルのパラメーターを調整して目的関数を最適化する*アルゴリズム*。
 
 ### データ
 
-言うまでもなく、データサイエンスはデータなしでは成し遂げられません。データを正確に構成するものを熟考していると、何百ページも失われる可能性がありますが、今のところ、実際的な側面を誤り、関心のある重要な特性に焦点を当てます。一般的に、私たちは一連の例に関心があります。データを有効に扱うためには、通常、適切な数値表現を考え出す必要があります。各*example* (または*data point*、*data instance*、*sample*) は、通常、*features* (または*covariates*) と呼ばれる一連の属性で構成され、モデルはこの属性から予測を行う必要があります。上記の教師あり学習問題では、予測するものは*label* (または*target*) として指定された特別な属性です。 
+言うまでもなく、データなしではデータサイエンスを成し遂げることはできません。正確にはデータ*とは何かを熟考するページが何百も失われる可能性がありますが、ここでは、関心のあるデータセットの主要な特性に焦点を当てます。一般的に、私たちは一連の例に関係しています。データを有効に扱うためには、通常、適切な数値表現を考え出す必要があります。各*example* (または*データポイント*、*データインスタンス*、*sample*) は、通常、モデルが予測を行う必要がある*features* (*covariates* または*inputs* と呼ばれることもあります) と呼ばれる一連の属性で構成されます。教師あり学習問題における私たちの目標は、モデルの入力の一部ではない、*ラベル* (または*ターゲット*) と呼ばれる特別な属性の値を予測することです。 
 
-画像データを扱う場合、個々の写真は一例であり、各ピクセルの明るさに対応する数値の順序付きリストで表されます。$200\times 200$ カラー写真は $200\times200\times3=120000$ の数値で構成され、空間位置ごとの赤、緑、青のチャンネルの明るさに対応しています。別の伝統的な課題では、年齢、バイタルサイン、診断などの標準的な特徴を考慮して、患者が生存するかどうかを予測しようとすることがあります。 
+画像データを扱う場合、各例は個々の写真（フィーチャ）と、写真が属するカテゴリ（ラベル）を示す数字で構成されます。写真は、各ピクセル位置における赤、緑、青の光の明るさを表す 3 つの数値グリッドとして数値で表されます。たとえば、$200\times 200$ のカラー写真は $200\times200\times3=120000$ の数値で構成されます。 
 
-すべての例が同じ数の数値によって特徴付けられる場合、データは固定長ベクトルで構成され、ベクトルの定長をデータの*次元性*として記述します。ご想像のとおり、固定長は便利なプロパティです。顕微鏡画像でがんを認識するモデルをトレーニングしたい場合、入力が固定長であれば、心配することが1つ少なくなります。 
+あるいは、電子医療記録データを使用して、特定の患者が今後30日間生存する可能性を予測するタスクに取り組むこともできます。ここでは、私たちの機能は、年齢、バイタルサイン、併存疾患、現在の薬、最近の手順など、すぐに利用できる属性と頻繁に記録される測定のコレクションで構成されている可能性があります。トレーニングに使用できるラベルは、履歴データの各患者が30日以内に生存したかどうかを示すバイナリ値です。 
 
-ただし、すべてのデータを次のように簡単に表現できるわけではありません。 
-*固定長* ベクトル。
-顕微鏡の画像は標準装備から得られると予想されるかもしれませんが、インターネットから採掘された画像がすべて同じ解像度または形状で表示されることは期待できません。画像の場合、すべてを標準サイズにトリミングすることを検討するかもしれませんが、その戦略はこれまでのところしか得られません。切り取られた部分の情報が失われる危険があります。さらに、テキストデータは固定長表現にさらに頑固に抵抗します。Amazon、IMDB、トリップアドバイザーなどの E コマースサイトに残されたカスタマーレビューについて考えてみましょう。いくつかは短いです：「それは悪臭を放つ！」。他の人はページのために歩き回ります。従来の方法に対するディープラーニングの大きな利点の 1 つは、現代のモデルが*可変長* のデータを処理できる比較的優美な点です。 
+そのような場合、すべての例が同じ数の数値的特徴によって特徴付けられる場合、入力は固定長ベクトルであり、ベクトルの（一定の）長さをデータの*次元性*と呼びます。ご想像のとおり、固定長の入力は便利で、心配する複雑さが1つ少なくなります。ただし、すべてのデータを*固定長* ベクトルとして簡単に表現できるわけではありません。顕微鏡の画像は標準的な装置から得られると予想されるかもしれませんが、インターネットから採掘された画像がすべて同じ解像度または形状で表示されることは期待できません。画像については、すべてを標準サイズに切り抜くことを検討するかもしれませんが、その戦略では今のところしか得られません。切り取られた部分の情報が失われる危険があります。さらに、テキストデータは固定長表現にさらに頑固に抵抗します。Amazon、IMDb、トリップアドバイザーなどのeコマースサイトに残されたカスタマーレビューを考えてみましょう。一部は短い：「臭い！」。他の人はページをめぐって争う.従来の方法に対するディープラーニングの主な利点の1つは、最新のモデルが*さまざまな長さ*のデータを処理できる比較優位性です。 
 
-一般的に、データが多いほど、仕事は簡単になります。データが増えれば、より強力なモデルをトレーニングでき、先入観に基づく仮定にあまり依存しなくなります。(比較的) 小規模データからビッグデータへの体制転換は、現代のディープラーニングの成功に大きく貢献しています。ディープラーニングで最も魅力的なモデルの多くは、大きなデータセットがないと機能しません。小規模データ体制で機能するものもありますが、従来のアプローチに勝るものはありません。 
+一般的に、データが多ければ多いほど、仕事は楽になります。より多くのデータがあると、より強力なモデルをトレーニングでき、先入観にあまり依存しなくなります。（比較的）小規模データからビッグデータへの体制転換は、現代のディープラーニングの成功に大きく貢献しています。要点を理解するために、ディープラーニングで最もエキサイティングなモデルの多くは、大きなデータセットがないと機能しません。スモールデータ体制で働くものもありますが、従来のアプローチに勝るものはありません。 
 
-最後に、大量のデータを用意して巧みに処理するだけでは不十分です。*正しい*データが必要です。データに誤りが多い場合や、選択した特徴が目標関心量を予測できない場合、学習は失敗します。状況は決まり文句によってうまく捉えられています。
+最後に、大量のデータを持ち、それを巧みに処理するだけでは十分ではありません。*正しい*データが必要です。データに誤りがたくさんある場合、または選択した特徴量が目標とする対象量を予測できない場合、学習は失敗します。状況は決まり文句によってうまく捉えられています。
 *ガベージイン、ガベージアウト*。
-さらに、予測パフォーマンスの低下だけが潜在的な結果ではありません。予測ポリシング、履歴書スクリーニング、融資に使用されるリスクモデルなど、機械学習の機密性の高いアプリケーションでは、ガベージデータの影響に特に注意する必要があります。一般的な故障モードの 1 つは、一部のグループの人が学習データで表されないデータセットで発生します。これまで黒い皮膚を見たことがなかった皮膚がん認識システムを野生で適用することを想像してみてください。また、データが一部のグループを過小評価しているだけでなく、社会的偏見を反映している場合にも失敗が起こります。たとえば、過去の採用決定が履歴書の審査に使用される予測モデルのトレーニングに使用された場合、機械学習モデルが不注意で過去の不正を捕捉して自動化する可能性があります。これはすべて、データサイエンティストが積極的に共謀したり、気づいたりしなくても起こり得ることに注意してください。 
+さらに、予測パフォーマンスの低下だけが潜在的な結果ではありません。予測ポリシング、履歴書スクリーニング、融資に使用されるリスクモデルなど、機械学習の機密性の高いアプリケーションでは、ガベージデータの影響に特に注意する必要があります。一般的な故障モードの 1 つは、一部の人々のグループがトレーニングデータに含まれていないデータセットで発生します。これまでに黒い肌を見たことがない野生の皮膚がん認識システムを適用することを想像してみてください。失敗は、データが一部のグループを過小評価しているだけでなく、社会的偏見を反映している場合にも発生する可能性があります。たとえば、過去の採用決定を使用して、履歴書のスクリーニングに使用される予測モデルをトレーニングすると、機械学習モデルが不注意に過去の不正を捉えて自動化する可能性があります。これはすべて、データサイエンティストが積極的に共謀したり、気づいたりしなくても起こり得ることに注意してください。 
 
 ### モデル
 
-ほとんどの機械学習は、ある意味でのデータの変換を伴います。写真をインジェストしてスマイリー感を予測するシステムを構築したいと思うかもしれません。あるいは、一連のセンサーの読み取り値をインジェストして、読み取り値が正常か異常かを予測することもできます。*model* とは、あるタイプのデータを取り込み、おそらく異なるタイプの予測を吐き出すための計算機構を表します。特に、データから推定できる統計モデルに興味があります。単純なモデルは単純な問題に適切に対処することができるが、本書で取り上げる問題は古典的手法の限界を広げている。ディープラーニングは、主に一連の強力なモデルに重点を置いており、従来のアプローチと区別されます。これらのモデルは、上から下に連鎖した多数の連続したデータ変換で構成されているため、*ディープラーニング*という名前が付けられています。ディープモデルについて議論する途中で、いくつかのより伝統的な方法についても説明します。 
+ほとんどの機械学習は、ある意味でデータを変換することを含みます。写真を取り込んで笑顔を予測するシステムを構築したいと思うかもしれません。あるいは、一連のセンサーの読み取り値を取り込んで、読み取り値がどの程度正常か異常かを予測したい場合があります。*model* とは、あるタイプのデータを取り込み、異なるタイプの予測を吐き出すための計算機構を示します。特に、データから推定できる統計モデルに関心があります。単純なモデルは適切な単純な問題に完全に対処できますが、この本で焦点を当てている問題は、古典的な方法の限界を広げています。ディープラーニングは、主に焦点を当てた一連の強力なモデルによって、従来のアプローチと区別されます。これらのモデルは、上から下に連鎖するデータの多くの連続的な変換で構成されているため、*ディープラーニング* という名前が付けられています。ディープモデルについて議論する途中で、いくつかのより伝統的な方法についても議論します。 
 
-### 目的関数
+### 客観的機能
 
-先ほど、経験からの学習として機械学習を導入しました。ここで*学習*するということは、あるタスクで時間をかけて改善することを意味します。しかし、何が改善を構成するのか誰が言うべきですか？私たちがモデルを更新することを提案できると想像するかもしれませんし、提案された更新が改善または減少のどちらを構成したかについて意見が合わない人もいるかもしれません。 
+以前、経験からの学習として機械学習を導入しました。ここで*学ぶ*とは、あるタスクで時間をかけて改善することを意味します。しかし、何が改善を構成するかは誰に言いますか？モデルの更新を提案できると想像するかもしれませんが、提案された更新が改善を構成するのか拒否したのかについて意見が合わない人もいます。 
 
-学習機械の形式的な数学システムを開発するためには、モデルがどれだけ良い（または悪い）かを形式的に測定する必要があります。機械学習、およびより一般的な最適化では、これらを「目的関数」と呼んでいます。慣例により、通常、目的関数は小さいほど良いように定義します。これは単なる慣習です。符号を反転させることで、高いほど良い関数を取り、質的には同じであるが、低いほど良い新しい関数に変えることができます。低いほど良いので、これらの関数は時々呼ばれます。
+学習機械の正式な数学的システムを開発するには、モデルがどれほど良い（または悪い）かを公式に測定する必要があります。機械学習、およびより一般的な最適化では、これらを*目的関数*と呼びます。慣例として、私たちは通常、低いほど良いように目的関数を定義します。これは単なる慣習です。符号を反転させることで、高いほど良い関数を取って、質的には同一であるが、低いほど良い新しい関数に変えることができます。低いほど良いので、これらの関数は時々呼ばれます
 *損失関数*。
 
-数値を予測する場合、最も一般的な損失関数は*二乗誤差*、つまり予測とグラウンドトゥルースの差の二乗です。分類の最も一般的な目的は、誤り率、つまり、予測がグラウンドトゥルースと一致しない例の割合を最小化することです。一部の目的 (二乗誤差など) は簡単に最適化できます。その他（誤り率など）は、微分不可能性やその他の複雑さのために、直接最適化が困難です。このような場合、*代理目的*を最適化するのが一般的です。 
+数値を予測しようとする場合、最も一般的な損失関数は*二乗誤差*、つまり予測とグラウンドトゥルースターゲットの差の二乗です。分類の最も一般的な目的は、誤り率を最小限に抑えることです。つまり、予測がグラウンドトゥルースと一致しない例の割合です。一部の目的（二乗誤差など）は最適化が簡単ですが、他の目的（誤り率など）は、非微分可能性やその他の複雑さのために直接最適化するのが困難です。このような場合、*サロゲート目標*を最適化するのが一般的です。 
 
-通常、損失関数はモデルのパラメータを基準に定義され、データセットによって異なります。トレーニング用に収集されたいくつかの例で構成されるセットで発生する損失を最小化することで、モデルのパラメーターの最適な値を学習します。ただし、トレーニングデータをうまく処理しても、目に見えないデータでもうまくいくとは限りません。そのため、通常、使用可能なデータを 2 つのパーティションに分割します。*training データセット* (または、モデルパラメーターの近似用の *training セット*) と、*test データセット* (または、評価のために保留される*test set*) で、両方でモデルがどのように機能するかを報告します。トレーニングのパフォーマンスは、実際の期末試験の準備に使用される模擬試験での学生の得点のようなものと考えることができます。結果が有望であっても、期末試験の合格を保証するものではありません。つまり、テストのパフォーマンスはトレーニングパフォーマンスから大きく逸脱する可能性があります。モデルがトレーニングセットではうまく機能しても、目に見えないデータに一般化できない場合、「過適合*」と言います。実生活で言えば、これは模擬試験でうまくやっているにもかかわらず、実際の試験をばかにするようなものです。 
+最適化中、損失はモデルのパラメーターの関数と考え、トレーニングデータセットを定数として扱います。トレーニング用に収集されたいくつかの例で構成されるセットで発生する損失を最小限に抑えることで、モデルのパラメーターの最適な値を学習します。しかし、トレーニングデータをうまく処理しても、目に見えないデータでうまくいくとは限りません。そのため、通常、利用可能なデータを2つのパーティションに分割します。モデルパラメーターを学習するための*トレーニングデータセット*（または*トレーニングセット*）と、評価のために保持される*テストデータセット*（または*テストセット*）です。一日の終わりには、通常、両方のパーティションでモデルがどのように機能するかを報告します。トレーニングのパフォーマンスは、実際の最終試験の準備に使用される模擬試験で学生が達成するスコアに似ていると考えることができます。結果が期待できるものであっても、期末試験の成功を保証するものではありません。勉強の過程で、学生は練習問題を覚え始めるかもしれません。トピックを習得しているように見えますが、実際の最終試験でこれまでに見られなかった問題に直面すると落ち着きます。モデルがトレーニングセットでうまく機能するが、目に見えないデータに一般化できない場合、トレーニングデータに*過剰適合*していると言います。 
 
 ### 最適化アルゴリズム
 
-データソースと表現、モデル、明確に定義された目的関数が得られたら、損失関数を最小化するための最良のパラメーターを検索できるアルゴリズムが必要です。ディープラーニングの一般的な最適化アルゴリズムは、*勾配降下法* と呼ばれるアプローチに基づいています。つまり、このメソッドは、各ステップで、そのパラメーターを少しだけ摂動させた場合にトレーニングセットの損失がどのように動くかをパラメーターごとにチェックします。その後、損失が減少する方向にパラメータが更新されます。 
+データソースと表現、モデル、明確に定義された目的関数が得られたら、損失関数を最小化するための最良のパラメーターを検索できるアルゴリズムが必要です。ディープラーニングの一般的な最適化アルゴリズムは、*勾配降下法*と呼ばれるアプローチに基づいています。要するに、この方法は、各ステップで、パラメータを少しだけ摂動させた場合、トレーニングセットの損失がどの方向に動くかを各パラメータについてチェックします。次に、損失を下げる方向にパラメータを更新します。 
 
 ## 機械学習の問題の種類
 
-私たちのモチベーションを高める例のウェイクワード問題は、機械学習が対処できる多くの問題の1つにすぎません。本書全体でさらに多くの問題について話すときに、読者のモチベーションを高め、共通の言葉を提供するために、以下に機械学習の問題のサンプルを挙げます。データ、モデル、トレーニングテクニックなど、前述の概念を常に参照します。 
+私たちのやる気を起こさせる例のウェイクワード問題は、機械学習が取り組むことができる多くの問題の1つにすぎません。読者をさらにやる気にさせ、本全体を通して私たちが従ういくつかの共通言語を提供するために、機械学習の問題定式化の概要を説明します。 
 
 ### 教師あり学習
 
-教師あり学習は、入力特徴量からラベルを予測するタスクに対処します。フィーチャとラベルの各ペアを例と呼びます。文脈が明確であれば、対応するラベルが不明な場合でも、*examples* という用語を使用して入力の集合を指すことがあります。私たちの目標は、あらゆる入力をラベル予測にマッピングするモデルを作成することです。 
-
-この説明を具体例にまとめると、私たちが医療に携わっていたら、患者が心臓発作を起こすかどうかを予測したいと思うかもしれません。この観察、「心臓発作」または「心臓発作なし」が私たちのラベルになります。入力フィーチャには、心拍数、拡張期血圧、収縮期血圧などのバイタルサインがあります。 
-
-監視が重要になるのは、パラメーターを選択するために、私たち (スーパーバイザー) がラベル付きの例で構成されるデータセットをモデルに提供し、各例がグラウンドトゥルースラベルと照合されるためです。確率論的に言えば、通常、入力フィーチャが与えられた場合のラベルの条件付き確率を推定することに関心があります。これは機械学習におけるいくつかのパラダイムの 1 つにすぎませんが、教師あり学習は産業界で成功している機械学習の応用の大部分を占めています。その理由の1つは、多くの重要なタスクが、利用可能な特定のデータセットから未知のものの確率を推定することとしてはっきりと説明できるためです。 
+教師あり学習は、フィーチャとラベルの両方を含むデータセットが与えられ、入力フィーチャからラベルを予測するモデルを作成するタスクを記述します。各フィーチャとラベルのペアを例と呼びます。コンテキストが明確な場合、対応するラベルが不明な場合でも、入力のコレクションを指すために*examples* という用語を使用することがあります。パラメータを選択するために、私たち（スーパーバイザー）がラベル付きの例で構成されるデータセットをモデルに提供するため、監督が役立ちます。確率論的には、通常、入力フィーチャからラベルの条件付き確率を推定することに関心があります。これは機械学習におけるいくつかのパラダイムの1つにすぎませんが、教師あり学習は業界で成功している機械学習のアプリケーションの大部分を占めています。その理由の1つは、多くの重要なタスクが、特定の利用可能なデータのセットを考慮して、未知の何かの確率を推定することとして明確に説明できるためです。 
 
-* コンピューター断層撮影の画像から、がんとがんではないかを予測します。
-* 英語の文章があれば、フランス語の正しい翻訳を予測します。
+* コンピューター断層撮影画像から、がんとがんではないかを予測します。
+* 英語の文を考えると、フランス語の正しい翻訳を予測します。
 * 今月の財務報告データに基づいて、来月の株価を予測します。
 
-「入力特徴量からラベルを予測する」という単純な説明があっても、教師あり学習には非常に多くの形式があり、(他の考慮事項の中でも) タイプ、サイズ、入力と出力の数に応じて、非常に多くのモデリング決定が必要になります。たとえば、任意の長さのシーケンスを処理したり、固定長のベクトル表現を処理したりするために、さまざまなモデルを使用します。この本全体を通して、これらの問題の多くを詳しく見ていきます。 
+教師あり学習の問題はすべて、「入力特徴を与えられたラベルを予測する」という簡単な説明によって捉えられますが、教師あり学習は、入力と出力のタイプ、サイズ、および量に応じて、さまざまな形式をとることができ、（他の考慮事項の中でも）大量のモデリング決定を必要とします。たとえば、任意の長さのシーケンスを処理したり、固定長のベクトル表現を処理したりするために、さまざまなモデルを使用します。この本を通して、これらの問題の多くを詳しく見ていきます。 
 
-非公式には、学習プロセスは次のようになります。まず、特徴が既知である例の大規模なコレクションを取得し、そこからランダムなサブセットを選択し、それぞれのグラウンドトゥルースラベルを取得します。これらのラベルは、すでに収集された利用可能なデータである場合があります（たとえば、患者が翌年に死亡したか？）また、データにラベルを付けるために人間のアノテーターを使用する必要がある場合もあります（たとえば、画像をカテゴリに割り当てるなど）。これらの入力と対応するラベルが一緒になって学習セットを構成します。トレーニングデータセットを教師あり学習アルゴリズムにフィードします。教師あり学習アルゴリズムは、データセットを入力として受け取り、別の関数、つまり学習済みモデルを出力する関数です。最後に、その出力を対応するラベルの予測として使用して、これまで見られなかった入力を学習済みモデルにフィードできます。完全なプロセスは :numref:`fig_supervised_learning` で描かれています。 
+非公式には、学習プロセスは次のようになります。まず、特徴がわかっている多数の例を取り出し、それらからランダムなサブセットを選択し、それぞれのグラウンドトゥルースラベルを取得します。これらのラベルは、すでに収集された入手可能なデータである場合があります（例：患者は翌年に死亡しましたか？）また、データにラベルを付けるために人間の注釈者を使用する必要がある場合もあります（たとえば、画像をカテゴリに割り当てるなど）。これらの入力と対応するラベルが一緒になって、トレーニングセットを構成します。トレーニングデータセットを教師あり学習アルゴリズムに送ります。教師あり学習アルゴリズムは、データセットを入力として受け取り、別の関数、つまり学習されたモデルを出力する関数です。最後に、その出力を対応するラベルの予測として使用して、学習したモデルにこれまで見られなかった入力を与えることができます。完全なプロセスは:numref:`fig_supervised_learning`に描かれています。 
 
 ![Supervised learning.](../img/supervised-learning.svg)
 :label:`fig_supervised_learning`
 
 #### リグレッション
 
-おそらく、頭を包み込む最も簡単な教師あり学習タスクは、*回帰*でしょう。たとえば、住宅販売のデータベースから収集された一連のデータを考えてみましょう。各行が異なる家に対応し、各列が家の面積、寝室の数、トイレの数、町の中心部までの距離 (徒歩) など、関連する属性に対応するテーブルを作成できます。このデータセットでは、各例は特定の家屋で、対応する特徴ベクトルはテーブルの 1 行になります。ニューヨークまたはサンフランシスコに住んでいて、Amazon、Google、Microsoft、または Facebook の CEO ではない場合、自宅の (平方フィート、寝室の数、トイレ数、徒歩距離) フィーチャベクトルは $[600, 1, 1, 60]$ のようになります。ただし、ピッツバーグに住んでいる場合は $[3000, 4, 3, 10]$ のように見えるかもしれません。このような特徴ベクトルは、ほとんどの古典的な機械学習アルゴリズムにとって不可欠です。 
+おそらく、頭を包み込む最も簡単な教師あり学習タスクは、*回帰*でしょう。たとえば、住宅販売のデータベースから収集された一連のデータを考えてみましょう。各行が別の家に対応し、各列が家の平方フィート、寝室の数、バスルームの数、町の中心までの時間（徒歩）数などの関連する属性に対応するテーブルを作成するとします。このデータセットでは、各例は特定の家で、対応する特徴ベクトルはテーブルの 1 行になります。ニューヨークまたはサンフランシスコに住んでいて、Amazon、グーグル、マイクロソフト、またはFacebookのCEOではない場合、あなたの家の特徴ベクトル（平方映像、寝室数、バスルーム数、徒歩距離）は、$[600, 1, 1, 60]$のようになります。ただし、ピッツバーグに住んでいる場合は、$[3000, 4, 3, 10]$のように見えるかもしれません。このような固定長特徴ベクトルは、ほとんどの従来の機械学習アルゴリズムに不可欠です。 
 
-問題を回帰にするのは、実際にはアウトプットです。あなたが新しい家を求めて市場にいるとしましょう。上記のようないくつかの特徴を考慮して、住宅の公正市場価値を見積もることができます。販売価格を表すラベルは数値です。ラベルが任意の数値を取る場合、これを*回帰* 問題と呼びます。私たちの目標は、予測が実際のラベル値に近似するモデルを作成することです。 
+問題を回帰させるのは、実際にはターゲットの形です。あなたが新しい家を求めて市場に出ているとしましょう。上記のような機能をいくつか考えると、住宅の公正な市場価値を見積もることができます。ここのデータは過去の住宅リストで構成され、ラベルは観測された販売価格である可能性があります。ラベルが（ある間隔内であっても）任意の数値をとる場合、これを*回帰*問題と呼びます。目標は、予測が実際のラベル値に近似するモデルを作成することです。 
 
-実際的な問題の多くは、よく説明された回帰問題です。ユーザーが映画に割り当てるレーティングを予測することは、回帰の問題と考えることができます。2009 年にこの偉業を達成するために優れたアルゴリズムを設計したなら、[1-million-dollar Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize) で優勝したかもしれません。入院中の患者の在留期間を予測することも、回帰の問題です。良い経験則は、どれくらいの量ですか？* または *いくつですか？* 問題は次のような回帰を示唆するはずです: 
+実際的な問題の多くは、回帰問題として簡単に説明できます。ユーザーが映画に割り当てるレーティングを予測することは、回帰問題と考えることができます。2009年にこの偉業を達成するための優れたアルゴリズムを設計した場合、[1-million-dollar Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize)を獲得した可能性があります。入院中の患者の滞在期間を予測することも回帰の問題です。経験則としては、どれくらい？* または*いくつですか？* 問題は回帰を示唆するはずです、例えば: 
 
-* この手術には何時間かかりますか？
-* この町は今後六時間でどれくらい降るでしょうか。
+* この手術は何時間かかりますか？
+* この町は今後6時間でどれくらいの雨が降るだろうか？
 
-機械学習を使ったことがなくても、おそらく回帰問題を非公式に解決したことがあるでしょう。たとえば、排水管を修理してもらい、請負業者が下水管からガンクを取り除くのに3時間を費やしたとします。それから彼はあなたに350ドルの請求書を送った。あなたの友人が同じ請負業者を2時間雇い、250ドルの請求書を受け取ったと想像してください。その後、誰かが今後のガンク除去請求書にどれだけ期待するかを尋ねた場合、労働時間が増えるとより多くの費用がかかるなど、いくつかの合理的な仮定をするかもしれません。また、ある程度の基本料金がかかり、請負業者が時間ごとに請求することを想定することもできます。これらの前提が成り立っていれば、この 2 つのデータ例を考えると、請負業者の料金体系をすでに特定できます。1 時間あたり 100 ドルと 50 ドルが自宅に現れます。それだけ従えば、線形回帰の背後にある高レベルの考え方をすでに理解していることでしょう。 
+これまで機械学習に取り組んだことがなくても、おそらく非公式に回帰問題に取り組んだことがあるでしょう。たとえば、排水管を修理し、請負業者が下水管からガンクを取り除くのに3時間を費やしたとします。それから彼はあなたに350ドルの請求書を送った。ここで、あなたの友人が同じ請負業者を2時間雇い、250ドルの請求書を受け取ったと想像してください。次に、誰かが次のガンク除去請求書にどれだけ期待できるかを尋ねた場合、労働時間が増えると費用がかかるなど、合理的な仮定を立てる可能性があります。また、基本料金がいくらかあり、請負業者が1時間ごとに請求すると想定することもできます。これらの仮定が当てはまる場合、これら2つのデータ例を考えれば、請負業者の価格体系をすでに特定できます。1時間あたり100ドルと自宅に50ドルを加えたものです。それだけ従えば、線形回帰の背後にあるハイレベルな考え方をすでに理解しているでしょう。 
 
-この場合、請負業者の価格と正確に一致するパラメータを生成できます。2 つの特徴量以外にもいくつかの要因による分散がある場合など、これが不可能な場合があります。このような場合、予測値と観測値の間の距離を最小にするモデルを学習します。ほとんどの章では、二乗誤差損失関数の最小化に焦点を当てます。後で説明するように、この損失は、データがガウスノイズによって破壊されたという仮定に相当します。 
+この場合、請負業者の価格に正確に一致するパラメータを生成できます。これは不可能な場合があります。たとえば、分散の一部が2つの特徴以外のいくつかの要因に起因する場合などです。このような場合、予測値と観測値の間の距離を最小にするモデルを学習しようとします。ほとんどの章では、二乗誤差損失関数の最小化に焦点を当てます。後で見るように、この損失は、データがガウスノイズによって破壊されたという仮定に対応します。 
 
 #### 分類
 
-回帰モデルは*いくつですか？*質問、多くの問題は、このテンプレートに快適に曲がっていません。たとえば、銀行がモバイルアプリに小切手スキャンを追加したいとします。これには、顧客がスマートフォンのカメラで小切手の写真を撮り、アプリが画像に表示された文字を自動的に認識できるようにする必要があります。具体的には、手書き文字を既知の文字の 1 つにマッピングするなど、手書き文字をさらに堅牢に理解する必要があります。こういうの、どっち？* 問題は*分類*と呼ばれます。多くの手法が引き継がれますが、回帰に使用されるアルゴリズムとは異なるアルゴリズムセットで扱われます。 
+回帰モデルは対処するのに最適ですが、*いくつですか？*質問、多くの問題はこのテンプレートに快適に曲がりません。たとえば、モバイルアプリ用の小切手スキャン機能を開発したい銀行を考えてみましょう。理想的には、顧客は小切手の写真を撮るだけで、アプリは画像からテキストを自動的に認識します。手書きの各文字に対応する画像パッチをセグメント化する能力があると仮定すると、残りの主なタスクは、既知のセットの中のどの文字が各画像パッチに描かれているかを決定することです。この種類の*どれ？* 問題は*分類* と呼ばれ、回帰に使用されるものとは異なる一連のツールが必要ですが、多くの手法が引き継がれます。 
 
-*classification* では、モデルが画像内のピクセル値などの特徴を調べ、いくつかの離散的なオプションセットの中で、どの*カテゴリ* (正式には*class*) が属するかを予測します。手書きの数字の場合、0 ～ 9 の数字に対応する 10 個のクラスがあります。分類の最も単純な形式は、クラスが2つしかない場合です。この問題を*バイナリ分類*と呼んでいます。たとえば、データセットは動物の画像で構成され、ラベルはクラス $\mathrm{\{cat, dog\}}$ とすることができます。回帰では数値を出力するリグレッサーを探しましたが、分類では予測されたクラス割り当てを出力する分類器を探します。 
+*分類*では、モデルに画像内のピクセル値などの特徴を調べ、いくつかの離散的なオプションセットのうち、例が属する*カテゴリ*（*クラス*と呼ばれることもあります）を予測します。手書きの数字の場合、0から9の数字に対応する10個のクラスがあります。最も単純な分類形式は、クラスが2つしかない場合で、これを*バイナリ分類*と呼んでいます。たとえば、データセットは動物の画像で構成され、ラベルはクラス$\mathrm{\{cat, dog\}}$である可能性があります。回帰では、数値を出力するリグレッサーを探し、分類では分類器を探しました。その出力は予測されたクラス割り当てです。 
 
-本書がより技術的になるにつれて説明する理由から、「cat」や「dog」など、ハードなカテゴリ割り当てのみを出力できるモデルを最適化するのは難しい場合があります。このような場合、通常、モデルを確率の言語で表現する方がはるかに簡単です。例の特性を考えると、このモデルは可能な各クラスに確率を割り当てます。クラスが $\mathrm{\{cat, dog\}}$ である動物分類の例に戻ると、分類器は画像を見て、その画像が猫である確率を 0.9 として出力することがあります。この数値は、分類器が画像が猫を描写していることを 90\% 確信していると解釈できます。予測されるクラスの確率の大きさは、不確実性の 1 つの概念を伝えます。それは不確実性の唯一の概念ではなく、より高度な章で他の概念について議論します。 
+本がより技術的になるにつれて説明する理由から、「cat」や「dog」など、ハードカテゴリ割り当てのみを出力できるモデルを最適化するのは難しい場合があります。このような場合、通常は、代わりに確率言語でモデルを表現する方がはるかに簡単です。例の特徴を考えると、私たちのモデルは可能な各クラスに確率を割り当てます。クラスが $\mathrm{\{cat, dog\}}$ である動物分類の例に戻ると、分類器は画像を見て、画像が猫である確率を 0.9 と出力する場合があります。この数字は、画像が猫を描写していることを分類器が90％確信していると解釈できます。予測されたクラスの確率の大きさは、不確実性の 1 つの概念を伝えます。不確実性の概念はそれだけではありません。他の概念については、より高度な章で説明します。 
 
-可能なクラスが3つ以上ある場合は、この問題を*マルチクラス分類*と呼びます。一般的な例としては、手書き文字認識 $\mathrm{\{0, 1, 2, ... 9, a, b, c, ...\}}$ などがあります。二乗誤差損失関数を最小化しようとして回帰問題を攻撃しましたが、分類問題に共通する損失関数は*cross-entropy* と呼ばれ、この名前は後続の章で情報理論の紹介を通してわかりやすく説明できます。 
+可能なクラスが 3 つ以上ある場合、問題を*マルチクラス分類* と呼びます。一般的な例には、手書き文字認識$\mathrm{\{0, 1, 2, ... 9, a, b, c, ...\}}$が含まれます。二乗誤差損失関数を最小化しようとして回帰問題を攻撃しましたが、分類問題の共通損失関数は*クロスエントロピー*と呼ばれ、その名前は後続の章で情報理論の紹介を通じてわかりやすく説明できます。 
 
-最も可能性の高いクラスは、必ずしも決定に使用するクラスではないことに注意してください。:numref:`fig_death_cap` のように、裏庭で美しいキノコを見つけたとします。 
+最も可能性の高いクラスは、必ずしも決定に使用するクラスではないことに注意してください。:numref:`fig_death_cap`に示すように、裏庭で美しいキノコを見つけたとします。 
 
-![Death cap---do not eat!](../img/death-cap.jpg)
+![Death cap - do not eat!](../img/death-cap.jpg)
 :width:`200px`
 :label:`fig_death_cap`
 
-ここで、分類器を作成し、写真に基づいてキノコに毒があるかどうかを予測するように分類器をトレーニングしたとします。毒検出分類器が :numref:`fig_death_cap` にデスキャップが含まれる確率は 0.2 であると出力したとします。言い換えれば、分類器はキノコがデスキャップではないことを80％確信しています。それでも、それを食べるには馬鹿でなければならないでしょう。それは、おいしい夕食の特定の利益は、それで死ぬリスクを20％も受ける価値がないからです。言い換えれば、不確実なリスクの影響は利益をはるかに上回ります。したがって、損失関数として被る予想リスクを計算する必要があります。つまり、結果の確率にそれに関連する利益 (または害) を掛ける必要があります。この場合、キノコを食べることによる損失は$0.2 \times \infty + 0.8 \times 0 = \infty$になる可能性がありますが、廃棄の損失は$0.2 \times 0 + 0.8 \times 1 = 0.8$です。私たちの注意は正当化されました。真菌学者が言うように、:numref:`fig_death_cap`のキノコは実際には死のキャップです。 
+ここで、分類器を構築し、写真に基づいてキノコが有毒かどうかを予測するように訓練したとします。ポイズン検出分類器が:numref:`fig_death_cap`にデスキャップを含む確率が0.2であると出力したとします。言い換えれば、分類器は、私たちのキノコがデスキャップではないことを80％確信しています。それでも、それを食べるのはばかでなければならないでしょう。それは、おいしい夕食の特定の利益は、それで死ぬリスクの 20\% の価値がないからです。言い換えれば、不確実なリスクの影響は、利益をはるかに上回ります。したがって、キノコを食べるかどうかを決定するためには、起こりそうな結果とそれぞれに関連する利益または害の両方に依存する、各行動に関連する予想される不有用性を計算する必要があります。この場合、キノコを食べることによって生じる不能は$0.2 \times \infty + 0.8 \times 0 = \infty$であるのに対し、それを捨てることの損失は$0.2 \times 0 + 0.8 \times 1 = 0.8$であるかもしれません。私たちの注意は正当化されました。菌学者が言うように、:numref:`fig_death_cap`のキノコは実際には死の帽子です。 
 
-分類は、バイナリ分類、マルチクラス分類、マルチラベル分類よりもはるかに複雑になることがあります。たとえば、階層のアドレス指定には、分類のバリエーションがいくつかあります。階層は、多数のクラス間に何らかの関係が存在することを前提としています。したがって、すべての誤差が等しいわけではありません。誤りを犯す必要がある場合は、遠いクラスではなく、関連するクラスに誤分類したほうがよいでしょう。通常、これを*階層分類* と呼びます。初期の例の1つは、動物を階層的に編成した[Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus)によるものです。 
+分類は、バイナリ分類やマルチクラス分類よりもはるかに複雑になる可能性があります。たとえば、階層的に構造化されたクラスに対処する分類のいくつかの変形があります。そのような場合、すべての誤りが等しいわけではありません。誤りを犯さなければならない場合は、遠いクラスではなく関連するクラスに誤分類する方がよいかもしれません。通常、これは*階層分類*と呼ばれます。インスピレーションを得るために、動物を階層的に整理した[Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus)を思い浮かべるかもしれません。 
 
-動物分類の場合、プードル（犬種）をシュナウザー（別の犬種）と間違えてもそれほど悪くないかもしれませんが、私たちのモデルはプードルを恐竜と混同すると大きなペナルティを払います。どの階層が関係するかは、モデルをどのように使用するかによって異なる場合があります。たとえば、ガラガラヘビとガーターヘビは系統樹の近くにいるかもしれませんが、ガラガラをガーターと間違えると致命的になる可能性があります。 
+動物分類の場合、プードルをシュナウザーと間違えるのはそれほど悪くないかもしれませんが、私たちのモデルは、プードルを恐竜と混同すると大きなペナルティを払うことになります。どの階層が関連するかは、モデルの使用方法によって異なる場合があります。たとえば、ガラガラヘビとガーターヘビは系統樹に近いかもしれませんが、ガラガラをガーターと間違えると致命的になる可能性があります。 
 
-#### タギング
+#### タグ付け
 
-一部の分類問題は、バイナリまたはマルチクラスの分類設定にきちんと適合します。たとえば、猫と犬を区別するために、通常のバイナリ分類器を学習させることができます。コンピュータビジョンの現状を考えると、市販のツールを使って簡単にこれを行うことができます。それでも、モデルがどれほど正確であっても、分類器が :numref:`fig_stackedanimals` に登場する4匹の動物が登場する人気のドイツのおとぎ話、*Town Musicians of Bremen* のイメージに遭遇すると、問題が発生する可能性があります。 
+一部の分類問題は、バイナリまたはマルチクラスの分類設定にうまく適合します。たとえば、猫と犬を区別するために通常のバイナリ分類器をトレーニングできます。コンピュータビジョンの現状を考えると、市販のツールでこれを簡単に行うことができます。それでも、モデルがどれほど正確であっても、分類器が、4匹の動物が登場するドイツの人気のあるおとぎ話（:numref:`fig_stackedanimals`）である*ブレーメンのタウンミュージシャン*の画像に遭遇すると、問題が発生する可能性があります。 
 
 ![A donkey, a dog, a cat, and a rooster.](../img/stackedanimals.png)
 :width:`300px`
 :label:`fig_stackedanimals`
 
-ご覧のとおり、:numref:`fig_stackedanimals`には猫がいて、オンドリ、犬、ロバがいて、木が背景にあります。最終的にモデルで何をしたいのかによって、これを二項分類問題として扱うのはあまり意味がないかもしれません。代わりに、画像が猫、犬、ロバを描いていると言うオプションをモデルに与えたいと思うかもしれません。
-*と*オンドリ。
+ご覧のとおり、写真には猫、オンドリ、犬、ロバが描かれており、背景にいくつかの木があります。このような画像に遭遇すると予想される場合、マルチクラス分類は適切な問題の定式化ではないかもしれません。代わりに、画像が猫、犬、ロバを描いていると言うオプションをモデルに与えたいと思うかもしれません。
+*そして*オンドリ。
 
-相互に排他的でないクラスの予測を学習する問題は、*マルチラベル分類* と呼ばれます。自動タグ付けの問題は、通常、マルチラベル分類の問題として最もよく説明されます。「機械学習」、「テクノロジー」、「ガジェット」、「プログラミング言語」、「Linux」、「クラウドコンピューティング」、「AWS」など、技術ブログの投稿に適用される可能性のあるタグを考えてみてください。一般的な記事には、5 ～ 10 個のタグが適用されている場合があります。これは、これらの概念が相互に関連しているためです。「クラウドコンピューティング」に関する記事には「AWS」と記載されることが多く、「機械学習」に関する投稿では「プログラミング言語」も扱われる可能性があります。 
+相互に排他的ではないクラスを予測することを学習する問題は、*マルチラベル分類*と呼ばれます。自動タグ付けの問題は、通常、マルチラベル分類問題として最もよく説明されます。「機械学習」、「テクノロジー」、「ガジェット」、「プログラミング言語」、「Linux」、「クラウドコンピューティング」、「AWS」など、人々が技術ブログの投稿に適用する可能性のあるタグを考えてみてください。一般的な記事には、5～10 個のタグが適用されている場合があります。通常、タグは何らかの相関構造を示します。「クラウドコンピューティング」に関する投稿は「AWS」に言及する可能性が高く、「機械学習」に関する投稿は「GPU」に言及する可能性が高い。 
 
-生物医学文献を扱う際には、このような問題にも対処しなければなりません。研究者が文献を網羅的にレビューすることができるため、論文に正しくタグ付けすることが重要となります。国立医学図書館では、PubMedで索引付けされた各記事を、約28000個のタグのコレクションであるMeSHの関連用語に関連付けるために、多くのプロの注釈者が調べています。これは時間のかかるプロセスであり、アノテータには通常、アーカイブとタグ付けの間隔が 1 年あります。ここでは、機械学習を使用して、各記事が適切に手動でレビューされるまで暫定的なタグを提供できます。実際、数年間、BioASQ組織はこれを正確に行うための[hosted competitions](http://bioasq.org/)を持っています。 
+このようなタグ付けの問題は、膨大なラベルセットに悪影響を与えることがあります。国立医学図書館は、PubMedで索引付けされる各記事を、約28000個のタグのコレクションである医学科目見出し（MeSH）オントロジーから引き出された一連のタグと関連付ける多くの専門注釈者を雇用しています。記事を正しくタグ付けすることは、研究者が文献の徹底的なレビューを行うことを可能にするため、重要です。これは時間のかかるプロセスであり、アノテーターは通常、アーカイブとタグ付けの間に1年の遅れがあります。機械学習は、各記事が適切な手動レビューを受けることができるまで、暫定的なタグを提供できます。実際、数年間、BioASQ組織はこのタスクのために[hosted competitions](http://bioasq.org/)を持っています。 
 
-#### サーチ 
+#### 検索
 
-各例をバケットや実際の値に割り当てるだけではない場合もあります。情報検索の分野では、一連の項目にランキングを課したいと考えています。ウェブ検索を例に挙げてみましょう。目標は、特定のページがクエリに関連しているかどうかを判断することではなく、大量の検索結果の中で、特定のユーザーに最も関連のあるページを特定することです。私たちは関連する検索結果の順序を重視しており、学習アルゴリズムはより大きなセットから順序付けられた要素のサブセットを生成する必要があります。つまり、アルファベットから最初の5文字を生成するように求められた場合、「A B C D E」と「C A B E D」を返すことには違いがあります。結果セットが同じであっても、セット内の順序付けは重要です。 
-
-この問題の解決策の 1 つは、まずセット内のすべての要素に対応する関連性スコアを割り当て、次に評価の高い要素を取得することです。[PageRank](https://en.wikipedia.org/wiki/PageRank)、Google検索エンジンの背後にある元の秘密のソースは、このようなスコアリングシステムの初期の例でしたが、それがそうであったという点で独特でした実際のクエリには依存しません。ここでは、単純な関連性フィルターを使用して関連アイテムのセットを特定し、PageRank を使用してクエリ用語を含む結果を並べ替えました。現在、検索エンジンは機械学習と行動モデルを使用して、クエリに依存する関連性スコアを取得しています。このテーマに特化した学会全体があります。 
+情報検索の分野では、アイテムのセットにランキングを課すことがよくあります。ウェブ検索を例にとってみましょう。目標は、特定のページがクエリに関連しているかどうか*判断することではなく、関連する一連の結果の中で特定のユーザーに最も目立つように表示する必要があるかどうかを判断することです。考えられる解決策の1つは、最初にセット内のすべての要素にスコアを割り当て、次に最高評価の要素を取得することです。Google検索エンジンの背後にある元の秘密のソースである[PageRank](https://en.wikipedia.org/wiki/PageRank)は、そのようなスコアリングシステムの初期の例でした。独特なことに、PageRankによって提供されたスコアリングは実際のクエリに依存しませんでした。代わりに、関連する候補のセットを特定するために単純な関連性フィルターに依存し、PageRankを使用してより信頼できるページに優先順位を付けました。現在、検索エンジンは機械学習と行動モデルを使用して、クエリに依存する関連性スコアを取得しています。このテーマに専念する学術会議全体があります。 
 
 #### レコメンダーシステム
 :label:`subsec_recommender_systems`
 
-レコメンダーシステムは、検索とランキングに関連するもう 1 つの問題設定です。関連する一連の項目をユーザーに表示することが目的である限り、問題は同様です。主な違いは、
-*パーソナライゼーション*
-レコメンダーシステムのコンテキストで特定のユーザーに。たとえば、映画のおすすめの場合、SFファンの結果ページとピーターセラーズのコメディーの愛好家の結果ページは大きく異なる場合があります。小売商品、音楽、ニュースレコメンデーションなど、他のレコメンデーション設定でも同様の問題がポップアップします。 
+レコメンダーシステムは、検索とランキングに関連する別の問題設定です。一連の関連項目をユーザーに表示することが目的である限り、問題は同様です。主な違いは、レコメンダーシステムのコンテキストで特定のユーザーに*パーソナライゼーション*を重視していることです。たとえば、映画のおすすめの場合、サイエンスフィクションファン向けの結果ページと、ピーターセラーズのコメディーの愛好家向けの結果ページが大幅に異なる場合があります。同様の問題が、小売商品、音楽、ニュースのおすすめなど、他のおすすめ設定でポップアップ表示されます。 
 
-場合によっては、購入者が特定の商品がどの程度気に入ったかを伝える明示的なフィードバック（Amazon、IMDb、Goodreadsでの商品評価やレビューなど）を提供することがあります。また、プレイリストのタイトルをスキップするなど、暗黙的なフィードバックを提供する場合もあります。これは不満を示しているかもしれませんが、その曲が文脈上不適切であることを示している可能性があります。最も単純な定式化では、これらのシステムは、ユーザーとアイテムが与えられた場合に、推定評価や購入確率などのスコアを推定するようにトレーニングされています。 
+場合によっては、顧客が特定の製品をどれだけ気に入ったかを伝える明確なフィードバックを提供することがあります（例：Amazon、IMDb、Goodreadsでの製品評価とレビュー）。また、プレイリストのタイトルをスキップするなど、不満を示したり、曲が文脈上不適切であることを示したりするなど、暗黙のフィードバックを提供する場合もあります。最も単純な定式化では、これらのシステムは、予想される星評価や特定のユーザーが特定のアイテムを購入する確率など、ある程度のスコアを推定するように訓練されています。 
 
-このようなモデルがあれば、どのユーザーに対しても、スコアが最も高いオブジェクトのセットを取得して、それをユーザーに推奨することができます。プロダクションシステムはかなり高度で、このようなスコアを計算する際には、詳細なユーザーアクティビティとアイテム特性が考慮されます。:numref:`fig_deeplearning_amazon` は、好みに合わせて調整されたパーソナライゼーションアルゴリズムに基づいて Amazon が推奨するディープラーニングブックの例です。 
+このようなモデルがあれば、任意のユーザーに対して、最大のスコアを持つオブジェクトのセットを取得し、ユーザーに推奨することができます。本番システムはかなり高度で、そのようなスコアを計算する際に詳細なユーザーアクティビティとアイテムの特性を考慮します。:numref:`fig_deeplearning_amazon`は、Astonの好みをキャプチャするように調整されたパーソナライゼーションアルゴリズムに基づいてAmazonが推奨するディープラーニングブックを表示します。 
 
 ![Deep learning books recommended by Amazon.](../img/deeplearning-amazon.jpg)
 :label:`fig_deeplearning_amazon`
 
-その莫大な経済的価値にもかかわらず、予測モデルの上に単純に構築されたレコメンデーションシステムには、いくつかの重大な概念上の欠陥があります。まず、*検閲されたフィードバック*のみを観察します。ユーザーは自分が強く感じている映画を優先的に評価します。たとえば、5 段階評価では、項目に 5 つ星と 1 つ星の評価が多く、3 つ星の評価が著しく少ないことに気付く場合があります。さらに、現在の購入習慣は、現在導入されているレコメンデーションアルゴリズムの結果であることが多いですが、学習アルゴリズムでは必ずしもこの詳細が考慮されるわけではありません。したがって、レコメンダーシステムが優先的にアイテムをプッシュし、（購入数が多いために）より良くなるようになり、ひいてはより頻繁にレコメンデーションされるというフィードバックループが形成される可能性があります。打ち切り、インセンティブ、フィードバックループへの対処方法に関するこれらの問題の多くは、未解決の重要な研究課題です。 
+莫大な経済的価値にもかかわらず、予測モデルの上に素朴に構築されたレコメンデーションシステムは、いくつかの深刻な概念上の欠陥に苦しんでいます。まず、*検閲されたフィードバック*のみを観察します。ユーザーは、自分が強く感じている映画を優先的に評価します。たとえば、5 段階評価では、項目に 1 つ星と 5 つ星の数が多いが、3 つ星の評価が目立つほど少ないことに気付くかもしれません。さらに、現在の購入習慣は、現在実施されているレコメンデーションアルゴリズムの結果であることが多いですが、学習アルゴリズムでは必ずしもこの詳細を考慮しているわけではありません。したがって、フィードバックループが形成され、レコメンダーシステムがアイテムを優先的にプッシュし、（購入が増えるため）より良いものと判断され、さらに頻繁に推奨されるようになります。検閲、インセンティブ、フィードバックループへの対処方法に関するこれらの問題の多くは、重要なオープンリサーチクエスチョンです。 
 
 #### シーケンス学習
 
-これまで、入力数が固定され、出力数が固定されている問題を見てきました。たとえば、平方フィート、寝室の数、バスルームの数、市街地までの徒歩時間など、固定されたフィーチャセットから住宅価格を予測することを検討しました。また、(固定次元の) 画像から、固定数のクラスに属する予測確率へのマッピング、またはユーザー ID と製品 ID の取得と星評価の予測についても説明しました。このような場合、固定長の入力をモデルに入力して出力を生成すると、モデルは直ちに見たものを忘れてしまいます。 
+これまで、いくつかの固定数の入力があり、固定数の出力を生成する問題を見てきました。たとえば、平方フィート、寝室の数、バスルームの数、ダウンタウンまでの移動時間など、固定された一連のフィーチャを考慮して住宅価格を予測することを検討しました。また、（固定次元の）画像から、固定数のクラスの中でそれぞれが属する予測確率へのマッピングと、ユーザーIDと製品IDのみに基づいて購入に関連する星評価を予測することについても説明しました。このような場合、モデルがトレーニングされると、各テスト例がモデルに入力されると、すぐに忘れられます。私たちは、連続する観測は独立しており、したがって、この文脈を保持する必要はないと仮定しました。 
 
-これは、入力が本当にすべて同じ次元を持ち、連続する入力が本当に関係がない場合は問題ないかもしれません。しかし、ビデオスニペットをどのように扱うのでしょうか？この場合、各スニペットは異なるフレーム数で構成されることがあります。また、前のフレームまたは次のフレームを考慮すると、各フレームで何が起こっているかを推測する方がはるかに強くなる可能性があります。言語についても同じことが言えます。ディープラーニングの一般的な問題の 1 つに機械翻訳があります。機械翻訳とは、あるソース言語の文章を取り込み、別の言語での翻訳を予測する作業です。 
+しかし、ビデオスニペットをどのように扱うべきでしょうか？この場合、各スニペットは異なる数のフレームで構成されている可能性があります。また、前のフレームまたは次のフレームを考慮すると、各フレームで何が起こっているのかを推測するほうがはるかに強くなる可能性があります。言語についても同じことが言えます。ディープラーニングでよく知られている問題の1つは、機械翻訳です。これは、あるソース言語の文を取り込み、別の言語で翻訳を予測するタスクです。 
 
-これらの問題は医学でも起こります。集中治療室の患者を監視し、24 時間以内に患者が死亡するリスクがある閾値を超えた場合にアラートを発するモデルが必要になる場合があります。私たちは、このモデルが患者の病歴について知っているすべてのものを1時間ごとに捨てて、最新の測定値に基づいて予測することを絶対に望んでいないでしょう。 
+これらの問題は医学でも起こります。集中治療室の患者を監視し、次の24時間で死亡するリスクがあるしきい値を超えるたびにアラートを発するモデルが必要になる場合があります。ここでは、患者の病歴について知っていることを1時間ごとにすべて捨てるのではなく、最新の測定値のみに基づいて予測を行います。 
 
-これらの問題は、機械学習の最も興味深い応用例であり、*シーケンス学習*の例です。入力シーケンスを取り込むか、出力シーケンスを出力する (あるいはその両方) モデルを必要とします。具体的には、
-*シーケンスからシーケンスへの学習*は問題を考慮する
-ここで、入力と出力はどちらも可変長のシーケンスで、機械翻訳や話し言葉からのテキストの文字起こしなどです。すべてのタイプのシーケンス変換を考慮することは不可能ですが、以下の特殊なケースについて言及する価値があります。 
+これらの問題は、機械学習の最もエキサイティングなアプリケーションの1つであり、*シーケンス学習*の例です。これらには、一連の入力を取り込むか、出力のシーケンス（またはその両方）を出力するモデルが必要です。具体的には、*シーケンス間学習* は、入力と出力の両方が可変長シーケンスで構成される問題を考慮します。例としては、機械翻訳や音声からテキストへの文字起こしなどがあります。すべてのタイプのシーケンス変換を考慮することは不可能ですが、次の特殊なケースは言及する価値があります。 
 
-**タグ付けと構文解析**。これには、テキストシーケンスに属性による注釈を付けることが含まれます。
-つまり、入力と出力の数は本質的に同じです。例えば、動詞と主語がどこにあるのか知りたいかもしれません。あるいは、どの単語が名前付き実体であるかを知りたいかもしれません。一般的には、構造的および文法的な仮定に基づいてテキストを分解して注釈を付けて、何らかの注釈を得ることが目的です。これは実際よりも複雑に聞こえます。以下は、どの単語が名前付き実体 (「Ent」とタグ付けされている) を参照しているかを示すタグで文に注釈を付ける非常に簡単な例です。
+**タグ付けと解析**。
+これには、テキストシーケンスに属性による注釈を付けることが含まれます。ここで、入力と出力は*整列*されています。つまり、それらは同じ番号で、対応する順序で発生します。たとえば、*品詞（PoS）タグ付け*では、文中のすべての単語に対応する品詞、つまり「名詞」または「直接目的」に注釈を付けます。あるいは、連続する単語のどのグループが*人*、*場所*、または*組織*のような名前付きエンティティを参照しているかを知りたいかもしれません。以下の漫画的に単純な例では、文中のすべての単語について、それが名前付きエンティティ（「Ent」としてタグ付けされた）の一部であるかどうかを示したいだけかもしれません。
 
 ```text
 Tom has dinner in Washington with Sally
 Ent  -    -    -     Ent      -    Ent
 ```
 
-**自動音声認識**。音声認識では、入力シーケンス
-は話者の音声録音 (:numref:`fig_speech`) で、出力は発言者の発言をテキストで記録したものです。問題は、テキストよりも多くのオーディオフレーム (サウンドは通常 8kHz または 16kHz でサンプリング) があることです。つまり、何千ものサンプルが 1 つの話し言葉に相当する可能性があるため、オーディオとテキストの間に 1:1 の対応関係がないことです。これらは、出力が入力よりもはるかに短い、シーケンス間学習の問題です。 
+**自動音声認識**。
+音声認識では、入力シーケンスは話者の音声録音（:numref:`fig_speech`）であり、出力は話者の発言のトランスクリプトです。課題は、テキストよりもはるかに多くのオーディオフレーム（サウンドは一般的に8kHzまたは16kHzでサンプリングされます）があることです。つまり、数千のサンプルが1つの話し言葉に対応している可能性があるため、オーディオとテキストの間に1：1の対応がないということです。これらは、出力が入力よりもはるかに短い、シーケンス間学習の問題です。 
 
 ![`-D-e-e-p- L-ea-r-ni-ng-` in an audio recording.](../img/speech.png)
 :width:`700px`
 :label:`fig_speech`
 
-**テキスト読み上げ**。これは自動音声認識の逆です。
-つまり、入力はテキストで、出力はオーディオファイルです。この場合、出力は入力よりもずっと長くなります。人間が悪いオーディオファイルを認識するのは簡単ですが、これはコンピュータにとってそれほど些細なことではありません。 
+**テキスト読み上げ**。
+これは自動音声認識の逆です。ここで、入力はテキストで、出力はオーディオファイルです。この場合、出力は入力よりもはるかに長くなります。人間は、低品質のオーディオからでも音声を認識するのが非常に得意ですが、コンピューターにその偉業を実行させることは手ごわい挑戦です。 
 
-**機械翻訳**。音声認識の場合とは異なり、対応する場合
-入力と出力は同じ順序 (アライメント後) で行われるため、機械翻訳では順序の反転が不可欠です。つまり、あるシーケンスを別のシーケンスに変換している間は、入力と出力の数も、対応するデータ例の順序も同じであるとは想定されません。ドイツ人が動詞を文末に置くという独特の傾向を示す次の例を考えてみましょう。
+**機械翻訳**。
+機械翻訳では、対応する入力と出力が同じ順序で発生する音声認識の場合とは異なり、アライメントされていないデータは新たな課題を提起します。ここで、入力シーケンスと出力シーケンスは異なる長さを持つことができ、それぞれのシーケンスの対応する領域は異なる順序で表示される場合があります。ドイツ人が動詞を文末に置くという独特の傾向を示す次の例を考えてみましょう。
 
 ```text
 German:           Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?
@@ -231,181 +220,171 @@ English:          Did you already check out this excellent tutorial?
 Wrong alignment:  Did you yourself already this excellent tutorial looked-at?
 ```
 
-関連する多くの問題が他の学習タスクに現れます。たとえば、ユーザーが Web ページを読む順序を決定することは、2 次元のレイアウト解析の問題です。対話の問題は、あらゆる種類の追加の複雑さを示します。次に何を言うべきかを決定するには、現実世界の知識と長い時間的距離にわたる会話の以前の状態を考慮する必要があります。これらは活発な研究分野です。 
+関連する多くの問題が他の学習タスクに現れます。たとえば、ユーザーがWebページを読む順序を決定することは、2次元のレイアウト分析の問題です。対話の問題にはあらゆる種類の追加の複雑さがあり、次に何を言うかを決定するには、現実世界の知識と長い時間的距離にわたる会話の以前の状態を考慮する必要があります。これらは活発な研究分野です。 
 
 ### 教師なし学習と自己教師あり学習
 
-これまでの例はすべて、教師あり学習、つまり、特徴量と対応するラベル値の両方を含む巨大なデータセットをモデルに供給する状況に関連していました。教師付き学習者は、非常に専門的な仕事と非常に平凡な上司を持っていると考えることができます。上司はあなたの肩の上に立ち、状況から行動へのマッピングを学ぶまで、あらゆる状況で何をすべきかを正確に伝えます。そのような上司のために働くことはかなり下手に聞こえます。一方、この上司を喜ばせるのは簡単です。できるだけ早くパターンを認識し、その行動を模倣するだけです。 
+前の例では、教師あり学習に焦点を当てていました。ここでは、特徴と対応するラベル値の両方を含む巨大なデータセットをモデルに供給します。教師付き学習者は、非常に専門的な仕事と非常に独裁的な上司と考えることができます。上司は肩越しに立ち、状況から行動へのマッピングを学ぶまで、あらゆる状況で何をすべきかを正確に伝えます。そのような上司のために働くことはかなり下品に聞こえます。一方、そのような上司を喜ばせるのはかなり簡単です。できるだけ早くパターンを認識し、その行動を真似するだけです。 
 
-まったく逆に、あなたが何をしてほしいのか分からない上司のために働くのはイライラするかもしれません。ただし、データサイエンティストになる予定がある場合は、慣れたほうがよいでしょう。上司は巨大なデータを手渡して、それを使ってデータサイエンスをやるように言うかもしれません！* これは曖昧に聞こえます。私たちはこの種の問題を「教師なし学習」と呼んでおり、私たちが尋ねることができる質問の種類と数は、私たちの創造性によってのみ制限されます。教師なし学習手法については、後の章で説明します。今のあなたの食欲を刺激するために、私たちはあなたが尋ねるかもしれない以下の質問のいくつかを説明します。 
+逆の状況を考えると、自分が何をしてほしいのか分からない上司のために働くのはイライラするかもしれません。しかし、データサイエンティストになるつもりなら、それに慣れたほうがいいでしょう。ボスはあなたに大量のデータを渡して、それを使ってデータサイエンスをやるように言うかもしれません！* これは曖昧に聞こえます。私たちはこのクラスの問題を「教師なし学習」と呼んでおり、私たちが尋ねることができる質問の種類と数は、私たちの創造性によってのみ制限されます。教師なし学習技術については、後の章で取り上げます。今のあなたの食欲をそそるために、私たちはあなたが尋ねるかもしれない以下の質問のいくつかを説明します。 
 
 * 少数のプロトタイプを見つけることはできますか
-データを正確に要約したのか？写真のセットがあれば、風景写真、犬、赤ちゃん、猫、山頂の写真にグループ化できますか？同様に、ユーザーのブラウジングアクティビティを集めた場合、同様の行動を持つユーザーにグループ化できますか？この問題は、通常「*クラスタリング*」と呼ばれます。
-* 少数のパラメータを見つけられますか
-データの関連特性を正確に捉えているのですか？ボールの軌跡は、ボールの速度、直径、質量によって非常によく表されます。テーラーは、衣服のフィッティングを目的として、人体の形状をかなり正確に記述する少数のパラメータを開発しました。これらの問題を*部分空間推定* と呼びます。依存性が線形の場合は、*主成分分析* と呼ばれます。
-* (任意に構造化された) オブジェクトの表現はありますか
-ユークリッド空間でシンボリック特性がよく一致するようにしますか？これは、「ローマ」$-$「イタリア」$+$「フランス」$=$「パリ」のように、エンティティとその関係を記述するために使用できます。
+データを正確に要約しているのか？写真のセットがあれば、風景写真、犬、赤ちゃん、猫、山頂の写真にグループ化できますか？同様に、ユーザーのブラウジングアクティビティのコレクションがある場合、それらを同様の行動を持つユーザーにグループ化できますか？この問題は通常、*クラスタリング*として知られています。
+* 少数のパラメータを見つけることはできますか
+データの関連する特性を正確に捉えるのは何ですか？ボールの軌道は、ボールの速度、直径、質量によってよく記述されます。仕立て屋は、衣服をフィットさせる目的で、人体の形状をかなり正確に記述する少数のパラメータを開発しました。これらの問題は、*部分空間推定* と呼ばれます。依存性が線形の場合は、*主成分分析*と呼ばれます。
+* （任意に構造化された）オブジェクトの表現はありますか
+ユークリッド空間で、シンボリックプロパティがうまく一致するようにしますか？これは、「ローマ」$-$「イタリア」$+$「フランス」$=$「パリ」など、エンティティとその関係を記述するために使用できます。
 * 根本原因の説明はありますか
-私たちが観察したデータの多くの？たとえば、住宅価格、汚染、犯罪、場所、教育、給与に関する人口統計データがある場合、経験的データに基づいてそれらがどのように関連しているかを知ることはできますか？*causality* と*確率的グラフィカルモデル* に関係するフィールドは、この問題に対処します。
-* 教師なし学習におけるもう一つの重要でエキサイティングな最近の進展
-*生成的敵対的ネットワーク*の出現です。これにより、画像やオーディオなどの複雑な構造化データも含めて、手続き型の方法でデータを合成できます。基礎となる統計メカニズムは、実データと偽データが同じかどうかを調べる検定です。 
+私たちが観察するデータの多くは？たとえば、住宅価格、汚染、犯罪、場所、教育、給与に関する人口統計データがある場合、経験的データに基づいてそれらがどのように関連しているかを発見できますか？*因果関係*に関する分野と
+*確率的グラフィカル・モデル*は、そのような問題に取り組みます。
+* 教師なし学習におけるもう一つの重要で刺激的な最近の進展
+ディープ・ジェネレーティブ・モデルの出現です。これらのモデルは、データ$p(\mathbf{x})$の密度を明示的または*暗黙的に*推定します。トレーニングが完了したら、生成モデルを使用して、その可能性に応じて例をスコアリングするか、学習した分布から合成例をサンプリングできます。ジェネレーティブモデリングにおける初期のディープラーニングのブレークスルーは、*変分オートエンコーダー* :cite:`Kingma.Welling.2014`の発明によってもたらされ、*敵対的生成ネットワーク* :cite:`Goodfellow.Pouget-Abadie.Mirza.ea.2014`の開発を続けました。最近の進歩には、流れの正規化、拡散モデル、スコアベースのモデルが含まれます。 
 
-教師なし学習の一形態として、
-*自己教師あり学習*
-は、ラベル付けされていないデータを活用して、他の部分を使用してデータの一部の保留部分を予測するなど、トレーニングの監視を提供します。テキストについては、ラベル付けの手間をかけずに、ビッグコーパスで周囲の単語 (コンテキスト) を使用してランダムにマスクされた単語を予測することで、「空白を埋める」ようにモデルをトレーニングできます。:cite:`Devlin.Chang.Lee.ea.2018`!イメージの場合、同じイメージ :cite:`Doersch.Gupta.Efros.2015` の 2 つの切り取られた領域間の相対的な位置を知るようにモデルをトレーニングすることがあります。これら 2 つの自己教師あり学習の例では、考えられる単語と相対位置を予測する学習モデルはどちらも (教師あり学習による) 分類タスクです。 
+教師なし学習の主要な発展は、ラベルなしデータのいくつかの側面を活用して監督を提供する技術である「自己教師あり学習」の台頭です。テキストについては、ラベル付けの手間をかけずに、大きなコーパスで周囲の単語（コンテキスト）を使用してランダムにマスクされた単語を予測することで、「空白を埋める」ようにモデルをトレーニングできます。:cite:`Devlin.Chang.Lee.ea.2018`！画像の場合、同じ画像:cite:`Doersch.Gupta.Efros.2015`の2つの切り取られた領域間の相対的な位置を伝えるか、画像の残りの部分に基づいて画像のオクルージョン部分を予測するか、2つの例が同じ基礎となる画像の摂動バージョンであるかどうかを予測するようにモデルをトレーニングすることがあります。自己教師付きモデルは表現を学習することが多く、その後、関心のある下流のタスクで結果のモデルを微調整することによって活用されます。 
 
 ### 環境とのやりとり
 
-これまでのところ、データが実際にどこから来たのか、機械学習モデルが出力を生成すると実際に何が起こるのかについては説明していません。これは、教師あり学習と教師なし学習ではこれらの問題にあまり洗練された方法で対処できないためです。いずれにせよ、私たちは大量のデータを前もって取得し、環境と二度と相互作用することなくパターン認識マシンを動かします。すべての学習はアルゴリズムが環境から切り離された後に行われるため、「オフライン学習」と呼ばれることもあります。教師あり学習の場合、環境からのデータ収集を考慮したプロセスは :numref:`fig_data_collection` のようになります。 
+これまで、データの実際の出所や、機械学習モデルが出力を生成するときに実際に何が起こるかについては説明していません。これは、教師あり学習と教師なし学習は、これらの問題に非常に洗練された方法で対処しないためです。いずれの場合も、大量のデータを事前に取得し、環境と相互作用することなくパターン認識マシンを動作させます。すべての学習はアルゴリズムが環境から切り離された後に行われるため、これは*オフライン学習*と呼ばれることもあります。たとえば、教師あり学習は :numref:`fig_data_collection` に示される単純な相互作用パターンを想定しています。 
 
 ![Collecting data for supervised learning from an environment.](../img/data-collection.svg)
 :label:`fig_data_collection`
 
-オフライン学習のシンプルさには魅力があります。利点は、これらの他の問題から気を散らすことなく、パターン認識を単独で心配できることです。しかし、欠点は、問題の定式化が非常に限定的であることです。あなたがもっと野心的であるか、アシモフのロボットシリーズを読んで育ったなら、予測を行うだけでなく、世界で行動を起こすことができる人工知能ボットを想像するかもしれません。予測モデルだけでなく、インテリジェントな「エージェント」についても考えたいと考えています。つまり、予測をするだけではなく、*actions*を選ぶことを考える必要があるということです。さらに、予測とは異なり、行動は実際には環境に影響を与えます。インテリジェントエージェントをトレーニングする場合、そのアクションがエージェントの将来の観測にどのように影響するかを考慮する必要があります。 
+オフライン学習のこのシンプルさには魅力があります。利点は、動的な環境との相互作用から生じる複雑さを心配することなく、パターン認識を単独で心配できることです。しかし、この問題の定式化は制限されています。アシモフのロボット小説を読んで育ったなら、予測を行うだけでなく、世界で行動を起こすことができる人工知能エージェントを想像するかもしれません。予測モデルだけでなく、インテリジェントな*エージェント*について考えたいと考えています。つまり、単に予測をするのではなく、*アクション*を選択することを考える必要があるということです。単なる予測とは異なり、行動は実際に環境に影響を与えます。インテリジェントエージェントをトレーニングする場合、そのアクションがエージェントの将来の観察にどのように影響するかを説明する必要があります。 
 
-環境との相互作用を考慮すると、モデリングに関する新しい疑問が生まれます。以下はほんの一例です。 
+環境との相互作用を考慮すると、一連の新しいモデリングの問題が生まれます。以下はほんの一例です。 
 
-* 環境は私たちが以前に行ったことを記憶していますか？
-* ユーザーが音声認識機能にテキストを読み込むなど、環境は私たちを助けたいと思っていますか？
-* 環境は私たちを打ち負かしたいですか？つまり、スパムフィルタリング（スパマーに対する）やゲーム（対戦相手に対して）をプレイするような敵対的な設定ですか？
-* 環境は気にしないのですか？
-* 環境には変化するダイナミクスがありますか？たとえば、将来のデータは常に過去と似ているのか、それともパターンが時間とともに自然に変化するのか、それとも自動化ツールに応じて変化するのか？
+* 環境は私たちが以前にしたことを覚えていますか？
+* 環境は私たちを助けたいと思っていますか？例えば、ユーザーが音声認識機能でテキストを読むような場合ですか？
+* スパムフィルターを回避するためにメールを改ざんするスパマーなど、環境は私たちを打ち負かしたいですか？
+* 環境のダイナミクスは変化していますか？たとえば、未来のデータは常に過去に似ているのか、それとも自然にパターンが時間とともに変化するのか、それとも自動化ツールに応じて変化するのか？
 
-この最後の質問は、学習データとテストデータが異なる場合、*分布シフト*という問題を提起します。それは私たちのほとんどが講師が書いた試験を受けるときに経験した問題ですが、宿題は彼のティーチングアシスタントによって構成されていました。次に、環境との相互作用を明示的に考慮した設定である強化学習について簡単に説明します。 
+これらの質問は、トレーニングデータとテストデータが異なる*分布シフト*の問題を提起します。私たちのほとんどは、講師が書いた試験を受けるときにこの問題を経験しましたが、宿題はティーチングアシスタントによって構成されていました。次に、強化学習について簡単に説明します。強化学習は、エージェントが環境と対話する学習問題を提起するための豊富なフレームワークです。 
 
 ### 強化学習
 
-機械学習を使用して、環境と対話してアクションを実行するエージェントを開発することに興味がある場合は、おそらく*強化学習*に集中することになるでしょう。これには、ロボット工学、対話システム、ビデオゲーム用の人工知能 (AI) の開発への応用も含まれます。
-*深層強化学習*、適用される
-ディープラーニングから強化学習問題まで、人気が急上昇しています。視覚入力のみでアタリの試合で人間を打ち負かした画期的なディープQネットワークと、ボードゲームGoで世界チャンピオンを失ったAlphaGoプログラムなどが代表的な例だ。 
+機械学習を使用して、環境と相互作用し、行動を起こすエージェントを開発することに興味があるなら、おそらく*強化学習*に焦点を当てることになるでしょう。これには、ロボット工学、対話システム、さらにはビデオゲーム用の人工知能（AI）の開発への応用も含まれます。
+*ディープ強化学習*、これが当てはまる
+ディープラーニングから強化学習の問題まで、人気が急上昇しています。視覚入力のみを使用してアタリの試合で人間を打ち負かす画期的なディープQネットワーク:cite:`mnih2015human`と、ボードゲームGo :cite:`Silver.Huang.Maddison.ea.2016`で世界チャンピオンを倒したAlphaGoプログラムは、2つの顕著な例です。 
 
-強化学習は、エージェントが一連のタイムステップで環境と対話するという、非常に一般的な問題の説明を提供します。各タイムステップで、エージェントは環境から何らかの*観測*を受け取り、*アクション*を選択しなければならず、その後、何らかのメカニズム (アクチュエータとも呼ばれる) を介して環境に送り返されます。最後に、エージェントは環境から報酬を受け取ります。このプロセスは :numref:`fig_rl-environment` で説明されています。その後、エージェントは後続の観測値を受け取り、後続のアクションを選択します。強化学習エージェントの動作はポリシーによって管理されます。つまり、*policy* は、環境の観察から行動にマッピングする関数にすぎません。強化学習の目標は、良い政策を生み出すことです。 
+強化学習は、エージェントが一連の時間ステップにわたって環境と対話するという、非常に一般的な問題の説明を提供します。各タイムステップで、エージェントは環境から*観測*を受け取り、何らかのメカニズム（*アクチュエータ*と呼ばれることもあります）を介して環境に送信される*アクション*を選択する必要があります。最後に、エージェントは環境から報酬を受け取ります。このプロセスは:numref:`fig_rl-environment`に示されています。その後、エージェントは後続の観測値を受け取り、後続のアクションを選択します。強化学習エージェントの動作は、*ポリシー* によって管理されます。要するに、*ポリシー*は、環境の観察から行動にマップする単なる機能です。強化学習の目標は、良い政策を生み出すことです。 
 
 ![The interaction between reinforcement learning and an environment.](../img/rl-environment.svg)
 :label:`fig_rl-environment`
 
-強化学習の枠組みの一般性を誇張するのは難しい。たとえば、教師あり学習の問題を強化学習問題としてキャストできます。分類の問題があったとしましょう。各クラスに対応する 1 つのアクションを持つ強化学習エージェントを作成できました。そこで、元の教師あり学習問題の損失関数とまったく同じ報酬を与える環境を作ることができました。 
+強化学習の枠組みの一般性を誇張するのは難しい。たとえば、教師あり学習問題を強化学習問題としてキャストできます。分類の問題があったとしましょう。各クラスに対応する1つのアクションを持つ強化学習エージェントを作成できます。その後、元の教師あり学習問題からの損失関数とまったく等しい報酬を与える環境を作り出すことができました。 
 
-そうは言っても、強化学習は教師あり学習では不可能な多くの問題にも対処できます。たとえば、教師あり学習では、学習入力が正しいラベルに関連付けられていることが常に想定されます。しかし、強化学習では、観測ごとに環境が最適な行動を教えてくれるとは想定していません。一般的に、私たちはいくらかの報酬を得るだけです。さらに、環境はどの行動が報酬につながったのかさえ教えてくれないかもしれません。 
+とはいえ、強化学習は、教師あり学習ではできない多くの問題にも対処できます。たとえば、教師あり学習では、トレーニング入力が正しいラベルに関連付けられていることを常に期待しています。しかし、強化学習では、観察ごとに環境が最適な行動を教えてくれるとは想定していません。一般的に、私たちはいくらかの報酬を得るだけです。さらに、環境は、どの行動が報酬につながったかを教えてくれないかもしれません。 
 
-たとえば、チェスのゲームを考えてみましょう。唯一の本当の報酬信号は、ゲームの終わりに勝ったときに報酬1を割り当てるか、負けたときに報酬-1を割り当てることができます。したがって、強化学習者は*クレジット割り当て*の問題に対処する必要があります。つまり、どのアクションをクレジットするか、または結果に責任を負わせるかを決定するということです。10月11日に昇進した従業員にも同じことが言えます。このプロモーションは、前年に比べて厳選された多数のアクションを反映している可能性があります。将来的にプロモーションを増やすには、そのプロモーションにつながった行動を把握する必要があります。 
+チェスのゲームを考えてみましょう。唯一の本当の報酬シグナルは、ゲームの終わりに勝って報酬を獲得したとき、たとえば1の報酬を獲得したとき、または負けて、たとえば-1の報酬を受け取ったときに発生します。したがって、強化学習者は*クレジット割り当て*の問題に対処する必要があります。つまり、結果に対してどのアクションをクレジットするか、または非難するかを決定することです。10月11日に昇進した従業員についても同じことが言えます。そのプロモーションは、前年に比べて厳選された多数の行動を反映している可能性があります。今後、より多くのプロモーションを獲得するには、そのプロモーションにつながった途中でどのようなアクションがあったかを把握する必要があります。 
 
-強化学習では、部分可観測性の問題にも対処しなければならない場合があります。つまり、現在の観測では、現在の状態に関するすべてがわかるとは限りません。掃除ロボットが家の中の同じクローゼットの一つに閉じ込められていることに気付いたとしましょう。ロボットの正確な位置 (および状態) を推測するには、クローゼットに入る前にロボットの以前の観測を考慮する必要がある場合があります。 
+強化学習者は、部分的な可観測性の問題にも対処しなければならない場合があります。つまり、現在の観測では、現在の状態に関するすべてがわかるとは限りません。掃除ロボットが、家の中の同じクローゼットの中に閉じ込められているとしましょう。ロボットの正確な位置を推測するには、クローゼットに入る前に以前の観察結果を考慮する必要があるかもしれません。 
 
-最後に、強化学習者はいつでも良いポリシーを1つ知っているかもしれませんが、エージェントが試したことのない優れたポリシーが他にもたくさんあるかもしれません。強化学習者は、現在知られている最も優れた戦略を政策として「活用」するか、戦略の空間を「探索する」かを常に選択しなければならず、知識と引き換えに短期的な報酬を放棄する可能性がある。 
+最後に、どの時点でも、強化学習者は1つの優れたポリシーを知っているかもしれませんが、エージェントが試したことのない優れたポリシーが他にもたくさんあるかもしれません。強化学習者は、（現在）最もよく知られている戦略を政策として*活用*するか、戦略の空間を*探求*し、知識と引き換えに短期的な報酬をあきらめる可能性があるかを常に選択する必要があります。 
 
-一般的な強化学習問題は非常に一般的な設定です。アクションは後続の観測に影響します。報酬は、選択したアクションにのみ対応して観察されます。環境は完全に観察されることも部分的に観察されることもあります。この複雑さを一度に説明すると、あまりにも多くの研究者に尋ねるかもしれません。さらに、すべての実際的な問題がこのような複雑さを示すわけではありません。その結果、研究者は強化学習の問題の特殊なケースを数多く研究してきました。 
+一般的な強化学習の問題は、非常に一般的な設定です。アクションは後続の観測に影響します。報酬は、選択したアクションに対応する場合にのみ観察されます。環境は、完全にまたは部分的に観察されます。この複雑さを一度に説明すると、あまりにも多くの研究者に尋ねるかもしれません。さらに、すべての実際的な問題がこの複雑さをすべて示すわけではありません。その結果、研究者は強化学習の問題の特殊なケースをいくつか研究してきました。 
 
-環境が十分に観測されると、強化学習問題を*マルコフ決定過程*と呼びます。状態が前のアクションに依存しない場合、この問題を*コンテキストバンディット問題*と呼びます。状態がなく、最初は報酬が不明な一連のアクションしかない場合、この問題は古典的な*マルチアームバンディット問題*です。 
+環境が完全に観察されると、強化学習問題を*マルコフ決定過程*と呼びます。国家が以前の行動に依存しない場合、私たちは問題を*文脈上の盗賊問題*と呼びます。状態がなく、最初は報酬が不明な一連の利用可能なアクションだけの場合、この問題は古典的な*マルチアームバンディット問題*です。 
 
 ## ルーツ
 
-ここでは、機械学習が対処できる問題のほんの一部を確認しました。さまざまな機械学習の問題に対して、ディープラーニングはそれらを解決するための強力なツールを提供します。多くのディープラーニング手法は最近の発明ですが、データとニューラルネットワーク (多くのディープラーニングモデルの名前) を使ったプログラミングの核となるアイデアは何世紀にもわたって研究されてきました。実際、人間は長い間データを分析し、将来の結果を予測したいという願望を抱いており、自然科学の多くはこれに根ざしています。例えば、ベルヌーイ分布は [Jacob Bernoulli (1655—1705)](https://en.wikipedia.org/wiki/Jacob_Bernoulli) にちなんで名付けられ、ガウス分布は [カール・フリードリヒ・ガウス (1777—1855)](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss) によって発見されました。たとえば、彼は最小平均二乗アルゴリズムを発明しました。このアルゴリズムは、保険の計算から医療診断まで、今日でも数え切れないほどの問題に使用されています。これらのツールは自然科学における実験的アプローチを生み出しました。例えば、抵抗器の電流と電圧に関するオームの法則は、線形モデルによって完全に記述されています。 
+機械学習が対処できる問題のごく一部をレビューしました。ディープラーニングは、さまざまな機械学習の問題に対して、それらを解決するための強力なツールを提供します。多くのディープラーニング手法は最近の発明ですが、データからの学習の背後にある核となるアイデアは何世紀にもわたって研究されてきました。実際、人間は長い間、データを分析し、将来の結果を予測したいという欲求を抱いており、自然科学の多くはこれにルーツがあります。たとえば、ベルヌーイ分布は [Jacob Bernoulli (1655—1705)](https://en.wikipedia.org/wiki/Jacob_Bernoulli) にちなんで名付けられ、ガウス分布は [カール・フリードリヒ・ガウス (1777—1855)](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss) によって発見されました。たとえば、彼は最小平均二乗アルゴリズムを発明しました。このアルゴリズムは、保険の計算から医療診断まで、数え切れないほどの問題に今日でも使用されています。これらのツールは、自然科学における実験的アプローチを生み出しました。たとえば、抵抗器の電流と電圧に関するオームの法則は、線形モデルによって完全に記述されます。 
 
-中世になっても、数学者は推計について鋭い直感を持っていました。たとえば、[Jacob Köbel (1460—1533)](https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry) のジオメトリブックでは、16 人の成人男性の足の長さを平均して平均的な足の長さを求めることが示されています。 
+中世になっても、数学者は推定の鋭い直感を持っていました。たとえば、[Jacob Köbel (1460—1533)](https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry) の幾何学の本は、人口の平均足の長さを推定するために、16人の成人男性の足の長さを平均化することを示しています (:numref:`fig_koebel`)。 
 
 ![Estimating the length of a foot.](../img/koebel.jpg)
 :width:`500px`
 :label:`fig_koebel`
 
-:numref:`fig_koebel` は、この推定量がどのように機能するかを示しています。16人の成人男性は、教会を去るときに一列に並ぶように頼まれました。次に、それらの総長を16で割って、現在1フィートになるものの推定値を取得しました。この「アルゴリズム」は、後に奇形の足に対処するために改善されました。足が最も短く、最も長い足を持つ2人の男性が送り出され、残りの部分でのみ平均化されました。これは、トリム平均推定の最も初期の例の 1 つです。 
+個人のグループが教会を出ると、16人の成人男性が列に並んで足を測定するように求められました。次に、これらの測定値の合計を16で割って、現在の1フィートに相当する推定値を求めました。この「アルゴリズム」は、不格好な足に対処するために後で改善されました。最も短い足と最も長い足を持つ2人の男性が送り出され、残りの部分のみの平均化されました。これは、調整された平均推定の最も初期の例の1つです。 
 
-統計は、データの収集と可用性によって実際に始まりました。その巨人のひとつ [ロナルド・フィッシャー (1890—1962)](https://en.wikipedia.org/wiki/Ronald_Fisher) は、その理論と遺伝学への応用にも大きく貢献した。彼のアルゴリズム (線形判別分析など) や数式 (フィッシャー情報行列など) の多くは、現在でも頻繁に使用されています。実際、1936年にFisherが発表したIrisデータセットでさえ、機械学習アルゴリズムを説明するために今でも時々使用されています。彼は優生学の支持者でもあり、道徳的に疑わしいデータサイエンスの使用は、産業界や自然科学における生産的な使用と同じくらい長く永続的な歴史があることを思い起こさせるはずです。 
+統計は、データの収集と可用性によって本当に始まりました。その先駆者の一人である [ロナルド・フィッシャー（1890—1962）]（https://en.wikipedia.org/wiki/Ronald_Fisher）は、その理論と遺伝学への応用に大きく貢献しました。彼のアルゴリズム（線形判別分析など）と数式（フィッシャー情報行列など）の多くは、現代統計の基礎において依然として重要な位置を占めています。彼のデータリソースでさえも永続的な影響を与えました。Fisher が 1936 年にリリースした Iris データセットは、今でも機械学習アルゴリズムのデモンストレーションに使用されています。フィッシャーは優生学の支持者でもあり、道徳的に疑わしいデータサイエンスの使用は、産業や自然科学における生産的な使用と同じくらい長く永続的な歴史があることを思い出させるはずです。 
 
-機械学習の2つ目の影響は、[Claude Shannon (1916—2001)](https://en.wikipedia.org/wiki/Claude_Shannon) による情報理論と [Alan Turing (1912—1954)](https://en.wikipedia.org/wiki/Alan_Turing) による計算論からもたらされました。チューリングは「機械は考えることができるか？」彼の有名な論文「コンピューティング機械と知能」:cite:`Turing.1950`に掲載されています。彼がチューリングテストとして説明したところでは、人間の評価者がテキストによる相互作用に基づいて機械と人間からの応答を区別するのが難しい場合、機械は*インテリジェント*であると考えることができます。 
+機械学習の 2 つ目の影響は、[クロード・シャノン (1916—2001)](https://en.wikipedia.org/wiki/Claude_Shannon) による情報理論と [アラン・チューリング (1912—1954)](https://en.wikipedia.org/wiki/Alan_Turing) を介した計算理論から来ました。チューリングは「機械は考えることができる？」という質問を投げかけました。彼の有名な論文*コンピューティング機械と知能* :cite:`Turing.1950`。彼がチューリングテストとして説明したように、人間の評価者がテキストによる相互作用に基づいて機械と人間からの応答を区別することが難しい場合、機械は「インテリジェント」と見なすことができます。 
 
-神経科学と心理学には別の影響があります。結局のところ、人間は明らかに知的な行動を示します。したがって、この能力を説明し、場合によってはリバースエンジニアリングできるかどうかを尋ねるのが妥当です。この様式にインスパイアされた最も古いアルゴリズムの一つは、[Donald Hebb (1904—1985)](https://en.wikipedia.org/wiki/Donald_O._Hebb) によって策定されました。彼の画期的な著書『行動の組織化』:cite:`Hebb.Hebb.1949`で、ニューロンはポジティブな強化によって学習すると仮定している。これはHebbian学習ルールとして知られるようになりました。これはRosenblattのパーセプトロン学習アルゴリズムのプロトタイプであり、今日のディープラーニングを支える多くの確率的勾配降下アルゴリズムの基礎を築きました。望ましい振る舞いを強化し、望ましくない振る舞いを減らして、ニューラルネットワークのパラメーターを適切に設定します。 
+別の影響は、神経科学と心理学に見られます。結局のところ、人間は明らかに知的な行動を示します。多くの学者は、この能力を説明し、場合によってはリバースエンジニアリングできるかどうかを尋ねてきました。生物学的にインスパイアされた最も古いアルゴリズムの1つは、[ドナルド・ヘブ（1904—1985）]（https://en.wikipedia.org/wiki/Donald_O._Hebb）によって策定されました。画期的な著書「行動の組織」（Organization of Behavior）:cite:`Hebb.Hebb.1949`で、彼はニューロンがポジティブな強化によって学習すると主張しました。これは、ヘビアン学習ルールとして知られるようになりました。これらのアイデアは、ローゼンブラットのパーセプトロン学習アルゴリズムのような後の作品に影響を与え、今日のディープラーニングを支える多くの確率的勾配降下アルゴリズムの基礎を築きました。望ましい動作を強化し、望ましくない動作を減らして、ニューラルネットワークのパラメーターの適切な設定を取得します。 
 
-生物学的インスピレーションは、*ニューラルネットワーク*にその名を与えたものです。1世紀以上にわたって（1873年のアレクサンダーベインと1890年のジェームズシェリントンのモデルにさかのぼります）、研究者たちは相互作用するニューロンのネットワークに似た計算回路を組み立てようとしました。時間が経つにつれて、生物学の解釈は文字通りではなくなりましたが、その名前は固まりました。その中心には、今日のほとんどのネットワークに見られるいくつかの重要な原則があります。 
+生物学的なインスピレーションは、*ニューラルネットワーク*に名前を付けたものです。1世紀以上にわたり（1873年のアレクサンダーベインと1890年のジェームズシェリントンのモデルにまでさかのぼる）、研究者は相互作用するニューロンのネットワークに似た計算回路を組み立てようとしました。時間が経つにつれて、生物学の解釈は文字通りではなくなってきましたが、名前は固執しました。その中心には、今日のほとんどのネットワークに見られるいくつかの重要な原則があります。 
 
-* 線形処理単位と非線形処理単位を交互に使用したもので、「*layers*」と呼ばれることもあります。
-* チェーンルール (*backpropagation* とも呼ばれる) を使用して、ネットワーク全体のパラメーターを一度に調整します。
+* 線形処理単位と非線形処理単位を交互に組み合わせたもので、しばしば*レイヤー* と呼ばれます。
+* ネットワーク全体のパラメータを一度に調整するためのチェーンルール (*バックプロパゲーション*とも呼ばれる) の使用。
 
-初期の急速な進歩の後、ニューラルネットワークの研究は1995年頃から2005年にかけて衰退しました。これは主に2つの理由によるものです。まず、ネットワークの学習は計算上非常にコストがかかります。前世紀の終わりにはランダムアクセスメモリが豊富でしたが、計算能力は乏しかったです。第二に、データセットは比較的小さかった。実際、1932年のFisher's Irisデータセットは、アルゴリズムの有効性をテストするための一般的なツールでした。60000 桁の手書きの数字を持つ MNIST データセットは巨大と見なされていました。 
+初期の急速な進歩の後、ニューラルネットワークの研究は1995年頃から2005年まで衰退しました。これは主に2つの理由によるものです。まず、ネットワークのトレーニングは計算上非常に高価です。前世紀の終わりにはランダムアクセスメモリが豊富でしたが、計算能力は不足していました。第二に、データセットは比較的小さかった。実際、1932年のFisherのIrisデータセットは、アルゴリズムの有効性をテストするための人気のあるツールでした。60000の手書きの数字を持つMNISTデータセットは巨大と見なされました。 
 
-データと計算が不足していることを考えると、カーネル法、決定木、グラフィカルモデルなどの強力な統計ツールが経験的に優れていることが証明されました。ニューラルネットワークとは異なり、トレーニングに数週間もかからず、強力な理論的保証で予測可能な結果が得られました。 
+データと計算が不足していることを考えると、カーネル手法、決定木、グラフィカルモデルなどの強力な統計ツールは、多くのアプリケーションで経験的に優れていることが証明されました。さらに、ニューラルネットワークとは異なり、トレーニングに何週間もかからず、強力な理論的保証で予測可能な結果を提供しました。 
 
 ## ディープラーニングへの道
 
-ワールドワイドウェブ、オンラインで何億人ものユーザーにサービスを提供する企業の出現、安価で高品質のセンサーの普及、安価なデータストレージ（クライダーの法則）、安価な計算（ムーアの法則）により、大量のデータがすぐに利用できるようになったことで、その多くが変わりました。もともとコンピュータゲーム用に設計されたGPUの形式。突然、計算上実行不可能と思われるアルゴリズムとモデルが関連するようになりました（逆も同様）。これは :numref:`tab_intro_decade` で最もよく説明されています。 
+この大部分は、World Wide Web、オンラインで何億人ものユーザーにサービスを提供する企業の出現、安価で高品質のセンサーの普及、安価なデータストレージ（Kryderの法則）、および安価な計算（ムーアの法則）により、大量のデータの可用性によって変化しました。特に、ディープラーニングにおける計算の状況は、もともとコンピューターゲーム用に設計されたGPUの進歩によって革命を起こしました。突然、計算上実行不可能と思われるアルゴリズムとモデルが関連するようになりました（逆もまた同様です）。これは:numref:`tab_intro_decade`で最もよく説明されています。 
 
-:データセットとコンピュータメモリと計算能力の比較 
+:データセット vs. コンピュータメモリと計算能力 
 
 |Decade|Dataset|Memory|Floating point calculations per second|
 |:--|:-|:-|:-|
 |1970|100 (Iris)|1 KB|100 KF (Intel 8080)|
-|1980|1 K (House prices in Boston)|100 KB|1 MF (Intel 80186)|
+|1980|1 K (house prices in Boston)|100 KB|1 MF (Intel 80186)|
 |1990|10 K (optical character recognition)|10 MB|10 MF (Intel 80486)|
 |2000|10 M (web pages)|100 MB|1 GF (Intel Core)|
 |2010|10 G (advertising)|1 GB|1 TF (Nvidia C2050)|
 |2020|1 T (social network)|100 GB|1 PF (Nvidia DGX-2)|
 :label:`tab_intro_decade`
 
-ランダム・アクセス・メモリがデータの増加に対応していないことは明らかです。同時に、計算能力の向上は、利用可能なデータの増加を上回っています。つまり、統計モデルはメモリ効率を高め (通常は非線形性を加えることで実現)、同時に計算量の増大により、これらのパラメーターの最適化により多くの時間を費やすことができるようになる必要があります。その結果、機械学習と統計学のスイートスポットは (一般化された) 線形モデルやカーネル法からディープニューラルネットワークへと移行しました。これは、多層パーセプトロン:cite:`McCulloch.Pitts.1943`、畳み込みニューラルネットワーク :cite:`LeCun.Bottou.Bengio.ea.1998`、長期短期記憶 :cite:`Hochreiter.Schmidhuber.1997`、Q-Learning :cite:`Watkins.Dayan.1992`など、ディープラーニングの主力の多くが過去10年間に本質的に「再発見」された理由の1つでもあります。かなりの時間。 
+ランダムアクセスメモリは、データの増加に追いついていないことに注意してください。同時に、計算能力の向上は、データセットの増加を上回っています。これは、計算予算の増加により、統計モデルをよりメモリ効率にする必要があり、パラメーターを最適化するためにより多くのコンピューターサイクルを費やす必要があることを意味します。その結果、機械学習と統計学のスイートスポットは、（一般化された）線形モデルとカーネル手法からディープニューラルネットワークに移行しました。これは、多層パーセプトロン:cite:`McCulloch.Pitts.1943`、畳み込みニューラルネットワーク:cite:`LeCun.Bottou.Bengio.ea.1998`、長期短期記憶:cite:`Hochreiter.Schmidhuber.1997`、Qラーニング:cite:`Watkins.Dayan.1992`など、ディープラーニングの主力の多くが、比較的休眠状態になった後、過去10年間に本質的に「再発見」された理由の1つでもあります。かなりの時間。 
 
-統計モデル、アプリケーション、アルゴリズムの最近の進歩は、種の進化が急速に進歩する瞬間であるカンブリア紀の爆発に例えられることがある。実際、最先端の技術は、数十年前のアルゴリズムに適用された、利用可能なリソースの単なる結果ではありません。以下のリストは、研究者が過去10年間で驚異的な進歩を遂げるのを助けてきたアイデアの表面をほとんど傷つけていないことに注意してください。 
+統計モデル、アプリケーション、アルゴリズムの最近の進歩は、種の進化の急速な進歩の瞬間であるカンブリア紀の爆発に例えられることがあります。実際、最先端技術は数十年前のアルゴリズムに適用された利用可能なリソースの単なる結果ではありません。以下のリストは、研究者が過去10年間で途方もない進歩を達成するのを助けてきたアイデアの表面をかろうじて傷つけていることに注意してください。 
 
-* *dropout* :cite:`Srivastava.Hinton.Krizhevsky.ea.2014` などの新しい容量制御方法により、過適合の危険性が軽減されました。これは、ニューラルネットワーク全体にノイズインジェクション :cite:`Bishop.1995` を適用し、学習目的で重みを確率変数に置き換えることで実現しました。
-* アテンションメカニズムは、1世紀以上にわたって統計を悩ませてきた2つ目の問題を解決しました。学習可能なパラメータの数を増やすことなく、システムのメモリと複雑さを増大させる方法です。研究者は、学習可能なポインター構造体 :cite:`Bahdanau.Cho.Bengio.2014` としか見なされないものを使用して、洗練された解法を発見しました。固定次元表現での機械翻訳など、テキストシーケンス全体を覚えておく必要はなく、保存する必要があるのは翻訳プロセスの中間状態へのポインタだけでした。これにより、新しいシーケンスの生成を開始する前にモデルがシーケンス全体を記憶する必要がなくなったため、長いシーケンスの精度が大幅に向上しました。
-* メモリネットワーク :cite:`Sukhbaatar.Weston.Fergus.ea.2015` やニューラルプログラマーインタープリター :cite:`Reed.De-Freitas.2015` などを介した多段階設計により、統計モデラーは推論への反復アプローチを記述することができました。これらのツールを使用すると、ディープニューラルネットワークの内部状態を繰り返し変更できるため、プロセッサが計算のためにメモリを変更するのと同様に、一連の推論で後続のステップを実行できます。
-* もう1つの重要な進展は、敵対的生成ネットワーク:cite:`Goodfellow.Pouget-Abadie.Mirza.ea.2014`の発明でした。従来、密度推定と生成モデルの統計的手法は、適切な確率分布と、それらからサンプリングするための (しばしば近似的な) アルゴリズムを見つけることに重点を置いていました。その結果、これらのアルゴリズムは、統計モデルに内在する柔軟性の欠如によって大きく制限されていました。敵対的生成ネットワークにおける重要な革新は、サンプラーを微分可能なパラメーターを持つ任意のアルゴリズムに置き換えることでした。その後、弁別器 (事実上 2 サンプル検定) がフェイクデータと実データを区別できないように調整されます。任意のアルゴリズムを使用してデータを生成できるため、密度推定をさまざまな手法にまで広げました。ギャロッピングするシマウマ :cite:`Zhu.Park.Isola.ea.2017` と偽の有名人の顔 :cite:`Karras.Aila.Laine.ea.2017` の例は、どちらもこの進歩の証です。アマチュアのいたずら書きをする人でも、シーンのレイアウトが :cite:`Park.Liu.Wang.ea.2019` のように描かれたスケッチだけに基づいて、フォトリアリスティックなイメージを生成できます。
-* 多くの場合、1 つの GPU では学習に使用できる大量のデータを処理するには不十分です。過去 10 年間で、並列および分散学習アルゴリズムを構築する能力が大幅に向上しました。スケーラブルなアルゴリズムの設計における重要な課題の 1 つは、ディープラーニング最適化の主力製品である確率的勾配降下法が、処理されるデータの比較的小さなバッチに依存していることです。同時に、バッチが小さいとGPUの効率が制限されます。したがって、たとえば、バッチあたり 32 イメージのミニバッチサイズの 1024 GPU での学習は、約 32000 イメージの集約ミニバッチになります。Li :cite:`Li.2017`、続いて :cite:`You.Gitman.Ginsburg.2017` と :cite:`Jia.Song.He.ea.2018` による最近の研究により、観測サイズは最大 64000 個にまで拡大され、ImageNet データセットの ResNet-50 モデルのトレーニング時間が 7 分未満に短縮されました。比較のため、当初はトレーニング時間は日数オーダーで測定されました。
-* また、計算を並列化できることは、少なくともシミュレーションが選択肢である場合は常に、強化学習の進歩にきわめて重要な貢献をしてきました。これにより、Go、Atariゲーム、Starcraft、物理シミュレーション（MujoCOの使用など）において、コンピューターが超人的なパフォーマンスを達成する上で大きな進歩を遂げました。AlphaGo でこれを実現する方法については、例えば :cite:`Silver.Huang.Maddison.ea.2016` を参照してください。一言で言えば、強化学習はたくさんの (状態、行動、報酬) トリプルが利用できる場合、つまり、それらが互いにどのように関係しているかを学ぶために多くのことを試すことができる場合に最も効果的です。シミュレーションはそのような手段を提供します。
-* ディープラーニングフレームワークは、アイデアを広める上で重要な役割を果たしてきました。モデリングを容易にする第1世代のフレームワークには、[Caffe](https://github.com/BVLC/caffe)、[Torch](https://github.com/torch)、[Theano](https://github.com/Theano/Theano)が含まれていました。これらのツールを使って多くの独創的な論文が書かれました。現在では、[TensorFlow](https://github.com/tensorflow/tensorflow) (高レベルの API [Keras](https://github.com/keras-team/keras) でよく使用される)、[CNTK](https://github.com/Microsoft/CNTK)、[Caffe 2](https://github.com/caffe2/caffe2)、および [Apache MXNet](https://github.com/apache/incubator-mxnet) に取って代わられています。第3世代のツール、つまりディープラーニングのための命令型ツールは、モデルを記述するために Python NumPy に似た構文を使用した [Chainer](https://github.com/chainer/chainer) が主導したことは間違いありません。このアイデアは、[PyTorch](https://github.com/pytorch/pytorch)、MXNet の [Gluon API](https://github.com/apache/incubator-mxnet)、および [Jax](https://github.com/google/jax) の両方によって採用されました。
+* *dropout* :cite:`Srivastava.Hinton.Krizhevsky.ea.2014` などの新しい容量制御方法は、過適合の緩和に役立っています。ここでは、学習中にニューラルネットワーク全体にノイズが注入されます :cite:`Bishop.1995`。
+* 注意メカニズムは、1世紀以上にわたって統計を悩ませてきた2つ目の問題を解決しました。学習可能なパラメータの数を増やすことなく、システムのメモリと複雑さをどのように増やすかです。研究者は、学習可能なポインター構造としてしか見ることができないものを使用して、洗練された解決策を見つけました :cite:`Bahdanau.Cho.Bengio.2014`。固定次元の表現での機械翻訳など、テキストシーケンス全体を覚える必要はなく、保存する必要があるのは翻訳プロセスの中間状態へのポインタだけでした。これにより、モデルは新しいシーケンスの生成を開始する前にシーケンス全体を記憶する必要がなくなったため、長いシーケンスの精度が大幅に向上しました。注意メカニズムのみに基づいて構築されたトランスアーキテクチャ :cite:`Vaswani.Shazeer.Parmar.ea.2017` は、幅広い分野で魅力的な成功を収めています。たとえば、テキスト、画像、関節トルク、ボタン押下などの多様なモダリティについて事前にトレーニングされた単一のトランスフォーマーは、Atari、キャプション画像、チャットを再生し、ロボットを制御できます :cite:`reed2022generalist`。
+* メモリネットワーク:cite:`Sukhbaatar.Weston.Fergus.ea.2015`やニューラルプログラマインタプリタ:cite:`Reed.De-Freitas.2015`を介したマルチステージ設計により、統計モデラーは推論への反復アプローチを記述することができました。これらのツールを使用すると、ディープニューラルネットワークの内部状態を繰り返し変更できるため、プロセッサが計算のためにメモリを変更する方法と同様に、推論の連鎖で後続のステップを実行できます。
+* もう1つの重要な開発は、敵対的生成ネットワーク:cite:`Goodfellow.Pouget-Abadie.Mirza.ea.2014`の発明でした。従来、密度推定と生成モデルの統計的手法は、適切な確率分布と、それらからサンプリングする（多くの場合近似）アルゴリズムを見つけることに重点を置いていました。その結果、これらのアルゴリズムは、統計モデルに固有の柔軟性の欠如によって大きく制限されていました。敵対的生成ネットワークにおける重要な革新は、サンプラーを微分可能なパラメーターを持つ任意のアルゴリズムに置き換えることでした。次に、ディスクリミネータ（事実上2サンプル検定）が偽物と実際のデータを区別できないように調整されます。任意のアルゴリズムを使用してデータを生成する機能により、密度推定がさまざまな手法に開かれました。疾走するシマウマ:cite:`Zhu.Park.Isola.ea.2017`の例と偽の有名人の顔:cite:`Karras.Aila.Laine.ea.2017`の例はどちらもこの進歩の証です。アマチュアの落書き者でも、シーンのレイアウトが:cite:`Park.Liu.Wang.ea.2019`のように見える様子を説明するスケッチだけに基づいて、フォトリアリスティックな画像を作成できます。
+* 多くの場合、トレーニングに使用できる大量のデータを処理するには、単一の GPU では不十分です。過去10年間で、並列分散型トレーニングアルゴリズムを構築する能力は大幅に向上しました。スケーラブルなアルゴリズムを設計する際の重要な課題の1つは、ディープラーニング最適化の主力製品である確率的勾配降下法が、処理されるデータの比較的小さなミニバッチに依存していることです。同時に、小さなバッチはGPUの効率を制限します。したがって、ミニバッチサイズ、たとえばバッチあたり32個のイメージを持つ1024個のGPUでのトレーニングは、約32000個のイメージの集約ミニバッチになります。最近の研究では、最初は:citet:`Li.2017`、続いて:citet:`You.Gitman.Ginsburg.2017`と:citet:`Jia.Song.He.ea.2018`によってサイズが64000に押し上げられ、ImageNetデータセットでのResNet-50モデルのトレーニング時間が7分未満に短縮されました。比較のため、当初、トレーニング時間は日数順に測定されていました。
+* 計算を並列化する能力は、強化学習の進歩にも貢献しています。これは、囲碁、アタリゲーム、スタークラフトなどのタスクや物理シミュレーション（例：MujoCOの使用）で超人的なパフォーマンスを達成するコンピューターの大きな進歩につながりました。利用可能。AlphaGoでこれを達成する方法の説明については、例えば:citet:`Silver.Huang.Maddison.ea.2016`を参照してください。一言で言えば、強化学習は、たくさんの (状態、アクション、報酬) タプルが利用できる場合に最も効果的です。シミュレーションはそのような道を提供します。
+* ディープラーニングのフレームワークは、アイデアを広める上で重要な役割を果たしてきました。ニューラルネットワークモデリングのための第1世代のオープンソースフレームワークは、[Caffe](https://github.com/BVLC/caffe)、[Torch](https://github.com/torch)、および[Theano](https://github.com/Theano/Theano)で構成されていました。多くの独創的な論文は、これらのツールを使用して書かれました。現在では、[TensorFlow](https://github.com/tensorflow/tensorflow)（高レベルAPI [Keras](https://github.com/keras-team/keras)を介して使用されることが多い）、[CNTK](https://github.com/Microsoft/CNTK)、[Caffe 2](https://github.com/caffe2/caffe2)、および[Apache MXNet](https://github.com/apache/incubator-mxnet)に置き換えられています。第3世代のツールは、ディープラーニングのためのいわゆる*命令型*ツールで構成されています。この傾向は、モデルを記述するためにPython NumPyに似た構文を使用していた[Chainer](https://github.com/chainer/chainer)によって発火されたことは間違いありません。このアイデアは、[PyTorch](https://github.com/pytorch/pytorch)、MXNetの[Gluon API](https://github.com/apache/incubator-mxnet)、および[Jax](https://github.com/google/jax)の両方で採用されました。
 
-より優れたツールを構築するシステム研究者とより優れたニューラルネットワークを構築する統計モデラーの分業により、物事は大幅に簡素化されました。たとえば、線形ロジスティック回帰モデルをトレーニングすることは、自明ではない宿題の問題であり、2014年にカーネギーメロン大学の新しい機械学習博士課程の学生に与える価値がありました。今では、このタスクは10行未満のコードで達成でき、プログラマーにしっかりと把握できます。 
+より優れたツールを構築するシステム研究者とより優れたニューラルネットワークを構築する統計モデラーの間の分業は、物事を大幅に簡素化しました。たとえば、線形ロジスティック回帰モデルのトレーニングは、以前は自明ではない宿題の問題であり、2014年にカーネギーメロン大学の新しい機械学習博士課程の学生に与える価値があります。今では、このタスクは10行未満のコードで達成でき、プログラマーはしっかりと把握できます。 
 
 ## 成功事例
 
-AIには長い歴史があり、そうでなければ達成するのは難しい結果をもたらしてきました。例えば、光学式文字認識を用いた郵便物仕分けシステムは、1990年代から導入されてきた。結局のところ、これは手書き数字の有名なMNISTデータセットのソースです。同じことが、銀行預金の読書小切手と申請者の信用力の採点にも当てはまります。金融取引は自動的に不正チェックされます。これは、PayPal、Stripe、AliPay、WeChat、Apple、Visa、MasterCardなど、多くの電子商取引決済システムのバックボーンを形成しています。チェスのコンピュータプログラムは何十年もの間競争力がありました。機械学習は、検索、レコメンデーション、パーソナライズ、ランキングをインターネット上でフィードします。言い換えれば、機械学習は広く普及していますが、視界からは隠されていることがよくあります。 
-
-AIが脚光を浴びているのはごく最近のことです。その主な理由は、以前は手に負えないと考えられていた、消費者に直接関係する問題の解決策によるものです。このような進歩の多くは、ディープラーニングによるものです。 
+AIには、そうでなければ達成するのが難しい結果をもたらしてきた長い歴史があります。たとえば、光学式文字認識を使用する郵便物の仕分けシステムは、1990年代から導入されてきました。これは、結局のところ、手書きの数字の有名なMNISTデータセットのソースです。同じことが、銀行預金の小切手の読み取りと申請者の信用力のスコアリングにも当てはまります。金融取引は詐欺のチェックが自動的に行われます。これは、PayPal、Stripe、AliPay、WeChat、Apple、Visa、MasterCardなど、多くの電子商取引決済システムのバックボーンを形成しています。チェスのコンピュータープログラムは何十年もの間競争力があります。機械学習は、インターネット上での検索、推奨、パーソナライズ、ランキングを提供します。言い換えれば、機械学習は、しばしば目に見えないものの、普及しています。 
 
-* AppleのSiri、AmazonのAlexa、Googleのアシスタントなどのインテリジェントアシスタントは、話された質問に妥当な精度で答えることができます。これには、ライトスイッチをオンにする（身体障害者への恩恵）、理髪店の予約をする、電話サポートダイアログを提供するなどの簡単な作業が含まれます。これは、AIが私たちの生活に影響を与えていることを示す最も顕著な兆候です。
-* デジタル・アシスタントの重要な要素は、音声を正確に認識する能力です。このようなシステムの精度は次第に向上し、特定のアプリケーション :cite:`Xiong.Wu.Alleva.ea.2018` では人間の同等性に達するようになりました。
-* 物体認識も同様に長い道のりを歩んできました。2010年には、写真に写っている物体の推定はかなり困難な作業でした。ImageNet ベンチマークでは、NEC Labs とイリノイ大学アーバナ・シャンペーン校の研究者がトップ 5 のエラー率 28% :cite:`Lin.Lv.Zhu.ea.2010` を達成しました。2017 年までに、このエラー率は 2.25% に減少しました :cite:`Hu.Shen.Sun.2018`。同様に、鳥類の特定や皮膚がんの診断においても、驚くべき結果が得られています。
-* ゲームはかつて人間の知性の要塞でした。TD-Gammonを皮切りに、時間差強化学習、アルゴリズム、計算の進歩を利用してバックギャモンをプレイするプログラムが、幅広い応用のためのアルゴリズムを生み出してきました。バックギャモンとは異なり、チェスははるかに複雑な状態空間と一連のアクションを持っています。DeepBlueは、大規模な並列処理、専用ハードウェア、ゲームツリー:cite:`Campbell.Hoane-Jr.Hsu.2002`による効率的な検索を使用して、Garry Kasparovを打ち負かしました。その巨大な状態空間のために、行くことはさらに困難です。AlphaGo は、ディープラーニングとモンテカルロ木のサンプリング :cite:`Silver.Huang.Maddison.ea.2016` を組み合わせて使用し、2015 年に人間の平等に達しました。ポーカーでの課題は、ステートスペースが広く、完全に観察されていない（対戦相手のカードがわからない）ことでした。Libratus は、効率的に構造化されたストラテジーを使用して、ポーカーで人間のパフォーマンスを上回りました :cite:`Brown.Sandholm.2017`。これは、ゲームの目覚ましい進歩と、高度なアルゴリズムがゲームに重要な役割を果たしたという事実を示しています。
-* AIの進歩を示すもう1つの兆候は、自動運転車やトラックの登場です。完全な自律性はまだ十分ではありませんが、テスラ、NVIDIA、Waymoなどの企業が少なくとも部分的な自律性を可能にする製品を出荷することで、この方向で素晴らしい進歩が見られました。完全な自律性が非常に難しいのは、適切な運転には、認識し、推論し、ルールをシステムに組み込む能力が必要であるということです。現在、ディープラーニングはこれらの問題のコンピュータビジョンの側面で主に使用されています。残りはエンジニアによって厳しく調整されています。
+AIが脚光を浴びているのはごく最近のことであり、主に以前は扱いにくいと考えられていて、消費者に直接関係している問題の解決策が原因です。このような進歩の多くは、ディープラーニングによるものです。 
 
-繰り返しになりますが、上記のリストは、機械学習が実際のアプリケーションに影響を与えた箇所をほとんど示していません。たとえば、ロボット工学、ロジスティクス、計算生物学、素粒子物理学、天文学は、少なくとも部分的に機械学習による最近の最も印象的な進歩のいくつかを負っています。機械学習は、エンジニアや科学者にとってユビキタスなツールになりつつあります。 
+* AppleのSiri、AmazonのAlexa、Googleのアシスタントなどのインテリジェントアシスタントは、話された質問に妥当な精度で答えることができます。これには、電灯のスイッチをオンにするなどの簡単なタスクや、理髪店の予約の手配や電話サポートのダイアログの提供など、より複雑なタスクが含まれます。これは、AIが私たちの生活に影響を与えていることを示す最も顕著な兆候である可能性があります。
+* デジタル・アシスタントの重要な要素は、音声を正確に認識する能力です。徐々に、このようなシステムの精度は、特定のアプリケーションで人間の同等性を達成するまでに向上しています :cite:`Xiong.Wu.Alleva.ea.2018`。
+* 物体認識も同様に長い道のりを歩んできました。写真の中の物体を推定することは、2010年にはかなり困難な作業でした。ImageNetベンチマークでは、NECラボとイリノイ大学アーバナ・シャンペーン校の研究者がトップ5のエラー率 28% :cite:`Lin.Lv.Zhu.ea.2010`を達成しました。2017年までに、このエラー率は 2.25% :cite:`Hu.Shen.Sun.2018` に減少しました。同様に、鳥類の特定と皮膚がんの診断についても驚くべき結果が得られています。
+* ゲームの腕前は、人間の知性の測定棒を提供するために使用されました。TD-Gammonをはじめ、時差強化学習、アルゴリズム、計算の進歩を使用してバックギャモンをプレイするプログラムは、幅広いアプリケーションのためのアルゴリズムにつながっています。バックギャモンとは異なり、チェスははるかに複雑な状態空間と一連のアクションを持っています。DeepBlueは、大規模な並列処理、特殊用途のハードウェア、およびゲームツリーを介した効率的な検索を使用して、Garry Kasparovを打ち負かしました :cite:`Campbell.Hoane-Jr.Hsu.2002`。Goは、その巨大な状態空間のため、さらに困難です。AlphaGoは、モンテカルロ木サンプリング:cite:`Silver.Huang.Maddison.ea.2016`と組み合わせたディープラーニングを使用して、2015年に人間の同等性を達成しました。ポーカーでの課題は、ステートスペースが大きく、部分的にしか観察されないことでした（対戦相手のカードはわかりません）。Libratusは、効率的に構造化された戦略を使用してポーカーで人間のパフォーマンスを上回りました :cite:`Brown.Sandholm.2017`。
+* AIの進歩を示すもう1つの兆候は、自動運転車とトラックの出現です。完全な自律性は手の届かないところにありますが、テスラ、NVIDIA、Waymoなどの企業が少なくとも部分的な自律性を可能にする製品を出荷することで、この方向で大きな進歩が見られました。完全な自律性を非常に困難にしているのは、適切な運転には、認識し、推論し、システムにルールを組み込む能力が必要であるということです。現在、ディープラーニングは、主にこれらの問題のコンピュータービジョンの側面で使用されています。残りはエンジニアによって大幅に調整されています。
 
-AIに関する非技術的な記事では、AIの黙示録、またはAIの特異点の問題が頻繁に提起されています。恐れているのは、機械学習システムが何らかの形で知覚力を持ち、プログラマー（およびマスター）から独立して、人間の生活に直接影響するものについて決定することになるということです。AIはすでにある程度人間の生計に即座に影響を与えています。信用力は自動的に評価され、オートパイロットは主に車両をナビゲートし、保釈を許可するかどうかの決定は統計データを入力として使用します。もっと軽率に、Alexaにコーヒーマシンの電源を入れるように頼むことができます。 
+これは、機械学習のインパクトのあるアプリケーションの表面をほとんど傷つけません。たとえば、ロボット工学、ロジスティクス、計算生物学、素粒子物理学、天文学は、少なくとも部分的に機械学習による最近の最も印象的な進歩のいくつかを負っています。機械学習は、エンジニアや科学者にとってユビキタスなツールになりつつあります。 
 
-幸いなことに、私たちは、人間のクリエイターを操作する（またはコーヒーを燃やす）準備ができている、知覚力のあるAIシステムにはほど遠いです。まず、AI システムは、特定の目標指向の方法で設計、トレーニング、展開されます。それらの振る舞いは一般的な知能の錯覚を与えるかもしれませんが、デザインの根底にあるのはルール、ヒューリスティック、統計モデルの組み合わせです。第二に、*人工一般知能 (AI) のためのツールは、自分自身を向上させ、自分自身について推論することができ、一般的な課題を解決しようとしながら独自のアーキテクチャを変更、拡張、改善することができる、単に存在しない。 
+AIに関する非技術的な記事で、来るべきAIの黙示録と*特異点*の妥当性についての質問が頻繁に提起されています。恐れているのは、機械学習システムが、人間の生活に直接影響を与えるプログラマーから独立して、感覚的になり、意思決定を下すことです。ある程度、AIはすでに人間の生活に直接的な影響を与えています。信用度は自動的に評価され、自動操縦は主に車両をナビゲートし、保釈を許可するかどうかの決定は統計データを入力として使用します。もっと軽薄に、Alexaにコーヒーマシンのスイッチを入れるように頼むことができます。 
 
-もっと差し迫った懸念は、AIが私たちの日常生活でどのように使われているかということです。トラックの運転手や店員が行う多くの卑劣なタスクは自動化でき、自動化される可能性が高い。農業ロボットは有機農業のコストを削減する可能性が高いですが、収穫作業の自動化も可能になります。トラック運転手や店員は多くの国で最も一般的な仕事の一部であるため、産業革命のこの段階は社会の広い範囲に大きな影響を与える可能性があります。さらに、統計モデルを注意せずに適用すると、人種、性別、または年齢の偏見につながり、必然的な決定を推進するために自動化されている場合、手続き上の公平性について合理的な懸念を引き起こす可能性がある。これらのアルゴリズムは慎重に使用することが重要です。今日私たちが知っていることから、これは人類を破壊する悪意のある超知性の可能性よりもはるかに差し迫った懸念を私たちに与えます。 
+幸いなことに、私たちは人間の作成者を故意に操作できる知覚力のあるAIシステムとはほど遠いです。まず、AIシステムは、特定の目標指向の方法で設計、トレーニング、および展開されます。彼らの行動は一般的な知性の錯覚を与えるかもしれませんが、設計の根底にあるのはルール、ヒューリスティック、統計モデルの組み合わせです。第二に、現在のところ、*人工知能*のためのツールは、自分自身を改善し、自分自身を推論し、一般的なタスクを解決しようとしながら独自のアーキテクチャを変更、拡張、改善することができる、単に存在しません。 
 
-## 特性
+もっと差し迫った懸念は、私たちの日常生活でAIがどのように使用されているかです。トラック運転手や店員が実行する多くの面倒なタスクは自動化でき、自動化される可能性があります。農業用ロボットは有機農業のコストを削減する可能性が高いですが、収穫作業も自動化されます。トラック運転手や店員は多くの国で最も一般的な仕事の一部であるため、産業革命のこの段階は社会の広い範囲に深刻な影響を与える可能性があります。さらに、統計モデルを注意せずに適用すると、人種、性別、または年齢の偏見につながり、結果的な決定を推進するために自動化されている場合、手続きの公平性について合理的な懸念を引き起こす可能性があります。これらのアルゴリズムは注意して使用することが重要です。今日私たちが知っていることで、これは人類を破壊する悪意のあるスーパーインテリジェンスの可能性よりもはるかに差し迫った懸念を私たちに襲います。 
 
-これまで、AIの一分野であると同時にAIへのアプローチでもある機械学習について幅広く話してきました。ディープラーニングは機械学習のサブセットですが、目まぐるしいアルゴリズムとアプリケーションのセットにより、ディープラーニングの成分を具体的に評価することは困難です。これは、ほとんどすべての成分が代替可能であるため、ピザに必要な材料を突き止めるのと同じくらい困難です。 
+## ディープラーニングの本質
 
-すでに説明したように、機械学習ではデータを使用して、音声認識で音声をテキストに変換するなど、入力と出力の間の変換を学習できます。その際、そのような表現を出力に変換するアルゴリズムに適した方法でデータを表現することがしばしば必要になります。
-*ディープラーニング*はまさにその意味で*ディープ*です
-モデルが多くの「レイヤー」の変換を学習し、各レイヤーが1つのレベルで表現を提供するということです。たとえば、入力に近いレイヤーはデータの低レベルの詳細を表し、分類出力に近いレイヤーは識別に使用されるより抽象的な概念を表す場合があります。*表現学習*は表現そのものを見つけることを目的としているので、ディープラーニングはマルチレベル表現学習と言えます。 
+これまで、機械学習について幅広く説明してきました。ディープラーニングは、多層ニューラルネットワークに基づくモデルに関連する機械学習のサブセットです。そのモデルが変換の多くの*レイヤー*を学習するという意味で、正確には*深い*です。これは狭く聞こえるかもしれませんが、ディープラーニングは、目まぐるしい数のモデル、手法、問題の定式化、およびアプリケーションを生み出しました。深さの利点を説明するために、多くの直感が開発されました。間違いなく、すべての機械学習には多くの計算層があり、最初の層は特徴処理ステップで構成されます。ディープラーニングの違いは、表現の多くのレイヤーのそれぞれで学習された操作が、データから共同で学習されることです。 
 
-生の音声信号、画像の生のピクセル値からの学習、または任意の長さの文とそれに対応する外国語でのマッピングなど、これまで議論してきた問題は、ディープラーニングが優れている問題や、従来の機械学習手法が行き詰まっている問題です。これらの多層モデルは、従来のツールでは不可能だった方法で低レベルの知覚データに対処できることが判明しました。ディープラーニング手法における最も重要な共通点は、間違いなく*エンドツーエンドのトレーニング*の使用です。つまり、個別にチューニングされたコンポーネントをベースにシステムを組み立てるのではなく、システムを構築し、そのパフォーマンスを共同でチューニングします。たとえば、コンピュータービジョンでは、科学者は機械学習モデルを構築するプロセスから「特徴量工学」のプロセスを切り離していました。キャニーエッジ検出器 :cite:`Canny.1987` と Lowe の SIFT 特徴抽出器 :cite:`Lowe.2004` は、イメージを特徴ベクトルにマッピングするアルゴリズムとして、10 年以上にわたって最高峰の地位を占めていました。昔、機械学習をこれらの問題に適用するうえで重要なのは、データを浅いモデルに適した形式に変換する手作業で設計された方法を考え出すことでした。残念ながら、アルゴリズムによって自動的に実行される何百万もの選択肢に対して一貫した評価と比較して、人間が創意工夫によって達成できるものはごくわずかです。ディープラーニングが引き継がれると、これらの特徴抽出器は自動調整フィルターに置き換えられ、優れた精度が得られました。 
+生のオーディオ信号、画像の生のピクセル値からの学習、または任意の長さの文と外国語の対応する文の間のマッピングなど、これまでに説明した問題は、ディープラーニングが優れており、従来の方法がうまくいかない問題です。これらの多層モデルは、以前のツールではできなかった方法で低レベルの知覚データに対処できることがわかりました。おそらく、ディープラーニング手法で最も重要な共通点は、*エンドツーエンドのトレーニング*です。つまり、個別に調整されたコンポーネントに基づいてシステムを組み立てるのではなく、システムを構築し、そのパフォーマンスを共同で調整します。たとえば、コンピュータービジョンでは、科学者は機械学習モデルを構築するプロセスから*特徴量工学*のプロセスを分離していました。Cannyエッジ検出器:cite:`Canny.1987`とLoweのSIFT特徴抽出器:cite:`Lowe.2004`は、画像を特徴ベクトルにマッピングするアルゴリズムとして10年以上にわたって最高に君臨しました。昔、これらの問題に機械学習を適用する上で重要な部分は、データを浅いモデルに適した形式に変換する手動設計の方法を考え出すことでした。残念ながら、アルゴリズムによって自動的に実行される何百万もの選択肢に対する一貫した評価と比較して、人間が創意工夫によって達成できるものはほとんどありません。ディープラーニングが引き継がれると、これらの特徴抽出器は自動的に調整されたフィルターに置き換えられ、優れた精度が得られました。 
 
-したがって、ディープラーニングの主な利点の 1 つは、従来の学習パイプラインの最後の浅いモデルだけでなく、労働集約的な特徴量エンジニアリングのプロセスにも取って代わることです。さらに、ディープラーニングは、ドメイン固有の前処理の多くを置き換えることで、これまでコンピュータービジョン、音声認識、自然言語処理、医療情報学などの応用分野を分断していた多くの境界を排除し、多様性に対処するための統一されたツールセットを提供しました。問題。 
+したがって、ディープラーニングの主な利点の1つは、従来の学習パイプラインの終わりにある浅いモデルだけでなく、労働集約的な特徴量エンジニアリングのプロセスにも取って代わることです。さらに、ディープラーニングは、ドメイン固有の前処理の多くを置き換えることで、以前はコンピュータービジョン、音声認識、自然言語処理、医療情報学、およびその他の応用分野を分離していた多くの境界を取り除き、多様性に取り組むための統一されたツールセットを提供しました問題。 
 
-エンドツーエンドのトレーニング以外にも、パラメトリック統計記述から完全ノンパラメトリックモデルへの移行が進んでいます。データが不足している場合、有用なモデルを得るためには、現実に関する仮定を単純化することに頼る必要があります。データが豊富な場合は、現実により正確に適合するノンパラメトリックモデルに置き換えることができます。これは、前世紀半ばにコンピュータが利用可能になったことで物理学が経験した進歩をある程度反映しています。電子がどのように振る舞うかのパラメトリック近似を手で解くのではなく、関連する偏微分方程式の数値シミュレーションに頼ることができるようになりました。これにより、説明可能性を犠牲にすることが多いとはいえ、はるかに正確なモデルが生まれました。 
+エンドツーエンドのトレーニングを超えて、パラメトリックな統計的記述から完全なノンパラメトリックモデルへの移行を経験しています。データが不足している場合、有用なモデルを得るためには、現実に関する仮定を単純化することに頼る必要があります。データが豊富な場合は、データにより適合するノンパラメトリックモデルに置き換えることができます。これは、前世紀半ばに物理学がコンピューターの利用可能性とともに経験した進歩をある程度反映しています。電子がどのように振る舞うかのパラメトリック近似を手で解くのではなく、関連する偏微分方程式の数値シミュレーションに頼ることができます。これにより、説明可能性が犠牲になることが多いものの、はるかに正確なモデルが作成されました。 
 
-以前の研究とのもう1つの違いは、最適ではない解を受け入れること、非凸非線形最適化問題を扱うこと、そしてそれを証明する前に物事を試す意欲があることです。統計的問題への対処におけるこの新たな経験主義は、急速な才能の流入と相まって、実用的なアルゴリズムの急速な進歩をもたらしましたが、多くの場合、何十年も前から存在していたツールの修正と再発明を犠牲にしています。 
+以前の研究とのもう1つの違いは、次善の解を受け入れること、非凸の非線形最適化問題を扱うこと、そしてそれらを証明する前に物事を試す意欲があることです。統計的問題への対処におけるこの新たに発見された経験主義と、才能の急速な流入が相まって、多くの場合、数十年にわたって存在していたツールの修正と再発明を犠牲にしても、実用的なアルゴリズムの急速な進歩につながりました。 
 
-結局、ディープラーニングコミュニティは、学問や企業の境界を越えてツールを共有し、多くの優れたライブラリ、統計モデル、トレーニングされたネットワークをオープンソースとして公開することに誇りを持っています。この精神に基づき、この本を構成するノートブックは自由に配布および使用できるようになっています。私たちは、誰もがディープラーニングについて学ぶためのアクセスの障壁を下げるために懸命に取り組んできました。読者がディープラーニングの恩恵を受けることを願っています。 
+最終的に、ディープラーニングコミュニティは、学問や企業の境界を越えてツールを共有し、多くの優れたライブラリ、統計モデル、トレーニングされたネットワークをオープンソースとしてリリースすることに誇りを持っています。この精神のもと、この本を構成するノートブックは自由に配布および使用できます。私たちは、誰もがディープラーニングについて学ぶためのアクセスの障壁を下げるために懸命に取り組んできました。読者がこれから恩恵を受けることを願っています。 
 
-## [概要
+## まとめ
 
-* 機械学習では、コンピューターシステムが経験 (多くの場合はデータ) を活用して特定のタスクのパフォーマンスを向上させる方法を学習します。統計、データマイニング、最適化のアイデアを組み合わせたものです。多くの場合、AIソリューションを実装する手段として使用されます。
-* 機械学習のクラスである表現学習は、データを適切に表現する方法を自動的に見つける方法に重点を置いています。ディープラーニングは、多層の変換を学習することによるマルチレベルの表現学習です。
-* ディープラーニングは、従来の機械学習パイプラインの終わりにあった浅いモデルだけでなく、労働集約的な特徴量エンジニアリングのプロセスにも取って代わります。 
-* 最近のディープラーニングの進歩の多くは、安価なセンサーやインターネット規模のアプリケーションから生じる豊富なデータと、主にGPUによる計算の大幅な進歩によって引き起こされています。
-* システム全体の最適化は、高いパフォーマンスを得るための重要な要素です。効率的なディープラーニングフレームワークを利用できるようになったことで、このフレームワークの設計と実装が非常に容易になりました。
+機械学習は、コンピューターシステムがどのように経験 (多くの場合データ) を活用して特定のタスクでのパフォーマンスを向上させることができるかを研究します。統計、データマイニング、最適化のアイデアを組み合わせています。多くの場合、AIソリューションを実装する手段として使用されます。機械学習の一種として、表現学習は、データを表現する適切な方法を自動的に見つける方法に焦点を当てています。ディープラーニングは、変換の多くの層を学習することによるマルチレベルの表現学習として、従来の機械学習パイプラインの終わりにある浅いモデルだけでなく、労働集約的な特徴量エンジニアリングのプロセスにも取って代わります。ディープラーニングにおける最近の進歩の多くは、安価なセンサーやインターネット規模のアプリケーションから生じる豊富なデータと、主にGPUを介した計算の大幅な進歩によって引き起こされました。さらに、効率的なディープラーニングフレームワークが利用可能になったことで、システム全体の最適化の設計と実装が大幅に容易になりました。これは、高性能を得るための重要な要素です。 
 
 ## 演習
 
-1. 現在書いているコードのどの部分を「学習」できるか、つまり、コード内でなされた設計の選択を学習して自動的に決定することで改善できるでしょうか。コードにヒューリスティックデザインの選択肢が含まれていますか？
-1. あなたが遭遇した問題には、解決方法の例がたくさんありますが、それらを自動化するための具体的な方法はありません。これらは、ディープラーニングを使用する第一の候補となる可能性があります。
-1. AIの発展を新たな産業革命と捉え、アルゴリズムとデータの関係性について教えてください。蒸気機関や石炭と似ていますか？根本的な違いは何ですか？
-1. :numref:`fig_ml_loop`、物理学、工学、計量経済学など、エンドツーエンドのトレーニングアプローチは他にどこに適用できますか？
+1. 現在書いているコードのどの部分を「学習」できるか、つまり、コードで行われる設計の選択を学習し、自動的に決定することによって改善できるでしょうか？あなたのコードにはヒューリスティックなデザインの選択肢が含まれていますか？目的の動作を学習するには、どのようなデータが必要ですか？
+1. 解決方法の例はたくさんありますが、それらを自動化する具体的な方法がないのに、遭遇した問題はどれですか？これらは、ディープラーニングを使用する第一の候補となる可能性があります。
+1. アルゴリズム、データ、計算の関係を説明する。データの特性と現在利用可能な計算リソースは、さまざまなアルゴリズムの妥当性にどのように影響しますか？
+1. エンドツーエンドのトレーニングが現在デフォルトのアプローチではないが、役に立つかもしれない設定をいくつか挙げてください。
 
 [Discussions](https://discuss.d2l.ai/t/22)
diff --git a/chapter_introduction/index_origin.md b/chapter_introduction/index_origin.md
index 871a121..6551fe8 100644
--- a/chapter_introduction/index_origin.md
+++ b/chapter_introduction/index_origin.md
@@ -1,72 +1,104 @@
 # Introduction
 :label:`chap_introduction`
 
-Until recently, nearly every computer program that we interact with daily
-was coded by software developers from first principles.
-Say that we wanted to write an application to manage an e-commerce platform.
-After huddling around a whiteboard for a few hours to ponder the problem,
-we would come up with the broad strokes of a working solution that might probably look something like this:
+Until recently, nearly every computer program
+that you might interact with on an ordinary day
+was coded up as a rigid set of rules
+specifying precisely how it should behave.
+Say that we wanted to write an application
+to manage an e-commerce platform.
+After huddling around a whiteboard
+for a few hours to ponder the problem,
+we might settle on the broad strokes
+of a working solution, for example:
 (i) users interact with the application through an interface
 running in a web browser or mobile application;
 (ii) our application interacts with a commercial-grade database engine
 to keep track of each user's state and maintain records
-of historical transactions; 
+of historical transactions;
 and (iii) at the heart of our application,
 the *business logic* (you might say, the *brains*) of our application
-spells out in methodical detail the appropriate action
-that our program should take in every conceivable circumstance.
+spells out a set of rules that map every conceivable circumstance
+to the corresponding action that our program should take.
 
 To build the brains of our application,
-we would have to step through every possible corner case
-that we anticipate encountering, devising appropriate rules.
-Each time a customer clicks to add an item to their shopping cart,
-we add an entry to the shopping cart database table,
-associating that user's ID with the requested product's ID.
-While few developers ever get it completely right the first time
+we might enumerate all the common events
+that our program should handle.
+For example, whenever a customer clicks
+to add an item to their shopping cart,
+our program should add an entry
+to the shopping cart database table,
+associating that user's ID
+with the requested product's ID.
+We might then attempt to step through
+every possible corner case,
+testing the appropriateness of our rules
+and making any necessary modifications.
+What happens if a user
+initiates a purchase with an empty cart?
+While few developers ever get it
+completely right the first time
 (it might take some test runs to work out the kinks),
-for the most part, we could write such a program from first principles
-and confidently launch it 
+for the most part, we can write such programs
+and confidently launch them
 *before* ever seeing a real customer.
-Our ability to design automated systems from first principles
-that drive functioning products and systems, 
+Our ability to manually design automated systems
+that drive functioning products and systems,
 often in novel situations,
 is a remarkable cognitive feat.
-And when you are able to devise solutions that work $100\%$ of the time,
-you should not be using machine learning.
+And when you are able to devise solutions
+that work $100\%$ of the time,
+you typically should not be
+worrying about machine learning.
 
-Fortunately for the growing community of machine learning scientists,
+Fortunately for the growing community
+of machine learning scientists,
 many tasks that we would like to automate
 do not bend so easily to human ingenuity.
-Imagine huddling around the whiteboard with the smartest minds you know,
-but this time you are tackling one of the following problems:
+Imagine huddling around the whiteboard
+with the smartest minds you know,
+but this time you are tackling
+one of the following problems:
 
 * Write a program that predicts tomorrow's weather given geographic information, satellite images, and a trailing window of past weather.
-* Write a program that takes in a question, expressed in free-form text, and  answers it correctly.
-* Write a program that given an image can identify all the people it contains,  drawing outlines around each.
-* Write a program that presents users with products that they are likely to   enjoy but unlikely, in the natural course of browsing, to encounter.
-
-In each of these cases, even elite programmers
-are incapable of coding up solutions from scratch.
-The reasons for this can vary. Sometimes the program
-that we are looking for follows a pattern that changes over time,
-and we need our programs to adapt.
-In other cases, the relationship (say between pixels,
+* Write a program that takes in a factoid question, expressed in free-form text, and  answers it correctly.
+* Write a program that, given an image, identifies all of people depicted in it and draws outlines around each.
+* Write a program that presents users with products that they are likely to enjoy but unlikely, in the natural course of browsing, to encounter.
+
+For these problems,
+even elite programmers would struggle
+to code up solutions from scratch.
+The reasons can vary.
+Sometimes the program that we are looking for
+follows a pattern that changes over time,
+so there is no fixed right answer!
+In such cases, any successful solution
+must adapt gracefully to a changing world.
+At other times, the relationship (say between pixels,
 and abstract categories) may be too complicated,
 requiring thousands or millions of computations
-that are beyond our conscious understanding
-even if our eyes manage the task effortlessly.
-*Machine learning* is the study of powerful
-techniques that can learn from experience.
+and following unknown principles.
+In the case of image recognition,
+the precise steps required to perform the task
+lie beyond our conscious understanding,
+even though our subconscious cognitive processes
+execute the task effortlessly.
+
+
+*Machine learning* is the study of algorithms
+that can learn from experience.
 As a machine learning algorithm accumulates more experience,
-typically in the form of observational data or
-interactions with an environment, its performance improves.
+typically in the form of observational data
+or interactions with an environment,
+its performance improves.
 Contrast this with our deterministic e-commerce platform,
-which performs according to the same business logic,
+which follows the same business logic,
 no matter how much experience accrues,
 until the developers themselves learn and decide
 that it is time to update the software.
-In this book, we will teach you the fundamentals of machine learning,
-and focus in particular on *deep learning*, 
+In this book, we will teach you
+the fundamentals of machine learning,
+focusing in particular on *deep learning*,
 a powerful set of techniques
 driving innovations in areas as diverse as computer vision,
 natural language processing, healthcare, and genomics.
@@ -98,10 +130,10 @@ with nothing but a computer and a code editor,
 as illustrated in :numref:`fig_wake_word`.
 How would you write such a program from first principles?
 Think about it... the problem is hard.
-Every second, the microphone will collect roughly 
+Every second, the microphone will collect roughly
 44000 samples.
 Each sample is a measurement of the amplitude of the sound wave.
-What rule could map reliably from a snippet of raw audio to confident predictions 
+What rule could map reliably from a snippet of raw audio to confident predictions
 $\{\text{yes}, \text{no}\}$
 on whether the snippet contains the wake word?
 If you are stuck, do not worry.
@@ -120,17 +152,16 @@ In other words, even if you do not know
 how to program a computer to recognize the word "Alexa",
 you yourself are able to recognize it.
 Armed with this ability, we can collect a huge *dataset*
-containing examples of audio 
-and label those that do
-and that do not contain the wake word.
-In the machine learning approach, 
+containing examples of audio snippets and associated labels,
+indicating which snippets contain the wake word.
+In the dominant approach to machine learning,
 we do not attempt to design a system
 *explicitly* to recognize wake words.
 Instead, we define a flexible program
 whose behavior is determined by a number of *parameters*.
-Then we use the dataset to determine the best possible set of parameters, 
-those that improve the performance of our program
-with respect to some measure of performance on the task of interest.
+Then we use the dataset to determine the best possible parameter values,
+i.e., those that improve the performance of our program
+with respect to a chosen performance measure.
 
 You can think of the parameters as knobs that we can turn,
 manipulating the behavior of the program.
@@ -145,14 +176,14 @@ Before we can go ahead and engage the learning algorithm,
 we have to define the problem precisely,
 pinning down the exact nature of the inputs and outputs,
 and choosing an appropriate model family.
-In this case, 
+In this case,
 our model receives a snippet of audio as *input*,
-and the model 
-generates a selection among 
+and the model
+generates a selection among
 $\{\text{yes}, \text{no}\}$ as *output*.
-If all goes according to plan 
+If all goes according to plan
 the model's guesses will
-typically be correct as to 
+typically be correct as to
 whether the snippet contains the wake word.
 
 If we choose the right family of models,
@@ -173,7 +204,7 @@ or from English sentences to Chinese sentences.
 As you might guess, if we just set all of the knobs randomly,
 it is unlikely that our model will recognize "Alexa",
 "Apricot", or any other English word.
-In machine learning, 
+In machine learning,
 the *learning* is the process
 by which we discover the right setting of the knobs
 coercing the desired behavior from our model.
@@ -183,45 +214,47 @@ As shown in :numref:`fig_ml_loop`, the training process usually looks like the f
 
 1. Start off with a randomly initialized model that cannot do anything useful.
 1. Grab some of your data (e.g., audio snippets and corresponding $\{\text{yes}, \text{no}\}$ labels).
-1. Tweak the knobs so the model sucks less with respect to those examples.
-1. Repeat Step 2 and 3 until the model is awesome.
+1. Tweak the knobs to make the model perform better as assessed on those examples.
+1. Repeat Steps 2 and 3 until the model is awesome.
 
 ![A typical training process.](../img/ml-loop.svg)
 :label:`fig_ml_loop`
 
 To summarize, rather than code up a wake word recognizer,
 we code up a program that can *learn* to recognize wake words,
-if we present it with a large labeled dataset.
+if presented with a large labeled dataset.
 You can think of this act of determining a program's behavior
 by presenting it with a dataset as *programming with data*.
-That is to say,
-we can "program" a cat detector by providing our machine learning system
+That is to say, we can "program" a cat detector
+by providing our machine learning system
 with many examples of cats and dogs.
-This way the detector will eventually learn to emit a very large positive number if it is a cat, a very large negative number if it is a dog,
-and something closer to zero if it is not sure,
-and this barely scratches the surface of what machine learning can do.
-Deep learning,
-which we will explain in greater detail later,
+This way the detector will eventually learn to emit
+a very large positive number if it is a cat,
+a very large negative number if it is a dog,
+and something closer to zero if it is not sure.
+This barely scratches the surface of what machine learning can do.
+Deep learning, which we will explain in greater detail later,
 is just one among many popular methods
 for solving machine learning problems.
 
+
 ## Key Components
 
 In our wake word example, we described a dataset
-consisting of audio snippets and binary labels, 
-and we
-gave a hand-wavy sense of how we might train
+consisting of audio snippets and binary labels,
+and we gave a hand-wavy sense of how we might train
 a model to approximate a mapping from snippets to classifications.
-This sort of problem, 
+This sort of problem,
 where we try to predict a designated unknown label
 based on known inputs
 given a dataset consisting of examples
-for which the labels are known, 
+for which the labels are known,
 is called *supervised learning*.
 This is just one among many kinds of machine learning problems.
-Later we will take a deep dive into different machine learning problems.
-First, we would like to shed more light on some core components
-that will follow us around, no matter what kind of machine learning problem we take on:
+Before we explore other varieties,
+we would like to shed more light
+on some core components that will follow us around,
+no matter what kind of machine learning problem we take on:
 
 1. The *data* that we can learn from.
 1. A *model* of how to transform the data.
@@ -231,55 +264,67 @@ that will follow us around, no matter what kind of machine learning problem we t
 ### Data
 
 It might go without saying that you cannot do data science without data.
-We could lose hundreds of pages pondering what precisely constitutes data,
-but for now, we will err on the practical side
-and focus on the key properties to be concerned with.
+We could lose hundreds of pages pondering what precisely data *is*,
+but for now, we will focus on the key properties
+of the datasets that we will be concerned with.
 Generally, we are concerned with a collection of examples.
-In order to work with data usefully, 
-we typically
+In order to work with data usefully, we typically
 need to come up with a suitable numerical representation.
-Each *example* (or *data point*, *data instance*, *sample*) typically consists of a set
-of attributes called *features* (or *covariates*),
-from which the model must make its predictions.
-In the supervised learning problems above,
-the thing to predict
-is a special attribute 
-that is designated as
-the *label* (or *target*).
-
+Each *example* (or *data point*, *data instance*, *sample*)
+typically consists of a set of attributes
+called *features* (sometimes called *covariates* or *inputs*),
+based on which the model must make its predictions.
+In supervised learning problems,
+our goal is to predict the value of a special attribute,
+called the *label* (or *target*),
+that is not part of the model's input.
 
 If we were working with image data,
-each individual photograph might constitute an example,
-each represented by an ordered list of numerical values
-corresponding to the brightness of each pixel.
-A $200\times 200$ color photograph would consist of $200\times200\times3=120000$
-numerical values, corresponding to the brightness
-of the red, green, and blue channels for each spatial location.
-In another traditional task, we might try to predict
-whether or not a patient will survive,
-given a standard set of features such as
-age, vital signs, and diagnoses.
-
-When every example is characterized by the same number of numerical values,
-we say that the data consist of fixed-length vectors
-and we describe the constant length of the vectors
-as the *dimensionality* of the data.
-As you might imagine, fixed-length can be a convenient property.
-If we wanted to train a model to recognize cancer in microscopy images,
-fixed-length inputs mean we have one less thing to worry about.
-
-However, not all data can easily be represented as 
-*fixed-length* vectors.
-While we might expect microscope images to come from standard equipment,
+each example might consist of an
+individual photograph (the features)
+and a number indicating the category
+to which the photograph belongs (the label).
+The photograph would be represented numerically
+as three grids of numerical values representing
+the brightness of red, green, and blue light
+at each pixel location.
+For example, a $200\times 200$ color photograph
+would consist of $200\times200\times3=120000$ numerical values.
+
+Alternatively, we might work with electronic health record data
+and tackle the task of predicting the likelihood
+that a given patient  will survive the next 30 days.
+Here, our features might consist of a collection
+of readily available attributes
+and frequently recorded measurements,
+including age, vital signs, comorbidities,
+current medications, and recent procedures.
+The label available for training would be a binary value
+indicating whether each patient in the historical data
+survived within the 30-day window.
+
+In such cases, when every example is characterized
+by the same number of numerical features,
+we say that the inputs are fixed-length vectors
+and we call the (constant) length of the vectors
+the *dimensionality* of the data.
+As you might imagine, fixed-length inputs can be convenient,
+giving us one less complication to worry about.
+However, not all data can easily
+be represented as *fixed-length* vectors.
+While we might expect microscope images
+to come from standard equipment,
 we cannot expect images mined from the Internet
 to all show up with the same resolution or shape.
-For images, we might consider cropping them all to a standard size,
+For images, we might consider
+cropping them all to a standard size,
 but that strategy only gets us so far.
 We risk losing information in the cropped out portions.
-Moreover, text data resist fixed-length representations even more stubbornly.
-Consider the customer reviews left on e-commerce sites
-such as Amazon, IMDB, and TripAdvisor.
-Some are short: "it stinks!". 
+Moreover, text data resists fixed-length
+representations even more stubbornly.
+Consider the customer reviews left
+on e-commerce sites such as Amazon, IMDb, and TripAdvisor.
+Some are short: "it stinks!".
 Others ramble for pages.
 One major advantage of deep learning over traditional methods
 is the comparative grace with which modern models
@@ -287,34 +332,41 @@ can handle *varying-length* data.
 
 Generally, the more data we have, the easier our job becomes.
 When we have more data, we can train more powerful models
-and rely less heavily on pre-conceived assumptions.
+and rely less heavily on preconceived assumptions.
 The regime change from (comparatively) small to big data
 is a major contributor to the success of modern deep learning.
-To drive the point home, many of the most exciting models in deep learning do not work without large datasets.
+To drive the point home, many of
+the most exciting models in deep learning
+do not work without large datasets.
 Some others work in the small data regime,
 but are no better than traditional approaches.
 
-Finally, it is not enough to have lots of data and to process it cleverly.
-We need the *right* data. 
-If the data are full of mistakes,
+Finally, it is not enough to have lots of data
+and to process it cleverly.
+We need the *right* data.
+If the data is full of mistakes,
 or if the chosen features are not predictive
-of the target quantity of interest, 
+of the target quantity of interest,
 learning is going to fail.
 The situation is captured well by the cliché:
 *garbage in, garbage out*.
-Moreover, poor predictive performance is not the only potential consequence.
+Moreover, poor predictive performance
+is not the only potential consequence.
 In sensitive applications of machine learning,
-like predictive policing, resume screening, and risk models used for lending,
-we must be especially alert to the consequences of garbage data.
-One common failure mode occurs in datasets where some groups of people
-are unrepresented in the training data.
+like predictive policing, resume screening,
+and risk models used for lending,
+we must be especially alert
+to the consequences of garbage data.
+One common failure mode occurs in datasets
+where some groups of people are unrepresented
+in the training data.
 Imagine applying a skin cancer recognition system in the wild
 that had never seen black skin before.
 Failure can also occur when the data
-do not merely under-represent some groups
-but reflect societal prejudices.
-For example, 
-if past hiring decisions are used to train a predictive model
+does not merely under-represent some groups
+but reflects societal prejudices.
+For example, if past hiring decisions
+are used to train a predictive model
 that will be used to screen resumes,
 then machine learning models could inadvertently
 capture and automate historical injustices.
@@ -330,7 +382,7 @@ Alternatively,
 we might want to ingest a set of sensor readings
 and predict how normal vs. anomalous the readings are.
 By *model*, we denote the computational machinery for ingesting data
-of one type, 
+of one type,
 and spitting out predictions of a possibly different type.
 In particular, we are interested in statistical models
 that can be estimated from data.
@@ -361,7 +413,7 @@ In machine learning, and optimization more generally,
 we call these *objective functions*.
 By convention, we usually define objective functions
 so that lower is better.
-This is merely a convention. 
+This is merely a convention.
 You can take any function
 for which higher is better, and turn it into a new function
 that is qualitatively identical but for which lower is better
@@ -371,18 +423,20 @@ Because lower is better, these functions are sometimes called
 
 When trying to predict numerical values,
 the most common loss function is *squared error*,
-i.e., the square of the difference between the prediction and the ground-truth.
-For classification, the most common objective is to minimize error rate,
+i.e., the square of the difference between
+the prediction and the ground truth target.
+For classification, the most common objective
+is to minimize error rate,
 i.e., the fraction of examples on which
 our predictions disagree with the ground truth.
-Some objectives (e.g., squared error) are easy to optimize.
-Others (e.g., error rate) are difficult to optimize directly,
+Some objectives (e.g., squared error) are easy to optimize,
+while others (e.g., error rate) are difficult to optimize directly,
 owing to non-differentiability or other complications.
 In these cases, it is common to optimize a *surrogate objective*.
 
-Typically, the loss function is defined
-with respect to the model's parameters
-and depends upon the dataset.
+During optimization, we think of the loss
+as a function of the model's parameters,
+and treat the training dataset as a constant.
 We learn
 the best values of our model's parameters
 by minimizing the loss incurred on a set
@@ -390,21 +444,23 @@ consisting of some number of examples collected for training.
 However, doing well on the training data
 does not guarantee that we will do well on unseen data.
 So we will typically want to split the available data into two partitions:
-the *training dataset* (or *training set*, for fitting model parameters)
-and the *test dataset* (or *test set*, which is held out for evaluation),
-reporting how the model performs on both of them.
-You could think of training performance as being like
-a student's scores on practice exams
-used to prepare for some real final exam.
+the *training dataset* (or *training set*), for learning model parameters;
+and the *test dataset* (or *test set*), which is held out for evaluation.
+At the end of the day, we typically report
+how our models perform on both partitions.
+You could think of training performance
+as analogous to the scores that a student achieves
+on the practice exams used to prepare for some real final exam.
 Even if the results are encouraging,
 that does not guarantee success on the final exam.
-In other words,
-the test performance can deviate significantly from the training performance. 
+Over the course of studying, the student
+might begin to memorize the practice questions,
+appearing to master the topic but faltering
+when faced with previously unseen questions
+on the actual final exam.
 When a model performs well on the training set
 but fails to generalize to unseen data,
-we say that it is *overfitting*.
-In real-life terms, this is like flunking the real exam
-despite doing well on practice exams.
+we say that it is *overfitting* to the training data.
 
 
 ### Optimization Algorithms
@@ -415,51 +471,41 @@ we need an algorithm capable of searching
 for the best possible parameters for minimizing the loss function.
 Popular optimization algorithms for deep learning
 are based on an approach called *gradient descent*.
-In short, at each step, this method 
+In short, at each step, this method
 checks to see, for each parameter,
 which way the training set loss would move
 if you perturbed that parameter just a small amount.
-It then updates
-the parameter in the direction that may reduce the loss.
+It then updates the parameter
+in the direction that lowers the loss.
+
 
 ## Kinds of Machine Learning Problems
 
 The wake word problem in our motivating example
-is just one among
-many problems that machine learning can tackle.
+is just one among many problems
+that machine learning can tackle.
 To motivate the reader further
-and provide us with some common language when we talk about more problems throughout the book,
-in the following we 
-list a sampling of machine learning problems.
-We will constantly refer to
-our aforementioned concepts 
-such as data, models, and training techniques.
+and provide us with some common language
+that will follow us throughout the book,
+we now provide a broad overview of the landscape
+of machine learning problem formulations.
 
 ### Supervised Learning
 
-Supervised learning addresses the task of
-predicting labels given input features.
+Supervised learning describes tasks
+where we are given a dataset
+containing both features and labels
+and tasked with producing a model
+to predict the labels given input features.
 Each feature--label pair is called an example.
-Sometimes, when the context is clear, we may use the term *examples*
+Sometimes, when the context is clear,
+we may use the term *examples*
 to refer to a collection of inputs,
 even when the corresponding labels are unknown.
-Our goal is to produce a model
-that maps any input to a label prediction.
-
-
-To ground this description in a concrete example,
-if we were working in healthcare,
-then we might want to predict whether or not
-a patient would have a heart attack.
-This observation, "heart attack" or "no heart attack",
-would be our label.
-The input features might be vital signs
-such as heart rate, diastolic blood pressure, 
-and systolic blood pressure.
-
-The supervision comes into play because for choosing the parameters, we (the supervisors) provide the model with a dataset
-consisting of labeled examples,
-where each example is matched with the ground-truth label.
+The supervision comes into play
+because for choosing the parameters,
+we (the supervisors) provide the model
+with a dataset consisting of labeled examples.
 In probabilistic terms, we typically are interested in estimating
 the conditional probability of a label given input features.
 While it is just one among several paradigms within machine learning,
@@ -473,16 +519,18 @@ of something unknown given a particular set of available data:
 * Predict the correct translation in French, given a sentence in English.
 * Predict the price of a stock next month based on this month's financial reporting data.
 
-Even with the simple description
-"predicting labels given input features"
-supervised learning can take a great many forms
-and require a great many modeling decisions,
-depending on (among other considerations) the type, size,
-and the number of inputs and outputs.
-For example, we use different models to process sequences of arbitrary lengths
+While all supervised learning problems
+are captured by the simple description
+"predicting the labels given input features",
+supervised learning can take diverse forms
+and require tons of modeling decisions,
+depending on (among other considerations)
+the type, size, and quantity of the inputs and outputs.
+For example, we use different models
+to process sequences of arbitrary lengths
 and for processing fixed-length vector representations.
-We will visit many of these problems in depth
-throughout this book.
+We will visit many of these problems
+in depth throughout this book.
 
 Informally, the learning process looks something like the following.
 First, grab a big collection of examples for which the features are known
@@ -509,11 +557,12 @@ Perhaps the simplest supervised learning task
 to wrap your head around is *regression*.
 Consider, for example, a set of data harvested
 from a database of home sales.
-We might construct a table, 
+We might construct a table,
 where each row corresponds to a different house,
 and each column corresponds to some relevant attribute,
-such as the square footage of a house, 
-the number of bedrooms, the number of bathrooms, and the number of minutes (walking) to the center of town.
+such as the square footage of a house,
+the number of bedrooms, the number of bathrooms,
+and the number of minutes (walking) to the center of town.
 In this dataset, each example would be a specific house,
 and the corresponding feature vector would be one row in the table.
 If you live in New York or San Francisco,
@@ -521,30 +570,33 @@ and you are not the CEO of Amazon, Google, Microsoft, or Facebook,
 the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance)
 feature vector for your home might look something like: $[600, 1, 1, 60]$.
 However, if you live in Pittsburgh, it might look more like $[3000, 4, 3, 10]$.
-Feature vectors like this are essential
+Fixed-length feature vectors like this are essential
 for most classic machine learning algorithms.
 
-What makes a problem a regression is actually the output.
+What makes a problem a regression is actually
+the form of the target.
 Say that you are in the market for a new home.
 You might want to estimate the fair market value of a house,
 given some features like above.
-The label, the price of sale, is a numerical value.
-When labels take on arbitrary numerical values,
+The data here might consist of historical home listings
+and the labels might be the observed sales prices.
+When labels take on arbitrary numerical values
+(even within some interval),
 we call this a *regression* problem.
-Our goal is to produce a model whose predictions
+The goal is to produce a model whose predictions
 closely approximate the actual label values.
 
 
-Lots of practical problems are well-described regression problems.
+Lots of practical problems are easily described as regression problems.
 Predicting the rating that a user will assign to a movie
 can be thought of as a regression problem
-and if you designed a great algorithm to accomplish this feat in 2009,
+and if you designed a great algorithm
+to accomplish this feat in 2009,
 you might have won the [1-million-dollar Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize).
 Predicting the length of stay for patients in the hospital
 is also a regression problem.
 A good rule of thumb is that any *how much?* or *how many?* problem
-should suggest regression,
-such as:
+should suggest regression, for example:
 
 * How many hours will this surgery take?
 * How much rainfall will this town have in the next six hours?
@@ -567,44 +619,48 @@ and that the contractor then charges per hour.
 If these assumptions held true, then given these two data examples,
 you could already identify the contractor's pricing structure:
 100 dollars per hour plus 50 dollars to show up at your house.
-If you followed that much then you already understand
+If you followed that much, then you already understand
 the high-level idea behind linear regression.
 
 In this case, we could produce the parameters
 that exactly matched the contractor's prices.
-Sometimes this is not possible, 
-e.g., if some of
-the variance owes to a few factors 
+Sometimes this is not possible,
+e.g., if some of the variance
+owes to a few factors
 besides your two features.
 In these cases, we will try to learn models
 that minimize the distance between our predictions and the observed values.
-In most of our chapters, we will focus on 
+In most of our chapters, we will focus on
 minimizing the squared error loss function.
 As we will see later, this loss corresponds to the assumption
 that our data were corrupted by Gaussian noise.
 
 #### Classification
 
-While regression models are great for addressing *how many?* questions,
+While regression models are great
+for addressing *how many?* questions,
 lots of problems do not bend comfortably to this template.
-For example,
-a bank wants to add check scanning to its mobile app.
-This would involve the customer snapping a photo of a check
-with their smart phone's camera
-and the app would need to be able
-to automatically understand text seen in the image.
-Specifically,
-it would also need to understand handwritten text to be even more robust,
-such as mapping a handwritten character
-to one of the known characters.
-This kind of *which one?* problem is called *classification*.
-It is treated with a different set of algorithms
-than those used for regression although many techniques will carry over.
+Consider, for example, a bank that wants
+to develop a check scanning feature for its mobile app.
+Ideally, the customer would simply snap a photo of a check
+and the app would automatically recognize the text from the image.
+Assuming that we had some ability
+to segment out image patches
+corresponding to each handwritten character,
+then the primary remaining task would be
+to determine which character among some known set
+is depicted in each image patch.
+These kinds of *which one?* problems are called *classification*
+and require a different set of tools
+than those used for regression,
+although many techniques will carry over.
 
 In *classification*, we want our model to look at features,
 e.g., the pixel values in an image,
-and then predict which *category* (formally called *class*),
-among some discrete set of options, an example belongs.
+and then predict which *category*
+(sometimes called a *class*)
+among some discrete set of options,
+an example belongs.
 For handwritten digits, we might have ten classes,
 corresponding to the digits 0 through 9.
 The simplest form of classification is when there are only two classes,
@@ -612,17 +668,18 @@ a problem which we call *binary classification*.
 For example, our dataset could consist of images of animals
 and our labels  might be the classes $\mathrm{\{cat, dog\}}$.
 While in regression, we sought a regressor to output a numerical value,
-in classification, we seek a classifier, whose output is the predicted class assignment.
+in classification, we seek a classifier,
+whose output is the predicted class assignment.
 
 For reasons that we will get into as the book gets more technical,
 it can be hard to optimize a model that can only output
-a hard categorical assignment, 
+a hard categorical assignment,
 e.g., either "cat" or "dog".
 In these cases, it is usually much easier to instead express
 our model in the language of probabilities.
-Given features of an example, 
+Given features of an example,
 our model assigns a probability
-to each possible class. 
+to each possible class.
 Returning to our animal classification example
 where the classes are $\mathrm{\{cat, dog\}}$,
 a classifier might see an image and output the probability
@@ -641,7 +698,7 @@ $\mathrm{\{0, 1, 2, ... 9, a, b, c, ...\}}$.
 While we attacked regression problems by trying
 to minimize the squared error loss function,
 the common loss function for classification problems is called *cross-entropy*,
-whose name can be demystified 
+whose name can be demystified
 via an introduction to information theory in subsequent chapters.
 
 Note that the most likely class is not necessarily
@@ -649,12 +706,12 @@ the one that you are going to use for your decision.
 Assume that you find a beautiful mushroom in your backyard
 as shown in :numref:`fig_death_cap`.
 
-![Death cap---do not eat!](../img/death-cap.jpg)
+![Death cap - do not eat!](../img/death-cap.jpg)
 :width:`200px`
 :label:`fig_death_cap`
 
 Now, assume that you built a classifier and trained it
-to predict if a mushroom is poisonous based on a photograph.
+to predict whether a mushroom is poisonous based on a photograph.
 Say our poison-detection classifier outputs
 that the probability that
 :numref:`fig_death_cap` contains a death cap is 0.2.
@@ -665,36 +722,40 @@ That is because the certain benefit of a delicious dinner
 is not worth a 20\% risk of dying from it.
 In other words, the effect of the uncertain risk
 outweighs the benefit by far.
-Thus, we need to compute the expected risk that we incur as the loss function,
-i.e., we need to multiply the probability of the outcome
-with the benefit (or harm) associated with it.
-In this case,
-the loss incurred by eating the mushroom
-can be $0.2 \times \infty + 0.8 \times 0 = \infty$,
-whereas the loss of discarding it is
-$0.2 \times 0 + 0.8 \times 1 = 0.8$.
+Thus, in order to make a decision about whether to eat the mushroom,
+we need to compute the expected disutility
+associated with each action
+which depends both on the likely outcomes
+and the benefits or harms associated with each.
+In this case, the disutility incurred
+by eating the mushroom
+might be $0.2 \times \infty + 0.8 \times 0 = \infty$,
+whereas the loss of discarding it
+is $0.2 \times 0 + 0.8 \times 1 = 0.8$.
 Our caution was justified:
 as any mycologist would tell us,
-the mushroom in :numref:`fig_death_cap` actually
-is a death cap.
+the mushroom in :numref:`fig_death_cap`
+is actually a death cap.
 
 Classification can get much more complicated than just
-binary, multiclass, or even multi-label classification.
+binary or multiclass classification.
 For instance, there are some variants of classification
-for addressing hierarchies.
-Hierarchies assume that there exist some relationships among the many classes.
-So not all errors are equal---if we must err, we would prefer
-to misclassify to a related class rather than to a distant class.
+addressing hierarchically structured classes.
+In such cases not all errors are equal---if
+we must err, we might prefer to misclassify
+to a related class rather than a distant class.
 Usually, this is referred to as *hierarchical classification*.
-One early example is due to [Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus), who organized the animals in a hierarchy.
+For inspiration, you might think of [Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus),
+who organized the animals in a hierarchy.
 
 In the case of animal classification,
-it might not be so bad to mistake a poodle (a dog breed) for a schnauzer (another dog breed),
+it might not be so bad to mistake
+a poodle for a schnauzer,
 but our model would pay a huge penalty
 if it confused a poodle for a dinosaur.
 Which hierarchy is relevant might depend
 on how you plan to use the model.
-For example, rattle snakes and garter snakes
+For example, rattlesnakes and garter snakes
 might be close on the phylogenetic tree,
 but mistaking a rattler for a garter could be deadly.
 
@@ -710,18 +771,18 @@ Nonetheless, no matter how accurate our model gets,
 we might find ourselves in trouble when the classifier
 encounters an image of the *Town Musicians of Bremen*,
 a popular German fairy tale featuring four animals
-in :numref:`fig_stackedanimals`.
+(:numref:`fig_stackedanimals`).
 
 ![A donkey, a dog, a cat, and a rooster.](../img/stackedanimals.png)
 :width:`300px`
 :label:`fig_stackedanimals`
 
-As you can see, there is a cat in :numref:`fig_stackedanimals`,
-and a rooster, a dog, and a donkey,
+As you can see, the photo features a cat,
+a rooster, a dog, and a donkey,
 with some trees in the background.
-Depending on what we want to do with our model
-ultimately, treating this as a binary classification problem
-might not make a lot of sense.
+When we anticipate encountering such images,
+multiclass classification might not be
+the right problem formulation.
 Instead, we might want to give the model the option of
 saying the image depicts a cat, a dog, a donkey,
 *and* a rooster.
@@ -730,59 +791,57 @@ The problem of learning to predict classes that are
 not mutually exclusive is called *multi-label classification*.
 Auto-tagging problems are typically best described
 as multi-label classification problems.
-Think of the tags people might apply to posts on a technical blog,
+Think of the tags people might apply
+to posts on a technical blog,
 e.g., "machine learning", "technology", "gadgets",
 "programming languages", "Linux", "cloud computing", "AWS".
-A typical article might have 5--10 tags applied
-because these concepts are correlated.
+A typical article might have 5--10 tags applied.
+Typically, tags will exhibit some correlation structure.
 Posts about "cloud computing" are likely to mention "AWS"
-and posts about "machine learning" could also deal
-with "programming languages".
-
-We also have to deal with this kind of problem when dealing
-with the biomedical literature, where correctly tagging articles is important
-because it allows researchers to do exhaustive reviews of the literature.
-At the National Library of Medicine, a number of professional annotators
-go over each article that gets indexed in PubMed
-to associate it with the relevant terms from MeSH,
+and posts about "machine learning" are likely to mention "GPUs".
+
+Sometimes such tagging problems
+draw on enormous label sets.
+The National Library of Medicine
+employs many professional annotators
+who associate each article to be indexed in PubMed
+with a set of tags drawn from the
+Medical Subject Headings (MeSH) ontology,
 a collection of roughly 28000 tags.
+Correctly tagging articles is important
+because it allows researchers to conduct
+exhaustive reviews of the literature.
 This is a time-consuming process and the
 annotators typically have a one-year lag between archiving and tagging.
-Machine learning can be used here to provide provisional tags
+Machine learning can provide provisional tags
 until each article can have a proper manual review.
 Indeed, for several years, the BioASQ organization
-has [hosted competitions](http://bioasq.org/) to do precisely this.
+has [hosted competitions](http://bioasq.org/)
+for this task.
 
-#### Search 
+#### Search
 
-Sometimes we do not just want to assign each example to a bucket
-or to a real value. In the field of information retrieval,
-we want to impose a ranking on a set of items.
-Take web search for an example. 
-The goal is less to determine whether
+In the field of information retrieval,
+we often impose rankings over sets of items.
+Take web search for example.
+The goal is less to determine *whether*
 a particular page is relevant for a query, but rather,
-which one of the plethora of search results is
-most relevant
-for a particular user.
-We really care about the ordering of the relevant search results
-and our learning algorithm needs to produce ordered subsets
-of elements from a larger set.
-In other words, if we are asked to produce the first 5 letters from the alphabet, there is a difference
-between returning "A B C D E" and "C A B E D".
-Even if the result set is the same,
-the ordering within the set matters.
-
-One possible solution to this problem is to first assign
-to every element in the set a corresponding relevance score
+which, among a set of relevant results
+should be shown most prominently
+to a particular user.
+One possible solution might be
+to first assign a score
+to every element in the set
 and then to retrieve the top-rated elements.
 [PageRank](https://en.wikipedia.org/wiki/PageRank),
-the original secret sauce behind the Google search engine
-was an early example of such a scoring system but it was
-peculiar in that it did not depend on the actual query.
-Here they relied on a simple relevance filter
-to identify the set of relevant items
-and then on PageRank to order those results
-that contained the query term.
+the original secret sauce behind the Google search engine,
+was an early example of such a scoring system.
+Peculiarly, the scoring provided by PageRank
+did not depend on the actual query.
+Instead, they relied on a simple relevance filter
+to identify the set of relevant candidates
+and then used PageRank to prioritize
+the more authoritative pages.
 Nowadays, search engines use machine learning and behavioral models
 to obtain query-dependent relevance scores.
 There are entire academic conferences devoted to this subject.
@@ -794,37 +853,41 @@ Recommender systems are another problem setting
 that is related to search and ranking.
 The problems are similar insofar as the goal
 is to display a set of relevant items to the user.
-The main difference is the emphasis on
-*personalization*
+The main difference is the emphasis on *personalization*
 to specific users in the context of recommender systems.
 For instance, for movie recommendations,
 the results page for a science fiction fan
 and the results page
-for a connoisseur of Peter Sellers comedies might differ significantly.
+for a connoisseur of Peter Sellers comedies
+might differ significantly.
 Similar problems pop up in other recommendation settings,
 e.g., for retail products, music, and news recommendation.
 
-In some cases, customers provide explicit feedback communicating
-how much they liked a particular product
-(e.g., the product ratings and reviews on Amazon, IMDb, and Goodreads).
-In some other cases, they provide implicit feedback,
+In some cases, customers provide explicit feedback,
+communicating how much they liked a particular product
+(e.g., the product ratings and reviews
+on Amazon, IMDb, and Goodreads).
+In other cases, they provide implicit feedback,
 e.g., by skipping titles on a playlist,
-which might indicate dissatisfaction but might just indicate
+which might indicate dissatisfaction,
+or might just indicate
 that the song was inappropriate in context.
-In the simplest formulations, these systems are trained
+In the simplest formulations,
+these systems are trained
 to estimate some score,
-such as an estimated rating
-or the probability of purchase,
-given a user and an item.
+such as an expected star rating
+or the probability that a given user
+will purchase a particular item.
 
-Given such a model, 
-for any given user,
+Given such a model, for any given user,
 we could retrieve the set of objects with the largest scores,
 which could then be recommended to the user.
-Production systems are considerably more advanced and take
-detailed user activity and item characteristics into account
-when computing such scores. :numref:`fig_deeplearning_amazon` is an example
-of deep learning books recommended by Amazon based on personalization algorithms tuned to capture one's preferences.
+Production systems are considerably more advanced
+and take detailed user activity and item characteristics
+into account when computing such scores.
+:numref:`fig_deeplearning_amazon` displays the deep learning books
+recommended by Amazon based on personalization algorithms
+tuned to capture Aston's preferences.
 
 ![Deep learning books recommended by Amazon.](../img/deeplearning-amazon.jpg)
 :label:`fig_deeplearning_amazon`
@@ -834,10 +897,11 @@ recommendation systems
 naively built on top of predictive models
 suffer some serious conceptual flaws.
 To start, we only observe *censored feedback*:
-users preferentially rate movies that they feel strongly about.
-For example, 
-on a five-point scale,
-you might notice that items receive many five and one star ratings
+users preferentially rate movies
+that they feel strongly about.
+For example, on a five-point scale,
+you might notice that items receive
+many one- and five-star ratings
 but that there are conspicuously few three-star ratings.
 Moreover, current purchase habits are often a result
 of the recommendation algorithm currently in place,
@@ -846,62 +910,77 @@ Thus it is possible for feedback loops to form
 where a recommender system preferentially pushes an item
 that is then taken to be better (due to greater purchases)
 and in turn is recommended even more frequently.
-Many of these problems about how to deal with censoring,
-incentives, and feedback loops, are important open research questions.
+Many of these problems about
+how to deal with censoring,
+incentives, and feedback loops,
+are important open research questions.
 
 #### Sequence Learning
 
 So far, we have looked at problems where we have
 some fixed number of inputs and produce a fixed number of outputs.
-For example,
-we considered predicting house prices from a fixed set of features: square footage, number of bedrooms,
-number of bathrooms, walking time to downtown.
+For example, we considered predicting house prices
+given a fixed set of features:
+square footage, number of bedrooms,
+number of bathrooms, and the transit time to downtown.
 We also discussed mapping from an image (of fixed dimension)
-to the predicted probabilities that it belongs to each
-of a fixed number of classes, or taking a user ID and a product ID,
-and predicting a star rating. In these cases,
-once we feed our fixed-length input
-into the model to generate an output,
-the model immediately forgets what it just saw.
-
-This might be fine if our inputs truly all have the same dimensions
-and if successive inputs truly have nothing to do with each other.
-But how would we deal with video snippets?
+to the predicted probabilities that it belongs
+to each among a fixed number of classes
+and predicting star ratings associated with purchases
+based on the user ID and product ID alone.
+In these cases, once our model is trained,
+after each test example is fed into our model,
+it is immediately forgotten.
+We assumed that successive observations were independent
+and thus there was no need to hold on to this context.
+
+But how should we deal with video snippets?
 In this case, each snippet might consist of a different number of frames.
 And our guess of what is going on in each frame might be much stronger
 if we take into account the previous or succeeding frames.
-Same goes for language. One popular deep learning problem
-is machine translation: the task of ingesting sentences
-in some source language and predicting their translation in another language.
+Same goes for language.
+One popular deep learning problem is machine translation:
+the task of ingesting sentences in some source language
+and predicting their translations in another language.
 
 These problems also occur in medicine.
 We might want a model to monitor patients in the intensive care unit
-and to fire off alerts if their risk of death
-in the next 24 hours exceeds some threshold.
-We definitely would not want this model to throw away
-everything it knows about the patient history each hour
-and just make its predictions based on the most recent measurements.
-
-These problems are among the most exciting applications of machine learning
+and to fire off alerts whenever their risk of dying in the next 24 hours
+exceeds some threshold.
+Here, we wouldn't throw away everything
+that we know about the patient history every hour,
+making predictions based only
+on the most recent measurements.
+
+These problems are among the most
+exciting applications of machine learning
 and they are instances of *sequence learning*.
 They require a model to either ingest sequences of inputs
 or to emit sequences of outputs (or both).
-Specifically,
-*sequence to sequence learning* considers problems
-where input and output are both variable-length sequences,
-such as machine translation and transcribing text from the spoken speech.
-While it is impossible to consider all types of sequence transformations,
+Specifically, *sequence-to-sequence learning* considers problems
+where inputs and outputs both consist of variable-length sequences.
+Examples include machine translation
+and speech-to-text transcription.
+While it is impossible to consider
+all types of sequence transformations,
 the following special cases are worth mentioning.
 
-**Tagging and Parsing**. This involves annotating a text sequence with attributes.
-In other words, the number of inputs and outputs is essentially the same.
-For instance, we might want to know where the verbs and subjects are.
-Alternatively, we might want to know which words are the named entities.
-In general, the goal is to decompose and annotate text based on structural
-and grammatical assumptions to get some annotation.
-This sounds more complex than it actually is.
-Below is a very simple example of annotating a sentence
-with tags indicating which words refer to named entities (tagged as "Ent").
+**Tagging and Parsing**.
+This involves annotating a text sequence with attributes.
+Here, the inputs and outputs are *aligned*,
+i.e., they are of the same number
+and occur in a corresponding order.
+For instance, in *part-of-speech (PoS) tagging*,
+we annotate every word in a sentence
+with the corresponding part of speech,
+i.e., "noun" or "direct object".
+Alternatively, we might want to know
+which groups of contiguous words refer to named entities,
+like *people*, *places*, or *organizations*.
+In the cartoonishly simple example below,
+we might just want to indicate,
+for every word in a sentence,
+whether it is part of a named entity (tagged as "Ent").
 
 ```text
 Tom has dinner in Washington with Sally
@@ -909,36 +988,45 @@ Ent  -    -    -     Ent      -    Ent
 ```
 
 
-**Automatic Speech Recognition**. With speech recognition, the input sequence
-is an audio recording of a speaker (shown in :numref:`fig_speech`), and the output 
-is the textual transcript of what the speaker said.
+**Automatic Speech Recognition**.
+With speech recognition, the input sequence
+is an audio recording of a speaker (:numref:`fig_speech`),
+and the output is a transcript of what the speaker said.
 The challenge is that there are many more audio frames
 (sound is typically sampled at 8kHz or 16kHz)
 than text, i.e., there is no 1:1 correspondence between audio and text,
 since thousands of samples may
 correspond to a single spoken word.
-These are sequence to sequence learning problems where the output is much shorter than the input.
+These are sequence-to-sequence learning problems,
+where the output is much shorter than the input.
 
 ![`-D-e-e-p- L-ea-r-ni-ng-` in an audio recording.](../img/speech.png)
 :width:`700px`
 :label:`fig_speech`
 
-**Text to Speech**. This is the inverse of automatic speech recognition.
-In other words, the input is text
-and the output is an audio file.
+**Text to Speech**.
+This is the inverse of automatic speech recognition.
+Here, the input is text and the output is an audio file.
 In this case, the output is much longer than the input.
-While it is easy for humans to recognize a bad audio file,
-this is not quite so trivial for computers.
-
-**Machine Translation**. Unlike the case of speech recognition, where corresponding
-inputs and outputs occur in the same order (after alignment),
-in machine translation, order inversion can be vital.
-In other words, while we are still converting one sequence into another,
-neither the number of inputs and outputs nor the order
-of corresponding data examples are assumed to be the same.
+While humans are remarkably good at recognizing speech,
+even from low-quality audio,
+getting computers to perform the feat
+is a formidable challenge.
+
+**Machine Translation**.
+Unlike the case of speech recognition,
+where corresponding inputs and outputs
+occur in the same order,
+in machine translation,
+unaligned data poses a new challenge.
+Here the input and output sequences
+can have different lengths,
+and the corresponding regions
+of the respective sequences
+may appear in different orders.
 Consider the following illustrative example
 of the peculiar tendency of Germans
-to place the verbs at the end of sentences.
+to place the verbs at the end of sentences:
 
 ```text
 German:           Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?
@@ -958,28 +1046,31 @@ These are active areas of research.
 
 ### Unsupervised and Self-Supervised Learning
 
-All the examples so far were related to supervised learning,
-i.e., situations where we feed the model a giant dataset
+The previous examples focused on supervised learning,
+where we feed the model a giant dataset
 containing both the features and corresponding label values.
 You could think of the supervised learner as having
-an extremely specialized job and an extremely banal boss.
-The boss stands over your shoulder and tells you exactly what to do
+an extremely specialized job and an extremely dictatorial boss.
+The boss stands over its shoulder and tells it exactly what to do
 in every situation until you learn to map from situations to actions.
 Working for such a boss sounds pretty lame.
-On the other hand, it is easy to please this boss.
+On the other hand, pleasing such a boss is pretty easy.
 You just recognize the pattern as quickly as possible
 and imitate their actions.
 
-In a completely opposite way, it could be frustrating
-to work for a boss who has no idea what they want you to do.
-However, if you plan to be a data scientist, you had better get used to it.
-The boss might just hand you a giant dump of data and tell you to *do some data science with it!* 
+Considering the opposite situation,
+it could be frustrating to work for a boss
+who has no idea what they want you to do.
+However, if you plan to be a data scientist,
+you had better get used to it.
+The boss might just hand you a giant dump of data
+and tell you to *do some data science with it!*
 This sounds vague because it is.
 We call this class of problems *unsupervised learning*,
 and the type and number of questions we could ask
 is limited only by our creativity.
 We will address unsupervised learning techniques
-in later chapters. 
+in later chapters.
 To whet your appetite for now,
 we describe a few of the following questions you might ask.
 
@@ -992,7 +1083,7 @@ can we group them into users with similar behavior?
 This problem is typically known as *clustering*.
 * Can we find a small number of parameters
 that accurately capture the relevant properties of the data?
-The trajectories of a ball are quite well described
+The trajectories of a ball are well described
 by velocity, diameter, and mass of the ball.
 Tailors have developed a small number of parameters
 that describe human body shape fairly accurately
@@ -1000,7 +1091,7 @@ for the purpose of fitting clothes.
 These problems are referred to as *subspace estimation*.
 If the dependence is linear, it is called *principal component analysis*.
 * Is there a representation of (arbitrarily structured) objects
-in Euclidean space 
+in Euclidean space
 such that symbolic properties can be well matched?
 This can be used to describe entities and their relations,
 such as "Rome" $-$ "Italy" $+$ "France" $=$ "Paris".
@@ -1010,44 +1101,49 @@ For instance, if we have demographic data
 about house prices, pollution, crime, location,
 education, and salaries, can we discover
 how they are related simply based on empirical data?
-The fields concerned with *causality* and *probabilistic graphical models* address this problem.
+The fields concerned with *causality* and
+*probabilistic graphical models* tackle such questions.
 * Another important and exciting recent development in unsupervised learning
-is the advent of *generative adversarial networks*.
-These give us a procedural way to synthesize data,
-even complicated structured data like images and audio.
-The underlying statistical mechanisms are tests
-to check whether real and fake data are the same.
-
-As a form of unsupervised learning,
-*self-supervised learning*
-leverages unlabeled data 
-to provide supervision in training,
-such as by
-predicting some withheld part of the data
-using other parts.
-For text,
-we can train models 
+is the advent of deep generative models.
+These models estimate the density of the data $p(\mathbf{x})$,
+either explicitly or *implicitly*.
+Once trained, we can use a generative model
+either to score examples according to how likely they are,
+or to sample synthetic examples from the learned distribution.
+Early deep learning breakthroughs in generative modeling
+came with the invention of *variational autoencoders* :cite:`Kingma.Welling.2014`
+and continued with the development of *generative adversarial networks* :cite:`Goodfellow.Pouget-Abadie.Mirza.ea.2014`.
+More recent advances include normalizing flows,
+diffusion models, and score-based models.
+
+
+
+A major development in unsupervised learning,
+has been the rise of *self-supervised learning*,
+techniques that leverage some aspect of the unlabeled data
+to provide supervision.
+For text, we can train models
 to "fill in the blanks"
 by predicting randomly masked words
 using their surrounding words (contexts)
 in big corpora without any labeling effort :cite:`Devlin.Chang.Lee.ea.2018`!
-For images,
-we may train models
+For images, we may train models
 to tell the relative position
 between two cropped regions
-of the same image :cite:`Doersch.Gupta.Efros.2015`.
-In these two examples of self-supervised learning,
-training models to predict
-possible words and relative positions
-are both classification tasks
-(from supervised learning).
-
+of the same image :cite:`Doersch.Gupta.Efros.2015`,
+to predict an occluded part of an image
+based on the remaining portions of the image,
+or to predict whether two examples
+are perturbed versions of the same underlying image.
+Self-supervised models often learn representations
+that are subsequently leveraged
+by fine-tuning the resulting models
+on some downstream task of interest.
 
 
 ### Interacting with an Environment
 
-So far, we have not discussed where data actually
-come from,
+So far, we have not discussed where data actually comes from,
 or what actually happens when a machine learning model generates an output.
 That is because supervised learning and unsupervised learning
 do not address these issues in a very sophisticated way.
@@ -1057,26 +1153,28 @@ without ever interacting with the environment again.
 Because all of the learning takes place
 after the algorithm is disconnected from the environment,
 this is sometimes called *offline learning*.
-For supervised learning,
-the process by considering data collection from an environment looks like :numref:`fig_data_collection`.
+For example, supervised learning assumes
+the simple interaction pattern
+depicted in :numref:`fig_data_collection`.
 
 ![Collecting data for supervised learning from an environment.](../img/data-collection.svg)
 :label:`fig_data_collection`
 
 This simplicity of offline learning has its charms.
-The upside is that
-we can worry about pattern recognition
-in isolation, without any distraction from these other problems.
-But the downside is that the problem formulation is quite limiting.
-If you are more ambitious, or if you grew up reading Asimov's Robot series,
-then you might imagine artificially intelligent bots capable
-not only of making predictions, but also 
-of taking actions in the world.
-We want to think about intelligent *agents*, not just predictive models.
-This means that
-we need to think about choosing *actions*,
+The upside is that we can worry
+about pattern recognition in isolation,
+without worrying about complications arising
+from interactions with a dynamic environment.
+But this problem formulation is limiting.
+If you grew up reading Asimov's Robot novels,
+then you might imagine artificially intelligent agents
+capable not only of making predictions,
+but also of taking actions in the world.
+We want to think about intelligent *agents*,
+not just predictive models.
+This means that we need to think about choosing *actions*,
 not just making predictions.
-Moreover, unlike predictions,
+Unlike mere predictions,
 actions actually impact the environment.
 If we want to train an intelligent agent,
 we must account for the way its actions might
@@ -1088,17 +1186,18 @@ The following are just a few examples.
 
 * Does the environment remember what we did previously?
 * Does the environment want to help us, e.g., a user reading text into a speech recognizer?
-* Does the environment want to beat us, i.e., an adversarial setting like spam filtering (against spammers) or playing a game (vs. an opponent)?
-* Does the environment not care?
+* Does the environment want to beat us, e.g., spammers altering their emails to evade spam filters?
 * Does the environment have shifting dynamics? For example, does future data always resemble the past or do the patterns change over time, either naturally or in response to our automated tools?
 
-This last question raises the problem of *distribution shift*,
-when training and test data are different.
-It is a problem that most of us have experienced
+These questions raise the problem of *distribution shift*,
+where training and test data are different.
+Most of us have have experienced this problem
 when taking exams written by a lecturer,
-while the homework was composed by his teaching assistants.
-Next, we will briefly describe reinforcement learning,
-a setting that explicitly considers interactions with an environment.
+while the homework was composed by their teaching assistants.
+Next, we briefly describe reinforcement learning,
+a rich framework for posing learning problems in which
+an agent interacts with an environment.
+
 
 ### Reinforcement Learning
 
@@ -1107,58 +1206,69 @@ to develop an agent that interacts with an environment
 and takes actions, then you are probably going to wind up
 focusing on *reinforcement learning*.
 This might include applications to robotics,
-to dialogue systems, 
+to dialogue systems,
 and even to developing artificial intelligence (AI)
 for video games.
 *Deep reinforcement learning*, which applies
 deep learning to reinforcement learning problems,
 has surged in popularity.
-The breakthrough deep Q-network that beat humans at Atari games using only the visual input,
-and the AlphaGo program that dethroned the world champion at the board game Go are two prominent examples.
+The breakthrough deep Q-network that beat humans
+at Atari games using only the visual input :cite:`mnih2015human`,
+and the AlphaGo program that dethroned the world champion
+at the board game Go :cite:`Silver.Huang.Maddison.ea.2016`
+are two prominent examples.
 
 Reinforcement learning gives a very general statement of a problem,
 in which an agent interacts with an environment over a series of time steps.
-At each time step, 
-the agent receives some *observation* 
+At each time step, the agent receives some *observation*
 from the environment and must choose an *action*
 that is subsequently transmitted back to the environment
-via some mechanism (sometimes called an actuator).
+via some mechanism (sometimes called an *actuator*).
 Finally, the agent receives a reward from the environment.
 This process is illustrated in :numref:`fig_rl-environment`.
 The agent then receives a subsequent observation,
 and chooses a subsequent action, and so on.
-The behavior of an reinforcement learning agent is governed by a policy.
+The behavior of a reinforcement learning agent is governed by a *policy*.
 In short, a *policy* is just a function that maps
 from observations of the environment to actions.
-The goal of reinforcement learning is to produce a good policy.
+The goal of reinforcement learning is to produce good policies.
 
 ![The interaction between reinforcement learning and an environment.](../img/rl-environment.svg)
 :label:`fig_rl-environment`
 
-It is hard to overstate the generality of the reinforcement learning framework.
-For example, we can cast any supervised learning problem as a reinforcement learning problem.
+It is hard to overstate the generality
+of the reinforcement learning framework.
+For example, we can cast supervised learning problems
+as reinforcement learning problems.
 Say we had a classification problem.
-We could create a reinforcement learning agent with one action corresponding to each class.
+We could create a reinforcement learning agent
+with one action corresponding to each class.
 We could then create an environment which gave a reward
 that was exactly equal to the loss function
 from the original supervised learning problem.
 
-That being said, reinforcement learning can also address many problems
+That being said, reinforcement learning
+can also address many problems
 that supervised learning cannot.
-For example, in supervised learning we always expect
-that the training input comes associated with the correct label.
-But in reinforcement learning, we do not assume that for each observation 
+For example, in supervised learning,
+we always expect that the training input
+comes associated with the correct label.
+But in reinforcement learning,
+we do not assume that for each observation
 the environment tells us the optimal action.
 In general, we just get some reward.
-Moreover, the environment may not even tell us which actions led to the reward.
+Moreover, the environment may not even tell us
+which actions led to the reward.
 
-Consider for example the game of chess.
+Consider the game of chess.
 The only real reward signal comes at the end of the game
-when we either win, which we might assign a reward of 1,
-or when we lose, which we could assign a reward of -1.
-So reinforcement learners must deal with the *credit assignment* problem:
+when we either win, earning a reward of, say, 1,
+or when we lose, receiving a reward of, say, -1.
+So reinforcement learners must deal
+with the *credit assignment* problem:
 determining which actions to credit or blame for an outcome.
-The same goes for an employee who gets a promotion on October 11.
+The same goes for an employee
+who gets a promotion on October 11.
 That promotion likely reflects a large number
 of well-chosen actions over the previous year.
 Getting more promotions in the future requires figuring out
@@ -1170,17 +1280,19 @@ That is, the current observation might not
 tell you everything about your current state.
 Say a cleaning robot found itself trapped
 in one of many identical closets in a house.
-Inferring the precise location (and thus state) of the robot
-might require considering its previous observations before entering the closet.
+Inferring the precise location of the robot
+might require considering its previous observations
+before entering the closet.
 
 Finally, at any given point, reinforcement learners
 might know of one good policy,
 but there might be many other better policies
 that the agent has never tried.
 The reinforcement learner must constantly choose
-whether to *exploit* the best currently-known strategy as a policy,
+whether to *exploit* the best (currently) known strategy as a policy,
 or to *explore* the space of strategies,
-potentially giving up some short-run reward in exchange for knowledge.
+potentially giving up some short-run reward
+in exchange for knowledge.
 
 The general reinforcement learning problem
 is a very general setting.
@@ -1202,21 +1314,19 @@ is the classic *multi-armed bandit problem*.
 
 ## Roots
 
-We have just reviewed
-a small subset of problems that machine learning 
-can address.
+We have just reviewed a small subset of problems
+that machine learning can address.
 For a diverse set of machine learning problems,
 deep learning provides powerful tools for solving them.
-Although many deep learning methods
-are recent inventions,
-the core idea of programming with data and neural networks (names of many deep learning models)
-has been studied for centuries.
-In fact,
-humans have held the desire to analyze data
+Although many deep learning methods are recent inventions,
+the core ideas behind learning from data
+have been studied for centuries.
+In fact, humans have held the desire to analyze data
 and to predict future outcomes for long
 and much of natural science has its roots in this.
 For instance, the Bernoulli distribution is named after
-[Jacob Bernoulli (1655--1705)](https://en.wikipedia.org/wiki/Jacob_Bernoulli), and the Gaussian distribution was discovered
+[Jacob Bernoulli (1655--1705)](https://en.wikipedia.org/wiki/Jacob_Bernoulli),
+and the Gaussian distribution was discovered
 by [Carl Friedrich Gauss (1777--1855)](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss).
 He invented, for instance, the least mean squares algorithm,
 which is still used today for countless problems
@@ -1226,40 +1336,48 @@ in the natural sciences---for instance, Ohm's law
 relating current and voltage in a resistor
 is perfectly described by a linear model.
 
-Even in the middle ages, mathematicians had a keen intuition of estimates.
-For instance, the geometry book of [Jacob Köbel (1460--1533)](https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry) illustrates
-averaging the length of 16 adult men's feet to obtain the average foot length.
+Even in the middle ages, mathematicians
+had a keen intuition of estimates.
+For instance, the geometry book of [Jacob Köbel (1460--1533)](https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry)
+illustrates averaging the length of 16 adult men's feet
+to estimate the average foot length in the population (:numref:`fig_koebel`).
 
 ![Estimating the length of a foot.](../img/koebel.jpg)
 :width:`500px`
 :label:`fig_koebel`
 
-:numref:`fig_koebel` illustrates how this estimator works.
-The 16 adult men were asked to line up in a row, when leaving the church.
-Their aggregate length was then divided by 16
+
+As a group of individuals exited a church,
+16 adult men were asked to line up in a row
+and have their feet measured.
+The sum of these measurements was then divided by 16
 to obtain an estimate for what now amounts to 1 foot.
-This "algorithm" was later improved to deal with misshapen feet---the
-2 men with the shortest and longest feet respectively were sent away,
+This "algorithm" was later improved
+to deal with misshapen feet;
+The 2 men with the shortest and longest feet were sent away,
 averaging only over the remainder.
-This is one of the earliest examples of the trimmed mean estimate.
+This is among the earliest examples
+of a trimmed mean estimate.
 
 Statistics really took off with the collection and availability of data.
-One of its titans, [Ronald Fisher (1890--1962)](https://en.wikipedia.org/wiki/Ronald_Fisher),
+One of its pioneers, [Ronald Fisher (1890--1962)](https://en.wikipedia.org/wiki/Ronald_Fisher),
 contributed significantly to its theory
 and also its applications in genetics.
 Many of his algorithms (such as linear discriminant analysis)
-and formula (such as the Fisher information matrix)
-are still in frequent use today. 
-In fact,
-even the Iris dataset
-that Fisher released in 1936 is still used sometimes
-to illustrate machine learning algorithms.
-He was also a proponent of eugenics,
+and formulas (such as the Fisher information matrix)
+still hold a prominent place
+in the foundations of modern statistics.
+Even his data resources had a lasting impact.
+The Iris dataset that Fisher released in 1936
+is still used sometimes to demonstrate
+machine learning algorithms.
+Fisher was also a proponent of eugenics,
 which should remind us that the morally dubious use of data science
 has as long and enduring a history as its productive use
 in industry and the natural sciences.
 
-A second influence for machine learning came from information theory by
+A second influence for machine learning
+came from information theory by
 [Claude Shannon (1916--2001)](https://en.wikipedia.org/wiki/Claude_Shannon) and the theory of computation via [Alan Turing (1912--1954)](https://en.wikipedia.org/wiki/Alan_Turing).
 Turing posed the question "can machines think?”
 in his famous paper *Computing Machinery and Intelligence* :cite:`Turing.1950`.
@@ -1270,24 +1388,25 @@ from a machine and a human based on textual interactions.
 
 Another influence can be found in neuroscience and psychology.
 After all, humans clearly exhibit intelligent behavior.
-It is thus only reasonable to ask whether one could explain
+Many scholars have asked whether one could explain
 and possibly reverse engineer this capacity.
-One of the oldest algorithms inspired in this fashion
+One of the oldest biologically inspired algorithms
 was formulated by [Donald Hebb (1904--1985)](https://en.wikipedia.org/wiki/Donald_O._Hebb).
 In his groundbreaking book *The Organization of Behavior* :cite:`Hebb.Hebb.1949`,
 he posited that neurons learn by positive reinforcement.
 This became known as the Hebbian learning rule.
-It is the prototype of Rosenblatt's perceptron learning algorithm
-and it laid the foundations of many stochastic gradient descent algorithms
-that underpin deep learning today: reinforce desirable behavior
-and diminish undesirable behavior to obtain good settings
-of the parameters in a neural network.
+These ideas inspired later works like
+Rosenblatt's perceptron learning algorithm
+and laid the foundations of many stochastic gradient descent algorithms
+that underpin deep learning today:
+reinforce desirable behavior and diminish undesirable behavior
+to obtain good settings of the parameters in a neural network.
 
 Biological inspiration is what gave *neural networks* their name.
 For over a century (dating back to the models of Alexander Bain, 1873
 and James Sherrington, 1890), researchers have tried to assemble
 computational circuits that resemble networks of interacting neurons.
-Over time, the interpretation of biology has become less literal
+Over time, the interpretation of biology has become less literal,
 but the name stuck. At its heart, lie a few key principles
 that can be found in most networks today:
 
@@ -1307,22 +1426,29 @@ The MNIST dataset with its 60000 handwritten digits was considered huge.
 
 Given the scarcity of data and computation,
 strong statistical tools such as kernel methods,
-decision trees and graphical models proved empirically superior.
-Unlike neural networks, they did not require weeks to train
-and provided predictable results with strong theoretical guarantees.
+decision trees, and graphical models
+proved empirically superior in many applications.
+Moreover, unlike neural networks,
+they did not require weeks to train
+and provided predictable results
+with strong theoretical guarantees.
 
 
 ## The Road to Deep Learning
 
-Much of this changed with 
-the ready availability of large amounts of data,
-due to the World Wide Web, 
+Much of this changed with the availability
+of large amounts of data,
+due to the World Wide Web,
 the advent of companies serving
-hundreds of millions of users online, 
-a dissemination of cheap, high-quality sensors, 
+hundreds of millions of users online,
+a dissemination of cheap, high-quality sensors,
 cheap data storage (Kryder's law),
-and cheap computation (Moore's law), in particular in the form of GPUs, originally engineered for computer gaming.
-Suddenly algorithms and models that seemed computationally infeasible
+and cheap computation (Moore's law).
+In particular, the landscape of computation in deep learning
+was revolutionized by advances in GPUs,
+which were originally engineered for computer gaming.
+Suddenly algorithms and models
+that seemed computationally infeasible
 became relevant (and vice versa).
 This is best illustrated in :numref:`tab_intro_decade`.
 
@@ -1331,22 +1457,24 @@ This is best illustrated in :numref:`tab_intro_decade`.
 |Decade|Dataset|Memory|Floating point calculations per second|
 |:--|:-|:-|:-|
 |1970|100 (Iris)|1 KB|100 KF (Intel 8080)|
-|1980|1 K (House prices in Boston)|100 KB|1 MF (Intel 80186)|
+|1980|1 K (house prices in Boston)|100 KB|1 MF (Intel 80186)|
 |1990|10 K (optical character recognition)|10 MB|10 MF (Intel 80486)|
 |2000|10 M (web pages)|100 MB|1 GF (Intel Core)|
 |2010|10 G (advertising)|1 GB|1 TF (Nvidia C2050)|
 |2020|1 T (social network)|100 GB|1 PF (Nvidia DGX-2)|
 :label:`tab_intro_decade`
 
-It is evident that random-access memory has not kept pace with the growth in data.
-At the same time, the increase in computational power
-has outpaced that of the data available.
-This means that statistical models need to become more memory efficient
-(this is typically achieved by adding nonlinearities)
-while simultaneously being able to spend more time
-on optimizing these parameters, due to an increased computational budget.
+Note that random-access memory has not kept pace with the growth in data.
+At the same time, increases in computational power
+have outpaced the growth in datasets.
+This means that statistical models
+need to become more memory efficient,
+and are free to spend more computer cycles
+optimizing parameters, due to
+the increased compute budget.
 Consequently, the sweet spot in machine learning and statistics
-moved from (generalized) linear models and kernel methods to deep neural networks.
+moved from (generalized) linear models and kernel methods
+to deep neural networks.
 This is also one of the reasons why many of the mainstays
 of deep learning, such as multilayer perceptrons
 :cite:`McCulloch.Pitts.1943`, convolutional neural networks
@@ -1368,16 +1496,16 @@ over the past decade.
 
 * Novel methods for capacity control, such as *dropout*
   :cite:`Srivastava.Hinton.Krizhevsky.ea.2014`,
-  have helped to mitigate the danger of overfitting.
-  This was achieved by applying noise injection :cite:`Bishop.1995`
-  throughout the neural network, replacing weights by random variables
-  for training purposes.
+  have helped to mitigate overfitting.
+  Here, noise is injected :cite:`Bishop.1995`
+  throughout the neural network during training.
 * Attention mechanisms solved a second problem
   that had plagued statistics for over a century:
   how to increase the memory and complexity of a system without
   increasing the number of learnable parameters.
   Researchers found an elegant solution
-  by using what can only be viewed as a learnable pointer structure :cite:`Bahdanau.Cho.Bengio.2014`.
+  by using what can only be viewed as
+  a learnable pointer structure :cite:`Bahdanau.Cho.Bengio.2014`.
   Rather than having to remember an entire text sequence, e.g.,
   for machine translation in a fixed-dimensional representation,
   all that needed to be stored was a pointer to the intermediate state
@@ -1385,13 +1513,24 @@ over the past decade.
   increased accuracy for long sequences, since the model
   no longer needed to remember the entire sequence before
   commencing the generation of a new sequence.
-* Multi-stage designs, e.g., via the memory networks 
-  :cite:`Sukhbaatar.Weston.Fergus.ea.2015` and the neural programmer-interpreter :cite:`Reed.De-Freitas.2015`
-  allowed statistical modelers to describe iterative approaches to reasoning. These tools allow for an internal state of the deep neural network
-  to be modified repeatedly, thus carrying out subsequent steps
+  Built solely on attention mechanisms,
+  the transformer architecture :cite:`Vaswani.Shazeer.Parmar.ea.2017`
+  has demonstrated compelling success in a wide range of areas.
+  For example, a single transformer pretrained on modalities
+  as diverse as text, images, joint torques, and button presses
+  can play Atari, caption images, chat,
+  and control a robot :cite:`reed2022generalist`.
+* Multi-stage designs, e.g., via the memory networks
+  :cite:`Sukhbaatar.Weston.Fergus.ea.2015`
+  and the neural programmer-interpreter :cite:`Reed.De-Freitas.2015`
+  allowed statistical modelers to describe iterative approaches to reasoning.
+  These tools allow for an internal state of the deep neural network
+  to be modified repeatedly,
+  thus carrying out subsequent steps
   in a chain of reasoning, similar to how a processor
   can modify memory for a computation.
-* Another key development was the invention of generative adversarial networks
+* Another key development was the invention
+  of generative adversarial networks
   :cite:`Goodfellow.Pouget-Abadie.Mirza.ea.2014`.
   Traditionally, statistical methods for density estimation
   and generative models focused on finding proper probability distributions
@@ -1421,33 +1560,40 @@ over the past decade.
   At the same time, small batches limit the efficiency of GPUs.
   Hence, training on 1024 GPUs with a minibatch size of,
   say 32 images per batch amounts to an aggregate minibatch
-  of about 32000 images. Recent work, first by Li :cite:`Li.2017`,
-  and subsequently by :cite:`You.Gitman.Ginsburg.2017`
-  and :cite:`Jia.Song.He.ea.2018` pushed the size up to 64000 observations,
-  reducing training time for the ResNet-50 model on the ImageNet dataset to less than 7 minutes.
+  of about 32000 images. Recent work, first by :citet:`Li.2017`,
+  and subsequently by :citet:`You.Gitman.Ginsburg.2017`
+  and :citet:`Jia.Song.He.ea.2018` pushed the size up to 64000 observations,
+  reducing training time for the ResNet-50 model
+  on the ImageNet dataset to less than 7 minutes.
   For comparison---initially training times were measured in the order of days.
-* The ability to parallelize computation has also contributed quite crucially
-  to progress in reinforcement learning, at least whenever simulation is an
-  option. This has led to significant progress in computers achieving
-  superhuman performance in Go, Atari games, Starcraft, and in physics
-  simulations (e.g., using MuJoCo). See e.g.,
-  :cite:`Silver.Huang.Maddison.ea.2016` for a description
+* The ability to parallelize computation
+  has also contributed to progress in reinforcement learning,
+  This has led to significant progress in computers achieving
+  superhuman performance on tasks like Go, Atari games,
+  Starcraft, and in physics simulations (e.g., using MuJoCo),
+  Where environment simulators are available.
+  See, e.g., :citet:`Silver.Huang.Maddison.ea.2016` for a description
   of how to achieve this in AlphaGo. In a nutshell,
-  reinforcement learning works best if plenty of (state, action, reward) triples are available, i.e., whenever it is possible to try out lots of things to learn how they relate to each
-  other. Simulation provides such an avenue.
+  reinforcement learning works best
+  if plenty of (state, action, reward) tuples are available.
+  Simulation provides such an avenue.
 * Deep learning frameworks have played a crucial role
-  in disseminating ideas. The first generation of frameworks
-  allowing for easy modeling encompassed
+  in disseminating ideas.
+  The first generation of open-source frameworks
+  for neural network modeling consisted of
   [Caffe](https://github.com/BVLC/caffe),
   [Torch](https://github.com/torch), and
   [Theano](https://github.com/Theano/Theano).
   Many seminal papers were written using these tools.
   By now, they have been superseded by
-  [TensorFlow](https://github.com/tensorflow/tensorflow) (often used via its high level API [Keras](https://github.com/keras-team/keras)), [CNTK](https://github.com/Microsoft/CNTK), [Caffe 2](https://github.com/caffe2/caffe2), and [Apache MXNet](https://github.com/apache/incubator-mxnet). The third generation of tools, namely imperative tools for deep learning,
-  was arguably spearheaded by [Chainer](https://github.com/chainer/chainer),
+  [TensorFlow](https://github.com/tensorflow/tensorflow) (often used via its high level API [Keras](https://github.com/keras-team/keras)), [CNTK](https://github.com/Microsoft/CNTK), [Caffe 2](https://github.com/caffe2/caffe2), and [Apache MXNet](https://github.com/apache/incubator-mxnet).
+  The third generation of tools consists
+  of so-called *imperative* tools for deep learning,
+  a trend that was arguably ignited by [Chainer](https://github.com/chainer/chainer),
   which used a syntax similar to Python NumPy to describe models.
   This idea was adopted by both [PyTorch](https://github.com/pytorch/pytorch),
-  the [Gluon API](https://github.com/apache/incubator-mxnet) of MXNet, and [Jax](https://github.com/google/jax).
+  the [Gluon API](https://github.com/apache/incubator-mxnet) of MXNet,
+  and [Jax](https://github.com/google/jax).
 
 
 The division of labor between system researchers building better tools
@@ -1457,18 +1603,21 @@ training a linear logistic regression model
 used to be a nontrivial homework problem,
 worthy to give to new machine learning
 Ph.D. students at Carnegie Mellon University in 2014.
-By now, this task can be accomplished with less than 10 lines of code,
+By now, this task can be accomplished
+with less than 10 lines of code,
 putting it firmly into the grasp of programmers.
 
+
 ## Success Stories
 
 AI has a long history of delivering results
 that would be difficult to accomplish otherwise.
-For instance, 
-the mail sorting systems
+For instance, the mail sorting systems
 using optical character recognition
 have been deployed since the 1990s.
-This is, after all, the source of the famous MNIST dataset  of handwritten digits.
+This is, after all, the source
+of the famous MNIST dataset
+of handwritten digits.
 The same applies to reading checks for bank deposits and scoring
 creditworthiness of applicants.
 Financial transactions are checked for fraud automatically.
@@ -1486,62 +1635,95 @@ that were considered intractable previously
 and that are directly related to consumers.
 Many of such advances are attributed to deep learning.
 
-* Intelligent assistants, such as Apple's Siri, Amazon's Alexa, and Google's
-  assistant, are able to answer spoken questions with a reasonable degree of
-  accuracy. This includes menial tasks such as turning on light switches (a boon to the disabled) up to making barber's appointments and offering phone support dialog. This is likely the most noticeable sign that AI is affecting our lives.
-* A key ingredient in digital assistants is the ability to recognize speech
-  accurately. Gradually the accuracy of such systems has increased to the point
-  where they reach human parity for certain
-  applications :cite:`Xiong.Wu.Alleva.ea.2018`.
-* Object recognition likewise has come a long way. Estimating the object in a
-  picture was a fairly challenging task in 2010. On the ImageNet benchmark researchers from NEC Labs and University of Illinois at Urbana-Champaign achieved a top-5 error rate of 28% :cite:`Lin.Lv.Zhu.ea.2010`. By 2017,
-  this error rate was reduced to 2.25% :cite:`Hu.Shen.Sun.2018`. Similarly, stunning
-  results have been achieved for identifying birds or diagnosing skin cancer.
-* Games used to be a bastion of human intelligence.
-  Starting from TD-Gammon, a program for playing backgammon using temporal difference reinforcement learning, algorithmic and computational progress has led to algorithms
-  for a wide range of applications. Unlike backgammon,
-  chess has a much more complex state space and set of actions.
+* Intelligent assistants, such as Apple's Siri,
+  Amazon's Alexa, and Google's assistant,
+  are able to answer spoken questions
+  with a reasonable degree of accuracy.
+  This includes menial tasks, like turning on light switches,
+  and more complex tasks, like arranging barber's appointments
+  and offering phone support dialog.
+  This is likely the most noticeable sign
+  that AI is affecting our lives.
+* A key ingredient in digital assistants
+  is the ability to recognize speech accurately.
+  Gradually, the accuracy of such systems
+  has increased to the point
+  of achieving human parity
+  for certain applications :cite:`Xiong.Wu.Alleva.ea.2018`.
+* Object recognition has likewise come a long way.
+  Estimating the object in a picture
+  was a fairly challenging task in 2010.
+  On the ImageNet benchmark researchers from NEC Labs
+  and University of Illinois at Urbana-Champaign
+  achieved a top-5 error rate of 28% :cite:`Lin.Lv.Zhu.ea.2010`.
+  By 2017, this error rate was reduced to 2.25% :cite:`Hu.Shen.Sun.2018`.
+  Similarly, stunning results have been achieved
+  for identifying birds and for diagnosing skin cancer.
+* Prowess in games used to provide
+  a measuring stick for human intelligence.
+  Starting from TD-Gammon, a program for playing backgammon
+  using temporal difference reinforcement learning,
+  algorithmic and computational progress
+  has led to algorithms for a wide range of applications.
+  Unlike backgammon, chess has
+  a much more complex state space and set of actions.
   DeepBlue beat Garry Kasparov using massive parallelism,
-  special-purpose hardware and efficient search through the game tree :cite:`Campbell.Hoane-Jr.Hsu.2002`.
+  special-purpose hardware and efficient search
+  through the game tree :cite:`Campbell.Hoane-Jr.Hsu.2002`.
   Go is more difficult still, due to its huge state space.
-  AlphaGo reached human parity in 2015, using deep learning combined with Monte Carlo tree sampling :cite:`Silver.Huang.Maddison.ea.2016`.
-  The challenge in Poker was that the state space is
-  large and it is not fully observed (we do not know the opponents'
-  cards). Libratus exceeded human performance in Poker using efficiently
-  structured strategies :cite:`Brown.Sandholm.2017`.
-  This illustrates the impressive progress in games
-  and the fact that advanced algorithms played a crucial part in them.
-* Another indication of progress in AI is the advent of self-driving cars
-  and trucks. While full autonomy is not quite within reach yet,
+  AlphaGo reached human parity in 2015,
+  using deep learning combined with Monte Carlo tree sampling :cite:`Silver.Huang.Maddison.ea.2016`.
+  The challenge in Poker was that the state space is large
+  and only partially observed
+  (we do not know the opponents' cards).
+  Libratus exceeded human performance in Poker
+  using efficiently structured strategies :cite:`Brown.Sandholm.2017`.
+* Another indication of progress in AI
+  is the advent of self-driving cars and trucks.
+  While full autonomy is not quite within reach,
   excellent progress has been made in this direction,
   with companies such as Tesla, NVIDIA,
-  and Waymo shipping products that enable at least partial autonomy.
-  What makes full autonomy so challenging is that proper driving
-  requires the ability to perceive, to reason and to incorporate rules
-  into a system. At present, deep learning is used primarily
+  and Waymo shipping products
+  that enable at least partial autonomy.
+  What makes full autonomy so challenging
+  is that proper driving requires
+  the ability to perceive, to reason
+  and to incorporate rules into a system.
+  At present, deep learning is used primarily
   in the computer vision aspect of these problems.
   The rest is heavily tuned by engineers.
 
 
 
-Again, the above list barely scratches the surface of where machine learning has impacted practical applications. For instance, robotics, logistics, computational biology, particle physics, and astronomy owe some of their most impressive recent advances at least in parts to machine learning. Machine learning is thus becoming a ubiquitous tool for engineers and scientists.
+This barely scratches the surface
+for impactful applications of machine learning.
+For instance, robotics, logistics, computational biology,
+particle physics, and astronomy
+owe some of their most impressive recent advances
+at least in parts to machine learning.
+Machine learning is thus becoming
+a ubiquitous tool for engineers and scientists.
 
-Frequently, the question of the AI apocalypse, or the AI singularity
-has been raised in non-technical articles on AI.
+Frequently, questions about a coming AI apocalypse
+and the plausibility of a *singularity*
+have been raised in non-technical articles on AI.
 The fear is that somehow machine learning systems
-will become sentient and decide independently from their programmers
-(and masters) about things that directly affect the livelihood of humans.
-To some extent, AI already affects the livelihood of humans
-in an immediate way:
+will become sentient and make decisions,
+independently from their programmers
+that directly impact the lives of humans.
+To some extent, AI already affects
+the livelihood of humans in direct ways:
 creditworthiness is assessed automatically,
 autopilots mostly navigate vehicles, decisions about
 whether to grant bail use statistical data as input.
 More frivolously, we can ask Alexa to switch on the coffee machine.
 
 Fortunately, we are far from a sentient AI system
-that is ready to manipulate its human creators (or burn their coffee).
-First, AI systems are engineered, trained and deployed in a specific,
-goal-oriented manner. While their behavior might give the illusion
+that could deliberately manipulate its human creators.
+First, AI systems are engineered,
+trained, and deployed
+in a specific, goal-oriented manner.
+While their behavior might give the illusion
 of general intelligence, it is a combination of rules, heuristics
 and statistical models that underlie the design.
 Second, at present tools for *artificial general intelligence*
@@ -1568,88 +1750,135 @@ With what we know today, this strikes us a much more pressing concern
 than the potential of malevolent superintelligence to destroy humanity.
 
 
-## Characteristics
-
-Thus far, we have talked about machine learning broadly, which is both a branch of AI and an approach to AI.
-Though deep learning is a subset of machine learning,
-the dizzying set of algorithms and applications makes it difficult to assess what specifically the ingredients for deep learning might be. 
-This is as difficult as trying to pin down required ingredients for pizza since almost every component is substitutable.
-
-As we have described, machine learning can
-use data to learn transformations between inputs and outputs,
-such as transforming audio into text in speech recognition.
-In doing so, it is often necessary to represent data in a way suitable for algorithms to transform such representations into the output.
-*Deep learning* is *deep* in precisely the sense
-that its models
-learn many *layers* of transformations,
-where each layer offers the representation
-at one level.
-For example,
-layers near the input may represent 
-low-level details of the data,
-while layers closer to the classification output
-may represent more abstract concepts used for discrimination.
-Since *representation learning* aims at
-finding the representation itself,
-deep learning can be referred to as multi-level
-representation learning.
-
-The problems that we have discussed so far, such as learning
-from the raw audio signal, 
+## The Essence of Deep Learning
+
+Thus far, we have talked about machine learning broadly.
+Deep learning is the subset of machine learning
+concerned with models based on many-layered neural networks.
+It is *deep* in precisely the sense that its models
+learn many *layers* of transformations.
+While this might sound narrow,
+deep learning has given rise
+to a dizzying array of models, techniques,
+problem formulations, and applications.
+Many intuitions have been developed
+to explain the benefits of depth.
+Arguably, all machine learning
+has many layers of computation,
+the first consisting of feature processing steps.
+What differentiates deep learning is that
+the operations learned at each of the many layers
+of representations are learned jointly from data.
+
+The problems that we have discussed so far,
+such as learning from the raw audio signal,
 the raw pixel values of images,
 or mapping between sentences of arbitrary lengths and
 their counterparts in foreign languages,
-are those
-where deep learning excels and where traditional 
-machine learning
-methods falter.
+are those where deep learning excels
+and traditional methods falter.
 It turns out that these many-layered models
 are capable of addressing low-level perceptual data
 in a way that previous tools could not.
-Arguably the most significant commonality in deep learning methods is the use of *end-to-end training*. 
-That is, rather than assembling a system based on components that are individually tuned, one builds the system and then tunes their performance jointly.
-For instance, in computer vision scientists used to separate the process of *feature engineering* from the process of building machine learning models. The Canny edge detector :cite:`Canny.1987` and Lowe's SIFT feature extractor :cite:`Lowe.2004` reigned supreme for over a decade as algorithms for mapping images into feature vectors.
+Arguably the most significant commonality
+in deep learning methods is *end-to-end training*.
+That is, rather than assembling a system
+based on components that are individually tuned,
+one builds the system and then tunes their performance jointly.
+For instance, in computer vision scientists
+used to separate the process of *feature engineering*
+from the process of building machine learning models.
+The Canny edge detector :cite:`Canny.1987`
+and Lowe's SIFT feature extractor :cite:`Lowe.2004`
+reigned supreme for over a decade as algorithms
+for mapping images into feature vectors.
 In bygone days, the crucial part of applying machine learning to these problems
 consisted of coming up with manually-engineered ways
 of transforming the data into some form amenable to shallow models.
-Unfortunately, there is only so little that humans can accomplish by ingenuity in comparison with a consistent evaluation over millions of choices carried out automatically by an algorithm.
+Unfortunately, there is only so little that humans can accomplish
+by ingenuity in comparison with a consistent evaluation
+over millions of choices carried out automatically by an algorithm.
 When deep learning took over,
-these feature extractors were replaced by automatically tuned filters, yielding superior accuracy.
+these feature extractors were replaced
+by automatically tuned filters, yielding superior accuracy.
 
-Thus,
-one key advantage of deep learning is that it replaces not
-only the shallow models at the end of traditional learning pipelines,
-but also the labor-intensive process of 
-feature engineering.
+Thus, one key advantage of deep learning is that it replaces
+not only the shallow models at the end of traditional learning pipelines,
+but also the labor-intensive process of feature engineering.
 Moreover, by replacing much of the domain-specific preprocessing,
 deep learning has eliminated many of the boundaries
 that previously separated computer vision, speech recognition,
 natural language processing, medical informatics, and other application areas,
 offering a unified set of tools for tackling diverse problems.
 
-Beyond end-to-end training, 
-we are experiencing a transition from parametric statistical descriptions to fully nonparametric models. When data are scarce, one needs to rely on simplifying assumptions about reality in order to obtain useful models. When data are abundant, this can be replaced by nonparametric models that fit reality more accurately. To some extent, this mirrors the progress that physics experienced in the middle of the previous century with the availability of computers. Rather than solving parametric approximations of how electrons behave by hand, one can now resort to numerical simulations of the associated partial differential equations. This has led to much more accurate models, albeit often at the expense of explainability.
-
-Another difference to previous work is the acceptance of suboptimal solutions, dealing with nonconvex nonlinear optimization problems, and the willingness to try things before proving them. This newfound empiricism in dealing with statistical problems, combined with a rapid influx of talent has led to rapid progress of practical algorithms, albeit in many cases at the expense of modifying and re-inventing tools that existed for decades.
-
-In the end, the deep learning community prides itself on sharing tools across academic and corporate boundaries, releasing many excellent libraries, statistical models, and trained networks as open source.
-It is in this spirit that the notebooks forming this book are freely available for distribution and use. We have worked hard to lower the barriers of access for everyone to learn about deep learning and we hope that our readers will benefit from this.
-
+Beyond end-to-end training, we are experiencing a transition
+from parametric statistical descriptions to fully nonparametric models.
+When data is scarce, one needs to rely on simplifying assumptions about reality
+in order to obtain useful models.
+When data is abundant, these can be replaced
+by nonparametric models that better fit the data.
+To some extent, this mirrors the progress
+that physics experienced in the middle of the previous century
+with the availability of computers.
+Rather than solving parametric approximations of how electrons behave by hand,
+one can now resort to numerical simulations of the associated partial differential equations.
+This has led to much more accurate models,
+albeit often at the expense of explainability.
+
+Another difference to previous work is the acceptance of suboptimal solutions,
+dealing with nonconvex nonlinear optimization problems,
+and the willingness to try things before proving them.
+This newfound empiricism in dealing with statistical problems,
+combined with a rapid influx of talent has led
+to rapid progress of practical algorithms,
+albeit in many cases at the expense of modifying
+and re-inventing tools that existed for decades.
+
+In the end, the deep learning community prides itself
+on sharing tools across academic and corporate boundaries,
+releasing many excellent libraries, statistical models,
+and trained networks as open source.
+It is in this spirit that the notebooks forming this book
+are freely available for distribution and use.
+We have worked hard to lower the barriers of access
+for everyone to learn about deep learning
+and we hope that our readers will benefit from this.
 
 
 ## Summary
 
-* Machine learning studies how computer systems can leverage experience (often data) to improve performance at specific tasks. It combines ideas from statistics, data mining, and optimization. Often, it is used as a means of implementing AI solutions.
-* As a class of machine learning, representational learning focuses on how to automatically find the appropriate way to represent data. Deep learning is multi-level representation learning through learning many layers of transformations.
-* Deep learning replaces not only the shallow models at the end of traditional machine learning pipelines, but also the labor-intensive process of feature engineering. 
-* Much of the recent progress in deep learning has been triggered by an abundance of data arising from cheap sensors and Internet-scale applications, and by significant progress in computation, mostly through GPUs.
-* Whole system optimization is a key component in obtaining high performance. The availability of efficient deep learning frameworks has made design and implementation of this significantly easier.
+Machine learning studies how computer systems
+can leverage experience (often data)
+to improve performance at specific tasks.
+It combines ideas from statistics, data mining, and optimization.
+Often, it is used as a means of implementing AI solutions.
+As a class of machine learning, representational learning
+focuses on how to automatically find
+the appropriate way to represent data.
+As multi-level representation learning
+through learning many layers of transformations,
+deep learning replaces not only the shallow models
+at the end of traditional machine learning pipelines,
+but also the labor-intensive process of feature engineering.
+Much of the recent progress in deep learning
+has been triggered by an abundance of data
+arising from cheap sensors and Internet-scale applications,
+and by significant progress in computation, mostly through GPUs.
+Besides, the availability of efficient deep learning frameworks
+has made design and implementation of whole system optimization significantly easier,
+which is a key component in obtaining high performance.
 
 ## Exercises
 
-1. Which parts of code that you are currently writing could be "learned", i.e., improved by learning and automatically determining design choices that are made in your code? Does your code include heuristic design choices?
-1. Which problems that you encounter have many examples for how to solve them, yet no specific way to automate them? These may be prime candidates for using deep learning.
-1. Viewing the development of AI as a new industrial revolution, what is the relationship between algorithms and data? Is it similar to steam engines and coal? What is the fundamental difference?
-1. Where else can you apply the end-to-end training approach, such as in :numref:`fig_ml_loop`, physics, engineering, and econometrics?
+1. Which parts of code that you are currently writing could be "learned",
+   i.e., improved by learning and automatically determining design choices
+   that are made in your code?
+   Does your code include heuristic design choices?
+   What data might you need to learn the desired behavior?
+1. Which problems that you encounter have many examples for how to solve them,
+   yet no specific way to automate them?
+   These may be prime candidates for using deep learning.
+1. Describe the relationships between algorithms, data, and computation. How do characteristics of the data and the current available computational resources influence the appropriateness of various algorithms?
+1. Name some settings where end-to-end training is not currently the default approach but might be useful.
 
 [Discussions](https://discuss.d2l.ai/t/22)
diff --git a/chapter_linear-classification/classification.md b/chapter_linear-classification/classification.md
new file mode 100644
index 0000000..356165b
--- /dev/null
+++ b/chapter_linear-classification/classification.md
@@ -0,0 +1,130 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 基本分類モデル
+:label:`sec_classification`
+
+リグレッションの場合、ゼロからの実装とフレームワーク機能を使用した簡潔な実装がかなり似ていることに気づいたかもしれません。分類についても同じことが言えます。この本の非常に多くのモデルが分類を扱っているので、特にこの設定をサポートするいくつかの機能を追加する価値があります。このセクションでは、将来のコードを簡略化するための分類モデルの基本クラスを提供します。
+
+```{.python .input  n=2}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx, gluon
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+from IPython import display
+```
+
+## `Classifier`クラスは
+
+以下に `Classifier` クラスを定義します。`validation_step`では、検証バッチの損失値と分類精度の両方を報告します。`num_val_batches`バッチごとに更新を描画します。これには、検証データ全体で平均化された損失と精度を生成するという利点があります。最後のバッチに含まれる例が少ない場合、これらの平均数は正確ではありませんが、コードを単純にするためにこの小さな違いを無視します。
+
+```{.python .input  n=5}
+%%tab all
+class Classifier(d2l.Module):  #@save
+    def validation_step(self, batch):
+        Y_hat = self(*batch[:-1])
+        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
+        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
+```
+
+デフォルトでは、線形回帰のコンテキストで行ったように、ミニバッチで動作する確率的勾配降下オプティマイザを使用します。
+
+```{.python .input  n=6}
+%%tab mxnet
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    params = self.parameters()
+    if isinstance(params, list):
+        return d2l.SGD(params, self.lr)
+    return gluon.Trainer(params, 'sgd', {'learning_rate': self.lr})
+```
+
+```{.python .input  n=7}
+%%tab pytorch
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    return torch.optim.SGD(self.parameters(), lr=self.lr)
+```
+
+```{.python .input  n=8}
+%%tab tensorflow
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    return tf.keras.optimizers.SGD(self.lr)
+```
+
+## 精度
+
+予測確率分布 `y_hat` を考えると、通常、ハード予測を出力する必要がある場合は常に、予測確率が最も高いクラスを選択します。実際、多くのアプリケーションでは選択が必要です。たとえば、Gmailはメールを「プライマリ」、「ソーシャル」、「アップデート」、「フォーラム」、「スパム」に分類する必要があります。内部で確率を推定するかもしれませんが、結局のところ、クラスの中から1つを選択する必要があります。 
+
+予測がラベルクラス `y` と一致する場合、それらは正しいです。分類精度は、正しいすべての予測の比率です。精度を直接最適化するのは難しいかもしれませんが（微分できません）、私たちが最も重視するのはパフォーマンス指標です。多くの場合、これはベンチマークの「関連量」です。そのため、ほとんどの場合、分類器をトレーニングするときに報告します。 
+
+精度は次のように計算されます。まず、`y_hat` が行列の場合、2 番目の次元には各クラスの予測スコアが格納されていると仮定します。`argmax` を使用して、各行の最大エントリのインデックスによって予測クラスを取得します。次に [**予測されたクラスとグラウンドトゥルース`y`を要素ごとに比較します。**] 等価演算子`==`はデータ型に敏感であるため、`y_hat`のデータ型を`y`のデータ型と一致するように変換します。結果は、0 (false) と 1 (true) のエントリを含むテンソルになります。合計を取ると、正しい予測の数が得られます。
+
+```{.python .input  n=9}
+%%tab all
+@d2l.add_to_class(Classifier)  #@save
+def accuracy(self, Y_hat, Y, averaged=True):
+    """Compute the number of correct predictions."""
+    Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+    preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
+    compare = d2l.astype(preds == d2l.reshape(Y, -1), d2l.float32)
+    return d2l.reduce_mean(compare) if averaged else compare
+```
+
+```{.python .input  n=10}
+%%tab mxnet
+
+@d2l.add_to_class(d2l.Module)  #@save
+def get_scratch_params(self):
+    params = []
+    for attr in dir(self):
+        a = getattr(self, attr)
+        if isinstance(a, np.ndarray):
+            params.append(a)
+        if isinstance(a, d2l.Module):
+            params.extend(a.get_scratch_params())
+    return params
+
+@d2l.add_to_class(d2l.Module)  #@save
+def parameters(self):
+    params = self.collect_params()
+    return params if isinstance(params, gluon.parameter.ParameterDict) and len(
+        params.keys()) else self.get_scratch_params()
+```
+
+## まとめ
+
+分類は十分に一般的な問題であり、それ自体の便利な機能を保証する。分類で最も重要なのは、分類器の「正確さ」です。私たちはしばしば主に正確さを重視しますが、統計的および計算上の理由から、他のさまざまな目的を最適化するように分類器をトレーニングすることに注意してください。ただし、学習中にどの損失関数が最小化されたかにかかわらず、分類器の精度を経験的に評価するための便利な方法があると便利です。  
+
+## 演習
+
+1. $L_v$で検証損失を表し、$L_v^q$をこのセクションの損失関数の平均化によって計算されたその迅速で汚い推定とします。最後に、最後のミニバッチの損失を$l_v^b$で表します。$L_v$ を $L_v^q$、$l_v^b$、およびサンプルとミニバッチのサイズで表します。
+1. 迅速で汚い推定$L_v^q$が偏りがないことを示します。つまり、$E[L_v] = E[L_v^q]$を見せてください。なぜ代わりに$L_v$を使いたいのですか？
+1. $y$ を見ると $y'$ を推定した場合のペナルティが $l(y,y')$ で示され、確率 $p(y \mid x)$ が与えられるマルチクラス分類損失を考えると、$y'$ の最適選択のルールを定式化します。ヒント:$l$と$p(y \mid x)$を使用して、予想される損失を表現します。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6808)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6809)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6810)
+:end_tab:
diff --git a/chapter_linear-classification/classification_origin.md b/chapter_linear-classification/classification_origin.md
new file mode 100644
index 0000000..78cca39
--- /dev/null
+++ b/chapter_linear-classification/classification_origin.md
@@ -0,0 +1,149 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# The Base Classification Model
+:label:`sec_classification`
+
+You may have noticed that the implementations from scratch and the concise implementation using framework functionality were quite similar in the case of regression. The same is true for classification. Since a great many models in this book deal with classification, it is worth adding some functionality to support this setting specifically. This section provides a base class for classification models to simplify future code.
+
+```{.python .input  n=2}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx, gluon
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+from IPython import display
+```
+
+## The `Classifier` Class
+
+We define the `Classifier` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exactly correct if the last batch contains fewer examples, but we ignore this minor difference to keep the code simple.
+
+```{.python .input  n=5}
+%%tab all
+class Classifier(d2l.Module):  #@save
+    def validation_step(self, batch):
+        Y_hat = self(*batch[:-1])
+        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
+        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
+```
+
+By default we use a stochastic gradient descent optimizer, operating on minibatches, just as we did in the context of linear regression.
+
+```{.python .input  n=6}
+%%tab mxnet
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    params = self.parameters()
+    if isinstance(params, list):
+        return d2l.SGD(params, self.lr)
+    return gluon.Trainer(params, 'sgd', {'learning_rate': self.lr})
+```
+
+```{.python .input  n=7}
+%%tab pytorch
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    return torch.optim.SGD(self.parameters(), lr=self.lr)
+```
+
+```{.python .input  n=8}
+%%tab tensorflow
+@d2l.add_to_class(d2l.Module)  #@save
+def configure_optimizers(self):
+    return tf.keras.optimizers.SGD(self.lr)
+```
+
+## Accuracy
+
+Given the predicted probability distribution `y_hat`,
+we typically choose the class with the highest predicted probability
+whenever we must output a hard prediction.
+Indeed, many applications require that we make a choice.
+For instance, Gmail must categorize an email into "Primary", "Social", "Updates", "Forums", or "Spam".
+It might estimate probabilities internally,
+but at the end of the day it has to choose one among the classes.
+
+When predictions are consistent with the label class `y`, they are correct.
+The classification accuracy is the fraction of all predictions that are correct.
+Although it can be difficult to optimize accuracy directly (it is not differentiable),
+it is often the performance measure that we care about the most. It is often *the*
+relevant quantity in benchmarks. As such, we will nearly always report it when training classifiers.
+
+Accuracy is computed as follows.
+First, if `y_hat` is a matrix,
+we assume that the second dimension stores prediction scores for each class.
+We use `argmax` to obtain the predicted class by the index for the largest entry in each row.
+Then we [**compare the predicted class with the ground-truth `y` elementwise.**]
+Since the equality operator `==` is sensitive to data types,
+we convert `y_hat`'s data type to match that of `y`.
+The result is a tensor containing entries of 0 (false) and 1 (true).
+Taking the sum yields the number of correct predictions.
+
+```{.python .input  n=9}
+%%tab all
+@d2l.add_to_class(Classifier)  #@save
+def accuracy(self, Y_hat, Y, averaged=True):
+    """Compute the number of correct predictions."""
+    Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+    preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
+    compare = d2l.astype(preds == d2l.reshape(Y, -1), d2l.float32)
+    return d2l.reduce_mean(compare) if averaged else compare
+```
+
+```{.python .input  n=10}
+%%tab mxnet
+
+@d2l.add_to_class(d2l.Module)  #@save
+def get_scratch_params(self):
+    params = []
+    for attr in dir(self):
+        a = getattr(self, attr)
+        if isinstance(a, np.ndarray):
+            params.append(a)
+        if isinstance(a, d2l.Module):
+            params.extend(a.get_scratch_params())
+    return params
+
+@d2l.add_to_class(d2l.Module)  #@save
+def parameters(self):
+    params = self.collect_params()
+    return params if isinstance(params, gluon.parameter.ParameterDict) and len(
+        params.keys()) else self.get_scratch_params()
+```
+
+## Summary
+
+Classification is a sufficiently common problem that it warrants its own convenience functions. Of central importance in classification is the *accuracy* of the classifier. Note that while we often care primarily about accuracy, we train classifiers to optimize a variety of other objectives for statistical and computational reasons. However, regardless of which loss function was minimized during training, it's useful to have a convenience method for assessing the accuracy of our classifier empirically. 
+
+
+## Exercises
+
+1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes.
+1. Show that the quick and dirty estimate $L_v^q$ is unbiased. That is, show that $E[L_v] = E[L_v^q]$. Why would you still want to use $L_v$ instead?
+1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y \mid x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y \mid x)$.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6808)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6809)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6810)
+:end_tab:
diff --git a/chapter_linear-classification/environment-and-distribution-shift.md b/chapter_linear-classification/environment-and-distribution-shift.md
new file mode 100644
index 0000000..733994d
--- /dev/null
+++ b/chapter_linear-classification/environment-and-distribution-shift.md
@@ -0,0 +1,252 @@
+# 環境と流通のシフト
+:label:`sec_environment-and-distribution-shift`
+
+前のセクションでは、さまざまなデータセットにモデルを適合させる機械学習の実践的なアプリケーションをいくつか取り上げました。それでも、そもそもデータがどこから来るのか、あるいはモデルからの出力で最終的に何をするつもりなのかを考えるのをやめませんでした。多くの場合、データを所有する機械学習の開発者は、これらの基本的な問題を検討するために立ち止まることなくモデルの開発に駆けつけます。 
+
+失敗した機械学習の導入の多くは、このパターンにまでさかのぼることができます。テストセットの精度で測定すると、モデルは驚くほど機能しているように見えますが、データの分散が突然変化すると、展開時に壊滅的に失敗することがあります。もっと狡猾なことに、モデルの展開そのものがデータ配信を混乱させるきっかけになることがあります。たとえば、ローンの返済者と債務不履行を予測するモデルをトレーニングし、申請者の履物の選択が債務不履行のリスクに関連していることを発見したとします（オックスフォードは返済を示し、スニーカーは債務不履行を示します）。その後、オックスフォードを履いているすべての申請者にローンを与え、スニーカーを着用しているすべての申請者を拒否する傾向があるかもしれません。 
+
+この場合、パターン認識から意思決定への私たちのよく考えられていない飛躍と、環境を批判的に考慮しなかったことは、悲惨な結果をもたらす可能性があります。手始めに、私たちが履物に基づいて決定を下し始めるとすぐに、顧客は自分の行動に追いつき、変化するでしょう。やがて、すべての応募者はオックスフォードを着用し、同時に信用力が向上することはありません。機械学習の多くのアプリケーションには同様の問題がたくさんあるので、少し時間をとってこれを要約してください。モデルベースの意思決定を環境に導入することで、モデルを壊す可能性があります。 
+
+これらのトピックを1つのセクションで完全に扱うことはできませんが、ここでは、いくつかの共通の懸念を明らかにし、これらの状況を早期に検出し、損傷を軽減し、責任を持って機械学習を使用するために必要な批判的思考を刺激することを目指しています。解決策の中には単純な（「正しい」データを求める）ものもあれば、技術的に難しいもの（強化学習システムを実装する）ものや、統計的予測の領域から完全に外に出て、の倫理的適用に関する難しい哲学的問題に取り組む必要があるものもあります。アルゴリズム。 
+
+## 流通シフトのタイプ
+
+まず、データ分布が変化するさまざまな方法と、モデルのパフォーマンスを引き出すために何が行われるかを考慮して、パッシブ予測設定に固執します。ある古典的な設定では、トレーニングデータはあるディストリビューション $p_S(\mathbf{x},y)$ からサンプリングされたが、テストデータはいくつかの異なるディストリビューション $p_T(\mathbf{x},y)$ から抽出されたラベルのない例で構成されると仮定します。すでに、私たちは冷静な現実に立ち向かわなければなりません。$p_S$と$p_T$が互いにどのように関連しているかについての仮定がなければ、ロバストな分類器を学習することは不可能です。 
+
+犬と猫を区別したい二項分類問題を考えてみましょう。分布が任意の方法でシフトできる場合、私たちのセットアップでは、入力に対する分布が一定である病理学的ケース（$p_S(\mathbf{x}) = p_T(\mathbf{x})$）を許可しますが、ラベルはすべて反転します：$p_S(y \mid \mathbf{x}) = 1 - p_T(y \mid \mathbf{x})$。言い換えれば、将来、すべての「猫」が犬になり、以前「犬」と呼ばれていたものが今では猫であると神が突然決定できるのであれば、入力値の分布を変えずに $p(\mathbf{x})$、この設定と分布がまったく変化しなかった設定とを区別することはできないでしょう。 
+
+幸いなことに、私たちのデータが将来どのように変化するかについてのいくつかの制限された仮定の下で、原理アルゴリズムはシフトを検出し、時にはその場で適応することさえでき、元の分類器の精度を向上させることができます。 
+
+### 共変量シフト
+
+分布シフトのカテゴリーの中で、共変量シフトが最も広く研究されている可能性があります。ここでは、入力の分布は時間とともに変化する可能性がありますが、ラベル付け関数、つまり条件付き分布 $P(y \mid \mathbf{x})$ は変化しないと仮定します。統計学者は、共変量（特徴）の分布の変化によって問題が生じるため、これを*共変量シフト*と呼んでいます。因果関係を呼び出すことなく分布シフトについて推論できる場合もありますが、共変量シフトは、$\mathbf{x}$が$y$を引き起こすと私たちが信じる設定で呼び出す自然な仮定であることに注意してください。 
+
+猫と犬を区別するという課題を考えてみましょう。私たちのトレーニングデータは、:numref:`fig_cat-dog-train`の種類の画像で構成されている可能性があります。 
+
+![Training data for distinguishing cats and dogs.](../img/cat-dog-train.svg)
+:label:`fig_cat-dog-train`
+
+テスト時には、:numref:`fig_cat-dog-test`で画像を分類するように求められます。 
+
+![Test data for distinguishing cats and dogs.](../img/cat-dog-test.svg)
+:label:`fig_cat-dog-test`
+
+トレーニングセットは写真で構成され、テストセットには漫画のみが含まれています。テストセットとは大幅に異なる特性を持つデータセットでトレーニングすると、新しいドメインにどのように適応するかについて一貫した計画がなければ、問題を引き起こす可能性があります。 
+
+### ラベルシフト
+
+*ラベルシフト* は、逆の問題を説明しています。
+ここでは、ラベル限界$P(y)$は変更できるが、クラス条件付き分布$P(\mathbf{x} \mid y)$はドメイン間で固定されたままであると仮定します。ラベルシフトは、$y$が$\mathbf{x}$を引き起こすと私たちが信じるときに行うべき合理的な仮定です。たとえば、診断の相対的な有病率が時間とともに変化している場合でも、その症状（または他の症状）から診断を予測したい場合があります。病気は症状を引き起こすため、ここではラベルシフトが適切な仮定です。一部の縮退ケースでは、ラベルシフトと共変量シフトの仮定が同時に成り立ちます。たとえば、ラベルが決定論的である場合、$y$が$\mathbf{x}$の原因となる場合でも、共変量シフトの仮定は満たされます。興味深いことに、これらのケースでは、ラベルシフトの仮定から流れる方法で作業することがしばしば有利です。これは、これらの方法では、ディープラーニングで高次元になりがちな入力のように見えるオブジェクトとは対照的に、ラベルのように見えるオブジェクト（多くの場合低次元）を操作する傾向があるためです。 
+
+### コンセプトシフト
+
+また、ラベルの定義そのものが変わる可能性があるときに発生する、*コンセプトシフト*の関連問題にも遭遇する可能性があります。これは奇妙に聞こえます-*猫*は*猫*ですよね？ただし、他のカテゴリは、時間の経過とともに使用状況が変化する可能性があります。精神疾患の診断基準、ファッショナブルに合格するもの、および役職はすべて、かなりの量のコンセプトシフトの影響を受けます。:numref:`fig_popvssoda`に示すように、データのソースを地理的に移動して米国内を移動すると、*ソフトドリンク*の名前の分布に関する概念が大幅に変化することがわかります。 
+
+![Concept shift on soft drink names in the United States.](../img/popvssoda.png)
+:width:`400px`
+:label:`fig_popvssoda`
+
+機械翻訳システムを構築する場合、ディストリビューション $P(y \mid \mathbf{x})$ は場所によって異なる場合があります。この問題は見つけにくい場合があります。シフトは時間的または地理的な意味で徐々にしか起こらないという知識を活用したいと思うかもしれません。 
+
+## 流通シフトの例
+
+形式主義とアルゴリズムを掘り下げる前に、共変量や概念のシフトが明らかではないかもしれないいくつかの具体的な状況について議論することができます。 
+
+### 医療診断
+
+がんを検出するアルゴリズムを設計したいと想像してみてください。健康な人や病気の人からデータを収集し、アルゴリズムをトレーニングします。それはうまく機能し、高い精度を提供し、医療診断で成功するキャリアの準備ができていると結論付けます。
+*そんなに早くない。*
+
+トレーニングデータを生み出した分布と、実際に遭遇する分布は、かなり異なる可能性があります。これは、何年か前に私たち（作家）が一緒に働いていた不幸なスタートアップに起こりました。彼らは、主に高齢の男性に影響を与える病気の血液検査を開発しており、患者から収集した血液サンプルを使用してそれを研究することを望んでいました。しかし、すでにシステムに存在する病気の患者よりも、健康な男性から血液サンプルを入手することはかなり困難です。これを補うために、スタートアップは大学のキャンパスの学生からの献血を求めて、テストを開発する際の健康的なコントロールとして機能させました。次に、病気を検出するための分類器を構築するのを手伝うことができるかどうか尋ねました。 
+
+私たちが彼らに説明したように、健康なコホートと病気のコホートをほぼ完璧な精度で区別するのは確かに簡単です。ただし、これは、被験者の年齢、ホルモンレベル、身体活動、食事、アルコール消費、および疾患とは無関係の多くの要因が異なるためです。これは実際の患者には当てはまりそうにありませんでした。サンプリング手順により、極端な共変量シフトが発生することが予想されます。さらに、このケースは従来の方法で修正できる可能性は低かった。要するに、彼らはかなりの金額を浪費した。 
+
+### 自動運転車
+
+ある会社が自動運転車の開発に機械学習を活用したいとしましょう。ここで重要なコンポーネントの1つは、路側検出器です。実際の注釈付きデータは入手に費用がかかるため、ゲームレンダリングエンジンからの合成データを追加のトレーニングデータとして使用する（賢明で疑わしい）アイデアがありました。これは、レンダリングエンジンから引き出された「テストデータ」に対して非常にうまく機能しました。ああ、実車の中では大惨事だった。結局のところ、道端は非常に単純なテクスチャでレンダリングされていました。さらに重要なのは、路側が*すべて*同じ*テクスチャでレンダリングされており、路側検出器がこの「特徴」について非常に迅速に学習したことです。 
+
+米軍が森林内の戦車を最初に検出しようとしたときにも同様のことが起こりました。彼らはタンクなしで森の航空写真を撮り、次にタンクを森に運転して別の写真を撮りました。分類器は*完全に*機能しているように見えました。残念ながら、影のある木と影のない木を区別する方法を学んだだけでした。最初のセットは早朝に撮影され、2番目のセットは正午に撮影されました。 
+
+### 非定常分布
+
+分布がゆっくり変化し（*非定常分布*とも呼ばれる）、モデルが適切に更新されない場合は、さらに微妙な状況が発生します。以下は代表的なケースです。 
+
+* 私たちは計算広告モデルを訓練し、それを頻繁に更新することに失敗します（例えば、iPadと呼ばれる不明瞭な新しいデバイスが発売されたばかりであることを組み込むのを忘れています）。
+* スパムフィルターを構築します。これまでに見たすべてのスパムを検出するのに適しています。しかし、その後、スパマーは賢くなり、これまでに見たことのない新しいメッセージを作成します。
+* 製品レコメンデーションシステムを構築します。冬の間は機能しますが、クリスマス後もずっとサンタ帽子を推奨し続けます。
+
+### その他の逸話
+
+* 顔検出器を構築します。すべてのベンチマークでうまく機能します。残念ながら、テストデータでは失敗します。問題のある例は、顔が画像全体を埋めるクローズアップです（トレーニングセットにはそのようなデータはありませんでした）。
+* 米国市場向けのウェブ検索エンジンを構築し、英国で展開したいと考えています。
+* 大規模なデータセットをコンパイルして画像分類器をトレーニングします。このデータセットでは、大規模なクラスのセットのそれぞれがデータセット内で等しく表され、1000個のカテゴリがそれぞれ1000個の画像で表されます。次に、写真の実際のラベル分布が明らかに不均一である現実世界にシステムを展開します。
+
+## 流通シフトの修正
+
+すでに説明したように、トレーニングとテストのディストリビューション $P(\mathbf{x}, y)$ が異なるケースが多くあります。場合によっては、共変量、ラベル、または概念のシフトにもかかわらず、ラッキーになり、モデルが機能することがあります。他のケースでは、シフトに対処するために原則的な戦略を採用することで、より良いことができます。このセクションの残りの部分は、かなり技術的なものになります。この資料は後続の概念の前提条件ではないため、せっかちな読者は次のセクションに進むことができます。 
+
+### 経験的リスクとリスク
+:label:`subsec_empirical-risk-and-risk`
+
+まず、モデルトレーニング中に正確に何が起こっているのかを考えてみましょう。トレーニングデータ $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ の特徴と関連するラベルを反復処理し、ミニバッチのたびにモデル $f$ のパラメーターを更新します。簡単にするために、正則化は考慮しないため、トレーニングの損失を大幅に最小限に抑えます。 
+
+$$\mathop{\mathrm{minimize}}_f \frac{1}{n} \sum_{i=1}^n l(f(\mathbf{x}_i), y_i),$$
+:eqlabel:`eq_empirical-risk-min`
+
+ここで、$l$は、予測$f(\mathbf{x}_i)$に「どれほど悪い」かを測定する損失関数で、関連するラベル$y_i$が与えられます。統計学者は、:eqref:`eq_empirical-risk-min`の用語を*経験的リスク*と呼んでいます。*経験的リスク*は、*リスク*を概算するためのトレーニングデータの平均損失です。これは、真の分布$p(\mathbf{x},y)$から引き出されたデータの母集団全体に対する損失の予想値です。 
+
+$$E_{p(\mathbf{x}, y)} [l(f(\mathbf{x}), y)] = \int\int l(f(\mathbf{x}), y) p(\mathbf{x}, y) \;d\mathbf{x}dy.$$
+:eqlabel:`eq_true-risk`
+
+しかし、実際には、通常、データの母集団全体を取得することはできません。したがって、:eqref:`eq_empirical-risk-min`の経験的リスクを最小化している*経験的リスク最小化*は、リスクをほぼ最小化することを期待して、機械学習の実用的な戦略です。 
+
+### 共変量シフト補正
+:label:`subsec_covariate-shift-correction`
+
+データ$(\mathbf{x}_i, y_i)$とラベル付けした依存関係$P(y \mid \mathbf{x})$を推定すると仮定します。残念ながら、観測値$\mathbf{x}_i$は、*ターゲット分布* $p(\mathbf{x})$ではなく、いくつかの*ソース分布* $q(\mathbf{x})$から抽出されています。幸いなことに、依存関係の仮定は、条件付き分布が変化しないことを意味します:$p(y \mid \mathbf{x}) = q(y \mid \mathbf{x})$。ソースディストリビューション $q(\mathbf{x})$ が「間違っている」場合、リスクに次の単純な ID を使用することで修正できます。 
+
+$$
+\begin{aligned}
+\int\int l(f(\mathbf{x}), y) p(y \mid \mathbf{x})p(\mathbf{x}) \;d\mathbf{x}dy =
+\int\int l(f(\mathbf{x}), y) q(y \mid \mathbf{x})q(\mathbf{x})\frac{p(\mathbf{x})}{q(\mathbf{x})} \;d\mathbf{x}dy.
+\end{aligned}
+$$
+
+言い換えれば、正しい分布から導き出される確率と間違った分布から導き出された確率の比率によって、各データ例を再重み付けする必要があります。 
+
+$$\beta_i \stackrel{\mathrm{def}}{=} \frac{p(\mathbf{x}_i)}{q(\mathbf{x}_i)}.$$
+
+各データ例$(\mathbf{x}_i, y_i)$の重み$\beta_i$を接続すると、以下を使用してモデルをトレーニングできます。
+*加重経験的リスク最小化*:
+
+$$\mathop{\mathrm{minimize}}_f \frac{1}{n} \sum_{i=1}^n \beta_i l(f(\mathbf{x}_i), y_i).$$
+:eqlabel:`eq_weighted-empirical-risk-min`
+
+ああ、その比率はわからないので、何か役に立つ前に見積もる必要があります。最小ノルムまたは最大エントロピー原理を使用して期待演算子を直接再調整しようとするいくつかの派手な演算子理論的アプローチなど、多くの方法が利用可能です。このようなアプローチでは、テストデータへのアクセスなどによる「真の」$p$と、トレーニングセット$q$の生成に使用されるもの（後者は簡単に利用可能）の両方のディストリビューションから抽出されたサンプルが必要であることに注意してください。ただし、必要なのは機能 $\mathbf{x} \sim p(\mathbf{x})$ だけであり、ラベル $y \sim p(y)$ にアクセスする必要はありません。 
+
+この場合、元のものとほぼ同じくらい良い結果が得られる非常に効果的なアプローチがあります。ロジスティック回帰は、バイナリ分類のためのソフトマックス回帰（:numref:`sec_softmax`を参照）の特殊なケースです。推定確率比を計算するために必要なのはこれだけです。$p(\mathbf{x})$から抽出されたデータと$q(\mathbf{x})$から抽出されたデータを区別するための分類器を学習します。2 つのディストリビューションを区別できない場合は、関連付けられたインスタンスが 2 つのディストリビューションのどちらか一方から来る可能性が等しくなることを意味します。一方、適切に識別できるインスタンスは、それに応じて大幅にオーバーウェイトまたはアンダーウェイトする必要があります。 
+
+簡単にするために、ディストリビューション $p(\mathbf{x})$ と $q(\mathbf{x})$ の両方からそれぞれ同じ数のインスタンスがあると仮定します。ここで、$z$ ラベルで表します。このラベルは、$p$ から抽出されたデータでは $1$、$q$ から抽出されたデータでは $-1$ になります。次に、混合データセットの確率は次の式で与えられます。 
+
+$$P(z=1 \mid \mathbf{x}) = \frac{p(\mathbf{x})}{p(\mathbf{x})+q(\mathbf{x})} \text{ and hence } \frac{P(z=1 \mid \mathbf{x})}{P(z=-1 \mid \mathbf{x})} = \frac{p(\mathbf{x})}{q(\mathbf{x})}.$$
+
+したがって、$P(z=1 \mid \mathbf{x})=\frac{1}{1+\exp(-h(\mathbf{x}))}$（$h$ はパラメーター化された関数）のロジスティック回帰アプローチを使用すると、次のようになります。 
+
+$$
+\beta_i = \frac{1/(1 + \exp(-h(\mathbf{x}_i)))}{\exp(-h(\mathbf{x}_i))/(1 + \exp(-h(\mathbf{x}_i)))} = \exp(h(\mathbf{x}_i)).
+$$
+
+その結果、2つの問題を解決する必要があります。1つ目は両方の分布から抽出されたデータを区別し、次に:eqref:`eq_weighted-empirical-risk-min`の重み付けされた経験的リスク最小化問題で、$\beta_i$で項を重み付けします。 
+
+これで、補正アルゴリズムについて説明する準備が整いました。トレーニングセット $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ とラベルのないテストセット $\{\mathbf{u}_1, \ldots, \mathbf{u}_m\}$ があるとします。共変量シフトでは、すべての$1 \leq i \leq n$の$\mathbf{x}_i$が何らかのソース分布から抽出され、すべての$1 \leq i \leq m$の$\mathbf{u}_i$がターゲット分布から抽出されると仮定します。共変量シフトを補正するための典型的なアルゴリズムは次のとおりです。 
+
+1. 二項分類トレーニングセット $\{(\mathbf{x}_1, -1), \ldots, (\mathbf{x}_n, -1), (\mathbf{u}_1, 1), \ldots, (\mathbf{u}_m, 1)\}$ を生成します。
+1. ロジスティック回帰を使用してバイナリ分類器に学習をさせ、関数 $h$ を取得します。
+1. いくつかの定数 $c$ に対して $\beta_i = \exp(h(\mathbf{x}_i))$ またはそれ以上 $\beta_i = \min(\exp(h(\mathbf{x}_i)), c)$ を使用してトレーニングデータを重み付けします。
+1. :eqref:`eq_weighted-empirical-risk-min`の$\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$のトレーニングには、ウェイト$\beta_i$を使用してください。
+
+上記のアルゴリズムは重要な仮定に依存していることに注意してください。このスキームが機能するためには、ターゲット (テスト時間など) 分布の各データ例が、学習時にゼロ以外の確率で発生する必要があります。$p(\mathbf{x}) > 0$で$q(\mathbf{x}) = 0$の点が見つかった場合、対応する重要度の重みは無限大になるはずです。 
+
+### ラベルシフト補正
+
+$k$ カテゴリの分類タスクを扱っていると仮定します。:numref:`subsec_covariate-shift-correction`で同じ表記法を使用すると、$q$と$p$はそれぞれソース分布（トレーニング時間など）とターゲット分布（テスト時間など）です。ラベルの分布が時間とともにシフトすると仮定します:$q(y) \neq p(y)$。しかし、クラス条件付き分布は同じ $q(\mathbf{x} \mid y)=p(\mathbf{x} \mid y)$ のままです。ソースディストリビューション $q(y)$ が「間違っている」場合、:eqref:`eq_true-risk` で定義されているリスクの次のアイデンティティに従って修正できます。 
+
+$$
+\begin{aligned}
+\int\int l(f(\mathbf{x}), y) p(\mathbf{x} \mid y)p(y) \;d\mathbf{x}dy =
+\int\int l(f(\mathbf{x}), y) q(\mathbf{x} \mid y)q(y)\frac{p(y)}{q(y)} \;d\mathbf{x}dy.
+\end{aligned}
+$$
+
+ここで、重要度の重みはラベルの尤度比に対応します 
+
+$$\beta_i \stackrel{\mathrm{def}}{=} \frac{p(y_i)}{q(y_i)}.$$
+
+ラベルシフトの良い点の1つは、ソースディストリビューションにかなり良いモデルがあれば、周囲の次元に対処する必要なく、これらの重みの一貫した推定値を得ることができるということです。ディープラーニングでは、入力は画像のような高次元のオブジェクトになりがちですが、ラベルはカテゴリのような単純なオブジェクトであることがよくあります。 
+
+ターゲットラベルの分布を推定するには、まず適度に優れた既製の分類器（通常はトレーニングデータでトレーニング済み）を使用し、検証セット（トレーニング分布からも）を使用して混同行列を計算します。*混同行列*、$\mathbf{C}$ は単に $k \times k$ 行列で、各列はラベルカテゴリ (グラウンドトゥルース) に対応し、各行はモデルの予測カテゴリに対応します。各セルの値 $c_{ij}$ は、真のラベルが $j$ で、モデルが予測した $i$ の検証セットの予測合計に対する割合です。 
+
+複雑なリアルタイムアノテーションパイプラインに投資しない限り、実際に見られる例のラベルを見ることができないため、ターゲットデータの混同行列を直接計算することはできません。ただし、実行できることは、テスト時にすべてのモデル予測を平均して、平均モデル出力$\mu(\hat{\mathbf{y}}) \in \mathbb{R}^k$を算出することです。$i^\mathrm{th}$の要素$\mu(\hat{y}_i)$は、モデルが$i$を予測したテストセットの予測合計に対する割合です。 
+
+ある穏やかな条件下では、分類器がそもそも合理的に正確であり、ターゲットデータに以前に見たカテゴリのみが含まれていて、ラベルシフトの仮定がそもそも当てはまる場合（ここでは最も強い仮定）、テストセットのラベルを推定できます。単純な線形システムを解くことによる分布 
+
+$$\mathbf{C} p(\mathbf{y}) = \mu(\hat{\mathbf{y}}),$$
+
+推定値として、$\sum_{j=1}^k c_{ij} p(y_j) = \mu(\hat{y}_i)$ はすべての $1 \leq i \leq k$ に当てはまるため、$p(y_j)$ は $k$ 次元のラベル分布ベクトル $p(\mathbf{y})$ の $j^\mathrm{th}$ 要素です。分類器が最初から十分正確であれば、混同行列 $\mathbf{C}$ は可逆になり、解が得られます $p(\mathbf{y}) = \mathbf{C}^{-1} \mu(\hat{\mathbf{y}})$。 
+
+ソースデータのラベルを観察するため、分布$q(y)$を推定するのは簡単です。次に、ラベル$y_i$のトレーニング例$i$について、推定$p(y_i)/q(y_i)$の比率を使用して重み$\beta_i$を計算し、これを:eqref:`eq_weighted-empirical-risk-min`の加重経験的リスク最小化にプラグインできます。 
+
+### コンセプトシフト修正
+
+コンセプトシフトは、原則的に修正するのがはるかに困難です。たとえば、猫と犬を区別することから、白と黒の動物を区別する問題に突然問題が変わる状況では、新しいラベルを集めてゼロから訓練するよりもはるかに良いことができると考えるのは無理でしょう。幸いなことに、実際には、このような極端なシフトはまれです。代わりに、通常起こるのは、タスクがゆっくりと変化し続けることです。より具体的にするために、いくつかの例を挙げます。 
+
+* コンピュテーショナル広告では、新製品が発売され、
+古い製品はあまり人気がなくなります。これは、広告の分布とその人気が徐々に変化し、クリック率の予測因子もそれに伴って徐々に変化する必要があることを意味します。
+* 交通カメラのレンズは、環境摩耗により徐々に劣化し、画質に次第に影響を与えます。
+* ニュースコンテンツは徐々に変化します（つまり、ほとんどのニュースは変更されませんが、新しいストーリーが表示されます）。
+
+このような場合、ネットワークのトレーニングに使用したのと同じアプローチを使用して、データの変化に適応させることができます。言い換えれば、ゼロからトレーニングするのではなく、既存のネットワークの重みを使用し、新しいデータでいくつかの更新ステップを実行するだけです。 
+
+## 学習問題の分類
+
+分布の変化にどう対処するかについての知識を身につけて、機械学習の問題定式化のいくつかの他の側面について考えることができるようになりました。 
+
+### バッチ学習
+
+*バッチ学習* では、モデル $f(\mathbf{x})$ のトレーニングに使用するトレーニング機能とラベル $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ にアクセスできます。その後、このモデルを展開して、同じ分布から抽出された新しいデータ $(\mathbf{x}, y)$ をスコアリングします。これは、ここで説明する問題の既定の前提です。たとえば、たくさんの猫と犬の画像に基づいて猫検出器を訓練するかもしれません。一度トレーニングしたら、猫だけが入ることができるスマートキャットドアコンピュータービジョンシステムの一部として出荷します。これは顧客の家に設置され、（極端な状況を除いて）二度と更新されません。 
+
+### オンライン学習
+
+ここで、データ$(\mathbf{x}_i, y_i)$が一度に1つのサンプルに到達すると想像してください。具体的には、最初に$\mathbf{x}_i$を観察すると仮定し、次に推定$f(\mathbf{x}_i)$を考え出す必要があります。これを実行すると、$y_i$を観察し、決定に基づいて報酬を受け取るか損失を被ります。多くの実際の問題がこのカテゴリに分類されます。たとえば、明日の株価を予測する必要があります。これにより、その見積もりに基づいて取引することができ、一日の終わりに、見積もりで利益を上げることができるかどうかがわかります。言い換えれば、*オンライン学習*では、新しい観察を受けてモデルを継続的に改善する次のサイクルがあります。 
+
+$$
+\mathrm{model} ~ f_t \longrightarrow
+\mathrm{data} ~ \mathbf{x}_t \longrightarrow
+\mathrm{estimate} ~ f_t(\mathbf{x}_t) \longrightarrow
+\mathrm{observation} ~ y_t \longrightarrow
+\mathrm{loss} ~ l(y_t, f_t(\mathbf{x}_t)) \longrightarrow
+\mathrm{model} ~ f_{t+1}
+$$
+
+### 盗賊
+
+*Bandits* は上記の問題の特殊なケースです。ほとんどの学習問題には連続的にパラメータ化された関数$f$があり、そのパラメータ（深いネットワークなど）を学習しますが、*バンディット*問題では、引っ張ることができる腕の数が限られています。つまり、実行できるアクションの数は有限です。この単純な問題に対して、最適性の観点からより強力な理論的保証が得られることはそれほど驚くべきことではありません。この問題はしばしば（混乱を招く）別個の学習環境であるかのように扱われるため、主に挙げています。
+
+### コントロール
+
+多くの場合、環境は私たちがしたことを覚えています。必ずしも敵対的なやり方ではありませんが、それはただ覚えているだけで、その反応は以前に起こったことに依存します。たとえば、コーヒーボイラーコントローラーは、以前にボイラーを加熱していたかどうかによって、異なる温度を観察します。PID (比例-積分-微分) コントローラーアルゴリズムは一般的な選択肢です。同様に、ニュースサイトでのユーザーの行動は、私たちが以前に見せたものに依存します（例えば、彼はほとんどのニュースを一度だけ読むでしょう）。そのようなアルゴリズムの多くは、決定をランダムに見えないようにするなど、動作する環境のモデルを形成します。最近では、制御理論（PIDバリアントなど）もハイパーパラメータを自動的に調整して、より優れた解きほぐしと再構成の品質を達成し、生成テキストの多様性と生成された画像の再構成品質を改善するために使用されています :cite:`Shao.Yao.Sun.ea.2020`。 
+
+### 強化学習
+
+メモリのある環境のより一般的なケースでは、環境が私たちと協力しようとしている状況（特に非ゼロサムゲームの協力ゲーム）、または環境が勝とうとする他の状況に遭遇する可能性があります。チェス、ゴー、バックギャモン、スタークラフトは、*強化学習*のケースの一部です。同様に、自動運転車用の優れたコントローラーを構築したいと思うかもしれません。他の車は、回避しようとする、事故を起こそうとする、協力しようとするなど、自明ではない方法で自動運転車の運転スタイルに反応する可能性が高い。 
+
+### 環境を考える
+
+上記のさまざまな状況の主な違いの1つは、定常環境の場合に全体的に機能していたのと同じ戦略が、環境が適応できる場合は全体を通して機能しない可能性があることです。たとえば、トレーダーが発見したアービトラージの機会は、トレーダーがそれを悪用し始めると消滅する可能性があります。環境が変化する速度と方法によって、私たちが耐えることができるアルゴリズムのタイプが大きく決まります。例えば、物事がゆっくりとしか変化しないかもしれないとわかっているなら、どんな見積もりもゆっくりしか変えないように強制することができます。環境が瞬時に変化するかもしれないが、ごくまれにしか変化しないとわかっているなら、それを考慮に入れることができます。これらの種類の知識は、データサイエンティストがコンセプトシフト、つまり解決しようとしている問題が時間とともに変化するときに対処するために不可欠です。 
+
+## 機械学習における公平性、説明責任、透明性
+
+最後に、機械学習システムを展開するときは、単に予測モデルを最適化するだけでなく、通常、意思決定を（部分的または完全に）自動化するために使用されるツールを提供していることを覚えておくことが重要です。これらの技術システムは、結果として生じる決定の対象となる個人の生活に影響を与える可能性があります。予測の検討から意思決定への飛躍は、新しい技術的な問題だけでなく、慎重に検討しなければならない多くの倫理的問題も提起します。医療診断システムを導入する場合、どの集団に対して機能し、どの集団で機能しないかを知る必要があります。亜集団の福祉に対する予見可能なリスクを見落とすと、私たちは劣ったケアを行う可能性があります。さらに、意思決定システムを検討したら、一歩下がって、テクノロジーの評価方法を再考する必要があります。この範囲の変更による他の結果の中でも、*正確さ*が正しい尺度になることはめったにないことがわかります。たとえば、予測を行動に変換する場合、誤りの潜在的なコスト感度をさまざまな方法で考慮したいことがよくあります。画像を誤分類する1つの方法が人種的な手品として認識され、別のカテゴリへの誤分類が無害である場合、意思決定プロトコルの設計における社会的価値を考慮して、それに応じてしきい値を調整したい場合があります。また、予測システムがどのようにフィードバックループにつながるかについても注意する必要があります。たとえば、犯罪の予測が高い地域に巡回担当者を割り当てる予測警察システムを考えてみましょう。心配なパターンがどのように現れるかは簡単にわかります。 
+
+ 1. 犯罪が多い地域では、より多くのパトロールが行われます。
+ 1. その結果、これらの近傍でより多くの犯罪が発見され、将来の反復に利用可能なトレーニングデータが入力されます。
+ 1. より多くのポジティブにさらされるこのモデルは、これらの地域でさらに多くの犯罪を予測しています。
+ 1. 次のイテレーションでは、更新されたモデルが同じ近隣地域をさらにターゲットにし、さらに多くの犯罪が発見されるなどにつながります。
+
+多くの場合、モデルの予測がトレーニングデータに結合されるさまざまなメカニズムは、モデリングプロセスでは考慮されません。これは、研究者が*暴走フィードバックループ*と呼ぶものにつながる可能性があります。さらに、そもそも正しい問題に取り組んでいるかどうかにも注意する必要があります。予測アルゴリズムは現在、情報の普及を媒介する上で非常に大きな役割を果たしています。個人が遭遇するニュースは、その人が*いいね！した*一連のFacebookページによって決定されるべきですか？これらは、機械学習のキャリアで遭遇する可能性のある、差し迫った倫理的ジレンマのほんの一部です。 
+
+## まとめ
+
+* 多くの場合、トレーニングセットとテストセットは同じディストリビューションから取得されません。これを分配シフトと呼びます。
+* リスクとは、真の分布から引き出されたデータの母集団全体にわたる損失の予想です。ただし、この全人口は通常利用できません。経験的リスクとは、リスクを概算するためのトレーニングデータの平均損失です。実際には、経験的なリスク最小化を実行します。
+* 対応する仮定の下で、共変量とラベルシフトはテスト時に検出および修正できます。この偏りを考慮しないと、テスト時に問題になる可能性があります。
+* 場合によっては、環境が自動化されたアクションを記憶し、意外な方法で応答することがあります。モデルを構築する際にはこの可能性を考慮し、モデルと環境が予期せぬ形で絡み合う可能性に心を開いて、ライブシステムを監視し続ける必要があります。
+
+## 演習
+
+1. 検索エンジンの動作を変えるとどうなるでしょうか？ユーザーは何をしますか？広告主はどうですか？
+1. 共変量シフト検出器を実装します。ヒント:分類器を構築する。
+1. 共変量シフト補正器を実装します。
+1. 分布シフト以外に、経験的リスクがリスクに近づく方法に影響を与える可能性があるのは他にありますか？
+
+[Discussions](https://discuss.d2l.ai/t/105)
diff --git a/chapter_multilayer-perceptrons/environment_origin.md b/chapter_linear-classification/environment-and-distribution-shift_origin.md
similarity index 98%
rename from chapter_multilayer-perceptrons/environment_origin.md
rename to chapter_linear-classification/environment-and-distribution-shift_origin.md
index 09256a3..3a7b589 100644
--- a/chapter_multilayer-perceptrons/environment_origin.md
+++ b/chapter_linear-classification/environment-and-distribution-shift_origin.md
@@ -1,10 +1,11 @@
 # Environment and Distribution Shift
+:label:`sec_environment-and-distribution-shift`
 
 In the previous sections, we worked through
 a number of hands-on applications of machine learning,
 fitting models to a variety of datasets.
 And yet, we never stopped to contemplate
-either where data come from in the first place
+either where data comes from in the first place
 or what we plan to ultimately do
 with the outputs from our models.
 Too often, machine learning developers
@@ -63,7 +64,7 @@ To begin, we stick with the passive prediction setting
 considering the various ways that data distributions might shift
 and what might be done to salvage model performance.
 In one classic setup, we assume that our training data
-were sampled from some distribution $p_S(\mathbf{x},y)$
+was sampled from some distribution $p_S(\mathbf{x},y)$
 but that our test data will consist
 of unlabeled examples drawn from
 some different distribution $p_T(\mathbf{x},y)$.
@@ -79,7 +80,7 @@ then our setup permits the pathological case
 in which the distribution over inputs remains
 constant: $p_S(\mathbf{x}) = p_T(\mathbf{x})$,
 but the labels are all flipped:
-$p_S(y | \mathbf{x}) = 1 - p_T(y | \mathbf{x})$.
+$p_S(y \mid \mathbf{x}) = 1 - p_T(y \mid \mathbf{x})$.
 In other words, if God can suddenly decide
 that in the future all "cats" are now dogs
 and what we previously called "dogs" are now cats---without
@@ -240,7 +241,7 @@ In short, they wasted a significant sum of money.
 Say a company wanted to leverage machine learning
 for developing self-driving cars.
 One key component here is a roadside detector.
-Since real annotated data are expensive to get,
+Since real annotated data is expensive to get,
 they had the (smart and questionable) idea
 to use synthetic data from a game rendering engine
 as additional training data.
@@ -279,7 +280,7 @@ Below are some typical cases.
 
 ### More Anecdotes
 
-* We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data---the offending examples are close-ups where the face fills the entire image (no such data were in the training set).
+* We build a face detector. It works well on all benchmarks. Unfortunately it fails on test data---the offending examples are close-ups where the face fills the entire image (no such data was in the training set).
 * We build a Web search engine for the US market and want to deploy it in the UK.
 * We train an image classifier by compiling a large dataset where each among a large set of classes is equally represented in the dataset, say 1000 categories, represented by 1000 images each. Then we deploy the system in the real world, where the actual label distribution of photographs is decidedly non-uniform.
 
@@ -304,7 +305,7 @@ as this material is not prerequisite to subsequent concepts.
 ### Empirical Risk and  Risk
 :label:`subsec_empirical-risk-and-risk`
 
-Let us first reflect about what exactly
+Let's first reflect about what exactly
 is happening during model training:
 we iterate over features and associated labels
 of training data
@@ -521,7 +522,7 @@ where our model predicted $i$.
 
 It turns out that under some mild conditions---if
 our classifier was reasonably accurate in the first place,
-and if the target data contain only categories
+and if the target data contains only categories
 that we have seen before,
 and if the label shift assumption holds in the first place
 (the strongest assumption here),
@@ -602,7 +603,7 @@ Likewise, a user's behavior on a news site will depend on what we showed him pre
 Recently,
 control theory (e.g., PID variants) has also been used
 to automatically tune hyperparameters
-to achive better disentangling and reconstruction quality,
+to achieve better disentangling and reconstruction quality,
 and improve the diversity of generated text and the reconstruction quality of generated images :cite:`Shao.Yao.Sun.ea.2020`.
 
 
diff --git a/chapter_linear-classification/generalization-classification.md b/chapter_linear-classification/generalization-classification.md
new file mode 100644
index 0000000..2f72f38
--- /dev/null
+++ b/chapter_linear-classification/generalization-classification.md
@@ -0,0 +1,89 @@
+# 分類における一般化
+
+:label:`chap_classification_generalization` 
+
+これまで、複数の出力とソフトマックス関数を使用して (線形) ニューラルネットワークをトレーニングすることにより、マルチクラス分類問題に取り組む方法に焦点を当ててきました。モデルの出力を確率的予測として解釈し、クロスエントロピー損失関数を動機付けて導き出しました。クロスエントロピー損失関数は、（固定パラメーターセットに対して）モデルが実際のラベルに割り当てる負の対数尤度を計算します。そして最後に、モデルをトレーニングセットに適合させることで、これらのツールを実践しました。しかし、いつものように、私たちの目標は、これまでに見られなかったデータ（テストセット）で経験的に評価された*一般的なパターン*を学ぶことです。トレーニングセットの精度が高いということは何の意味もありません。各入力が一意である場合（そして実際、これはほとんどの高次元のデータセットに当てはまります）、最初のトレーニングエポックでデータセットを記憶し、新しい画像が表示されるたびにラベルを検索するだけで、トレーニングセットで完全な精度を得ることができます。それでも、正確なトレーニング例に関連付けられた正確なラベルを覚えても、新しい例を分類する方法を教えてくれません。さらなるガイダンスがなければ、新しい例に出会うたびにランダムな推測に頼らなければならないかもしれません。 
+
+多くの燃えるような質問には早急な注意が必要です:
+1. 基礎となる母集団の分類器の精度を正確に推定するには、いくつの検定例が必要ですか？
+1. 同じテストでモデルを繰り返し評価し続けるとどうなりますか？
+1. 線形モデルをトレーニングセットに適合させることが、私たちの素朴な暗記スキームよりもうまくいくと期待すべきなのはなぜですか？
+
+:numref:`sec_generalization_basics`では線形回帰のコンテキストで過適合と汎化の基本を紹介しましたが、この章では統計的学習理論の基本的な考え方をいくつか紹介します。私たちはしばしば汎化を保証できることがわかります*アプリオリ*：多くのモデルと、汎化ギャップ$\epsilon$の希望する上限に対して、必要なサンプル数$n$を決定できることがよくあります。これにより、トレーニングセットに少なくとも$n$のサンプルが含まれている場合、経験的誤差真のエラーの$\epsilon$以内にあるでしょう、
+*あらゆるデータ生成ディストリビューション用*。
+残念なことに、この種の保証は知的ビルディングブロックの深遠なセットを提供するものの、ディープラーニングの実践者にとって実用的ではないこともわかりました。要するに、これらの保証は、ディープニューラルネットワークを*アプリオリ*一般化するには、不合理な数（おそらく数兆以上）が必要であることを示唆しています。たとえディープニューラルネットワークを気にするタスクで、通常ははるかに少ない例で非常にうまく一般化することがわかったとしても（数千)。したがって、ディープラーニングの実践者は、先験的な保証を完全に放棄することが多く、代わりに過去に同様の問題について十分に一般化してきた方法を使用し、経験的評価を通じて一般化を*事後*ホック*に認定します。:numref:`chap_perceptrons`に到達したら、汎化を再検討し、ディープニューラルネットワークが実際に一般化する理由を説明する試みで生まれた膨大な科学文献への簡単な紹介を提供します。 
+
+## テストセット
+
+汎化誤差を評価するためのゴールドスタンダードの方法としてすでにテストセットに依存し始めているので、そのような誤差推定の特性について議論することから始めましょう。取得方法を気にせずに、固定分類器$f$に焦点を当ててみましょう。さらに、分類器$f$のトレーニングに使用されなかった例$\mathcal{D} = {(\mathbf{x}^{(i)},y^{(i)})}_{i=1}^n$の*新鮮な*データセットを持っているとします。$\mathcal{D}$ の分類器 $f$ の*経験的誤差* は、予測 $f(\mathbf{x}^{(i)})$ が真のラベル $y^{(i)}$ と一致しないインスタンスの割合であり、次の式で与えられます。 
+
+$$\epsilon_\mathcal{D}(f) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(f(\mathbf{x}^{(i)}) \neq y^{(i)}).$$
+
+対照的に、*母集団誤差*は、分類器が真のラベルと一致しない確率密度関数$p(\mathbf{x},y)$によって特徴付けられる、基礎となる母集団（一部の分布$P(X,Y)$）の例の*予想される*割合です。 
+
+$$\epsilon(f) =  E_{(\mathbf{x}, y) \sim P} \mathbf{1}(f(\mathbf{x}) \neq y) =
+\int\int \mathbf{1}(f(\mathbf{x}) \neq y) p(\mathbf{x}, y) \;d\mathbf{x} dy.$$
+
+$\epsilon(f)$は私たちが実際に気にする量ですが、一人一人を測定せずに大集団の平均身長を直接観察できないのと同じように、直接観察することはできません。この数量はサンプルに基づいてのみ見積もることができます。この検定セット $\mathcal{D}$ は基礎となる母集団の統計的代表であるため、$\epsilon_\mathcal{D}(f)$ は母集団誤差 $\epsilon(f)$ の統計的推定値と見なすことができます。さらに、対象となる量 $\epsilon(f)$ は（確率変数 $\mathbf{1}(f(X) \neq Y)$ の）期待値であり、対応する推定器 $\epsilon_\mathcal{D}(f)$ はサンプル平均であるため、母集団誤差の推定は単に平均推定の古典的な問題であり、:numref:`sec_prob` から思い出すことができます。 
+
+*中心極限定理*と呼ばれる確率論の重要な古典的結果は、サンプル数$n$が無限大に近づくにつれて、平均$\mu$と標準偏差$\sigma$を持つ任意の分布から抽出された$n$のランダムサンプル$a_1, ..., a_n$を所有するときはいつでも、サンプルが平均$\hat{\mu}$は、真の平均を中心とし、標準偏差$\sigma/\sqrt{n}$の正規分布に近似する傾向があります。すでに、これは重要なことを示しています。例の数が増えるにつれて、テストエラー$\epsilon_\mathcal{D}(f)$は、$\mathcal{O}(1/\sqrt{n})$の割合で真のエラー$\epsilon(f)$に近づくはずです。したがって、テストエラーを 2 倍の精度で推定するには、4 倍の大きさのテストセットを収集する必要があります。テストエラーを100分の1に減らすには、1万倍のテストセットを収集する必要があります。一般に、このような$\mathcal{O}(1/\sqrt{n})$のレートは、統計で期待できる最高のレートです。 
+
+これで、テストエラー $\epsilon_\mathcal{D}(f)$ が真のエラー $\epsilon(f)$ に収束する漸近率について何かわかったので、いくつかの重要な詳細を拡大できます。対象となる確率変数 $\mathbf{1}(f(X) \neq Y)$ は値 $0$ と $1$ しか取ることができず、したがって、値 $1$ を取る確率を示すパラメータによって特徴付けられるベルヌーイ確率変数であることを思い出してください。ここで、$1$は、分類器がエラーを起こしたことを意味するため、確率変数のパラメータは実際には真の誤り率$\epsilon(f)$です。ベルヌーイの分散$\sigma^2$は、式$\epsilon(f)(1-\epsilon(f))$に従ってそのパラメータ（ここでは $\epsilon(f)$）に依存します。$\epsilon(f)$ は当初は不明ですが、$1$ を超えることはできないことがわかっています。この関数を少し調べてみると、真の誤り率が$0.5$に近いときに分散が最も高くなり、$0$に近いか、$1$に近いときにはるかに低くなる可能性があることがわかります。これは、（$n$の試験サンプルの選択に対する）誤差$\epsilon(f)$の推定$\epsilon_\mathcal{D}(f)$の漸近標準偏差が$\sqrt{0.25/n}$より大きくないことを示しています。 
+
+有限のサンプルがあるときではなく、テストセットのサイズが無限大に近づくにつれて、この率が動作を特徴付けるという事実を無視すると、これは、テストエラー$\epsilon_\mathcal{D}(f)$を母集団誤差$\epsilon(f)$に近似させたい場合、1つの標準偏差が次の区間に対応するようにすることを示しています。$\pm 0.01$、それならおよそ2500個のサンプルを集めるべきです。その範囲で2つの標準偏差を適合させ、$\epsilon_\mathcal{D}(f) \in \epsilon(f) \pm 0.01$の95％にする場合、10000サンプルが必要になります。 
+
+これは、機械学習における多くの一般的なベンチマークのテストセットのサイズであることがわかりました。$0.01$以下のエラー率の改善により、毎年何千もの応用ディープラーニングの論文が出版され、大きな成果を上げていることに驚かれるかもしれません。もちろん、エラー率が$0$にかなり近い場合、$0.01$の改善は確かに大きな問題になる可能性があります。 
+
+これまでの分析の厄介な特徴の1つは、実際には漸近的、つまりサンプルサイズが無限大になるにつれて$\epsilon_\mathcal{D}$と$\epsilon$の関係がどのように進化するかについてのみ教えてくれるということです。幸いなことに、確率変数は有界であるため、Hoeffding (1963) による不等式を適用することにより、有効な有限標本境界を得ることができます。 
+
+$$P(\epsilon_\mathcal{D}(f) - \epsilon(f) \geq t) < \exp\left( - 2n t^2 \right).$$
+
+推定値$\epsilon_\mathcal{D}(f)$と真の誤り率$\epsilon(f)$の間の距離$t$が$0.01$を超えないことを95％の信頼度で結論付けることができる最小のデータセットサイズを解くと、$10000$の例と比較して、およそ$15000$の例が必要であることがわかります。上記の漸近的分析によって示唆された。統計を深く掘り下げると、この傾向が一般的に当てはまることがわかります。有限サンプルでも保持される保証は、通常、やや保守的です。物事のスキームでは、これらの数字はそれほど離れていないことに注意してください。これは、私たちが法廷に持ち込むことができる保証ではないとしても、球場の数字を与えるための漸近分析の一般的な有用性を反映しています。 
+
+## テストセットの再利用
+
+ある意味では、経験的な機械学習研究を成功させる準備が整いました。ほぼすべての実用的なモデルは、テストセットの性能に基づいて開発および検証されており、テストセットのマスターになりました。固定分類器 $f$ については、その検定誤差 $\epsilon_\mathcal{D}(f)$ を評価し、母集団誤差 $\epsilon(f)$ について何が言える（またはできないか）を正確に把握していることを知っています。 
+
+それでは、この知識を活用して、最初のモデル $f_1$ をトレーニングする準備をするとします。分類器の誤り率の性能にどの程度自信があるかを知ることで、上記の分析を適用して、テストセット用に取っておくべき適切な数の例を決定します。さらに、:numref:`sec_generalization_basics`の教訓を心に留め、予備分析、ハイパーパラメータ調整、および検証セット上の複数の競合するモデルアーキテクチャの中からの選択をすべて実行することにより、テストセットの神聖さを確実に維持したと仮定します。最後に、モデル $f_1$ を検定セットで評価し、関連する信頼区間で母集団誤差の偏りのない推定値を報告します。 
+
+これまでのところ、すべてが順調に進んでいるようです。しかし、その夜、あなたは午前3時に新しいモデリングアプローチの素晴らしいアイデアで目を覚まします。翌日、新しいモデルをコーディングし、検証セットでハイパーパラメータを調整すると、新しいモデル $f_2$ が動作するようになるだけでなく、エラー率が $f_1$ よりもはるかに低いように見えます。しかし、最終評価の準備をするにつれて、発見のスリルは突然薄れます。テストセットがない！ 
+
+元のテストセット $\mathcal{D}$ がまだサーバー上に存在していても、2 つの手ごわい問題に直面しています。まず、テストセットを収集するときに、単一の分類器 $f$ を評価するという仮定の下で、必要な精度レベルを決定しました。ただし、同じテストセットで複数の分類器 $f_1, ..., f_k$ を評価するビジネスに入る場合は、誤検出の問題を考慮する必要があります。以前は、単一の分類器$f$の$\epsilon_\mathcal{D}(f) \in \epsilon(f) \pm 0.01$を95％確信していたため、誤解を招く結果の確率はわずか5％でした。$k$ 分類器が混在していると、テストセットの性能が誤解を招くような分類器が1つもないことを保証するのは難しい場合があります。20 個の分類器が検討されていると、そのうちの少なくとも 1 つが誤解を招くようなスコアを獲得した可能性を排除する権限がまったくない可能性があります。この問題は、統計学の膨大な文献にもかかわらず、科学研究を悩ませている永続的な問題のままである多重仮説検定に関連しています。 
+
+それだけでは不十分な場合は、その後の評価で得られる結果を信用しない特別な理由があります。テストセットのパフォーマンスの分析は、テストセットとの接触がない状態で分類器が選択されたという仮定に基づいていることを思い出してください。したがって、テストセットは基礎となる母集団からランダムに抽出されたものとして見ることができました。ここでは、複数の機能をテストするだけでなく、$f_1$のテストセットのパフォーマンスを観察した後に、後続の関数$f_2$が選択されました。テストセットからの情報がモデラーに漏れると、厳密な意味で再び真のテストセットになることはありません。この問題は*適応型オーバーフィット*と呼ばれ、最近、学習理論家や統計学者にとって非常に興味深いトピックとして浮上しています。:cite:`dwork2015preserving`。幸いなことに、ホールドアウトセットからすべての情報を漏らす可能性があり、理論上の最悪のシナリオは暗いですが、これらの分析は保守的すぎる可能性があります。実際には、実際のテストセットを作成し、できるだけ頻繁に参照しないようにし、信頼区間を報告するときに複数の仮説検定を考慮し、賭け金が高くデータセットサイズが小さい場合は、より積極的に警戒をダイヤルアップするようにしてください。一連のベンチマークチャレンジを実行する場合、各ラウンドの後に古いテストセットを検証セットに降格できるように、複数のテストセットを維持することが推奨されることがよくあります。 
+
+## 統計的学習理論
+
+一度に、*テストセットは私たちが本当に持っているものすべて*ですが、この事実は奇妙に不満足に思えます。まず、*真のテストセット*を持つことはほとんどありません。データセットを作成しているのが自分でない限り、他の誰かが表向きの「テストセット」で自分の分類子をすでに評価している可能性があります。そして、私たちが最初のディブを取得したときでさえ、私たちはすぐに不満を感じます。私たちの数字を信頼できないというかじるような感覚なしに、その後のモデリングの試みを評価できることを願っています。さらに、真のテストセットでさえ、分類器が実際に母集団に一般化されたかどうかを*事後*伝えることしかできず、一般化すべき*先験的*を期待する理由があるかどうかではありません。 
+
+これらの不安を念頭に置いて、経験的データで訓練されたモデルが目に見えないものに一般化できる/一般化する理由と時期を説明する基本原則を解明することを目的とする機械学習の数学的サブフィールドである*統計的学習理論*の魅力を理解するのに十分な準備が整っているかもしれませんデータ。数十年にわたる統計的学習研究者の主な目的の1つは、モデルクラスのプロパティ、データセット内のサンプル数を関連付ける一般化のギャップを制限することでした。 
+
+学習理論家は、学習セット $\mathcal{S}$ で学習および評価された学習済み分類器 $f_\mathcal{S}$ の*経験的誤差* $\epsilon_\mathcal{S}(f_\mathcal{S})$ と、基になる母集団の同じ分類器の真の誤差 $\epsilon(f_\mathcal{S})$ の差を制限することを目的としています。これは、先ほど取り上げた評価の問題と似ているかもしれませんが、大きな違いがあります。以前は、分類器$f$は修正されており、評価目的でのみデータセットが必要でした。実際、固定分類器は一般化されます。（以前は見えなかった）データセットでのその誤差は、母集団誤差の偏りのない推定値です。しかし、分類器が同じデータセットでトレーニングされ、評価されるとき、私たちは何を言うことができますか？トレーニングエラーがテストエラーに近いと確信できるでしょうか？ 
+
+学習した分類器 $f_\mathcal{S}$ が、事前に指定された関数セット $\mathcal{F}$ の中から選択されなければならないと仮定します。テストセットの議論から、単一の分類器の誤差を推定するのは簡単ですが、分類器のコレクションを検討し始めると状況が悪化することを思い出してください。任意の（固定）分類器の経験誤差が高い確率で真の誤差に近い場合でも、分類器の集合を検討したら、集合内の*1つだけ*分類器がひどく誤って推定された誤差を受け取る可能性について心配する必要があります。心配なのは、コレクション内の1つの分類器だけが誤解を招くほど低い誤差を受け取った場合、それを選択し、それによって母集団誤差を大幅に過小評価する可能性があるということです。さらに、線形モデルであっても、そのパラメータは連続的に評価されるため、通常は無限クラスの関数から選択します ($|\mathcal{F}| = \infty$)。 
+
+この問題に対する野心的な解決策の1つは、均一な収束を証明するための解析ツールを開発することです。つまり、高い確率で、クラス$f\in\mathcal{F}$のすべての分類器の経験的誤り率が、真の誤り率に*同時に*収束するということです。言い換えれば、少なくとも$1-\delta$（一部の小さい $\delta$）では、分類器の誤り率$\epsilon(f)$（クラス$\mathcal{F}$のすべての分類器のうち）が何らかの小さな$\alpha$よりも誤って推定されることはないという理論的原則を求めています。明らかに、すべてのモデルクラス$\mathcal{F}$に対してそのような記述を行うことはできません。常に経験的誤差 $0$ を達成するが、基礎となる母集団に対するランダムな推測を決して上回らない記憶マシンのクラスを思い出してください。 
+
+ある意味、メモライザーのクラスは柔軟性が高すぎる。そのような均一な収束結果は成り立たないでしょう。一方、固定分類器は役に立ちません。完全に一般化されますが、学習データにもテストデータにも適合しません。したがって、学習の中心的な問題は、歴史的に、トレーニングデータによりよく適合するが過剰適合のリスクがある、より柔軟な（より高い分散）モデルクラスと、一般化は良好であるが適合不足のリスクがあるより厳格な（より高いバイアス）モデルクラスとの間のトレードオフとして組み立てられてきました。学習理論における中心的な問題は、モデルがこのスペクトルに沿って位置する場所を定量化し、関連する保証を提供するための適切な数学的分析を開発することでした。 
+
+一連の独創的な論文で、VapnikとChervonenkisは、相対周波数の収束に関する理論をより一般的な関数のクラスに拡張しました:cite:`VapChe64,VapChe68,VapChe71,VapChe74b,VapChe81,VapChe91`。この一連の作業の主な貢献の1つは、モデルクラスの複雑さ（柔軟性）を測定する（1つの概念）Vapnik-Chervonenkis（VC）次元です。さらに、それらの主要な結果の1つは、経験誤差と母集団誤差の差を、VC次元とサンプル数の関数として制限します。 
+
+$$P\left(R[p, f] - R_\mathrm{emp}[\mathbf{X}, \mathbf{Y}, f] < \alpha\right) \geq 1-\delta
+\ \text{ for }\ \alpha \geq c \sqrt{(\mathrm{VC} - \log \delta)/n}.$$
+
+ここで、$\delta > 0$ は範囲に違反する確率、$\alpha$ は汎化ギャップの上限、$n$ はデータセットのサイズです。最後に、$c > 0$は、発生する可能性のある損失の規模にのみ依存する定数です。この範囲の 1 つの用途は、$\delta$ と $\alpha$ の希望する値を差し込んで、収集するサンプルの数を決定することです。VCディメンションは、任意の（バイナリ）ラベルを割り当てることができるデータポイントの最大数を定量化し、それぞれについて、そのラベルに一致するクラス内のモデル$f$を見つけます。たとえば、$d$ 次元の入力の線形モデルには、VC ディメンション $d+1$ があります。ラインが 2 次元の 3 つのポイントに割り当てることができるが、4 つのポイントには割り当てられないことが簡単にわかります。残念ながら、この理論はより複雑なモデルでは過度に悲観的になる傾向があり、この保証を得るには、通常、目的のエラー率を達成するために実際に必要とされるよりもはるかに多くの例が必要です。また、モデルクラスと$\delta$を修正すると、エラーレートが通常の$\mathcal{O}(1/\sqrt{n})$レートで再び減衰することにも注意してください。$n$に関してもっとうまくやれるとは思えない。しかし、モデルクラスを変えると、VC次元は汎化ギャップの悲観的な描写を示すことができます。 
+
+## まとめ
+
+モデルを評価する最も簡単な方法は、これまで見られなかったデータで構成されるテストセットを調べることです。テストセット評価は、真の誤差の偏りのない推定値を提供し、テストセットが大きくなるにつれて目的の $\mathcal{O}(1/\sqrt{n})$ レートで収束します。正確な漸近分布に基づく近似信頼区間、または（より保守的な）有限サンプル保証に基づく有効な有限サンプル信頼区間を提供できます。実際、テストセットの評価は、現代の機械学習研究の基盤です。ただし、テストセットが真のテストセットになることはほとんどありません（複数の研究者が何度も使用しています）。同じテストセットを使用して複数のモデルを評価すると、誤検出の制御が困難になる可能性があります。これは理論上大きな問題を引き起こす可能性があります。実際には、問題の重要性は、問題のホールドアウト集合のサイズと、それらが単にハイパーパラメータの選択に使用されているのか、それとも情報がより直接的に漏洩しているのかによって異なります。それでも、実際のテストセット（または複数）をキュレートし、それらの使用頻度についてできるだけ控えめにするのが良い習慣です。 
+
+統計的学習理論家は、より満足のいくソリューションを提供するために、モデルクラス全体で一様な収束を保証する方法を開発しました。実際にすべてのモデルの経験誤差が真の誤差に同時に収束する場合、ホールドアウトデータでも同様に良好に機能することがわかっているので、学習誤差を最小限に抑えて、最良のパフォーマンスを発揮するモデルを自由に選択できます。重要なのは、そのような結果はいずれもモデルクラスの何らかのプロパティに依存しなければならないということです。ウラジミール・ヴァプニクとアレクセイ・チェルノヴェンキスはVC次元を導入し、VCクラスのすべてのモデルに当てはまる均一な収束結果を提示しました。クラス内のすべてのモデルのトレーニングエラーは、（同時に）真のエラーに近いことが保証され、$\mathcal{O}(1/\sqrt{n})$のレートでより近くなることが保証されています。VC次元の革新的な発見に続いて、それぞれが類似した一般化保証を容易にする多数の代替複雑度測定が提案されています。関数の複雑度を測定するいくつかの高度な方法の詳細については、:citet:`boucheron2005theory`を参照してください。残念なことに、これらの複雑さの測定は統計理論において広く有用なツールになりましたが、ディープニューラルネットワークが一般化する理由を説明するために（簡単に適用できるように）無力であることが判明しました。ディープニューラルネットワークは多くの場合、数百万のパラメーター (またはそれ以上) を持ち、大量のポイントにランダムなラベルを簡単に割り当てることができます。それにもかかわらず、それらは実際的な問題についてうまく一般化しており、驚くべきことに、より大きなVC寸法を被るにもかかわらず、より大きく、より深くなると、よりよく一般化することがよくあります。次の章では、ディープラーニングの文脈における汎化を再考します。 
+
+## 演習
+
+1. 固定モデル$f$の誤差を、99.9％を超える確率で$0.0001$以内に推定する場合、いくつのサンプルが必要ですか？
+1. 他の誰かがラベル付きテストセット $\mathcal{D}$ を所有していて、ラベルのない入力 (フィーチャ) のみを使用可能にするとします。ここで、ラベルのない各入力に対してモデル $f$ (モデルクラスに制限なし) を実行し、対応するエラー $\epsilon_\mathcal{D}(f)$ を受け取ることによってのみテストセットのラベルにアクセスできるとします。実際のエラーに関係なく、テストセット全体をリークする前に、いくつのモデルを評価する必要がありますか。したがって、エラー $0$ があるように見えますか？
+1. $5^\mathrm{th}$次多項式のクラスのVC次元は何ですか？
+1. 二次元データの軸に整列した長方形のVC次元は何ですか?
+
+[Discussions](https://discuss.d2l.ai/t/6829)
diff --git a/chapter_linear-classification/generalization-classification_origin.md b/chapter_linear-classification/generalization-classification_origin.md
new file mode 100644
index 0000000..11a4d49
--- /dev/null
+++ b/chapter_linear-classification/generalization-classification_origin.md
@@ -0,0 +1,590 @@
+# Generalization in Classification
+
+:label:`chap_classification_generalization`
+
+
+
+So far, we have focused on how to tackle multiclass classification problems
+by training (linear) neural networks with multiple outputs and softmax functions.
+Interpreting our model's outputs as probabilistic predictions,
+we motivated and derived the cross-entropy loss function,
+which calculates the negative log likelihood
+that our model (for a fixed set of parameters)
+assigns to the actual labels.
+And finally, we put these tools into practice
+by fitting our model to the training set.
+However, as always, our goal is to learn *general patterns*,
+as assessed empirically on previously unseen data (the test set).
+High accuracy on the training set means nothing.
+Whenever each of our inputs is unique
+(and indeed this is true for most high-dimensional datasets),
+we can attain perfect accuracy on the training set
+by just memorizing the dataset on the first training epoch,
+and subsequently looking up the label whenever we see a new image.
+And yet, memorizing the exact labels
+associated with the exact training examples
+does not tell us how to classify new examples.
+Absent further guidance, we might have to fall back
+on random guessing whenever we encounter new examples.
+
+A number of burning questions demand immediate attention:
+1. How many test examples do we need to precisely estimate
+   the accuracy of our classifiers on the underlying population?
+1. What happens if we keep evaluating models on the same test repeatedly?
+1. Why should we expect that fitting our linear models to the training set
+   should fare any better than our naive memorization scheme?
+
+
+While :numref:`sec_generalization_basics` introduced
+the basics of overfitting and generalization
+in the context of linear regression,
+this chapter will go a little deeper,
+introducing some of the foundational ideas
+of statistical learning theory.
+It turns out that we often can guarantee generalization *a priori*:
+for many models,
+and for any desired upper bound
+on the generalization gap $\epsilon$,
+we can often determine some required number of samples $n$
+such that if our training set contains at least $n$
+samples, then our empirical error will lie
+within $\epsilon$ of the true error,
+*for any data generating distribution*.
+Unfortunately, it also turns out
+that while these sorts of guarantees provide
+a profound set of intellectual building blocks,
+they are of limited practical utility
+to the deep learning practitioner.
+In short, these guarantees suggest
+that ensuring generalization
+of deep neural networks *a priori*
+requires an absurd number of examples
+(perhaps trillions or more),
+even when we find that on the tasks we care about
+that deep neural networks typically to generalize
+remarkably well with far fewer examples (thousands).
+Thus deep learning practitioners often forgo
+a priori guarantees altogether,
+instead employing methods on the basis
+that they have generalized well
+on similar problems in the past,
+and certifying generalization *post hoc*
+through empirical evaluations.
+When we get to :numref:`chap_perceptrons`,
+we will revisit generalization
+and provide a light introduction
+to the vast scientific literature
+that has sprung in attempts
+to explain why deep neural networks generalize in practice.
+
+## The Test Set
+
+Since we have already begun to rely on test sets
+as the gold standard method
+for assessing generalization error,
+let's get started by discussing
+the properties of such error estimates.
+Let's focus on a fixed classifier $f$,
+without worrying about how it was obtained.
+Moreover suppose that we possess
+a *fresh* dataset of examples $\mathcal{D} = {(\mathbf{x}^{(i)},y^{(i)})}_{i=1}^n$
+that were not used to train the classifier $f$.
+The *empirical error* of our classifier $f$ on $\mathcal{D}$
+is simply the fraction of instances
+for which the prediction $f(\mathbf{x}^{(i)})$
+disagrees with the true label $y^{(i)}$,
+and is given by the following expression:
+
+$$\epsilon_\mathcal{D}(f) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(f(\mathbf{x}^{(i)}) \neq y^{(i)}).$$
+
+By contrast, the *population error*
+is the *expected* fraction
+of examples in the underlying population
+(some distribution $P(X,Y)$  characterized
+by probability density function $p(\mathbf{x},y)$
+for which our classifier disagrees
+with the true label:
+
+$$\epsilon(f) =  E_{(\mathbf{x}, y) \sim P} \mathbf{1}(f(\mathbf{x}) \neq y) =
+\int\int \mathbf{1}(f(\mathbf{x}) \neq y) p(\mathbf{x}, y) \;d\mathbf{x} dy.$$
+
+While $\epsilon(f)$ is the quantity that we actually care about,
+we cannot observe it directly,
+just as we cannot directly
+observe the average height in a large population
+without measuring every single person.
+We can only estimate this quantity based on samples.
+Because our test set $\mathcal{D}$
+is statistically representative
+of the underlying population,
+we can view $\epsilon_\mathcal{D}(f)$ as a statistical
+estimator of the population error $\epsilon(f)$.
+Moreover, because our quantity of interest $\epsilon(f)$
+is an expectation (of the random variable $\mathbf{1}(f(X) \neq Y)$)
+and the corresponding estimator $\epsilon_\mathcal{D}(f)$
+is the sample average,
+estimating the popullation error
+is simply the classic problem of mean estimation,
+which you may recall from :numref:`sec_prob`.
+
+An important classical result from probability theory
+called the *central limit theorem* guarantees
+that whenever we possess $n$ random samples $a_1, ..., a_n$
+drawn from any distribution with mean $\mu$ and standard deviation $\sigma$,
+as the number of samples $n$ approaches infinity,
+the sample average $\hat{\mu}$ approximately
+tends towards a normal distribution centered
+at the true mean and with standard deviation $\sigma/\sqrt{n}$.
+Already, this tells us something important:
+as the number of examples grows large,
+our test error $\epsilon_\mathcal{D}(f)$
+should approach the true error $\epsilon(f)$
+at a rate of $\mathcal{O}(1/\sqrt{n})$.
+Thus, to estimate our test error twice as precisely,
+we must collect four times as large a test set.
+To reduce our test error by a factor of one hundred,
+we must collect ten thousand times as large a test set.
+In general, such a rate of $\mathcal{O}(1/\sqrt{n})$
+is often the best we can hope for in statistics.
+
+Now that we know something about the asymptotic rate
+at which our test error $\epsilon_\mathcal{D}(f)$ converges to the true error $\epsilon(f)$,
+we can zoom in on some important details.
+Recall that the random variable of interest
+$\mathbf{1}(f(X) \neq Y)$
+can only take values $0$ and $1$
+and thus is a Bernoulli random variable,
+characterized by a parameter
+indicating the probability that it takes value $1$.
+Here, $1$ means that our classifier made an error,
+so the parameter of our random variable
+is actually the true error rate $\epsilon(f)$.
+The variance $\sigma^2$ of a Bernoulli
+depends on its parameter (here, $\epsilon(f)$)
+according to the expression $\epsilon(f)(1-\epsilon(f))$.
+While $\epsilon(f)$ is initially unknown,
+we know that it cannot be greater than $1$.
+A little investigation of this function
+reveals that our variance is highest
+when the true error rate is close to $0.5$
+and can be far lower when it is
+close to $0$ or close to $1$.
+This tells us that the asymptotic standard deviation
+of our estimate $\epsilon_\mathcal{D}(f)$ of the error $\epsilon(f)$
+(over the choice of the $n$ test samples)
+cannot be any greater than $\sqrt{0.25/n}$.
+
+If we ignore the fact that this rate characterizes
+behavior as the test set size approaches infinity
+rather than when we possess finite samples,
+this tells us that if we want our test error $\epsilon_\mathcal{D}(f)$
+to approximate the population error $\epsilon(f)$
+such that one standard deviation corresponds
+to an interval of $\pm 0.01$,
+then we should collect roughly 2500 samples.
+If we want to fit two standard deviations
+in that range and thus be 95%
+that $\epsilon_\mathcal{D}(f) \in \epsilon(f) \pm 0.01$,
+then we will need 10000 samples!
+
+This turns out to be the size of the test sets
+for many popular benchmarks in machine learning.
+You might be surprised to find out that thousands
+of applied deep learning papers get published every year
+making a big deal out of error rate improvements of $0.01$ or less.
+Of course, when the error rates are much closer to $0$,
+then an improvement of $0.01$ can indeed be a big deal.
+
+
+One pesky feature of our analysis thus far
+is that it really only tells us about asymptotics,
+i.e., how the relationship between $\epsilon_\mathcal{D}$ and $\epsilon$
+evolves as our sample size goes to infinity.
+Fortunately, because our random variable is bounded,
+we can obtain valid finite sample bounds
+by applying an inequality due to Hoeffding (1963):
+
+$$P(\epsilon_\mathcal{D}(f) - \epsilon(f) \geq t) < \exp\left( - 2n t^2 \right).$$
+
+Solving for the smallest dataset size
+that would allow us to conclude
+with 95% confidence that the distance $t$
+between our estimate $\epsilon_\mathcal{D}(f)$
+and the true error rate $\epsilon(f)$
+does not exceed $0.01$,
+you will find that roughly $15000$ examples are required
+as compared to the $10000$ examples suggested
+by the asymptotic analysis above.
+If you go deeper into statistics
+you will find that this trend holds generally.
+Guarantees that hold even in finite samples
+are typically slightly more conservative.
+Note that in the scheme of things,
+these numbers are not so far apart,
+reflecting the general usefulness
+of asymptotic analysis for giving
+us ballpark figures even if not
+guarantees we can take to court.
+
+## Test Set Reuse
+
+In some sense, you are now set up to succeed
+at conducting empirical machine learning research.
+Nearly all practical models are developed
+and validated based on test set performance
+and you are now a master of the test set.
+For any fixed classifier $f$,
+you know to evaluate its test error $\epsilon_\mathcal{D}(f)$,
+and know precisely what can (and can't)
+be said about its population error $\epsilon(f)$.
+
+So let's say that you take this knowledge
+and prepare to train your first model $f_1$.
+Knowing just how confident you need to be
+in the performance of your classifier's error rate
+you apply our analysis above to determine
+an appropriate number of examples
+to set aside for the test set.
+Moreover, let's assume that you took the lessons from
+:numref:`sec_generalization_basics` to heart
+and made sure to preserve the sanctity of the test set
+by conducting all of your preliminary analysis,
+hyperparameter tuning, and even selection
+among multiple competing model architectures
+on a validation set.
+Finally you evaluate your model $f_1$
+on the test set and report an unbiased
+estimate of the population error
+with an associated confidence interval.
+
+So far everything seems to be going well.
+However, that night you wake up at 3am
+with a brilliant idea for a new modeling approach.
+The next day, you code up your new model,
+tune its hyperparameters on the validation set
+and not only are you getting your new model $f_2$ to work
+but it's error rate appears to be much lower than $f_1$'s.
+However, the thrill of discovery suddenly fades
+as you prepare for the final evaluation.
+You don't have a test set!
+
+Even though the original test set $\mathcal{D}$
+is still sitting on your server,
+you now face two formidable problems.
+First, when you collected your test set,
+you determined the required level of precision
+under the assumption that you were evaluating
+a single classifier $f$.
+However, if you get into the business
+of evaluating multiple classifiers $f_1, ..., f_k$
+on the same test set,
+you must consider the problem of false discovery.
+Before, you might have been 95% sure
+that $\epsilon_\mathcal{D}(f) \in \epsilon(f) \pm 0.01$
+for a single classifier $f$
+and thus the probability of a misleading result
+was a mere 5%.
+With $k$ classifiers in the mix,
+it can be hard to guarantee
+that there is not even one among them
+whose test set performance is misleading.
+With 20 classifiers under consideration,
+you might have no power at all
+to rule out the possibility
+that at least one among them
+received a misleading score.
+This problem relates to multiple hypothesis testing,
+which despite a vast literature in statistics,
+remains a persistent problem plaguing scientific research.
+
+
+If that's not enough to worry you,
+there's a special reason to distrust
+the results that you get on subsequent evaluations.
+Recall that our analysis of test set performance
+rested on the assumption that the classifier
+was chosen absent any contact with the test set
+and thus we could view the test set
+as drawn randomly from the underlying population.
+Here, not only are you testing multiple functions,
+the subsequent function $f_2$ was chosen
+after you observed the test set performance of $f_1$.
+Once information from the test set has leaked to the modeler,
+it can never be a true test set again in the strictest sense.
+This problem is called *adaptive overfitting* and has recently emerged
+as a topic of intense interest to learning theorists and statisticians
+:cite:`dwork2015preserving`.
+Fortunately, while it is possible
+to leak all information out of a holdout set,
+and the theoretical worst case scenarios are bleak,
+these analyses may be too conservative.
+In practice, take care to create real test sets,
+to consult them as infrequently as possible,
+to account for multiple hypothesis testing
+when reporting confidence intervals,
+and to dial up your vigilance more aggressively
+when the stakes are high and your dataset size is small.
+When running a series of benchmark challenges,
+it's often good practice to maintain
+several test sets so that after each round,
+the old test set can be demoted to a validation set.
+
+
+
+
+
+## Statistical Learning Theory
+
+At once, *test sets are all that we really have*,
+and yet this fact seems strangely unsatisfying.
+First, we seldom possess a *true test set*---unless
+we are the ones creating the dataset,
+someone else has probably already evaluated
+their own classifier on our ostensible "test set".
+And even when we get first dibs,
+we soon find ourselves frustrated, wishing we could
+evaluate our subsequent modeling attempts
+without the gnawing feeling
+that we cannot trust our numbers.
+Moreover, even a true test set can only tell us *post hoc*
+whether a classifier has in fact generalized to the population,
+not whether we have any reason to expect *a priori*
+that it should generalize.
+
+With these misgivings in mind,
+you might now be sufficiently primed
+to see the appeal of *statistical learning theory*,
+the mathematical subfield of machine learning
+whose practitioners aim to elucidate the
+fundamental principles that explain
+why/when models trained on empirical data
+can/will generalize to unseen data.
+One of the primary aims for several decades
+of statistical learning researchers
+has been to bound the generalization gap,
+relating the properties of the model class,
+the number of samples in the dataset.
+
+Learning theorists aim to bound the difference
+between the *empirical error* $\epsilon_\mathcal{S}(f_\mathcal{S})$
+of a learned classifier $f_\mathcal{S}$,
+both trained and evaluated
+on the training set $\mathcal{S}$,
+and the true error $\epsilon(f_\mathcal{S})$
+of that same classifier on the underlying population.
+This might look similar to the evaluation problem
+that we just addressed but there's a major difference.
+Before, the classifier $f$ was fixed
+and we only needed a dataset
+for evaluative purposes.
+And indeed, any fixed classifier does generalize:
+its error on a (previously unseen) dataset
+is an unbiased estimate of the population error.
+But what can we say when a classifier
+is trained and evaluated on the same dataset?
+Can we ever be confident that the training error
+will be close to the testing error?
+
+
+Suppose that our learned classifier $f_\mathcal{S}$ must be chosen
+among some pre-specified set of functions $\mathcal{F}$.
+Recall from our discussion of test sets
+that while it's easy to estimate
+the error of a single classifier,
+things get hairy when we begin
+to consider collections of classifiers.
+Even if the empirical error
+of any one (fixed) classifier
+will be close to its true error
+with high probability,
+once we consider a collection of classifiers,
+we need to worry about the possibility
+that *just one* classifier in the set
+will receive a badly misestimated error.
+The worry is that if just one classifier
+in our collection receives
+a misleadingly low error
+then we might pick it
+and thereby grossly underestimate
+the population error.
+Moreover, even for linear models,
+because their parameters are continuously valued,
+we are typically choosing among
+an infinite class of functions ($|\mathcal{F}| = \infty$).
+
+One ambitious solution to the problem
+is to develop analytic tools
+for proving uniform convergence, i.e.,
+that with high probability,
+the empirical error rate for every classifier in the class $f\in\mathcal{F}$
+will *simultaneously* converge to its true error rate.
+In other words, we seek a theoretical principle
+that would allow us to state that
+with probability at least $1-\delta$
+(for some small $\delta$)
+no classifier's error rate $\epsilon(f)$
+(among all classifiers in the class $\mathcal{F}$)
+will be misestimated by more
+than some  small amount $\alpha$.
+Clearly, we cannot make such statements
+for all model classes $\mathcal{F}$.
+Recall the class of memorization machines
+that always achieve empirical error $0$
+but never outperform random guessing
+on the underlying population.
+
+In a sense the class of memorizers is too flexible.
+No such a uniform convergence result could possibly hold.
+On the other hand, a fixed classifier is useless---it
+generalizes perfectly, but fits neither
+the training data nor the test data.
+The central question of learning
+has thus historically been framed as a tradeoff
+between more flexible (higher variance) model classes
+that better fit the training data but risk overfitting,
+versus more rigid (higher bias) model classes
+that generalize well but risk underfitting.
+A central question in learning theory
+has been to develop the appropriate
+mathematical analysis to quantify
+where a model sits along this spectrum,
+and to provide the associated guarantees.
+
+In a series of seminal papers,
+Vapnik and Chervonenkis extended
+the theory on the convergence
+of relative frequencies
+to more general classes of functions
+:cite:`VapChe64,VapChe68,VapChe71,VapChe74b,VapChe81,VapChe91`.
+One of the key contributions of this line of work
+is the Vapnik-Chervonenkis (VC) dimension,
+which measures (one notion of)
+the complexity (flexibility) of a model class.
+Moreover, one of their key results bounds
+the difference between the empirical error
+and the population error as a function
+of the VC dimension and the number of samples:
+
+$$P\left(R[p, f] - R_\mathrm{emp}[\mathbf{X}, \mathbf{Y}, f] < \alpha\right) \geq 1-\delta
+\ \text{ for }\ \alpha \geq c \sqrt{(\mathrm{VC} - \log \delta)/n}.$$
+
+Here $\delta > 0$ is the probability that the bound is violated,
+$\alpha$ is the upper bound on the generalization gap,
+and $n$ is the dataset size.
+Lastly, $c > 0$ is a constant that depends
+only on the scale of the loss that can be incurred.
+One use of the bound might be to plug in desired
+values of $\delta$ and $\alpha$
+to determine how many samples to collect.
+The VC dimension quantifies the largest
+number of data points for which we can assign
+any arbitrary (binary) labeling
+and for each find some model $f$ in the class
+that agrees with that labeling.
+For example, linear models on $d$-dimensional inputs
+have VC dimension $d+1$.
+It's easy to see that a line can assign
+any possible labeling to three points in two dimensions,
+but not to four.
+Unfortunately, the theory tends to be
+overly pessimistic for more complex models
+and obtaining this guarantee typically requires
+far more examples than are actually required
+to achieve the desired error rate.
+Note also that fixing the model class and $\delta$,
+our error rate again decays
+with the usual $\mathcal{O}(1/\sqrt{n})$ rate.
+It seems unlikely that we could do better in terms of $n$.
+However, as we vary the model class,
+VC dimension can present
+a pessimistic picture
+of the generalization gap.
+
+
+
+
+
+## Summary
+
+The most straightforward way to evaluate a model
+is to consult a test set comprised of previously unseen data.
+Test set evaluations provide an unbiased estimate of the true error
+and converge at the desired $\mathcal{O}(1/\sqrt{n})$ rate as the test set grows.
+We can provide approximate confidence intervals
+based on exact asymptotic distributions
+or valid finite sample confidence intervals
+based on (more conservative) finite sample guarantees.
+Indeed test set evaluation is the bedrock
+of modern machine learning research.
+However, test sets are seldom true test sets
+(used by multiple researchers again and again).
+Once the same test set is used
+to evaluate multiple models,
+controlling for false discovery can be difficult.
+This can cause huge problems in theory.
+In practice, the significance of the problem
+depends on the size of the holdout sets in question
+and whether they are merely being used to choose hyperparameters
+or if they are leaking information more directly.
+Nevertheless, it's good practice to curate real test sets (or multiple)
+and to be as conservative as possible about how often they are used.
+
+
+Hoping to provide a more satisfying solution,
+statistical learning theorists have developed methods
+for guaranteeing uniform convergence over a model class.
+If indeed every model's empirical error
+converges to its true error simultaneously,
+then we are free to choose the model that performs
+best, minimizing the training error,
+knowing that it too will perform similarly well
+on the holdout data.
+Crucially, any of such results must depend
+on some property of the model class.
+Vladimir Vapnik and Alexey Chernovenkis
+introduced the VC dimension,
+presenting uniform convergence results
+that hold for all models in a VC class.
+The training errors for all models in the class
+are (simultaneously) guaranteed
+to be close to their true errors,
+and guaranteed to grow closer
+at $\mathcal{O}(1/\sqrt{n})$ rates.
+Following the revolutionary discovery of VC dimension,
+numerous alternative complexity measures have been proposed,
+each facilitating an analogous generalization guarantee.
+See :citet:`boucheron2005theory` for a detailed discussion
+of several advanced ways of measuring function complexity.
+Unfortunately, while these complexity measures
+have become broadly useful tools in statistical theory,
+they turn out to be powerless
+(as straightforwardly applied)
+for explaining why deep neural networks generalize.
+Deep neural networks often have millions of parameters (or more),
+and can easily assign random labels to large collections of points.
+Nevertheless, they generalize well on practical problems
+and, surprisingly, they often generalize better,
+when they are larger and deeper,
+despite incurring larger VC dimensions.
+In the next chapter, we will revisit generalization
+in the context of deep learning.
+
+## Exercises
+
+1. If we wish to estimate the error of a fixed model $f$
+   to within $0.0001$ with probability greater than 99.9%,
+   how many samples do we need?
+1. Suppose that somebody else possesses a labeled test set
+   $\mathcal{D}$ and only makes available the unlabeled inputs (features).
+   Now suppose that you can only access the test set labels
+   by running a model $f$ (no restrictions placed on the model class)
+   on each of the unlabeled inputs
+   and receiving the corresponding error $\epsilon_\mathcal{D}(f)$.
+   How many models would you need to evaluate
+   before you leak the entire test set
+   and thus could appear to have error $0$,
+   regardless of your true error?
+1. What is the VC dimension of the class of $5^\mathrm{th}$-order polynomials?
+1. What is the VC dimension of axis-aligned rectangles on two-dimensional data?
+
+[Discussions](https://discuss.d2l.ai/t/6829)
diff --git a/chapter_linear-classification/image-classification-dataset.md b/chapter_linear-classification/image-classification-dataset.md
new file mode 100644
index 0000000..a3e012a
--- /dev/null
+++ b/chapter_linear-classification/image-classification-dataset.md
@@ -0,0 +1,231 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 画像分類データセット
+:label:`sec_fashion_mnist`
+
+(~~ MNIST データセットは、画像分類に広く使用されているデータセットの 1 つですが、ベンチマークデータセットとしてはあまりにも単純です。似ているがもっと複雑なFashion-MNISTデータセットを使用します~~) 
+
+画像分類に広く使用されているデータセットの1つは、手書き数字の [MNISTデータセット]（https://en.wikipedia.org/wiki/MNIST_database) :cite:`LeCun.Bottou.Bengio.ea.1998`）です。1990年代にリリースされた時点では、$28 \times 28$ピクセルの解像度の60,000枚の画像（および10,000枚の画像のテストデータセット）で構成されるほとんどの機械学習アルゴリズムに手ごわい課題がありました。物事を展望すると、当時、なんと64MBのRAMと5mFlopsの猛烈な5MFLOPSを備えたSun SparcStation 5は、1995年にAT＆Tベル研究所で機械学習のための最先端の機器と見なされていました。数字認識の高精度を達成することは、1990年代のUSPSの文字ソートを自動化するための重要な要素でした。LeNet-5 :cite:`LeCun.Jackel.Bottou.ea.1995`、不変性を持つサポートベクターマシン :cite:`Scholkopf.Burges.Vapnik.1996`、接線距離分類器 :cite:`Simard.LeCun.Denker.ea.1998` などのディープネットワークはすべて、1% 未満の誤り率に達することができました。  
+
+10年以上にわたり、MNISTは機械学習アルゴリズムを比較するための*基準点*としての役割を果たしました。ベンチマークデータセットとしては好調でしたが、今日の標準による単純なモデルでも 95% を超える分類精度が得られるため、強いモデルと弱いモデルを区別するのには適していません。さらに、このデータセットは、多くの分類問題では一般的に見られない「非常に」高いレベルの精度を可能にします。このアルゴリズム開発は、アクティブセットメソッドや境界探索アクティブセットアルゴリズムなど、クリーンなデータセットを利用できる特定のアルゴリズムファミリーに偏っていました。今日、MNISTはベンチマークとしてよりも健全性チェックの役割を果たしています。ImageNET :cite:`Deng.Dong.Socher.ea.2009` は、はるかに重要な課題を提起します。残念ながら、ImageNetは、この本の多くの例やイラストには大きすぎます。例をインタラクティブにするにはトレーニングに時間がかかりすぎるからです。代替として、2017年にリリースされた、質的に類似しているがはるかに小さいFashion-MNISTデータセット:cite:`Xiao.Rasul.Vollgraf.2017`について、今後のセクションで議論に焦点を当てます。これには、$28 \times 28$ピクセルの解像度で10種類の衣類の画像が含まれています。
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+import time
+from d2l import mxnet as d2l
+from mxnet import gluon, npx
+from mxnet.gluon.data.vision import transforms
+npx.set_np()
+
+d2l.use_svg_display()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+import time
+from d2l import torch as d2l
+import torch
+import torchvision
+from torchvision import transforms
+
+d2l.use_svg_display()
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+import time
+from d2l import tensorflow as d2l
+import tensorflow as tf
+
+d2l.use_svg_display()
+```
+
+## データセットの読み込み
+
+これは頻繁に使用されるデータセットであるため、すべての主要なフレームワークは前処理されたバージョンを提供します。[**組み込みのフレームワーク関数を使用して、Fashion-Mnist データセットをダウンロードしてメモリに読み込むことができます。**]
+
+```{.python .input  n=5}
+%%tab mxnet
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = gluon.data.vision.FashionMNIST(
+            train=True).transform_first(trans)
+        self.val = gluon.data.vision.FashionMNIST(
+            train=False).transform_first(trans)
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = torchvision.datasets.FashionMNIST(
+            root=self.root, train=True, transform=trans, download=True)
+        self.val = torchvision.datasets.FashionMNIST(
+            root=self.root, train=False, transform=trans, download=True)
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        self.train, self.val = tf.keras.datasets.fashion_mnist.load_data()
+```
+
+Fashion-mnist は 10 のカテゴリの画像で構成され、それぞれがトレーニングデータセットの 6,000 枚の画像、テストデータセットの 1,000 枚の画像で表されます。*テストデータセット*は、モデルの性能を評価するために使用されます（トレーニングには使用しないでください）。その結果、トレーニングセットとテストセットにはそれぞれ60,000と10,000の画像が含まれます。
+
+```{.python .input  n=8}
+%%tab mxnet, pytorch
+data = FashionMNIST(resize=(32, 32))
+len(data.train), len(data.val)
+```
+
+```{.python .input  n=9}
+%%tab tensorflow
+data = FashionMNIST(resize=(32, 32))
+len(data.train[0]), len(data.val[0])
+```
+
+画像はグレースケールで、上記の解像度で$32 \times 32$ピクセルにアップスケールされています。これは、（バイナリ）白黒画像で構成された元のMNISTデータセットに似ています。ただし、最新の画像データには3チャンネル（赤、緑、青）があり、ハイパースペクトル画像は100チャンネルを超える場合があります（HyMapセンサーには126チャンネルあります）。慣例により、画像を$c \times h \times w$テンソルとして保存します。ここで、$c$はカラーチャンネルの数、$h$は高さ、$w$は幅です。
+
+```{.python .input  n=10}
+%%tab all
+data.train[0][0].shape
+```
+
+[~~データセットを可視化する2つのユーティリティ関数~~] 
+
+Fashion-mnistのカテゴリーには、人間が理解できる名前があります。次の便利な関数は、数値ラベルとその名前を変換します。
+
+```{.python .input  n=11}
+%%tab all
+@d2l.add_to_class(FashionMNIST)  #@save
+def text_labels(self, indices):
+    """Return text labels."""
+    labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+              'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [labels[int(i)] for i in indices]
+```
+
+## ミニバッチを読み取る
+
+トレーニングセットとテストセットから読み取るときの生活を楽にするために、ゼロから作成するのではなく、組み込みのデータイテレーターを使用します。各反復で、データイテレータ [**サイズ `batch_size`.のデータのミニバッチを読み取る**] を思い出してください。また、トレーニングデータイテレータの例をランダムにシャッフルします。
+
+```{.python .input  n=12}
+%%tab mxnet
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    return gluon.data.DataLoader(data, self.batch_size, shuffle=train,
+                                 num_workers=self.num_workers)
+```
+
+```{.python .input  n=13}
+%%tab pytorch
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train,
+                                       num_workers=self.num_workers)
+```
+
+```{.python .input  n=14}
+%%tab tensorflow
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                            tf.cast(y, dtype='int32'))
+    resize_fn = lambda X, y: (tf.image.resize_with_pad(X, *self.resize), y)
+    shuffle_buf = len(data[0]) if train else 1
+    return tf.data.Dataset.from_tensor_slices(process(*data)).batch(
+        self.batch_size).map(resize_fn).shuffle(shuffle_buf)
+```
+
+これがどのように機能するかを確認するために、新しく追加された`train_dataloader`メソッドを呼び出して画像のミニバッチをロードしましょう。64枚の画像が含まれています。
+
+```{.python .input  n=15}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print(X.shape, X.dtype, y.shape, y.dtype)
+```
+
+画像を読むのにかかる時間を見てみましょう。組み込みのローダーですが、驚くほど高速ではありません。それでも、ディープネットワークでの画像の処理にはかなり時間がかかるため、これで十分です。したがって、ネットワークのトレーニングが IO の制約を受けないほど十分です。
+
+```{.python .input  n=16}
+%%tab all
+tic = time.time()
+for X, y in data.train_dataloader():
+    continue
+f'{time.time() - tic:.2f} sec'
+```
+
+## 視覚化
+
+Fashion-mnist データセットをかなり頻繁に使用します。便利な機能`show_images`を使用して、画像と関連するラベルを視覚化できます。その実装の詳細は付録に延期されています。
+
+```{.python .input  n=17}
+%%tab all
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
+    """Plot a list of images."""
+    raise NotImplementedError
+```
+
+それを有効に活用しよう。一般に、トレーニング中のデータを視覚化して調査することをお勧めします。人間は異常な側面を見つけるのが非常に得意であるため、視覚化は実験計画における間違いや誤りに対する追加の保護手段として機能します。トレーニングデータセットの最初のいくつかの例の [**画像とそれに対応するラベル**]（本文）を以下に示します。
+
+```{.python .input  n=18}
+%%tab all
+@d2l.add_to_class(FashionMNIST)  #@save
+def visualize(self, batch, nrows=1, ncols=8, labels=[]):
+    X, y = batch
+    if not labels:
+        labels = self.text_labels(y)
+    if tab.selected('mxnet') or tab.selected('pytorch'):
+        d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
+    if tab.selected('tensorflow'):
+        d2l.show_images(tf.squeeze(X), nrows, ncols, titles=labels)
+
+batch = next(iter(data.val_dataloader()))
+data.visualize(batch)
+```
+
+これで、次のセクションで Fashion-mnist データセットを使用する準備が整いました。 
+
+## まとめ
+
+これで、分類に使用する、もう少し現実的なデータセットができました。Fashion-mnist は、10 のカテゴリを表す画像で構成されるアパレル分類データセットです。このデータセットを以降のセクションと章で使用して、単純な線形モデルから高度な残差ネットワークまで、さまざまなネットワーク設計を評価します。画像でよく行うように、それらを形状のテンソル（バッチサイズ、チャンネル数、高さ、幅）として読み取ります。今のところ、画像はグレースケールであるため、チャンネルは1つだけです（上記の視覚化では、視認性を向上させるために偽のカラーパレットを使用しています）。  
+
+最後に、データイテレータは効率的なパフォーマンスの重要なコンポーネントです。たとえば、効率的な画像解凍、ビデオトランスコーディング、またはその他の前処理にGPUを使用する場合があります。可能な限り、トレーニングループの速度を落とさないように、ハイパフォーマンスコンピューティングを活用する適切に実装されたデータイテレータに頼るべきです。 
+
+## 演習
+
+1. `batch_size` を (たとえば 1 に) 下げると、読み取りパフォーマンスに影響しますか?
+1. データイテレータのパフォーマンスは重要です。現在の実装は十分速いと思いますか？それを改善するためのさまざまなオプションを検討してください。システムプロファイラを使用して、ボトルネックがどこにあるかを調べます。
+1. フレームワークのオンライン API ドキュメントを確認してください。他に利用できるデータセットはどれですか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/48)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/49)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/224)
+:end_tab:
diff --git a/chapter_linear-classification/image-classification-dataset_origin.md b/chapter_linear-classification/image-classification-dataset_origin.md
new file mode 100644
index 0000000..0044f5a
--- /dev/null
+++ b/chapter_linear-classification/image-classification-dataset_origin.md
@@ -0,0 +1,250 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# The Image Classification Dataset
+:label:`sec_fashion_mnist`
+
+(~~The MNIST dataset is one of the widely used dataset for image classification, while it's too simple as a benchmark dataset. We will use the similar, but more complex Fashion-MNIST dataset ~~)
+
+One of the widely used dataset for image classification is the  [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) :cite:`LeCun.Bottou.Bengio.ea.1998` of handwritten digits. At the time of its release in the 1990s it posed a formidable challenge to most machine learning algorithms, consisting of 60,000 images of $28 \times 28$ pixels resolution (plus a test dataset of 10,000 images). To put things into perspective, at the time, a Sun SPARCStation 5 with a whopping 64MB of RAM and a blistering 5 MFLOPs was considered state of the art equipment for machine learning at AT&T Bell Laboratories in 1995. Achieving high accuracy on digit recognition was a key component in automating letter sorting for the USPS in the 1990s. Deep networks such as LeNet-5 :cite:`LeCun.Jackel.Bottou.ea.1995`, support vector machines with invariances :cite:`Scholkopf.Burges.Vapnik.1996`, and tangent distance classifiers :cite:`Simard.LeCun.Denker.ea.1998` all allowed to reach error rates below 1%. 
+
+For over a decade, MNIST served as *the* point of reference for comparing machine learning algorithms. 
+While it had a good run as a benchmark dataset,
+even simple models by today's standards achieve classification accuracy over 95%,
+making it unsuitable for distinguishing between stronger models and weaker ones. Even more so, the dataset allows for *very* high levels of accuracy, not typically seen in many classification problems. This skewed algorithmic development towards specific families of algorithms that can take advantage of clean datasets, such as active set methods and boundary-seeking active set algorithms.
+Today, MNIST serves as more of sanity checks than as a benchmark. ImageNet :cite:`Deng.Dong.Socher.ea.2009` poses a much 
+more relevant challenge. Unfortunately, ImageNet is too large for many of the examples and illustrations in this book, as it would take too long to train to make the examples interactive. As a substitute we will focus our discussion in the coming sections on the qualitatively similar, but much smaller Fashion-MNIST
+dataset :cite:`Xiao.Rasul.Vollgraf.2017`, which was released in 2017. It constains images of 10 categories of clothing at $28 \times 28$ pixels resolution.
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+import time
+from d2l import mxnet as d2l
+from mxnet import gluon, npx
+from mxnet.gluon.data.vision import transforms
+npx.set_np()
+
+d2l.use_svg_display()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+import time
+from d2l import torch as d2l
+import torch
+import torchvision
+from torchvision import transforms
+
+d2l.use_svg_display()
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+import time
+from d2l import tensorflow as d2l
+import tensorflow as tf
+
+d2l.use_svg_display()
+```
+
+## Loading the Dataset
+
+Since it is such a frequently used dataset, all major frameworks provide preprocessed versions of it. We can [**download and read the Fashion-MNIST dataset into memory using built-in framework functions.**]
+
+```{.python .input  n=5}
+%%tab mxnet
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = gluon.data.vision.FashionMNIST(
+            train=True).transform_first(trans)
+        self.val = gluon.data.vision.FashionMNIST(
+            train=False).transform_first(trans)
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = torchvision.datasets.FashionMNIST(
+            root=self.root, train=True, transform=trans, download=True)
+        self.val = torchvision.datasets.FashionMNIST(
+            root=self.root, train=False, transform=trans, download=True)
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+class FashionMNIST(d2l.DataModule):  #@save
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        self.train, self.val = tf.keras.datasets.fashion_mnist.load_data()
+```
+
+Fashion-MNIST consists of images from 10 categories, each represented
+by 6,000 images in the training dataset and by 1,000 in the test dataset.
+A *test dataset* is used for evaluating model performance (it must not be used for training).
+Consequently the training set and the test set
+contain 60,000 and 10,000 images, respectively.
+
+```{.python .input  n=8}
+%%tab mxnet, pytorch
+data = FashionMNIST(resize=(32, 32))
+len(data.train), len(data.val)
+```
+
+```{.python .input  n=9}
+%%tab tensorflow
+data = FashionMNIST(resize=(32, 32))
+len(data.train[0]), len(data.val[0])
+```
+
+The images are grayscale and upscaled to $32 \times 32$ pixels in resolution above. This is similar to the original MNIST dataset which consisted of (binary) black and white images. Note, though, that most modern image data which has 3 channels (red, green, blue) and hyperspectral images which can have in excess of 100 channels (the HyMap sensor has 126 channels).
+By convention we store image as a $c \times h \times w$ tensor, where $c$ is the number of color channels, $h$ is the height and $w$ is the width.
+
+```{.python .input  n=10}
+%%tab all
+data.train[0][0].shape
+```
+
+[~~Two utility functions to visualize the dataset~~]
+
+The categories of Fashion-MNIST have human-understandable names. 
+The following convenience function converts between numeric labels and their names.
+
+```{.python .input  n=11}
+%%tab all
+@d2l.add_to_class(FashionMNIST)  #@save
+def text_labels(self, indices):
+    """Return text labels."""
+    labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+              'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [labels[int(i)] for i in indices]
+```
+
+## Reading a Minibatch
+
+To make our life easier when reading from the training and test sets,
+we use the built-in data iterator rather than creating one from scratch.
+Recall that at each iteration, a data iterator
+[**reads a minibatch of data with size `batch_size`.**]
+We also randomly shuffle the examples for the training data iterator.
+
+```{.python .input  n=12}
+%%tab mxnet
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    return gluon.data.DataLoader(data, self.batch_size, shuffle=train,
+                                 num_workers=self.num_workers)
+```
+
+```{.python .input  n=13}
+%%tab pytorch
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train,
+                                       num_workers=self.num_workers)
+```
+
+```{.python .input  n=14}
+%%tab tensorflow
+@d2l.add_to_class(FashionMNIST)  #@save
+def get_dataloader(self, train):
+    data = self.train if train else self.val
+    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                            tf.cast(y, dtype='int32'))
+    resize_fn = lambda X, y: (tf.image.resize_with_pad(X, *self.resize), y)
+    shuffle_buf = len(data[0]) if train else 1
+    return tf.data.Dataset.from_tensor_slices(process(*data)).batch(
+        self.batch_size).map(resize_fn).shuffle(shuffle_buf)
+```
+
+To see how this works, let's load a minibatch of images by invoking the newly-added `train_dataloader` method. It contains 64 images.
+
+```{.python .input  n=15}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print(X.shape, X.dtype, y.shape, y.dtype)
+```
+
+Let's look at the time it takes to read the images. Even though it's a built-in loader, it isn't blazingly fast. Nonetheless, this is sufficient since processing images with a deep network takes quite a bit longer. Hence it's good enough that training a network won't be IO constrained.
+
+```{.python .input  n=16}
+%%tab all
+tic = time.time()
+for X, y in data.train_dataloader():
+    continue
+f'{time.time() - tic:.2f} sec'
+```
+
+## Visualization
+
+We'll be using the Fashion-MNIST dataset quite frequently. A convenience function `show_images` can be used to visualize the images and the associated labels. Details of its implementation are deferred to the appendix.
+
+```{.python .input  n=17}
+%%tab all
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
+    """Plot a list of images."""
+    raise NotImplementedError
+```
+
+Let's put it to good use. In general, it is a good idea to visualize and inspect data that you're training on. 
+Humans are very good at spotting unusual aspects and as such, visualization serves as an additional safeguard against mistakes and errors in the design of experiments. Here are [**the images and their corresponding labels**] (in text)
+for the first few examples in the training dataset.
+
+```{.python .input  n=18}
+%%tab all
+@d2l.add_to_class(FashionMNIST)  #@save
+def visualize(self, batch, nrows=1, ncols=8, labels=[]):
+    X, y = batch
+    if not labels:
+        labels = self.text_labels(y)
+    if tab.selected('mxnet') or tab.selected('pytorch'):
+        d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
+    if tab.selected('tensorflow'):
+        d2l.show_images(tf.squeeze(X), nrows, ncols, titles=labels)
+
+batch = next(iter(data.val_dataloader()))
+data.visualize(batch)
+```
+
+We are now ready to work with the Fashion-MNIST dataset in the sections that follow.
+
+## Summary
+
+We now have a slightly more realistic dataset to use for classification. Fashion-MNIST is an apparel classification dataset consisting of images representing 10 categories. We will use this dataset in subsequent sections and chapters to evaluate various network designs, from a simple linear model to advanced residual networks. As we commonly do with images, we read them as a tensor of shape (batch size, number of channels, height, width). For now, we only have one channel as the images are grayscale (the visualization above use a false color palette for improved visibility). 
+
+Lastly, data iterators are a key component for efficient performance. For instance, we might use GPUs for efficient image decompression, video transcoding, or other preprocessing. Whenever possible, you should rely on well-implemented data iterators that exploit high-performance computing to avoid slowing down your training loop.
+
+
+## Exercises
+
+1. Does reducing the `batch_size` (for instance, to 1) affect the reading performance?
+1. The data iterator performance is important. Do you think the current implementation is fast enough? Explore various options to improve it. Use a system profiler to find out where the bottlenecks are.
+1. Check out the framework's online API documentation. Which other datasets are available?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/48)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/49)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/224)
+:end_tab:
diff --git a/chapter_linear-classification/index.md b/chapter_linear-classification/index.md
new file mode 100644
index 0000000..189a336
--- /dev/null
+++ b/chapter_linear-classification/index.md
@@ -0,0 +1,16 @@
+# 分類のための線形ニューラルネットワーク
+:label:`chap_classification`
+
+すべてのメカニズムを習得したので、これらのスキルを幅広い種類のタスクに適用する準備が整いました。分類にピボットしても、ほとんどの配管は同じままです。データのロード、モデルへの送信、出力の生成、損失の計算、重みに関する勾配の取得、モデルの更新です。ただし、ターゲットの正確な形式、出力層のパラメーター化、および損失関数の選択は、*分類* 設定に適合します。
+
+```toc
+:maxdepth: 2
+
+softmax-regression
+image-classification-dataset
+classification
+softmax-regression-scratch
+softmax-regression-concise
+generalization-classification
+environment-and-distribution-shift
+```
diff --git a/chapter_linear-classification/index_origin.md b/chapter_linear-classification/index_origin.md
new file mode 100644
index 0000000..4c35f09
--- /dev/null
+++ b/chapter_linear-classification/index_origin.md
@@ -0,0 +1,28 @@
+# Linear Neural Networks for Classification
+:label:`chap_classification`
+
+Now that you have worked through all of the mechanics
+you are ready to apply these skills to broader kinds of tasks.
+Even as we pivot towards classification,
+most of the plumbing remains the same:
+loading the data, passing it through the model,
+generating output, calculating the loss,
+taking gradients with respect to weights,
+and updating the model.
+However, the precise form of the targets,
+the parameterization of the output layer,
+and the choice of loss function will adapt
+to suit the *classification* setting.
+
+```toc
+:maxdepth: 2
+
+softmax-regression
+image-classification-dataset
+classification
+softmax-regression-scratch
+softmax-regression-concise
+generalization-classification
+environment-and-distribution-shift
+```
+
diff --git a/chapter_linear-classification/softmax-regression-concise.md b/chapter_linear-classification/softmax-regression-concise.md
new file mode 100644
index 0000000..57c0e42
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression-concise.md
@@ -0,0 +1,151 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# ソフトマックス回帰の簡潔な実装
+:label:`sec_softmax_concise`
+
+高レベルのディープラーニングフレームワークによって線形回帰の実装が容易になったように (:numref:`sec_linear_concise` を参照)、ここでも同様に便利です。
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import gluon, init, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+from torch.nn import functional as F
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## モデルを定義する
+
+:numref:`sec_linear_concise` と同様に、組み込みのレイヤーを使用して完全接続レイヤーを構築します。組み込みの `__call__` メソッドは、ネットワークを何らかの入力に適用する必要があるときはいつでも `forward` を呼び出します。
+
+:begin_tab:`mxnet`
+入力 `X` が 4 次テンソルであっても、組み込みの `Dense` 層は、第 1 軸に沿った次元を変更せずに維持することにより、自動的に `X` を 2 次テンソルに変換します。
+:end_tab:
+
+:begin_tab:`pytorch`
+`Flatten` 層を使用して、第 1 軸に沿った次元を変更せずに第 4 次テンソル `X` を 2 次に変換します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+`Flatten` レイヤーを使用して、1 番目の軸に沿った次元を変更せずに保持して 4 次テンソル `X` を変換します。
+:end_tab:
+
+```{.python .input}
+%%tab all
+class SoftmaxRegression(d2l.Classifier):
+    def __init__(self, num_outputs, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.net = nn.Dense(num_outputs)
+            self.net.initialize()
+        if tab.selected('pytorch'):
+            self.net = nn.Sequential(nn.Flatten(),
+                                     nn.LazyLinear(num_outputs))
+        if tab.selected('tensorflow'):
+            self.net = tf.keras.models.Sequential()
+            self.net.add(tf.keras.layers.Flatten())
+            self.net.add(tf.keras.layers.Dense(num_outputs))
+
+    def forward(self, X):
+        return self.net(X)
+```
+
+## Softmax 再訪
+:label:`subsec_softmax-implementation-revisited`
+
+:numref:`sec_softmax_scratch`では、モデルの出力を計算し、クロスエントロピー損失を適用しました。これは数学的には完全に合理的ですが、べき乗の数値アンダーフローとオーバーフローのため、計算上は危険です。 
+
+softmax 関数は $\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$ を介して確率を計算することを思い出してください。$o_k$の一部が非常に大きい、つまり非常に正の場合、$\exp(o_k)$は特定のデータ型で取得できる最大数よりも大きくなる可能性があります。これは*オーバーフロー* と呼ばれます。同様に、すべての引数が非常に負の場合、*underflow* になります。たとえば、単精度浮動小数点数は、おおよそ$10^{-38}$から$10^{38}$の範囲をカバーします。そのため、$\mathbf{o}$の最大の項が区間$[-90, 90]$の外にある場合、結果は安定しません。この問題の解決策は、すべてのエントリから $\bar{o} \stackrel{\mathrm{def}}{=} \max_k o_k$ を引くことです。 
+
+$$
+\hat y_j = \frac{\exp o_j}{\sum_k \exp o_k} =
+\frac{\exp(o_j - \bar{o}) \exp \bar{o}}{\sum_k \exp (o_k - \bar{o}) \exp \bar{o}} =
+\frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})}.
+$$
+
+構造上、$o_j - \bar{o} \leq 0$がすべての$j$であることを知っています。そのため、$q$ クラスの分類問題では、分母は区間 $[1, q]$ に含まれます。さらに、分子が$1$を超えることはないため、数値のオーバーフローを防ぎます。数値アンダーフローは、$\exp(o_j - \bar{o})$ が数値的に $0$ と評価された場合にのみ発生します。それでも、$\log \hat{y}_j$を$\log 0$として計算したいとき、道を数歩進んだときに問題が発生する可能性があります。特に、バックプロパゲーションでは、恐ろしい`NaN`（Not a Number）の結果の一部に直面する可能性があります。 
+
+幸いなことに、指数関数を計算しているにもかかわらず、最終的には（クロスエントロピー損失を計算するときに）対数を取るつもりであるという事実によって救われます。ソフトマックスとクロスエントロピーを組み合わせることで、数値安定性の問題を完全に回避できます。私たちには次のものがあります。 
+
+$$
+\log \hat{y}_j =
+\log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} =
+o_j - \bar{o} - \log \sum_k \exp (o_k - \bar{o}).
+$$
+
+これにより、オーバーフローとアンダーフローの両方が回避されます。モデルによって出力確率を評価したい場合に備えて、従来のソフトマックス関数を手元に置いておきたいと思います。しかし、ソフトマックス確率を新しい損失関数に渡す代わりに、["LogsumExp trick"]（https://en.wikipedia.org/wiki/LogSumExp）のようなスマートなことをするクロスエントロピー損失関数内で、[**ロジットを渡してソフトマックスとその対数を一度に計算する**] だけです。
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(d2l.Classifier)  #@save
+def loss(self, Y_hat, Y, averaged=True):
+    Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+    Y = d2l.reshape(Y, (-1,))
+    if tab.selected('mxnet'):
+        fn = gluon.loss.SoftmaxCrossEntropyLoss()
+        l = fn(Y_hat, Y)
+        return l.mean() if averaged else l
+    if tab.selected('pytorch'):
+        return F.cross_entropy(
+            Y_hat, Y, reduction='mean' if averaged else 'none')
+    if tab.selected('tensorflow'):
+        fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        return fn(Y, Y_hat)
+```
+
+## トレーニング
+
+次に、モデルをトレーニングします。前と同じように、784次元の特徴ベクトルに平坦化されたFashion-MNIST画像を使用します。
+
+```{.python .input}
+%%tab all
+data = d2l.FashionMNIST(batch_size=256)
+model = SoftmaxRegression(num_outputs=10, lr=0.1)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+以前と同様に、このアルゴリズムは、今回は以前よりも少ないコード行数ではありますが、まともな精度を達成するソリューションに収束します。 
+
+## まとめ
+
+高レベル API は、数値の安定性など、潜在的に危険な側面をユーザーから隠すのに非常に便利です。さらに、ユーザーはごくわずかなコード行でモデルを簡潔に設計できます。これは祝福と呪いの両方です。明らかな利点は、人生で単一のクラスの統計をとったことがないエンジニアにとっても、物事にアクセスしやすくなることです（実際、これは本の対象読者の1人です）。しかし、鋭いエッジを隠すことには代償が伴います。新しいコンポーネントや異なるコンポーネントを自分で追加することは、それを行うための筋肉の記憶がほとんどないため、阻害要因になります。さらに、フレームワークの保護パッドがすべてのコーナーケースを完全に覆うことができない場合は、物事を*修正*することがより困難になります。繰り返しますが、これは親しみやすさの欠如によるものです。 
+
+そのため、後続する多くの実装の最低限のバージョンとエレガントなバージョンの両方を確認することを強くお勧めします。私たちは理解のしやすさを強調していますが、それでも実装は通常かなりパフォーマンスが良いです（ここでは畳み込みが大きな例外です）。私たちの意図は、フレームワークでは得られない新しいものを発明するときに、これらに基づいて構築できるようにすることです。 
+
+## 演習
+
+1. ディープラーニングは、FP64倍精度（ごくまれにしか使用されない）など、さまざまな数値形式を使用します。
+FP32 単精度、BFLOAT16 (圧縮表現に最適)、FP16 (非常に不安定な)、TF32 (NVIDIA からの新しいフォーマット)、および INT8。結果が数値的なアンダーフローまたはオーバーフローを引き起こさない指数関数の最小および最大の引数を計算します。
+1. INT8は、$1$から$255$までのゼロ以外の数字を含む非常に限定された形式です。より多くのビットを使用せずにダイナミックレンジを拡張するにはどうすればよいでしょうか？標準の乗算と加算はまだ機能しますか?
+1. トレーニングのエポック数を増やします。しばらくすると検証精度が低下するのはなぜですか？どうやってこれを直せる？
+1. 学習率を上げるとどうなりますか？いくつかの学習率の損失曲線を比較します。どちらがうまくいきますか？いつ？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/52)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/53)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/260)
+:end_tab:
diff --git a/chapter_linear-classification/softmax-regression-concise_origin.md b/chapter_linear-classification/softmax-regression-concise_origin.md
new file mode 100644
index 0000000..aad6136
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression-concise_origin.md
@@ -0,0 +1,203 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Concise Implementation of Softmax Regression
+:label:`sec_softmax_concise`
+
+
+
+Just as high-level deep learning frameworks
+made it easier to implement linear regression
+(see :numref:`sec_linear_concise`),
+they are similarly convenient here.
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import gluon, init, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+from torch.nn import functional as F
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## Defining the Model
+
+As in :numref:`sec_linear_concise`, 
+we construct our fully connected layer 
+using the built-in layer. 
+The built-in `__call__` method then invokes `forward` 
+whenever we need to apply the network to some input.
+
+:begin_tab:`mxnet`
+Even though the input `X` is a 4th order tensor, 
+the built-in `Dense` layer 
+will automatically convert `X` into a 2nd order tensor 
+by keeping the dimensionality along the first axis unchanged.
+:end_tab:
+
+:begin_tab:`pytorch`
+We use a `Flatten` layer to convert the 4th order tensor `X` to 2nd order 
+by keeping the dimensionality along the first axis unchanged.
+
+:end_tab:
+
+:begin_tab:`tensorflow`
+We use a `Flatten` layer to convert the 4th order tensor `X` 
+by keeping the dimension along the first axis unchanged.
+:end_tab:
+
+```{.python .input}
+%%tab all
+class SoftmaxRegression(d2l.Classifier):
+    def __init__(self, num_outputs, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.net = nn.Dense(num_outputs)
+            self.net.initialize()
+        if tab.selected('pytorch'):
+            self.net = nn.Sequential(nn.Flatten(),
+                                     nn.LazyLinear(num_outputs))
+        if tab.selected('tensorflow'):
+            self.net = tf.keras.models.Sequential()
+            self.net.add(tf.keras.layers.Flatten())
+            self.net.add(tf.keras.layers.Dense(num_outputs))
+
+    def forward(self, X):
+        return self.net(X)
+```
+
+## Softmax Revisited
+:label:`subsec_softmax-implementation-revisited`
+
+In :numref:`sec_softmax_scratch` we calculated our model's output
+and applied the cross-entropy loss. While this is perfectly
+reasonable mathematically, it is risky computationally, due to
+numerical underflow and overflow in the exponentiation.
+
+Recall that the softmax function computes probabilities via
+$\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$.
+If some of the $o_k$ are very large, i.e., very positive,
+then $\exp(o_k)$ might be larger than the largest number
+we can have for certain data types. This is called *overflow*. Likewise,
+if all arguments are very negative, we will get *underflow*.
+For instance, single precision floating point numbers approximately
+cover the range of $10^{-38}$ to $10^{38}$. As such, if the largest term in $\mathbf{o}$
+lies outside the interval $[-90, 90]$, the result will not be stable.
+A solution to this problem is to subtract $\bar{o} \stackrel{\mathrm{def}}{=} \max_k o_k$ from
+all entries:
+
+$$
+\hat y_j = \frac{\exp o_j}{\sum_k \exp o_k} =
+\frac{\exp(o_j - \bar{o}) \exp \bar{o}}{\sum_k \exp (o_k - \bar{o}) \exp \bar{o}} =
+\frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})}.
+$$
+
+By construction we know that $o_j - \bar{o} \leq 0$ for all $j$. As such, for a $q$-class
+classification problem, the denominator is contained in the interval $[1, q]$. Moreover, the
+numerator never exceeds $1$, thus preventing numerical overflow. Numerical underflow only
+occurs when $\exp(o_j - \bar{o})$ numerically evaluates as $0$. Nonetheless, a few steps down
+the road we might find ourselves in trouble when we want to compute $\log \hat{y}_j$ as $\log 0$.
+In particular, in backpropagation,
+we might find ourselves faced with a screenful
+of the dreaded `NaN` (Not a Number) results.
+
+Fortunately, we are saved by the fact that
+even though we are computing exponential functions,
+we ultimately intend to take their log
+(when calculating the cross-entropy loss).
+By combining softmax and cross-entropy,
+we can escape the numerical stability issues altogether. We have:
+
+$$
+\log \hat{y}_j =
+\log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} =
+o_j - \bar{o} - \log \sum_k \exp (o_k - \bar{o}).
+$$
+
+This avoids both overflow and underflow.
+We will want to keep the conventional softmax function handy
+in case we ever want to evaluate the output probabilities by our model.
+But instead of passing softmax probabilities into our new loss function,
+we just
+[**pass the logits and compute the softmax and its log
+all at once inside the cross-entropy loss function,**]
+which does smart things like the ["LogSumExp trick"](https://en.wikipedia.org/wiki/LogSumExp).
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(d2l.Classifier)  #@save
+def loss(self, Y_hat, Y, averaged=True):
+    Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+    Y = d2l.reshape(Y, (-1,))
+    if tab.selected('mxnet'):
+        fn = gluon.loss.SoftmaxCrossEntropyLoss()
+        l = fn(Y_hat, Y)
+        return l.mean() if averaged else l
+    if tab.selected('pytorch'):
+        return F.cross_entropy(
+            Y_hat, Y, reduction='mean' if averaged else 'none')
+    if tab.selected('tensorflow'):
+        fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        return fn(Y, Y_hat)
+```
+
+## Training
+
+Next we train our model. As before, we use Fashion-MNIST images, flattened to 784-dimensional feature vectors.
+
+```{.python .input}
+%%tab all
+data = d2l.FashionMNIST(batch_size=256)
+model = SoftmaxRegression(num_outputs=10, lr=0.1)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+As before, this algorithm converges to a solution
+that achieves a decent accuracy,
+albeit this time with fewer lines of code than before.
+
+
+## Summary
+
+High-level APIs are very convenient at hiding potentially dangerous aspects from their user, such as numerical stability. Moreover, they allow users to design models concisely with very few lines of code. This is both a blessing and a curse. The obvious benefit is that it makes things highly accessible, even to engineers who never took a single class of statistics in their life (in fact, this is one of the target audiences of the book). But hiding the sharp edges also comes with a price: a disincentive to add new and different components on your own, since there's little muscle memory for doing it. Moreover, it makes it more difficult to *fix* things whenever the protective padding of
+a framework fails to cover all the corner cases entirely. Again, this is due to lack of familiarity.
+
+As such, we strongly urge you to review *both* the bare bones and the elegant versions of many of the implementations that follow. While we emphasize ease of understanding, the implementations are nonetheless usually quite performant (convolutions are the big exception here). It is our intention to allow you to build on these when you invent something new that no framework can give you.
+
+
+## Exercises
+
+1. Deep learning uses many different number formats, including FP64 double precision (used extremely rarely),
+FP32 single precision, BFLOAT16 (good for compressed representations), FP16 (very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest and largest argument of the exponential function for which the result does not lead to a numerical underflow or overflow.
+1. INT8 is a very limited format with nonzero numbers from $1$ to $255$. How could you extend its dynamic range without using more bits? Do standard multiplication and addition still work?
+1. Increase the number of epochs for training. Why might the validation accuracy decrease after a while? How could we fix this?
+1. What happens as you increase the learning rate? Compare the loss curves for several learning rates. Which one works better? When?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/52)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/53)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/260)
+:end_tab:
diff --git a/chapter_linear-classification/softmax-regression-scratch.md b/chapter_linear-classification/softmax-regression-scratch.md
new file mode 100644
index 0000000..20fb453
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression-scratch.md
@@ -0,0 +1,241 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Softmax 回帰のゼロからの実装
+:label:`sec_softmax_scratch`
+
+ソフトマックス回帰はとても基本的なものなので、自分で実装する方法を知っておくべきだと考えています。ここでは、モデルのソフトマックス固有の側面の定義に限定し、トレーニングループを含む線形回帰セクションの他のコンポーネントを再利用します。
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx, gluon
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## ザ・ソフトマックス
+
+まず、最も重要な部分、つまりスカラーから確率へのマッピングから始めましょう。復習では、:numref:`subsec_lin-alg-reduction`と:numref:`subsec_lin-alg-non-reduction`で説明されているように、テンソルの特定の次元に沿った合計演算子の演算を思い出してください。[**行列 `X` を指定すると、すべての要素 (デフォルト) を合計するか、同じ軸の要素のみを合計できます。**] `axis` 変数を使用すると、行と列の合計を計算できます。
+
+```{.python .input}
+%%tab all
+X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
+d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)
+```
+
+ソフトマックスの計算には 3 つのステップが必要です。(i) 各項のべき乗、(ii) 各例の正規化定数を計算するための各行の合計、(iii) 各行を正規化定数で除算し、結果の合計が 1 になるようにします。 
+
+(** $\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.$ドル)
+**)
+
+分母 (の対数) は (log) *パーティション関数* と呼ばれます。これは [統計物理学](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)) で導入され、熱力学的アンサンブルのすべての可能な状態を合計しました。実装は簡単です。
+
+```{.python .input}
+%%tab all
+def softmax(X):
+    X_exp = d2l.exp(X)
+    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
+    return X_exp / partition  # The broadcasting mechanism is applied here
+```
+
+どの入力でも `X`、[**各要素を非負の数に変換します。各行は、確率に必要な最大で 1, **] になります。注意:上記のコードは、非常に大きな引数や非常に小さな引数に対して堅牢ではありません。何が起こっているのかを説明するにはこれで十分ですが、このコードを重大な目的にはそのまま使用しないでください。ディープラーニングフレームワークにはこのような保護機能が組み込まれており、今後は組み込みのソフトマックスを使用する予定です。
+
+```{.python .input}
+%%tab mxnet
+X = d2l.rand(2, 5)
+X_prob = softmax(X)
+X_prob, d2l.reduce_sum(X_prob, 1)
+```
+
+```{.python .input}
+%%tab tensorflow, pytorch
+X = d2l.rand((2, 5))
+X_prob = softmax(X)
+X_prob, d2l.reduce_sum(X_prob, 1)
+```
+
+## ザ・モデル
+
+これで、実装に必要なものがすべて揃いました [**ソフトマックス回帰モデル**]。線形回帰の例のように、各インスタンスは固定長のベクトルで表されます。ここでの生データは$28 \times 28$ピクセルの画像で構成されているため、[**各画像を平坦化し、長さ784のベクトルとして扱います**] 後の章では、空間構造をより満足のいく方法で利用する畳み込みニューラルネットワークを紹介します。 
+
+ソフトマックス回帰では、ネットワークからの出力数はクラスの数と等しくなければなりません。(**データセットには 10 個のクラスがあるため、ネットワークの出力次元は 10.**) したがって、重みは $784 \times 10$ 行列と、バイアスの $1 \times 10$ 次元の行ベクトルを加えたものになります。線形回帰と同様に、重み `W` をガウスノイズで初期化します。バイアスはゼロとして初期化されます。
+
+```{.python .input}
+%%tab mxnet
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = np.random.normal(0, sigma, (num_inputs, num_outputs))
+        self.b = np.zeros(num_outputs)
+        self.W.attach_grad()
+        self.b.attach_grad()
+
+    def collect_params(self):
+        return [self.W, self.b]
+```
+
+```{.python .input}
+%%tab pytorch
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = torch.normal(0, sigma, size=(num_inputs, num_outputs),
+                              requires_grad=True)
+        self.b = torch.zeros(num_outputs, requires_grad=True)
+
+    def parameters(self):
+        return [self.W, self.b]
+```
+
+```{.python .input}
+%%tab tensorflow
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = tf.random.normal((num_inputs, num_outputs), 0, sigma)
+        self.b = tf.zeros(num_outputs)
+        self.W = tf.Variable(self.W)
+        self.b = tf.Variable(self.b)
+```
+
+以下のコードは、ネットワークが各入力を出力にどのようにマッピングするかを定義しています。データをモデルに渡す前に、バッチ内の各 $28 \times 28$ ピクセルイメージを `reshape` を使用してベクトルにフラット化することに注意してください。
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SoftmaxRegressionScratch)
+def forward(self, X):
+    return softmax(d2l.matmul(d2l.reshape(
+        X, (-1, self.W.shape[0])), self.W) + self.b)
+```
+
+## クロスエントロピー損失
+
+次に、クロスエントロピー損失関数 (:numref:`subsec_softmax-regression-loss-func` で導入) を実装する必要があります。これは、すべてのディープラーニングで最も一般的な損失関数かもしれません。現時点では、ディープラーニングのアプリケーションは、回帰問題としてより適切に扱われる分類問題をはるかに上回っています。 
+
+クロスエントロピーは、真のラベルに割り当てられた予測確率の負の対数尤度を取ることを思い出してください。効率化のため、Python の for ループを避け、代わりにインデックスを使用します。特に、$\mathbf{y}$のワンホットエンコーディングでは、$\hat{\mathbf{y}}$で一致する用語を選択できます。 
+
+この動作を確認するために、[**3つのクラスに対する予測確率の2つの例とそれに対応するラベル`y`を含むサンプルデータ`y_hat`を作成します] 正しいラベルはそれぞれ$1$と$2$です。[**`y`を`y_hat`の確率の指標として使用する、**] 私たちは効率的に項を選ぶことができます。
+
+```{.python .input}
+%%tab mxnet, pytorch
+y = d2l.tensor([0, 2])
+y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
+y_hat[[0, 1], y]
+```
+
+```{.python .input}
+%%tab tensorflow
+y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
+y = tf.constant([0, 2])
+tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
+```
+
+これで、選択した確率の対数を平均化することで (**クロスエントロピー損失関数を実装**) できます。
+
+```{.python .input}
+%%tab mxnet, pytorch
+def cross_entropy(y_hat, y):
+    return - d2l.reduce_mean(d2l.log(y_hat[range(len(y_hat)), y]))
+
+cross_entropy(y_hat, y)
+```
+
+```{.python .input}
+%%tab tensorflow
+def cross_entropy(y_hat, y):
+    return - tf.reduce_mean(tf.math.log(tf.boolean_mask(
+        y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))))
+
+cross_entropy(y_hat, y)
+```
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SoftmaxRegressionScratch)
+def loss(self, y_hat, y):
+    return cross_entropy(y_hat, y)
+```
+
+## トレーニング
+
+:numref:`sec_linear_scratch` で定義された `fit` メソッドを再利用して [**10 エポックでモデルをトレーニングします。**] エポック数 (`max_epochs`)、ミニバッチサイズ (`batch_size`)、学習率 (`lr`) はどちらも調整可能なハイパーパラメータであることに注意してください。つまり、これらの値は主要なトレーニングループでは学習されませんが、トレーニングとジェネラライズのパフォーマンスの両方に対して、モデルのパフォーマンスに影響を与えます。実際には、データの*検証*分割に基づいてこれらの値を選択し、最終的に*テスト*分割で最終モデルを評価します。:numref:`subsec_generalization-model-selection` で説明したように、Fashion-MNIST のテストデータを検証セットとして扱い、この分割の検証損失と検証精度を報告します。
+
+```{.python .input}
+%%tab all
+data = d2l.FashionMNIST(batch_size=256)
+model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+## 予測
+
+これでトレーニングが完了し、モデルが [**いくつかの画像を分類する**] 準備が整いました。
+
+```{.python .input}
+%%tab all
+X, y = next(iter(data.val_dataloader()))
+preds = d2l.argmax(model(X), axis=1)
+preds.shape
+```
+
+私たちは、*間違って*ラベル付けした画像にもっと関心があります。実際のラベル (テキスト出力の1行目) とモデルからの予測 (テキスト出力の2行目) を比較して視覚化します。
+
+```{.python .input}
+%%tab all
+wrong = d2l.astype(preds, y.dtype) != y
+X, y, preds = X[wrong], y[wrong], preds[wrong]
+labels = [a+'\n'+b for a, b in zip(
+    data.text_labels(y), data.text_labels(preds))]
+data.visualize([X, y], labels=labels)
+```
+
+## まとめ
+
+今では、線形回帰と分類の問題の解決についてある程度の経験を積み始めています。これにより、1960〜1970年代の統計モデリングの最先端であると思われるものに到達しました。次のセクションでは、ディープラーニングフレームワークを活用してこのモデルをより効率的に実装する方法を説明します。 
+
+## 演習
+
+1. このセクションでは、softmax演算の数学的定義に基づいて、softmax関数を直接実装しました。:numref:`sec_softmax`で説明したように、これは数値の不安定性を引き起こす可能性があります。
+    1. 入力の値が$100$の場合でも、`softmax`が正しく機能するかどうかをテストしますか？
+    1. 全入力のうち最大値が$-100$より小さい場合でも、`softmax`が正しく動作するかどうかテストしますか？
+    1. 引数の最大のエントリに対する相対的な値を見て、修正を実装します。
+1. クロスエントロピー損失関数 $\sum_i y_i \log \hat{y}_i$ の定義に従う `cross_entropy` 関数を実装します。
+    1. 上記のコード例で試してみてください。
+    1. なんでもっとゆっくり走ると思う？
+    1. それを使うべきか？どのような場合に意味がありますか？
+    1. 注意すべきことは何ですか？ヒント:対数のドメインを考えてみましょう。
+1. 最も可能性の高いラベルを返品するのは常に良い考えですか？例えば、医学的診断のためにこれをしますか？この件にどう対処しようと思う？
+1. ソフトマックス回帰を使用して、いくつかの特徴に基づいて次の単語を予測するとします。大きな語彙から生じる可能性のある問題は何ですか？
+1. 上記のコードのハイパーパラメータを試してみてください。特に:
+    1. 学習率の変化に伴って検証損失がどのように変化するかをプロットします。
+    1. ミニバッチのサイズを変更すると、検証と学習の損失も変化しますか?効果が出る前にどれくらいの大きさ、それとも小さくなる必要がありますか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/50)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/51)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/225)
+:end_tab:
diff --git a/chapter_linear-classification/softmax-regression-scratch_origin.md b/chapter_linear-classification/softmax-regression-scratch_origin.md
new file mode 100644
index 0000000..6eb32df
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression-scratch_origin.md
@@ -0,0 +1,340 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Softmax Regression Implementation from Scratch
+:label:`sec_softmax_scratch`
+
+Because softmax regression is so fundamental,
+we believe that you ought to know
+how to implement it yourself.
+Here, we limit ourselves to defining the
+softmax-specific aspects of the model
+and reuse the other components
+from our linear regression section,
+including the training loop.
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx, gluon
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## The Softmax
+
+Let's begin with the most important part:
+the mapping from scalars to probabilities.
+For a refresher, recall the operation of the sum operator
+along specific dimensions in a tensor,
+as discussed in :numref:`subsec_lin-alg-reduction`
+and :numref:`subsec_lin-alg-non-reduction`.
+[**Given a matrix `X` we can sum over all elements (by default) or only
+over elements in the same axis.**]
+The `axis` variable lets us compute row and column sums:
+
+```{.python .input}
+%%tab all
+X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
+d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)
+```
+
+Computing the softmax requires three steps:
+(i) exponentiation of each term;
+(ii) a sum over each row to compute the normalization constant for each example;
+(iii) division of each row by its normalization constant,
+ensuring that the result sums to 1.
+
+(**
+$$\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.$$
+**)
+
+The (logarithm of the) denominator
+is called the (log) *partition function*.
+It was introduced in [statistical physics](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics))
+to sum over all possible states in a thermodynamic ensemble.
+The implementation is straightforward:
+
+```{.python .input}
+%%tab all
+def softmax(X):
+    X_exp = d2l.exp(X)
+    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
+    return X_exp / partition  # The broadcasting mechanism is applied here
+```
+
+For any input `X`, [**we turn each element
+into a non-negative number.
+Each row sums up to 1,**]
+as is required for a probability. Caution: the code above is *not* robust against very large or very small arguments. While this is sufficient to illustrate what is happening, you should *not* use this code verbatim for any serious purpose. Deep learning frameworks have such protections built-in and we will be using the built-in softmax going forward.
+
+```{.python .input}
+%%tab mxnet
+X = d2l.rand(2, 5)
+X_prob = softmax(X)
+X_prob, d2l.reduce_sum(X_prob, 1)
+```
+
+```{.python .input}
+%%tab tensorflow, pytorch
+X = d2l.rand((2, 5))
+X_prob = softmax(X)
+X_prob, d2l.reduce_sum(X_prob, 1)
+```
+
+## The Model
+
+We now have everything that we need
+to implement [**the softmax regression model.**]
+As in our linear regression example,
+each instance will be represented
+by a fixed-length vector.
+Since the raw data here consists
+of $28 \times 28$ pixel images,
+[**we flatten each image,
+treating them as vectors of length 784.**]
+In later chapters, we will introduce
+convolutional neural networks,
+which exploit the spatial structure
+in a more satisfying way.
+
+
+In softmax regression,
+the number of outputs from our network
+should be equal to the number of classes.
+(**Since our dataset has 10 classes,
+our network has an output dimension of 10.**)
+Consequently, our weights constitute a $784 \times 10$ matrix
+plus a $1 \times 10$ dimensional row vector for the biases.
+As with linear regression,
+we initialize the weights `W`
+with Gaussian noise.
+The biases are initialized as zeros.
+
+```{.python .input}
+%%tab mxnet
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = np.random.normal(0, sigma, (num_inputs, num_outputs))
+        self.b = np.zeros(num_outputs)
+        self.W.attach_grad()
+        self.b.attach_grad()
+
+    def collect_params(self):
+        return [self.W, self.b]
+```
+
+```{.python .input}
+%%tab pytorch
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = torch.normal(0, sigma, size=(num_inputs, num_outputs),
+                              requires_grad=True)
+        self.b = torch.zeros(num_outputs, requires_grad=True)
+
+    def parameters(self):
+        return [self.W, self.b]
+```
+
+```{.python .input}
+%%tab tensorflow
+class SoftmaxRegressionScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W = tf.random.normal((num_inputs, num_outputs), 0, sigma)
+        self.b = tf.zeros(num_outputs)
+        self.W = tf.Variable(self.W)
+        self.b = tf.Variable(self.b)
+```
+
+The code below defines how the network
+maps each input to an output.
+Note that we flatten each $28 \times 28$ pixel image in the batch
+into a vector using `reshape`
+before passing the data through our model.
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SoftmaxRegressionScratch)
+def forward(self, X):
+    return softmax(d2l.matmul(d2l.reshape(
+        X, (-1, self.W.shape[0])), self.W) + self.b)
+```
+
+## The Cross-Entropy Loss
+
+Next we need to implement the cross-entropy loss function
+(introduced in :numref:`subsec_softmax-regression-loss-func`).
+This may be the most common loss function
+in all of deep learning.
+At the moment, applications of deep learning
+easily cast classification problems
+far outnumber those better treated as regression problems.
+
+Recall that cross-entropy takes the negative log-likelihood
+of the predicted probability assigned to the true label.
+For efficiency we avoid Python for-loops and use indexing instead.
+In particular, the one-hot encoding in $\mathbf{y}$
+allows us to select the matching terms in $\hat{\mathbf{y}}$.
+
+To see this in action we [**create sample data `y_hat`
+with 2 examples of predicted probabilities over 3 classes and their corresponding labels `y`.**]
+The correct labels are $1$ and $2$ respectively.
+[**Using `y` as the indices of the probabilities in `y_hat`,**]
+we can pick out terms efficiently.
+
+```{.python .input}
+%%tab mxnet, pytorch
+y = d2l.tensor([0, 2])
+y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
+y_hat[[0, 1], y]
+```
+
+```{.python .input}
+%%tab tensorflow
+y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
+y = tf.constant([0, 2])
+tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
+```
+
+Now we can (**implement the cross-entropy loss function**) by averaging over the logarithms of the selected probabilities.
+
+```{.python .input}
+%%tab mxnet, pytorch
+def cross_entropy(y_hat, y):
+    return - d2l.reduce_mean(d2l.log(y_hat[range(len(y_hat)), y]))
+
+cross_entropy(y_hat, y)
+```
+
+```{.python .input}
+%%tab tensorflow
+def cross_entropy(y_hat, y):
+    return - tf.reduce_mean(tf.math.log(tf.boolean_mask(
+        y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))))
+
+cross_entropy(y_hat, y)
+```
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SoftmaxRegressionScratch)
+def loss(self, y_hat, y):
+    return cross_entropy(y_hat, y)
+```
+
+## Training
+
+We reuse the `fit` method defined in :numref:`sec_linear_scratch` to [**train the model with 10 epochs.**]
+Note that both the number of epochs (`max_epochs`),
+the minibatch size (`batch_size`),
+and learning rate (`lr`)
+are adjustable hyperparameters.
+That means that while these values are not
+learned during our primary training loop,
+they still influence the performance
+of our model, bot vis-a-vis training
+and generalization performance.
+In practice you will want to choose these values
+based on the *validation* split of the data
+and then to ultimately evaluate your final model
+on the *test* split.
+As discussed in :numref:`subsec_generalization-model-selection`,
+we will treat the test data of Fashion-MNIST
+as the validation set, thus
+reporting validation loss and validation accuracy
+on this split.
+
+```{.python .input}
+%%tab all
+data = d2l.FashionMNIST(batch_size=256)
+model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+## Prediction
+
+Now that training is complete,
+our model is ready to [**classify some images.**]
+
+```{.python .input}
+%%tab all
+X, y = next(iter(data.val_dataloader()))
+preds = d2l.argmax(model(X), axis=1)
+preds.shape
+```
+
+We are more interested in the images we label *incorrectly*. We visualize them by
+comparing their actual labels
+(first line of text output)
+with the predictions from the model
+(second line of text output).
+
+```{.python .input}
+%%tab all
+wrong = d2l.astype(preds, y.dtype) != y
+X, y, preds = X[wrong], y[wrong], preds[wrong]
+labels = [a+'\n'+b for a, b in zip(
+    data.text_labels(y), data.text_labels(preds))]
+data.visualize([X, y], labels=labels)
+```
+
+## Summary
+
+By now we are starting to get some experience
+with solving linear regression
+and classification problems.
+With it, we have reached what would arguably be
+the state of the art of 1960-1970s of statistical modeling.
+In the next section, we'll show you how to leverage
+deep learning frameworks to implement this model
+much more efficiently.
+
+## Exercises
+
+1. In this section, we directly implemented the softmax function based on the mathematical definition of the softmax operation. As discussed in :numref:`sec_softmax` this can cause numerical instabilities.
+    1. Test whether `softmax` still works correctly if an input has a value of $100$?
+    1. Test whether `softmax` still works correctly if the largest of all inputs is smaller than $-100$?
+    1. Implement a fix by looking at the value relative to the largest entry in the argument.
+1. Implement a `cross_entropy` function that follows the definition of the cross-entropy loss function $\sum_i y_i \log \hat{y}_i$.
+    1. Try it out in the code example above.
+    1. Why do you think it runs more slowly?
+    1. Should you use it? In which cases would it make sense?
+    1. What do you need to be careful of? Hint: consider the domain of the logarithm.
+1. Is it always a good idea to return the most likely label? For example, would you do this for medical diagnosis? How would you try to address this?
+1. Assume that we want to use softmax regression to predict the next word based on some features. What are some problems that might arise from a large vocabulary?
+1. Experiment with the hyperparameters of the code above. In particular:
+    1. Plot how the validation loss changes as you change the learning rate.
+    1. Do the validation and training loss change as you change the minibatch size? How large or small do you need to go before you see an effect?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/50)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/51)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/225)
+:end_tab:
diff --git a/chapter_linear-classification/softmax-regression.md b/chapter_linear-classification/softmax-regression.md
new file mode 100644
index 0000000..30691db
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression.md
@@ -0,0 +1,200 @@
+# ソフトマックスリグレッション
+:label:`sec_softmax`
+
+:numref:`sec_linear_regression`では、線形回帰を導入し、:numref:`sec_linear_scratch`でゼロから実装を行い、:numref:`sec_linear_concise`のディープラーニングフレームワークの高レベルAPIを使用して重い作業を行いました。 
+
+回帰は、私たちが答えたいときに手を伸ばすハンマーです*どれくらいですか？* または*いくつですか？* 質問。家が売られる金額（価格）、野球チームの勝利数、または患者が退院するまでに入院する日数を予測したい場合は、おそらく回帰モデルを探しているでしょう。ただし、回帰モデル内でも重要な違いがあります。たとえば、住宅の価格がマイナスになることはなく、変動はベースライン価格に対して*相対的*になることがよくあります。そのため、価格の対数で回帰する方が効果的かもしれません。同様に、患者が入院する日数は*離散非負*の確率変数です。そのため、最小平均二乗法も理想的なアプローチではないかもしれません。この種のイベントまでの時間モデリングには、*サバイバルモデリング*と呼ばれる特殊なサブフィールドで対処される他の多くの複雑さが伴います。 
+
+ここでのポイントは、あなたを圧倒することではなく、単に二乗誤差を最小化するだけではないことを推定することがたくさんあることを知らせることです。そして、もっと広義には、回帰よりも教師あり学習の方がたくさんあります。このセクションでは、*分類*の問題に焦点を当てます。*どれくらいですか？* 質問し、代わりに*どのカテゴリに焦点を当てますか？* 質問。 
+
+* このメールは迷惑メールフォルダまたは受信トレイに属していますか？
+* この顧客は、サブスクリプションサービスにサインアップする可能性が高いですか、それともサインアップしない可能性が高くなりますか？
+* この画像はロバ、犬、猫、またはオンドリを描いていますか？
+* アストンが次に視聴する可能性が最も高い映画はどれですか？
+* その本のどのセクションを次に読むつもりですか。
+
+口語的に、機械学習の実践者は、2つの微妙に異なる問題を説明するために*分類*という言葉をオーバーロードします。（i）カテゴリ（クラス）への例のハードアサインのみに関心がある問題、および（ii）ソフトアサインメントを行いたい、つまり各カテゴリーが適用されます。区別が曖昧になる傾向があります。なぜなら、ハードアサインメントだけを考えているときでも、ソフトアサインメントを行うモデルを使用することが多いからです。 
+
+さらに、複数のラベルが当てはまる場合があります。たとえば、ニュース記事では、エンターテインメント、ビジネス、宇宙飛行のトピックを同時に取り上げても、医学やスポーツのトピックは取り上げない場合があります。したがって、それを単独で上記のカテゴリのいずれかに分類することはあまり役に立ちません。この問題は一般に [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification) として知られています。概要については:citet:`Tsoumakas.Katakis.2007`を、画像にタグ付けする際の効果的なアルゴリズムについては:citet:`Huang.Xu.Yu.2015`を参照してください。 
+
+## 分類
+:label:`subsec_classification-problem`
+
+足を濡らすために、簡単な画像分類問題から始めましょう。ここで、各入力は $2\times2$ グレースケールイメージで構成されています。各ピクセル値を単一のスカラーで表すことができ、4つの特徴が得られます $x_1, x_2, x_3, x_4$。さらに、各画像が「猫」、「鶏」、「犬」のいずれかのカテゴリに属していると仮定します。 
+
+次に、ラベルの表現方法を選択する必要があります。私たちには2つの明らかな選択肢があります。おそらく最も自然な衝動は、$y \in \{1, 2, 3\}$を選択することでしょう。ここで、整数はそれぞれ$\{\text{dog}, \text{cat}, \text{chicken}\}$を表します。これは、そのような情報をコンピューターに「保存」する素晴らしい方法です。カテゴリに自然な順序付けがある場合、たとえば$\{\text{baby}, \text{toddler}, \text{adolescent}, \text{young adult}, \text{adult}, \text{geriatric}\}$を予測しようとしている場合、これを[ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression)問題としてキャストし、ラベルをこの形式で保持することも意味があるかもしれません。さまざまなタイプのランキング損失関数の概要については :citet:`Moon.Smola.Chang.ea.2010` を、複数のモードで応答を処理するベイジアンアプローチについては :citet:`Beutel.Murray.Faloutsos.ea.2014` を参照してください。 
+
+一般に、分類問題にはクラス間の自然な順序付けは伴いません。幸いなことに、統計学者ははるか昔にカテゴリデータを表現する簡単な方法、*ワンホットエンコーディング*を発明しました。ワンホットエンコーディングは、カテゴリと同じ数のコンポーネントを持つベクトルです。特定のインスタンスのカテゴリに対応するコンポーネントは 1 に設定され、その他のコンポーネントはすべて 0 に設定されます。この例では、ラベル$y$は3次元ベクトルになり、$(1, 0, 0)$は「猫」、$(0, 1, 0)$は「鶏」、$(0, 0, 1)$は「犬」に対応します。 
+
+$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$
+
+### 線形モデル
+
+考えられるすべてのクラスに関連する条件付き確率を推定するには、クラスごとに1つずつ、複数の出力を持つモデルが必要です。線形モデルによる分類に対処するには、出力と同じ数のアフィン関数が必要です。厳密に言えば、最後のカテゴリは$1$と他のカテゴリの合計との差でなければならないため、1つ少なくする必要がありますが、対称性の理由から、わずかに冗長なパラメータ化を使用します。各出力は、独自のアフィン関数に対応しています。この例では、4 つのフィーチャと 3 つの可能な出力カテゴリがあるため、重みを表すには 12 個のスカラー (添字付き $w$) と、バイアスを表す 3 個のスカラー (添字付き $b$) が必要です。これにより、次の結果が得られます。 
+
+$$
+\begin{aligned}
+o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
+o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
+o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
+\end{aligned}
+$$
+
+対応するニューラルネットワーク図は、:numref:`fig_softmaxreg`に示されています。線形回帰と同様に、単層ニューラルネットワークを使用します。また、各出力 $o_1, o_2$ および $o_3$ の計算は、すべての入力、$x_1$、$x_2$、$x_3$、および $x_4$ に依存するため、出力層は*完全結合層* として記述することもできます。 
+
+![Softmax regression is a single-layer neural network.](../img/softmaxreg.svg)
+:label:`fig_softmaxreg`
+
+より簡潔な表記法には、ベクトルと行列を使用します。$\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$は、数学とコードにはるかに適しています。すべての重みを$3 \times 4$行列に集め、すべてのバイアスをベクトルに$\mathbf{b} \in \mathbb{R}^3$に集めたことに注意してください。 
+
+### ザ・ソフトマックス
+:label:`subsec_softmax_operation`
+
+適切な損失関数を仮定すると、$\mathbf{o}$とラベル$\mathbf{y}$の差を最小化することを直接試みることができます。分類をベクトル値回帰問題として扱うことは驚くほどうまく機能することがわかりますが、それでも以下の点では欠けています。 
+
+* 出力 $o_i$ の合計が $1$ になるという保証はありません。
+* 出力の合計が$1$になったり、$1$を超えない場合でも、出力$o_i$が非負になるという保証はありません。
+
+どちらの側面でも、推定の問題を解決するのが難しく、解が外れ値に対して非常に脆弱になります。たとえば、寝室の数と誰かが家を買う可能性の間に正の線形依存があると仮定すると、大邸宅を購入する確率は$1$を超える可能性があります。そのため、出力を「押しつぶす」メカニズムが必要です。 
+
+この目標を達成するには多くの方法があります。たとえば、出力 $\mathbf{o}$ は $\mathbf{y}$ の破損したバージョンであると仮定できます。この場合、破損は、正規分布から引き出されるノイズ $\mathbf{\epsilon}$ を追加することによって発生します。つまり、$\mathbf{y} = \mathbf{o} + \mathbf{\epsilon}$、ここで $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ です。これは、:citet:`Fechner.1860`によって最初に導入された、いわゆる[probit model](https://en.wikipedia.org/wiki/Probit_model)です。魅力的ではありますが、ソフトマックスと比較すると、うまく機能しないか、特に素晴らしい最適化問題につながります。 
+
+この目標を達成する (そして非負性を保証する) 別の方法は、指数関数$P(y = i) \propto \exp o_i$を使用することです。これは、条件付きクラス確率が$o_i$の増加に伴って増加し、単調であり、すべての確率が非負であるという要件を実際に満たしています。次に、これらの値をそれぞれ合計で割ることにより、合計が$1$になるように変換できます。このプロセスを*正規化* と呼びます。これら 2 つのピースを組み合わせると、*softmax* 関数が得られます。 
+
+$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \text{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}.$$
+:eqlabel:`eq_softmax_y_and_o`
+
+$\mathbf{o}$の最大座標は、$\hat{\mathbf{y}}$に従って最も可能性の高いクラスに対応することに注意してください。さらに、ソフトマックス演算は引数間の順序を保持するため、最も高い確率が割り当てられているクラスを決定するためにソフトマックスを計算する必要はありません。 
+
+$$
+\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j.
+$$
+
+ソフトマックスのアイデアは、物理学:cite:`Gibbs.1902`のアイデアを適応させたギブスにまでさかのぼります。さらに遡る, ボルツマン, 現代の熱力学の父, このトリックを使用して、気体分子のエネルギー状態の分布をモデル化しました.特に、ガス中の分子などの熱力学的アンサンブルにおけるエネルギー状態の有病率は、$\exp(-E/kT)$に比例することを発見しました。ここで、$E$は状態のエネルギー、$T$は温度、$k$はボルツマン定数です。統計学者が統計システムの「温度」を上げたり下げたりすることについて話すとき、彼らはより低いまたはより高いエネルギー状態を優先するために$T$を変えることを指します。ギブスの考えに従うと、エネルギーはエラーに相当します。エネルギーベースのモデル :cite:`Ranzato.Boureau.Chopra.ea.2007` は、ディープラーニングの問題を記述するときにこの観点を使用します。 
+
+### ベクトル化
+:label:`subsec_softmax_vectorization`
+
+計算効率を向上させるために、計算をデータのミニバッチでベクトル化します。次元 (入力の数) $d$ をもつ $n$ 個の特徴量のミニバッチ $\mathbf{X} \in \mathbb{R}^{n \times d}$ が与えられていると仮定します。さらに、出力に $q$ のカテゴリがあると仮定します。その後、重みは$\mathbf{W} \in \mathbb{R}^{d \times q}$を満たし、バイアスは$\mathbf{b} \in \mathbb{R}^{1\times q}$を満たします。 
+
+$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$
+:eqlabel:`eq_minibatch_softmax_reg`
+
+これにより、マトリックスマトリックス積$\mathbf{X} \mathbf{W}$への支配的な操作が加速されます。さらに、$\mathbf{X}$の各行はデータ例を表すため、softmax演算自体を*rowwise*で計算できます。$\mathbf{O}$の各行について、すべてのエントリをべき乗し、合計で正規化します。ただし、べき乗や大きな数の対数をとらないように注意する必要があります。これは、数値のオーバーフローまたはアンダーフローを引き起こす可能性があるためです。ディープラーニングフレームワークはこれを自動的に処理します。 
+
+## 損失機能
+:label:`subsec_softmax-regression-loss-func`
+
+フィーチャー $\mathbf{x}$ から確率 $\mathbf{\hat{y}}$ へのマッピングができたので、このマッピングの精度を最適化する方法が必要です。最尤推定に頼ります。これは、:numref:`subsec_normal_distribution_and_squared_loss`の平均二乗誤差損失の確率的正当化を提供するときに遭遇したのとまったく同じ概念です。 
+
+### 対数尤度
+
+softmax 関数はベクトル $\hat{\mathbf{y}}$ を与えます。これは、$\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$ などの入力 $\mathbf{x}$ が与えられた場合、各クラスの（推定された）条件付き確率として解釈できます。以下では、特徴量 $\mathbf{X}$ を持つデータセットでは、ラベル $\mathbf{Y}$ がワンホットエンコーディングラベルベクトルを使用して表されると仮定します。次の特徴を考慮して、実際のクラスがモデルに従ってどの程度ありそうかを確認することで、推定値と現実を比較できます。 
+
+$$
+P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
+$$
+
+各ラベルはそれぞれの分布 $P(\mathbf{y}\mid\mathbf{x}^{(i)})$ から独立して描画されると仮定するので、因数分解を使用できます。項の積を最大化するのは扱いにくいので、負の対数を使用して負の対数尤度を最小化するという同等の問題を求めます。 
+
+$$
+-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
+= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
+$$
+
+ここで、$q$ クラスに対するラベル $\mathbf{y}$ とモデル予測 $\hat{\mathbf{y}}$ の任意のペアについて、損失関数 $l$ は次のようになります。 
+
+$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$
+:eqlabel:`eq_l_cross_entropy`
+
+後で説明する理由から、:eqref:`eq_l_cross_entropy`の損失関数は一般に*クロスエントロピー損失*と呼ばれます。$\mathbf{y}$は長さ$q$のワンホットベクトルであるため、そのすべての座標$j$の合計は、1つの項を除くすべての項で消滅します。$\hat{y}$ が確率ベクトルである場合、損失 $l(\mathbf{y}, \hat{\mathbf{y}})$ は $0$ によって下方から制限されることに注意してください。$1$ より大きい単一のエントリはないため、負の対数は $0$; $l(\mathbf{y}, \hat{\mathbf{y}}) = 0$ より小さくすることはできません。*確実性* で実際のラベルを予測する場合のみです。$1$ に向かってソフトマックス出力を取得するには、対応する入力 $o_i$ を無限大（または $j \neq i$ の他のすべての出力 $o_j$ を負の無限大）にする必要があるため、これは重みの有限設定では発生しません。私たちのモデルが$0$の出力確率を割り当てることができるとしても、そのような高い信頼性を割り当てるときに発生するエラーは無限の損失を被ります（$-\log 0 = \infty$）。 
+
+### ソフトマックスとクロスエントロピー損失
+:label:`subsec_softmax_and_derivatives`
+
+ソフトマックス関数とそれに対応するクロスエントロピー損失は非常に一般的であるため、それらがどのように計算されるかをもう少しよく理解する価値があります。:eqref:`eq_softmax_y_and_o`を:eqref:`eq_l_cross_entropy`の損失の定義に差し込み、ソフトマックスの定義を使用して次のようになります。 
+
+$$
+\begin{aligned}
+l(\mathbf{y}, \hat{\mathbf{y}}) &=  - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\
+&= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j \\
+&= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j.
+\end{aligned}
+$$
+
+何が起こっているのかをもう少しよく理解するために、ロジット$o_j$に関する微分を考えてみましょう。我々が得る 
+
+$$
+\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j.
+$$
+
+言い換えれば、微分は、ソフトマックス演算で表されるモデルによって割り当てられた確率と、ワンホットラベルベクトルの要素で表される実際に起こったことの差です。この意味で、これは回帰で見たものと非常に似ています。勾配は、観測値$y$と推定$\hat{y}$の差でした。これは偶然ではありません。どの指数家族モデルでも、対数尤度の勾配は正確にこの項によって与えられます。この事実により、実際には勾配の計算が簡単になります。 
+
+ここで、単一の結果だけでなく、結果全体の分布を観察する場合を考えてみましょう。ラベル $\mathbf{y}$ には以前と同じ表現を使用できます。唯一の違いは、$(0, 0, 1)$などのバイナリエントリのみを含むベクトルではなく、$(0.1, 0.2, 0.7)$などの汎用確率ベクトルがあることです。:eqref:`eq_l_cross_entropy`の損失$l$を定義するために以前に使用した数学は、解釈が少し一般的であるというだけで、まだうまく機能します。これは、ラベルを超える分布の損失の期待値です。この損失は*クロスエントロピー損失*と呼ばれ、分類問題で最も一般的に使用される損失の1つです。情報理論の基礎だけを紹介することで、名前をわかりやすく説明できます。簡単に言えば、$\mathbf{y}$が発生すると予測されるものと比較して、$\mathbf{y}$をエンコードするビット数を測定します。以下では、非常に基本的な説明を提供します。情報理論の詳細については、:cite:`Cover.Thomas.1999` または :cite:`mackay2003information` を参照してください。 
+
+## 情報理論の基本
+:label:`subsec_info_theory_basics`
+
+ディープラーニングの論文の多くは、直感と情報理論の用語を使用しています。それらを理解するには、いくつかの共通言語が必要です。これはサバイバルガイドです。
+*情報理論*は問題を扱う
+情報（データとも呼ばれる）のエンコード、デコード、送信、および操作を行います。 
+
+### エントロピー
+
+情報理論の中心的な考え方は、データに含まれる情報の量を定量化することです。これにより、データを圧縮する能力に制限が課せられます。ディストリビューション $P$ では、*エントロピー* は次のように定義されます。 
+
+$$H[P] = \sum_j - P(j) \log P(j).$$
+:eqlabel:`eq_softmax_reg_entropy`
+
+情報理論の基本定理の1つは、$P$分布からランダムに抽出されたデータをエンコードするには、それをエンコードするために少なくとも$H[P]$「nats」が必要であると述べています。:cite:`Shannon.1948`。「nat」が何であるか疑問に思うなら、それはビットと同等ですが、ベース2のコードではなくベース$e$のコードを使用する場合です。したがって、1つのNATは$\frac{1}{\log(2)} \approx 1.44$ビットです。 
+
+### 驚き
+
+圧縮が予測とどのような関係があるのか疑問に思われるかもしれません。圧縮したいデータのストリームがあるとします。次のトークンを予測するのが常に簡単であれば、このデータは簡単に圧縮できます。ストリーム内のすべてのトークンが常に同じ値をとる極端な例を考えてみましょう。それはとても退屈なデータストリームです！そして、それは退屈なだけでなく、予測も簡単です。これらは常に同じであるため、ストリームの内容を通信するために情報を送信する必要はありません。予測しやすく、圧縮しやすい。 
+
+しかし、すべての出来事を完全に予測できないなら、時々驚かれるかもしれません。イベントに低い確率を割り当てると、私たちの驚きはより大きくなります。クロード・シャノンは、$j$に（主観的な）確率$P(j)$を割り当てた事象$j$を観察したときの*驚き*を定量化するために$\log \frac{1}{P(j)} = -\log P(j)$に落ち着きました。:eqref:`eq_softmax_reg_entropy`で定義されたエントロピーは、データ生成プロセスに真に一致する正しい確率を割り当てたときに、*予想される驚き*です。 
+
+### クロスエントロピー再考
+
+したがって、エントロピーが真の確率を知っている人が経験する驚きのレベルである場合、クロスエントロピーとは何か疑問に思うかもしれません。$H(P, Q)$と示される* $P$*から* $Q$までのクロスエントロピーは、確率$P$に従って実際に生成されたデータを見て、主観的確率$Q$を持つオブザーバーの予想される驚きです。これは $H(P, Q) \stackrel{\mathrm{def}}{=} \sum_j - P(j) \log Q(j)$ によって与えられます。$P=Q$の場合、可能な限り低いクロスエントロピーが達成されます。この場合、$P$から$Q$へのクロスエントロピーは$H(P, P)= H(P)$です。 
+
+要するに、クロスエントロピー分類の目的は2つの方法で考えることができます。（i）観測データの可能性を最大化すること、および（ii）ラベルを通信するために必要な驚き（したがってビット数）を最小限に抑えることです。 
+
+## まとめと議論
+
+このセクションでは、*離散*出力空間での最適化を可能にする、最初の非自明な損失関数に遭遇しました。その設計の鍵は、離散カテゴリを確率分布から引き出すインスタンスとして扱う確率論的アプローチを取ることでした。副作用として、通常のニューラルネットワーク層の出力を有効な離散確率分布に変換する便利なアクティベーション関数であるsoftmaxに遭遇しました。ソフトマックスと組み合わせた場合のクロスエントロピー損失の微分は、二乗誤差の導関数と非常によく似た動作をすることがわかりました。つまり、予想される動作とその予測の差をとることによるものです。そして、私たちはその表面をスクラッチすることしかできませんでしたが、統計物理学と情報理論との刺激的なつながりに出会いました。 
+
+これはあなたをあなたの道に連れて行くのに十分であり、うまくいけばあなたの食欲を刺激するのに十分ですが、私たちはここで深く潜ることはほとんどありませんでした。とりわけ、計算上の考慮事項をスキップしました。具体的には、$d$入力と$q$出力を備えた全接続層の場合、パラメータ化と計算コストは$\mathcal{O}(dq)$であり、実際には非常に高くなる可能性があります。幸いなことに、$d$入力を$q$出力に変換するこのコストは、近似と圧縮によって削減できます。たとえば、Deep Fried Convnets :cite:`Yang.Moczulski.Denil.ea.2015` は、順列、フーリエ変換、スケーリングの組み合わせを使用して、コストを二次から対数線形に削減します。同様の手法は、より高度な構造マトリックス近似 :cite:`sindhwani2015structured` でも機能します。最後に、圧縮係数$n$に基づく計算コストとストレージコスト:cite:`Zhang.Tay.Zhang.ea.2021`とわずかな精度をトレードオフする意思がある場合は、クォータニオンのような分解を使用してコストを$\mathcal{O}(\frac{dq}{n})$に削減できます。これは活発な研究分野です。難しいのは、必ずしも最もコンパクトな表現や最小数の浮動小数点演算ではなく、最新のGPUで最も効率的に実行できるソリューションを目指すことです。 
+
+## 演習
+
+1. 指数ファミリーとソフトマックスの関係をもう少し詳しく調べることができます。
+    1. ソフトマックスのクロスエントロピー損失 $l(\mathbf{y},\hat{\mathbf{y}})$ の 2 次導関数を計算します。
+    1. $\mathrm{softmax}(\mathbf{o})$ で与えられる分布の分散を計算し、上記で計算した 2 次導関数と一致することを示します。
+1. 等しい確率で発生するクラスが 3 つあると仮定します。つまり、確率ベクトルは $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ です。
+    1. バイナリコードを設計しようとすると何が問題になりますか？
+    1. もっと良いコードを設計できますか？ヒント:2つの独立した観測値をエンコードしようとするとどうなりますか?$n$の観測値を一緒にエンコードするとどうなるでしょうか？
+1. 物理ワイヤを介して送信される信号をエンコードする場合、エンジニアは常にバイナリコードを使用するとは限りません。たとえば、[PAM-3](https://en.wikipedia.org/wiki/Ternary_signal)は、2つのレベル$\{0, 1\}$とは対照的に、3つの信号レベル$\{-1, 0, 1\}$を使用します。$\{0, \ldots, 7\}$の範囲の整数を送信するには、いくつの三元単位が必要ですか？エレクトロニクスの観点から、なぜこれが良いアイデアなのでしょうか？
+1. [ブラッドリー・テリーモデル](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) は
+好みを捉えるためのロジスティックモデル。ユーザーがリンゴとオレンジのどちらかを選ぶには、$o_{\mathrm{apple}}$と$o_{\mathrm{orange}}$のスコアを仮定します。私たちの要件は、スコアが大きいほど関連するアイテムを選択する可能性が高くなり、スコアが最も大きいアイテムが最も選択される可能性が高くなることです。:cite:`Bradley.Terry.1952`。
+    1. softmax がこの要件を満たしていることを証明します。
+    1. リンゴもオレンジも選択しないというデフォルトのオプションを許可したい場合はどうなりますか？ヒント:ユーザーには 3 つの選択肢があります。
+1. Softmax の名前は、$\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$ というマッピングから派生しています。
+    1. $\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$であることを証明してください。
+    1. 両方の機能の違いはどれくらい小さくできますか？ヒント:の損失なし
+    一般性では、$b = 0$と$a \geq b$を設定できます。
+    1. $\lambda > 0$という条件で、これが$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$に当てはまることを証明してください。
+    1. $\lambda \to \infty$には$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$があることを示します。
+    1. ソフトミンはどのようなものですか？
+    1. これを3つ以上の数字に拡張します。
+1. 関数 $g(\mathbf{x}) \stackrel{\mathrm{def}}{=} \log \sum_i \exp x_i$ は、[ログパーティション関数](https://en.wikipedia.org/wiki/Partition_function_(mathematics) と呼ばれることもあります。
+    1. 関数が凸であることを証明します。ヒント:そのためには、一次導関数がソフトマックス関数の確率になるという事実を利用して、2 次導関数が分散であることを示します。
+    1. $g$が翻訳不変であること、つまり$g(\mathbf{x} + b) = g(\mathbf{x})$であることを示してください。
+    1. $x_i$ の座標の一部が非常に大きい場合はどうなりますか？すべてが非常に小さい場合はどうなりますか？
+    1. $b = \mathrm{max}_i x_i$を選択した場合、数値的に安定した実装になることを示してください。
+1. いくつかの確率分布 $P$ があると仮定します。$\alpha > 0$ のために $Q(i) \propto P(i)^\alpha$ を持つ別のディストリビューション $Q$ を選択したとします。
+    1. $\alpha$のどの選択肢が温度を2倍にするのに対応しますか？それを半分にすることに相当する選択肢はどれですか？
+    1. 温度を$0$に収束させるとどうなりますか？
+    1. 温度を$\infty$に収束させるとどうなりますか？
+
+[Discussions](https://discuss.d2l.ai/t/46)
diff --git a/chapter_linear-classification/softmax-regression_origin.md b/chapter_linear-classification/softmax-regression_origin.md
new file mode 100644
index 0000000..47c03ef
--- /dev/null
+++ b/chapter_linear-classification/softmax-regression_origin.md
@@ -0,0 +1,549 @@
+# Softmax Regression
+:label:`sec_softmax`
+
+In :numref:`sec_linear_regression`, we introduced linear regression,
+working through implementations from scratch in :numref:`sec_linear_scratch`
+and again using high-level APIs of a deep learning framework
+in :numref:`sec_linear_concise` to do the heavy lifting.
+
+Regression is the hammer we reach for when
+we want to answer *how much?* or *how many?* questions.
+If you want to predict the number of dollars (price)
+at which a house will be sold,
+or the number of wins a baseball team might have,
+or the number of days that a patient
+will remain hospitalized before being discharged,
+then you are probably looking for a regression model.
+However, even within regression models,
+there are important distinctions.
+For instance, the price of a house
+will never be negative and changes might often be *relative* to its baseline price.
+As such, it might be more effective to regress
+on the logarithm of the price.
+Likewise, the number of days a patient spends in hospital
+is a *discrete nonnegative* random variable.
+As such, least mean squares might not be an ideal approach either.
+This sort of time-to-event modeling
+comes with a host of other complications that are dealt with
+in a specialized subfield called *survival modeling*.
+
+The point here is not to overwhelm you but just
+to let you know that there is a lot more to estimation
+than simply minimizing squared errors.
+And more broadly, there's a lot more to supervised learning than regression.
+In this section, we focus on *classification* problems
+where we put aside *how much?* questions
+and instead focus on *which category?* questions.
+
+
+
+* Does this email belong in the spam folder or the inbox?
+* Is this customer more likely to sign up
+  or not to sign up for a subscription service?
+* Does this image depict a donkey, a dog, a cat, or a rooster?
+* Which movie is Aston most likely to watch next?
+* Which section of the book are you going to read next?
+
+Colloquially, machine learning practitioners
+overload the word *classification*
+to describe two subtly different problems:
+(i) those where we are interested only in
+hard assignments of examples to categories (classes);
+and (ii) those where we wish to make soft assignments,
+i.e., to assess the probability that each category applies.
+The distinction tends to get blurred, in part,
+because often, even when we only care about hard assignments,
+we still use models that make soft assignments.
+
+Even more, there are cases where more than one label might be true.
+For instance, a news article might simultaneously cover
+the topics of entertainment, business, and space flight,
+but not the topics of medicine or sports.
+Thus, categorizing it into one of the above categories
+on their own would not be very useful.
+This problem is commonly known as [multi-label classification](https://en.wikipedia.org/wiki/Multi-label_classification).
+See :citet:`Tsoumakas.Katakis.2007` for an overview
+and :citet:`Huang.Xu.Yu.2015`
+for an effective algorithm when tagging images.
+
+## Classification
+:label:`subsec_classification-problem`
+
+To get our feet wet, let's start with
+a simple image classification problem.
+Here, each input consists of a $2\times2$ grayscale image.
+We can represent each pixel value with a single scalar,
+giving us four features $x_1, x_2, x_3, x_4$.
+Further, let's assume that each image belongs to one
+among the categories "cat", "chicken", and "dog".
+
+Next, we have to choose how to represent the labels.
+We have two obvious choices.
+Perhaps the most natural impulse would be
+to choose $y \in \{1, 2, 3\}$,
+where the integers represent
+$\{\text{dog}, \text{cat}, \text{chicken}\}$ respectively.
+This is a great way of *storing* such information on a computer.
+If the categories had some natural ordering among them,
+say if we were trying to predict
+$\{\text{baby}, \text{toddler}, \text{adolescent}, \text{young adult}, \text{adult}, \text{geriatric}\}$,
+then it might even make sense to cast this as
+an [ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression) problem
+and keep the labels in this format.
+See :citet:`Moon.Smola.Chang.ea.2010` for an overview
+of different types of ranking loss functions
+and :citet:`Beutel.Murray.Faloutsos.ea.2014` for a Bayesian approach
+that addresses responses with more than one mode.
+
+In general, classification problems do not come
+with natural orderings among the classes.
+Fortunately, statisticians long ago invented a simple way
+to represent categorical data: the *one-hot encoding*.
+A one-hot encoding is a vector
+with as many components as we have categories.
+The component corresponding to a particular instance's category is set to 1
+and all other components are set to 0.
+In our case, a label $y$ would be a three-dimensional vector,
+with $(1, 0, 0)$ corresponding to "cat", $(0, 1, 0)$ to "chicken",
+and $(0, 0, 1)$ to "dog":
+
+$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$
+
+### Linear Model
+
+In order to estimate the conditional probabilities
+associated with all the possible classes,
+we need a model with multiple outputs, one per class.
+To address classification with linear models,
+we will need as many affine functions as we have outputs.
+Strictly speaking, we only need one fewer,
+since the last category has to be the difference
+between $1$ and the sum of the other categories
+but for reasons of symmetry
+we use a slightly redundant parametrization.
+Each output corresponds to its own affine function.
+In our case, since we have 4 features and 3 possible output categories,
+we need 12 scalars to represent the weights ($w$ with subscripts),
+and 3 scalars to represent the biases ($b$ with subscripts). This yields:
+
+$$
+\begin{aligned}
+o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
+o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
+o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
+\end{aligned}
+$$
+
+The corresponding neural network diagram
+is shown in :numref:`fig_softmaxreg`.
+Just as in linear regression,
+we use a single-layer neural network.
+And since the calculation of each output, $o_1, o_2$, and $o_3$,
+depends on all inputs, $x_1$, $x_2$, $x_3$, and $x_4$,
+the output layer can also be described as a *fully connected layer*.
+
+![Softmax regression is a single-layer neural network.](../img/softmaxreg.svg)
+:label:`fig_softmaxreg`
+
+For a more concise notation we use vectors and matrices:
+$\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$ is
+much better suited for mathematics and code.
+Note that we have gathered all of our weights into a $3 \times 4$ matrix and all biases
+$\mathbf{b} \in \mathbb{R}^3$ in a vector.
+
+### The Softmax
+:label:`subsec_softmax_operation`
+
+Assuming a suitable loss function,
+we could try, directly, to minimize the difference
+between $\mathbf{o}$ and the labels $\mathbf{y}$.
+While it turns out that treating classification
+as a vector-valued regression problem works surprisingly well,
+it is nonetheless lacking in the following ways:
+
+* There is no guarantee that the outputs $o_i$ sum up to $1$ in the way we expect probabilities to behave.
+* There is no guarantee that the outputs $o_i$ are even nonnegative, even if their outputs sum up to $1$, or that they do not exceed $1$.
+
+Both aspects render the estimation problem difficult to solve
+and the solution very brittle to outliers.
+For instance, if we assume that there
+is a positive linear dependency
+between the number of bedrooms and the likelihood
+that someone will buy a house,
+the probability might exceed $1$
+when it comes to buying a mansion!
+As such, we need a mechanism to "squish" the outputs.
+
+There are many ways we might to accomplish this goal.
+For instance, we could assume that the outputs
+$\mathbf{o}$ are corrupted versions of $\mathbf{y}$,
+where the corruption occurs by means of adding noise $\mathbf{\epsilon}$
+drawn from a normal distribution.
+In other words, $\mathbf{y} = \mathbf{o} + \mathbf{\epsilon}$,
+where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.
+This is the so-called [probit model](https://en.wikipedia.org/wiki/Probit_model),
+first introduced by :citet:`Fechner.1860`.
+While appealing, it doesn't work quite as well
+or lead to a particularly nice optimization problem,
+when compared to the softmax.
+
+Another way to accomplish this goal
+(and to ensure nonnegativity) is to use
+an exponential function $P(y = i) \propto \exp o_i$.
+This does indeed satisfy the requirement
+that the conditional class probability
+increases with increasing $o_i$, it is monotonic,
+and all probabilities are nonnegative.
+We can then transform these values so that they add up to $1$
+by dividing each by their sum.
+This process is called *normalization*.
+Putting these two pieces together
+gives us the *softmax* function:
+
+$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \text{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}.$$
+:eqlabel:`eq_softmax_y_and_o`
+
+Note that the largest coordinate of $\mathbf{o}$
+corresponds to the most likely class according to $\hat{\mathbf{y}}$.
+Moreover, because the softmax operation
+preserves the ordering among its arguments,
+we do not need to compute the softmax
+to determine which class has been assigned the highest probability.
+
+$$
+\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j.
+$$
+
+
+The idea of a softmax dates back to Gibbs,
+who adapted ideas from physics :cite:`Gibbs.1902`.
+Dating even further back, Boltzmann,
+the father of modern thermodynamics,
+used this trick to model a distribution
+over energy states in gas molecules.
+In particular, he discovered that the prevalence
+of a state of energy in a thermodynamic ensemble,
+such as the molecules in a gas,
+is proportional to $\exp(-E/kT)$.
+Here, $E$ is the energy of a state,
+$T$ is the temperature, and $k$ is the Boltzmann constant.
+When statisticians talk about increasing or decreasing
+the "temperature" of a statistical system,
+they refer to changing $T$
+in order to favor lower or higher energy states.
+Following Gibbs' idea, energy equates to error.
+Energy-based models :cite:`Ranzato.Boureau.Chopra.ea.2007`
+use this point of view when describing
+problems in deep learning.
+
+### Vectorization
+:label:`subsec_softmax_vectorization`
+
+To improve computational efficiency,
+we vectorize calculations in minibatches of data.
+Assume that we are given a minibatch $\mathbf{X} \in \mathbb{R}^{n \times d}$
+of $n$ features with dimensionality (number of inputs) $d$.
+Moreover, assume that we have $q$ categories in the output.
+Then the weights satisfy $\mathbf{W} \in \mathbb{R}^{d \times q}$
+and the bias satisfies $\mathbf{b} \in \mathbb{R}^{1\times q}$.
+
+$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$
+:eqlabel:`eq_minibatch_softmax_reg`
+
+This accelerates the dominant operation into
+a matrix-matrix product $\mathbf{X} \mathbf{W}$.
+Moreover, since each row in $\mathbf{X}$ represents a data example,
+the softmax operation itself can be computed *rowwise*:
+for each row of $\mathbf{O}$, exponentiate all entries
+and then normalize them by the sum.
+Note, though, that care must be taken
+to avoid exponentiating and taking logarithms of large numbers,
+since this can cause numerical overflow or underflow.
+Deep learning frameworks take care of this automatically.
+
+## Loss Function
+:label:`subsec_softmax-regression-loss-func`
+
+Now that we have a mapping from features $\mathbf{x}$
+to probabilities $\mathbf{\hat{y}}$,
+we need a way to optimize the accuracy of this mapping.
+We will rely on maximum likelihood estimation,
+the very same concept that we encountered
+when providing a probabilistic justification
+for the mean squared error loss in
+:numref:`subsec_normal_distribution_and_squared_loss`.
+
+### Log-Likelihood
+
+The softmax function gives us a vector $\hat{\mathbf{y}}$,
+which we can interpret as (estimated) conditional probabilities
+of each class, given any input $\mathbf{x}$,
+such as $\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$.
+In the following we assume that for a dataset
+with features $\mathbf{X}$ the labels $\mathbf{Y}$
+are represented using a one-hot encoding label vector.
+We can compare the estimates with reality
+by checking how probable the actual classes are
+according to our model, given the features:
+
+$$
+P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
+$$
+
+We are allowed to use the factorization
+since we assume that each label is drawn independently
+from its respective distribution $P(\mathbf{y}\mid\mathbf{x}^{(i)})$.
+Since maximizing the product of terms is awkward,
+we take the negative logarithm to obtain the equivalent problem
+of minimizing the negative log-likelihood:
+
+$$
+-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
+= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
+$$
+
+where for any pair of label $\mathbf{y}$
+and model prediction $\hat{\mathbf{y}}$
+over $q$ classes, the loss function $l$ is
+
+$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$
+:eqlabel:`eq_l_cross_entropy`
+
+For reasons explained later on,
+the loss function in :eqref:`eq_l_cross_entropy`
+is commonly called the *cross-entropy loss*.
+Since $\mathbf{y}$ is a one-hot vector of length $q$,
+the sum over all its coordinates $j$ vanishes for all but one term.
+Note that the loss $l(\mathbf{y}, \hat{\mathbf{y}})$
+is bounded from below by $0$
+whenever $\hat{y}$ is a probability vector:
+no single entry is larger than $1$,
+hence their negative logarithm cannot be lower than $0$;
+$l(\mathbf{y}, \hat{\mathbf{y}}) = 0$ only if we predict
+the actual label with *certainty*.
+This can never happen for any finite setting of the weights
+because taking a softmax output towards $1$
+requires taking the corresponding input $o_i$ to infinity
+(or all other outputs $o_j$ for $j \neq i$ to negative infinity).
+Even if our model could assign an output probability of $0$,
+any error made when assigning such high confidence
+would incur infinite loss ($-\log 0 = \infty$).
+
+
+### Softmax and Cross-Entropy Loss
+:label:`subsec_softmax_and_derivatives`
+
+Since the softmax function
+and the corresponding cross-entropy loss are so common,
+it is worth understanding a bit better how they are computed.
+Plugging :eqref:`eq_softmax_y_and_o` into the definition of the loss
+in :eqref:`eq_l_cross_entropy`
+and using the definition of the softmax we obtain:
+
+$$
+\begin{aligned}
+l(\mathbf{y}, \hat{\mathbf{y}}) &=  - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\
+&= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j \\
+&= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j.
+\end{aligned}
+$$
+
+To understand a bit better what is going on,
+consider the derivative with respect to any logit $o_j$. We get
+
+$$
+\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j.
+$$
+
+In other words, the derivative is the difference
+between the probability assigned by our model,
+as expressed by the softmax operation,
+and what actually happened, as expressed
+by elements in the one-hot label vector.
+In this sense, it is very similar
+to what we saw in regression,
+where the gradient was the difference
+between the observation $y$ and estimate $\hat{y}$.
+This is not coincidence.
+In any exponential family model,
+the gradients of the log-likelihood are given by precisely this term.
+This fact makes computing gradients easy in practice.
+
+Now consider the case where we observe not just a single outcome
+but an entire distribution over outcomes.
+We can use the same representation as before for the label $\mathbf{y}$.
+The only difference is that rather
+than a vector containing only binary entries,
+say $(0, 0, 1)$, we now have a generic probability vector,
+say $(0.1, 0.2, 0.7)$.
+The math that we used previously to define the loss $l$
+in :eqref:`eq_l_cross_entropy`
+still works out fine,
+just that the interpretation is slightly more general.
+It is the expected value of the loss for a distribution over labels.
+This loss is called the *cross-entropy loss* and it is
+one of the most commonly used losses for classification problems.
+We can demystify the name by introducing just the basics of information theory.
+In a nutshell, it measures the number of bits to encode what we see $\mathbf{y}$
+relative to what we predict that should happen $\hat{\mathbf{y}}$.
+We provide a very basic explanation in the following. For further
+details on information theory see
+:cite:`Cover.Thomas.1999` or :cite:`mackay2003information`.
+
+
+
+## Information Theory Basics
+:label:`subsec_info_theory_basics`
+
+Many deep learning papers use intuition and terms from information theory.
+To make sense of them, we need some common language.
+This is a survival guide.
+*Information theory* deals with the problem
+of encoding, decoding, transmitting,
+and manipulating information (also known as data).
+
+### Entropy
+
+The central idea in information theory is to quantify the
+amount of information contained in data.
+This places a  limit on our ability to compress data.
+For a distribution $P$ its *entropy* is defined as:
+
+$$H[P] = \sum_j - P(j) \log P(j).$$
+:eqlabel:`eq_softmax_reg_entropy`
+
+One of the fundamental theorems of information theory states
+that in order to encode data drawn randomly from the distribution $P$,
+we need at least $H[P]$ "nats" to encode it :cite:`Shannon.1948`.
+If you wonder what a "nat" is, it is the equivalent of bit
+but when using a code with base $e$ rather than one with base 2.
+Thus, one nat is $\frac{1}{\log(2)} \approx 1.44$ bit.
+
+
+### Surprisal
+
+You might be wondering what compression has to do with prediction.
+Imagine that we have a stream of data that we want to compress.
+If it is always easy for us to predict the next token,
+then this data is easy to compress.
+Take the extreme example where every token in the stream
+always takes the same value.
+That is a very boring data stream!
+And not only it is boring, but it is also easy to predict.
+Because they are always the same,
+we do not have to transmit any information
+to communicate the contents of the stream.
+Easy to predict, easy to compress.
+
+However if we cannot perfectly predict every event,
+then we might sometimes be surprised.
+Our surprise is greater when we assigned an event lower probability.
+Claude Shannon settled on $\log \frac{1}{P(j)} = -\log P(j)$
+to quantify one's *surprisal* at observing an event $j$
+having assigned it a (subjective) probability $P(j)$.
+The entropy defined in :eqref:`eq_softmax_reg_entropy`
+is then the *expected surprisal*
+when one assigned the correct probabilities
+that truly match the data-generating process.
+
+
+### Cross-Entropy Revisited
+
+So if entropy is the level of surprise experienced
+by someone who knows the true probability,
+then you might be wondering, what is cross-entropy?
+The cross-entropy *from* $P$ *to* $Q$, denoted $H(P, Q)$,
+is the expected surprisal of an observer with subjective probabilities $Q$
+upon seeing data that was actually generated according to probabilities $P$.
+This is given by $H(P, Q) \stackrel{\mathrm{def}}{=} \sum_j - P(j) \log Q(j)$.
+The lowest possible cross-entropy is achieved when $P=Q$.
+In this case, the cross-entropy from $P$ to $Q$ is $H(P, P)= H(P)$.
+
+In short, we can think of the cross-entropy classification objective
+in two ways: (i) as maximizing the likelihood of the observed data;
+and (ii) as minimizing our surprisal (and thus the number of bits)
+required to communicate the labels.
+
+## Summary and Discussion
+
+In this section, we encountered the first nontrivial loss function,
+allowing us to optimize over *discrete* output spaces.
+Key in its design was that we took a probabilistic approach,
+treating discrete categories as instances of draws from a probability distribution.
+As a side effect, we encountered the softmax,
+a convenient activation function that transforms
+outputs of an ordinary neural network layer
+into valid discrete probability distributions.
+We saw that the derivative of the cross entropy loss
+when combined with softmax
+behaves very similarly
+to the derivative of squared error,
+namely by taking the difference between
+the expected behavior and its prediction.
+And, while we were only able to
+scratch the very surface of it,
+we encountered exciting connections
+to statistical physics and information theory.
+
+While this is enough to get you on your way,
+and hopefully enough to whet your appetite,
+we hardly dived deep here.
+Among other things, we skipped over computational considerations.
+Specifically, for any fully connected layer with $d$ inputs and $q$ outputs,
+the parameterization and computational cost is $\mathcal{O}(dq)$,
+which can be prohibitively high in practice.
+Fortunately, this cost of transforming $d$ inputs into $q$ outputs
+can be reduced through approximation and compression.
+For instance Deep Fried Convnets :cite:`Yang.Moczulski.Denil.ea.2015`
+uses a combination of permutations,
+Fourier transforms, and scaling
+to reduce the cost from quadratic to log-linear.
+Similar techniques work for more advanced
+structural matrix approximations :cite:`sindhwani2015structured`.
+Lastly, we can use Quaternion-like decompositions
+to reduce the cost to $\mathcal{O}(\frac{dq}{n})$,
+again if we are willing to trade off a small amount of accuracy
+for computational and storage cost :cite:`Zhang.Tay.Zhang.ea.2021`
+based on a compression factor $n$.
+This is an active area of research.
+What makes it challenging is that
+we do not necessarily strive
+for the most compact representation
+or the smallest number of floating point operations
+but rather for the solution
+that can be executed most efficiently on modern GPUs.
+
+## Exercises
+
+1. We can explore the connection between exponential families and the softmax in some more depth.
+    1. Compute the second derivative of the cross-entropy loss $l(\mathbf{y},\hat{\mathbf{y}})$ for the softmax.
+    1. Compute the variance of the distribution given by $\mathrm{softmax}(\mathbf{o})$ and show that it matches the second derivative computed above.
+1. Assume that we have three classes which occur with equal probability, i.e., the probability vector is $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$.
+    1. What is the problem if we try to design a binary code for it?
+    1. Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode $n$ observations jointly?
+1. When encoding signals transmitted over a physical wire, engineers don't always use binary codes. For instance, [PAM-3](https://en.wikipedia.org/wiki/Ternary_signal) uses three signal levels $\{-1, 0, 1\}$ as opposed to two levels $\{0, 1\}$. How many ternary units do you need to transmit an integer in the range $\{0, \ldots, 7\}$? Why might this be a better idea in terms of electronics?
+1. The [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) uses
+a logistic model to capture preferences. For a user to choose between apples and oranges one
+assumes scores $o_{\mathrm{apple}}$ and $o_{\mathrm{orange}}$. Our requirements are that larger scores should lead to a higher likelihood in choosing the associated item and that
+the item with the largest score is the most likely one to be chosen :cite:`Bradley.Terry.1952`.
+    1. Prove that the softmax satisfies this requirement.
+    1. What happens if you want to allow for a default option of choosing neither apples nor oranges? Hint: now the user has 3 choices.
+1. Softmax derives its name from the following mapping: $\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$.
+    1. Prove that $\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$.
+    1. How small can you make the difference between both functions? Hint: without loss of
+    generality you can set $b = 0$ and $a \geq b$.
+    1. Prove that this holds for $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$, provided that $\lambda > 0$.
+    1. Show that for $\lambda \to \infty$ we have $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$.
+    1. What does the soft-min look like?
+    1. Extend this to more than two numbers.
+1. The function $g(\mathbf{x}) \stackrel{\mathrm{def}}{=} \log \sum_i \exp x_i$ is sometimes also referred to as the [log-partition function](https://en.wikipedia.org/wiki/Partition_function_(mathematics)).
+    1. Prove that the function is convex. Hint: to do so, use the fact that the first derivative amounts to the probabilities from the softmax function and show that the second derivative is the variance.
+    1. Show that $g$ is translation invariant, i.e., $g(\mathbf{x} + b) = g(\mathbf{x})$.
+    1. What happens if some of the coordinates $x_i$ are very large? What happens if they're all very small?
+    1. Show that if we choose $b = \mathrm{max}_i x_i$ we end up with a numerically stable implementation.
+1. Assume that we have some probability distribution $P$. Suppose we pick another distribution $Q$ with $Q(i) \propto P(i)^\alpha$ for $\alpha > 0$.
+    1. Which choice of $\alpha$ corresponds to doubling the temperature? Which choice corresponds to halving it?
+    1. What happens if we let the temperature converge to $0$?
+    1. What happens if we let the temperature converge to $\infty$?
+
+[Discussions](https://discuss.d2l.ai/t/46)
diff --git a/chapter_linear-networks/image-classification-dataset.md b/chapter_linear-networks/image-classification-dataset.md
deleted file mode 100644
index a20b4f0..0000000
--- a/chapter_linear-networks/image-classification-dataset.md
+++ /dev/null
@@ -1,295 +0,0 @@
-# 画像分類データセット
-:label:`sec_fashion_mnist`
-
-(~~MNIST データセットは画像分類に広く使われているデータセットの一つですが、ベンチマークデータセットとしてはシンプルすぎます。似ているがもっと複雑な Fashion-MNIST データセットを使います~~) 
-
-画像分類に広く使用されているデータセットの 1 つに、MNIST データセット :cite:`LeCun.Bottou.Bengio.ea.1998` があります。ベンチマークデータセットとしては好調でしたが、今日の標準では単純なモデルでも 95% を超える分類精度が得られ、強いモデルと弱いモデルの区別には不向きです。現在、MNISTはベンチマークというよりもサニティチェックの役割を果たしています。少し前置きにするために、2017年にリリースされた、質的に類似しているが比較的複雑なファッションMNISTデータセット:cite:`Xiao.Rasul.Vollgraf.2017`に関する次のセクションで議論を集中します。
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import gluon
-import sys
-
-d2l.use_svg_display()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-import torchvision
-from torchvision import transforms
-from torch.utils import data
-
-d2l.use_svg_display()
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-
-d2l.use_svg_display()
-```
-
-## データセットの読み取り
-
-[**フレームワークの組み込み関数を使用して、Fashion-MNIST データセットをダウンロードしてメモリに読み込む**]
-
-```{.python .input}
-mnist_train = gluon.data.vision.FashionMNIST(train=True)
-mnist_test = gluon.data.vision.FashionMNIST(train=False)
-```
-
-```{.python .input}
-#@tab pytorch
-# `ToTensor` converts the image data from PIL type to 32-bit floating point
-# tensors. It divides all numbers by 255 so that all pixel values are between
-# 0 and 1
-trans = transforms.ToTensor()
-mnist_train = torchvision.datasets.FashionMNIST(
-    root="../data", train=True, transform=trans, download=True)
-mnist_test = torchvision.datasets.FashionMNIST(
-    root="../data", train=False, transform=trans, download=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
-```
-
-Fashion-MNIST は 10 個のカテゴリの画像で構成され、それぞれがトレーニングデータセットでは 6000 個、テストデータセットでは 1000 個の画像で表されます。*test dataset* (または*test set*) は、トレーニングではなくモデルの性能を評価するために使用されます。したがって、トレーニングセットとテストセットにはそれぞれ 60000 と 10000 のイメージが含まれます。
-
-```{.python .input}
-#@tab mxnet, pytorch
-len(mnist_train), len(mnist_test)
-```
-
-```{.python .input}
-#@tab tensorflow
-len(mnist_train[0]), len(mnist_test[0])
-```
-
-各入力イメージの高さと幅はどちらも 28 ピクセルです。データセットは、チャンネル数が 1 のグレースケールイメージで構成されていることに注意してください。簡潔にするために、本書では、高さ $h$、幅 $w$ ピクセルの画像の形状を $h \times w$ または ($h$, $w$) として保存しています。
-
-```{.python .input}
-#@tab all
-mnist_train[0][0].shape
-```
-
-[~~データセットを可視化する2つのユーティリティ関数~~] 
-
-Fashion-mnistの画像は次のカテゴリに関連付けられています：Tシャツ、ズボン、プルオーバー、ドレス、コート、サンダル、シャツ、スニーカー、バッグ、アンクルブーツ。次の関数は、数値ラベルインデックスとテキスト内の名前との間で変換を行います。
-
-```{.python .input}
-#@tab all
-def get_fashion_mnist_labels(labels):  #@save
-    """Return text labels for the Fashion-MNIST dataset."""
-    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
-                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
-    return [text_labels[int(i)] for i in labels]
-```
-
-これで、これらの例を可視化する関数を作成できるようになりました。
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
-    """Plot a list of images."""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        ax.imshow(d2l.numpy(img))
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-```
-
-```{.python .input}
-#@tab pytorch
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
-    """Plot a list of images."""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        if torch.is_tensor(img):
-            # Tensor Image
-            ax.imshow(img.numpy())
-        else:
-            # PIL Image
-            ax.imshow(img)
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-```
-
-トレーニングデータセットの最初のいくつかの例の [**画像とそれに対応するラベル**](本文) を以下に示します。
-
-```{.python .input}
-X, y = mnist_train[:18]
-
-print(X.shape)
-show_images(X.squeeze(axis=-1), 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-```{.python .input}
-#@tab pytorch
-X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
-show_images(X.reshape(18, 28, 28), 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.constant(mnist_train[0][:18])
-y = tf.constant(mnist_train[1][:18])
-show_images(X, 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-## ミニバッチの読み方
-
-トレーニングセットとテストセットから読みやすくなるように、ゼロからデータイテレーターを作成するのではなく、組み込みのデータイテレーターを使用します。繰り返しのたびに、データイテレータ [**サイズが `batch_size` のデータのミニバッチを毎回読み取ります。**] また、学習データイテレータの例をランダムにシャッフルします。
-
-```{.python .input}
-batch_size = 256
-
-def get_dataloader_workers():  #@save
-    """Use 4 processes to read the data except for Windows."""
-    return 0 if sys.platform.startswith('win') else 4
-
-# `ToTensor` converts the image data from uint8 to 32-bit floating point. It
-# divides all numbers by 255 so that all pixel values are between 0 and 1
-transformer = gluon.data.vision.transforms.ToTensor()
-train_iter = gluon.data.DataLoader(mnist_train.transform_first(transformer),
-                                   batch_size, shuffle=True,
-                                   num_workers=get_dataloader_workers())
-```
-
-```{.python .input}
-#@tab pytorch
-batch_size = 256
-
-def get_dataloader_workers():  #@save
-    """Use 4 processes to read the data."""
-    return 4
-
-train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True,
-                             num_workers=get_dataloader_workers())
-```
-
-```{.python .input}
-#@tab tensorflow
-batch_size = 256
-train_iter = tf.data.Dataset.from_tensor_slices(
-    mnist_train).batch(batch_size).shuffle(len(mnist_train[0]))
-```
-
-トレーニングデータを読み取るのにかかる時間を見てみましょう。
-
-```{.python .input}
-#@tab all
-timer = d2l.Timer()
-for X, y in train_iter:
-    continue
-f'{timer.stop():.2f} sec'
-```
-
-## すべてのものをまとめる
-
-ここで、[**Fashion-MNIST データセットを取得して読み込む `load_data_fashion_mnist` 関数**] を定義します。この関数は、トレーニングセットと検証セットの両方のデータイテレータを返します。また、イメージを別のシェイプにサイズ変更するオプションの引数も使用できます。
-
-```{.python .input}
-def load_data_fashion_mnist(batch_size, resize=None):  #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    dataset = gluon.data.vision
-    trans = [dataset.transforms.ToTensor()]
-    if resize:
-        trans.insert(0, dataset.transforms.Resize(resize))
-    trans = dataset.transforms.Compose(trans)
-    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
-    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
-    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
-                                  num_workers=get_dataloader_workers()),
-            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
-                                  num_workers=get_dataloader_workers()))
-```
-
-```{.python .input}
-#@tab pytorch
-def load_data_fashion_mnist(batch_size, resize=None):  #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    trans = [transforms.ToTensor()]
-    if resize:
-        trans.insert(0, transforms.Resize(resize))
-    trans = transforms.Compose(trans)
-    mnist_train = torchvision.datasets.FashionMNIST(
-        root="../data", train=True, transform=trans, download=True)
-    mnist_test = torchvision.datasets.FashionMNIST(
-        root="../data", train=False, transform=trans, download=True)
-    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
-                            num_workers=get_dataloader_workers()),
-            data.DataLoader(mnist_test, batch_size, shuffle=False,
-                            num_workers=get_dataloader_workers()))
-```
-
-```{.python .input}
-#@tab tensorflow
-def load_data_fashion_mnist(batch_size, resize=None):   #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
-    # Divide all numbers by 255 so that all pixel values are between
-    # 0 and 1, add a batch dimension at the last. And cast label to int32
-    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
-                            tf.cast(y, dtype='int32'))
-    resize_fn = lambda X, y: (
-        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
-    return (
-        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
-            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
-        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
-            batch_size).map(resize_fn))
-```
-
-以下では、`resize` 引数を指定して `load_data_fashion_mnist` 関数のイメージサイズ変更機能をテストします。
-
-```{.python .input}
-#@tab all
-train_iter, test_iter = load_data_fashion_mnist(32, resize=64)
-for X, y in train_iter:
-    print(X.shape, X.dtype, y.shape, y.dtype)
-    break
-```
-
-これで、以降のセクションで Fashion-MNIST データセットを操作する準備が整いました。 
-
-## [概要
-
-* Fashion-MNIST は、10 種類のカテゴリを表す画像で構成されるアパレル分類データセットです。このデータセットを以降のセクションと章で使用して、さまざまな分類アルゴリズムを評価します。
-* 高さが$h$、幅が $w$ ピクセルのイメージのシェイプは、$h \times w$ または ($h$, $w$) として格納されます。
-* データイテレータは、パフォーマンスを効率化するための重要な要素です。トレーニングループの速度を低下させないように、ハイパフォーマンスコンピューティングを利用する、適切に実装されたデータイテレーターを利用してください。
-
-## 演習
-
-1. `batch_size` (たとえば 1) を減らすと、読み取りのパフォーマンスに影響しますか?
-1. データイテレータのパフォーマンスは重要です。現在の実装は十分速いと思いますか？それを改善するためのさまざまなオプションを探る。
-1. フレームワークのオンライン API ドキュメントを確認してください。他にどのようなデータセットがありますか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/48)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/49)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/224)
-:end_tab:
diff --git a/chapter_linear-networks/image-classification-dataset_origin.md b/chapter_linear-networks/image-classification-dataset_origin.md
deleted file mode 100644
index fcc1937..0000000
--- a/chapter_linear-networks/image-classification-dataset_origin.md
+++ /dev/null
@@ -1,321 +0,0 @@
-# The Image Classification Dataset
-:label:`sec_fashion_mnist`
-
-(~~The MNIST dataset is one of the widely used dataset for image classification, while it's too simple as a benchmark dataset. We will use the similar, but more complex Fashion-MNIST dataset~~)
-
-One of the widely used dataset for image classification is the  MNIST dataset :cite:`LeCun.Bottou.Bengio.ea.1998`.
-While it had a good run as a benchmark dataset,
-even simple models by today's standards achieve classification accuracy over 95%,
-making it unsuitable for distinguishing between stronger models and weaker ones.
-Today, MNIST serves as more of sanity checks than as a benchmark.
-To up the ante just a bit, we will focus our discussion in the coming sections
-on the qualitatively similar, but comparatively complex Fashion-MNIST
-dataset :cite:`Xiao.Rasul.Vollgraf.2017`, which was released in 2017.
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import gluon
-import sys
-
-d2l.use_svg_display()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-import torchvision
-from torchvision import transforms
-from torch.utils import data
-
-d2l.use_svg_display()
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-
-d2l.use_svg_display()
-```
-
-## Reading the Dataset
-
-We can [**download and read the Fashion-MNIST dataset into memory via the build-in functions in the framework.**]
-
-```{.python .input}
-mnist_train = gluon.data.vision.FashionMNIST(train=True)
-mnist_test = gluon.data.vision.FashionMNIST(train=False)
-```
-
-```{.python .input}
-#@tab pytorch
-# `ToTensor` converts the image data from PIL type to 32-bit floating point
-# tensors. It divides all numbers by 255 so that all pixel values are between
-# 0 and 1
-trans = transforms.ToTensor()
-mnist_train = torchvision.datasets.FashionMNIST(
-    root="../data", train=True, transform=trans, download=True)
-mnist_test = torchvision.datasets.FashionMNIST(
-    root="../data", train=False, transform=trans, download=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
-```
-
-Fashion-MNIST consists of images from 10 categories, each represented
-by 6000 images in the training dataset and by 1000 in the test dataset.
-A *test dataset* (or *test set*) is used for evaluating  model performance and not for training.
-Consequently the training set and the test set
-contain 60000 and 10000 images, respectively.
-
-```{.python .input}
-#@tab mxnet, pytorch
-len(mnist_train), len(mnist_test)
-```
-
-```{.python .input}
-#@tab tensorflow
-len(mnist_train[0]), len(mnist_test[0])
-```
-
-The height and width of each input image are both 28 pixels.
-Note that the dataset consists of grayscale images, whose number of channels is 1.
-For brevity, throughout this book
-we store the shape of any image with height $h$ width $w$ pixels as $h \times w$ or ($h$, $w$).
-
-```{.python .input}
-#@tab all
-mnist_train[0][0].shape
-```
-
-[~~Two utility functions to visualize the dataset~~]
-
-The images in Fashion-MNIST are associated with the following categories:
-t-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot.
-The following function converts between numeric label indices and their names in text.
-
-```{.python .input}
-#@tab all
-def get_fashion_mnist_labels(labels):  #@save
-    """Return text labels for the Fashion-MNIST dataset."""
-    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
-                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
-    return [text_labels[int(i)] for i in labels]
-```
-
-We can now create a function to visualize these examples.
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
-    """Plot a list of images."""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        ax.imshow(d2l.numpy(img))
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-```
-
-```{.python .input}
-#@tab pytorch
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):  #@save
-    """Plot a list of images."""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        if torch.is_tensor(img):
-            # Tensor Image
-            ax.imshow(img.numpy())
-        else:
-            # PIL Image
-            ax.imshow(img)
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-```
-
-Here are [**the images and their corresponding labels**] (in text)
-for the first few examples in the training dataset.
-
-```{.python .input}
-X, y = mnist_train[:18]
-
-print(X.shape)
-show_images(X.squeeze(axis=-1), 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-```{.python .input}
-#@tab pytorch
-X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
-show_images(X.reshape(18, 28, 28), 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.constant(mnist_train[0][:18])
-y = tf.constant(mnist_train[1][:18])
-show_images(X, 2, 9, titles=get_fashion_mnist_labels(y));
-```
-
-## Reading a Minibatch
-
-To make our life easier when reading from the training and test sets,
-we use the built-in data iterator rather than creating one from scratch.
-Recall that at each iteration, a data iterator
-[**reads a minibatch of data with size `batch_size` each time.**]
-We also randomly shuffle the examples for the training data iterator.
-
-```{.python .input}
-batch_size = 256
-
-def get_dataloader_workers():  #@save
-    """Use 4 processes to read the data except for Windows."""
-    return 0 if sys.platform.startswith('win') else 4
-
-# `ToTensor` converts the image data from uint8 to 32-bit floating point. It
-# divides all numbers by 255 so that all pixel values are between 0 and 1
-transformer = gluon.data.vision.transforms.ToTensor()
-train_iter = gluon.data.DataLoader(mnist_train.transform_first(transformer),
-                                   batch_size, shuffle=True,
-                                   num_workers=get_dataloader_workers())
-```
-
-```{.python .input}
-#@tab pytorch
-batch_size = 256
-
-def get_dataloader_workers():  #@save
-    """Use 4 processes to read the data."""
-    return 4
-
-train_iter = data.DataLoader(mnist_train, batch_size, shuffle=True,
-                             num_workers=get_dataloader_workers())
-```
-
-```{.python .input}
-#@tab tensorflow
-batch_size = 256
-train_iter = tf.data.Dataset.from_tensor_slices(
-    mnist_train).batch(batch_size).shuffle(len(mnist_train[0]))
-```
-
-Let us look at the time it takes to read the training data.
-
-```{.python .input}
-#@tab all
-timer = d2l.Timer()
-for X, y in train_iter:
-    continue
-f'{timer.stop():.2f} sec'
-```
-
-## Putting All Things Together
-
-Now we define [**the `load_data_fashion_mnist` function
-that obtains and reads the Fashion-MNIST dataset.**]
-It returns the data iterators for both the training set and validation set.
-In addition, it accepts an optional argument to resize images to another shape.
-
-```{.python .input}
-def load_data_fashion_mnist(batch_size, resize=None):  #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    dataset = gluon.data.vision
-    trans = [dataset.transforms.ToTensor()]
-    if resize:
-        trans.insert(0, dataset.transforms.Resize(resize))
-    trans = dataset.transforms.Compose(trans)
-    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
-    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
-    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
-                                  num_workers=get_dataloader_workers()),
-            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
-                                  num_workers=get_dataloader_workers()))
-```
-
-```{.python .input}
-#@tab pytorch
-def load_data_fashion_mnist(batch_size, resize=None):  #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    trans = [transforms.ToTensor()]
-    if resize:
-        trans.insert(0, transforms.Resize(resize))
-    trans = transforms.Compose(trans)
-    mnist_train = torchvision.datasets.FashionMNIST(
-        root="../data", train=True, transform=trans, download=True)
-    mnist_test = torchvision.datasets.FashionMNIST(
-        root="../data", train=False, transform=trans, download=True)
-    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
-                            num_workers=get_dataloader_workers()),
-            data.DataLoader(mnist_test, batch_size, shuffle=False,
-                            num_workers=get_dataloader_workers()))
-```
-
-```{.python .input}
-#@tab tensorflow
-def load_data_fashion_mnist(batch_size, resize=None):   #@save
-    """Download the Fashion-MNIST dataset and then load it into memory."""
-    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
-    # Divide all numbers by 255 so that all pixel values are between
-    # 0 and 1, add a batch dimension at the last. And cast label to int32
-    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
-                            tf.cast(y, dtype='int32'))
-    resize_fn = lambda X, y: (
-        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
-    return (
-        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
-            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
-        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
-            batch_size).map(resize_fn))
-```
-
-Below we test the image resizing feature of the `load_data_fashion_mnist` function
-by specifying the `resize` argument.
-
-```{.python .input}
-#@tab all
-train_iter, test_iter = load_data_fashion_mnist(32, resize=64)
-for X, y in train_iter:
-    print(X.shape, X.dtype, y.shape, y.dtype)
-    break
-```
-
-We are now ready to work with the Fashion-MNIST dataset in the sections that follow.
-
-## Summary
-
-* Fashion-MNIST is an apparel classification dataset consisting of images representing 10 categories. We will use this dataset in subsequent sections and chapters to evaluate various classification algorithms.
-* We store the shape of any image with height $h$ width $w$ pixels as $h \times w$ or ($h$, $w$).
-* Data iterators are a key component for efficient performance. Rely on well-implemented data iterators that exploit high-performance computing to avoid slowing down your training loop.
-
-
-## Exercises
-
-1. Does reducing the `batch_size` (for instance, to 1) affect the reading performance?
-1. The data iterator performance is important. Do you think the current implementation is fast enough? Explore various options to improve it.
-1. Check out the framework's online API documentation. Which other datasets are available?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/48)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/49)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/224)
-:end_tab:
diff --git a/chapter_linear-networks/index.md b/chapter_linear-networks/index.md
deleted file mode 100644
index ef6f830..0000000
--- a/chapter_linear-networks/index.md
+++ /dev/null
@@ -1,16 +0,0 @@
-# リニアニューラルネットワーク
-:label:`chap_linear`
-
-ディープニューラルネットワークの詳細に入る前に、ニューラルネットワークの学習の基本について説明する必要があります。この章では、単純なニューラルネットワークアーキテクチャの定義、データの処理、損失関数の指定、モデルのトレーニングなど、トレーニングプロセス全体について説明します。物事を把握しやすくするために、最も単純な概念から始めます。幸いなことに、線形回帰やソフトマックス回帰などの従来の統計的学習手法は、*線形* ニューラルネットワークとしてキャストできます。これらの古典的なアルゴリズムから始めて、基本を紹介します。この本の残りの部分では、より複雑な手法の基礎を提供します。
-
-```toc
-:maxdepth: 2
-
-linear-regression
-linear-regression-scratch
-linear-regression-concise
-softmax-regression
-image-classification-dataset
-softmax-regression-scratch
-softmax-regression-concise
-```
diff --git a/chapter_linear-networks/index_origin.md b/chapter_linear-networks/index_origin.md
deleted file mode 100644
index 130ffb4..0000000
--- a/chapter_linear-networks/index_origin.md
+++ /dev/null
@@ -1,25 +0,0 @@
-# Linear Neural Networks
-:label:`chap_linear`
-
-Before we get into the details of deep neural networks,
-we need to cover the basics of neural network training.
-In this chapter, we will cover the entire training process,
-including defining simple neural network architectures, handling data, specifying a loss function, and training the model. 
-In order to make things easier to grasp, we begin with the simplest concepts.
-Fortunately, classic statistical learning techniques such as linear and softmax regression
-can be cast as *linear* neural networks.
-Starting from these classic algorithms, we will introduce you to the basics,
-providing the basis for more complex techniques in the rest of the book.
-
-```toc
-:maxdepth: 2
-
-linear-regression
-linear-regression-scratch
-linear-regression-concise
-softmax-regression
-image-classification-dataset
-softmax-regression-scratch
-softmax-regression-concise
-```
-
diff --git a/chapter_linear-networks/linear-regression-concise.md b/chapter_linear-networks/linear-regression-concise.md
deleted file mode 100644
index ac93e8f..0000000
--- a/chapter_linear-networks/linear-regression-concise.md
+++ /dev/null
@@ -1,350 +0,0 @@
-# 線形回帰の簡潔な実装
-:label:`sec_linear_concise`
-
-過去数年間、ディープラーニングに幅広く関心が高まっているため、企業、学者、愛好家は、勾配ベースの学習アルゴリズムを実装する反復作業を自動化するためのさまざまな成熟したオープンソースフレームワークを開発するようになりました。:numref:`sec_linear_scratch` では、(i) データストレージと線形代数にはテンソル、(ii) 勾配の計算には自動微分のみに依存していました。実際には、データイテレータ、損失関数、オプティマイザ、ニューラルネットワーク層が非常に一般的であるため、現代のライブラリでもこれらのコンポーネントが実装されています。 
-
-このセクションでは、ディープラーニングフレームワークの :numref:`sec_linear_scratch` (**高レベル API を使用して簡潔に**) の (**線形回帰モデルの実装方法**) を紹介します。 
-
-## データセットの生成
-
-まず、:numref:`sec_linear_scratch` と同じデータセットを生成します。
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import autograd, gluon, np, npx
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import numpy as np
-import torch
-from torch.utils import data
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import numpy as np
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-true_w = d2l.tensor([2, -3.4])
-true_b = 4.2
-features, labels = d2l.synthetic_data(true_w, true_b, 1000)
-```
-
-## データセットの読み取り
-
-独自のイテレータをロールするのではなく、[**データを読み込むためにフレームワーク内の既存の API を呼び出す**] `features` と `labels` を引数として渡し、データイテレータオブジェクトをインスタンス化するときに `batch_size` を指定します。また、ブール値 `is_train` は、データイテレータオブジェクトが各エポックでデータをシャッフルする (データセットを通過する) かどうかを示します。
-
-```{.python .input}
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a Gluon data iterator."""
-    dataset = gluon.data.ArrayDataset(*data_arrays)
-    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
-```
-
-```{.python .input}
-#@tab pytorch
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a PyTorch data iterator."""
-    dataset = data.TensorDataset(*data_arrays)
-    return data.DataLoader(dataset, batch_size, shuffle=is_train)
-```
-
-```{.python .input}
-#@tab tensorflow
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a TensorFlow data iterator."""
-    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
-    if is_train:
-        dataset = dataset.shuffle(buffer_size=1000)
-    dataset = dataset.batch(batch_size)
-    return dataset
-```
-
-```{.python .input}
-#@tab all
-batch_size = 10
-data_iter = load_array((features, labels), batch_size)
-```
-
-:numref:`sec_linear_scratch` で `data_iter` 関数を呼び出したのとほぼ同じ方法で `data_iter` を使うことができます。それが機能していることを確認するために、サンプルの最初のミニバッチを読んで印刷します。:numref:`sec_linear_scratch` と比較すると、ここでは `iter` を使用して Python イテレータを構築し、`next` を使用してイテレータから最初の項目を取得します。
-
-```{.python .input}
-#@tab all
-next(iter(data_iter))
-```
-
-## モデルを定義する
-
-:numref:`sec_linear_scratch` で線形回帰をゼロから実装したとき、モデルパラメーターを明示的に定義し、基本的な線形代数演算を使用して出力を生成するように計算をコーディングしました。あなたはこれを行う方法を知っているべきです*。しかし、モデルがより複雑になり、これをほぼ毎日行う必要があると、喜んで支援を受けることができます。この状況は、自分のブログをゼロからコーディングするのと似ています。それを1回か2回行うことはやりがいがあり、有益ですが、ブログを必要とするたびに車輪の再発明に1か月を費やしたら、お粗末なWeb開発者になるでしょう。 
-
-標準的な操作では、[**フレームワークの定義済みレイヤーを使用**] できます。これにより、実装に集中するのではなく、特にモデルの構築に使用されるレイヤーに集中できます。最初に `Sequential` クラスのインスタンスを参照するモデル変数 `net` を定義します。`Sequential` クラスは、連鎖される複数のレイヤーのコンテナーを定義します。入力データが与えられると、`Sequential` インスタンスはそのデータを第 1 レイヤーに渡し、出力を 2 番目のレイヤーの入力として渡します。次の例では、モデルは 1 つのレイヤーのみで構成されているため、`Sequential` は実際には必要ありません。しかし、今後のモデルのほとんどすべてに複数のレイヤーが含まれるため、最も標準的なワークフローに慣れるためだけに使用します。 
-
-:numref:`fig_single_neuron` に示された単層ネットワークのアーキテクチャを思い出してください。各入力が行列ベクトル乗算によって各出力に接続されているため、この層は*完全接続* であると言われます。
-
-:begin_tab:`mxnet`
-グルーオンでは、全結合層は `Dense` クラスで定義されています。1 つのスカラー出力のみを生成したいので、その数を 1 に設定します。 
-
-便宜上、Gluonでは各レイヤーの入力形状を指定する必要がないことに注意してください。したがって、ここでは、この線形層に入る入力の数をグルーオンに伝える必要はありません。最初にモデルにデータを渡そうとしたとき、例えば `net(X)` を後で実行すると、Gluon は各レイヤーへの入力数を自動的に推測します。この仕組みについては、後ほど詳しく説明します。
-:end_tab:
-
-:begin_tab:`pytorch`
-PyTorch では、完全結合層は `Linear` クラスで定義されています。`nn.Linear` に 2 つの引数を渡したことに注意してください。1 つ目は入力フィーチャの次元 (2) を指定し、2 つ目は出力フィーチャの次元 (単一のスカラー、つまり 1) を指定します。
-:end_tab:
-
-:begin_tab:`tensorflow`
-Keras では、完全結合層は `Dense` クラスで定義されています。1 つのスカラー出力のみを生成したいので、その数を 1 に設定します。 
-
-便宜上、Kerasでは各レイヤーの入力形状を指定する必要がないことに注意してください。したがって、ここでは、この線形層に入る入力の数をKerasに伝える必要はありません。最初にモデルにデータを渡そうとしたとき、例えば `net(X)` を後で実行すると、Keras は各レイヤーへの入力数を自動的に推測します。この仕組みについては、後ほど詳しく説明します。
-:end_tab:
-
-```{.python .input}
-# `nn` is an abbreviation for neural networks
-from mxnet.gluon import nn
-net = nn.Sequential()
-net.add(nn.Dense(1))
-```
-
-```{.python .input}
-#@tab pytorch
-# `nn` is an abbreviation for neural networks
-from torch import nn
-net = nn.Sequential(nn.Linear(2, 1))
-```
-
-```{.python .input}
-#@tab tensorflow
-# `keras` is the high-level API for TensorFlow
-net = tf.keras.Sequential()
-net.add(tf.keras.layers.Dense(1))
-```
-
-## モデルパラメーターの初期化
-
-`net` を使用する前に、線形回帰モデルの重みや偏りなど (**モデルパラメーターの初期化**) を行う必要があります。ディープラーニングフレームワークには、パラメーターの初期化方法が事前に定義されていることがよくあります。ここでは、平均が 0、標準偏差が 0.01 の正規分布から各重みパラメータをランダムにサンプリングするように指定します。bias パラメータは 0 に初期化されます。
-
-:begin_tab:`mxnet`
-`initializer` モジュールを MXNet からインポートします。このモジュールは、モデルパラメーターを初期化するためのさまざまなメソッドを提供します。Gluon は `init` を `initializer` パッケージにアクセスするためのショートカット (略称) として使用できるようにしています。重みの初期化方法を指定するのは `init.Normal(sigma=0.01)` を呼び出すことだけです。バイアスパラメータはデフォルトで 0 に初期化されます。
-:end_tab:
-
-:begin_tab:`pytorch`
-`nn.Linear` を構築する際に入力次元と出力次元を指定したので、パラメータに直接アクセスして初期値を指定できるようになりました。まず、ネットワーク内の最初の層である `net[0]` によって層を特定し、`weight.data` および `bias.data` メソッドを使用してパラメーターにアクセスします。次に、置換メソッド `normal_` と `fill_` を使用してパラメーター値を上書きします。
-:end_tab:
-
-:begin_tab:`tensorflow`
-TensorFlow の `initializers` モジュールは、モデルパラメーターの初期化にさまざまな方法を提供します。Keras で初期化方法を指定する最も簡単な方法は、`kernel_initializer` を指定してレイヤーを作成するときです。ここで `net` をもう一度作り直します。
-:end_tab:
-
-```{.python .input}
-from mxnet import init
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-net[0].weight.data.normal_(0, 0.01)
-net[0].bias.data.fill_(0)
-```
-
-```{.python .input}
-#@tab tensorflow
-initializer = tf.initializers.RandomNormal(stddev=0.01)
-net = tf.keras.Sequential()
-net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
-```
-
-:begin_tab:`mxnet`
-上記のコードは単純に見えるかもしれませんが、ここで何か奇妙なことが起きていることに注意してください。Gluon は入力の次元数をまだ把握していませんが、ネットワークのパラメーターを初期化しています。この例のように2になるか、2000になるかもしれません。Gluonは、舞台裏で初期化が実際には*延期*されるため、これを回避できます。実際の初期化は、初めてネットワークを介してデータを渡そうとしたときにのみ行われます。パラメータはまだ初期化されていないため、パラメータにアクセスしたり操作したりすることはできないことに注意してください。
-:end_tab:
-
-:begin_tab:`pytorch`
-
-:end_tab:
-
-:begin_tab:`tensorflow`
-上記のコードは単純に見えるかもしれませんが、ここで何か奇妙なことが起きていることに注意してください。Keras は入力の次元数をまだ把握していませんが、ネットワークのパラメーターを初期化しています。この例のように2になるか、2000になるかもしれません。Kerasはこれを回避することができます。なぜなら、舞台裏では初期化が実際には*延期*されるからです。実際の初期化は、初めてネットワークを介してデータを渡そうとしたときにのみ行われます。パラメータはまだ初期化されていないため、パラメータにアクセスしたり操作したりすることはできないことに注意してください。
-:end_tab:
-
-## 損失関数の定義
-
-:begin_tab:`mxnet`
-Gluon では、`loss` モジュールがさまざまな損失関数を定義しています。この例では、二乗損失の Gluon 実装 (`L2Loss`) を使用します。
-:end_tab:
-
-:begin_tab:`pytorch`
-[**`MSELoss` クラスは平均二乗誤差を計算します (:eqref:`eq_mse` の係数 $1/2$ を除く)。**] デフォルトでは、例に対する平均損失が返されます。
-:end_tab:
-
-:begin_tab:`tensorflow`
-`MeanSquaredError` クラスは平均二乗誤差を計算します (:eqref:`eq_mse` では $1/2$ 係数を使用しない)。デフォルトでは、例に対する平均損失が返されます。
-:end_tab:
-
-```{.python .input}
-loss = gluon.loss.L2Loss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.MSELoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.MeanSquaredError()
-```
-
-## 最適化アルゴリズムの定義
-
-:begin_tab:`mxnet`
-ミニバッチ確率的勾配降下法はニューラルネットワークを最適化するための標準ツールであり、Gluon は `Trainer` クラスを通じてこのアルゴリズムのさまざまなバリエーションと共にこれをサポートしています。`Trainer` をインスタンス化するときは、最適化するパラメーター (`net.collect_params()` 経由でモデル `net` から取得可能)、使用する最適化アルゴリズム (`sgd`)、および最適化アルゴリズムに必要なハイパーパラメーターのディクショナリを指定します。ミニバッチ確率的勾配降下法では、値 `learning_rate` を設定するだけで、ここでは 0.03 に設定されます。
-:end_tab:
-
-:begin_tab:`pytorch`
-ミニバッチ確率的勾配降下法はニューラルネットワークを最適化するための標準ツールであり、PyTorch は `optim` モジュールのこのアルゴリズムのさまざまなバリエーションと共にこれをサポートしています。(**`SGD` インスタンスをインスタンス化**) する際には、最適化アルゴリズムに必要なハイパーパラメーターのディクショナリを使用して、最適化するパラメーター (`net.parameters()` 経由でネットから取得可能) を指定します。ミニバッチ確率的勾配降下法では、値 `lr` を設定するだけで、ここでは 0.03 に設定されます。
-:end_tab:
-
-:begin_tab:`tensorflow`
-ミニバッチ確率的勾配降下法はニューラルネットワークを最適化するための標準ツールであり、Keras は `optimizers` モジュールのこのアルゴリズムのさまざまなバリエーションと共にこれをサポートしています。ミニバッチ確率的勾配降下法では、値 `learning_rate` を設定するだけで、ここでは 0.03 に設定されます。
-:end_tab:
-
-```{.python .input}
-from mxnet import gluon
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})
-```
-
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=0.03)
-```
-
-```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=0.03)
-```
-
-## 訓練
-
-ディープラーニングフレームワークの高レベル API を使用してモデルを表現するには、比較的少ないコード行しか必要としないことに気づいたかもしれません。パラメーターを個別に割り当てたり、損失関数を定義したり、ミニバッチ確率的勾配降下法を実装したりする必要はありませんでした。いったん複雑なモデルを扱うようになれば、高レベル API の利点はかなり大きくなるでしょう。しかし、いったん基本的な要素がすべて揃ったら、[**トレーニングループ自体は、すべてをゼロから実装したときと非常に似ています。**] 
-
-メモリをリフレッシュするには:いくつかのエポックで、データセット (`train_data`) を完全に渡し、入力のミニバッチと対応するグラウンドトゥルースラベルを繰り返し取得します。ミニバッチごとに、次の儀式を行います。 
-
-* `net(X)` を呼び出して予測を生成し、損失 `l` (順伝播) を計算します。
-* バックプロパゲーションを実行して勾配を計算します。
-* オプティマイザーを呼び出してモデルパラメーターを更新します。
-
-良い尺度として、各エポック後に損失を計算し、それを出力して進行状況を監視します。
-
-```{.python .input}
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        with autograd.record():
-            l = loss(net(X), y)
-        l.backward()
-        trainer.step(batch_size)
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l.mean().asnumpy():f}')
-```
-
-```{.python .input}
-#@tab pytorch
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        l = loss(net(X) ,y)
-        trainer.zero_grad()
-        l.backward()
-        trainer.step()
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l:f}')
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        with tf.GradientTape() as tape:
-            l = loss(net(X, training=True), y)
-        grads = tape.gradient(l, net.trainable_variables)
-        trainer.apply_gradients(zip(grads, net.trainable_variables))
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l:f}')
-```
-
-以下では、データセットを生成した [**有限データで学習したモデルパラメータと実パラメータを比較**] します。パラメーターにアクセスするには、まず `net` から必要な層にアクセスし、その層の重みとバイアスにアクセスします。ゼロからの実装と同様に、推定されたパラメーターは対応するグラウンドトゥルースに近いことに注意してください。
-
-```{.python .input}
-w = net[0].weight.data()
-print(f'error in estimating w: {true_w - d2l.reshape(w, true_w.shape)}')
-b = net[0].bias.data()
-print(f'error in estimating b: {true_b - b}')
-```
-
-```{.python .input}
-#@tab pytorch
-w = net[0].weight.data
-print('error in estimating w:', true_w - d2l.reshape(w, true_w.shape))
-b = net[0].bias.data
-print('error in estimating b:', true_b - b)
-```
-
-```{.python .input}
-#@tab tensorflow
-w = net.get_weights()[0]
-print('error in estimating w', true_w - d2l.reshape(w, true_w.shape))
-b = net.get_weights()[1]
-print('error in estimating b', true_b - b)
-```
-
-## [概要
-
-:begin_tab:`mxnet`
-* Gluon を使えば、モデルをより簡潔に実装できます。
-* Gluon では、`data` モジュールはデータ処理用のツールを提供し、`nn` モジュールは多数のニューラルネットワーク層を定義し、`loss` モジュールは多くの一般的な損失関数を定義します。
-* MXNet のモジュール `initializer` は、モデルパラメーターの初期化にさまざまなメソッドを提供します。
-* 次元と記憶域は自動的に推定されますが、初期化される前にパラメータにアクセスしようとしないよう注意してください。
-:end_tab:
-
-:begin_tab:`pytorch`
-* PyTorch の高レベル API を使えば、モデルをより簡潔に実装できます。
-* PyTorch では `data` モジュールはデータ処理用のツールを提供し、`nn` モジュールは多数のニューラルネットワーク層と共通の損失関数を定義します。
-* パラメータの値を `_` で終わるメソッドに置き換えることで、パラメータを初期化できます。
-:end_tab:
-
-:begin_tab:`tensorflow`
-* TensorFlow の高レベル API を使用することで、モデルをより簡潔に実装できます。
-* TensorFlow では、`data` モジュールはデータ処理用のツールを提供し、`keras` モジュールは多数のニューラルネットワーク層と一般的な損失関数を定義します。
-* TensorFlow のモジュール `initializers` は、モデルパラメーターの初期化のためのさまざまなメソッドを提供します。
-* 次元と記憶域は自動的に推論されます (ただし、初期化される前にパラメーターにアクセスしようとしないよう注意してください)。
-:end_tab:
-
-## 演習
-
-:begin_tab:`mxnet`
-1. `l = loss(output, y)` を `l = loss(output, y).mean()` に置き換える場合、コードが同じように動作するように `trainer.step(batch_size)` を `trainer.step(1)` に変更する必要があります。なぜ？
-1. モジュール `gluon.loss` および `init` で提供されている損失関数と初期化方法については、MXNet のドキュメントを参照してください。損失をフーバーの損失で置き換えます。
-1. `dense.weight` のグラデーションにはどうやってアクセスしますか？
-
-[Discussions](https://discuss.d2l.ai/t/44)
-:end_tab:
-
-:begin_tab:`pytorch`
-1. `nn.MSELoss(reduction='sum')` を `nn.MSELoss()` に置き換えた場合、コードの学習率を同じように変更するにはどうすればよいでしょうか。なぜ？
-1. PyTorch のドキュメントを参照して、提供されている損失関数と初期化メソッドを確認してください。損失をフーバーの損失で置き換えます。
-1. `net[0].weight` のグラデーションにはどうやってアクセスしますか？
-
-[Discussions](https://discuss.d2l.ai/t/45)
-:end_tab:
-
-:begin_tab:`tensorflow`
-1. TensorFlow のドキュメントを参照して、どのような損失関数と初期化方法が提供されているかを確認してください。損失をフーバーの損失で置き換えます。
-
-[Discussions](https://discuss.d2l.ai/t/204)
-:end_tab:
diff --git a/chapter_linear-networks/linear-regression-concise_origin.md b/chapter_linear-networks/linear-regression-concise_origin.md
deleted file mode 100644
index aa73730..0000000
--- a/chapter_linear-networks/linear-regression-concise_origin.md
+++ /dev/null
@@ -1,513 +0,0 @@
-# Concise Implementation of Linear Regression
-:label:`sec_linear_concise`
-
-Broad and intense interest in deep learning for the past several years
-has inspired companies, academics, and hobbyists
-to develop a variety of mature open source frameworks
-for automating the repetitive work of implementing
-gradient-based learning algorithms.
-In :numref:`sec_linear_scratch`, we relied only on
-(i) tensors for data storage and linear algebra;
-and (ii) auto differentiation for calculating gradients.
-In practice, because data iterators, loss functions, optimizers,
-and neural network layers
-are so common, modern libraries implement these components for us as well.
-
-In this section, (**we will show you how to implement
-the linear regression model**) from :numref:`sec_linear_scratch`
-(**concisely by using high-level APIs**) of deep learning frameworks.
-
-
-## Generating the Dataset
-
-To start, we will generate the same dataset as in :numref:`sec_linear_scratch`.
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import autograd, gluon, np, npx
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import numpy as np
-import torch
-from torch.utils import data
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import numpy as np
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-true_w = d2l.tensor([2, -3.4])
-true_b = 4.2
-features, labels = d2l.synthetic_data(true_w, true_b, 1000)
-```
-
-## Reading the Dataset
-
-Rather than rolling our own iterator,
-we can [**call upon the existing API in a framework to read data.**]
-We pass in `features` and `labels` as arguments and specify `batch_size`
-when instantiating a data iterator object.
-Besides, the boolean value `is_train`
-indicates whether or not
-we want the data iterator object to shuffle the data
-on each epoch (pass through the dataset).
-
-```{.python .input}
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a Gluon data iterator."""
-    dataset = gluon.data.ArrayDataset(*data_arrays)
-    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
-```
-
-```{.python .input}
-#@tab pytorch
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a PyTorch data iterator."""
-    dataset = data.TensorDataset(*data_arrays)
-    return data.DataLoader(dataset, batch_size, shuffle=is_train)
-```
-
-```{.python .input}
-#@tab tensorflow
-def load_array(data_arrays, batch_size, is_train=True):  #@save
-    """Construct a TensorFlow data iterator."""
-    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
-    if is_train:
-        dataset = dataset.shuffle(buffer_size=1000)
-    dataset = dataset.batch(batch_size)
-    return dataset
-```
-
-```{.python .input}
-#@tab all
-batch_size = 10
-data_iter = load_array((features, labels), batch_size)
-```
-
-Now we can use `data_iter` in much the same way as we called
-the `data_iter` function in :numref:`sec_linear_scratch`.
-To verify that it is working, we can read and print
-the first minibatch of examples.
-Comparing with :numref:`sec_linear_scratch`,
-here we use `iter` to construct a Python iterator and use `next` to obtain the first item from the iterator.
-
-```{.python .input}
-#@tab all
-next(iter(data_iter))
-```
-
-## Defining the Model
-
-When we implemented linear regression from scratch
-in :numref:`sec_linear_scratch`,
-we defined our model parameters explicitly
-and coded up the calculations to produce output
-using basic linear algebra operations.
-You *should* know how to do this.
-But once your models get more complex,
-and once you have to do this nearly every day,
-you will be glad for the assistance.
-The situation is similar to coding up your own blog from scratch.
-Doing it once or twice is rewarding and instructive,
-but you would be a lousy web developer
-if every time you needed a blog you spent a month
-reinventing the wheel.
-
-For standard operations, we can [**use a framework's predefined layers,**]
-which allow us to focus especially
-on the layers used to construct the model
-rather than having to focus on the implementation.
-We will first define a model variable `net`,
-which will refer to an instance of the `Sequential` class.
-The `Sequential` class defines a container
-for several layers that will be chained together.
-Given input data, a `Sequential` instance passes it through
-the first layer, in turn passing the output
-as the second layer's input and so forth.
-In the following example, our model consists of only one layer,
-so we do not really need `Sequential`.
-But since nearly all of our future models
-will involve multiple layers,
-we will use it anyway just to familiarize you
-with the most standard workflow.
-
-Recall the architecture of a single-layer network as shown in :numref:`fig_single_neuron`.
-The layer is said to be *fully-connected*
-because each of its inputs is connected to each of its outputs
-by means of a matrix-vector multiplication.
-
-:begin_tab:`mxnet`
-In Gluon, the fully-connected layer is defined in the `Dense` class.
-Since we only want to generate a single scalar output,
-we set that number to 1.
-
-It is worth noting that, for convenience,
-Gluon does not require us to specify
-the input shape for each layer.
-So here, we do not need to tell Gluon
-how many inputs go into this linear layer.
-When we first try to pass data through our model,
-e.g., when we execute `net(X)` later,
-Gluon will automatically infer the number of inputs to each layer.
-We will describe how this works in more detail later.
-:end_tab:
-
-:begin_tab:`pytorch`
-In PyTorch, the fully-connected layer is defined in the `Linear` class. Note that we passed two arguments into `nn.Linear`. The first one specifies the input feature dimension, which is 2, and the second one is the output feature dimension, which is a single scalar and therefore 1.
-:end_tab:
-
-:begin_tab:`tensorflow`
-In Keras, the fully-connected layer is defined in the `Dense` class. Since we only want to generate a single scalar output, we set that number to 1.
-
-It is worth noting that, for convenience,
-Keras does not require us to specify
-the input shape for each layer.
-So here, we do not need to tell Keras
-how many inputs go into this linear layer.
-When we first try to pass data through our model,
-e.g., when we execute `net(X)` later,
-Keras will automatically infer the number of inputs to each layer.
-We will describe how this works in more detail later.
-:end_tab:
-
-```{.python .input}
-# `nn` is an abbreviation for neural networks
-from mxnet.gluon import nn
-net = nn.Sequential()
-net.add(nn.Dense(1))
-```
-
-```{.python .input}
-#@tab pytorch
-# `nn` is an abbreviation for neural networks
-from torch import nn
-net = nn.Sequential(nn.Linear(2, 1))
-```
-
-```{.python .input}
-#@tab tensorflow
-# `keras` is the high-level API for TensorFlow
-net = tf.keras.Sequential()
-net.add(tf.keras.layers.Dense(1))
-```
-
-## Initializing Model Parameters
-
-Before using `net`, we need to (**initialize the model parameters,**)
-such as the weights and bias in the linear regression model.
-Deep learning frameworks often have a predefined way to initialize the parameters.
-Here we specify that each weight parameter
-should be randomly sampled from a normal distribution
-with mean 0 and standard deviation 0.01.
-The bias parameter will be initialized to zero.
-
-:begin_tab:`mxnet`
-We will import the `initializer` module from MXNet.
-This module provides various methods for model parameter initialization.
-Gluon makes `init` available as a shortcut (abbreviation)
-to access the `initializer` package.
-We only specify how to initialize the weight by calling `init.Normal(sigma=0.01)`.
-Bias parameters are initialized to zero by default.
-:end_tab:
-
-:begin_tab:`pytorch`
-As we have specified the input and output dimensions when constructing `nn.Linear`,
-now we can access the parameters directly to specify their initial values.
-We first locate the layer by `net[0]`, which is the first layer in the network,
-and then use the `weight.data` and `bias.data` methods to access the parameters.
-Next we use the replace methods `normal_` and `fill_` to overwrite parameter values.
-:end_tab:
-
-:begin_tab:`tensorflow`
-The `initializers` module in TensorFlow provides various methods for model parameter initialization. The easiest way to specify the initialization method in Keras is when creating the layer by specifying `kernel_initializer`. Here we recreate `net` again.
-:end_tab:
-
-```{.python .input}
-from mxnet import init
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-net[0].weight.data.normal_(0, 0.01)
-net[0].bias.data.fill_(0)
-```
-
-```{.python .input}
-#@tab tensorflow
-initializer = tf.initializers.RandomNormal(stddev=0.01)
-net = tf.keras.Sequential()
-net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
-```
-
-:begin_tab:`mxnet`
-The code above may look straightforward but you should note
-that something strange is happening here.
-We are initializing parameters for a network
-even though Gluon does not yet know
-how many dimensions the input will have!
-It might be 2 as in our example or it might be 2000.
-Gluon lets us get away with this because behind the scene,
-the initialization is actually *deferred*.
-The real initialization will take place only
-when we for the first time attempt to pass data through the network.
-Just be careful to remember that since the parameters
-have not been initialized yet,
-we cannot access or manipulate them.
-:end_tab:
-
-:begin_tab:`pytorch`
-
-:end_tab:
-
-:begin_tab:`tensorflow`
-The code above may look straightforward but you should note
-that something strange is happening here.
-We are initializing parameters for a network
-even though Keras does not yet know
-how many dimensions the input will have!
-It might be 2 as in our example or it might be 2000.
-Keras lets us get away with this because behind the scenes,
-the initialization is actually *deferred*.
-The real initialization will take place only
-when we for the first time attempt to pass data through the network.
-Just be careful to remember that since the parameters
-have not been initialized yet,
-we cannot access or manipulate them.
-:end_tab:
-
-## Defining the Loss Function
-
-:begin_tab:`mxnet`
-In Gluon, the `loss` module defines various loss functions.
-In this example, we will use the Gluon
-implementation of squared loss (`L2Loss`).
-:end_tab:
-
-:begin_tab:`pytorch`
-[**The `MSELoss` class computes the mean squared error (without the $1/2$ factor in :eqref:`eq_mse`).**]
-By default it returns the average loss over examples.
-:end_tab:
-
-:begin_tab:`tensorflow`
-The `MeanSquaredError` class computes the mean squared error (without the $1/2$ factor in :eqref:`eq_mse`).
-By default it returns the average loss over examples.
-:end_tab:
-
-```{.python .input}
-loss = gluon.loss.L2Loss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.MSELoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.MeanSquaredError()
-```
-
-## Defining the Optimization Algorithm
-
-:begin_tab:`mxnet`
-Minibatch stochastic gradient descent is a standard tool
-for optimizing neural networks
-and thus Gluon supports it alongside a number of
-variations on this algorithm through its `Trainer` class.
-When we instantiate `Trainer`,
-we will specify the parameters to optimize over
-(obtainable from our model `net` via `net.collect_params()`),
-the optimization algorithm we wish to use (`sgd`),
-and a dictionary of hyperparameters
-required by our optimization algorithm.
-Minibatch stochastic gradient descent just requires that
-we set the value `learning_rate`, which is set to 0.03 here.
-:end_tab:
-
-:begin_tab:`pytorch`
-Minibatch stochastic gradient descent is a standard tool
-for optimizing neural networks
-and thus PyTorch supports it alongside a number of
-variations on this algorithm in the `optim` module.
-When we (**instantiate an `SGD` instance,**)
-we will specify the parameters to optimize over
-(obtainable from our net via `net.parameters()`), with a dictionary of hyperparameters
-required by our optimization algorithm.
-Minibatch stochastic gradient descent just requires that
-we set the value `lr`, which is set to 0.03 here.
-:end_tab:
-
-:begin_tab:`tensorflow`
-Minibatch stochastic gradient descent is a standard tool
-for optimizing neural networks
-and thus Keras supports it alongside a number of
-variations on this algorithm in the `optimizers` module.
-Minibatch stochastic gradient descent just requires that
-we set the value `learning_rate`, which is set to 0.03 here.
-:end_tab:
-
-```{.python .input}
-from mxnet import gluon
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})
-```
-
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=0.03)
-```
-
-```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=0.03)
-```
-
-## Training
-
-You might have noticed that expressing our model through
-high-level APIs of a deep learning framework
-requires comparatively few lines of code.
-We did not have to individually allocate parameters,
-define our loss function, or implement minibatch stochastic gradient descent.
-Once we start working with much more complex models,
-advantages of high-level APIs will grow considerably.
-However, once we have all the basic pieces in place,
-[**the training loop itself is strikingly similar
-to what we did when implementing everything from scratch.**]
-
-To refresh your memory: for some number of epochs,
-we will make a complete pass over the dataset (`train_data`),
-iteratively grabbing one minibatch of inputs
-and the corresponding ground-truth labels.
-For each minibatch, we go through the following ritual:
-
-* Generate predictions by calling `net(X)` and calculate the loss `l` (the forward propagation).
-* Calculate gradients by running the backpropagation.
-* Update the model parameters by invoking our optimizer.
-
-For good measure, we compute the loss after each epoch and print it to monitor progress.
-
-```{.python .input}
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        with autograd.record():
-            l = loss(net(X), y)
-        l.backward()
-        trainer.step(batch_size)
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l.mean().asnumpy():f}')
-```
-
-```{.python .input}
-#@tab pytorch
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        l = loss(net(X) ,y)
-        trainer.zero_grad()
-        l.backward()
-        trainer.step()
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l:f}')
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs = 3
-for epoch in range(num_epochs):
-    for X, y in data_iter:
-        with tf.GradientTape() as tape:
-            l = loss(net(X, training=True), y)
-        grads = tape.gradient(l, net.trainable_variables)
-        trainer.apply_gradients(zip(grads, net.trainable_variables))
-    l = loss(net(features), labels)
-    print(f'epoch {epoch + 1}, loss {l:f}')
-```
-
-Below, we [**compare the model parameters learned by training on finite data
-and the actual parameters**] that generated our dataset.
-To access parameters,
-we first access the layer that we need from `net`
-and then access that layer's weights and bias.
-As in our from-scratch implementation,
-note that our estimated parameters are
-close to their ground-truth counterparts.
-
-```{.python .input}
-w = net[0].weight.data()
-print(f'error in estimating w: {true_w - d2l.reshape(w, true_w.shape)}')
-b = net[0].bias.data()
-print(f'error in estimating b: {true_b - b}')
-```
-
-```{.python .input}
-#@tab pytorch
-w = net[0].weight.data
-print('error in estimating w:', true_w - d2l.reshape(w, true_w.shape))
-b = net[0].bias.data
-print('error in estimating b:', true_b - b)
-```
-
-```{.python .input}
-#@tab tensorflow
-w = net.get_weights()[0]
-print('error in estimating w', true_w - d2l.reshape(w, true_w.shape))
-b = net.get_weights()[1]
-print('error in estimating b', true_b - b)
-```
-
-## Summary
-
-:begin_tab:`mxnet`
-* Using Gluon, we can implement models much more concisely.
-* In Gluon, the `data` module provides tools for data processing, the `nn` module defines a large number of neural network layers, and the `loss` module defines many common loss functions.
-* MXNet's module `initializer` provides various methods for model parameter initialization.
-* Dimensionality and storage are automatically inferred, but be careful not to attempt to access parameters before they have been initialized.
-:end_tab:
-
-:begin_tab:`pytorch`
-* Using PyTorch's high-level APIs, we can implement models much more concisely.
-* In PyTorch, the `data` module provides tools for data processing, the `nn` module defines a large number of neural network layers and common loss functions.
-* We can initialize the parameters by replacing their values with methods ending with `_`.
-:end_tab:
-
-:begin_tab:`tensorflow`
-* Using TensorFlow's high-level APIs, we can implement models much more concisely.
-* In TensorFlow, the `data` module provides tools for data processing, the `keras` module defines a large number of neural network layers and common loss functions.
-* TensorFlow's module `initializers` provides various methods for model parameter initialization.
-* Dimensionality and storage are automatically inferred (but be careful not to attempt to access parameters before they have been initialized).
-:end_tab:
-
-## Exercises
-
-:begin_tab:`mxnet`
-1. If we replace `l = loss(output, y)` with `l = loss(output, y).mean()`, we need to change `trainer.step(batch_size)` to `trainer.step(1)` for the code to behave identically. Why?
-1. Review the MXNet documentation to see what loss functions and initialization methods are provided in the modules `gluon.loss` and `init`. Replace the loss by Huber's loss.
-1. How do you access the gradient of `dense.weight`?
-
-[Discussions](https://discuss.d2l.ai/t/44)
-:end_tab:
-
-:begin_tab:`pytorch`
-1. If we replace `nn.MSELoss(reduction='sum')` with `nn.MSELoss()`, how can we change the learning rate for the code to behave identically. Why?
-1. Review the PyTorch documentation to see what loss functions and initialization methods are provided. Replace the loss by Huber's loss.
-1. How do you access the gradient of `net[0].weight`?
-
-[Discussions](https://discuss.d2l.ai/t/45)
-:end_tab:
-
-:begin_tab:`tensorflow`
-1. Review the TensorFlow documentation to see what loss functions and initialization methods are provided. Replace the loss by Huber's loss.
-
-[Discussions](https://discuss.d2l.ai/t/204)
-:end_tab:
diff --git a/chapter_linear-networks/linear-regression-scratch.md b/chapter_linear-networks/linear-regression-scratch.md
deleted file mode 100644
index a66aaa3..0000000
--- a/chapter_linear-networks/linear-regression-scratch.md
+++ /dev/null
@@ -1,311 +0,0 @@
-# 線形回帰のゼロからの実装
-:label:`sec_linear_scratch`
-
-線形回帰の背後にある重要な概念を理解できたので、コードでの実践的な実装に取り掛かることができます。このセクションでは (**データパイプライン、モデル、損失関数、ミニバッチ確率的勾配降下オプティマイザーなど、メソッド全体をゼロから実装します。**) 最新のディープラーニングフレームワークではこの作業のほとんどすべてを自動化できますが、ゼロから実装することが唯一の方法です自分が何をしているのか本当にわかっていることを確認するためです。さらに、モデルをカスタマイズしたり、独自のレイヤーや損失関数を定義したりするときは、内部で物事がどのように機能するかを理解すると便利です。このセクションでは、テンソルと自動微分にのみ依存します。その後、ディープラーニングフレームワークの特徴を生かして、より簡潔な実装を紹介します。
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import autograd, np, npx
-import random
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-import random
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-import random
-```
-
-## データセットの生成
-
-単純化するために、[**加法性ノイズを含む線形モデルに従って人工データセットを構築する**] 私たちの仕事は、データセットに含まれる有限な例集合を使用して、このモデルのパラメーターを回復することです。データを低次元に保ち、簡単に視覚化できるようにします。次のコードスニペットでは、1000 個の例を含むデータセットを生成します。各サンプルは、標準正規分布からサンプリングされた 2 つの特徴量から構成されます。したがって、合成データセットは行列 $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$ になります。 
-
-(**データセットを生成する真のパラメーターは $\mathbf{w} = [2, -3.4]^\top$ と $b = 4.2$、**) 合成ラベルは、ノイズ項 $\epsilon$ をもつ次の線形モデルに従って割り当てられます。 
-
-(** $\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$ドル**) 
-
-$\epsilon$ は、フィーチャとラベルの潜在的な計測誤差をキャプチャするものと考えることができます。標準的な仮定が成り立ち、$\epsilon$ は平均 0 の正規分布に従うと仮定します。問題を簡単にするために、標準偏差を 0.01 に設定します。次のコードは、合成データセットを生成します。
-
-```{.python .input}
-#@tab mxnet, pytorch
-def synthetic_data(w, b, num_examples):  #@save
-    """Generate y = Xw + b + noise."""
-    X = d2l.normal(0, 1, (num_examples, len(w)))
-    y = d2l.matmul(X, w) + b
-    y += d2l.normal(0, 0.01, y.shape)
-    return X, d2l.reshape(y, (-1, 1))
-```
-
-```{.python .input}
-#@tab tensorflow
-def synthetic_data(w, b, num_examples):  #@save
-    """Generate y = Xw + b + noise."""
-    X = d2l.zeros((num_examples, w.shape[0]))
-    X += tf.random.normal(shape=X.shape)
-    y = d2l.matmul(X, tf.reshape(w, (-1, 1))) + b
-    y += tf.random.normal(shape=y.shape, stddev=0.01)
-    y = d2l.reshape(y, (-1, 1))
-    return X, y
-```
-
-```{.python .input}
-#@tab all
-true_w = d2l.tensor([2, -3.4])
-true_b = 4.2
-features, labels = synthetic_data(true_w, true_b, 1000)
-```
-
-[**`features` の各行は 2 次元のデータ例で構成され、`labels` の各行は 1 次元のラベル値 (スカラー) で構成されていることに注意してください。**]
-
-```{.python .input}
-#@tab all
-print('features:', features[0],'\nlabel:', labels[0])
-```
-
-2 番目のフィーチャ `features[:, 1]` と `labels` を使用して散布図を生成すると、この 2 つのフィーチャ間の線形相関を明確に観察できます。
-
-```{.python .input}
-#@tab all
-d2l.set_figsize()
-# The semicolon is for displaying the plot only
-d2l.plt.scatter(d2l.numpy(features[:, 1]), d2l.numpy(labels), 1);
-```
-
-## データセットの読み取り
-
-モデルのトレーニングは、データセットに対して複数のパスを作成し、サンプルのミニバッチを一度に 1 つずつ取得し、それらを使用してモデルを更新することで構成されることを思い出してください。このプロセスは機械学習アルゴリズムのトレーニングにとって非常に重要なので、データセットをシャッフルしてミニバッチでアクセスするユーティリティ関数を定義する価値があります。 
-
-以下のコードでは、[**`data_iter` 関数を定義**](~~that~~) して、この機能の 1 つの可能な実装を示します。関数 (**バッチサイズ、特徴の行列、ラベルのベクトルをとり、サイズ `batch_size` のミニバッチを生成する**) 各ミニバッチは、特徴量とラベルのタプルで構成されます。
-
-```{.python .input}
-#@tab mxnet, pytorch
-def data_iter(batch_size, features, labels):
-    num_examples = len(features)
-    indices = list(range(num_examples))
-    # The examples are read at random, in no particular order
-    random.shuffle(indices)
-    for i in range(0, num_examples, batch_size):
-        batch_indices = d2l.tensor(
-            indices[i: min(i + batch_size, num_examples)])
-        yield features[batch_indices], labels[batch_indices]
-```
-
-```{.python .input}
-#@tab tensorflow
-def data_iter(batch_size, features, labels):
-    num_examples = len(features)
-    indices = list(range(num_examples))
-    # The examples are read at random, in no particular order
-    random.shuffle(indices)
-    for i in range(0, num_examples, batch_size):
-        j = tf.constant(indices[i: min(i + batch_size, num_examples)])
-        yield tf.gather(features, j), tf.gather(labels, j)
-```
-
-一般的には、並列化操作に優れた GPU ハードウェアを活用するために、適度なサイズのミニバッチを使用することに注意してください。各例はモデルを通じて並列に供給でき、各例の損失関数の勾配も並列で取得できるため、GPU を使用すると、1 つの例を処理するよりも短時間で数百もの例を処理できます。 
-
-直感を深めるために、データ例の最初の小さなバッチを読んで印刷してみましょう。各ミニバッチ内のフィーチャの形状から、ミニバッチのサイズと入力フィーチャの数の両方がわかります。同様に、ラベルのミニバッチは `batch_size` で指定された形状になります。
-
-```{.python .input}
-#@tab all
-batch_size = 10
-
-for X, y in data_iter(batch_size, features, labels):
-    print(X, '\n', y)
-    break
-```
-
-反復を実行すると、データセット全体が使い果たされるまで、個別のミニバッチが連続して取得されます (これを試してください)。上記で実装したイテレーションは教訓的な目的には適していますが、実際の問題でトラブルに巻き込まれるような点では非効率的です。たとえば、すべてのデータをメモリにロードし、大量のランダムメモリアクセスを実行する必要があります。ディープラーニングフレームワークに実装されたビルトインイテレーターは非常に効率的で、ファイルに格納されたデータとデータストリームを介して供給されるデータの両方を処理できます。 
-
-## モデルパラメーターの初期化
-
-[**モデルのパラメーターの最適化を始める前に**] ミニバッチ確率的勾配降下法 (**最初にいくつかのパラメーターが必要です**) 次のコードでは、平均 0、標準偏差 0.01 の正規分布から乱数をサンプリングして重みを初期化します。バイアスを 0 に設定します。
-
-```{.python .input}
-w = np.random.normal(0, 0.01, (2, 1))
-b = np.zeros(1)
-w.attach_grad()
-b.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
-b = torch.zeros(1, requires_grad=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01),
-                trainable=True)
-b = tf.Variable(tf.zeros(1), trainable=True)
-```
-
-パラメーターを初期化したら、次のタスクは、データに十分適合するまでパラメーターを更新することです。更新のたびに、パラメーターに関する損失関数の勾配を取る必要があります。この勾配が与えられると、損失が減少する方向に各パラメータを更新できます。 
-
-誰も勾配を明示的に計算したくないので (これは面倒でエラーが発生しやすい)、:numref:`sec_autograd` で導入された自動微分を使用して勾配を計算します。 
-
-## モデルを定義する
-
-次に、[**モデルを定義し、入力とパラメーターを出力に関連付ける**] 必要があります。線形モデルの出力を計算するには、入力フィーチャ $\mathbf{X}$ とモデルの重み $\mathbf{w}$ の行列-ベクトルドット積を取り、オフセット $b$ を各例に追加するだけです。$\mathbf{Xw}$ 以下はベクトルで、$b$ はスカラーであることに注意してください。:numref:`subsec_broadcasting` で説明されているブロードキャストメカニズムを思い出してください。ベクトルとスカラーを追加すると、ベクトルの各コンポーネントにスカラーが追加されます。
-
-```{.python .input}
-#@tab all
-def linreg(X, w, b):  #@save
-    """The linear regression model."""
-    return d2l.matmul(X, w) + b
-```
-
-## 損失関数の定義
-
-[**モデルを更新するには損失関数の勾配を取る必要がある**] ので、(**損失関数を先に定義する**) 必要があります。ここでは :numref:`sec_linear_regression` で説明されている二乗損失関数を使用します。実装では、真の値 `y` を予測値のシェイプ `y_hat` に変換する必要があります。次の関数が返す結果も `y_hat` と同じ形になります。
-
-```{.python .input}
-#@tab all
-def squared_loss(y_hat, y):  #@save
-    """Squared loss."""
-    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
-```
-
-## 最適化アルゴリズムの定義
-
-:numref:`sec_linear_regression` で説明したように、線形回帰には閉形式の解があります。しかし、これは線形回帰に関する本ではなく、ディープラーニングに関する本です。本書で紹介する他のモデルは解析的に解くことができないため、この機会にミニバッチ確率的勾配降下法の最初の実例を紹介します。[~~線形回帰には閉形式の解がありますが、本書の他のモデルにはありません。ここではミニバッチ確率的勾配降下法について紹介します。~~] 
-
-各ステップで、データセットからランダムに抽出された1つのミニバッチを使用して、パラメーターに対する損失の勾配を推定します。次に、損失を減らす可能性のある方向にパラメータを更新します。次のコードは、一連のパラメーター、学習率、およびバッチサイズを指定して、ミニバッチの確率的勾配降下法の更新を適用します。更新ステップのサイズは、学習率 `lr` によって決まります。損失は例のミニバッチの合計として計算されるため、標準的なステップサイズの大きさがバッチサイズの選択に大きく依存しないように、ステップサイズをバッチサイズ (`batch_size`) で正規化します。
-
-```{.python .input}
-def sgd(params, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    for param in params:
-        param[:] = param - lr * param.grad / batch_size
-```
-
-```{.python .input}
-#@tab pytorch
-def sgd(params, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    with torch.no_grad():
-        for param in params:
-            param -= lr * param.grad / batch_size
-            param.grad.zero_()
-```
-
-```{.python .input}
-#@tab tensorflow
-def sgd(params, grads, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    for param, grad in zip(params, grads):
-        param.assign_sub(lr*grad/batch_size)
-```
-
-## 訓練
-
-これですべてのパーツが揃ったので、[**メインのトレーニングループを実装する**] 準備ができました。ディープラーニングのキャリアを通じて、ほぼ同じトレーニングループが何度も繰り返し見られるため、このコードを理解することが重要です。 
-
-各反復で、トレーニング例のミニバッチを取得し、モデルに渡して一連の予測を取得します。損失を計算した後、ネットワークの逆方向パスを開始し、各パラメータに関する勾配を保存します。最後に、最適化アルゴリズム `sgd` を呼び出してモデルパラメーターを更新します。 
-
-要約すると、次のループを実行します。 
-
-* パラメーターを初期化する $(\mathbf{w}, b)$
-* 完了するまで繰り返す
-    * グラディエントを計算する $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
-    * 更新パラメータ $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
-
-*epoch* ごとに、トレーニングデータセットのすべての例を通過した後、データセット全体 (`data_iter` 関数を使用) を反復処理します (例の数がバッチサイズで割り切れると仮定)。エポック数 `num_epochs` と学習率 `lr` はどちらもハイパーパラメーターで、ここではそれぞれ 3 と 0.03 に設定します。残念ながら、ハイパーパラメータの設定は難しく、試行錯誤による調整が必要です。ここではこれらの詳細は省略していますが、:numref:`chap_optimization` の後半で修正します。
-
-```{.python .input}
-#@tab all
-lr = 0.03
-num_epochs = 3
-net = linreg
-loss = squared_loss
-```
-
-```{.python .input}
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        with autograd.record():
-            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Because `l` has a shape (`batch_size`, 1) and is not a scalar
-        # variable, the elements in `l` are added together to obtain a new
-        # variable, on which gradients with respect to [`w`, `b`] are computed
-        l.backward()
-        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
-    train_l = loss(net(features, w, b), labels)
-    print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
-```
-
-```{.python .input}
-#@tab pytorch
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Compute gradient on `l` with respect to [`w`, `b`]
-        l.sum().backward()
-        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
-    with torch.no_grad():
-        train_l = loss(net(features, w, b), labels)
-        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
-```
-
-```{.python .input}
-#@tab tensorflow
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        with tf.GradientTape() as g:
-            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Compute gradient on l with respect to [`w`, `b`]
-        dw, db = g.gradient(l, [w, b])
-        # Update parameters using their gradient
-        sgd([w, b], [dw, db], lr, batch_size)
-    train_l = loss(net(features, w, b), labels)
-    print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')
-```
-
-この場合、データセットを自分で合成したため、真のパラメータが何であるかを正確に把握できます。したがって、トレーニングループを通じて [**真のパラメータと学習したパラメータを比較して、トレーニングの成功を評価する**] ことができます。実際、彼らはお互いに非常に近いことが分かります。
-
-```{.python .input}
-#@tab all
-print(f'error in estimating w: {true_w - d2l.reshape(w, true_w.shape)}')
-print(f'error in estimating b: {true_b - b}')
-```
-
-パラメータを完全に回復できるのは当然のことではないことに注意してください。しかし、機械学習では通常、真の基礎となるパラメーターの回復にはあまり関心がなく、高精度の予測につながるパラメーターへの関心が高まります。幸いなことに、困難な最適化問題であっても、確率的勾配降下法は非常に優れた解を見出すことがよくあります。これは、ディープネットワークでは、非常に正確な予測につながるパラメーターの構成が多数存在するためです。 
-
-## [概要
-
-* レイヤーの定義や高度なオプティマイザーを必要とせずに、テンソルと自動微分のみを使用して、ディープネットワークをゼロから実装して最適化する方法を確認しました。
-* このセクションでは、可能なことの表面のみをスクラッチします。次のセクションでは、今紹介した概念に基づいた追加モデルについて説明し、より簡潔に実装する方法を学習します。
-
-## 演習
-
-1. 重みをゼロに初期化するとどうなるでしょうか。アルゴリズムはまだ機能しますか？
-1. [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm) が電圧と電流の間のモデルを考え出そうとしているとします。自動微分を使用してモデルのパラメーターを学習できるか
-1. [プランクの法則](https://en.wikipedia.org/wiki/Planck%27s_law) を使って、スペクトルエネルギー密度を使って物体の温度を決定できますか？
-1. 二次微分を計算する場合に遭遇する可能性のある問題は何ですか？どうやって直すの？
-1.  `squared_loss` 関数に `reshape` 関数が必要なのはなぜですか？
-1. さまざまな学習率を試して、損失関数の値がどれだけ速く低下するかを調べます。
-1. 例の数をバッチサイズで割れない場合、`data_iter` 関数の動作はどうなりますか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/42)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/43)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/201)
-:end_tab:
diff --git a/chapter_linear-networks/linear-regression-scratch_origin.md b/chapter_linear-networks/linear-regression-scratch_origin.md
deleted file mode 100644
index 733cdcd..0000000
--- a/chapter_linear-networks/linear-regression-scratch_origin.md
+++ /dev/null
@@ -1,459 +0,0 @@
-# Linear Regression Implementation from Scratch
-:label:`sec_linear_scratch`
-
-Now that you understand the key ideas behind linear regression,
-we can begin to work through a hands-on implementation in code.
-In this section, (**we will implement the entire method from scratch,
-including the data pipeline, the model,
-the loss function, and the minibatch stochastic gradient descent optimizer.**)
-While modern deep learning frameworks can automate nearly all of this work,
-implementing things from scratch is the only way
-to make sure that you really know what you are doing.
-Moreover, when it comes time to customize models,
-defining our own layers or loss functions,
-understanding how things work under the hood will prove handy.
-In this section, we will rely only on tensors and auto differentiation.
-Afterwards, we will introduce a more concise implementation,
-taking advantage of bells and whistles of deep learning frameworks.
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import autograd, np, npx
-import random
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-import random
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-import random
-```
-
-## Generating the Dataset
-
-To keep things simple, we will [**construct an artificial dataset
-according to a linear model with additive noise.**]
-Our task will be to recover this model's parameters
-using the finite set of examples contained in our dataset.
-We will keep the data low-dimensional so we can visualize it easily.
-In the following code snippet, we generate a dataset
-containing 1000 examples, each consisting of 2 features
-sampled from a standard normal distribution.
-Thus our synthetic dataset will be a matrix
-$\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.
-
-(**The true parameters generating our dataset will be
-$\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$,
-and**) our synthetic labels will be assigned according
-to the following linear model with the noise term $\epsilon$:
-
-(**$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$**)
-
-You could think of $\epsilon$ as capturing potential
-measurement errors on the features and labels.
-We will assume that the standard assumptions hold and thus
-that $\epsilon$ obeys a normal distribution with mean of 0.
-To make our problem easy, we will set its standard deviation to 0.01.
-The following code generates our synthetic dataset.
-
-```{.python .input}
-#@tab mxnet, pytorch
-def synthetic_data(w, b, num_examples):  #@save
-    """Generate y = Xw + b + noise."""
-    X = d2l.normal(0, 1, (num_examples, len(w)))
-    y = d2l.matmul(X, w) + b
-    y += d2l.normal(0, 0.01, y.shape)
-    return X, d2l.reshape(y, (-1, 1))
-```
-
-```{.python .input}
-#@tab tensorflow
-def synthetic_data(w, b, num_examples):  #@save
-    """Generate y = Xw + b + noise."""
-    X = d2l.zeros((num_examples, w.shape[0]))
-    X += tf.random.normal(shape=X.shape)
-    y = d2l.matmul(X, tf.reshape(w, (-1, 1))) + b
-    y += tf.random.normal(shape=y.shape, stddev=0.01)
-    y = d2l.reshape(y, (-1, 1))
-    return X, y
-```
-
-```{.python .input}
-#@tab all
-true_w = d2l.tensor([2, -3.4])
-true_b = 4.2
-features, labels = synthetic_data(true_w, true_b, 1000)
-```
-
-Note that [**each row in `features` consists of a 2-dimensional data example
-and that each row in `labels` consists of a 1-dimensional label value (a scalar).**]
-
-```{.python .input}
-#@tab all
-print('features:', features[0],'\nlabel:', labels[0])
-```
-
-By generating a scatter plot using the second feature `features[:, 1]` and `labels`,
-we can clearly observe the linear correlation between the two.
-
-```{.python .input}
-#@tab all
-d2l.set_figsize()
-# The semicolon is for displaying the plot only
-d2l.plt.scatter(d2l.numpy(features[:, 1]), d2l.numpy(labels), 1);
-```
-
-## Reading the Dataset
-
-Recall that training models consists of
-making multiple passes over the dataset,
-grabbing one minibatch of examples at a time,
-and using them to update our model.
-Since this process is so fundamental
-to training machine learning algorithms,
-it is worth defining a utility function
-to shuffle the dataset and access it in minibatches.
-
-In the following code, we [**define the `data_iter` function**] (~~that~~)
-to demonstrate one possible implementation of this functionality.
-The function (**takes a batch size, a matrix of features,
-and a vector of labels, yielding minibatches of the size `batch_size`.**)
-Each minibatch consists of a tuple of features and labels.
-
-```{.python .input}
-#@tab mxnet, pytorch
-def data_iter(batch_size, features, labels):
-    num_examples = len(features)
-    indices = list(range(num_examples))
-    # The examples are read at random, in no particular order
-    random.shuffle(indices)
-    for i in range(0, num_examples, batch_size):
-        batch_indices = d2l.tensor(
-            indices[i: min(i + batch_size, num_examples)])
-        yield features[batch_indices], labels[batch_indices]
-```
-
-```{.python .input}
-#@tab tensorflow
-def data_iter(batch_size, features, labels):
-    num_examples = len(features)
-    indices = list(range(num_examples))
-    # The examples are read at random, in no particular order
-    random.shuffle(indices)
-    for i in range(0, num_examples, batch_size):
-        j = tf.constant(indices[i: min(i + batch_size, num_examples)])
-        yield tf.gather(features, j), tf.gather(labels, j)
-```
-
-In general, note that we want to use reasonably sized minibatches
-to take advantage of the GPU hardware,
-which excels at parallelizing operations.
-Because each example can be fed through our models in parallel
-and the gradient of the loss function for each example can also be taken in parallel,
-GPUs allow us to process hundreds of examples in scarcely more time
-than it might take to process just a single example.
-
-To build some intuition, let us read and print
-the first small batch of data examples.
-The shape of the features in each minibatch tells us
-both the minibatch size and the number of input features.
-Likewise, our minibatch of labels will have a shape given by `batch_size`.
-
-```{.python .input}
-#@tab all
-batch_size = 10
-
-for X, y in data_iter(batch_size, features, labels):
-    print(X, '\n', y)
-    break
-```
-
-As we run the iteration, we obtain distinct minibatches
-successively until the entire dataset has been exhausted (try this).
-While the iteration implemented above is good for didactic purposes,
-it is inefficient in ways that might get us in trouble on real problems.
-For example, it requires that we load all the data in memory
-and that we perform lots of random memory access.
-The built-in iterators implemented in a deep learning framework
-are considerably more efficient and they can deal
-with both data stored in files and data fed via data streams.
-
-
-## Initializing Model Parameters
-
-[**Before we can begin optimizing our model's parameters**] by minibatch stochastic gradient descent,
-(**we need to have some parameters in the first place.**)
-In the following code, we initialize weights by sampling
-random numbers from a normal distribution with mean 0
-and a standard deviation of 0.01, and setting the bias to 0.
-
-```{.python .input}
-w = np.random.normal(0, 0.01, (2, 1))
-b = np.zeros(1)
-w.attach_grad()
-b.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
-b = torch.zeros(1, requires_grad=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01),
-                trainable=True)
-b = tf.Variable(tf.zeros(1), trainable=True)
-```
-
-After initializing our parameters,
-our next task is to update them until
-they fit our data sufficiently well.
-Each update requires taking the gradient
-of our loss function with respect to the parameters.
-Given this gradient, we can update each parameter
-in the direction that may reduce the loss.
-
-Since nobody wants to compute gradients explicitly
-(this is tedious and error prone),
-we use automatic differentiation,
-as introduced in :numref:`sec_autograd`, to compute the gradient.
-
-
-## Defining the Model
-
-Next, we must [**define our model,
-relating its inputs and parameters to its outputs.**]
-Recall that to calculate the output of the linear model,
-we simply take the matrix-vector dot product
-of the input features $\mathbf{X}$ and the model weights $\mathbf{w}$,
-and add the offset $b$ to each example.
-Note that below $\mathbf{Xw}$  is a vector and $b$ is a scalar.
-Recall the broadcasting mechanism as described in :numref:`subsec_broadcasting`.
-When we add a vector and a scalar,
-the scalar is added to each component of the vector.
-
-```{.python .input}
-#@tab all
-def linreg(X, w, b):  #@save
-    """The linear regression model."""
-    return d2l.matmul(X, w) + b
-```
-
-## Defining the Loss Function
-
-Since [**updating our model requires taking
-the gradient of our loss function,**]
-we ought to (**define the loss function first.**)
-Here we will use the squared loss function
-as described in :numref:`sec_linear_regression`.
-In the implementation, we need to transform the true value `y`
-into the predicted value's shape `y_hat`.
-The result returned by the following function
-will also have the same shape as `y_hat`.
-
-```{.python .input}
-#@tab all
-def squared_loss(y_hat, y):  #@save
-    """Squared loss."""
-    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
-```
-
-## Defining the Optimization Algorithm
-
-As we discussed in :numref:`sec_linear_regression`,
-linear regression has a closed-form solution.
-However, this is not a book about linear regression:
-it is a book about deep learning.
-Since none of the other models that this book introduces
-can be solved analytically, we will take this opportunity to introduce your first working example of
-minibatch stochastic gradient descent.
-[~~Despite linear regression has a closed-form solution, other models in this book don't. Here we introduce minibatch stochastic gradient descent.~~]
-
-At each step, using one minibatch randomly drawn from our dataset,
-we will estimate the gradient of the loss with respect to our parameters.
-Next, we will update our parameters
-in the direction that may reduce the loss.
-The following code applies the minibatch stochastic gradient descent update,
-given a set of parameters, a learning rate, and a batch size.
-The size of the update step is determined by the learning rate `lr`.
-Because our loss is calculated as a sum over the minibatch of examples,
-we normalize our step size by the batch size (`batch_size`),
-so that the magnitude of a typical step size
-does not depend heavily on our choice of the batch size.
-
-```{.python .input}
-def sgd(params, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    for param in params:
-        param[:] = param - lr * param.grad / batch_size
-```
-
-```{.python .input}
-#@tab pytorch
-def sgd(params, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    with torch.no_grad():
-        for param in params:
-            param -= lr * param.grad / batch_size
-            param.grad.zero_()
-```
-
-```{.python .input}
-#@tab tensorflow
-def sgd(params, grads, lr, batch_size):  #@save
-    """Minibatch stochastic gradient descent."""
-    for param, grad in zip(params, grads):
-        param.assign_sub(lr*grad/batch_size)
-```
-
-## Training
-
-Now that we have all of the parts in place,
-we are ready to [**implement the main training loop.**]
-It is crucial that you understand this code
-because you will see nearly identical training loops
-over and over again throughout your career in deep learning.
-
-In each iteration, we will grab a minibatch of training examples,
-and pass them through our model to obtain a set of predictions.
-After calculating the loss, we initiate the backwards pass through the network,
-storing the gradients with respect to each parameter.
-Finally, we will call the optimization algorithm `sgd`
-to update the model parameters.
-
-In summary, we will execute the following loop:
-
-* Initialize parameters $(\mathbf{w}, b)$
-* Repeat until done
-    * Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
-    * Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
-
-In each *epoch*,
-we will iterate through the entire dataset
-(using the `data_iter` function) once
-passing through every example in the training dataset
-(assuming that the number of examples is divisible by the batch size).
-The number of epochs `num_epochs` and the learning rate `lr` are both hyperparameters,
-which we set here to 3 and 0.03, respectively.
-Unfortunately, setting hyperparameters is tricky
-and requires some adjustment by trial and error.
-We elide these details for now but revise them
-later in
-:numref:`chap_optimization`.
-
-```{.python .input}
-#@tab all
-lr = 0.03
-num_epochs = 3
-net = linreg
-loss = squared_loss
-```
-
-```{.python .input}
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        with autograd.record():
-            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Because `l` has a shape (`batch_size`, 1) and is not a scalar
-        # variable, the elements in `l` are added together to obtain a new
-        # variable, on which gradients with respect to [`w`, `b`] are computed
-        l.backward()
-        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
-    train_l = loss(net(features, w, b), labels)
-    print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
-```
-
-```{.python .input}
-#@tab pytorch
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Compute gradient on `l` with respect to [`w`, `b`]
-        l.sum().backward()
-        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
-    with torch.no_grad():
-        train_l = loss(net(features, w, b), labels)
-        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
-```
-
-```{.python .input}
-#@tab tensorflow
-for epoch in range(num_epochs):
-    for X, y in data_iter(batch_size, features, labels):
-        with tf.GradientTape() as g:
-            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
-        # Compute gradient on l with respect to [`w`, `b`]
-        dw, db = g.gradient(l, [w, b])
-        # Update parameters using their gradient
-        sgd([w, b], [dw, db], lr, batch_size)
-    train_l = loss(net(features, w, b), labels)
-    print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')
-```
-
-In this case, because we synthesized the dataset ourselves,
-we know precisely what the true parameters are.
-Thus, we can [**evaluate our success in training
-by comparing the true parameters
-with those that we learned**] through our training loop.
-Indeed they turn out to be very close to each other.
-
-```{.python .input}
-#@tab all
-print(f'error in estimating w: {true_w - d2l.reshape(w, true_w.shape)}')
-print(f'error in estimating b: {true_b - b}')
-```
-
-Note that we should not take it for granted
-that we are able to recover the parameters perfectly.
-However, in machine learning, we are typically less concerned
-with recovering true underlying parameters,
-and more concerned with parameters that lead to highly accurate prediction.
-Fortunately, even on difficult optimization problems,
-stochastic gradient descent can often find remarkably good solutions,
-owing partly to the fact that, for deep networks,
-there exist many configurations of the parameters
-that lead to highly accurate prediction.
-
-
-## Summary
-
-* We saw how a deep network can be implemented and optimized from scratch, using just tensors and auto differentiation, without any need for defining layers or fancy optimizers.
-* This section only scratches the surface of what is possible. In the following sections, we will describe additional models based on the concepts that we have just introduced and learn how to implement them more concisely.
-
-
-## Exercises
-
-1. What would happen if we were to initialize the weights to zero. Would the algorithm still work?
-1. Assume that you are
-   [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm) trying to come up
-   with a model between voltage and current. Can you use auto differentiation to learn the parameters of your model?
-1. Can you use [Planck's Law](https://en.wikipedia.org/wiki/Planck%27s_law) to determine the temperature of an object using spectral energy density?
-1. What are the problems you might encounter if you wanted to  compute the second derivatives? How would you fix them?
-1.  Why is the `reshape` function needed in the `squared_loss` function?
-1. Experiment using different learning rates to find out how fast the loss function value drops.
-1. If the number of examples cannot be divided by the batch size, what happens to the `data_iter` function's behavior?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/42)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/43)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/201)
-:end_tab:
diff --git a/chapter_linear-networks/linear-regression.md b/chapter_linear-networks/linear-regression.md
deleted file mode 100644
index 34521f1..0000000
--- a/chapter_linear-networks/linear-regression.md
+++ /dev/null
@@ -1,334 +0,0 @@
-# 線形回帰
-:label:`sec_linear_regression`
-
-*回帰* とは、モデリングのための一連の手法のことです。
-1 つまたは複数の独立変数と従属変数の関係。自然科学と社会科学では、回帰の目的は最も頻繁に
-*入力と出力の関係をキャラクタライズ* します。
-一方、機械学習は*予測*に関係することが最も多いです。 
-
-数値を予測したいときはいつでも回帰問題が現れます。一般的な例としては、（住宅、株などの）価格の予測、滞在期間の予測（入院患者の場合）、需要予測（小売売上高）などがあります。すべての予測問題が古典的な回帰問題というわけではありません。以降のセクションでは、分類問題を紹介します。分類問題は、一連のカテゴリ間のメンバシップを予測することを目的としています。 
-
-## 線形回帰の基本要素
-
-*線形回帰*はどちらも最も単純な場合があります
-回帰の標準的なツールの中で最も人気があります。デート 19世紀の夜明けに戻る, 線形回帰はいくつかの単純な仮定から流れます.まず、独立変数 $\mathbf{x}$ と従属変数 $y$ の関係は線形であると仮定します。つまり、$y$ は $\mathbf{x}$ の要素の加重和として表すことができます。次に、ノイズはすべて (ガウス分布に従って) 適切に動作すると仮定します。 
-
-アプローチのモチベーションを高めるために、実行例から始めましょう。面積 (平方フィート) と年齢 (年数) に基づいて住宅価格 (ドル) を見積もるとします。住宅価格を予測するモデルを実際に開発するには、各住宅の販売価格、面積、年齢がわかっている売上高で構成されるデータセットを手に入れる必要があります。機械学習の用語では、データセットは*トレーニングデータセット* または*トレーニングセット* と呼ばれ、各行（ここでは 1 つの売上に対応するデータ）は*example*（または*データポイント*、*data instance*、*sample*）と呼ばれます。私たちが予測しようとしているもの（価格）を*ラベル*（または*ターゲット*）と呼びます。予測の基になる独立変数 (年齢と面積) は、*特徴* (または*共変量*) と呼ばれます。 
-
-通常、$n$ を使用して、データセット内のサンプル数を示します。データ例を $i$ で索引付けし、各入力を $\mathbf{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}]^\top$、対応するラベルを $y^{(i)}$ と示します。 
-
-### 線形モデル
-:label:`subsec_linear_model`
-
-直線性の仮定では、ターゲット（価格）はフィーチャ（面積と年齢）の加重和として表すことができるとだけ言っています。 
-
-$$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.$$
-:eqlabel:`eq_price-area`
-
-:eqref:`eq_price-area` では $w_{\mathrm{area}}$ と $w_{\mathrm{age}}$ は*ウェイト* と呼ばれ、$b$ は*バイアス* (*オフセット* または*インターセプト* とも呼ばれる) と呼ばれています。重みは各特徴が予測に及ぼす影響を決定し、バイアスはすべての特徴が値0になったときに予測価格が取るべき価値を示すだけです。面積がゼロの家、または正確に0年前の家を見ることは決してない場合でも、偏見が必要です。そうしないと、モデルの表現力が制限されます。厳密に言うと、:eqref:`eq_price-area` は入力特徴量の*アフィン変換* であり、加重和による特徴の*線形変換* と、追加されたバイアスによる*平行移動*の組み合わせによって特徴付けられます。 
-
-データセットが与えられた場合、私たちの目標は、モデルに従って行われた予測がデータで観測された真の価格に最もよく適合するように、重み $\mathbf{w}$ とバイアス $b$ を選択することです。入力フィーチャのアフィン変換によって出力予測が決定されるモデルは*線形モデル* で、アフィン変換は選択した重みとバイアスによって指定されます。 
-
-特徴量が少ないデータセットに注目するのが一般的な分野では、このように長い形式のモデルを明示的に表現するのが一般的です。機械学習では通常、高次元のデータセットを扱うため、線形代数表記法を採用した方が便利です。入力が $d$ の特徴量で構成されている場合、予測 $\hat{y}$ (一般に「帽子」記号は推定値を表します) を次のように表します。 
-
-$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b.$$
-
-すべての特徴をベクトル $\mathbf{x} \in \mathbb{R}^d$ に、すべての重みをベクトル $\mathbf{w} \in \mathbb{R}^d$ にまとめると、ドット積を使用してモデルをコンパクトに表現できます。 
-
-$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b.$$
-:eqlabel:`eq_linreg-y`
-
-:eqref:`eq_linreg-y` では、ベクトル $\mathbf{x}$ は 1 つのデータ例の特徴量に対応しています。$n$ の例のデータセット全体の特徴を、*設計行列* $\mathbf{X} \in \mathbb{R}^{n \times d}$ で参照すると便利なことがよくあります。ここで $\mathbf{X}$ には、例ごとに 1 つの行、フィーチャごとに 1 つの列が含まれています。 
-
-特徴の集合 $\mathbf{X}$ の場合、予測値 $\hat{\mathbf{y}} \in \mathbb{R}^n$ は行列とベクトルの積で表すことができます。 
-
-$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b,$$
-
-ここで、加算中にブロードキャスト (:numref:`subsec_broadcasting` を参照) が適用されます。トレーニングデータセット $\mathbf{X}$ と対応する (既知の) ラベル $\mathbf{y}$ の特性を考えると、線形回帰の目標は、$\mathbf{X}$ と同じ分布からサンプリングされた新しいデータ例の特性に基づいて、重みベクトル $\mathbf{w}$ とバイアス項 $b$ を求めることです。新しい例のラベルは (expectation) は、最小の誤差で予測されます。 
-
-$\mathbf{x}$ が与えられた $y$ を予測するための最良のモデルが線形であると信じたとしても、$1 \leq i \leq n$ すべてについて $y^{(i)}$ が $\mathbf{w}^\top \mathbf{x}^{(i)}+b$ と正確に等しい $n$ の実世界のデータセットを見つけることは期待できません。たとえば、フィーチャ $\mathbf{X}$ および $\mathbf{y}$ の観測に使用する計測器には、わずかな測定誤差が生じる可能性があります。したがって、根底にある関係が線形であると確信している場合でも、そのような誤差を説明するためにノイズ項を取り入れます。 
-
-最適な*パラメータ* (または*モデルパラメータ*) $\mathbf{w}$ と $b$ を検索する前に、(i) 特定のモデルの品質測定と、(ii) 品質を向上させるためにモデルを更新する手順の 2 つが必要です。 
-
-### 損失関数
-
-データをモデルに「あてはめる」方法を考える前に、*適合性*の尺度を決定する必要があります。*loss 関数* は、ターゲットの*実数*と*予測*値の間の距離を定量化します。通常、損失は負ではない数値で、値が小さいほど良好で、完全な予測では損失が0になります。回帰問題で最も一般的な損失関数は二乗誤差です。例 $i$ の予測が $\hat{y}^{(i)}$ で、対応する真のラベルが $y^{(i)}$ の場合、二乗誤差は次の式で求められます。 
-
-$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$
-:eqlabel:`eq_mse`
-
-定数 $\frac{1}{2}$ は実質的な違いはありませんが、表記上は便利で、損失の導関数を取ると相殺されます。トレーニングデータセットは私たちに与えられ、制御不能であるため、経験的誤差はモデルパラメータの関数にすぎません。より具体的にするために、:numref:`fig_fit_linreg` に示すように、1 次元のケースに対する回帰問題をプロットする以下の例を考えてみましょう。 
-
-![Fit data with a linear model.](../img/fit-linreg.svg)
-:label:`fig_fit_linreg`
-
-推定値 $\hat{y}^{(i)}$ と観測値 $y^{(i)}$ の間に大きな差があると、二次依存性のため、損失への寄与がさらに大きくなることに注意してください。$n$ の例のデータセット全体でモデルの品質を測定するには、トレーニングセットの損失を単純に平均 (または同等に合計) します。 
-
-$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
-
-モデルに学習をさせる場合、すべての学習例で総損失を最小化するパラメーター ($\mathbf{w}^*, b^*$) を求めます。 
-
-$$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b).$$
-
-### 分析的ソリューション
-
-線形回帰はたまたま非常に単純な最適化問題です。本書で取り上げている他のほとんどのモデルとは異なり、線形回帰は簡単な公式を適用することで解析的に解くことができます。まず、すべて 1 で構成される計画行列に列を追加することで、バイアス $b$ をパラメーター $\mathbf{w}$ に含めることができます。次に、予測問題は $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ を最小化することです。損失曲面には1つの臨界点しかなく、ドメイン全体の損失の最小値に相当します。$\mathbf{w}$ に対する損失の導関数をゼロに設定すると、解析的 (閉形式) 解が得られます。 
-
-$$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}.$$
-
-線形回帰のような単純な問題は解析的な解を認めるかもしれないが、そのような幸運に慣れるべきではない。解析的解では優れた数学的解析が可能ですが、解析解の要件は非常に厳しく、ディープラーニングのすべてが除外されます。 
-
-### ミニバッチ確率的勾配降下法
-
-モデルを解析的に解くことができない場合でも、実際にモデルを効果的にトレーニングできることが分かります。さらに、多くのタスクでは、最適化が難しいモデルの方がはるかに優れているため、それらをトレーニングする方法を理解することは、トラブルに見合うだけの価値があります。 
-
-ほぼすべてのディープラーニングモデルを最適化するための重要な手法は、損失関数を徐々に低下させる方向にパラメーターを更新することで、エラーを繰り返し減らすことです。このアルゴリズムを*勾配降下* と呼びます。 
-
-勾配降下法を最も単純に適用するには、損失関数の導関数を使用します。損失関数は、データセット内の各例で計算された損失の平均値です。実際には、この処理は非常に遅くなる可能性があります。1 回の更新を行う前に、データセット全体を渡す必要があります。したがって、更新を計算する必要があるたびに、サンプルのランダムなミニバッチをサンプリングすることに決まることがよくあります。これは、*minibatch 確率的勾配降下* と呼ばれるバリアントです。 
-
-各反復で、まず、一定数の学習例で構成されるミニバッチ $\mathcal{B}$ をランダムにサンプリングします。次に、モデルパラメーターに関して、ミニバッチの平均損失の微分 (勾配) を計算します。最後に、勾配に所定の正の値 $\eta$ を掛け、その結果の項を現在のパラメーター値から減算します。 
-
-更新は次のように数学的に表現できます ($\partial$ は偏微分を表します)。 
-
-$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).$$
-
-要約すると、アルゴリズムのステップは次のとおりです。(i) 通常はランダムにモデルパラメーターの値を初期化します。(ii) データからランダムなミニバッチを繰り返しサンプリングし、負の勾配の方向にパラメーターを更新します。二次損失とアフィン変換の場合、これを次のように明示的に記述できます。 
-
-$$\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} -   \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b -  \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b)  = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$
-:eqlabel:`eq_linreg_batch_update`
-
-$\mathbf{w}$ と $\mathbf{x}$ は :eqref:`eq_linreg_batch_update` のベクトルであることに注意してください。ここでは、より洗練されたベクトル表記法により、係数 ($w_1, w_2, \ldots, w_d$) で物事を表現するよりも数学がはるかに読みやすくなります。設定されたカーディナリティ $|\mathcal{B}|$ は各ミニバッチの例数 (*バッチサイズ*) を表し、$\eta$ は*学習率* を表します。バッチサイズと学習率の値は手作業であらかじめ指定されており、通常はモデルトレーニングでは学習されないことを強調します。調整可能だが学習ループでは更新されないこれらのパラメーターは、*hyperparameters* と呼ばれます。
-*ハイパーパラメータチューニング* は、ハイパーパラメータを選択するプロセスです。
-通常、個別の*validationデータセット* (または*validationset*) で評価されたトレーニングループの結果に基づいて調整する必要があります。 
-
-あらかじめ決められた反復回数のトレーニングの後 (または他の停止基準が満たされるまで)、推定されたモデルパラメーター ($\hat{\mathbf{w}}, \hat{b}$) を記録します。関数が真に線形でノイズがない場合でも、これらのパラメーターは損失の正確な最小化にはならないことに注意してください。アルゴリズムは最小化器に向かってゆっくりと収束しますが、有限ステップ数では正確に収束できないためです。 
-
-線形回帰は、ドメイン全体で最小値が1つしかない学習問題です。ただし、ディープネットワークのようなより複雑なモデルでは、損失曲面には多くの最小値が含まれます。幸いなことに、まだ完全には理解されていない理由から、ディープラーニングの実践者は、*トレーニングセット*の損失を最小限に抑えるパラメーターを見つけるのに苦労することはほとんどありません。より手ごわい作業は、これまでに見たことのないデータの損失を低く抑えるパラメータを見つけることです。これは*一般化*と呼ばれる課題です。本全体を通して、これらのトピックに戻ります。 
-
-### 学習したモデルで予測を行う
-
-学習済みの線形回帰モデル $\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}$ を考えると、面積 $x_1$、年齢 $x_2$ から新しい家 (トレーニングデータに含まれていない) の価格を見積もることができます。特徴量からターゲットを推定することは、一般に*予測* または*推論* と呼ばれます。 
-
-ディープラーニングでは標準的な専門用語として浮上しているにもかかわらず、このステップを「推論」と呼ぶのはやや誤称なので、私たちは*予測*に固執しようとします。統計学では、*推論* はデータセットに基づくパラメーターの推定を意味することが多いです。このような用語の誤用は、ディープラーニングの実践者が統計学者と話をするときによくある混乱の原因となります。 
-
-## 高速化のためのベクタ変換
-
-モデルをトレーニングする場合、通常、サンプルのミニバッチ全体を同時に処理します。これを効率的に行うには (**we**) (~~should~~) (**計算をベクトル化し、Pythonでコストがかかる for ループを書くのではなく、高速な線形代数ライブラリを活用する**) が必要です。
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-import math
-import numpy as np
-import time
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import math
-import torch
-import numpy as np
-import time
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import math
-import tensorflow as tf
-import numpy as np
-import time
-```
-
-なぜこれが重要なのかを説明するために、(**ベクトルを加えるための2つの方法を考えてみる**) ことができます。まず、すべて 1 を含む 10000 次元のベクトルを 2 つインスタンス化します。ある方法では、Python の for ループでベクトルをループします。もう 1 つの方法では、`+` を 1 回呼び出すだけで済みます。
-
-```{.python .input}
-#@tab all
-n = 10000
-a = d2l.ones(n)
-b = d2l.ones(n)
-```
-
-本書では頻繁に実行時間のベンチマークを行いますので、[**タイマーを定義しましょう**]。
-
-```{.python .input}
-#@tab all
-class Timer:  #@save
-    """Record multiple running times."""
-    def __init__(self):
-        self.times = []
-        self.start()
-
-    def start(self):
-        """Start the timer."""
-        self.tik = time.time()
-
-    def stop(self):
-        """Stop the timer and record the time in a list."""
-        self.times.append(time.time() - self.tik)
-        return self.times[-1]
-
-    def avg(self):
-        """Return the average time."""
-        return sum(self.times) / len(self.times)
-
-    def sum(self):
-        """Return the sum of time."""
-        return sum(self.times)
-
-    def cumsum(self):
-        """Return the accumulated time."""
-        return np.array(self.times).cumsum().tolist()
-```
-
-これで、ワークロードのベンチマークが可能になりました。まず、[**for-loopを使って座標を1つずつ加算します**]
-
-```{.python .input}
-#@tab mxnet, pytorch
-c = d2l.zeros(n)
-timer = Timer()
-for i in range(n):
-    c[i] = a[i] + b[i]
-f'{timer.stop():.5f} sec'
-```
-
-```{.python .input}
-#@tab tensorflow
-c = tf.Variable(d2l.zeros(n))
-timer = Timer()
-for i in range(n):
-    c[i].assign(a[i] + b[i])
-f'{timer.stop():.5f} sec'
-```
-
-(**または、再ロードされた `+` 演算子を使用して要素単位の合計を計算します。**)
-
-```{.python .input}
-#@tab all
-timer.start()
-d = a + b
-f'{timer.stop():.5f} sec'
-```
-
-2番目の方法は最初の方法よりも劇的に高速であることに気付いたでしょう。コードをベクトル化すると、多くの場合、桁違いに高速化されます。さらに、数学をより多くライブラリにプッシュし、自分で計算を記述する必要がないため、エラーの可能性を減らすことができます。 
-
-## 正規分布と二乗損失
-:label:`subsec_normal_distribution_and_squared_loss`
-
-上記の情報だけを使ってもすでに手を汚すことはできますが、以下ではノイズの分布に関する仮定を通して、二乗損失目標をより正式に動機づけることができます。 
-
-線形回帰は1795年にGaussによって発明され、Gaussも正規分布（*Gaussian*とも呼ばれる）を発見しました。正規分布と線形回帰の関係は、一般的な親子関係よりも深いことが分かります。記憶を更新するために、平均 $\mu$、分散 $\sigma^2$ (標準偏差 $\sigma$) をもつ正規分布の確率密度は次のように与えられます。 
-
-$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).$$
-
-以下 [**正規分布を計算するPython関数を定義します**]。
-
-```{.python .input}
-#@tab all
-def normal(x, mu, sigma):
-    p = 1 / math.sqrt(2 * math.pi * sigma**2)
-    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)
-```
-
-これで (**正規分布を可視化**) できます。
-
-```{.python .input}
-#@tab all
-# Use numpy again for visualization
-x = np.arange(-7, 7, 0.01)
-
-# Mean and standard deviation pairs
-params = [(0, 1), (0, 2), (3, 1)]
-d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
-         ylabel='p(x)', figsize=(4.5, 2.5),
-         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
-```
-
-ご覧のとおり、平均値を変更すると $x$ 軸に沿ったシフトに相当し、分散を大きくすると分布が広がり、ピークが小さくなります。 
-
-平均二乗誤差損失関数 (または単に二乗損失) を使用して線形回帰を動機付ける方法の 1 つは、観測値がノイズの多い観測値から発生すると公式に仮定することです。この場合、ノイズは次のように正規分布します。 
-
-$$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2).$$
-
-したがって、指定された $\mathbf{x}$ の特定の $y$ が見られる*可能性*を 
-
-$$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).$$
-
-最尤法の原則によると、パラメーター $\mathbf{w}$ と $b$ の最良値は、データセット全体の「尤度」を最大化する値です。 
-
-$$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)}).$$
-
-最尤法の原理に従って選択された推定量は、*最尤推定量* と呼ばれます。多くの指数関数の積を最大化するのは難しいように思えるかもしれませんが、その代わりに尤度の対数を最大化することで、目的を変えずに物事を大幅に単純化することができます。歴史的な理由から、最適化は最大化ではなく最小化として表現されることが多いです。したがって、何も変更せずに、*負の対数尤度* $-\log P(\mathbf y \mid \mathbf X)$を最小化できます。数学を考えると次のことが得られます。 
-
-$$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.$$
-
-ここで、$\sigma$が固定定数であるという仮定をもう1つだけ必要とします。したがって、最初の項は $\mathbf{w}$ または $b$ に依存しないため、無視できます。これで、第 2 項は、乗法定数 $\frac{1}{\sigma^2}$ を除き、前に紹介した二乗誤差損失と同じです。幸いなことに、このソリューションは$\sigma$には依存しません。したがって、平均二乗誤差を最小化することは、加法性ガウスノイズを仮定した場合の線形モデルの最尤推定と同等です。 
-
-## 線形回帰からディープネットワークへ
-
-ここまでは、線形モデルについてのみ説明しました。ニューラルネットワークはより豊富なモデルファミリーをカバーしていますが、線形モデルをニューラルネットワークの言語で表現することで、ニューラルネットワークと考えることができます。まず、「レイヤー」表記で書き直すところから始めましょう。 
-
-### ニューラルネットワークダイアグラム
-
-ディープラーニングの実践者は、モデルで起きていることを視覚化するために図を描くのが好きです。:numref:`fig_single_neuron` では、線形回帰モデルをニューラルネットワークとして表現しています。これらのダイアグラムは、各入力が出力にどのように接続されているかなどの接続性パターンを強調していますが、重みやバイアスがとる値は強調していないことに注意してください。 
-
-![Linear regression is a single-layer neural network.](../img/singleneuron.svg)
-:label:`fig_single_neuron`
-
-:numref:`fig_single_neuron` に示すニューラルネットワークでは、入力は $x_1, \ldots, x_d$ なので、入力層の*入力数* (または*特徴次元*) は $d$ です。:numref:`fig_single_neuron` のネットワークの出力は $o_1$ なので、出力層の*出力数* は 1 です。入力値はすべて*与えられる* で、*computed* ニューロンは 1 つだけであることに注意してください。計算が行われる場所に注目して、従来、レイヤーを数えるときに入力レイヤーは考慮しません。つまり、:numref:`fig_single_neuron` のニューラルネットワークの *層数* は 1 です。線形回帰モデルは、単一の人工ニューロンだけで構成されるニューラルネットワーク、または単層ニューラルネットワークと考えることができます。 
-
-線形回帰では、すべての入力がすべての出力 (この場合は出力が 1 つだけ) に接続されるため、この変換 (:numref:`fig_single_neuron` の出力層) は*完全結合層* または*高密度層* と見なすことができます。次の章では、このような層で構成されるネットワークについてさらに詳しく説明します。 
-
-### 生物学
-
-1795年に考案された線形回帰は計算神経科学よりも前から存在するため、線形回帰をニューラルネットワークと表現するのは時代錯誤のように思えるかもしれません。サイバネティスト/神経生理学者のウォーレン・マカロックとウォルター・ピッツが人工ニューロンのモデルを開発し始めたとき、線形モデルが自然に始まった理由を理解するために、:numref:`fig_Neuron`の生物学的ニューロンの漫画的な図を考えてみましょう。
-*樹状突起* (入力端子)
-*nucleus* (CPU)、*axon* (出力線)、*axon端子* (出力端子) により、*シナプス*を介して他のニューロンに接続できます。 
-
-![The real neuron.](../img/neuron.svg)
-:label:`fig_Neuron`
-
-他のニューロン（または網膜などの環境センサー）から届く情報$x_i$は、樹状突起で受信されます。特に、その情報は、入力の効果（例えば、製品 $x_i w_i$ による活性化または阻害）を決定する*シナプス重み* $w_i$によって重み付けされる。複数のソースから到着する重み付けされた入力は、重み付き合計 $y = \sum_i x_i w_i + b$ として核に集約され、この情報は軸索 $y$ でさらに処理するために送信されます。通常、$\sigma(y)$ を介した非線形処理が実施されます。そこから目的地（筋肉など）に到達するか、樹状突起を介して別のニューロンに供給されます。 
-
-確かに、そのような多くのユニットを適切な接続性と適切な学習アルゴリズムと組み合わせて、1つのニューロンだけで表現できるよりもはるかに面白くて複雑な動作を生み出すことができるという高レベルのアイデアは、実際の生物学的ニューラルシステムの研究によるものです。 
-
-同時に、今日のディープラーニングの研究のほとんどは、神経科学に直接的なインスピレーションを与えることはほとんどありません。私たちはスチュアート・ラッセルとピーター・ノーヴィグを呼びます。彼らは古典的なAIの教科書で
-*人工知能: A Modern Approach* :cite:`Russell.Norvig.2016`
-飛行機は鳥に触発されたかもしれないが、鳥類学は何世紀にもわたって航空学の革新の主要な推進力ではなかったと指摘した。同様に、最近のディープラーニングのインスピレーションは、数学、統計、コンピューターサイエンスから同等かそれ以上得られています。 
-
-## [概要
-
-* 機械学習モデルの重要な要素は、トレーニングデータ、損失関数、最適化アルゴリズム、そして明らかにモデルそのものです。
-* ベクトル化すると、すべてがより良くなり（ほとんどが数学）、より速くなります（主にコード）。
-* 目的関数を最小化することと最尤推定を実行することも、同じ意味を持つことがあります。
-* 線形回帰モデルもニューラルネットワークです。
-
-## 演習
-
-1. $x_1, \ldots, x_n \in \mathbb{R}$ というデータがあると仮定します。私たちの目標は、$\sum_i (x_i - b)^2$ が最小化されるような定数 $b$ を見つけることです。
-    1. $b$ の最適値に対する解析解を求めます。
-    1. この問題とその解は正規分布とどのように関係していますか。
-1. 二乗誤差をもつ線形回帰の最適化問題に対する解析解を導き出します。単純化するために、バイアス $b$ を問題から省略できます (すべてが 1 で構成される $\mathbf X$ に 1 つの列を追加することで、原則的にこれを行うことができます)。
-    1. 最適化問題を行列とベクトル表記で書き出します (すべてのデータを 1 つの行列として扱い、すべてのターゲット値を 1 つのベクトルとして扱います)。
-    1. $w$ に対する損失の勾配を計算します。
-    1. 勾配をゼロに設定し、行列方程式を解くことで解析解を求めます。
-    1. 確率的勾配降下法を使用するよりもこれが良いのはいつですか？この方法が壊れるのはいつですか？
-1. 加法性ノイズ $\epsilon$ を支配するノイズモデルが指数分布であると仮定します。つまり、$p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$ です。
-    1. モデル $-\log P(\mathbf y \mid \mathbf X)$ のデータの負の対数尤度を書き出します。
-    1. クローズドフォームのソリューションを見つけられますか？
-    1. この問題を解決するために、確率的勾配降下法アルゴリズムを提案する。何がうまくいかない可能性がありますか（ヒント：パラメータを更新し続けると、静止点の近くで何が起こりますか）。これを直せる？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/40)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/258)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/259)
-:end_tab:
diff --git a/chapter_linear-networks/linear-regression_origin.md b/chapter_linear-networks/linear-regression_origin.md
deleted file mode 100644
index c3b7830..0000000
--- a/chapter_linear-networks/linear-regression_origin.md
+++ /dev/null
@@ -1,668 +0,0 @@
-# Linear Regression
-:label:`sec_linear_regression`
-
-*Regression* refers to a set of methods for modeling
-the relationship between one or more independent variables
-and a dependent variable.
-In the natural sciences and social sciences,
-the purpose of regression is most often to
-*characterize* the relationship between the inputs and outputs.
-Machine learning, on the other hand,
-is most often concerned with *prediction*.
-
-Regression problems pop up whenever we want to predict a numerical value.
-Common examples include predicting prices (of homes, stocks, etc.),
-predicting length of stay (for patients in the hospital),
-demand forecasting (for retail sales), among countless others.
-Not every prediction problem is a classic regression problem.
-In subsequent sections, we will introduce classification problems,
-where the goal is to predict membership among a set of categories.
-
-
-## Basic Elements of Linear Regression
-
-*Linear regression* may be both the simplest
-and most popular among the standard tools to regression.
-Dating back to the dawn of the 19th century,
-linear regression flows from a few simple assumptions.
-First, we assume that the relationship between
-the independent variables $\mathbf{x}$ and the dependent variable $y$ is linear,
-i.e., that $y$ can be expressed as a weighted sum
-of the elements in $\mathbf{x}$,
-given some noise on the observations.
-Second, we assume that any noise is well-behaved
-(following a Gaussian distribution).
-
-To motivate the approach, let us start with a running example.
-Suppose that we wish to estimate the prices of houses (in dollars)
-based on their area (in square feet) and age (in years).
-To actually develop a model for predicting house prices,
-we would need to get our hands on a dataset
-consisting of sales for which we know
-the sale price, area, and age for each home.
-In the terminology of machine learning,
-the dataset is called a *training dataset* or *training set*,
-and each row (here the data corresponding to one sale)
-is called an *example* (or *data point*, *data instance*, *sample*).
-The thing we are trying to predict (price)
-is called a *label* (or *target*).
-The independent variables (age and area)
-upon which the predictions are based
-are called *features* (or *covariates*).
-
-Typically, we will use $n$ to denote
-the number of examples in our dataset.
-We index the data examples by $i$, denoting each input
-as $\mathbf{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}]^\top$
-and the corresponding label as $y^{(i)}$.
-
-
-### Linear Model
-:label:`subsec_linear_model`
-
-The linearity assumption just says that the target (price)
-can be expressed as a weighted sum of the features (area and age):
-
-$$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.$$
-:eqlabel:`eq_price-area`
-
-In :eqref:`eq_price-area`, $w_{\mathrm{area}}$ and $w_{\mathrm{age}}$
-are called *weights*, and $b$ is called a *bias*
-(also called an *offset* or *intercept*).
-The weights determine the influence of each feature
-on our prediction and the bias just says
-what value the predicted price should take
-when all of the features take value 0.
-Even if we will never see any homes with zero area,
-or that are precisely zero years old,
-we still need the bias or else we will
-limit the expressivity of our model.
-Strictly speaking, :eqref:`eq_price-area` is an *affine transformation*
-of input features,
-which is characterized by
-a *linear transformation* of features via weighted sum, combined with
-a *translation* via the added bias.
-
-Given a dataset, our goal is to choose
-the weights $\mathbf{w}$ and the bias $b$ such that on average,
-the predictions made according to our model
-best fit the true prices observed in the data.
-Models whose output prediction
-is determined by the affine transformation of input features
-are *linear models*,
-where the affine transformation is specified by the chosen weights and bias.
-
-
-In disciplines where it is common to focus
-on datasets with just a few features,
-explicitly expressing models long-form like this is common.
-In machine learning, we usually work with high-dimensional datasets,
-so it is more convenient to employ linear algebra notation.
-When our inputs consist of $d$ features,
-we express our prediction $\hat{y}$ (in general the "hat" symbol denotes estimates) as
-
-$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b.$$
-
-Collecting all features into a vector $\mathbf{x} \in \mathbb{R}^d$
-and all weights into a vector $\mathbf{w} \in \mathbb{R}^d$,
-we can express our model compactly using a dot product:
-
-$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b.$$
-:eqlabel:`eq_linreg-y`
-
-In :eqref:`eq_linreg-y`, the vector $\mathbf{x}$ corresponds to features of a single data example.
-We will often find it convenient
-to refer to features of our entire dataset of $n$ examples
-via the *design matrix* $\mathbf{X} \in \mathbb{R}^{n \times d}$.
-Here, $\mathbf{X}$ contains one row for every example
-and one column for every feature.
-
-For a collection of features $\mathbf{X}$,
-the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$
-can be expressed via the matrix-vector product:
-
-$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b,$$
-
-where broadcasting (see :numref:`subsec_broadcasting`) is applied during the summation.
-Given features of a training dataset $\mathbf{X}$
-and corresponding (known) labels $\mathbf{y}$,
-the goal of linear regression is to find
-the weight vector $\mathbf{w}$ and the bias term $b$
-that given features of a new data example
-sampled from the same distribution as $\mathbf{X}$,
-the new example's label will (in expectation) be predicted with the lowest error.
-
-
-Even if we believe that the best model for
-predicting $y$ given $\mathbf{x}$ is linear,
-we would not expect to find a real-world dataset of $n$ examples where
-$y^{(i)}$ exactly equals $\mathbf{w}^\top \mathbf{x}^{(i)}+b$
-for all $1 \leq i \leq n$.
-For example, whatever instruments we use to observe
-the features $\mathbf{X}$ and labels $\mathbf{y}$
-might suffer small amount of measurement error.
-Thus, even when we are confident
-that the underlying relationship is linear,
-we will incorporate a noise term to account for such errors.
-
-Before we can go about searching for the best *parameters* (or *model parameters*) $\mathbf{w}$ and $b$,
-we will need two more things:
-(i) a quality measure for some given model;
-and (ii) a procedure for updating the model to improve its quality.
-
-
-### Loss Function
-
-Before we start thinking about how to *fit* data with our model,
-we need to determine a measure of *fitness*.
-The *loss function* quantifies the distance
-between the *real* and *predicted* value of the target.
-The loss will usually be a non-negative number
-where smaller values are better
-and perfect predictions incur a loss of 0.
-The most popular loss function in regression problems
-is the squared error.
-When our prediction for an example $i$ is $\hat{y}^{(i)}$
-and the corresponding true label is $y^{(i)}$,
-the squared error is given by:
-
-$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$
-:eqlabel:`eq_mse`
-
-The constant $\frac{1}{2}$ makes no real difference
-but will prove notationally convenient,
-canceling out when we take the derivative of the loss.
-Since the training dataset is given to us, and thus out of our control,
-the empirical error is only a function of the model parameters.
-To make things more concrete, consider the example below
-where we plot a regression problem for a one-dimensional case
-as shown in :numref:`fig_fit_linreg`.
-
-![Fit data with a linear model.](../img/fit-linreg.svg)
-:label:`fig_fit_linreg`
-
-Note that large differences between
-estimates $\hat{y}^{(i)}$ and observations $y^{(i)}$
-lead to even larger contributions to the loss,
-due to the quadratic dependence.
-To measure the quality of a model on the entire dataset of $n$ examples,
-we simply average (or equivalently, sum)
-the losses on the training set.
-
-$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
-
-When training the model, we want to find parameters ($\mathbf{w}^*, b^*$)
-that minimize the total loss across all training examples:
-
-$$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b).$$
-
-
-### Analytic Solution
-
-Linear regression happens to be an unusually simple optimization problem.
-Unlike most other models that we will encounter in this book,
-linear regression can be solved analytically by applying a simple formula.
-To start, we can subsume the bias $b$ into the parameter $\mathbf{w}$
-by appending a column to the design matrix consisting of all ones.
-Then our prediction problem is to minimize $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$.
-There is just one critical point on the loss surface
-and it corresponds to the minimum of the loss over the entire domain.
-Taking the derivative of the loss with respect to $\mathbf{w}$
-and setting it equal to zero yields the analytic (closed-form) solution:
-
-$$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}.$$
-
-While simple problems like linear regression
-may admit analytic solutions,
-you should not get used to such good fortune.
-Although analytic solutions allow for nice mathematical analysis,
-the requirement of an analytic solution is so restrictive
-that it would exclude all of deep learning.
-
-
-### Minibatch Stochastic Gradient Descent
-
-Even in cases where we cannot solve the models analytically,
-it turns out that we can still train models effectively in practice.
-Moreover, for many tasks, those difficult-to-optimize models
-turn out to be so much better that figuring out how to train them
-ends up being well worth the trouble.
-
-The key technique for optimizing nearly any deep learning model,
-and which we will call upon throughout this book,
-consists of iteratively reducing the error
-by updating the parameters in the direction
-that incrementally lowers the loss function.
-This algorithm is called *gradient descent*.
-
-The most naive application of gradient descent
-consists of taking the derivative of the loss function,
-which is an average of the losses computed
-on every single example in the dataset.
-In practice, this can be extremely slow:
-we must pass over the entire dataset before making a single update.
-Thus, we will often settle for sampling a random minibatch of examples
-every time we need to compute the update,
-a variant called *minibatch stochastic gradient descent*.
-
-In each iteration, we first randomly sample a minibatch $\mathcal{B}$
-consisting of a fixed number of training examples.
-We then compute the derivative (gradient) of the average loss
-on the minibatch with regard to the model parameters.
-Finally, we multiply the gradient by a predetermined positive value $\eta$
-and subtract the resulting term from the current parameter values.
-
-We can express the update mathematically as follows
-($\partial$ denotes the partial derivative):
-
-$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).$$
-
-
-To summarize, steps of the algorithm are the following:
-(i) we initialize the values of the model parameters, typically at random;
-(ii) we iteratively sample random minibatches from the data,
-updating the parameters in the direction of the negative gradient.
-For quadratic losses and affine transformations,
-we can write this out explicitly as follows:
-
-$$\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} -   \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b -  \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b)  = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$
-:eqlabel:`eq_linreg_batch_update`
-
-
-Note that $\mathbf{w}$ and $\mathbf{x}$ are vectors in :eqref:`eq_linreg_batch_update`.
-Here, the more elegant vector notation makes the math
-much more readable than expressing things in terms of coefficients,
-say $w_1, w_2, \ldots, w_d$.
-The set cardinality
-$|\mathcal{B}|$ represents
-the number of examples in each minibatch (the *batch size*)
-and $\eta$ denotes the *learning rate*.
-We emphasize that the values of the batch size and learning rate
-are manually pre-specified and not typically learned through model training.
-These parameters that are tunable but not updated
-in the training loop are called *hyperparameters*.
-*Hyperparameter tuning* is the process by which hyperparameters are chosen,
-and typically requires that we adjust them
-based on the results of the training loop
-as assessed on a separate *validation dataset* (or *validation set*).
-
-After training for some predetermined number of iterations
-(or until some other stopping criteria are met),
-we record the estimated model parameters,
-denoted $\hat{\mathbf{w}}, \hat{b}$.
-Note that even if our function is truly linear and noiseless,
-these parameters will not be the exact minimizers of the loss
-because, although the algorithm converges slowly towards the minimizers
-it cannot achieve it exactly in a finite number of steps.
-
-Linear regression happens to be a learning problem where there is only one minimum
-over the entire domain.
-However, for more complicated models, like deep networks,
-the loss surfaces contain many minima.
-Fortunately, for reasons that are not yet fully understood,
-deep learning practitioners seldom struggle to find parameters
-that minimize the loss *on training sets*.
-The more formidable task is to find parameters
-that will achieve low loss on data
-that we have not seen before,
-a challenge called *generalization*.
-We return to these topics throughout the book.
-
-
-### Making Predictions with the Learned Model
-
-
-Given the learned linear regression model
-$\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}$,
-we can now estimate the price of a new house
-(not contained in the training data)
-given its area $x_1$ and age $x_2$.
-Estimating targets given features is
-commonly called *prediction* or *inference*.
-
-We will try to stick with *prediction* because
-calling this step *inference*,
-despite emerging as standard jargon in deep learning,
-is somewhat of a misnomer.
-In statistics, *inference* more often denotes
-estimating parameters based on a dataset.
-This misuse of terminology is a common source of confusion
-when deep learning practitioners talk to statisticians.
-
-
-## Vectorization for Speed
-
-When training our models, we typically want to process
-whole minibatches of examples simultaneously.
-Doing this efficiently requires that (**we**) (~~should~~) (**vectorize the calculations
-and leverage fast linear algebra libraries
-rather than writing costly for-loops in Python.**)
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-import math
-from mxnet import np
-import time
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import math
-import torch
-import numpy as np
-import time
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import math
-import tensorflow as tf
-import numpy as np
-import time
-```
-
-To illustrate why this matters so much,
-we can (**consider two methods for adding vectors.**)
-To start we instantiate two 10000-dimensional vectors
-containing all ones.
-In one method we will loop over the vectors with a Python for-loop.
-In the other method we will rely on a single call to `+`.
-
-```{.python .input}
-#@tab all
-n = 10000
-a = d2l.ones(n)
-b = d2l.ones(n)
-```
-
-Since we will benchmark the running time frequently in this book,
-[**let us define a timer**].
-
-```{.python .input}
-#@tab all
-class Timer:  #@save
-    """Record multiple running times."""
-    def __init__(self):
-        self.times = []
-        self.start()
-
-    def start(self):
-        """Start the timer."""
-        self.tik = time.time()
-
-    def stop(self):
-        """Stop the timer and record the time in a list."""
-        self.times.append(time.time() - self.tik)
-        return self.times[-1]
-
-    def avg(self):
-        """Return the average time."""
-        return sum(self.times) / len(self.times)
-
-    def sum(self):
-        """Return the sum of time."""
-        return sum(self.times)
-
-    def cumsum(self):
-        """Return the accumulated time."""
-        return np.array(self.times).cumsum().tolist()
-```
-
-Now we can benchmark the workloads.
-First, [**we add them, one coordinate at a time,
-using a for-loop.**]
-
-```{.python .input}
-#@tab mxnet, pytorch
-c = d2l.zeros(n)
-timer = Timer()
-for i in range(n):
-    c[i] = a[i] + b[i]
-f'{timer.stop():.5f} sec'
-```
-
-```{.python .input}
-#@tab tensorflow
-c = tf.Variable(d2l.zeros(n))
-timer = Timer()
-for i in range(n):
-    c[i].assign(a[i] + b[i])
-f'{timer.stop():.5f} sec'
-```
-
-(**Alternatively, we rely on the reloaded `+` operator to compute the elementwise sum.**)
-
-```{.python .input}
-#@tab all
-timer.start()
-d = a + b
-f'{timer.stop():.5f} sec'
-```
-
-You probably noticed that the second method
-is dramatically faster than the first.
-Vectorizing code often yields order-of-magnitude speedups.
-Moreover, we push more of the mathematics to the library
-and need not write as many calculations ourselves,
-reducing the potential for errors.
-
-## The Normal Distribution and Squared Loss
-:label:`subsec_normal_distribution_and_squared_loss`
-
-While you can already get your hands dirty using only the information above,
-in the following we can more formally motivate the squared loss objective
-via assumptions about the distribution of noise.
-
-Linear regression was invented by Gauss in 1795,
-who also discovered the normal distribution (also called the *Gaussian*).
-It turns out that the connection between
-the normal distribution and linear regression
-runs deeper than common parentage.
-To refresh your memory, the probability density
-of a normal distribution with mean $\mu$ and variance $\sigma^2$ (standard deviation $\sigma$)
-is given as
-
-$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).$$
-
-Below [**we define a Python function to compute the normal distribution**].
-
-```{.python .input}
-#@tab all
-def normal(x, mu, sigma):
-    p = 1 / math.sqrt(2 * math.pi * sigma**2)
-    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)
-```
-
-We can now (**visualize the normal distributions**).
-
-```{.python .input}
-#@tab all
-# Use numpy again for visualization
-x = np.arange(-7, 7, 0.01)
-
-# Mean and standard deviation pairs
-params = [(0, 1), (0, 2), (3, 1)]
-d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
-         ylabel='p(x)', figsize=(4.5, 2.5),
-         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
-```
-
-As we can see, changing the mean corresponds to a shift along the $x$-axis,
-and increasing the variance spreads the distribution out, lowering its peak.
-
-One way to motivate linear regression with the mean squared error loss function (or simply squared loss)
-is to formally assume that observations arise from noisy observations,
-where the noise is normally distributed as follows:
-
-$$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2).$$
-
-Thus, we can now write out the *likelihood*
-of seeing a particular $y$ for a given $\mathbf{x}$ via
-
-$$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).$$
-
-Now, according to the principle of maximum likelihood,
-the best values of parameters $\mathbf{w}$ and $b$ are those
-that maximize the *likelihood* of the entire dataset:
-
-$$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)}).$$
-
-Estimators chosen according to the principle of maximum likelihood
-are called *maximum likelihood estimators*.
-While, maximizing the product of many exponential functions,
-might look difficult,
-we can simplify things significantly, without changing the objective,
-by maximizing the log of the likelihood instead.
-For historical reasons, optimizations are more often expressed
-as minimization rather than maximization.
-So, without changing anything we can minimize the *negative log-likelihood*
-$-\log P(\mathbf y \mid \mathbf X)$.
-Working out the mathematics gives us:
-
-$$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.$$
-
-Now we just need one more assumption that $\sigma$ is some fixed constant.
-Thus we can ignore the first term because
-it does not depend on $\mathbf{w}$ or $b$.
-Now the second term is identical to the squared error loss introduced earlier,
-except for the multiplicative constant $\frac{1}{\sigma^2}$.
-Fortunately, the solution does not depend on $\sigma$.
-It follows that minimizing the mean squared error
-is equivalent to maximum likelihood estimation
-of a linear model under the assumption of additive Gaussian noise.
-
-## From Linear Regression to Deep Networks
-
-So far we only talked about linear models.
-While neural networks cover a much richer family of models,
-we can begin thinking of the linear model
-as a neural network by expressing it in the language of neural networks.
-To begin, let us start by rewriting things in a "layer" notation.
-
-### Neural Network Diagram
-
-Deep learning practitioners like to draw diagrams
-to visualize what is happening in their models.
-In :numref:`fig_single_neuron`,
-we depict our linear regression model as a neural network.
-Note that these diagrams highlight the connectivity pattern
-such as how each input is connected to the output,
-but not the values taken by the weights or biases.
-
-![Linear regression is a single-layer neural network.](../img/singleneuron.svg)
-:label:`fig_single_neuron`
-
-For the neural network shown in :numref:`fig_single_neuron`,
-the inputs are $x_1, \ldots, x_d$,
-so the *number of inputs* (or *feature dimensionality*) in the input layer is $d$.
-The output of the network in :numref:`fig_single_neuron` is $o_1$,
-so the *number of outputs* in the output layer is 1.
-Note that the input values are all *given*
-and there is just a single *computed* neuron.
-Focusing on where computation takes place,
-conventionally we do not consider the input layer when counting layers.
-That is to say,
-the *number of layers* for the neural network in :numref:`fig_single_neuron` is 1.
-We can think of linear regression models as neural networks
-consisting of just a single artificial neuron,
-or as single-layer neural networks.
-
-Since for linear regression, every input is connected
-to every output (in this case there is only one output),
-we can regard this transformation (the output layer in :numref:`fig_single_neuron`)
-as a *fully-connected layer* or *dense layer*.
-We will talk a lot more about networks composed of such layers
-in the next chapter.
-
-
-### Biology
-
-Since linear regression (invented in 1795)
-predates computational neuroscience,
-it might seem anachronistic to describe
-linear regression as a neural network.
-To see why linear models were a natural place to begin
-when the cyberneticists/neurophysiologists
-Warren McCulloch and Walter Pitts began to develop
-models of artificial neurons,
-consider the cartoonish picture
-of a biological neuron in :numref:`fig_Neuron`, consisting of
-*dendrites* (input terminals),
-the *nucleus* (CPU), the *axon* (output wire),
-and the *axon terminals* (output terminals),
-enabling connections to other neurons via *synapses*.
-
-![The real neuron.](../img/neuron.svg)
-:label:`fig_Neuron`
-
-Information $x_i$ arriving from other neurons
-(or environmental sensors such as the retina)
-is received in the dendrites.
-In particular, that information is weighted by *synaptic weights* $w_i$
-determining the effect of the inputs
-(e.g., activation or inhibition via the product $x_i w_i$).
-The weighted inputs arriving from multiple sources
-are aggregated in the nucleus as a weighted sum $y = \sum_i x_i w_i + b$,
-and this information is then sent for further processing in the axon $y$,
-typically after some nonlinear processing via $\sigma(y)$.
-From there it either reaches its destination (e.g., a muscle)
-or is fed into another neuron via its dendrites.
-
-Certainly, the high-level idea that many such units
-could be cobbled together with the right connectivity
-and right learning algorithm,
-to produce far more interesting and complex behavior
-than any one neuron alone could express
-owes to our study of real biological neural systems.
-
-At the same time, most research in deep learning today
-draws little direct inspiration in neuroscience.
-We invoke Stuart Russell and Peter Norvig who,
-in their classic AI text book
-*Artificial Intelligence: A Modern Approach* :cite:`Russell.Norvig.2016`,
-pointed out that although airplanes might have been *inspired* by birds,
-ornithology has not been the primary driver
-of aeronautics innovation for some centuries.
-Likewise, inspiration in deep learning these days
-comes in equal or greater measure from mathematics,
-statistics, and computer science.
-
-## Summary
-
-* Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm, and quite obviously, the model itself.
-* Vectorizing makes everything better (mostly math) and faster (mostly code).
-* Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.
-* Linear regression models are neural networks, too.
-
-
-## Exercises
-
-1. Assume that we have some data $x_1, \ldots, x_n \in \mathbb{R}$. Our goal is to find a constant $b$ such that $\sum_i (x_i - b)^2$ is minimized.
-    1. Find a analytic solution for the optimal value of $b$.
-    1. How does this problem and its solution relate to the normal distribution?
-1. Derive the analytic solution to the optimization problem for linear regression with squared error. To keep things simple, you can omit the bias $b$ from the problem (we can do this in principled fashion by adding one column to $\mathbf X$ consisting of all ones).
-    1. Write out the optimization problem in matrix and vector notation (treat all the data as a single matrix, and all the target values as a single vector).
-    1. Compute the gradient of the loss with respect to $w$.
-    1. Find the analytic solution by setting the gradient equal to zero and solving the matrix equation.
-    1. When might this be better than using stochastic gradient descent? When might this method break?
-1. Assume that the noise model governing the additive noise $\epsilon$ is the exponential distribution. That is, $p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$.
-    1. Write out the negative log-likelihood of the data under the model $-\log P(\mathbf y \mid \mathbf X)$.
-    1. Can you find a closed form solution?
-    1. Suggest a stochastic gradient descent algorithm to solve this problem. What could possibly go wrong (hint: what happens near the stationary point as we keep on updating the parameters)? Can you fix this?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/40)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/258)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/259)
-:end_tab:
diff --git a/chapter_linear-networks/softmax-regression-concise.md b/chapter_linear-networks/softmax-regression-concise.md
deleted file mode 100644
index 053e431..0000000
--- a/chapter_linear-networks/softmax-regression-concise.md
+++ /dev/null
@@ -1,157 +0,0 @@
-# ソフトマックス回帰の簡潔な実装
-:label:`sec_softmax_concise`
-
-:numref:`sec_linear_concise` のディープラーニングフレームワーク (**線形回帰の実装がはるかに容易になった**) の (**同様に高レベル API **) (~~here~~) (またはそれ以上) は、分類モデルの実装に便利です。:numref:`sec_softmax_scratch` のように、Fashion-MNIST データセットに固執し、バッチサイズを 256 に保ちましょう。
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, init, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## モデルパラメーターの初期化
-
-:numref:`sec_softmax` で述べたように、[**softmax 回帰の出力層は完全結合層です。**] したがって、このモデルを実装するには、`Sequential` に 10 個の出力をもつ完全結合層を 1 つ追加するだけで済みます。繰り返しますが、`Sequential` は実際には必要ありませんが、ディープモデルを実装するときはどこにでもあるので、習慣を形成したほうがよいでしょう。繰り返しますが、重みをゼロ平均と標準偏差 0.01 でランダムに初期化します。
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-# PyTorch does not implicitly reshape the inputs. Thus we define the flatten
-# layer to reshape the inputs before the linear layer in our network
-net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential()
-net.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
-weight_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01)
-net.add(tf.keras.layers.Dense(10, kernel_initializer=weight_initializer))
-```
-
-## Softmax 実装の再検討
-:label:`subsec_softmax-implementation-revisited`
-
-前の :numref:`sec_softmax_scratch` の例では、モデルの出力を計算し、この出力をクロスエントロピー損失まで実行しました。数学的には、それは完全に合理的なことです。しかし、計算の観点からすると、べき乗は数値の安定性の問題の原因となる可能性があります。 
-
-ソフトマックス関数は $\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$ を計算することを思い出してください。$\hat y_j$ は予測確率分布 $\hat{\mathbf{y}}$ の $j^\mathrm{th}$ 要素、$o_j$ はロジット $\mathbf{o}$ の $j^\mathrm{th}$ 要素です。$o_k$ の一部が非常に大きい (つまり、非常に正の) 場合、$\exp(o_k)$ は、特定のデータ型で取得できる最大数 (*オーバーフロー*) よりも大きい可能性があります。これにより、分母 (および/または分子) が `inf` (無限大) になり、$\hat y_j$ の場合は 0、`inf`、または `nan` (数値ではない) のいずれかに遭遇することになります。このような状況では、クロスエントロピーに対する明確な戻り値は得られません。 
-
-これを回避する 1 つのトリックは、ソフトマックスの計算を続行する前に、すべての $o_k$ から $\max(o_k)$ を引くことです。この $o_k$ を定数係数でシフトしても softmax の戻り値は変わらないことがわかります。 
-
-$$
-\begin{aligned}
-\hat y_j & =  \frac{\exp(o_j - \max(o_k))\exp(\max(o_k))}{\sum_k \exp(o_k - \max(o_k))\exp(\max(o_k))} \\
-& = \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))}.
-\end{aligned}
-$$
-
-減算と正規化のステップの後、$o_j - \max(o_k)$ の負の値が大きくなり、対応する $\exp(o_j - \max(o_k))$ が 0 に近い値になることがあります。これらは有限精度 (*アンダーフロー*) のためにゼロに丸められ、$\hat y_j$ がゼロになり、$\log(\hat y_j)$ は `-inf` になります。バックプロパゲーションの道を少し進むと、恐ろしい`nan`の結果のスクリーン一杯に直面するかもしれません。 
-
-幸いなことに、指数関数を計算しているにもかかわらず、最終的にはその対数を取るつもりです (クロスエントロピー損失を計算するとき)。これら 2 つの演算子 softmax と crossentropy を組み合わせることで、逆伝播中に悩まされる可能性のある数値安定性の問題を回避できます。次の式に示すように、$\exp(o_j - \max(o_k))$ の計算は避け、$\log(\exp(\cdot))$ でキャンセルされるため、代わりに $o_j - \max(o_k)$ を直接使用できます。 
-
-$$
-\begin{aligned}
-\log{(\hat y_j)} & = \log\left( \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))}\right) \\
-& = \log{(\exp(o_j - \max(o_k)))}-\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)} \\
-& = o_j - \max(o_k) -\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)}.
-\end{aligned}
-$$
-
-モデルで出力確率を評価したい場合に備えて、従来のソフトマックス関数を手元に置いておきます。しかし、ソフトマックスの確率を新しい損失関数に渡す代わりに、["LogsumExp trick"](https://en.wikipedia.org/wiki/LogSumExp) のようなスマートな処理を行う [**クロスエントロピー損失関数内でロジットを渡し、ソフトマックスとその対数を一度に計算する**] だけにします。
-
-```{.python .input}
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.CrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-```
-
-## 最適化アルゴリズム
-
-ここでは、最適化アルゴリズムとして学習率 0.1 で (**ミニバッチ確率的勾配降下法**)。これは線形回帰の例で適用したものと同じで、オプティマイザの一般的な適用性を示しています。
-
-```{.python .input}
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
-```
-
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=0.1)
-```
-
-```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=.1)
-```
-
-## 訓練
-
-次に :numref:`sec_softmax_scratch` で [**定義されたトレーニング関数を呼び出します**](~~以前~~)、モデルをトレーニングします。
-
-```{.python .input}
-#@tab all
-num_epochs = 10
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-以前と同様、このアルゴリズムは、今回は以前よりも少ないコード行数ではありますが、適切な精度を実現する解に収束します。 
-
-## [概要
-
-* 高レベル API を使用すると、softmax 回帰をより簡潔に実装できます。
-* 計算の観点から見ると、ソフトマックス回帰の実装には複雑さがあります。多くの場合、ディープラーニングフレームワークでは、数値の安定性を確保するために、これらのよく知られたトリック以外にも追加の予防措置が講じられていることに注意してください。これにより、実際にすべてのモデルをゼロからコーディングしようとした場合に遭遇する落とし穴がさらに増えるのを防ぐことができます。
-
-## 演習
-
-1. バッチサイズ、エポック数、学習率などのハイパーパラメーターを調整して、結果を確認します。
-1. 学習のエポック数を増やします。しばらくするとテストの精度が低下するのはなぜですか？どうやってこれを直せる？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/52)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/53)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/260)
-:end_tab:
diff --git a/chapter_linear-networks/softmax-regression-concise_origin.md b/chapter_linear-networks/softmax-regression-concise_origin.md
deleted file mode 100644
index 7d528b7..0000000
--- a/chapter_linear-networks/softmax-regression-concise_origin.md
+++ /dev/null
@@ -1,225 +0,0 @@
-# Concise Implementation of Softmax Regression
-:label:`sec_softmax_concise`
-
-
-
-(**Just as high-level APIs**)
-of deep learning frameworks
-(**made it much easier to implement linear regression**)
-in :numref:`sec_linear_concise`,
-(**we will find it similarly**) (~~here~~) (or possibly more)
-convenient for implementing classification models. Let us stick with the Fashion-MNIST dataset
-and keep the batch size at 256 as in :numref:`sec_softmax_scratch`.
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, init, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## Initializing Model Parameters
-
-As mentioned in :numref:`sec_softmax`,
-[**the output layer of softmax regression
-is a fully-connected layer.**]
-Therefore, to implement our model,
-we just need to add one fully-connected layer
-with 10 outputs to our `Sequential`.
-Again, here, the `Sequential` is not really necessary,
-but we might as well form the habit since it will be ubiquitous
-when implementing deep models.
-Again, we initialize the weights at random
-with zero mean and standard deviation 0.01.
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-# PyTorch does not implicitly reshape the inputs. Thus we define the flatten
-# layer to reshape the inputs before the linear layer in our network
-net = nn.Sequential(nn.Flatten(), nn.Linear(784, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential()
-net.add(tf.keras.layers.Flatten(input_shape=(28, 28)))
-weight_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01)
-net.add(tf.keras.layers.Dense(10, kernel_initializer=weight_initializer))
-```
-
-## Softmax Implementation Revisited
-:label:`subsec_softmax-implementation-revisited`
-
-In the previous example of :numref:`sec_softmax_scratch`,
-we calculated our model's output
-and then ran this output through the cross-entropy loss.
-Mathematically, that is a perfectly reasonable thing to do.
-However, from a computational perspective,
-exponentiation can be a source of numerical stability issues.
-
-Recall that the softmax function calculates
-$\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$,
-where $\hat y_j$ is the $j^\mathrm{th}$ element of
-the predicted probability distribution $\hat{\mathbf{y}}$
-and $o_j$ is the $j^\mathrm{th}$ element of the logits
-$\mathbf{o}$.
-If some of the $o_k$ are very large (i.e., very positive),
-then $\exp(o_k)$ might be larger than the largest number
-we can have for certain data types (i.e., *overflow*).
-This would make the denominator (and/or numerator) `inf` (infinity)
-and we wind up encountering either 0, `inf`, or `nan` (not a number) for $\hat y_j$.
-In these situations we do not get a well-defined
-return value for cross-entropy.
-
-
-One trick to get around this is to first subtract $\max(o_k)$
-from all $o_k$ before proceeding with the softmax calculation.
-You can see that this shifting of each $o_k$ by constant factor
-does not change the return value of softmax:
-
-$$
-\begin{aligned}
-\hat y_j & =  \frac{\exp(o_j - \max(o_k))\exp(\max(o_k))}{\sum_k \exp(o_k - \max(o_k))\exp(\max(o_k))} \\
-& = \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))}.
-\end{aligned}
-$$
-
-
-After the subtraction and normalization step,
-it might be possible that some $o_j - \max(o_k)$ have large negative values
-and thus that the corresponding $\exp(o_j - \max(o_k))$ will take values close to zero.
-These might be rounded to zero due to finite precision (i.e., *underflow*),
-making $\hat y_j$ zero and giving us `-inf` for $\log(\hat y_j)$.
-A few steps down the road in backpropagation,
-we might find ourselves faced with a screenful
-of the dreaded `nan` results.
-
-Fortunately, we are saved by the fact that
-even though we are computing exponential functions,
-we ultimately intend to take their log
-(when calculating the cross-entropy loss).
-By combining these two operators
-softmax and cross-entropy together,
-we can escape the numerical stability issues
-that might otherwise plague us during backpropagation.
-As shown in the equation below, we avoid calculating $\exp(o_j - \max(o_k))$
-and can use instead $o_j - \max(o_k)$ directly due to the canceling in $\log(\exp(\cdot))$:
-
-$$
-\begin{aligned}
-\log{(\hat y_j)} & = \log\left( \frac{\exp(o_j - \max(o_k))}{\sum_k \exp(o_k - \max(o_k))}\right) \\
-& = \log{(\exp(o_j - \max(o_k)))}-\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)} \\
-& = o_j - \max(o_k) -\log{\left( \sum_k \exp(o_k - \max(o_k)) \right)}.
-\end{aligned}
-$$
-
-We will want to keep the conventional softmax function handy
-in case we ever want to evaluate the output probabilities by our model.
-But instead of passing softmax probabilities into our new loss function,
-we will just
-[**pass the logits and compute the softmax and its log
-all at once inside the cross-entropy loss function,**]
-which does smart things like the ["LogSumExp trick"](https://en.wikipedia.org/wiki/LogSumExp).
-
-```{.python .input}
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.CrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-```
-
-## Optimization Algorithm
-
-Here, we (**use minibatch stochastic gradient descent**)
-with a learning rate of 0.1 as the optimization algorithm.
-Note that this is the same as we applied in the linear regression example
-and it illustrates the general applicability of the optimizers.
-
-```{.python .input}
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
-```
-
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=0.1)
-```
-
-```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=.1)
-```
-
-## Training
-
-Next we [**call the training function defined**] (~~earlier~~) in :numref:`sec_softmax_scratch` to train the model.
-
-```{.python .input}
-#@tab all
-num_epochs = 10
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-As before, this algorithm converges to a solution
-that achieves a decent accuracy,
-albeit this time with fewer lines of code than before.
-
-
-## Summary
-
-* Using high-level APIs, we can implement softmax regression much more concisely.
-* From a computational perspective, implementing softmax regression has intricacies. Note that in many cases, a deep learning framework takes additional precautions beyond these most well-known tricks to ensure numerical stability, saving us from even more pitfalls that we would encounter if we tried to code all of our models from scratch in practice.
-
-## Exercises
-
-1. Try adjusting the hyperparameters, such as the batch size, number of epochs, and learning rate, to see what the results are.
-1. Increase the number of epochs for training. Why might the test accuracy decrease after a while? How could we fix this?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/52)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/53)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/260)
-:end_tab:
diff --git a/chapter_linear-networks/softmax-regression-scratch.md b/chapter_linear-networks/softmax-regression-scratch.md
deleted file mode 100644
index 885fb05..0000000
--- a/chapter_linear-networks/softmax-regression-scratch.md
+++ /dev/null
@@ -1,469 +0,0 @@
-# ソフトマックス回帰のゼロからの実装
-:label:`sec_softmax_scratch`
-
-(**線形回帰をゼロから実装したように、**) ソフトマックス回帰も同様に基本的であり、(**あなたはの残酷な詳細を知っておくべきです**) (~~softmax regression~~) そしてそれを自分でどのように実装するか。:numref:`sec_fashion_mnist` で導入されたばかりの Fashion-MNIST データセットを使用して、バッチサイズ 256 のデータイテレータを設定します。
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import autograd, np, npx, gluon
-from IPython import display
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from IPython import display
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-from IPython import display
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## モデルパラメーターの初期化
-
-線形回帰の例と同様に、ここでの各例は固定長ベクトルで表されます。生データセットの各例は $28 \times 28$ イメージです。このセクションでは [**各画像を平坦化し、長さ 784 のベクトルとして扱う**] 今後は、画像の空間構造を利用するためのより洗練された戦略について説明する予定ですが、ここでは各ピクセル位置を単なる別の特徴として扱います。 
-
-softmax 回帰では、クラス数と同じ数の出力があることを思い出してください。(**データセットには 10 個のクラスがあるため、ネットワークの出力次元は 10.**) したがって、重みは $784 \times 10$ 行列を構成し、バイアスは $1 \times 10$ 行ベクトルを構成します。線形回帰と同様に、重み `W` をガウスノイズとバイアスで初期化し、初期値 0 を取ります。
-
-```{.python .input}
-num_inputs = 784
-num_outputs = 10
-
-W = np.random.normal(0, 0.01, (num_inputs, num_outputs))
-b = np.zeros(num_outputs)
-W.attach_grad()
-b.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs = 784
-num_outputs = 10
-
-W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
-b = torch.zeros(num_outputs, requires_grad=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_inputs = 784
-num_outputs = 10
-
-W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs),
-                                 mean=0, stddev=0.01))
-b = tf.Variable(tf.zeros(num_outputs))
-```
-
-## Softmax オペレーションの定義
-
-ソフトマックス回帰モデルを実装する前に、:numref:`subseq_lin-alg-reduction` と :numref:`subseq_lin-alg-non-reduction` で説明したように、和演算子がテンソルの特定の次元に沿ってどのように機能するかを簡単に確認しておきましょう。[**行列 `X` を指定すると、すべての要素 (デフォルト) または同じ軸の要素のみを合計できます。**] つまり、同じ列 (軸 0) または同じ行 (軸 1) です。`X` が形状 (2, 3) のテンソルで、列を合計した場合、結果は形状 (3,) をもつベクトルになることに注意してください。sum 演算子を呼び出すときに、合計した次元を折りたたむのではなく、元のテンソルの軸数を維持するように指定できます。これにより、形状 (1, 3) の 2 次元テンソルになります。
-
-```{.python .input}
-#@tab pytorch
-X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-d2l.reduce_sum(X, 0, keepdim=True), d2l.reduce_sum(X, 1, keepdim=True)
-```
-
-```{.python .input}
-#@tab mxnet, tensorflow
-X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)
-```
-
-これで準備が整いました (**softmax 操作を実装する**)。softmax は 3 つのステップで構成されていることを思い出してください:(i) 各項を累乗する (`exp` を使用)、(ii) 各行を合計して (バッチには例ごとに 1 つの行がある)、(iii) 各行を正規化定数で除算し、結果の合計が 1 になるようにします。コードを見る前に、これがどのように方程式で表されるかを思い出してください。 
-
-(** $\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.$ドル)
-**)
-
-分母 (正規化定数) は、*分割関数* と呼ばれることもあります (その対数は対数分割関数と呼ばれます)。その名前の由来は [統計物理学](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics) にあります。関連する方程式は、粒子のアンサンブル上の分布をモデル化しています。
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def softmax(X):
-    X_exp = d2l.exp(X)
-    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
-    return X_exp / partition  # The broadcasting mechanism is applied here
-```
-
-```{.python .input}
-#@tab pytorch
-def softmax(X):
-    X_exp = d2l.exp(X)
-    partition = d2l.reduce_sum(X_exp, 1, keepdim=True)
-    return X_exp / partition  # The broadcasting mechanism is applied here
-```
-
-ご覧のとおり、任意のランダム入力に対して、[**各要素を非負の数に変換します。さらに、各行の合計は、確率の要求に応じて最大 1, **] になります。
-
-```{.python .input}
-#@tab mxnet, pytorch
-X = d2l.normal(0, 1, (2, 5))
-X_prob = softmax(X)
-X_prob, d2l.reduce_sum(X_prob, 1)
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.random.normal((2, 5), 0, 1)
-X_prob = softmax(X)
-X_prob, tf.reduce_sum(X_prob, 1)
-```
-
-これは数学的には正しいように見えますが、行列の要素が大きいか非常に小さいため、数値のオーバーフローやアンダーフローに対する予防策を講じられなかったため、実装が少しずさんでした。 
-
-## モデルを定義する
-
-softmax 演算を定義したので、[**softmax 回帰モデルを実装する**] 次のコードは、ネットワークを介して入力を出力にどのようにマッピングするかを定義します。モデルにデータを渡す前に、関数 `reshape` を使用して、バッチ内の各元のイメージをベクトルに平坦化することに注意してください。
-
-```{.python .input}
-#@tab all
-def net(X):
-    return softmax(d2l.matmul(d2l.reshape(X, (-1, W.shape[0])), W) + b)
-```
-
-## 損失関数の定義
-
-次に、:numref:`sec_softmax` で紹介されたクロスエントロピー損失関数を実装する必要があります。現時点では、分類問題は回帰問題よりはるかに多いため、これはすべての深層学習で最も一般的な損失関数です。 
-
-クロスエントロピーは、真のラベルに割り当てられた予測確率の負の対数尤度をとることを思い出してください。Python の for ループ (非効率になりがちです) で予測を反復処理するのではなく、1 つの演算子ですべての要素を選択することができます。以下では、[**3 つのクラスの予測確率の 2 つの例とそれに対応するラベル `y` を含む標本データ `y_hat` を作成します**] `y` では、最初の例では最初のクラスが正しい予測であり、2 番目の例では 3 番目のクラスがグラウンドトゥルースであることがわかっています。[**`y_hat` の確率の指標として `y` を使用,**] 最初の例では最初のクラスの確率を、2 番目の例では第 3 クラスの確率を選びます。
-
-```{.python .input}
-#@tab mxnet, pytorch
-y = d2l.tensor([0, 2])
-y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
-y_hat[[0, 1], y]
-```
-
-```{.python .input}
-#@tab tensorflow
-y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
-y = tf.constant([0, 2])
-tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
-```
-
-これで、たった一行のコードで効率的に (**クロスエントロピー損失関数を実装する**) ことができます。
-
-```{.python .input}
-#@tab mxnet, pytorch
-def cross_entropy(y_hat, y):
-    return - d2l.log(y_hat[range(len(y_hat)), y])
-
-cross_entropy(y_hat, y)
-```
-
-```{.python .input}
-#@tab tensorflow
-def cross_entropy(y_hat, y):
-    return -tf.math.log(tf.boolean_mask(
-        y_hat, tf.one_hot(y, depth=y_hat.shape[-1])))
-
-cross_entropy(y_hat, y)
-```
-
-## 分類精度
-
-予測確率分布 `y_hat` を考えると、ハード予測を出力する必要がある場合は常に、予測確率が最も高いクラスを選択します。実際、多くのアプリケーションでは選択が必要です。Gmail では、メールを [プライマリ]、[ソーシャル]、[更新]、[フォーラム] に分類する必要があります。内部で確率を推定するかもしれませんが、一日の終わりにはクラスの中から一つを選ばなければなりません。 
-
-予測がラベルクラス `y` と一致する場合、予測は正しいです。分類精度は、正しいすべての予測の比率です。精度を直接最適化することは難しい (微分できない) 場合がありますが、私たちが最も重視するのはパフォーマンス指標であることが多く、分類器の学習時にはほぼ必ず報告します。 
-
-精度を計算するために、次の操作を行います。まず、`y_hat` が行列の場合、2 番目の次元には各クラスの予測スコアが格納されていると仮定します。`argmax` を使用して、各行で最も大きいエントリのインデックスによって予測されるクラスを取得します。次に、[**予測されたクラスとグラウンドトゥルースの `y` を要素ごとに比較します。**] 等価演算子 `==` はデータ型に敏感であるため、`y_hat` のデータ型を `y` のデータ型に一致するように変換します。結果は 0 (偽) と 1 (真) のエントリを含むテンソルになります。合計を取ると、正しい予測の数が算出されます。
-
-```{.python .input}
-#@tab all
-def accuracy(y_hat, y):  #@save
-    """Compute the number of correct predictions."""
-    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
-        y_hat = d2l.argmax(y_hat, axis=1)
-    cmp = d2l.astype(y_hat, y.dtype) == y
-    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
-```
-
-予測確率分布とラベルとして、前に定義した変数 `y_hat` と `y` を引き続き使用します。最初の例の予測クラスは 2 (行の最大要素はインデックス 2 の 0.6) で、実際のラベル 0 と矛盾していることがわかります。2 番目の例の予測クラスは 2 (行の最大要素はインデックス 2 で 0.5) で、これは実際のラベル 2 と一致します。したがって、これら 2 つの例の分類精度率は 0.5 です。
-
-```{.python .input}
-#@tab all
-accuracy(y_hat, y) / len(y)
-```
-
-[**同様に、データセット上の任意のモデル `net` の精度を評価できます**] データイテレータ `data_iter` を介してアクセスします。
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def evaluate_accuracy(net, data_iter):  #@save
-    """Compute the accuracy for a model on a dataset."""
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-    for X, y in data_iter:
-        metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-```
-
-```{.python .input}
-#@tab pytorch
-def evaluate_accuracy(net, data_iter):  #@save
-    """Compute the accuracy for a model on a dataset."""
-    if isinstance(net, torch.nn.Module):
-        net.eval()  # Set the model to evaluation mode
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-
-    with torch.no_grad():
-        for X, y in data_iter:
-            metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-```
-
-ここで `Accumulator` は、複数の変数にわたって合計を累積するユーティリティクラスです。上記の `evaluate_accuracy` 関数では、正しい予測数と予測数の両方をそれぞれ格納するために、`Accumulator` インスタンスに 2 つの変数を作成します。両方とも、データセットを反復処理するにつれて、時間の経過とともに累積されます。
-
-```{.python .input}
-#@tab all
-class Accumulator:  #@save
-    """For accumulating sums over `n` variables."""
-    def __init__(self, n):
-        self.data = [0.0] * n
-
-    def add(self, *args):
-        self.data = [a + float(b) for a, b in zip(self.data, args)]
-
-    def reset(self):
-        self.data = [0.0] * len(self.data)
-
-    def __getitem__(self, idx):
-        return self.data[idx]
-```
-
-[**`net` モデルをランダムな重みで初期化したため、このモデルの精度はランダム推測、**] に近いはずです。つまり、10 クラスで 0.1 になります。
-
-```{.python .input}
-#@tab all
-evaluate_accuracy(net, test_iter)
-```
-
-## 訓練
-
-:numref:`sec_linear_scratch` の線形回帰の実装を読めば、softmax 回帰の [**トレーニングループ**] は驚くほど馴染みのあるものになるはずです。ここでは、再利用できるように実装をリファクタリングします。まず、1 エポックで学習させる関数を定義します。`updater` はモデルパラメーターを更新する一般的な関数で、バッチサイズを引数として受け取ります。`d2l.sgd` 関数のラッパーか、フレームワークに組み込まれている最適化関数のいずれかになります。
-
-```{.python .input}
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """Train a model within one epoch (defined in Chapter 3)."""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    if isinstance(updater, gluon.Trainer):
-        updater = updater.step
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with autograd.record():
-            y_hat = net(X)
-            l = loss(y_hat, y)
-        l.backward()
-        updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.size)
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-```{.python .input}
-#@tab pytorch
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """The training loop defined in Chapter 3."""
-    # Set the model to training mode
-    if isinstance(net, torch.nn.Module):
-        net.train()
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        y_hat = net(X)
-        l = loss(y_hat, y)
-        if isinstance(updater, torch.optim.Optimizer):
-            # Using PyTorch in-built optimizer & loss criterion
-            updater.zero_grad()
-            l.sum().backward()
-            updater.step()
-        else:
-            # Using custom built optimizer & loss criterion
-            l.sum().backward()
-            updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-```{.python .input}
-#@tab tensorflow
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """The training loop defined in Chapter 3."""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with tf.GradientTape() as tape:
-            y_hat = net(X)
-            # Keras implementations for loss takes (labels, predictions)
-            # instead of (predictions, labels) that users might implement
-            # in this book, e.g. `cross_entropy` that we implemented above
-            if isinstance(loss, tf.keras.losses.Loss):
-                l = loss(y, y_hat)
-            else:
-                l = loss(y_hat, y)
-        if isinstance(updater, tf.keras.optimizers.Optimizer):
-            params = net.trainable_variables
-            grads = tape.gradient(l, params)
-            updater.apply_gradients(zip(grads, params))
-        else:
-            updater(X.shape[0], tape.gradient(l, updater.params))
-        # Keras loss by default returns the average loss in a batch
-        l_sum = l * float(tf.size(y)) if isinstance(
-            loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
-        metric.add(l_sum, accuracy(y_hat, y), tf.size(y))
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-トレーニング関数の実装を示す前に、[**データをアニメーションでプロットするユーティリティクラス**] を定義します。このクラスは、本書の残りの部分でコードを単純化することを目的としています。
-
-```{.python .input}
-#@tab all
-class Animator:  #@save
-    """For plotting data in animation."""
-    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
-                 ylim=None, xscale='linear', yscale='linear',
-                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
-                 figsize=(3.5, 2.5)):
-        # Incrementally plot multiple lines
-        if legend is None:
-            legend = []
-        d2l.use_svg_display()
-        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
-        if nrows * ncols == 1:
-            self.axes = [self.axes, ]
-        # Use a lambda function to capture arguments
-        self.config_axes = lambda: d2l.set_axes(
-            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
-        self.X, self.Y, self.fmts = None, None, fmts
-
-    def add(self, x, y):
-        # Add multiple data points into the figure
-        if not hasattr(y, "__len__"):
-            y = [y]
-        n = len(y)
-        if not hasattr(x, "__len__"):
-            x = [x] * n
-        if not self.X:
-            self.X = [[] for _ in range(n)]
-        if not self.Y:
-            self.Y = [[] for _ in range(n)]
-        for i, (a, b) in enumerate(zip(x, y)):
-            if a is not None and b is not None:
-                self.X[i].append(a)
-                self.Y[i].append(b)
-        self.axes[0].cla()
-        for x, y, fmt in zip(self.X, self.Y, self.fmts):
-            self.axes[0].plot(x, y, fmt)
-        self.config_axes()
-        display.display(self.fig)
-        display.clear_output(wait=True)
-```
-
-[~~The training function ~~] 次のトレーニング関数は、`num_epochs` で指定された複数のエポックに対して `train_iter` を介してアクセスされるトレーニングデータセットでモデル `net` をトレーニングします。各エポックの終わりに、`test_iter` 経由でアクセスされるテスト用データセットでモデルが評価されます。`Animator` クラスを活用して、トレーニングの進捗状況を可視化します。
-
-```{.python .input}
-#@tab all
-def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):  #@save
-    """Train a model (defined in Chapter 3)."""
-    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
-                        legend=['train loss', 'train acc', 'test acc'])
-    for epoch in range(num_epochs):
-        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
-        test_acc = evaluate_accuracy(net, test_iter)
-        animator.add(epoch + 1, train_metrics + (test_acc,))
-    train_loss, train_acc = train_metrics
-    assert train_loss < 0.5, train_loss
-    assert train_acc <= 1 and train_acc > 0.7, train_acc
-    assert test_acc <= 1 and test_acc > 0.7, test_acc
-```
-
-ゼロからの実装として、:numref:`sec_linear_scratch` で定義された [**ミニバッチ確率的勾配降下法を使用する**]、学習率 0.1 でモデルの損失関数を最適化します。
-
-```{.python .input}
-#@tab mxnet, pytorch
-lr = 0.1
-
-def updater(batch_size):
-    return d2l.sgd([W, b], lr, batch_size)
-```
-
-```{.python .input}
-#@tab tensorflow
-class Updater():  #@save
-    """For updating parameters using minibatch stochastic gradient descent."""
-    def __init__(self, params, lr):
-        self.params = params
-        self.lr = lr
-
-    def __call__(self, batch_size, grads):
-        d2l.sgd(self.params, grads, self.lr, batch_size)
-
-updater = Updater([W, b], lr=0.1)
-```
-
-ここで [**10 エポックでモデルをトレーニングします。**] エポック数 (`num_epochs`) と学習率 (`lr`) はどちらも調整可能なハイパーパラメーターであることに注意してください。これらの値を変更することで、モデルの分類精度を高めることができる場合があります。
-
-```{.python .input}
-#@tab all
-num_epochs = 10
-train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)
-```
-
-## 予測
-
-トレーニングが完了したので、モデルは [**いくつかの画像を分類**] する準備が整いました。一連の画像がある場合、実際のラベル (テキスト出力の 1 行目) とモデルからの予測 (テキスト出力の 2 行目) を比較します。
-
-```{.python .input}
-#@tab all
-def predict_ch3(net, test_iter, n=6):  #@save
-    """Predict labels (defined in Chapter 3)."""
-    for X, y in test_iter:
-        break
-    trues = d2l.get_fashion_mnist_labels(y)
-    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1))
-    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
-    d2l.show_images(
-        d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])
-
-predict_ch3(net, test_iter)
-```
-
-## [概要
-
-* softmax 回帰を使用すると、マルチクラス分類用のモデルを学習させることができます。
-* ソフトマックス回帰の学習ループは線形回帰の学習ループとよく似ています。データの取得と読み取り、モデルと損失関数の定義、最適化アルゴリズムを使用したモデルの学習です。すぐにわかるように、ほとんどの一般的なディープラーニングモデルには同様のトレーニング手順があります。
-
-## 演習
-
-1. このセクションでは、softmax 演算の数学的定義に基づいて softmax 関数を直接実装しました。これはどのような問題を引き起こす可能性がありますか？ヒント:$\exp(50)$ のサイズを計算してみてください。
-1. このセクションの関数 `cross_entropy` は、クロスエントロピー損失関数の定義に従って実装されています。この実装では何が問題になりますか？ヒント:対数の領域を考えてみましょう。
-1. 上記の2つの問題を解決するには、どのような解決策が考えられますか？
-1. 最も可能性の高いラベルを返すのは常に良い考えですか？例えば、医療診断のためにこれを行いますか？
-1. ソフトマックス回帰を使用して、いくつかの特徴に基づいて次の単語を予測すると仮定します。大きな語彙から生じる可能性のある問題は何ですか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/50)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/51)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/225)
-:end_tab:
diff --git a/chapter_linear-networks/softmax-regression-scratch_origin.md b/chapter_linear-networks/softmax-regression-scratch_origin.md
deleted file mode 100644
index 0847431..0000000
--- a/chapter_linear-networks/softmax-regression-scratch_origin.md
+++ /dev/null
@@ -1,605 +0,0 @@
-# Implementation of Softmax Regression from Scratch
-:label:`sec_softmax_scratch`
-
-(**Just as we implemented linear regression from scratch, we believe that**)
-softmax regression
-is similarly fundamental and
-(**you ought to know the gory details of**) (~~softmax regression~~) and how to implement it yourself.
-We will work with the Fashion-MNIST dataset, just introduced in :numref:`sec_fashion_mnist`,
-setting up a data iterator with batch size 256.
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import autograd, np, npx, gluon
-from IPython import display
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from IPython import display
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-from IPython import display
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## Initializing Model Parameters
-
-As in our linear regression example,
-each example here will be represented by a fixed-length vector.
-Each example in the raw dataset is a $28 \times 28$ image.
-In this section, [**we will flatten each image,
-treating them as vectors of length 784.**]
-In the future, we will talk about more sophisticated strategies
-for exploiting the spatial structure in images,
-but for now we treat each pixel location as just another feature.
-
-Recall that in softmax regression,
-we have as many outputs as there are classes.
-(**Because our dataset has 10 classes,
-our network will have an output dimension of 10.**)
-Consequently, our weights will constitute a $784 \times 10$ matrix
-and the biases will constitute a $1 \times 10$ row vector.
-As with linear regression, we will initialize our weights `W`
-with Gaussian noise and our biases to take the initial value 0.
-
-```{.python .input}
-num_inputs = 784
-num_outputs = 10
-
-W = np.random.normal(0, 0.01, (num_inputs, num_outputs))
-b = np.zeros(num_outputs)
-W.attach_grad()
-b.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs = 784
-num_outputs = 10
-
-W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
-b = torch.zeros(num_outputs, requires_grad=True)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_inputs = 784
-num_outputs = 10
-
-W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs),
-                                 mean=0, stddev=0.01))
-b = tf.Variable(tf.zeros(num_outputs))
-```
-
-## Defining the Softmax Operation
-
-Before implementing the softmax regression model,
-let us briefly review how the sum operator works
-along specific dimensions in a tensor,
-as discussed in :numref:`subseq_lin-alg-reduction` and :numref:`subseq_lin-alg-non-reduction`.
-[**Given a matrix `X` we can sum over all elements (by default) or only
-over elements in the same axis,**]
-i.e., the same column (axis 0) or the same row (axis 1).
-Note that if `X` is a tensor with shape (2, 3)
-and we sum over the columns,
-the result will be a vector with shape (3,).
-When invoking the sum operator,
-we can specify to keep the number of axes in the original tensor,
-rather than collapsing out the dimension that we summed over.
-This will result in a two-dimensional tensor with shape (1, 3).
-
-```{.python .input}
-#@tab pytorch
-X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-d2l.reduce_sum(X, 0, keepdim=True), d2l.reduce_sum(X, 1, keepdim=True)
-```
-
-```{.python .input}
-#@tab mxnet, tensorflow
-X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
-d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)
-```
-
-We are now ready to (**implement the softmax operation**).
-Recall that softmax consists of three steps:
-(i) we exponentiate each term (using `exp`);
-(ii) we sum over each row (we have one row per example in the batch)
-to get the normalization constant for each example;
-(iii) we divide each row by its normalization constant,
-ensuring that the result sums to 1.
-Before looking at the code, let us recall
-how this looks expressed as an equation:
-
-(**
-$$\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.$$
-**)
-
-The denominator, or normalization constant,
-is also sometimes called the *partition function*
-(and its logarithm is called the log-partition function).
-The origins of that name are in [statistical physics](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics))
-where a related equation models the distribution
-over an ensemble of particles.
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def softmax(X):
-    X_exp = d2l.exp(X)
-    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
-    return X_exp / partition  # The broadcasting mechanism is applied here
-```
-
-```{.python .input}
-#@tab pytorch
-def softmax(X):
-    X_exp = d2l.exp(X)
-    partition = d2l.reduce_sum(X_exp, 1, keepdim=True)
-    return X_exp / partition  # The broadcasting mechanism is applied here
-```
-
-As you can see, for any random input,
-[**we turn each element into a non-negative number.
-Moreover, each row sums up to 1,**]
-as is required for a probability.
-
-```{.python .input}
-#@tab mxnet, pytorch
-X = d2l.normal(0, 1, (2, 5))
-X_prob = softmax(X)
-X_prob, d2l.reduce_sum(X_prob, 1)
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.random.normal((2, 5), 0, 1)
-X_prob = softmax(X)
-X_prob, tf.reduce_sum(X_prob, 1)
-```
-
-Note that while this looks correct mathematically,
-we were a bit sloppy in our implementation
-because we failed to take precautions against numerical overflow or underflow
-due to large or very small elements of the matrix.
-
-## Defining the Model
-
-Now that we have defined the softmax operation,
-we can [**implement the softmax regression model.**]
-The below code defines how the input is mapped to the output through the network.
-Note that we flatten each original image in the batch
-into a vector using the `reshape` function
-before passing the data through our model.
-
-```{.python .input}
-#@tab all
-def net(X):
-    return softmax(d2l.matmul(d2l.reshape(X, (-1, W.shape[0])), W) + b)
-```
-
-## Defining the Loss Function
-
-Next, we need to implement the cross-entropy loss function,
-as introduced in :numref:`sec_softmax`.
-This may be the most common loss function
-in all of deep learning because, at the moment,
-classification problems far outnumber regression problems.
-
-Recall that cross-entropy takes the negative log-likelihood
-of the predicted probability assigned to the true label.
-Rather than iterating over the predictions with a Python for-loop
-(which tends to be inefficient),
-we can pick all elements by a single operator.
-Below, we [**create sample data `y_hat`
-with 2 examples of predicted probabilities over 3 classes and their corresponding labels `y`.**]
-With `y` we know that in the first example the first class is the correct prediction and
-in the second example the third class is the ground-truth.
-[**Using `y` as the indices of the probabilities in `y_hat`,**]
-we pick the probability of the first class in the first example
-and the probability of the third class in the second example.
-
-```{.python .input}
-#@tab mxnet, pytorch
-y = d2l.tensor([0, 2])
-y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
-y_hat[[0, 1], y]
-```
-
-```{.python .input}
-#@tab tensorflow
-y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
-y = tf.constant([0, 2])
-tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
-```
-
-Now we can (**implement the cross-entropy loss function**) efficiently with just one line of code.
-
-```{.python .input}
-#@tab mxnet, pytorch
-def cross_entropy(y_hat, y):
-    return - d2l.log(y_hat[range(len(y_hat)), y])
-
-cross_entropy(y_hat, y)
-```
-
-```{.python .input}
-#@tab tensorflow
-def cross_entropy(y_hat, y):
-    return -tf.math.log(tf.boolean_mask(
-        y_hat, tf.one_hot(y, depth=y_hat.shape[-1])))
-
-cross_entropy(y_hat, y)
-```
-
-## Classification Accuracy
-
-Given the predicted probability distribution `y_hat`,
-we typically choose the class with the highest predicted probability
-whenever we must output a hard prediction.
-Indeed, many applications require that we make a choice.
-Gmail must categorize an email into "Primary", "Social", "Updates", or "Forums".
-It might estimate probabilities internally,
-but at the end of the day it has to choose one among the classes.
-
-When predictions are consistent with the label class `y`, they are correct.
-The classification accuracy is the fraction of all predictions that are correct.
-Although it can be difficult to optimize accuracy directly (it is not differentiable),
-it is often the performance measure that we care most about,
-and we will nearly always report it when training classifiers.
-
-To compute accuracy we do the following.
-First, if `y_hat` is a matrix,
-we assume that the second dimension stores prediction scores for each class.
-We use `argmax` to obtain the predicted class by the index for the largest entry in each row.
-Then we [**compare the predicted class with the ground-truth `y` elementwise.**]
-Since the equality operator `==` is sensitive to data types,
-we convert `y_hat`'s data type to match that of `y`.
-The result is a tensor containing entries of 0 (false) and 1 (true).
-Taking the sum yields the number of correct predictions.
-
-```{.python .input}
-#@tab all
-def accuracy(y_hat, y):  #@save
-    """Compute the number of correct predictions."""
-    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
-        y_hat = d2l.argmax(y_hat, axis=1)
-    cmp = d2l.astype(y_hat, y.dtype) == y
-    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
-```
-
-We will continue to use the variables `y_hat` and `y`
-defined before
-as the predicted probability distributions and labels, respectively.
-We can see that the first example's prediction class is 2
-(the largest element of the row is 0.6 with the index 2),
-which is inconsistent with the actual label, 0.
-The second example's prediction class is 2
-(the largest element of the row is 0.5 with the index of 2),
-which is consistent with the actual label, 2.
-Therefore, the classification accuracy rate for these two examples is 0.5.
-
-```{.python .input}
-#@tab all
-accuracy(y_hat, y) / len(y)
-```
-
-[**Similarly, we can evaluate the accuracy for any model `net` on a dataset**]
-that is accessed via the data iterator `data_iter`.
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def evaluate_accuracy(net, data_iter):  #@save
-    """Compute the accuracy for a model on a dataset."""
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-    for X, y in data_iter:
-        metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-```
-
-```{.python .input}
-#@tab pytorch
-def evaluate_accuracy(net, data_iter):  #@save
-    """Compute the accuracy for a model on a dataset."""
-    if isinstance(net, torch.nn.Module):
-        net.eval()  # Set the model to evaluation mode
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-
-    with torch.no_grad():
-        for X, y in data_iter:
-            metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-```
-
-Here `Accumulator` is a utility class to accumulate sums over multiple variables.
-In the above `evaluate_accuracy` function,
-we create 2 variables in the `Accumulator` instance for storing both
-the number of correct predictions and the number of predictions, respectively.
-Both will be accumulated over time as we iterate over the dataset.
-
-```{.python .input}
-#@tab all
-class Accumulator:  #@save
-    """For accumulating sums over `n` variables."""
-    def __init__(self, n):
-        self.data = [0.0] * n
-
-    def add(self, *args):
-        self.data = [a + float(b) for a, b in zip(self.data, args)]
-
-    def reset(self):
-        self.data = [0.0] * len(self.data)
-
-    def __getitem__(self, idx):
-        return self.data[idx]
-```
-
-[**Because we initialized the `net` model with random weights,
-the accuracy of this model should be close to random guessing,**]
-i.e., 0.1 for 10 classes.
-
-```{.python .input}
-#@tab all
-evaluate_accuracy(net, test_iter)
-```
-
-## Training
-
-[**The training loop**]
-for softmax regression should look strikingly familiar
-if you read through our implementation
-of linear regression in :numref:`sec_linear_scratch`.
-Here we refactor the implementation to make it reusable.
-First, we define a function to train for one epoch.
-Note that `updater` is a general function to update the model parameters,
-which accepts the batch size as an argument.
-It can be either a wrapper of the `d2l.sgd` function
-or a framework's built-in optimization function.
-
-```{.python .input}
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """Train a model within one epoch (defined in Chapter 3)."""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    if isinstance(updater, gluon.Trainer):
-        updater = updater.step
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with autograd.record():
-            y_hat = net(X)
-            l = loss(y_hat, y)
-        l.backward()
-        updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.size)
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-```{.python .input}
-#@tab pytorch
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """The training loop defined in Chapter 3."""
-    # Set the model to training mode
-    if isinstance(net, torch.nn.Module):
-        net.train()
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        y_hat = net(X)
-        l = loss(y_hat, y)
-        if isinstance(updater, torch.optim.Optimizer):
-            # Using PyTorch in-built optimizer & loss criterion
-            updater.zero_grad()
-            l.sum().backward()
-            updater.step()
-        else:
-            # Using custom built optimizer & loss criterion
-            l.sum().backward()
-            updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-```{.python .input}
-#@tab tensorflow
-def train_epoch_ch3(net, train_iter, loss, updater):  #@save
-    """The training loop defined in Chapter 3."""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with tf.GradientTape() as tape:
-            y_hat = net(X)
-            # Keras implementations for loss takes (labels, predictions)
-            # instead of (predictions, labels) that users might implement
-            # in this book, e.g. `cross_entropy` that we implemented above
-            if isinstance(loss, tf.keras.losses.Loss):
-                l = loss(y, y_hat)
-            else:
-                l = loss(y_hat, y)
-        if isinstance(updater, tf.keras.optimizers.Optimizer):
-            params = net.trainable_variables
-            grads = tape.gradient(l, params)
-            updater.apply_gradients(zip(grads, params))
-        else:
-            updater(X.shape[0], tape.gradient(l, updater.params))
-        # Keras loss by default returns the average loss in a batch
-        l_sum = l * float(tf.size(y)) if isinstance(
-            loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
-        metric.add(l_sum, accuracy(y_hat, y), tf.size(y))
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-```
-
-Before showing the implementation of the training function,
-we define [**a utility class that plot data in animation.**]
-Again, it aims to simplify code in the rest of the book.
-
-```{.python .input}
-#@tab all
-class Animator:  #@save
-    """For plotting data in animation."""
-    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
-                 ylim=None, xscale='linear', yscale='linear',
-                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
-                 figsize=(3.5, 2.5)):
-        # Incrementally plot multiple lines
-        if legend is None:
-            legend = []
-        d2l.use_svg_display()
-        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
-        if nrows * ncols == 1:
-            self.axes = [self.axes, ]
-        # Use a lambda function to capture arguments
-        self.config_axes = lambda: d2l.set_axes(
-            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
-        self.X, self.Y, self.fmts = None, None, fmts
-
-    def add(self, x, y):
-        # Add multiple data points into the figure
-        if not hasattr(y, "__len__"):
-            y = [y]
-        n = len(y)
-        if not hasattr(x, "__len__"):
-            x = [x] * n
-        if not self.X:
-            self.X = [[] for _ in range(n)]
-        if not self.Y:
-            self.Y = [[] for _ in range(n)]
-        for i, (a, b) in enumerate(zip(x, y)):
-            if a is not None and b is not None:
-                self.X[i].append(a)
-                self.Y[i].append(b)
-        self.axes[0].cla()
-        for x, y, fmt in zip(self.X, self.Y, self.fmts):
-            self.axes[0].plot(x, y, fmt)
-        self.config_axes()
-        display.display(self.fig)
-        display.clear_output(wait=True)
-```
-
-[~~The training function~~]
-The following training function then
-trains a model `net` on a training dataset accessed via `train_iter`
-for multiple epochs, which is specified by `num_epochs`.
-At the end of each epoch,
-the model is evaluated on a testing dataset accessed via `test_iter`.
-We will leverage the `Animator` class to visualize
-the training progress.
-
-```{.python .input}
-#@tab all
-def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):  #@save
-    """Train a model (defined in Chapter 3)."""
-    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
-                        legend=['train loss', 'train acc', 'test acc'])
-    for epoch in range(num_epochs):
-        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
-        test_acc = evaluate_accuracy(net, test_iter)
-        animator.add(epoch + 1, train_metrics + (test_acc,))
-    train_loss, train_acc = train_metrics
-    assert train_loss < 0.5, train_loss
-    assert train_acc <= 1 and train_acc > 0.7, train_acc
-    assert test_acc <= 1 and test_acc > 0.7, test_acc
-```
-
-As an implementation from scratch,
-we [**use the minibatch stochastic gradient descent**] defined in :numref:`sec_linear_scratch`
-to optimize the loss function of the model with a learning rate 0.1.
-
-```{.python .input}
-#@tab mxnet, pytorch
-lr = 0.1
-
-def updater(batch_size):
-    return d2l.sgd([W, b], lr, batch_size)
-```
-
-```{.python .input}
-#@tab tensorflow
-class Updater():  #@save
-    """For updating parameters using minibatch stochastic gradient descent."""
-    def __init__(self, params, lr):
-        self.params = params
-        self.lr = lr
-
-    def __call__(self, batch_size, grads):
-        d2l.sgd(self.params, grads, self.lr, batch_size)
-
-updater = Updater([W, b], lr=0.1)
-```
-
-Now we [**train the model with 10 epochs.**]
-Note that both the number of epochs (`num_epochs`),
-and learning rate (`lr`) are adjustable hyperparameters.
-By changing their values, we may be able
-to increase the classification accuracy of the model.
-
-```{.python .input}
-#@tab all
-num_epochs = 10
-train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)
-```
-
-## Prediction
-
-Now that training is complete,
-our model is ready to [**classify some images.**]
-Given a series of images,
-we will compare their actual labels
-(first line of text output)
-and the predictions from the model
-(second line of text output).
-
-```{.python .input}
-#@tab all
-def predict_ch3(net, test_iter, n=6):  #@save
-    """Predict labels (defined in Chapter 3)."""
-    for X, y in test_iter:
-        break
-    trues = d2l.get_fashion_mnist_labels(y)
-    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1))
-    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
-    d2l.show_images(
-        d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])
-
-predict_ch3(net, test_iter)
-```
-
-## Summary
-
-* With softmax regression, we can train models for multiclass classification.
-* The training loop of softmax regression is very similar to that in linear regression: retrieve and read data, define models and loss functions, then train models using optimization algorithms. As you will soon find out, most common deep learning models have similar training procedures.
-
-## Exercises
-
-1. In this section, we directly implemented the softmax function based on the mathematical definition of the softmax operation. What problems might this cause? Hint: try to calculate the size of $\exp(50)$.
-1. The function `cross_entropy` in this section was implemented according to the definition of the cross-entropy loss function.  What could be the problem with this implementation? Hint: consider the domain of the logarithm.
-1. What solutions you can think of to fix the two problems above?
-1. Is it always a good idea to return the most likely label? For example, would you do this for medical diagnosis?
-1. Assume that we want to use softmax regression to predict the next word based on some features. What are some problems that might arise from a large vocabulary?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/50)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/51)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/225)
-:end_tab:
diff --git a/chapter_linear-networks/softmax-regression.md b/chapter_linear-networks/softmax-regression.md
deleted file mode 100644
index c21e89b..0000000
--- a/chapter_linear-networks/softmax-regression.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# ソフトマックス回帰
-:label:`sec_softmax`
-
-:numref:`sec_linear_regression` では線形回帰を導入し、:numref:`sec_linear_scratch` では実装をゼロから実行し、:numref:`sec_linear_concise` ではディープラーニングフレームワークの高レベル API を使用して手間のかかる作業を行いました。 
-
-回帰は、私たちが答えたいときに手を伸ばすハンマーです。*いくら？* または *いくつですか？* 質問。家を売る際のドル (価格)、野球チームの勝利数、患者が退院するまでの入院日数を予測したい場合は、おそらく回帰モデルを探しているでしょう。 
-
-実際には、私たちはより頻繁に*分類*に関心があります。「どれだけ」ではなく「どれだけ」を尋ねるかです。 
-
-* このメールは迷惑メールフォルダと受信トレイのどちらにありますか？
-* この顧客は、サブスクリプションサービスに「サインアップする」可能性が高い、または「サインアップしない」可能性が高いですか？
-* この画像はロバ、犬、猫、または雄鶏を描いていますか？
-* アストンが次に観る可能性が最も高い映画はどれですか？
-
-口語的に言うと、機械学習の実践者は、2つの微妙に異なる問題を説明するために、*classification* という単語をオーバーロードします。(i) カテゴリ (クラス) への例のハードアサインメントのみに関心がある問題、および (ii) ソフトアサインメントを行いたい問題、つまり、以下の確率を評価する問題各カテゴリーが適用されます。ハードな割り当てだけを考えている場合でも、ソフト割り当てを行うモデルを使用することが多いため、区別が曖昧になりがちです。 
-
-## 分類問題
-:label:`subsec_classification-problem`
-
-足を濡らすために、簡単な画像分類問題から始めましょう。ここで、各入力は $2\times2$ グレースケールイメージで構成されています。各ピクセル値を 1 つのスカラーで表すことができ、4 つの特徴 $x_1, x_2, x_3, x_4$ が得られます。さらに、それぞれの画像が「猫」、「鶏」、「犬」のいずれかのカテゴリに属すると仮定します。 
-
-次に、ラベルの表現方法を選択する必要があります。明らかな選択肢が2つあります。おそらく最も自然な衝動は $y \in \{1, 2, 3\}$ を選択することでしょう。ここで、整数はそれぞれ $\{\text{dog}, \text{cat}, \text{chicken}\}$ を表します。これは、そのような情報をコンピューターに「保存」する優れた方法です。カテゴリに自然順序付けがある場合、たとえば $\{\text{baby}, \text{toddler}, \text{adolescent}, \text{young adult}, \text{adult}, \text{geriatric}\}$ を予測しようとした場合、この問題を回帰としてキャストし、ラベルをこの形式で保持するのが理にかなっているかもしれません。 
-
-しかし、一般的な分類問題には、クラス間の自然な順序付けは伴いません。幸いなことに、統計学者ははるか昔にカテゴリカルデータを表現する簡単な方法、*ワンホットエンコーディング*を発明しました。ワンホットエンコーディングは、カテゴリと同じ数の成分をもつベクトルです。特定のインスタンスのカテゴリに対応するコンポーネントは 1 に設定され、その他すべてのコンポーネントは 0 に設定されます。この例では、ラベル $y$ は 3 次元ベクトルで、$(1, 0, 0)$ は「cat」、$(0, 1, 0)$ は「鶏」、$(0, 0, 1)$ は「犬」に対応します。 
-
-$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$
-
-## ネットワークアーキテクチャ
-
-考えられるすべてのクラスに関連する条件付き確率を推定するには、クラスごとに 1 つずつ、複数の出力をもつモデルが必要です。線形モデルによる分類に対処するには、出力の数だけアフィン関数が必要になります。各出力は独自のアフィン関数に対応します。この例では、4 つの特徴量と 3 つの可能な出力カテゴリがあるため、重みを表すには 12 個のスカラー (添字付き $w$)、バイアスを表すには 3 つのスカラー (添字付き $b$) が必要です。入力ごとに、次の 3 つの*logit*、$o_1, o_2$、$o_3$ を計算します。 
-
-$$
-\begin{aligned}
-o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
-o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
-o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
-\end{aligned}
-$$
-
-この計算は :numref:`fig_softmaxreg` に示すニューラルネットワークダイアグラムで表すことができます。線形回帰と同様に、ソフトマックス回帰も単層ニューラルネットワークです。また、各出力 $o_1, o_2$ および $o_3$ の計算は $x_1$、$x_2$、$x_3$、$x_4$ のすべての入力に依存するため、ソフトマックス回帰の出力層は全結合層としても記述できます。 
-
-![Softmax regression is a single-layer neural network.](../img/softmaxreg.svg)
-:label:`fig_softmaxreg`
-
-モデルをよりコンパクトに表現するために、線形代数表記法を使うことができます。ベクトル形式では $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$ に到達しました。これは、数学とコードの記述の両方により適した形式です。すべての重みを $3 \times 4$ 行列に集め、与えられたデータ例 $\mathbf{x}$ の特徴量に対して、出力は、重みと入力特徴量にバイアス $\mathbf{b}$ を加えた行列-ベクトル積で与えられることに注意してください。 
-
-## 全結合層のパラメータ化コスト
-:label:`subsec_parameterization-cost-fc-layers`
-
-以降の章で説明するように、完全結合層はディープラーニングのいたるところに存在しています。ただし、その名前が示すように、全結合層は潜在的に多くの学習可能なパラメーターと「完全に」接続されています。具体的には、入力が $d$、出力が $q$ の完全接続レイヤでは、パラメータ化のコストは $\mathcal{O}(dq)$ となり、実際には非常に高くなる可能性があります。幸いなことに、$d$ の入力を $q$ 出力に変換するコストは $\mathcal{O}(\frac{dq}{n})$ にまで削減できます。この場合は、ハイパーパラメータ $n$ を柔軟に指定して、実際のアプリケーション :cite:`Zhang.Tay.Zhang.ea.2021` でパラメーターの節約とモデルの有効性のバランスをとることができます。 
-
-## ソフトマックスオペレーション
-:label:`subsec_softmax_operation`
-
-ここで取り上げる主なアプローチは、モデルの出力を確率として解釈することです。観測されたデータの尤度を最大化する確率を生成するために、パラメーターを最適化します。次に、予測を生成するために、予測確率が最大であるラベルを選択するなど、しきい値を設定します。 
-
-正式には、任意の出力 $\hat{y}_j$ を、与えられた項目がクラス $j$ に属する確率として解釈するようにします。次に、予測値 $\operatorname*{argmax}_j y_j$ として出力値が最も大きいクラスを選択できます。たとえば、$\hat{y}_1$、$\hat{y}_2$、$\hat{y}_3$ がそれぞれ 0.1、0.8、0.1 である場合、カテゴリ 2 が予測されます。このカテゴリは (この例では)「ニワトリ」を表します。 
-
-ロジット $o$ を目的の出力として直接解釈するように提案したくなるかもしれません。ただし、線形層の出力を確率として直接解釈することには、いくつかの問題があります。一方では、これらの数値の合計を1に制限するものは何もありません。一方、入力によっては、負の値を取ることもあります。これらは :numref:`sec_prob` で示された確率の基本公理に違反します 
-
-出力を確率として解釈するには、(新しいデータであっても) 非負で合計が 1 になることを保証しなければなりません。さらに、モデルが確率を忠実に推定するように促すトレーニング目標も必要です。分類器が 0.5 を出力するすべてのインスタンスのうち、これらの例の半分が実際に予測されるクラスに属することを期待しています。これは*Calibration* と呼ばれるプロパティです。 
-
-1959年に社会科学者のR. Duncan Luceが*選択モデル*の文脈で考案した*softmax関数*は、まさにこれを行います。モデルが微分可能であることを要求しながら、ロジットを非負にして合計が 1 になるようにロジットを変換するには、まず各ロジットをべき乗し (非負性を保証して)、その合計で除算します (合計が 1 になるようにします)。 
-
-$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}. $$
-:eqlabel:`eq_softmax_y_and_o`
-
-$j$は$0 \leq \hat{y}_j \leq 1$で$\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$を見るのは簡単です。したがって、$\hat{\mathbf{y}}$ は適切な確率分布であり、要素値をそれに応じて解釈できます。softmax 演算は、各クラスに割り当てられる確率を決定するソフトマックス前の値であるロジット $\mathbf{o}$ の順序を変更しないことに注意してください。したがって、予測中に、最も可能性の高いクラスを次の方法で選択できます。 
-
-$$
-\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j.
-$$
-
-softmax は非線形関数ですが、ソフトマックス回帰の出力は入力特徴量のアフィン変換によって *決定* されます。したがって、ソフトマックス回帰は線形モデルです。 
-
-## ミニバッチのベクトル化
-:label:`subsec_softmax_vectorization`
-
-計算効率を向上し、GPU を活用するために、通常、データのミニバッチに対してベクトル計算を実行します。特徴次元 (入力数) $d$、バッチサイズが $n$ の例のミニバッチ $\mathbf{X}$ が与えられていると仮定します。さらに、出力に $q$ のカテゴリがあると仮定します。この場合、ミニバッチフィーチャー $\mathbf{X}$ は $\mathbb{R}^{n \times d}$ になり、重みは $\mathbf{W} \in \mathbb{R}^{d \times q}$ になり、バイアスは $\mathbf{b} \in \mathbb{R}^{1\times q}$ を満たします。 
-
-$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$
-:eqlabel:`eq_minibatch_softmax_reg`
-
-これにより、一回に 1 つの例を処理した場合に実行する行列とベクトルの積に対して、行列-行列の積 $\mathbf{X} \mathbf{W}$ に対する支配的な演算が高速化されます。$\mathbf{X}$ の各行はデータ例を表すため、softmax 演算自体は *rowwise* で計算できます。$\mathbf{O}$ の各行に対して、すべてのエントリをべき乗し、合計で正規化します。:eqref:`eq_minibatch_softmax_reg` の合計 $\mathbf{X} \mathbf{W} + \mathbf{b}$ の間にブロードキャストをトリガーすると、ミニバッチロジット $\mathbf{O}$ と出力確率 $\hat{\mathbf{Y}}$ はどちらも $n \times q$ 行列になります。 
-
-## 損失関数
-
-次に、予測される確率の質を測定する損失関数が必要です。最尤推定に頼ります。これは、線形回帰で平均二乗誤差の目的の確率的正当化を提供するときに遭遇したのとまったく同じ概念です (:numref:`subsec_normal_distribution_and_squared_loss`)。 
-
-### 対数尤度
-
-softmax 関数からベクトル $\hat{\mathbf{y}}$ が得られ、任意の入力 $\mathbf{x}$ が与えられた場合、各クラスの推定条件付き確率として解釈できます (例:$\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$)。データセット $\{\mathbf{X}, \mathbf{Y}\}$ 全体に $n$ の例があり、$i$ によってインデックス付けされた例は、特徴ベクトル $\mathbf{x}^{(i)}$ とワンホットラベルベクトル $\mathbf{y}^{(i)}$ で構成されているとします。次の特徴を考慮して、モデルに従って実際のクラスがどの程度確率が高いかを確認することで、推定値を現実と比較できます。 
-
-$$
-P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
-$$
-
-最尤推定によると、$P(\mathbf{Y} \mid \mathbf{X})$ は最大化されます。これは、負の対数尤度を最小化することに相当します。 
-
-$$
--\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
-= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
-$$
-
-ここで、$q$ クラスに対するラベル $\mathbf{y}$ とモデル予測 $\hat{\mathbf{y}}$ のペアについて、損失関数 $l$ は次のようになります。 
-
-$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$
-:eqlabel:`eq_l_cross_entropy`
-
-後述する理由から、:eqref:`eq_l_cross_entropy` の損失関数は一般に*クロスエントロピー損失* と呼ばれます。$\mathbf{y}$ は長さ $q$ の 1 ホットベクトルなので、1 つの項を除くすべての座標 $j$ の合計は消失します。$\hat{y}_j$ はすべて予測確率であるため、その対数は $0$ より大きくなることはありません。したがって、実際のラベルを*確実性*で正しく予測した場合、つまり実際のラベル $\mathbf{y}$ の予測確率 $P(\mathbf{y} \mid \mathbf{x}) = 1$ であれば、損失関数をこれ以上最小化することはできません。これは不可能な場合が多いことに注意してください。たとえば、データセットにラベルノイズがある可能性があります (一部の例では誤ったラベルが付けられている可能性があります)。また、入力フィーチャの情報量が十分でないため、すべての例を完全に分類できない場合もあります。 
-
-### ソフトマックスとデリバティブ
-:label:`subsec_softmax_and_derivatives`
-
-ソフトマックスとそれに対応する損失は非常に一般的であるため、計算方法をもう少しよく理解する価値があります。:eqref:`eq_softmax_y_and_o` を :eqref:`eq_l_cross_entropy` の損失の定義に差し込み、得られたソフトマックスの定義を使用します。 
-
-$$
-\begin{aligned}
-l(\mathbf{y}, \hat{\mathbf{y}}) &=  - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\
-&= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j\\
-&= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j.
-\end{aligned}
-$$
-
-何が起こっているのかをもう少しよく理解するために、ロジット $o_j$ に関する微分を考えてみましょう。我々が得る 
-
-$$
-\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j.
-$$
-
-言い換えると、微分とは、ソフトマックス演算で表されるモデルによって割り当てられた確率と、ワンホットラベルベクトルの要素によって表される実際の発生との差です。この意味で、これは回帰で見られたものと非常に似ています。勾配は観測値 $y$ と推定 $\hat{y}$ の差でした。これは偶然ではない。指数族 ([online appendix on distributions](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html) を参照) モデルでは、対数尤度の勾配は正確にこの項によって与えられます。この事実により、実際には勾配の計算が容易になります。 
-
-### クロスエントロピー損失
-
-ここで、単一の結果だけでなく、結果全体の分布を観察する場合を考えてみましょう。ラベル $\mathbf{y}$ には、以前と同じ表現を使用できます。唯一の違いは、$(0, 0, 1)$ などのバイナリエントリのみを含むベクトルではなく、$(0.1, 0.2, 0.7)$ などの一般的な確率ベクトルがあることです。:eqref:`eq_l_cross_entropy` で損失 $l$ を定義するために以前に使用した計算は、解釈が少し一般的であるというだけで、まだうまく機能します。これは、ラベル上の分布に対する損失の期待値です。この損失は*クロスエントロピー損失* と呼ばれ、分類問題で最も一般的に使用される損失の 1 つです。情報理論の基礎を紹介するだけで、名前の謎を解き明かすことができます。情報理論の詳細について理解したい場合は、[online appendix on information theory](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html) を参照してください。 
-
-## 情報理論の基礎
-:label:`subsec_info_theory_basics`
-
-*情報理論*は、符号化、復号化、送信、
-また、情報 (データとも呼ばれる) をできるだけ簡潔な形で操作します。 
-
-### エントロピー
-
-情報理論の中心的な考え方は、データに含まれる情報量を定量化することです。この量は、データの圧縮能力に厳しい制限を課します。情報理論では、この量は分布 $P$ の*エントロピー* と呼ばれ、次の方程式で捉えられます。 
-
-$$H[P] = \sum_j - P(j) \log P(j).$$
-:eqlabel:`eq_softmax_reg_entropy`
-
-情報理論の基本定理の一つに、分布 $P$ から無作為に抽出されたデータをエンコードするには、少なくとも $H[P]$「nats」が必要であるとされています。「nat」が何であるか疑問に思うなら、それはビットと同等ですが、基数2のコードではなく基数$e$のコードを使用する場合です。したがって、1 つの NAT は $\frac{1}{\log(2)} \approx 1.44$ ビットになります。 
-
-### サプライザル
-
-圧縮が予測とどのような関係があるのか疑問に思われるかもしれません。圧縮したいデータのストリームがあるとします。次のトークンを予測することが常に容易であれば、このデータは簡単に圧縮できます。ストリーム内のすべてのトークンが常に同じ値を取るという極端な例を考えてみましょう。それは非常に退屈なデータストリームです！退屈なだけでなく、予測も簡単です。それらは常に同じなので、ストリームの内容を伝えるために情報を送信する必要はありません。予測しやすく、圧縮も簡単です。 
-
-しかし、すべての出来事を完全に予測できなければ、驚くこともあります。イベントに低い確率を割り当てると、驚きはより大きくなります。クロード・シャノンは、事象$j$を (主観的な) 確率$P(j)$に割り当てた事象を観測したときの*驚き*を定量化するために、$\log \frac{1}{P(j)} = -\log P(j)$に落ち着いた。:eqref:`eq_softmax_reg_entropy` で定義されているエントロピーは、データ生成プロセスに真に合致する正しい確率を割り当てたときの「予想される驚き」になります。 
-
-### クロスエントロピー再考
-
-したがって、エントロピーが真の確率を知っている人が経験する驚きのレベルであれば、クロスエントロピーとは何か疑問に思うかもしれません。$H(P, Q)$ と表記される* $P$ *から* $Q$ へのクロスエントロピーは、主観的確率 $Q$ をもつ観測者が、確率 $P$ に従って実際に生成されたデータを見たときに予想される驚きです。$P=Q$ のときには、可能な限り低いクロスエントロピーが達成されます。この場合、$P$ から $Q$ までのクロスエントロピーは $H(P, P)= H(P)$ になります。 
-
-要するに、クロスエントロピー分類の目的は、(i) 観測されるデータの尤度を最大化すること、(ii) ラベルを伝達するのに必要な驚き (したがってビット数) を最小化することの 2 つの方法で考えることができます。 
-
-## モデル予測と評価
-
-ソフトマックス回帰モデルに学習をさせた後、特徴の例があれば、各出力クラスの確率を予測できます。通常、予測される確率が最も高いクラスを出力クラスとして使用します。実際のクラス (ラベル) と一致していれば、予測は正しいです。実験の次のパートでは、*accuracy* を使用してモデルの性能を評価します。これは、正しい予測の数と予測の総数の比率に等しくなります。 
-
-## [概要
-
-* softmax 演算はベクトルを受け取り、確率にマップします。
-* Softmax 回帰は分類問題に適用されます。softmax 演算で出力クラスの確率分布を使用します。
-* クロスエントロピーは、2 つの確率分布の差の優れた尺度です。モデルで与えられたデータをエンコードするのに必要なビット数を測定します。
-
-## 演習
-
-1. 指数群とソフトマックスの関係をもう少し詳しく調べることができます。
-    1. ソフトマックスに対するクロスエントロピー損失 $l(\mathbf{y},\hat{\mathbf{y}})$ の 2 次導関数を計算します。
-    1. $\mathrm{softmax}(\mathbf{o})$ で与えられる分布の分散を計算し、上記で計算した 2 次導関数と一致することを示します。
-1. 等しい確率で発生するクラスが 3 つあると仮定します。つまり、確率ベクトルは $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ です。
-    1. バイナリコードを設計しようとすると何が問題になりますか？
-    1. もっと良いコードをデザインできますか？ヒント:2つの独立したオブザベーションをエンコードしようとするとどうなりますか？$n$ 個の観測値を一緒にエンコードするとどうなるでしょうか。
-1. Softmax は、上で紹介したマッピングの誤称です (ただし、ディープラーニングでは誰もがこれを使用しています)。実際のソフトマックスは $\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$ と定義されています。
-    1. $\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$であることを証明しろ
-    1. $\lambda > 0$という条件で、これが$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$に当てはまることを証明してください。
-    1. $\lambda \to \infty$には$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$があることを示してください。
-    1. ソフトミンはどんな感じですか？
-    1. これを 3 つ以上の数値に拡張します。
-
-[Discussions](https://discuss.d2l.ai/t/46)
diff --git a/chapter_linear-networks/softmax-regression_origin.md b/chapter_linear-networks/softmax-regression_origin.md
deleted file mode 100644
index e6cb866..0000000
--- a/chapter_linear-networks/softmax-regression_origin.md
+++ /dev/null
@@ -1,443 +0,0 @@
-# Softmax Regression
-:label:`sec_softmax`
-
-In :numref:`sec_linear_regression`, we introduced linear regression,
-working through implementations from scratch in :numref:`sec_linear_scratch`
-and again using high-level APIs of a deep learning framework
-in :numref:`sec_linear_concise` to do the heavy lifting.
-
-Regression is the hammer we reach for when
-we want to answer *how much?* or *how many?* questions.
-If you want to predict the number of dollars (price)
-at which a house will be sold,
-or the number of wins a baseball team might have,
-or the number of days that a patient will remain hospitalized before being discharged,
-then you are probably looking for a regression model.
-
-In practice, we are more often interested in *classification*:
-asking not "how much" but "which one":
-
-* Does this email belong in the spam folder or the inbox?
-* Is this customer more likely *to sign up* or *not to sign up* for a subscription service?
-* Does this image depict a donkey, a dog, a cat, or a rooster?
-* Which movie is Aston most likely to watch next?
-
-Colloquially, machine learning practitioners
-overload the word *classification*
-to describe two subtly different problems:
-(i) those where we are interested only in
-hard assignments of examples to categories (classes);
-and (ii) those where we wish to make soft assignments,
-i.e., to assess the probability that each category applies.
-The distinction tends to get blurred, in part,
-because often, even when we only care about hard assignments,
-we still use models that make soft assignments.
-
-
-## Classification Problem
-:label:`subsec_classification-problem`
-
-To get our feet wet, let us start off with
-a simple image classification problem.
-Here, each input consists of a $2\times2$ grayscale image.
-We can represent each pixel value with a single scalar,
-giving us four features $x_1, x_2, x_3, x_4$.
-Further, let us assume that each image belongs to one
-among the categories "cat", "chicken", and "dog".
-
-Next, we have to choose how to represent the labels.
-We have two obvious choices.
-Perhaps the most natural impulse would be to choose $y \in \{1, 2, 3\}$,
-where the integers represent $\{\text{dog}, \text{cat}, \text{chicken}\}$ respectively.
-This is a great way of *storing* such information on a computer.
-If the categories had some natural ordering among them,
-say if we were trying to predict $\{\text{baby}, \text{toddler}, \text{adolescent}, \text{young adult}, \text{adult}, \text{geriatric}\}$,
-then it might even make sense to cast this problem as regression
-and keep the labels in this format.
-
-But general classification problems do not come with natural orderings among the classes.
-Fortunately, statisticians long ago invented a simple way
-to represent categorical data: the *one-hot encoding*.
-A one-hot encoding is a vector with as many components as we have categories.
-The component corresponding to particular instance's category is set to 1
-and all other components are set to 0.
-In our case, a label $y$ would be a three-dimensional vector,
-with $(1, 0, 0)$ corresponding to "cat", $(0, 1, 0)$ to "chicken",
-and $(0, 0, 1)$ to "dog":
-
-$$y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}.$$
-
-## Network Architecture
-
-In order to estimate the conditional probabilities associated with all the possible classes,
-we need a model with multiple outputs, one per class.
-To address classification with linear models,
-we will need as many affine functions as we have outputs.
-Each output will correspond to its own affine function.
-In our case, since we have 4 features and 3 possible output categories,
-we will need 12 scalars to represent the weights ($w$ with subscripts),
-and 3 scalars to represent the biases ($b$ with subscripts).
-We compute these three *logits*, $o_1, o_2$, and $o_3$, for each input:
-
-$$
-\begin{aligned}
-o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\
-o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\
-o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
-\end{aligned}
-$$
-
-We can depict this calculation with the neural network diagram shown in :numref:`fig_softmaxreg`.
-Just as in linear regression, softmax regression is also a single-layer neural network.
-And since the calculation of each output, $o_1, o_2$, and $o_3$,
-depends on all inputs, $x_1$, $x_2$, $x_3$, and $x_4$,
-the output layer of softmax regression can also be described as fully-connected layer.
-
-![Softmax regression is a single-layer neural network.](../img/softmaxreg.svg)
-:label:`fig_softmaxreg`
-
-To express the model more compactly, we can use linear algebra notation.
-In vector form, we arrive at
-$\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$,
-a form better suited both for mathematics, and for writing code.
-Note that we have gathered all of our weights into a $3 \times 4$ matrix
-and that for features of a given data example $\mathbf{x}$,
-our outputs are given by a matrix-vector product of our weights by our input features
-plus our biases $\mathbf{b}$.
-
-
-## Parameterization Cost of Fully-Connected Layers
-:label:`subsec_parameterization-cost-fc-layers`
-
-As we will see in subsequent chapters,
-fully-connected layers are ubiquitous in deep learning.
-However, as the name suggests,
-fully-connected layers are *fully* connected
-with potentially many learnable parameters.
-Specifically,
-for any fully-connected layer
-with $d$ inputs and $q$ outputs,
-the parameterization cost is $\mathcal{O}(dq)$,
-which can be prohibitively high in practice.
-Fortunately,
-this cost 
-of transforming $d$ inputs into $q$ outputs
-can be reduced to $\mathcal{O}(\frac{dq}{n})$,
-where the hyperparameter $n$ can be flexibly specified
-by us to balance between parameter saving and model effectiveness in real-world applications :cite:`Zhang.Tay.Zhang.ea.2021`.
-
-
-
-
-
-## Softmax Operation
-:label:`subsec_softmax_operation`
-
-The main approach that we are going to take here
-is to interpret the outputs of our model as probabilities.
-We will optimize our parameters to produce probabilities
-that maximize the likelihood of the observed data.
-Then, to generate predictions, we will set a threshold,
-for example, choosing the label with the maximum predicted probabilities.
-
-Put formally, we would like any output $\hat{y}_j$
-to be interpreted as the probability
-that a given item belongs to class $j$.
-Then we can choose the class with the largest output value
-as our prediction $\operatorname*{argmax}_j y_j$.
-For example, if $\hat{y}_1$, $\hat{y}_2$, and $\hat{y}_3$
-are 0.1, 0.8, and 0.1, respectively,
-then we predict category 2, which (in our example) represents "chicken".
-
-You might be tempted to suggest that we interpret
-the logits $o$ directly as our outputs of interest.
-However, there are some problems with directly
-interpreting the output of the linear layer as a probability.
-On one hand,
-nothing constrains these numbers to sum to 1.
-On the other hand, depending on the inputs, they can take negative values.
-These violate basic axioms of probability presented in :numref:`sec_prob`
-
-To interpret our outputs as probabilities,
-we must guarantee that (even on new data),
-they will be nonnegative and sum up to 1.
-Moreover, we need a training objective that encourages
-the model to estimate faithfully probabilities.
-Of all instances when a classifier outputs 0.5,
-we hope that half of those examples
-will actually belong to the predicted class.
-This is a property called *calibration*.
-
-The *softmax function*, invented in 1959 by the social scientist
-R. Duncan Luce in the context of *choice models*,
-does precisely this.
-To transform our logits such that they become nonnegative and sum to 1,
-while requiring that the model remains differentiable,
-we first exponentiate each logit (ensuring non-negativity)
-and then divide by their sum (ensuring that they sum to 1):
-
-$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}. $$
-:eqlabel:`eq_softmax_y_and_o`
-
-It is easy to see $\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 1$
-with $0 \leq \hat{y}_j \leq 1$ for all $j$.
-Thus, $\hat{\mathbf{y}}$ is a proper probability distribution
-whose element values can be interpreted accordingly.
-Note that the softmax operation does not change the ordering among the logits $\mathbf{o}$,
-which are simply the pre-softmax values
-that determine the probabilities assigned to each class.
-Therefore, during prediction we can still pick out the most likely class by
-
-$$
-\operatorname*{argmax}_j \hat y_j = \operatorname*{argmax}_j o_j.
-$$
-
-Although softmax is a nonlinear function,
-the outputs of softmax regression are still *determined* by
-an affine transformation of input features;
-thus, softmax regression is a linear model.
-
-
-
-## Vectorization for Minibatches
-:label:`subsec_softmax_vectorization`
-
-To improve computational efficiency and take advantage of GPUs,
-we typically carry out vector calculations for minibatches of data.
-Assume that we are given a minibatch $\mathbf{X}$ of examples
-with feature dimensionality (number of inputs) $d$ and batch size $n$.
-Moreover, assume that we have $q$ categories in the output.
-Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$,
-weights $\mathbf{W} \in \mathbb{R}^{d \times q}$,
-and the bias satisfies $\mathbf{b} \in \mathbb{R}^{1\times q}$.
-
-$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$
-:eqlabel:`eq_minibatch_softmax_reg`
-
-This accelerates the dominant operation into
-a matrix-matrix product $\mathbf{X} \mathbf{W}$
-vs. the matrix-vector products we would be executing
-if we processed one example at a time.
-Since each row in $\mathbf{X}$ represents a data example,
-the softmax operation itself can be computed *rowwise*:
-for each row of $\mathbf{O}$, exponentiate all entries and then normalize them by the sum.
-Triggering broadcasting during the summation $\mathbf{X} \mathbf{W} + \mathbf{b}$ in :eqref:`eq_minibatch_softmax_reg`,
-both the minibatch logits $\mathbf{O}$ and output probabilities $\hat{\mathbf{Y}}$
-are $n \times q$ matrices.
-
-## Loss Function
-
-Next, we need a loss function to measure
-the quality of our predicted probabilities.
-We will rely on maximum likelihood estimation,
-the very same concept that we encountered
-when providing a probabilistic justification
-for the mean squared error objective in linear regression
-(:numref:`subsec_normal_distribution_and_squared_loss`).
-
-
-### Log-Likelihood
-
-The softmax function gives us a vector $\hat{\mathbf{y}}$,
-which we can interpret as estimated conditional probabilities
-of each class given any input $\mathbf{x}$, e.g.,
-$\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$.
-Suppose that the entire dataset $\{\mathbf{X}, \mathbf{Y}\}$ has $n$ examples,
-where the example indexed by $i$
-consists of a feature vector $\mathbf{x}^{(i)}$ and a one-hot label vector $\mathbf{y}^{(i)}$.
-We can compare the estimates with reality
-by checking how probable the actual classes are
-according to our model, given the features:
-
-$$
-P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
-$$
-
-According to maximum likelihood estimation,
-we maximize $P(\mathbf{Y} \mid \mathbf{X})$,
-which is
-equivalent to minimizing the negative log-likelihood:
-
-$$
--\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
-= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
-$$
-
-where for any pair of label $\mathbf{y}$ and model prediction $\hat{\mathbf{y}}$ over $q$ classes,
-the loss function $l$ is
-
-$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$
-:eqlabel:`eq_l_cross_entropy`
-
-For reasons explained later on, the loss function in :eqref:`eq_l_cross_entropy`
-is commonly called the *cross-entropy loss*.
-Since $\mathbf{y}$ is a one-hot vector of length $q$,
-the sum over all its coordinates $j$ vanishes for all but one term.
-Since all $\hat{y}_j$ are predicted probabilities,
-their logarithm is never larger than $0$.
-Consequently, the loss function cannot be minimized any further
-if we correctly predict the actual label with *certainty*,
-i.e., if the predicted probability $P(\mathbf{y} \mid \mathbf{x}) = 1$ for the actual label $\mathbf{y}$.
-Note that this is often impossible.
-For example, there might be label noise in the dataset
-(some examples may be mislabeled).
-It may also not be possible when the input features
-are not sufficiently informative
-to classify every example perfectly.
-
-### Softmax and Derivatives
-:label:`subsec_softmax_and_derivatives`
-
-Since the softmax and the corresponding loss are so common,
-it is worth understanding a bit better how it is computed.
-Plugging :eqref:`eq_softmax_y_and_o` into the definition of the loss
-in :eqref:`eq_l_cross_entropy`
-and using the definition of the softmax we obtain:
-
-$$
-\begin{aligned}
-l(\mathbf{y}, \hat{\mathbf{y}}) &=  - \sum_{j=1}^q y_j \log \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} \\
-&= \sum_{j=1}^q y_j \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j\\
-&= \log \sum_{k=1}^q \exp(o_k) - \sum_{j=1}^q y_j o_j.
-\end{aligned}
-$$
-
-To understand a bit better what is going on,
-consider the derivative with respect to any logit $o_j$. We get
-
-$$
-\partial_{o_j} l(\mathbf{y}, \hat{\mathbf{y}}) = \frac{\exp(o_j)}{\sum_{k=1}^q \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j.
-$$
-
-In other words, the derivative is the difference
-between the probability assigned by our model,
-as expressed by the softmax operation,
-and what actually happened, as expressed by elements in the one-hot label vector.
-In this sense, it is very similar to what we saw in regression,
-where the gradient was the difference
-between the observation $y$ and estimate $\hat{y}$.
-This is not coincidence.
-In any exponential family (see the
-[online appendix on distributions](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/distributions.html)) model,
-the gradients of the log-likelihood are given by precisely this term.
-This fact makes computing gradients easy in practice.
-
-### Cross-Entropy Loss
-
-Now consider the case where we observe not just a single outcome
-but an entire distribution over outcomes.
-We can use the same representation as before for the label $\mathbf{y}$.
-The only difference is that rather than a vector containing only binary entries,
-say $(0, 0, 1)$, we now have a generic probability vector, say $(0.1, 0.2, 0.7)$.
-The math that we used previously to define the loss $l$
-in :eqref:`eq_l_cross_entropy`
-still works out fine,
-just that the interpretation is slightly more general.
-It is the expected value of the loss for a distribution over labels.
-This loss is called the *cross-entropy loss* and it is
-one of the most commonly used losses for classification problems.
-We can demystify the name by introducing just the basics of information theory.
-If you wish to understand more details of information theory,
-you may further refer to the [online appendix on information theory](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html).
-
-
-
-## Information Theory Basics
-:label:`subsec_info_theory_basics`
-
-*Information theory* deals with the problem of encoding, decoding, transmitting,
-and manipulating information (also known as data) in as concise form as possible.
-
-
-### Entropy
-
-The central idea in information theory is to quantify the information content in data.
-This quantity places a hard limit on our ability to compress the data.
-In information theory, this quantity is called the *entropy* of a distribution $P$,
-and it is captured by the following equation:
-
-$$H[P] = \sum_j - P(j) \log P(j).$$
-:eqlabel:`eq_softmax_reg_entropy`
-
-One of the fundamental theorems of information theory states
-that in order to encode data drawn randomly from the distribution $P$,
-we need at least $H[P]$ "nats" to encode it.
-If you wonder what a "nat" is, it is the equivalent of bit
-but when using a code with base $e$ rather than one with base 2.
-Thus, one nat is $\frac{1}{\log(2)} \approx 1.44$ bit.
-
-
-### Surprisal
-
-You might be wondering what compression has to do with prediction.
-Imagine that we have a stream of data that we want to compress.
-If it is always easy for us to predict the next token,
-then this data is easy to compress!
-Take the extreme example where every token in the stream always takes the same value.
-That is a very boring data stream!
-And not only it is boring, but it is also easy to predict.
-Because they are always the same, we do not have to transmit any information
-to communicate the contents of the stream.
-Easy to predict, easy to compress.
-
-However if we cannot perfectly predict every event,
-then we might sometimes be surprised.
-Our surprise is greater when we assigned an event lower probability.
-Claude Shannon settled on $\log \frac{1}{P(j)} = -\log P(j)$
-to quantify one's *surprisal* at observing an event $j$
-having assigned it a (subjective) probability $P(j)$.
-The entropy defined in :eqref:`eq_softmax_reg_entropy` is then the *expected surprisal*
-when one assigned the correct probabilities
-that truly match the data-generating process.
-
-
-### Cross-Entropy Revisited
-
-So if entropy is level of surprise experienced
-by someone who knows the true probability,
-then you might be wondering, what is cross-entropy?
-The cross-entropy *from* $P$ *to* $Q$, denoted $H(P, Q)$,
-is the expected surprisal of an observer with subjective probabilities $Q$
-upon seeing data that were actually generated according to probabilities $P$.
-The lowest possible cross-entropy is achieved when $P=Q$.
-In this case, the cross-entropy from $P$ to $Q$ is $H(P, P)= H(P)$.
-
-In short, we can think of the cross-entropy classification objective
-in two ways: (i) as maximizing the likelihood of the observed data;
-and (ii) as minimizing our surprisal (and thus the number of bits)
-required to communicate the labels.
-
-
-## Model Prediction and Evaluation
-
-After training the softmax regression model, given any example features,
-we can predict the probability of each output class.
-Normally, we use the class with the highest predicted probability as the output class.
-The prediction is correct if it is consistent with the actual class (label).
-In the next part of the experiment,
-we will use *accuracy* to evaluate the model's performance.
-This is equal to the ratio between the number of correct predictions and the total number of predictions.
-
-
-## Summary
-
-* The softmax operation takes a vector and maps it into probabilities.
-* Softmax regression applies to classification problems. It uses the probability distribution of the output class in the softmax operation.
-* Cross-entropy is a good measure of the difference between two probability distributions. It measures the number of bits needed to encode the data given our model.
-
-## Exercises
-
-1. We can explore the connection between exponential families and the softmax in some more depth.
-    1. Compute the second derivative of the cross-entropy loss $l(\mathbf{y},\hat{\mathbf{y}})$ for the softmax.
-    1. Compute the variance of the distribution given by $\mathrm{softmax}(\mathbf{o})$ and show that it matches the second derivative computed above.
-1. Assume that we have three classes which occur with equal probability, i.e., the probability vector is $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$.
-    1. What is the problem if we try to design a binary code for it?
-    1. Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode $n$ observations jointly?
-1. Softmax is a misnomer for the mapping introduced above (but everyone in deep learning uses it). The real softmax is defined as $\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$.
-    1. Prove that $\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$.
-    1. Prove that this holds for $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$, provided that $\lambda > 0$.
-    1. Show that for $\lambda \to \infty$ we have $\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$.
-    1. What does the soft-min look like?
-    1. Extend this to more than two numbers.
-
-[Discussions](https://discuss.d2l.ai/t/46)
diff --git a/chapter_linear-regression/generalization.md b/chapter_linear-regression/generalization.md
new file mode 100644
index 0000000..9f25b73
--- /dev/null
+++ b/chapter_linear-regression/generalization.md
@@ -0,0 +1,112 @@
+# ジェネラライズ
+:label:`sec_generalization_basics`
+
+2人の大学生が最終試験に向けて熱心に準備していることを考えてみましょう。一般的に、この準備は、前年に実施された試験を受けることによって、能力を練習し、テストすることで構成されます。それにもかかわらず、過去の試験でうまくやっていても、重要なときに優れているという保証はありません。たとえば、前の年の試験問題の解答を覚えるだけで準備が整っていた学生、エレファンティネ・エリーを想像してみてください。エリーが象の記憶に恵まれていて、*以前に見た*質問に対する答えを完全に思い出すことができたとしても、それでも彼女は新しい（*以前は見られなかった*）質問に直面すると凍りつくかもしれません。比較すると、比較的低い暗記スキルを持つが、パターンを拾うコツがある別の学生、Inductive Ireneを想像してみてください。試験が本当に前年のリサイクルされた質問で構成されている場合、エリーはアイリーンを簡単に上回ります。アイリーンの推測されたパターンが 90% の正確な予測をもたらしたとしても、エリーの100％のリコールと競合することはできませんでした。ただし、試験が完全に新鮮な問題で構成されていたとしても、アイリーンは平均90％を維持する可能性があります。 
+
+機械学習の科学者としての私たちの目標は、*パターン*を発見することです。しかし、単にデータを記憶するのではなく、*一般的な*パターンを本当に発見したことをどうやって確信できるのでしょうか？ほとんどの場合、私たちの予測は、モデルがそのようなパターンを発見した場合にのみ役立ちます。昨日の株価は予測したくないが、明日の株価は予測したい。私たちは、以前に診察された患者についてすでに診断された病気を認識する必要はなく、以前に見られなかった患者の以前に診断されていない病気を認識する必要はありません。この問題、つまり*一般化*するパターンをどのように発見するかは、機械学習の根本的な問題であり、おそらくすべての統計の根本的な問題です。この問題は、科学のすべてを巻き込むはるかに壮大な質問のほんの一部に過ぎないかもしれません。特定の観察からより一般的な声明に飛躍することが正当化されるのはいつですか :cite:`popper2005logic`？ 
+
+実際の生活では、有限なデータコレクションを使用してモデルを適合させる必要があります。そのデータの典型的なスケールは、ドメインによって大きく異なります。多くの重要な医学的問題については、数千のデータポイントにしかアクセスできません。希少疾患を研究するとき、何百もの病気にアクセスできるのは幸運かもしれません。対照的に、ラベル付き写真で構成される最大の公開データセット（ImageNet :cite:`Deng.Dong.Socher.ea.2009` など）には、何百万もの画像が含まれています。また、Flickr YFC100M データセットなど、ラベルのない一部の画像コレクションは、さらに大きくなる可能性があり、1億を超える画像が含まれる :cite:`thomee2016yfcc100m`。しかし、この極端な規模であっても、利用可能なデータポイントの数は、1メガピクセルの解像度で可能なすべての画像のスペースと比較して非常に少ないままです。有限サンプルを扱うときはいつでも、一般化可能なパターンを発見できなかったことを発見するためだけに、トレーニングデータに適合するリスクを念頭に置く必要があります。 
+
+基礎となる分布よりもトレーニングデータに近いフィッティングの現象は*オーバーフィット*と呼ばれ、オーバーフィットに対処するテクニックはしばしば*正則化*メソッドと呼ばれます。統計的学習理論（:citet:`Vapnik98,boucheron2005theory`参照）の適切な導入に代わるものはありませんが、始めるのに十分な直感を提供します。本書全体の多くの章で一般化を再検討し、さまざまなモデルにおける一般化の根底にある原理について知られていることと、実際に関心のあるタスクの一般化を改善するために（経験的に）発見されたヒューリスティック手法の両方を探ります。 
+
+## 学習エラーと汎化エラー
+
+標準の教師あり学習の設定では、トレーニングデータとテストデータは*同一の*分布から*独立*に抽出されると仮定します。これは一般に*IID 仮定*と呼ばれます。この仮定は強力ですが、そのような仮定がなければ、私たちは水中で死んでしまうことに注目する価値があります。なぜ分布$P(X,Y)$からサンプリングされたトレーニングデータが、*異なる分布* $Q(X,Y)$によって生成されたテストデータの予測を行う方法を教えてくれると信じるべきなのでしょうか？このような飛躍を遂げるには、$P$と$Q$がどのように関連しているかについての強い仮定が必要であることが分かります。後で分布の変化を許容するいくつかの仮定について説明しますが、最初に IID ケース、$P(\cdot) = Q(\cdot)$ を理解する必要があります。 
+
+まず、トレーニングデータセットで計算された*統計*である*トレーニングエラー* $R_\text{emp}$と、基礎となる分布に対する*期待値*である*汎化誤差* $R$を区別する必要があります。汎化誤差は、同じ基礎となるデータ分布から抽出された追加のデータ例の無限ストリームにモデルを適用した場合に表示されるものと考えることができます。正式には、トレーニングエラーは*sum*（:numref:`sec_linear_regression`でも同じ表記で）で表されます。 
+
+$$R_\text{emp}[\mathbf{X}, \mathbf{y}, f] = \frac{1}{n} \sum_{i=1}^n l(\mathbf{x}^{(i)}, y^{(i)}, f(\mathbf{x}^{(i)})),$$
+
+一方、汎化誤差は積分として表されます。 
+
+$$R[p, f] = E_{(\mathbf{x}, y) \sim P} [l(\mathbf{x}, y, f(\mathbf{x}))] =
+\int \int l(\mathbf{x}, y, f(\mathbf{x})) p(\mathbf{x}, y) \;d\mathbf{x} dy.$$
+
+問題として、汎化誤差 $R$ を正確に計算することはできません。誰も密度関数$p(\mathbf{x}, y)$の正確な形を教えてくれません。さらに、データポイントの無限のストリームをサンプリングすることはできません。したがって、実際には、トレーニングセットから除外された例$\mathbf{X}'$とラベル$\mathbf{y}'$をランダムに選択して構成される独立したテストセットにモデルを適用することにより、汎化誤差を*推定*する必要があります。これは、経験的トレーニングエラーの計算と同じ式を、テストセット $\mathbf{X}', \mathbf{y}'$ に適用することで構成されます。 
+
+重要なのは、テストセットで分類器を評価する場合、*固定*分類器（テストセットのサンプルに依存しない）を使用して作業しているため、その誤差を推定することは単に平均推定の問題です。しかし、トレーニングセットについても同じことは言えません。最終的なモデルは、トレーニングセットの選択に明示的に依存するため、トレーニングエラーは一般に、基礎となる母集団の真のエラーの偏った推定値になることに注意してください。ジェネラライズの中心的な問題は、トレーニングエラーが母集団エラー（つまりジェネラライズエラー）に近いと予想されるのはいつですか。 
+
+### モデルの複雑さ
+
+古典理論では、単純なモデルと豊富なデータがある場合、学習と汎化の誤差は近い傾向があります。しかし、より複雑なモデルやより少ない例で作業する場合、トレーニングエラーは減少しますが、汎化のギャップは大きくなると予想されます。これは驚くべきことではありません。$n$の例のどのデータセットでも、ランダムに割り当てられた場合でも、任意のラベルに完全に適合する一連のパラメーターを見つけることができるほど表現力豊かなモデルクラスを想像してみてください。この場合、トレーニングデータを完全に適合させたとしても、汎化誤差についてどのように結論付けることができますか？私たちが知っているすべてのために、私たちの汎化誤差はランダムな推測に勝るものではないかもしれません。 
+
+一般に、モデルクラスに制限がない限り、トレーニングデータのフィッティングだけでは、モデルが一般化可能なパターン :cite:`vapnik1994measuring` を発見したと結論付けることはできません。一方、モデルクラスが任意のラベルを適合させることができない場合は、パターンを発見したに違いありません。モデルの複雑さに関する学習理論的アイデアは、偽造可能性の基準を形式化した影響力のある科学哲学者であるカール・ポッパーのアイデアからインスピレーションを得ました。ポッパーによると、あらゆる観察を説明できる理論は、まったく科学的な理論ではありません！結局のところ、可能性を排除していなければ、世界について何を教えてくれましたか？要するに、私たちが望むのは、私たちがおそらく行う可能性のある観察を*説明できず、それでも、*実際に*行った観察と互換性があるという仮説です。 
+
+さて、モデルの複雑さの適切な概念を正確に構成するものは複雑な問題です。多くの場合、より多くのパラメータを持つモデルは、任意に割り当てられた多数のラベルに適合できます。しかし、これは必ずしもそうではありません。たとえば、カーネルメソッドは無限の数のパラメータを持つスペースで動作しますが、その複雑さは他の手段 :cite:`scholkopf2002learning` によって制御されます。複雑さの概念として、しばしば有用であることが証明されるのは、パラメータが取ることができる値の範囲です。ここで、パラメータが任意の値を取ることを許可されているモデルは、より複雑になります。このアイデアは、次のセクションで、初めての実用的な正則化手法である*重量減衰*を紹介するときに再考します。特に、実質的に異なるモデルクラス (決定木とニューラルネットワークなど) のメンバー間で複雑さを比較するのは難しい場合があります。 
+
+この時点で、ディープニューラルネットワークを導入する際に再検討するもう1つの重要な点を強調する必要があります。モデルが任意のラベルを近似できる場合、学習誤差が小さいからといって、必ずしも汎化誤差が小さいことを意味するわけではありません。
+*ただし、必ずしも
+高い汎化誤差も暗示する！* 私たちが自信を持って言えることは、低いトレーニングエラーだけでは低い汎化エラーを証明するのに十分ではないということです。ディープニューラルネットワークは、まさにそのようなモデルであることがわかります。実際にはうまく一般化されていますが、トレーニングエラーだけに基づいて多くの結論を出すには強力すぎます。このような場合、事後に一般化を証明するために、ホールドアウトデータにもっと大きく依存する必要があります。ホールドアウトデータ、つまり検証セットのエラーは、*検証エラー* と呼ばれます。 
+
+## アンダーフィットかオーバーフィッティング？
+
+トレーニングエラーと検証エラーを比較するときは、2 つの一般的な状況に注意する必要があります。まず、トレーニングエラーと検証エラーの両方が大きいが、両者の間に少しギャップがある場合に注意します。モデルがトレーニングエラーを減らすことができない場合は、モデルが単純すぎる（つまり、表現力が不十分な）ため、モデル化しようとしているパターンをキャプチャできない可能性があります。さらに、トレーニングエラーとジェネラライズエラーの間の*汎化ギャップ*（$R_\text{emp} - R$）は小さいので、より複雑なモデルで回避できると信じる理由があります。この現象は*アンダーフィット*として知られています。 
+
+一方、上で説明したように、トレーニングエラーが検証エラーよりも大幅に低く、深刻な*オーバーフィット*を示しているケースに注意する必要があります。オーバーフィットは必ずしも悪いことではないことに注意してください。特にディープラーニングでは、最良の予測モデルが、ホールドアウトデータよりもトレーニングデータの方がはるかに優れたパフォーマンスを発揮することがよくあります。最終的に、私たちは通常、汎化誤差を低くすることを重視し、そのための障害となる限りギャップのみを気にします。学習誤差がゼロの場合、汎化ギャップは汎化誤差と正確に等しくなり、ギャップを減らすことによってのみ進歩できることに注意してください。 
+
+### 多項式曲線フィッティング
+:label:`subsec_polynomial-curve-fitting`
+
+過適合とモデルの複雑さに関するいくつかの古典的な直感を説明するために、以下を考えてみましょう。単一の特徴量$x$と対応する実数値のラベル$y$で構成されるトレーニングデータを考えると、次数$d$の多項式を見つけようとします。 
+
+$$\hat{y}= \sum_{i=0}^d x^i w_i$$
+
+$y$というラベルを推定します。これは単なる線形回帰問題であり、私たちの特徴は$x$の累乗によって与えられ、モデルの重みは$w_i$によって与えられ、バイアスはすべての$x$について$x^0 = 1$から$w_0$によって与えられます。これは単なる線形回帰問題なので、二乗誤差を損失関数として使用できます。 
+
+高次の多項式関数は低次の多項式関数よりも複雑です。これは、高次の多項式にはより多くのパラメーターがあり、モデル関数の選択範囲が広いためです。トレーニングデータセットを修正すると、高次多項式関数は、低次多項式に比べて、常に (最悪の場合、等しい) 学習誤差が小さくなるはずです。実際、各データ例が$x$という異なる値を持つ場合は常に、データ例の数と等しい次数を持つ多項式関数は、学習セットに完全に適合できます。:numref:`fig_capacity_vs_error`では、多項式の次数（モデルの複雑さ）とアンダーフィットと過適合の関係を視覚化します。 
+
+![Influence of model complexity on underfitting and overfitting](../img/capacity-vs-error.svg)
+:label:`fig_capacity_vs_error`
+
+### データセットのサイズ
+
+上記の境界がすでに示しているように、もう1つ留意すべき大きな考慮事項はデータセットのサイズです。モデルを修正すると、トレーニングデータセットに含まれるサンプルが少なくなるほど、過適合に遭遇する可能性が高くなります（そして深刻になります）。トレーニングデータの量を増やすと、汎化誤差は一般的に減少します。さらに、一般に、より多くのデータが害を及ぼすことはありません。固定タスクとデータ分散の場合、モデルの複雑さはデータ量よりも急速に増加するべきではありません。より多くのデータがあれば、もっと複雑なモデルを近似しようとするかもしれません。十分なデータがないと、単純なモデルは打ち負かすのが難しいかもしれません。多くのタスクにおいて、ディープラーニングは、何千ものトレーニング例が利用可能な場合にのみ、線形モデルよりも優れています。ディープラーニングの現在の成功は、インターネット企業、安価なストレージ、接続されたデバイス、および経済の広範なデジタル化から生まれた大量のデータセットに大きく起因しています。 
+
+## モデル選択
+:label:`subsec_generalization-model-selection`
+
+通常、さまざまな方法（異なるアーキテクチャ、トレーニング目標、選択された機能、データの前処理、学習率など）が異なる複数のモデルを評価した後にのみ、最終モデルを選択します。多くのモデルの中から選ぶことを適切に*モデル選択*と呼びます。 
+
+原則として、すべてのハイパーパラメータを選択するまでテストセットに触れないでください。モデル選択プロセスでテストデータを使用する場合、テストデータを過剰適合させるリスクがあります。そうすれば、私たちは深刻なトラブルに陥るでしょう。トレーニングデータを過剰に適合させすぎると、常に正直さを保つためのテストデータの評価があります。しかし、もし私たちがテストデータを過剰適合させたら、どうやってわかるでしょうか？複雑さを厳密に制御できるモデルであっても、これが不合理な結果につながる例については、:citet:`ong2005learning`を参照してください。 
+
+したがって、モデルの選択にテストデータに頼るべきではありません。それでも、モデルをトレーニングするために使用するデータそのものの汎化誤差を推定できないため、モデル選択のためにトレーニングデータだけに頼ることはできません。 
+
+実際のアプリケーションでは、画像が濁ります。最良のモデルを評価したり、少数のモデルを相互に比較したりするために、テストデータに一度だけ触れるのが理想的ですが、実際のテストデータが1回使用されただけで破棄されることはほとんどありません。実験のラウンドごとに新しいテストセットを用意することはほとんどありません。実際、ベンチマークデータを数十年にわたってリサイクルすることは、[image classification](https://paperswithcode.com/sota/image-classification-on-imagenet)や[optical character recognition](https://paperswithcode.com/sota/image-classification-on-mnist)などのアルゴリズムの開発に大きな影響を与える可能性があります。 
+
+*テストセットでのトレーニング*の問題に対処するための一般的な方法は、トレーニングデータセットとテストデータセットに加えて*検証セット*を組み込んで、データを3つの方法で分割することです。その結果、検証データとテストデータの境界が心配なほど曖昧になるという、あいまいな習慣が生まれます。特に明記されていない限り、この本の実験では、真のテストセットなしで、トレーニングデータと検証データと呼ばれるべきものを実際に扱っています。したがって、本の各実験で報告されている精度は、実際には検証精度であり、真のテストセット精度ではありません。 
+
+### クロスバリデーション
+
+トレーニングデータが不足している場合、適切な検証セットを構成するのに十分なデータを保持する余裕さえないかもしれません。この問題の一般的な解決策の 1 つは、$K$*-fold 交差検証* を採用することです。ここでは、元のトレーニングデータが $K$ の重複しないサブセットに分割されます。次に、モデルトレーニングと検証が $K$ 回実行されます。毎回、$K-1$ サブセットでトレーニングを行い、別のサブセット (そのラウンドではトレーニングに使用されなかったサブセット) で検証します。最後に、$K$ の実験の結果を平均化して、学習エラーと検証エラーを推定します。 
+
+## まとめ
+
+このセクションでは、機械学習における汎化の基盤のいくつかを探りました。これらのアイデアのいくつかは、より深いモデルに到達すると複雑で直観に反します。そこでは、モデルはデータをひどく過剰適合させる可能性があり、複雑さの関連する概念は暗黙的で直感に反する可能性があります（たとえば、より多くのパラメータを持つより大きなアーキテクチャがより適切に一般化されます）。いくつかの経験則を残しておきます。 
+
+1. モデル選択には検証セット (または $K$*-fold 交差検証*) を使用します。
+1. より複雑なモデルでは、多くの場合より多くのデータが必要です。
+1. 関連する複雑さの概念には、パラメータの数と許容される値の範囲の両方が含まれます。
+1. 他のすべてを等しく保つと、ほとんどの場合、より多くのデータがより良い一般化につながります。
+1. この一般化の話はすべて、IIDの仮定に基づいています。この仮定を緩和して、分布がトレイン期間とテスト期間の間でシフトできるようにすると、さらに（おそらくより穏やかな）仮定がなければ、一般化については何も言えません。
+
+## 演習
+
+1. 多項式回帰の問題を正確に解くことができるのはいつですか？
+1. 従属確率変数によって問題を IID データとして扱うことが推奨されない例を少なくとも 5 つ挙げてください。
+1. トレーニングエラーがゼロになることは期待できますか？汎化誤差がゼロになるのはどのような状況ですか？
+1. $K$倍の交差検証が計算に非常にコストがかかるのはなぜですか?
+1. $K$ 分割交差検証誤差推定に偏りがあるのはなぜですか?
+1. VCディメンションは、関数クラスの関数によって任意のラベル$\{\pm 1\}$で分類できるポイントの最大数として定義されます。関数のクラスがどれほど複雑であるかを測定するのにこれが良い考えではないのはなぜですか？ヒント:関数の大きさはどうですか?
+1. 上司から、現在のアルゴリズムがあまりうまく機能しない難しいデータセットが提供されました。もっとデータが必要だということを彼にどう正当化しますか？ヒント:データを増やすことはできませんが、減らすことはできます。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/96)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/97)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/234)
+:end_tab:
diff --git a/chapter_linear-regression/generalization_origin.md b/chapter_linear-regression/generalization_origin.md
new file mode 100644
index 0000000..58b82ca
--- /dev/null
+++ b/chapter_linear-regression/generalization_origin.md
@@ -0,0 +1,475 @@
+# Generalization
+:label:`sec_generalization_basics`
+
+Consider two college students diligently
+preparing for their final exam.
+Commonly, this preparation will consist
+of practicing and testing their abilities
+by taking exams administered in previous years.
+Nonetheless, doing well on past exams is no guarantee
+that they will excel when it matters.
+For instance, imagine a student, Elephantine Ellie,
+whose preparation consisted entirely
+of memorizing the answers
+to previous years' exam questions.
+Even if Ellie were endowed
+with an elephantine memory,
+and thus could perfectly recall the answer
+to any *previously seen* question,
+she might nevertheless freeze
+when faced with a new (*previously unseen*) question.
+By comparison, imagine another student,
+Inductive Irene, with comparably poor
+memorization skills,
+but a knack for picking up patterns.
+Note that if the exam truly consisted of
+recycled questions from a previous year,
+Ellie would handily outperform Irene.
+Even if Irene's inferred patterns
+yielded 90% accurate predictions,
+they could never compete with
+Ellie's 100% recall.
+However, even if the exam consisted
+entirely of fresh questions,
+Irene might maintain her 90% average.
+
+As machine learning scientists,
+our goal is to discover *patterns*.
+But how can we be sure that we have
+truly discovered a *general* pattern
+and not simply memorized our data?
+Most of the time, our predictions are only useful
+if our model discovers such a pattern.
+We don't want to predict yesterday's stock prices, but tomorrow's.
+We don't need to recognize
+already diagnosed diseases
+for previously seen patients,
+but rather previously undiagnosed
+ailments in previously unseen patients.
+This problem---how to discover patterns that *generalize*---is
+the fundamental problem of machine learning,
+and arguably of all of statistics.
+We might cast this problem as just one slice
+of a far grander question
+that engulfs all of science:
+when are we ever justified
+in making the leap from particular observations
+to more general statements :cite:`popper2005logic`?
+
+
+In real life, we must fit out models
+using a finite collection of data.
+The typical scales of that data
+vary wildly across domains.
+For many important  medical problem,
+we can only access a few thousand data points.
+When studying rare diseases,
+we might be lucky to access hundreds.
+By contrast, the largest public datasets
+consisting of labeled photographs
+(e.g., ImageNet :cite:`Deng.Dong.Socher.ea.2009`),
+contain millions of images.
+And some unlabeled image collections
+such as the Flickr YFC100M dataset
+can be even larger, containing
+over 100 million images :cite:`thomee2016yfcc100m`.
+However, even at this extreme scale,
+the number of available data points
+remains infinitesimally small
+compared to the space of all possible images
+at 1 megapixel resolution.
+Whenever we work with finite samples,
+we must keep in mind the risk
+that we might fit our training data,
+only to discover that we failed
+to discover a generalizable pattern.
+
+The phenomenon of fitting closer to our training data
+than to the underlying distribution is called *overfitting*,
+and techniques for combatting overfitting
+are often called *regularization* methods.
+While there is no substitute for a proper introduction
+to statistical learning theory (see :citet:`Vapnik98,boucheron2005theory`),
+we will give you just enough intuition to get going.
+We will revisit generalization in many chapters
+throughout the book,
+exploring both what is known about
+the principles underlying generalization
+in various models,
+and also heuristic techniques
+that have been found (empirically)
+to yield improved generalization
+on tasks of practical interest.
+
+
+
+## Training Error and Generalization Error
+
+
+In the standard supervised learning setting,
+we assume that the training data and the test data
+are drawn *independently* from *identical* distributions.
+This is commonly called the *IID assumption*.
+While this assumption is strong,
+it's worth noting that absent any such assumption
+we would be dead in the water.
+Why should we believe that training data
+sampled from distribution $P(X,Y)$
+should tell us how to make predictions on
+test data generated by a *different distribution* $Q(X,Y)$?
+Making such leaps turns out to require
+strong assumptions about how $P$ and $Q$ are related.
+Later on we will discuss some assumptions
+that allow for shifts in distribution
+but first we need to understand the IID case,
+where $P(\cdot) = Q(\cdot)$.
+
+To begin with, we need to differentiate between
+the *training error* $R_\text{emp}$,
+which is a *statistic*
+calculated on the training dataset,
+and the *generalization error* $R$,
+which is an *expectation* taken
+with respect to the underlying distribution.
+You can think of the generalization error as
+what you would see  if you applied your model
+to an infinite stream of additional data examples
+drawn from the same underlying data distribution.
+Formally the training error is expressed as a *sum* (with the same notation in :numref:`sec_linear_regression`):
+
+$$R_\text{emp}[\mathbf{X}, \mathbf{y}, f] = \frac{1}{n} \sum_{i=1}^n l(\mathbf{x}^{(i)}, y^{(i)}, f(\mathbf{x}^{(i)})),$$
+
+
+while the generalization error is expressed as an integral:
+
+$$R[p, f] = E_{(\mathbf{x}, y) \sim P} [l(\mathbf{x}, y, f(\mathbf{x}))] =
+\int \int l(\mathbf{x}, y, f(\mathbf{x})) p(\mathbf{x}, y) \;d\mathbf{x} dy.$$
+
+Problematically, we can never calculate
+the generalization error $R$ exactly.
+Nobody ever tells us the precise form
+of the density function $p(\mathbf{x}, y)$.
+Moreover, we cannot sample an infinite stream of data points.
+Thus, in practice, we must *estimate* the generalization error
+by applying our model to an independent test set
+constituted of a random selection of examples
+$\mathbf{X}'$ and labels $\mathbf{y}'$
+that were withheld from our training set.
+This consists of applying the same formula
+as for calculating the empirical training error
+but to a test set $\mathbf{X}', \mathbf{y}'$.
+
+
+Crucially, when we evaluate our classifier on the test set,
+we are working with a *fixed* classifier
+(it does not depend on the sample of the test set),
+and thus estimating its error
+is simply the problem of mean estimation.
+However the same cannot be said
+for the training set.
+Note that the model we wind up with
+depends explicitly on the selection of the training set
+and thus the training error will in general
+be a biased estimate of the true error
+on the underlying population.
+The central question of generalization
+is then when should we expect our training error
+to be close to the population error
+(and thus the generalization error).
+
+### Model Complexity
+
+In classical theory, when we have
+simple models and abundant data,
+the training and generalization errors tend to be close.
+However, when we work with
+more complex models and/or fewer examples,
+we expect the training error to go down
+but the generalization gap to grow.
+This should not be surprising.
+Imagine a model class so expressive that
+for any dataset of $n$ examples,
+we can find a set of parameters
+that can perfectly fit arbitrary labels,
+even if randomly assigned.
+In this case, even if we fit our training data perfectly,
+how can we conclude anything about the generalization error?
+For all we know, our generalization error
+might be no better than random guessing.
+
+In general, absent any restriction on our model class,
+we cannot conclude based on fitting the training data alone
+that our model has discovered any generalizable pattern :cite:`vapnik1994measuring`.
+On the other hand, if our model class
+was not capable of fitting arbitrary labels,
+then it must have discovered a pattern.
+Learning-theoretic ideas about model complexity
+derived some inspiration from the ideas
+of Karl Popper, an influential philosopher of science,
+who formalized the criterion of falsifiability.
+According to Popper, a theory
+that can explain any and all observations
+is not a scientific theory at all!
+After all, what has it told us about the world
+if it has not ruled out any possibility?
+In short, what we want is a hypothesis
+that *could not* explain any observations
+we might conceivably make
+and yet nevertheless happens to be compatible
+with those observations that we *in fact* make.
+
+Now what precisely constitutes an appropriate
+notion of model complexity is a complex matter.
+Often, models with more parameters
+are able to fit a greater number
+of arbitrarily assigned labels.
+However, this is not necessarily true.
+For instance, kernel methods operate in spaces
+with infinite numbers of parameters,
+yet their complexity is controlled
+by other means :cite:`scholkopf2002learning`.
+One notion of complexity that often proves useful
+is the range of values that the parameters can take.
+Here, a model whose parameters are permitted
+to take arbitrary values
+would be more complex.
+We will revisit this idea in the next section,
+when we introduce *weight decay*,
+your first practical regularization technique.
+Notably, it can be difficult to compare
+complexity among members of substantially different model classes
+(say, decision trees vs. neural networks).
+
+
+At this point, we must stress another important point
+that we will revisit when introducing deep neural networks.
+When a model is capable of fitting arbitrary labels,
+low training error does not necessarily
+imply low generalization error.
+*However, it does not necessarily
+imply high generalization error either!*
+All we can say confidently is that
+low training error alone is not enough
+to certify low generalization error.
+Deep neural networks turn out to be just such models:
+while they generalize well in practice,
+they are too powerful to allow us to conclude
+much on the basis of training error alone.
+In these cases we must rely more heavily
+on our holdout data to certify generalization
+after the fact.
+Error on the holdout data, i.e., validation set,
+is called the *validation error*.
+
+## Underfitting or Overfitting?
+
+When we compare the training and validation errors,
+we want to be mindful of two common situations.
+First, we want to watch out for cases
+when our training error and validation error are both substantial
+but there is a little gap between them.
+If the model is unable to reduce the training error,
+that could mean that our model is too simple
+(i.e., insufficiently expressive)
+to capture the pattern that we are trying to model.
+Moreover, since the *generalization gap* ($R_\text{emp} - R$)
+between our training and generalization errors is small,
+we have reason to believe that we could get away with a more complex model.
+This phenomenon is known as *underfitting*.
+
+On the other hand, as we discussed above,
+we want to watch out for the cases
+when our training error is significantly lower
+than our validation error, indicating severe *overfitting*.
+Note that overfitting is not always a bad thing.
+In deep learning especially,
+the best predictive models often perform
+far better on training data than on holdout data.
+Ultimately, we usually care about
+driving the generalization error lower,
+and only care about the gap insofar
+as it becomes an obstacle to that end.
+Note that if the training error is zero,
+then the generalization gap is precisely equal to the generalization error
+and we can make progress only by reducing the gap.
+
+### Polynomial Curve Fitting
+:label:`subsec_polynomial-curve-fitting`
+
+To illustrate some classical intuition
+about overfitting and model complexity,
+consider the following:
+given training data consisting of a single feature $x$
+and a corresponding real-valued label $y$,
+we try to find the polynomial of degree $d$
+
+$$\hat{y}= \sum_{i=0}^d x^i w_i$$
+
+to estimate the label $y$.
+This is just a linear regression problem
+where our features are given by the powers of $x$,
+the model's weights are given by $w_i$,
+and the bias is given by $w_0$ since $x^0 = 1$ for all $x$.
+Since this is just a linear regression problem,
+we can use the squared error as our loss function.
+
+
+A higher-order polynomial function is more complex
+than a lower-order polynomial function,
+since the higher-order polynomial has more parameters
+and the model function's selection range is wider.
+Fixing the training dataset,
+higher-order polynomial functions should always
+achieve lower (at worst, equal) training error
+relative to lower degree polynomials.
+In fact, whenever each data example
+has a distinct value of $x$,
+a polynomial function with degree
+equal to the number of data examples
+can fit the training set perfectly.
+We visualize the relationship between polynomial degree (model complexity)
+and underfitting vs. overfitting in :numref:`fig_capacity_vs_error`.
+
+![Influence of model complexity on underfitting and overfitting](../img/capacity-vs-error.svg)
+:label:`fig_capacity_vs_error`
+
+
+### Dataset Size
+
+As the above bound already indicates,
+another big consideration
+to bear in mind is dataset size.
+Fixing our model, the fewer samples
+we have in the training dataset,
+the more likely (and more severely)
+we are to encounter overfitting.
+As we increase the amount of training data,
+the generalization error typically decreases.
+Moreover, in general, more data never hurts.
+For a fixed task and data distribution,
+model complexity should not increase
+more rapidly than the amount of data.
+Given more data, we might  attempt
+to fit a more complex model.
+Absent sufficient data, simpler models
+may be more difficult to beat.
+For many tasks, deep learning
+only outperforms linear models
+when many thousands of training examples are available.
+In part, the current success of deep learning
+owes considerably to the abundance of massive datasets
+arising from Internet companies, cheap storage,
+connected devices, and the broad digitization of the economy.
+
+## Model Selection
+:label:`subsec_generalization-model-selection`
+
+Typically, we select our final model,
+only after evaluating multiple models
+that differ in various ways
+(different architectures, training objectives,
+selected features, data preprocessing,
+learning rates, etc.).
+Choosing among many models is aptly
+called *model selection*.
+
+In principle, we should not touch our test set
+until after we have chosen all our hyperparameters.
+Were we to use the test data in the model selection process,
+there is a risk that we might overfit the test data.
+Then we would be in serious trouble.
+If we overfit our training data,
+there is always the evaluation on test data to keep us honest.
+But if we overfit the test data, how would we ever know?
+See :citet:`ong2005learning` for an example how
+this can lead to absurd results even for models where the complexity
+can be tightly controlled.
+
+Thus, we should never rely on the test data for model selection.
+And yet we cannot rely solely on the training data
+for model selection either because
+we cannot estimate the generalization error
+on the very data that we use to train the model.
+
+
+In practical applications, the picture gets muddier.
+While ideally we would only touch the test data once,
+to assess the very best model or to compare
+a small number of models with each other,
+real-world test data is seldom discarded after just one use.
+We can seldom afford a new test set for each round of experiments.
+In fact, recycling benchmark data for decades
+can have a significant impact on the
+development of algorithms,
+e.g., for [image classification](https://paperswithcode.com/sota/image-classification-on-imagenet)
+and [optical character recognition](https://paperswithcode.com/sota/image-classification-on-mnist).
+
+The common practice to address the problem of *training on the test set*
+is to split our data three ways,
+incorporating a *validation set*
+in addition to the training and test datasets.
+The result is a murky practice where the boundaries
+between validation and test data are worryingly ambiguous.
+Unless explicitly stated otherwise, in the experiments in this book
+we are really working with what should rightly be called
+training data and validation data, with no true test sets.
+Therefore, the accuracy reported in each experiment of the book is really
+the validation accuracy and not a true test set accuracy.
+
+### Cross-Validation
+
+When training data is scarce,
+we might not even be able to afford to hold out
+enough data to constitute a proper validation set.
+One popular solution to this problem is to employ
+$K$*-fold cross-validation*.
+Here, the original training data is split into $K$ non-overlapping subsets.
+Then model training and validation are executed $K$ times,
+each time training on $K-1$ subsets and validating
+on a different subset (the one not used for training in that round).
+Finally, the training and validation errors are estimated
+by averaging over the results from the $K$ experiments.
+
+
+
+## Summary
+
+This section explored some of the  underpinnings
+of generalization in  machine learning.
+Some of these ideas become complicated
+and counterintuitive when we get to deeper models,
+there, models are capable of overfitting data badly,
+and the relevant notions of complexity
+can be both implicit and counterintuitive
+(e.g., larger architectures with more parameters
+generalizing better).
+We leave you with a few rules of thumb:
+
+1. Use validation sets (or $K$*-fold cross-validation*) for model selection;
+1. More complex models often require more data;
+1. Relevant notions of complexity include both the number of parameters and the range of values that they are allowed to take;
+1. Keeping all else equal, more data almost always leads to better generalization;
+1. This entire talk of generalization is all predicated on the IID assumption. If we relax this assumption, allowing for distributions to shift between the train and testing periods, then we cannot say anything about generalization absent a further (perhaps milder) assumption.
+
+
+## Exercises
+
+1. When can you solve the problem of polynomial regression exactly?
+1. Give at least five examples where dependent random variables make treating the problem as IID data inadvisable.
+1. Can you ever expect to see zero training error? Under which circumstances would you see zero generalization error?
+1. Why is $K$-fold cross-validation very expensive to compute?
+1. Why is the $K$-fold cross-validation error estimate biased?
+1. The VC dimension is defined as the maximum number of points that can be classified with arbitrary labels $\{\pm 1\}$ by a function of a class of functions. Why might this not be a good idea to measure how complex the class of functions is? Hint: what about the magnitude of the functions?
+1. Your manager gives you a difficult dataset on which your current algorithm doesn't perform so well. How would you justify to him that you need more data? Hint: you cannot increase the data but you can decrease it.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/96)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/97)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/234)
+:end_tab:
diff --git a/chapter_linear-regression/index.md b/chapter_linear-regression/index.md
new file mode 100644
index 0000000..c4591c7
--- /dev/null
+++ b/chapter_linear-regression/index.md
@@ -0,0 +1,16 @@
+# 回帰のための線形ニューラルネットワーク
+:label:`chap_regression`
+
+ニューラルネットワークを深くすることを心配する前に、入力が出力に直接接続される浅いニューラルネットワークを実装すると役立ちます。これは、いくつかの理由で重要であることがわかります。まず、複雑なアーキテクチャに気を取られるのではなく、出力層のパラメータ化、データの処理、損失関数の指定、モデルのトレーニングなど、ニューラルネットワークトレーニングの基本に焦点を当てることができます。第二に、このクラスの浅いネットワークは、線形およびソフトマックス回帰を含む統計的予測のための多くの古典的な方法を包含する一連の線形モデルで構成されています。これらの古典的なツールを理解することは、多くのコンテキストで広く使用されており、より洗練されたアーキテクチャの使用を正当化する際のベースラインとして使用する必要があるため、極めて重要です。この章では、線形回帰に焦点を絞り、次の章では、分類のための線形ニューラルネットワークを開発することにより、モデリングのレパートリーを拡張します。
+
+```toc
+:maxdepth: 2
+
+linear-regression
+oo-design
+synthetic-regression-data
+linear-regression-scratch
+linear-regression-concise
+generalization
+weight-decay
+```
diff --git a/chapter_linear-regression/index_origin.md b/chapter_linear-regression/index_origin.md
new file mode 100644
index 0000000..287d558
--- /dev/null
+++ b/chapter_linear-regression/index_origin.md
@@ -0,0 +1,35 @@
+# Linear Neural Networks for Regression
+:label:`chap_regression`
+
+Before we worry about making our neural networks deep,
+it will be helpful to implement some shallow neural networks,
+for which the inputs connect directly to the outputs.
+This will prove important for a few reasons.
+First, rather than getting distracted by complicated architectures,
+we can focus on the basics of neural network training,
+including parameterizing the output layer, handling data,
+specifying a loss function, and training the model.
+Second, this class of shallow networks happens
+to comprise the set of linear models,
+which subsumes many classical methods for statistical prediction,
+including linear and softmax regression.
+Understanding these classical tools is pivotal
+because they are widely used in many contexts
+and we will often need to use them as baselines
+when justifying the use of fancier architectures.
+This chapter will focus narrowly on linear regression
+and the subsequent chapter will extend our modeling repertoire
+by developing linear neural networks for classification.
+
+```toc
+:maxdepth: 2
+
+linear-regression
+oo-design
+synthetic-regression-data
+linear-regression-scratch
+linear-regression-concise
+generalization
+weight-decay
+```
+
diff --git a/chapter_linear-regression/linear-regression-concise.md b/chapter_linear-regression/linear-regression-concise.md
new file mode 100644
index 0000000..834bf58
--- /dev/null
+++ b/chapter_linear-regression/linear-regression-concise.md
@@ -0,0 +1,204 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 線形回帰の簡潔な実装
+:label:`sec_linear_concise`
+
+ディープラーニングは、過去10年間にカンブリア紀の爆発的な爆発を目の当たりにしてきました。膨大な数の技術、アプリケーション、アルゴリズムは、過去数十年の進歩をはるかに上回っています。これは、複数の要因が偶然に組み合わされているためです。そのうちの1つは、多数のオープンソースのディープラーニングフレームワークによって提供される強力な無料ツールです。Theano :cite:`Bergstra.Breuleux.Bastien.ea.2010`、DistBelief :cite:`Dean.Corrado.Monga.ea.2012`、およびCaffe :cite:`Jia.Shelhamer.Donahue.ea.2014`は、間違いなく広く採用された第1世代のモデルを代表しています。Lispのようなプログラミング体験を提供するSN2（Simulateur Neuristique）:cite:`Bottou.Le-Cun.1988`のような以前の（独創的な）作品とは対照的に、最新のフレームワークはPythonの自動差別化と利便性を提供します。これらのフレームワークにより、勾配ベースの学習アルゴリズムを実装する反復作業を自動化およびモジュール化できます。 
+
+:numref:`sec_linear_scratch`では、（i）データストレージと線形代数のテンソル、および（ii）勾配の計算には自動微分のみに依存していました。実際には、データイテレータ、損失関数、オプティマイザ、ニューラルネットワーク層は非常に一般的であるため、現代のライブラリもこれらのコンポーネントを実装しています。このセクションでは、ディープラーニングフレームワークの:numref:`sec_linear_scratch`（**高レベルAPIを使用して簡潔に**）から（**線形回帰モデルの実装方法を説明します**）。
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+from d2l import torch as d2l
+import numpy as np
+import torch
+from torch import nn
+```
+
+```{.python .input  n=1}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import numpy as np
+import tensorflow as tf
+```
+
+## モデルを定義する
+
+:numref:`sec_linear_scratch` でゼロから線形回帰を実装したとき、モデルパラメーターを明示的に定義し、基本的な線形代数演算を使用して出力を生成するように計算をコード化しました。あなたはこれを行う方法を知っているべきです*。しかし、モデルがより複雑になり、ほぼ毎日これを行う必要がある場合は、喜んで支援を受けるでしょう。状況は、自分のブログをゼロからコーディングするのと似ています。それを1回か2回行うことはやりがいがあり、有益ですが、一ヶ月かけて車輪の再発明をすれば、お粗末なWeb開発者になるでしょう。 
+
+標準的な操作では、[**フレームワークの事前定義されたレイヤーを使用**] できます。これにより、実装について心配することなく、モデルの構築に使用されるレイヤーに集中できます。:numref:`fig_single_neuron`で説明されている単層ネットワークのアーキテクチャを思い出してください。この層は、各入力が行列ベクトル乗算によって各出力に接続されるため、*完全接続* と呼ばれます。
+
+:begin_tab:`mxnet`
+Gluonでは、全結合層は`Dense`クラスで定義されています。単一のスカラー出力のみを生成したいので、その数を 1 に設定します。便宜上、Gluonでは各レイヤーの入力形状を指定する必要がないことは注目に値します。したがって、この線形層に入る入力の数をGluonに伝える必要はありません。モデルに初めてデータを渡すとき、例えば`net(X)`を後で実行すると、Gluonは各レイヤーへの入力数を自動的に推測し、正しいモデルをインスタンス化します。これがどのように機能するかについては、後で詳しく説明します。
+:end_tab:
+
+:begin_tab:`pytorch`
+PyTorch では、全結合層は `Linear` と `LazyLinear` (バージョン 1.8.0 以降で利用可能) のクラスで定義されています。後者では、ユーザーは出力次元を*のみ*指定できますが、前者はさらにこの層に入る入力の数を要求します。入力シェイプの指定は不便で、(畳み込み層などで) 自明でない計算が必要になる場合があります。したがって、簡単にするために、できる限りこのような「怠惰な」レイヤーを使用します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+Kerasでは、全結合層は`Dense`クラスで定義されています。単一のスカラー出力のみを生成したいので、その数を 1 に設定します。便宜上、Kerasでは各レイヤーの入力形状を指定する必要がないことは注目に値します。この線形層に入る入力の数をKerasに伝える必要はありません。最初にモデルにデータを渡そうとするとき、例えば`net(X)`を後で実行すると、Kerasは各レイヤーへの入力数を自動的に推測します。これがどのように機能するかについては、後で詳しく説明します。
+:end_tab:
+
+```{.python .input}
+%%tab all
+class LinearRegression(d2l.Module):  #@save
+    def __init__(self, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.net = nn.Dense(1)
+            self.net.initialize(init.Normal(sigma=0.01))
+        if tab.selected('tensorflow'):
+            initializer = tf.initializers.RandomNormal(stddev=0.01)
+            self.net = tf.keras.layers.Dense(1, kernel_initializer=initializer)
+        if tab.selected('pytorch'):
+            self.net = nn.LazyLinear(1)
+            self.net.weight.data.normal_(0, 0.01)
+            self.net.bias.data.fill_(0)
+```
+
+`forward` メソッドでは、定義済みレイヤーの組み込み関数 `__call__` を呼び出して出力を計算します。
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def forward(self, X):
+    """The linear regression model."""
+    return self.net(X)
+```
+
+## 損失関数の定義
+
+:begin_tab:`mxnet`
+`loss`モジュールは、多くの有用な損失関数を定義しています。スピードと利便性のために、私たちは独自の実装を忘れて、代わりに組み込みの`loss.L2Loss`を選択します。返される`loss`は各例の二乗誤差であるため、`mean`を使用してミニバッチ全体の損失を平均します。
+:end_tab:
+
+:begin_tab:`pytorch`
+[**`MSELoss` クラスは平均二乗誤差 (:eqref:`eq_mse` の $1/2$ 係数を含まない) を計算します。**] 既定では、`MSELoss` は例に対する平均損失を返します。独自に実装するよりも速い (そして使いやすい)。
+:end_tab:
+
+:begin_tab:`tensorflow`
+`MeanSquaredError` クラスは、平均二乗誤差 (:eqref:`eq_mse` の $1/2$ 係数を含まない) を計算します。デフォルトでは、例の平均損失を返します。
+:end_tab:
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def loss(self, y_hat, y):
+    if tab.selected('mxnet'):
+        fn = gluon.loss.L2Loss()
+        return fn(y_hat, y).mean()
+    if tab.selected('pytorch'):
+        fn = nn.MSELoss()
+        return fn(y_hat, y)
+    if tab.selected('tensorflow'):
+        fn = tf.keras.losses.MeanSquaredError()
+        return fn(y, y_hat)
+```
+
+## 最適化アルゴリズムの定義
+
+:begin_tab:`mxnet`
+Minibatch SGD はニューラルネットワークを最適化するための標準ツールであるため、Gluon は `Trainer` クラスを通じて、このアルゴリズムの多くのバリエーションと共にそれをサポートしています。Gluonの`Trainer`クラスは最適化アルゴリズムを表し、:numref:`sec_oo-design`で作成した`Trainer`クラスにはトレーニング関数が含まれています。つまり、オプティマイザを繰り返し呼び出してモデルパラメータを更新します。`Trainer`をインスタンス化するとき、`net.collect_params()`を介してモデル`net`から取得できる、最適化するパラメータ、使用する最適化アルゴリズム（`sgd`）、および最適化アルゴリズムに必要なハイパーパラメータの辞書を指定します。
+:end_tab:
+
+:begin_tab:`pytorch`
+Minibatch SGD はニューラルネットワークを最適化するための標準ツールであるため、PyTorch は `optim` モジュールでこのアルゴリズムの多くのバリエーションと共にこれをサポートします。（**`SGD`インスタンスをインスタンス化**）するとき、`self.parameters()`を介してモデルから取得できる、最適化するパラメーターと、最適化アルゴリズムに必要な学習率（`self.lr`）を指定します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+Minibatch SGDはニューラルネットワークを最適化するための標準ツールであるため、Kerasは`optimizers`モジュールでこのアルゴリズムの多くのバリエーションと共にそれをサポートしています。
+:end_tab:
+
+```{.python .input  n=5}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def configure_optimizers(self):
+    if tab.selected('mxnet'):
+        return gluon.Trainer(self.collect_params(),
+                             'sgd', {'learning_rate': self.lr})
+    if tab.selected('pytorch'):
+        return torch.optim.SGD(self.parameters(), self.lr)
+    if tab.selected('tensorflow'):
+        return tf.keras.optimizers.SGD(self.lr)
+```
+
+## トレーニング
+
+ディープラーニングフレームワークの高レベル API を使用してモデルを表現するには、必要なコード行が少なくて済むことに気づいたかもしれません。パラメーターを個別に割り当てたり、損失関数を定義したり、ミニバッチ SGD を実装したりする必要はありませんでした。もっと複雑なモデルで作業を始めると、高レベル API の利点はかなり大きくなるでしょう。これで基本的な要素がすべて揃ったので、[**トレーニングループ自体は、ゼロから実装したものと同じです。**] そこで、:numref:`sec_linear_scratch` の `fit_epoch` メソッドの実装に依存する `fit` メソッド (:numref:`oo-design-training` で導入) を呼び出して、モデルをトレーニングします。
+
+```{.python .input}
+%%tab all
+model = LinearRegression(lr=0.03)
+data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+trainer = d2l.Trainer(max_epochs=3)
+trainer.fit(model, data)
+```
+
+以下では、データセットを生成した [**有限データのトレーニングによって学習したモデルパラメータと実際のパラメータを比較する**]。パラメータにアクセスするには、必要な層の重みと偏りにアクセスします。ゼロからの実装と同様に、推定されたパラメータは真の対応パラメータに近いことに注意してください。
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def get_w_b(self):
+    if tab.selected('mxnet'):
+        return (self.net.weight.data(), self.net.bias.data())
+    if tab.selected('pytorch'):
+        return (self.net.weight.data, self.net.bias.data)
+    if tab.selected('tensorflow'):
+        return (self.get_weights()[0], self.get_weights()[1])
+
+w, b = model.get_w_b()
+print(f'error in estimating w: {data.w - d2l.reshape(w, data.w.shape)}')
+print(f'error in estimating b: {data.b - b}')
+```
+
+## まとめ
+
+このセクションには、Gluon `Chen.Li.Li.ea.2015`、JAX :cite:`Frostig.Johnson.Leary.2018`、PyTorch :cite:`Paszke.Gross.Massa.ea.2019`、Tensorflow :cite:`Abadi.Barham.Chen.ea.2016` などの最新のディープラーニングフレームワークによって提供される便利さを活用するためのディープネットワーク（本書内）の最初の実装が含まれています。データのロード、レイヤー、損失関数、オプティマイザー、およびトレーニングループの定義にフレームワークのデフォルトを使用しました。フレームワークが必要な機能をすべて提供する場合は常に、それらを使用することをお勧めします。これらのコンポーネントのライブラリ実装は、パフォーマンスが大幅に最適化され、信頼性が適切にテストされる傾向があるためです。同時に、これらのモジュールは直接実装可能であることを忘れないようにしてください。これは、現在のどのライブラリにも存在し得ない新しいコンポーネントを発明するモデル開発の最先端を生きたいと願う意欲的な研究者にとって特に重要です。
+
+:begin_tab:`mxnet`
+Gluonでは、`data`モジュールはデータ処理のためのツールを提供し、`nn`モジュールは多数のニューラルネットワーク層を定義し、`loss`モジュールは多くの一般的な損失関数を定義します。さらに、`initializer`は、パラメータ初期化のための多くの選択肢へのアクセスを提供します。ユーザーにとって便利なことに、次元とストレージは自動的に推測されます。この遅延初期化の結果、パラメータがインスタンス化 (および初期化) される前にパラメータにアクセスしようとしないでください。
+:end_tab:
+
+:begin_tab:`pytorch`
+PyTorchでは、`data`モジュールはデータ処理のためのツールを提供し、`nn`モジュールは多数のニューラルネットワーク層と一般的な損失関数を定義します。パラメータの値を `_` で終わるメソッドに置き換えることで、パラメータを初期化できます。ネットワークの入力次元を指定する必要があることに注意してください。これは今のところ些細なことですが、多くの層を持つ複雑なネットワークを設計する場合、大きな効果をもたらす可能性があります。移植性を確保するには、これらのネットワークをどのようにパラメータ化するかについて慎重に検討する必要があります。
+:end_tab:
+
+:begin_tab:`tensorflow`
+TensorFlowでは、`data`モジュールはデータ処理のためのツールを提供し、`keras`モジュールは多数のニューラルネットワーク層と一般的な損失関数を定義します。さらに、`initializers` モジュールは、モデルパラメーターを初期化するためのさまざまな方法を提供します。ネットワークの次元とストレージは自動的に推測されます (ただし、初期化される前にパラメーターにアクセスしようとしないように注意してください)。
+:end_tab:
+
+## 演習
+
+1. ミニバッチの総損失をミニバッチの損失の平均に置き換える場合、学習率をどのように変更する必要がありますか？
+1. フレームワークのドキュメントを確認して、どの損失関数が提供されているかを確認します。特に、二乗損失をHuberのロバストな損失関数に置き換えます。つまり、損失関数$$l (y, y') =\ begin {case} |y-y'|-\ frac {\ sigma} {2} &\ text {if} |y-y'| >\ sigma\\ frac {1} {2\ sigma} (y-y') ^2 &\ text {そうでなければ}\ end {case} $$
+1. モデルの重みの勾配にはどのようにアクセスしますか？
+1. 学習率とエポック数を変えると、解はどのように変化しますか？改善し続けていますか？
+1. 生成されるデータ量を変更すると、ソリューションはどのように変化しますか?
+    1. $\hat{\mathbf{w}} - \mathbf{w}$ と $\hat{b} - b$ の推定誤差をデータ量の関数としてプロットします。ヒント：データ量を直線的にではなく対数的に増やします。つまり、5、10、20、50、...、1,000、2,000、...、10,000ではなく10,000です。
+    2. ヒントの提案が適切なのはなぜですか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/44)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/45)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/204)
+:end_tab:
diff --git a/chapter_linear-regression/linear-regression-concise_origin.md b/chapter_linear-regression/linear-regression-concise_origin.md
new file mode 100644
index 0000000..dfc202b
--- /dev/null
+++ b/chapter_linear-regression/linear-regression-concise_origin.md
@@ -0,0 +1,390 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Concise Implementation of Linear Regression
+:label:`sec_linear_concise`
+
+Deep learning has witnessed a Cambrian explosion
+of sorts over the past decade.
+The sheer number of techniques, applications and algorithms by far surpasses the
+progress of previous decades. 
+This is due to a fortuitous combination of multiple factors,
+one of which is the powerful free tools
+offered by a number of open source deep learning frameworks.
+Theano :cite:`Bergstra.Breuleux.Bastien.ea.2010`,
+DistBelief :cite:`Dean.Corrado.Monga.ea.2012`,
+and Caffe :cite:`Jia.Shelhamer.Donahue.ea.2014`
+arguably represent the
+first generation of such models 
+that found widespread adoption.
+In contrast to earlier (seminal) works like
+SN2 (Simulateur Neuristique) :cite:`Bottou.Le-Cun.1988`,
+which provided a Lisp-like programming experience,
+modern frameworks offer automatic differentiation
+and the convenience of Python.
+These frameworks allow us to automate and modularize
+the repetitive work of implementing gradient-based learning algorithms.
+
+In :numref:`sec_linear_scratch`, we relied only on
+(i) tensors for data storage and linear algebra;
+and (ii) automatic differentiation for calculating gradients.
+In practice, because data iterators, loss functions, optimizers,
+and neural network layers
+are so common, modern libraries implement these components for us as well.
+In this section, (**we will show you how to implement
+the linear regression model**) from :numref:`sec_linear_scratch`
+(**concisely by using high-level APIs**) of deep learning frameworks.
+
+```{.python .input}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+from d2l import torch as d2l
+import numpy as np
+import torch
+from torch import nn
+```
+
+```{.python .input  n=1}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import numpy as np
+import tensorflow as tf
+```
+
+## Defining the Model
+
+When we implemented linear regression from scratch
+in :numref:`sec_linear_scratch`,
+we defined our model parameters explicitly
+and coded up the calculations to produce output
+using basic linear algebra operations.
+You *should* know how to do this.
+But once your models get more complex,
+and once you have to do this nearly every day,
+you will be glad for the assistance.
+The situation is similar to coding up your own blog from scratch.
+Doing it once or twice is rewarding and instructive,
+but you would be a lousy web developer
+if you spent a month reinventing the wheel.
+
+For standard operations,
+we can [**use a framework's predefined layers,**]
+which allow us to focus
+on the layers used to construct the model
+rather than worrying about their implementation.
+Recall the architecture of a single-layer network
+as described in :numref:`fig_single_neuron`.
+The layer is called *fully connected*,
+since each of its inputs is connected
+to each of its outputs
+by means of a matrix-vector multiplication.
+
+:begin_tab:`mxnet`
+In Gluon, the fully connected layer is defined in the `Dense` class.
+Since we only want to generate a single scalar output,
+we set that number to 1.
+It is worth noting that, for convenience,
+Gluon does not require us to specify
+the input shape for each layer.
+Hence we don't need to tell Gluon
+how many inputs go into this linear layer.
+When we first pass data through our model,
+e.g., when we execute `net(X)` later,
+Gluon will automatically infer the number of inputs to each layer and
+thus instantiate the correct model.
+We will describe how this works in more detail later.
+:end_tab:
+
+:begin_tab:`pytorch`
+In PyTorch, the fully connected layer is defined in `Linear` and `LazyLinear` (available since version 1.8.0) classes. 
+The latter
+allows users to *only* specify
+the output dimension,
+while the former
+additionally asks for
+how many inputs go into this layer.
+Specifying input shapes is inconvenient,
+which may require nontrivial calculations
+(such as in convolutional layers).
+Thus, for simplicity we will use such "lazy" layers
+whenever we can. 
+:end_tab:
+
+:begin_tab:`tensorflow`
+In Keras, the fully connected layer is defined in the `Dense` class.
+Since we only want to generate a single scalar output,
+we set that number to 1.
+It is worth noting that, for convenience,
+Keras does not require us to specify
+the input shape for each layer.
+We don't need to tell Keras
+how many inputs go into this linear layer.
+When we first try to pass data through our model,
+e.g., when we execute `net(X)` later,
+Keras will automatically infer
+the number of inputs to each layer.
+We will describe how this works in more detail later.
+:end_tab:
+
+```{.python .input}
+%%tab all
+class LinearRegression(d2l.Module):  #@save
+    def __init__(self, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.net = nn.Dense(1)
+            self.net.initialize(init.Normal(sigma=0.01))
+        if tab.selected('tensorflow'):
+            initializer = tf.initializers.RandomNormal(stddev=0.01)
+            self.net = tf.keras.layers.Dense(1, kernel_initializer=initializer)
+        if tab.selected('pytorch'):
+            self.net = nn.LazyLinear(1)
+            self.net.weight.data.normal_(0, 0.01)
+            self.net.bias.data.fill_(0)
+```
+
+In the `forward` method, we just invoke the built-in `__call__` function of the predefined layers to compute the outputs.
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def forward(self, X):
+    """The linear regression model."""
+    return self.net(X)
+```
+
+## Defining the Loss Function
+
+:begin_tab:`mxnet`
+The `loss` module defines many useful loss functions.
+For speed and convenience, we forgo implementing our own
+and choose the built-in `loss.L2Loss` instead.
+Because the `loss` that it returns is
+the squared error for each example,
+we use `mean`to average the loss across over the minibatch.
+:end_tab:
+
+:begin_tab:`pytorch`
+[**The `MSELoss` class computes the mean squared error (without the $1/2$ factor in :eqref:`eq_mse`).**]
+By default, `MSELoss` returns the average loss over examples.
+It is faster (and easier to use) than implementing our own.
+:end_tab:
+
+:begin_tab:`tensorflow`
+The `MeanSquaredError` class computes the mean squared error (without the $1/2$ factor in :eqref:`eq_mse`).
+By default, it returns the average loss over examples.
+:end_tab:
+
+```{.python .input  n=3}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def loss(self, y_hat, y):
+    if tab.selected('mxnet'):
+        fn = gluon.loss.L2Loss()
+        return fn(y_hat, y).mean()
+    if tab.selected('pytorch'):
+        fn = nn.MSELoss()
+        return fn(y_hat, y)
+    if tab.selected('tensorflow'):
+        fn = tf.keras.losses.MeanSquaredError()
+        return fn(y, y_hat)
+```
+
+## Defining the Optimization Algorithm
+
+:begin_tab:`mxnet`
+Minibatch SGD is a standard tool
+for optimizing neural networks
+and thus Gluon supports it alongside a number of
+variations on this algorithm through its `Trainer` class.
+Note that Gluon's `Trainer` class stands
+for the optimization algorithm,
+while the `Trainer` class we created in :numref:`sec_oo-design`
+contains the training function,
+i.e., repeatedly call the optimizer
+to update the model parameters.
+When we instantiate `Trainer`,
+we specify the parameters to optimize over,
+obtainable from our model `net` via `net.collect_params()`,
+the optimization algorithm we wish to use (`sgd`),
+and a dictionary of hyperparameters
+required by our optimization algorithm.
+:end_tab:
+
+:begin_tab:`pytorch`
+Minibatch SGD is a standard tool
+for optimizing neural networks
+and thus PyTorch supports it alongside a number of
+variations on this algorithm in the `optim` module.
+When we (**instantiate an `SGD` instance,**)
+we specify the parameters to optimize over,
+obtainable from our model via `self.parameters()`,
+and the learning rate (`self.lr`)
+required by our optimization algorithm.
+:end_tab:
+
+:begin_tab:`tensorflow`
+Minibatch SGD is a standard tool
+for optimizing neural networks
+and thus Keras supports it alongside a number of
+variations on this algorithm in the `optimizers` module.
+:end_tab:
+
+```{.python .input  n=5}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def configure_optimizers(self):
+    if tab.selected('mxnet'):
+        return gluon.Trainer(self.collect_params(),
+                             'sgd', {'learning_rate': self.lr})
+    if tab.selected('pytorch'):
+        return torch.optim.SGD(self.parameters(), self.lr)
+    if tab.selected('tensorflow'):
+        return tf.keras.optimizers.SGD(self.lr)
+```
+
+## Training
+
+You might have noticed that expressing our model through
+high-level APIs of a deep learning framework
+requires fewer lines of code.
+We did not have to allocate parameters individually,
+define our loss function, or implement minibatch SGD.
+Once we start working with much more complex models,
+the advantages of the high-level API will grow considerably.
+Now that we have all the basic pieces in place,
+[**the training loop itself is the same
+as the one we implemented from scratch.**]
+So we just call the `fit` method (introduced in :numref:`oo-design-training`),
+which relies on the implementation of the `fit_epoch` method
+in :numref:`sec_linear_scratch`,
+to train our model.
+
+```{.python .input}
+%%tab all
+model = LinearRegression(lr=0.03)
+data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+trainer = d2l.Trainer(max_epochs=3)
+trainer.fit(model, data)
+```
+
+Below, we
+[**compare the model parameters learned
+by training on finite data
+and the actual parameters**]
+that generated our dataset.
+To access parameters,
+we access the weights and bias
+of the layer that we need.
+As in our implementation from scratch,
+note that our estimated parameters
+are close to their true counterparts.
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(LinearRegression)  #@save
+def get_w_b(self):
+    if tab.selected('mxnet'):
+        return (self.net.weight.data(), self.net.bias.data())
+    if tab.selected('pytorch'):
+        return (self.net.weight.data, self.net.bias.data)
+    if tab.selected('tensorflow'):
+        return (self.get_weights()[0], self.get_weights()[1])
+
+w, b = model.get_w_b()
+print(f'error in estimating w: {data.w - d2l.reshape(w, data.w.shape)}')
+print(f'error in estimating b: {data.b - b}')
+```
+
+## Summary
+
+This section contains the first
+implementation of a deep network (in this book)
+to tap into the conveniences afforded
+by modern deep learning frameworks,
+such as Gluon `Chen.Li.Li.ea.2015`, 
+JAX :cite:`Frostig.Johnson.Leary.2018`, 
+PyTorch :cite:`Paszke.Gross.Massa.ea.2019`, 
+and Tensorflow :cite:`Abadi.Barham.Chen.ea.2016`.
+We used framework defaults for loading data, defining a layer,
+a loss function, an optimizer and a training loop.
+Whenever the framework provides all necessary features,
+it's generally a good idea to use them,
+since the library implementations of these components
+tend to be heavily optimized for performance
+and properly tested for reliability.
+At the same time, try not to forget
+that these modules *can* be implemented directly.
+This is especially important for aspiring researchers
+who wish to live on the bleeding edge of model development,
+where you will be inventing new components
+that cannot possibly exist in any current library.
+
+:begin_tab:`mxnet`
+In Gluon, the `data` module provides tools for data processing,
+the `nn` module defines a large number of neural network layers,
+and the `loss` module defines many common loss functions.
+Moreover, the `initializer` gives access
+to many choices for parameter initialization.
+Conveniently for the user,
+dimensionality and storage are automatically inferred.
+A consequence of this lazy initialization is that
+you must not attempt to access parameters
+before they have been instantiated (and initialized).
+:end_tab:
+
+:begin_tab:`pytorch`
+In PyTorch, the `data` module provides tools for data processing,
+the `nn` module defines a large number of neural network layers and common loss functions.
+We can initialize the parameters by replacing their values
+with methods ending with `_`.
+Note that we need to specify the input dimensions of the network.
+While this is trivial for now, it can have significant knock-on effects
+when we want to design complex networks with many layers.
+Careful considerations of how to parametrize these networks
+is needed to allow portability.
+:end_tab:
+
+:begin_tab:`tensorflow`
+In TensorFlow, the `data` module provides tools for data processing,
+the `keras` module defines a large number of neural network layers and common loss functions.
+Moreover, the `initializers` module provides various methods for model parameter initialization.
+Dimensionality and storage for networks are automatically inferred
+(but be careful not to attempt to access parameters before they have been initialized).
+:end_tab:
+
+## Exercises
+
+1. How would you need to change the learning rate if you replace the aggregate loss over the minibatch
+   with an average over the loss on the minibatch?
+1. Review the framework documentation to see which loss functions are provided. In particular,
+   replace the squared loss with Huber's robust loss function. That is, use the loss function
+   $$l(y,y') = \begin{cases}|y-y'| -\frac{\sigma}{2} & \text{ if } |y-y'| > \sigma \\ \frac{1}{2 \sigma} (y-y')^2 & \text{ otherwise}\end{cases}$$
+1. How do you access the gradient of the weights of the model?
+1. How does the solution change if you change the learning rate and the number of epochs? Does it keep on improving?
+1. How does the solution change as you change the amount of data generated?
+    1. Plot the estimation error for $\hat{\mathbf{w}} - \mathbf{w}$ and $\hat{b} - b$ as a function of the amount of data. Hint: increase the amount of data logarithmically rather than linearly, i.e., 5, 10, 20, 50, ..., 10,000 rather than 1,000, 2,000, ..., 10,000.
+    2. Why is the suggestion in the hint appropriate?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/44)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/45)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/204)
+:end_tab:
diff --git a/chapter_linear-regression/linear-regression-scratch.md b/chapter_linear-regression/linear-regression-scratch.md
new file mode 100644
index 0000000..3c90c06
--- /dev/null
+++ b/chapter_linear-regression/linear-regression-scratch.md
@@ -0,0 +1,275 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# ゼロからの線形回帰の実装
+:label:`sec_linear_scratch`
+
+これで、完全に機能する線形回帰の実装に取り組む準備が整いました。このセクションでは、(**) (i) モデル、(ii) 損失関数、(iii) ミニバッチ確率的勾配降下オプティマイザ、(iv) これらすべてをまとめるトレーニング関数を含む、メソッド全体をゼロから実装します。**) 最後に、以下から合成データジェネレータを実行します。:numref:`sec_synthetic-regression-data`、結果のデータセットにモデルを適用します。最新のディープラーニングフレームワークはこの作業のほとんどすべてを自動化できますが、何をしているのかを確実に把握するには、ゼロから実装することが唯一の方法です。さらに、モデルをカスタマイズしたり、独自のレイヤーや損失関数を定義したりするときには、内部で物事がどのように機能するかを理解することが役立ちます。このセクションでは、テンソルと自動微分のみを使用します。後ほど、以下の構造を維持しながら、ディープラーニングフレームワークの機能を活用して、より簡潔な実装を紹介します。
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## モデルを定義する
+
+[**モデルのパラメーターの最適化を始める前に**] minibatch SGD (**そもそもいくつかのパラメーターが必要です。**) 以下では、平均 0、標準偏差 0.01 の正規分布から乱数を抽出し、重みを初期化します。マジックナンバー0.01は実際にはうまく機能することが多いですが、引数`sigma`で別の値を指定できます。さらに、バイアスを0に設定します。オブジェクト指向設計では、`d2l.Module` (:numref:`oo-design-models` で導入) のサブクラスの `__init__` メソッドにコードを追加することに注意してください。
+
+```{.python .input  n=5}
+%%tab all
+class LinearRegressionScratch(d2l.Module):  #@save
+    def __init__(self, num_inputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.w = d2l.normal(0, sigma, (num_inputs, 1))
+            self.b = d2l.zeros(1)
+            self.w.attach_grad()
+            self.b.attach_grad()
+        if tab.selected('pytorch'):
+            self.w = d2l.normal(0, sigma, (num_inputs, 1), requires_grad=True)
+            self.b = d2l.zeros(1, requires_grad=True)
+        if tab.selected('tensorflow'):
+            w = tf.random.normal((num_inputs, 1), mean=0, stddev=0.01)
+            b = tf.zeros(1)
+            self.w = tf.Variable(w, trainable=True)
+            self.b = tf.Variable(b, trainable=True)
+```
+
+次に、[**入力とパラメーターを出力に関連付けてモデルを定義します**]。線形モデルでは、入力フィーチャ $\mathbf{X}$ とモデルの重み $\mathbf{w}$ の行列ベクトル積を取得し、オフセット $b$ を各例に追加します。$\mathbf{Xw}$ はベクトル、$b$ はスカラーです。ブロードキャストメカニズム (:numref:`subsec_broadcasting` を参照) により、ベクトルとスカラーを追加すると、スカラーはベクトルの各コンポーネントに追加されます。結果の `forward` 関数は、`add_to_class` (:numref:`oo-design-utilities` で導入) を介して `LinearRegressionScratch` クラスのメソッドとして登録されます。
+
+```{.python .input  n=6}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def forward(self, X):
+    """The linear regression model."""
+    return d2l.matmul(X, self.w) + self.b
+```
+
+## 損失関数の定義
+
+[**モデルを更新するには損失関数の勾配を取る必要があるため、**](**損失関数を最初に定義します**) ここでは :eqref:`eq_mse` の二乗損失関数を使用します。実装では、真の値`y`を予測値の形状`y_hat`に変換する必要があります。次の関数によって返される結果も、`y_hat` と同じ形状になります。また、ミニバッチのすべての例の平均損失値を返します。
+
+```{.python .input  n=7}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def loss(self, y_hat, y):
+    l = (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+    return d2l.reduce_mean(l)
+```
+
+## 最適化アルゴリズムの定義
+
+:numref:`sec_linear_regression`で説明したように、線形回帰には閉形式の解があります。ただし、ここでの目標は、より一般的なニューラルネットワークを学習させる方法を説明することであり、そのためには、ミニバッチ SGD の使い方を教える必要があります。そこで、この機会にSGDの最初の実例を紹介します。各ステップで、データセットからランダムに抽出されたミニバッチを使用して、パラメータに対する損失の勾配を推定します。次に、損失を減らす可能性のある方向にパラメータを更新します。 
+
+次のコードは、一連のパラメーター、学習率 `lr` を指定して、更新を適用します。損失はミニバッチの平均として計算されるため、バッチサイズに対して学習率を調整する必要はありません。後の章では、分散型大規模学習で発生する非常に大きなミニバッチの学習率をどのように調整すべきかを調査します。今のところ、この依存関係は無視できます。
+
+:begin_tab:`mxnet`
+`d2l.HyperParameters` (:numref:`oo-design-utilities` で導入) のサブクラスである `SGD` クラスを、組み込みの SGD オプティマイザと同様の API を持つように定義します。`step` メソッドのパラメーターを更新します。無視できる `batch_size` 引数を受け入れます。
+:end_tab:
+
+:begin_tab:`pytorch`
+`d2l.HyperParameters` (:numref:`oo-design-utilities` で導入) のサブクラスである `SGD` クラスを、組み込みの SGD オプティマイザと同様の API を持つように定義します。`step` メソッドのパラメーターを更新します。`zero_grad` メソッドは、すべてのグラデーションを 0 に設定します。これは、バックプロパゲーションステップの前に実行する必要があります。
+:end_tab:
+
+:begin_tab:`tensorflow`
+`SGD` クラスは `d2l.HyperParameters` (:numref:`oo-design-utilities` で導入) のサブクラスであり、組み込みの SGD オプティマイザと同様の API を持つように定義しています。`apply_gradients` メソッドのパラメーターを更新します。パラメータとグラデーションのペアのリストを受け入れます。
+:end_tab:
+
+```{.python .input  n=8}
+%%tab mxnet, pytorch
+class SGD(d2l.HyperParameters):  #@save
+    def __init__(self, params, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+
+    if tab.selected('mxnet'):
+        def step(self, _):
+            for param in self.params:
+                param -= self.lr * param.grad
+    
+    if tab.selected('pytorch'):
+        def step(self):
+            for param in self.params:
+                param -= self.lr * param.grad
+
+        def zero_grad(self):
+            for param in self.params:
+                if param.grad is not None:
+                    param.grad.zero_()
+```
+
+```{.python .input  n=9}
+%%tab tensorflow
+class SGD(d2l.HyperParameters):  #@save
+    def __init__(self, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+    
+    def apply_gradients(self, grads_and_vars):
+        for grad, param in grads_and_vars:
+            param.assign_sub(self.lr * grad)
+```
+
+次に、`SGD` クラスのインスタンスを返す `configure_optimizers` メソッドを定義します。
+
+```{.python .input  n=10}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def configure_optimizers(self):
+    if tab.selected('mxnet') or tab.selected('pytorch'):
+        return SGD([self.w, self.b], self.lr)
+    if tab.selected('tensorflow'):
+        return SGD(self.lr)
+```
+
+## トレーニング
+
+これで、すべての部分 (パラメーター、損失関数、モデル、オプティマイザー) が揃ったので、[**メイントレーニングループを実装する**] 準備ができました。この本で取り上げている他のすべてのディープラーニングモデルにも同様のトレーニングループを使用するため、このコードをよく理解することが重要です。各*epoch* では、トレーニングデータセット全体を反復処理し、すべての例を 1 回通過します (例の数がバッチサイズで割り切れると仮定)。各反復で、トレーニング例のミニバッチを取得し、モデルの`training_step`メソッドを使用してその損失を計算します。次に、各パラメータに関する勾配を計算します。最後に、最適化アルゴリズムを呼び出してモデルパラメーターを更新します。要約すると、次のループを実行します。 
+
+* パラメータを初期化する $(\mathbf{w}, b)$
+* 完了するまで繰り返します
+    * グラデーションの計算 $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
+    * 更新パラメータ $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
+
+: numref: ``sec_synthetic-regression-data``で生成した合成回帰データセットは検証データセットを提供しないことを思い出してください。ただし、ほとんどの場合、検証データセットを使用してモデルの品質を測定します。ここでは、モデルのパフォーマンスを測定するために、各エポックで検証データローダーを 1 回渡します。オブジェクト指向設計に従って、`prepare_batch`および`fit_epoch`関数は、`d2l.Trainer`クラス（:numref:`oo-design-training`で導入された）のメソッドとして登録されます。
+
+```{.python .input  n=11}
+%%tab all    
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_batch(self, batch):
+    return batch
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    self.model.train()        
+    for batch in self.train_dataloader:        
+        loss = self.model.training_step(self.prepare_batch(batch))
+        self.optim.zero_grad()
+        with torch.no_grad():
+            loss.backward()
+            if self.gradient_clip_val > 0:  # To be discussed later
+                self.clip_gradients(self.gradient_clip_val, self.model)
+            self.optim.step()
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    self.model.eval()
+    for batch in self.val_dataloader:
+        with torch.no_grad():            
+            self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+```{.python .input  n=13}
+%%tab mxnet
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    for batch in self.train_dataloader:
+        with autograd.record():
+            loss = self.model.training_step(self.prepare_batch(batch))
+        loss.backward()
+        if self.gradient_clip_val > 0:
+            self.clip_gradients(self.gradient_clip_val, self.model)
+        self.optim.step(1)
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    for batch in self.val_dataloader:        
+        self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+```{.python .input  n=14}
+%%tab tensorflow
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    self.model.training = True
+    for batch in self.train_dataloader:            
+        with tf.GradientTape() as tape:
+            loss = self.model.training_step(self.prepare_batch(batch))
+        grads = tape.gradient(loss, self.model.trainable_variables)
+        if self.gradient_clip_val > 0:
+            grads = self.clip_gradients(self.gradient_clip_val, grads)
+        self.optim.apply_gradients(zip(grads, self.model.trainable_variables))
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    self.model.training = False
+    for batch in self.val_dataloader:        
+        self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+モデルをトレーニングする準備はほぼできていますが、まずトレーニングするデータが必要です。ここでは `SyntheticRegressionData` クラスを使用し、いくつかのグラウンドトゥルースパラメータを渡します。次に、学習率 `lr=0.03` でモデルをトレーニングし、`max_epochs=3` を設定します。一般に、エポック数と学習率の両方がハイパーパラメータであることに注意してください。一般に、ハイパーパラメータの設定は難しく、通常、3方向スプリットを使用します。1つはトレーニング用、もう1つはハイパーパラメータ選択用、3つ目は最終評価用です。これらの詳細は今のところ省略しますが、後で修正します。
+
+```{.python .input  n=15}
+%%tab all
+model = LinearRegressionScratch(2, lr=0.03)
+data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+trainer = d2l.Trainer(max_epochs=3)
+trainer.fit(model, data)
+```
+
+私たちはデータセットを自分で合成したので、真のパラメータが何であるかを正確に知っています。したがって、トレーニングループを通じて [**真のパラメータと学習したパラメータを比較することにより、トレーニングの成功を評価する**] ことができます。確かに、彼らはお互いに非常に近いことが分かります。
+
+```{.python .input  n=16}
+%%tab all
+print(f'error in estimating w: {data.w - d2l.reshape(model.w, data.w.shape)}')
+print(f'error in estimating b: {data.b - model.b}')
+```
+
+グラウンドトゥルースパラメータを正確に回復する能力を当然のことと考えてはいけません。一般に、ディープモデルでは、パラメータに対する独自のソリューションは存在せず、線形モデルであっても、他のフィーチャに線形に依存するフィーチャがない場合にのみパラメータを正確に回復できます。しかし、機械学習では、真の基礎となるパラメーターを回復することにはあまり関心がなく、高精度の予測につながるパラメーターに関心があることがよくあります。:cite:`Vapnik.1992`。幸いなことに、困難な最適化問題であっても、確率的勾配降下法は多くの場合、非常に優れた解を見つけることができます。これは、深いネットワークでは、高精度の予測につながるパラメーターの構成が多数存在するためです。 
+
+## まとめ
+
+このセクションでは、完全に機能するニューラルネットワークモデルとトレーニングループを実装することにより、ディープラーニングシステムの設計に向けて重要な一歩を踏み出しました。このプロセスでは、データローダー、モデル、損失関数、最適化手順、および視覚化および監視ツールを構築しました。これは、モデルのトレーニングに関連するすべてのコンポーネントを含む Python オブジェクトを作成することで実現しました。これはまだプロ級の実装ではありませんが、完全に機能しており、このようなコードはすでに小さな問題を迅速に解決するのに役立ちます。次のセクションでは、これを*より簡潔に*（定型コードを避ける）と*より効率的に*（GPUを最大限に活用する）の両方を行う方法について説明します。 
+
+## 演習
+
+1. 重みをゼロに初期化するとどうなるでしょうか。アルゴリズムはまだ機能しますか？$0.01$ではなく分散$1,000$でパラメータを初期化した場合はどうなりますか？
+1. [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm)が、電圧と電流を関連付ける抵抗器のモデルを考え出そうとしているとします。自動微分を使用してモデルのパラメーターを学習できますか？
+1. [プランクの法則](https://en.wikipedia.org/wiki/Planck%27s_law) を使用して、スペクトルエネルギー密度を使用して物体の温度を決定できますか？参考までに、黒体から放射される放射線のスペクトル密度$B$は$B(\lambda, T) = \frac{2 hc^2}{\lambda^5} \cdot \left(\exp \frac{h c}{\lambda k T} - 1\right)^{-1}$です。ここで、$\lambda$は波長、$T$は温度、$c$は光の速度、$h$はプランクの量子、$k$はボルツマン定数です。さまざまな波長 $\lambda$ のエネルギーを測定し、スペクトル密度曲線をプランクの法則に適合させる必要があります。
+1. 損失の二次導関数を計算する場合に遭遇する可能性のある問題は何ですか？どうやって直すの？
+1. `loss` 関数に `reshape` メソッドが必要なのはなぜですか?
+1. さまざまな学習率を使用して実験し、損失関数の値がどれだけ早く低下するかを調べます。トレーニングのエポック数を増やすことでエラーを減らすことはできますか？
+1. 例の数をバッチサイズで割ることができない場合、エポックの終わりに`data_iter`はどうなりますか？
+1. 絶対値損失 `(y_hat - d2l.reshape(y, y_hat.shape)).abs().sum()` など、別の損失関数を実装してみてください。
+    1. 通常のデータに何が起こるかを確認します。
+    1. $y_5 = 10,000$ など、$\mathbf{y}$ の一部のエントリをアクティブに摂動させる場合は、動作に違いがあるかどうかを確認します。
+    1. 二乗損失と絶対値損失の最良の側面を組み合わせる安価なソリューションを考えられますか？ヒント:どうしたら本当に大きなグラデーション値を避けることができますか?
+1. データセットを再シャッフルする必要があるのはなぜですか？そうでなければ、悪意のあるデータセットが最適化アルゴリズムを破るケースを設計できますか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/42)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/43)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/201)
+:end_tab:
diff --git a/chapter_linear-regression/linear-regression-scratch_origin.md b/chapter_linear-regression/linear-regression-scratch_origin.md
new file mode 100644
index 0000000..4770f5e
--- /dev/null
+++ b/chapter_linear-regression/linear-regression-scratch_origin.md
@@ -0,0 +1,467 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Linear Regression Implementation from Scratch
+:label:`sec_linear_scratch`
+
+We're now ready to work through 
+a fully functioning implementation 
+of linear regression. 
+In this section, 
+(**we will implement the entire method from scratch,
+including (i) the model; (ii) the loss function;
+(iii) a minibatch stochastic gradient descent optimizer;
+and (iv) the training function 
+that stitches all of these pieces together.**)
+Finally, we will run our synthetic data generator
+from :numref:`sec_synthetic-regression-data`
+and apply our model
+on the resulting dataset. 
+While modern deep learning frameworks 
+can automate nearly all of this work,
+implementing things from scratch is the only way
+to make sure that you really know what you are doing.
+Moreover, when it comes time to customize models,
+defining our own layers or loss functions,
+understanding how things work under the hood will prove handy.
+In this section, we will rely only 
+on tensors and automatic differentiation.
+Later on, we will introduce a more concise implementation,
+taking advantage of bells and whistles of deep learning frameworks 
+while retaining the structure of what follows below.
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import autograd, np, npx
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## Defining the Model
+
+[**Before we can begin optimizing our model's parameters**] by minibatch SGD,
+(**we need to have some parameters in the first place.**)
+In the following we initialize weights by drawing
+random numbers from a normal distribution with mean 0
+and a standard deviation of 0.01. 
+The magic number 0.01 often works well in practice, 
+but you can specify a different value 
+through the argument `sigma`.
+Moreover we set the bias to 0.
+Note that for object-oriented design
+we add the code to the `__init__` method of a subclass of `d2l.Module` (introduced in :numref:`oo-design-models`).
+
+```{.python .input  n=5}
+%%tab all
+class LinearRegressionScratch(d2l.Module):  #@save
+    def __init__(self, num_inputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        if tab.selected('mxnet'):
+            self.w = d2l.normal(0, sigma, (num_inputs, 1))
+            self.b = d2l.zeros(1)
+            self.w.attach_grad()
+            self.b.attach_grad()
+        if tab.selected('pytorch'):
+            self.w = d2l.normal(0, sigma, (num_inputs, 1), requires_grad=True)
+            self.b = d2l.zeros(1, requires_grad=True)
+        if tab.selected('tensorflow'):
+            w = tf.random.normal((num_inputs, 1), mean=0, stddev=0.01)
+            b = tf.zeros(1)
+            self.w = tf.Variable(w, trainable=True)
+            self.b = tf.Variable(b, trainable=True)
+```
+
+Next, we must [**define our model,
+relating its input and parameters to its output.**]
+For our linear model we simply take the matrix-vector product
+of the input features $\mathbf{X}$ 
+and the model weights $\mathbf{w}$,
+and add the offset $b$ to each example.
+$\mathbf{Xw}$ is a vector and $b$ is a scalar.
+Due to the broadcasting mechanism 
+(see :numref:`subsec_broadcasting`),
+when we add a vector and a scalar,
+the scalar is added to each component of the vector.
+The resulting `forward` function 
+is registered as a method in the `LinearRegressionScratch` class
+via `add_to_class` (introduced in :numref:`oo-design-utilities`).
+
+```{.python .input  n=6}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def forward(self, X):
+    """The linear regression model."""
+    return d2l.matmul(X, self.w) + self.b
+```
+
+## Defining the Loss Function
+
+Since [**updating our model requires taking
+the gradient of our loss function,**]
+we ought to (**define the loss function first.**)
+Here we use the squared loss function
+in :eqref:`eq_mse`.
+In the implementation, we need to transform the true value `y`
+into the predicted value's shape `y_hat`.
+The result returned by the following function
+will also have the same shape as `y_hat`. 
+We also return the averaged loss value
+among all examples in the minibatch.
+
+```{.python .input  n=7}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def loss(self, y_hat, y):
+    l = (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+    return d2l.reduce_mean(l)
+```
+
+## Defining the Optimization Algorithm
+
+As discussed in :numref:`sec_linear_regression`,
+linear regression has a closed-form solution.
+However, our goal here is to illustrate 
+how to train more general neural networks,
+and that requires that we teach you 
+how to use minibatch SGD.
+Hence we will take this opportunity
+to introduce your first working example of SGD.
+At each step, using a minibatch 
+randomly drawn from our dataset,
+we estimate the gradient of the loss
+with respect to the parameters.
+Next, we update the parameters
+in the direction that may reduce the loss.
+
+The following code applies the update, 
+given a set of parameters, a learning rate `lr`.
+Since our loss is computed as an average over the minibatch, 
+we don't need to adjust the learning rate against the batch size. 
+In later chapters we will investigate 
+how learning rates should be adjusted
+for very large minibatches as they arise 
+in distributed large scale learning.
+For now, we can ignore this dependency.
+
+ 
+
+
+:begin_tab:`mxnet`
+We define our `SGD` class, 
+a subclass of `d2l.HyperParameters` (introduced in :numref:`oo-design-utilities`),
+to have a similar API
+as the built-in SGD optimizer.
+We update the parameters in the `step` method.
+It accepts a `batch_size` argument that can be ignored.
+:end_tab:
+
+:begin_tab:`pytorch`
+We define our `SGD` class,
+a subclass of `d2l.HyperParameters` (introduced in :numref:`oo-design-utilities`),
+to have a similar API 
+as the built-in SGD optimizer.
+We update the parameters in the `step` method.
+The `zero_grad` method sets all gradients to 0,
+which must be run before a backpropagation step. 
+:end_tab:
+
+:begin_tab:`tensorflow`
+We define our `SGD` class,
+a subclass of `d2l.HyperParameters` (introduced in :numref:`oo-design-utilities`),
+to have a similar API
+as the built-in SGD optimizer.
+We update the parameters in the `apply_gradients` method.
+It accepts a list of parameter and gradient pairs. 
+:end_tab:
+
+```{.python .input  n=8}
+%%tab mxnet, pytorch
+class SGD(d2l.HyperParameters):  #@save
+    def __init__(self, params, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+
+    if tab.selected('mxnet'):
+        def step(self, _):
+            for param in self.params:
+                param -= self.lr * param.grad
+    
+    if tab.selected('pytorch'):
+        def step(self):
+            for param in self.params:
+                param -= self.lr * param.grad
+
+        def zero_grad(self):
+            for param in self.params:
+                if param.grad is not None:
+                    param.grad.zero_()
+```
+
+```{.python .input  n=9}
+%%tab tensorflow
+class SGD(d2l.HyperParameters):  #@save
+    def __init__(self, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+    
+    def apply_gradients(self, grads_and_vars):
+        for grad, param in grads_and_vars:
+            param.assign_sub(self.lr * grad)        
+```
+
+We next define the `configure_optimizers` method, which returns an instance of the `SGD` class.
+
+```{.python .input  n=10}
+%%tab all
+@d2l.add_to_class(LinearRegressionScratch)  #@save
+def configure_optimizers(self):
+    if tab.selected('mxnet') or tab.selected('pytorch'):
+        return SGD([self.w, self.b], self.lr)
+    if tab.selected('tensorflow'):
+        return SGD(self.lr)
+```
+
+## Training
+
+Now that we have all of the parts in place
+(parameters, loss function, model, and optimizer),
+we are ready to [**implement the main training loop.**]
+It is crucial that you understand this code well
+since you will employ similar training loops
+for every other deep learning model
+covered in this book.
+In each *epoch*, we iterate through 
+the entire training dataset, 
+passing once through every example
+(assuming that the number of examples 
+is divisible by the batch size). 
+In each iteration, we grab a minibatch of training examples,
+and compute its loss through the model's `training_step` method. 
+Next, we compute the gradients with respect to each parameter. 
+Finally, we will call the optimization algorithm
+to update the model parameters. 
+In summary, we will execute the following loop:
+
+* Initialize parameters $(\mathbf{w}, b)$
+* Repeat until done
+    * Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
+    * Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
+ 
+Recall that the synthetic regression dataset 
+that we generated in :numref:``sec_synthetic-regression-data`` 
+does not provide a validation dataset. 
+In most cases, however, 
+we will use a validation dataset 
+to measure our model quality. 
+Here we pass the validation dataloader 
+once in each epoch to measure the model performance.
+Following our object-oriented design,
+the `prepare_batch` and `fit_epoch` functions
+are registered as methods of the `d2l.Trainer` class
+(introduced in :numref:`oo-design-training`).
+
+```{.python .input  n=11}
+%%tab all    
+@d2l.add_to_class(d2l.Trainer)  #@save
+def prepare_batch(self, batch):
+    return batch
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    self.model.train()        
+    for batch in self.train_dataloader:        
+        loss = self.model.training_step(self.prepare_batch(batch))
+        self.optim.zero_grad()
+        with torch.no_grad():
+            loss.backward()
+            if self.gradient_clip_val > 0:  # To be discussed later
+                self.clip_gradients(self.gradient_clip_val, self.model)
+            self.optim.step()
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    self.model.eval()
+    for batch in self.val_dataloader:
+        with torch.no_grad():            
+            self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+```{.python .input  n=13}
+%%tab mxnet
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    for batch in self.train_dataloader:
+        with autograd.record():
+            loss = self.model.training_step(self.prepare_batch(batch))
+        loss.backward()
+        if self.gradient_clip_val > 0:
+            self.clip_gradients(self.gradient_clip_val, self.model)
+        self.optim.step(1)
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    for batch in self.val_dataloader:        
+        self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+```{.python .input  n=14}
+%%tab tensorflow
+@d2l.add_to_class(d2l.Trainer)  #@save
+def fit_epoch(self):
+    self.model.training = True
+    for batch in self.train_dataloader:            
+        with tf.GradientTape() as tape:
+            loss = self.model.training_step(self.prepare_batch(batch))
+        grads = tape.gradient(loss, self.model.trainable_variables)
+        if self.gradient_clip_val > 0:
+            grads = self.clip_gradients(self.gradient_clip_val, grads)
+        self.optim.apply_gradients(zip(grads, self.model.trainable_variables))
+        self.train_batch_idx += 1
+    if self.val_dataloader is None:
+        return
+    self.model.training = False
+    for batch in self.val_dataloader:        
+        self.model.validation_step(self.prepare_batch(batch))
+        self.val_batch_idx += 1
+```
+
+We are almost ready to train the model,
+but first we need some data to train on.
+Here we use the `SyntheticRegressionData` class 
+and pass in some ground-truth parameters.
+Then, we train our model with 
+the learning rate `lr=0.03` 
+and set `max_epochs=3`. 
+Note that in general, both the number of epochs 
+and the learning rate are hyperparameters.
+In general, setting hyperparameters is tricky
+and we will usually want to use a 3-way split,
+one set for training, 
+a second for hyperparameter seclection,
+and the third reserved for the final evaluation.
+We elide these details for now but will revise them
+later.
+
+```{.python .input  n=15}
+%%tab all
+model = LinearRegressionScratch(2, lr=0.03)
+data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+trainer = d2l.Trainer(max_epochs=3)
+trainer.fit(model, data)
+```
+
+Because we synthesized the dataset ourselves,
+we know precisely what the true parameters are.
+Thus, we can [**evaluate our success in training
+by comparing the true parameters
+with those that we learned**] through our training loop.
+Indeed they turn out to be very close to each other.
+
+```{.python .input  n=16}
+%%tab all
+print(f'error in estimating w: {data.w - d2l.reshape(model.w, data.w.shape)}')
+print(f'error in estimating b: {data.b - model.b}')
+```
+
+We should not take the ability to exactly recover 
+the ground-truth parameters for granted.
+In general, for deep models unique solutions
+for the parameters do not exist,
+and even for linear models,
+exactly recovering the parameters
+is only possible when no feature 
+is linearly dependent on the others.
+However, in machine learning, 
+we are often less concerned
+with recovering true underlying parameters,
+and more concerned with parameters 
+that lead to highly accurate prediction :cite:`Vapnik.1992`.
+Fortunately, even on difficult optimization problems,
+stochastic gradient descent can often find remarkably good solutions,
+owing partly to the fact that, for deep networks,
+there exist many configurations of the parameters
+that lead to highly accurate prediction.
+
+
+## Summary
+
+In this section, we took a significant step 
+towards designing deep learning systems 
+by implementing a fully functional 
+neural network model and training loop.
+In this process, we built a data loader, 
+a model, a loss function, an optimization procedure,
+and a visualization and monitoring tool. 
+We did this by composing a Python object 
+that contains all relevant components for training a model. 
+While this is not yet a professional-grade implementation
+it is perfectly functional and code like this 
+could already help you to solve small problems quickly.
+In the next sections, we will see how to do this
+both *more concisely* (avoiding boilerplate code)
+and *more efficiently* (use our GPUs to their full potential).
+
+
+
+## Exercises
+
+1. What would happen if we were to initialize the weights to zero. Would the algorithm still work? What if we
+   initialized the parameters with variance $1,000$ rather than $0.01$?
+1. Assume that you are [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm) trying to come up
+   with a model for resistors that relate voltage and current. Can you use automatic
+   differentiation to learn the parameters of your model?
+1. Can you use [Planck's Law](https://en.wikipedia.org/wiki/Planck%27s_law) to determine the temperature of an object
+   using spectral energy density? For reference, the spectral density $B$ of radiation emanating from a black body is
+   $B(\lambda, T) = \frac{2 hc^2}{\lambda^5} \cdot \left(\exp \frac{h c}{\lambda k T} - 1\right)^{-1}$. Here
+   $\lambda$ is the wavelength, $T$ is the temperature, $c$ is the speed of light, $h$ is Planck's quantum, and $k$ is the
+   Boltzmann constant. You measure the energy for different wavelengths $\lambda$ and you now need to fit the spectral
+   density curve to Planck's law.
+1. What are the problems you might encounter if you wanted to compute the second derivatives of the loss? How would
+   you fix them?
+1. Why is the `reshape` method needed in the `loss` function?
+1. Experiment using different learning rates to find out how quickly the loss function value drops. Can you reduce the
+   error by increasing the number of epochs of training?
+1. If the number of examples cannot be divided by the batch size, what happens to `data_iter` at the end of an epoch?
+1. Try implementing a different loss function, such as the absolute value loss `(y_hat - d2l.reshape(y, y_hat.shape)).abs().sum()`.
+    1. Check what happens for regular data.
+    1. Check whether there is a difference in behavior if you actively perturb some entries of $\mathbf{y}$,
+       such as $y_5 = 10,000$.
+    1. Can you think of a cheap solution for combining the best aspects of squared loss and absolute value loss?
+       Hint: how can you avoid really large gradient values?
+1. Why do we need to reshuffle the dataset? Can you design a case where a maliciously dataset would break the
+   optimization algorithm otherwise?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/42)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/43)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/201)
+:end_tab:
diff --git a/chapter_linear-regression/linear-regression.md b/chapter_linear-regression/linear-regression.md
new file mode 100644
index 0000000..10bc24d
--- /dev/null
+++ b/chapter_linear-regression/linear-regression.md
@@ -0,0 +1,327 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 線形回帰
+:label:`sec_linear_regression`
+
+*数値を予測したいときはいつでも、回帰*の問題がポップアップします。
+一般的な例には、（住宅、株などの）価格の予測、（入院中の患者の）滞在期間の予測、（小売販売のための）需要の予測など、数え切れないほどあります。すべての予測問題が古典的な回帰問題であるとは限りません。後で、分類問題を紹介します。ここでは、一連のカテゴリ間のメンバーシップを予測することが目的です。 
+
+実行例として、面積（平方フィート）と年齢（年）に基づいて住宅の価格（ドル単位）を見積もるとします。住宅価格を予測するモデルを開発するには、各住宅の販売価格、面積、年齢などの売上高からなるデータを手に入れる必要があります。機械学習の用語では、データセットは*トレーニングデータセット*または*トレーニングセット*と呼ばれ、各行（1つの販売に対応するデータを含む）は*例*（または*データポイント*、*インスタンス*、*サンプル*）と呼ばれます。私たちが予測しようとしているもの（価格）は、*ラベル*（または*ターゲット*）と呼ばれます。予測の基になる変数（年齢と面積）は、*特徴*（または*共変量*）と呼ばれます。 
+
+## 基本
+
+*線形回帰*はどちらも最も簡単かもしれません
+回帰問題に取り組むための標準的なツールの中で最も人気があります。19世紀:cite:`Legendre.1805,Gauss.1809`の夜明けにさかのぼる線形回帰は、いくつかの単純な仮定から流れます。まず、フィーチャ $\mathbf{x}$ とターゲット $y$ の関係がほぼ線形であると仮定します。つまり、条件付き平均 $E[Y \mid X=\mathbf{x}]$ は、フィーチャ $\mathbf{x}$ の重み付き和として表すことができます。この設定により、観測ノイズのために目標値が期待値から逸脱する可能性があります。次に、ガウス分布に従って、そのようなノイズが適切に動作するという仮定を課すことができます。通常、データセット内の例の数を示すために$n$を使用します。上付き文字を使用してサンプルとターゲットを列挙し、添字を使用して座標をインデックスします。具体的には、$\mathbf{x}^{(i)}$は$i$番目のサンプルを示し、$x_j^{(i)}$は$j$番目の座標を示します。 
+
+### モデル
+:label:`subsec_linear_model`
+
+すべてのソリューションの中心には、フィーチャをターゲットの推定値に変換する方法を記述するモデルがあります。線形性の仮定は、ターゲット（価格）の期待値をフィーチャ（面積と年齢）の加重合計として表すことができることを意味します。 
+
+$$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.$$
+:eqlabel:`eq_price-area`
+
+ここで、$w_{\mathrm{area}}$と$w_{\mathrm{age}}$は*重み*と呼ばれ、$b$は*バイアス*（または*オフセット*または*インターセプト*）と呼ばれます。重みは、予測に対する各特徴量の影響を決定します。バイアスは、すべての特徴量がゼロの場合の推定値を決定します。面積が正確にゼロの新しく建てられた家を見ることはありませんが、（原点を通る線に制限するのではなく）フィーチャのすべての線形関数を表現できるため、バイアスが必要です。厳密に言えば、:eqref:`eq_price-area`は入力フィーチャの*アフィン変換*であり、加重和によるフィーチャの*線形変換*と、追加されたバイアスによる*平行移動*の組み合わせによって特徴付けられます。データセットが与えられた場合、私たちの目標は、モデルの予測がデータで観察された真の価格にできるだけ近づくように、平均して重み$\mathbf{w}$とバイアス$b$を選択することです。 
+
+いくつかの特徴量だけを持つデータセットに焦点を当てることが一般的な分野では、:eqref:`eq_price-area`のようにモデルを長い形式で明示的に表現するのが一般的です。機械学習では、通常、コンパクトな線形代数表記を使用する方が便利な高次元のデータセットを扱います。入力が$d$フィーチャで構成されている場合、それぞれにインデックス（$1$から$d$の間）を割り当て、予測$\hat{y}$（一般に「帽子」記号は推定値を示します）を次のように表現できます。 
+
+$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b.$$
+
+すべての特徴量をベクトル $\mathbf{x} \in \mathbb{R}^d$ に集め、すべての重みをベクトル $\mathbf{w} \in \mathbb{R}^d$ に集めると、$\mathbf{w}$ と $\mathbf{x}$ の間のドット積によってモデルをコンパクトに表現できます。 
+
+$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b.$$
+:eqlabel:`eq_linreg-y`
+
+:eqref:`eq_linreg-y` では、ベクトル $\mathbf{x}$ は 1 つの例の特徴に対応します。$n$例のデータセット全体の特徴を*設計マトリックス* $\mathbf{X} \in \mathbb{R}^{n \times d}$で参照すると便利なことがよくあります。ここで、$\mathbf{X}$ には、例ごとに 1 つの行と、各フィーチャごとに 1 つの列が含まれています。特徴量$\mathbf{X}$の集合の場合、予測$\hat{\mathbf{y}} \in \mathbb{R}^n$は行列-ベクトル積によって表すことができます。 
+
+$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b,$$
+
+総和中に放送（:numref:`subsec_broadcasting`）が適用される場所。トレーニングデータセット$\mathbf{X}$の特徴と対応する（既知の）ラベル$\mathbf{y}$が与えられると、線形回帰の目標は、$\mathbf{X}$と同じ分布からサンプリングされた新しいデータ例の特徴を与える重みベクトル$\mathbf{w}$とバイアス項$b$を見つけることです。新しい例のラベルは（expectation）は最小の誤差で予測されます。 
+
+$\mathbf{x}$が与えられた場合の$y$を予測するための最良のモデルが線形であると私たちが信じるとしても、$n$例の実世界のデータセットを見つけることは期待できません。$y^{(i)}$はすべて$1 \leq i \leq n$で$\mathbf{w}^\top \mathbf{x}^{(i)}+b$とまったく同じです。たとえば、$\mathbf{X}$とラベル$\mathbf{y}$を観察するために使用する機器が何であれ、わずかな測定誤差が生じる可能性があります。したがって、基礎となる関係が線形であると確信できる場合でも、そのような誤差を説明するためにノイズ項を組み込みます。 
+
+最適な*パラメータ*（または*モデルパラメータ*）$\mathbf{w}$と$b$を検索する前に、（i）特定のモデルの品質尺度と、（ii）モデルを更新して品質を向上させる手順の2つが必要です。 
+
+### 損失機能
+:label:`subsec_linear-regression-loss-function`
+
+当然、モデルをデータに適合させるには、*適合性*（または同等に*不適合性*）の尺度について合意する必要があります。
+*損失関数* 距離を定量化する
+ターゲットの*実数*と*予測*の値の間。通常、損失は非負の数であり、値が小さいほど優れており、完全な予測では0の損失が発生します。回帰問題の場合、最も一般的な損失関数は二乗誤差です。$i$の例に対する予測が$\hat{y}^{(i)}$で、対応する真のラベルが$y^{(i)}$の場合、*二乗誤差*は次のように求められます。 
+
+$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$
+:eqlabel:`eq_mse`
+
+定数$\frac{1}{2}$は実質的な違いはありませんが、損失の微分を取ると相殺されるため、表記上便利であることがわかります。トレーニングデータセットは私たちに与えられ、制御不能であるため、経験的誤差はモデルパラメータの関数にすぎません。以下では、一次元入力 (:numref:`fig_fit_linreg`) をもつ問題における線形回帰モデルの適合を可視化します。 
+
+![Fitting a linear regression model to one-dimensional data.](../img/fit-linreg.svg)
+:label:`fig_fit_linreg`
+
+推定値$\hat{y}^{(i)}$と目標$y^{(i)}$の間の大きな違いは、損失の二次形式（これは両刃の剣である可能性があります）のために、損失へのより大きな寄与につながることに注意してください。これにより、モデルが大きなエラーを回避するよう促す一方で、異常なデータに対する過度の感度につながる可能性もあります）。$n$例のデータセット全体でモデルの品質を測定するには、トレーニングセットの損失を単純に平均（または同等に合計）します。 
+
+$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
+
+モデルをトレーニングする場合、すべてのトレーニング例で合計損失を最小限に抑えるパラメーター ($\mathbf{w}^*, b^*$) を見つけます。 
+
+$$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b).$$
+
+### 分析ソリューション
+
+これから取り上げるほとんどのモデルとは異なり、線形回帰は驚くほど簡単な最適化問題を提示します。特に、以下のような簡単な式を適用することにより、最適なパラメータ（トレーニングデータで評価される）を分析的に見つけることができます。まず、すべての 1 で構成される計画行列に列を追加することにより、バイアス $b$ をパラメーター $\mathbf{w}$ に含めることができます。次に、予測問題は $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ を最小化することです。設計行列 $\mathbf{X}$ がフルランクである限り（他のフィーチャに線形依存するフィーチャはありません）、損失曲面には臨界点が1つだけ存在し、ドメイン全体の損失の最小値に相当します。$\mathbf{w}$に関する損失の微分をゼロに設定すると、次のようになります。 
+
+$$\begin{aligned}
+    \partial_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 =
+    2 \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0
+    \text{ and hence }
+    \mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{X} \mathbf{w}.
+\end{aligned}$$
+
+$\mathbf{w}$を解くと、最適化問題の最適な解が得られます。この解決策に注意してください  
+
+$$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}$$
+
+行列 $\mathbf X^\top \mathbf X$ が可逆である場合、つまり計画行列の列が線形独立である :cite:`Golub.Van-Loan.1996` の場合にのみ一意になります。 
+
+線形回帰のような単純な問題は分析的な解決策を認めるかもしれませんが、そのような幸運に慣れるべきではありません。分析ソリューションは優れた数学的分析を可能にしますが、分析ソリューションの要件は非常に厳しく、ディープラーニングのエキサイティングな側面のほとんどすべてが除外されます。 
+
+### ミニバッチ確率的勾配降下
+
+幸いなことに、モデルを解析的に解くことができない場合でも、実際にはモデルを効果的にトレーニングできることがよくあります。さらに、多くのタスクでは、最適化が困難なモデルの方がはるかに優れているため、それらをどのようにトレーニングするかを考え出すことは、トラブルに見合うだけの価値があります。 
+
+ほぼすべてのディープラーニングモデルを最適化するための重要な手法であり、この本全体で呼びますが、損失関数を段階的に下げる方向にパラメーターを更新することにより、エラーを繰り返し減らすことです。このアルゴリズムは*勾配降下*と呼ばれます。 
+
+勾配降下法の最も単純な適用は、損失関数の導関数を取ることです。これは、データセット内のすべての例で計算された損失の平均です。実際には、これは非常に遅くなる可能性があります。更新ステップが非常に強力であっても、単一の更新を行う前にデータセット全体を渡す必要があります :cite:`Liu.Nocedal.1989`。さらに悪いことに、トレーニングデータに多くの冗長性がある場合、完全更新の利点はさらに低くなります。 
+
+もう1つの極端な点は、一度に1つの例のみを検討し、一度に1つの観測値に基づいて更新手順を実行することです。結果として得られるアルゴリズムである*確率的勾配降下法* (SGD) は、大規模なデータセットに対しても効果的な戦略となります :cite:`Bottou.2010`。残念ながら、SGD には計算と統計の両方の欠点があります。1つの問題は、プロセッサがメインメモリからプロセッサキャッシュにデータを移動する場合よりも数値の乗算と加算がはるかに高速であるという事実から生じます。対応する数のベクトル-ベクトル演算よりも、行列-ベクトル乗算を実行する方が、最大で桁違いに効率的です。これは、完全なバッチと比較して、一度に1つのサンプルを処理するのに非常に長い時間がかかる可能性があることを意味します。2つ目の問題は、バッチ正規化（:numref:`sec_batch_norm`で説明）などの一部の層は、一度に複数の観測値にアクセスできる場合にのみうまく機能することです。 
+
+両方の問題の解決策は、中間的な戦略を選択することです。完全なバッチまたは一度に1つのサンプルだけを取るのではなく、観測値の*ミニバッチ*を取る:cite:`Li.Zhang.Chen.ea.2014`。このミニバッチのサイズの具体的な選択は、メモリ量、アクセラレータの数、レイヤの選択、およびデータセットの合計サイズなど、多くの要因に依存します。それにもかかわらず、32から256の間の数、できれば$2$の大きな累乗の倍数が、良いスタートです。これにより、*ミニバッチの確率的勾配降下*につながります。 
+
+最も基本的な形式では、各反復$t$で、最初に、固定数のトレーニング例$|\mathcal{B}|$で構成されるミニバッチ$\mathcal{B}_t$をランダムにサンプリングします。次に、モデルパラメーターに関するミニバッチの平均損失の微分 (勾配) を計算します。最後に、勾配に*学習率*と呼ばれるあらかじめ決められた小さな正の値 $\eta$ を掛け、現在のパラメーター値から結果の項を減算します。更新は次のように表現できます。 
+
+$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).$$
+
+要約すると、ミニバッチ SGD は次のように処理されます。(i) 通常はランダムにモデルパラメーターの値を初期化します。(ii) データからランダムなミニバッチを繰り返しサンプリングし、負の勾配の方向にパラメーターを更新します。二次損失とアフィン変換の場合、これは閉形式展開になります。 
+
+$$\begin{aligned} \mathbf{w} & \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) && = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)\\ b &\leftarrow b -  \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_b l^{(i)}(\mathbf{w}, b) &&  = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$
+:eqlabel:`eq_linreg_batch_update`
+
+ミニバッチ$\mathcal{B}$を選ぶので、そのサイズ$|\mathcal{B}|$で正規化する必要があります。多くの場合、ミニバッチのサイズと学習率はユーザー定義です。トレーニングループで更新されないこのような調整可能なパラメーターは、*ハイパーパラメーター* と呼ばれます。これらは、ベイズ最適化 :cite:`Frazier.2018` など、さまざまな手法によって自動的に調整できます。最終的には、ソリューションの品質は通常、別の*検証データセット* (または*検証セット*) で評価されます。 
+
+所定の反復回数（または他の停止基準が満たされるまで）のトレーニングの後、推定されたモデルパラメーター（$\hat{\mathbf{w}}, \hat{b}$）を記録します。関数が真に線形で、ノイズがない場合でも、これらのパラメータは損失の正確な最小化にはならず、決定論的でもないことに注意してください。アルゴリズムはミニマイザーに向かってゆっくりと収束しますが、通常、有限数のステップで正確に収束することはできません。さらに、パラメータを更新するために使用されるミニバッチ$\mathcal{B}$はランダムに選択されます。これは決定論を破る。 
+
+線形回帰は、大域的最小値（$\mathbf{X}$がフルランクの場合は常に、または$\mathbf{X}^\top \mathbf{X}$が可逆である場合は同等）を伴う学習問題になります。ただし、ディープネットワークの損失曲面には多くのサドルポイントと最小値が含まれています。幸いなことに、私たちは通常、正確なパラメータセットを見つけることではなく、正確な予測につながる（したがって低損失）一連のパラメータを見つけることだけに関心があります。実際には、ディープラーニングの実践者は、*トレーニングセット* :cite:`Izmailov.Podoprikhin.Garipov.ea.2018,Frankle.Carbin.2018`の損失を最小限に抑えるパラメータを見つけるのに苦労することはめったにありません。より手ごわい作業は、これまで見られなかったデータの正確な予測につながるパラメーターを見つけることであり、これは*一般化*と呼ばれる課題です。本全体を通してこれらのトピックに戻ります。 
+
+### 予測
+
+モデル$\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}$を考えると、新しい例として*予測*を行うことができます。たとえば、面積$x_1$と年齢$x_2$を考えると、以前は見えなかった家の販売価格を予測します。ディープラーニングの実践者は、予測フェーズを「推論」と呼んでいますが、これはちょっとした誤称です。*推論* とは、パラメータの値と目に見えないインスタンスのありそうなラベルの両方を含む、証拠に基づいて到達した結論を広く指します。どちらかといえば、統計学の文献で
+*inference* はパラメータ推論を示すことが多い
+そして、この用語の過負荷は、ディープラーニングの実践者が統計学者と話すときに不必要な混乱を引き起こします。以下では、可能な限り*予測*に固執します。 
+
+## 速度のためのベクトル化
+
+モデルをトレーニングするとき、私たちは通常、サンプルのミニバッチ全体を同時に処理したいと考えています。これを効率的に行うには、(**we**) (~~should~~) (**計算をベクトル化し、Pythonで高価なfor-loopsを書くのではなく、高速な線形代数ライブラリを活用する**) が必要です。
+
+```{.python .input  n=1}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+import math
+from mxnet import np
+import time
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import math
+import torch
+import numpy as np
+import time
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import math
+import tensorflow as tf
+import numpy as np
+import time
+```
+
+なぜこれほど重要なのかを説明するために、(**ベクトルを加算する2つの方法を考えてみる**)、まず、すべてが 1 を含む 2 つの 10,000 次元のベクトルをインスタンス化します。ある方法では、Python の for ループでベクトルをループします。もう 1 つの方法では、`+` への 1 回の呼び出しに依存しています。
+
+```{.python .input  n=2}
+%%tab all
+n = 10000
+a = d2l.ones(n)
+b = d2l.ones(n)
+```
+
+これで、ワークロードのベンチマークが可能になりました。まず、[**for-loopを使用して一度に1つの座標を追加します。**]
+
+```{.python .input  n=3}
+%%tab mxnet, pytorch
+c = d2l.zeros(n)
+t = time.time()
+for i in range(n):
+    c[i] = a[i] + b[i]
+f'{time.time() - t:.5f} sec'
+```
+
+```{.python .input}
+%%tab tensorflow
+c = tf.Variable(d2l.zeros(n))
+t = time.time()
+for i in range(n):
+    c[i].assign(a[i] + b[i])
+f'{time.time() - t:.5f} sec'
+```
+
+(**あるいは、再ロードされた `+` 演算子を使用して要素単位の合計を計算します。**)
+
+```{.python .input  n=4}
+%%tab all
+t = time.time()
+d = a + b
+f'{time.time() - t:.5f} sec'
+```
+
+2つ目の方法は、1つ目の方法よりも大幅に高速です。コードをベクトル化すると、多くの場合、桁違いに高速化されます。さらに、多くの計算を自分で記述する必要なく、より多くの数学をライブラリにプッシュし、エラーの可能性を減らし、コードの移植性を高めます。 
+
+## 正規分布と二乗損失
+:label:`subsec_normal_distribution_and_squared_loss`
+
+ここまで、二乗損失の目的のかなり機能的な動機付けを与えてきました。最適なパラメーターは、基礎となるパターンが真に線形である場合は常に条件付き期待値 $E[Y\mid X]$ を返し、損失は外れ値に対して特大のペナルティを割り当てます。また、ノイズの分布について確率論的な仮定を行うことで、二乗損失の目標に対してより正式な動機を与えることもできます。 
+
+線形回帰は、19世紀の変わり目に発明されました。ガウスとルジャンドルのどちらが最初にこの考えを考案したのかは長い間議論されてきましたが、正規分布（*ガウス*とも呼ばれる）も発見したのはガウスでした。正規分布と二乗損失を伴う線形回帰は、一般的な親子関係よりも深いつながりを共有していることがわかります。 
+
+はじめに、平均が$\mu$、分散が$\sigma^2$（標準偏差$\sigma$）の正規分布は次のように与えられることを思い出してください。 
+
+$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).$$
+
+以下 [**正規分布を計算する関数を定義します**]。
+
+```{.python .input  n=3}
+%%tab all
+def normal(x, mu, sigma):
+    p = 1 / math.sqrt(2 * math.pi * sigma**2)
+    return p * np.exp(-0.5 * (x - mu)**2 / sigma**2)
+```
+
+これで (**正規分布を可視化する**) ことができます。
+
+```{.python .input  n=8}
+%%tab mxnet
+# Use numpy again for visualization
+x = np.arange(-7, 7, 0.01)
+
+# Mean and standard deviation pairs
+params = [(0, 1), (0, 2), (3, 1)]
+d2l.plot(x.asnumpy(), [normal(x, mu, sigma).asnumpy() for mu, sigma in params], xlabel='x',
+         ylabel='p(x)', figsize=(4.5, 2.5),
+         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
+```
+
+```{.python .input  n=8}
+%%tab pytorch, tensorflow
+# Use numpy again for visualization
+x = np.arange(-7, 7, 0.01)
+
+# Mean and standard deviation pairs
+params = [(0, 1), (0, 2), (3, 1)]
+d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
+         ylabel='p(x)', figsize=(4.5, 2.5),
+         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
+```
+
+平均値の変化は$x$軸に沿ったシフトに対応し、分散を増やすと分布が広がり、ピークが下がることに注意してください。 
+
+損失の二乗による線形回帰を動機付ける1つの方法は、観測値がノイズの多い測定値から発生すると仮定することです。ノイズは次のように正規分布しています。 
+
+$$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2).$$
+
+したがって、特定の$y$について、特定の$y$を見る*可能性*を次の方法で書き出すことができます。 
+
+$$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).$$
+
+そのため、尤度は因数分解されます。*最尤法の原則*によると、パラメータ$\mathbf{w}$と$b$の最良値は、データセット全体の*尤度*を最大化する値です。 
+
+$$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)} \mid \mathbf{x}^{(i)}).$$
+
+すべてのペア$(\mathbf{x}^{(i)}, y^{(i)})$が互いに独立して描画されたため、等価性が続きます。最尤法の原理に従って選択された推定量は、*最尤推定量*と呼ばれます。多くの指数関数の積を最大化するのは難しいように思えるかもしれませんが、代わりに尤度の対数を最大化することで、目的を変えずに物事を大幅に単純化できます。歴史的な理由から、最適化は最大化ではなく最小化として表現されることが多いです。したがって、何も変更せずに、*負の対数尤度*を*最小化*できます。これは次のように表現できます。 
+
+$$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.$$
+
+$\sigma$が固定であると仮定すると、最初の項は無視できます。これは、$\mathbf{w}$または$b$に依存しないためです。2 番目の項は、乗法定数 $\frac{1}{\sigma^2}$ を除いて、前に導入した二乗誤差損失と同じです。幸いなことに、このソリューションは$\sigma$にも依存しません。したがって、平均二乗誤差を最小化することは、加法性ガウスノイズを仮定した場合の線形モデルの最尤推定と等価です。 
+
+## ニューラルネットワークとしての線形回帰
+
+線形モデルは、この本で紹介する多くの複雑なニューラルネットワークを表現するのに十分なほど豊富ではありませんが、ニューラルネットワークは、すべての特徴が入力ニューロンによって表され、そのすべてが出力に直接接続されているニューラルネットワークとして線形モデルを包含するのに十分なほど豊富です。 
+
+:numref:`fig_single_neuron`は、線形回帰をニューラルネットワークとして表しています。この図は、各入力が出力にどのように接続されているかなどの接続パターンを強調していますが、重みやバイアスによって取られる特定の値は強調していません。 
+
+![Linear regression is a single-layer neural network.](../img/singleneuron.svg)
+:label:`fig_single_neuron`
+
+入力は$x_1, \ldots, x_d$です。$d$は、入力レイヤーの*入力数*または*フィーチャの次元*と呼びます。ネットワークの出力は $o_1$ です。単一の数値を予測しようとしているだけなので、出力ニューロンは 1 つだけです。入力値はすべて*指定* であることに注意してください。*計算された*ニューロンは1つだけです。要約すると、線形回帰は、単一層の完全に接続されたニューラルネットワークと考えることができます。今後の章では、はるかに多くの層を持つネットワークに遭遇するでしょう。 
+
+### 生物学
+
+線形回帰は計算神経科学よりも前から存在するため、線形回帰をニューラルネットワークの観点から説明するのは時代錯誤のように思えるかもしれません。それにもかかわらず、サイバネティストと神経生理学者のウォーレン・マカロックとウォルター・ピッツが人工ニューロンのモデルを開発し始めたとき、それらは自然な出発点でした。:numref:`fig_Neuron`の生体ニューロンの漫画的な図を考えてみましょう。*樹状突起*（入力端子）、*核*（CPU）、*軸索*（出力ワイヤ）、および*軸索端子*（出力端子）で構成され、*シナプス*を介して他のニューロンに接続できます。 
+
+![The real neuron.](../img/neuron.svg)
+:label:`fig_Neuron`
+
+他のニューロン（または環境センサー）から到着した情報$x_i$は、樹状突起で受信されます。特に、その情報は*シナプスの重み* $w_i$によって重み付けされ、入力の効果、例えば製品$x_i w_i$を介した活性化または阻害を決定する。複数のソースから到着する加重入力は、加重和$y = \sum_i x_i w_i + b$として核に集約され、$\sigma(y)$を介した何らかの非線形後処理の対象となる可能性があります。この情報は、軸索を介して軸索末端に送られ、そこで目的地（筋肉などのアクチュエータなど）に到達するか、樹状突起を介して別のニューロンに供給されます。 
+
+確かに、そのようなユニットの多くを適切な接続性と適切な学習アルゴリズムと組み合わせて、1つのニューロンだけで表現できるよりもはるかに興味深い複雑な動作を生成できるという高レベルのアイデアは、実際の生物学的神経システムの研究のおかげです。同時に、今日のディープラーニングに関するほとんどの研究は、はるかに幅広い情報源からインスピレーションを得ています。私たちはスチュアート・ラッセルとピーター・ノーヴィグ:cite:`Russell.Norvig.2016`を呼び出します。彼らは、飛行機は鳥に*触発された*かもしれないが、鳥類学は何世紀にもわたって航空学の革新の主要な推進力ではなかったと指摘しました。同様に、最近のディープラーニングのインスピレーションは、数学、言語学、心理学、統計、コンピューターサイエンス、および他の多くの分野から同等またはそれ以上の尺度で得られます。 
+
+## まとめ
+
+このセクションでは、従来の線形回帰について紹介しました。この回帰では、学習セットの損失の二乗を最小限に抑えるために線形関数のパラメーターが選択されます。また、いくつかの実際的な考察と、線形性とガウスノイズの仮定の下での最尤推定としての線形回帰の解釈の両方を通じて、この目的の選択を動機付けました。計算上の考慮事項と統計とのつながりの両方について議論した後、そのような線形モデルが、入力が出力に直接接続される単純なニューラルネットワークとしてどのように表現できるかを示しました。間もなく線形モデルを完全に通過する予定ですが、パラメトリック形式、微分可能な目的、ミニバッチ確率的勾配降下法による最適化、そして最終的にはこれまで見られなかったデータの評価など、すべてのモデルが必要とするほとんどのコンポーネントを導入するのに十分です。 
+
+## 演習
+
+1. $x_1, \ldots, x_n \in \mathbb{R}$ というデータがあると仮定します。私たちの目標は、$\sum_i (x_i - b)^2$が最小化されるような定数$b$を見つけることです。
+    1. $b$の最適値に対する分析解を見つけます。
+    1. この問題とその解決策は正規分布とどのように関係していますか？
+    1. 損失を$\sum_i (x_i - b)^2$から$\sum_i |x_i-b|$に変更するとどうなりますか？$b$の最適なソリューションが見つかりますか？
+1. $\mathbf{x}^\top \mathbf{w} + b$で表すことができるアフィン関数が、$(\mathbf{x}, 1)$の線形関数と等価であることを証明します。
+1. $\mathbf{x}$ の二次関数、つまり $f(\mathbf{x}) = b + \sum_i w_i x_i + \sum_{j \leq i} w_{ij} x_{i} x_{j}$ を求めると仮定します。ディープネットワークでこれをどのように定式化しますか？
+1. 線形回帰問題が解ける条件の 1 つは、計画行列 $\mathbf{X}^\top \mathbf{X}$ がフルランクであることを思い出してください。
+    1. これが当てはまらない場合はどうなりますか？
+    1. どうやって直せる？$\mathbf{X}$ のすべてのエントリに、座標的に独立したガウスノイズを少量加えるとどうなりますか？
+    1. この場合の設計行列 $\mathbf{X}^\top \mathbf{X}$ の期待値はどれくらいですか？
+    1. $\mathbf{X}^\top \mathbf{X}$がフルランクでない場合、確率的勾配降下法はどうなりますか？
+1. 加法性ノイズ $\epsilon$ を支配するノイズモデルが指数分布であると仮定します。つまり、$p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$ です。
+    1. モデル$-\log P(\mathbf y \mid \mathbf X)$のデータの負の対数尤度を書き出します。
+    1. クローズドフォームの解決策は見つかりますか？
+    1. この問題を解決するミニバッチ確率的勾配降下法アルゴリズムを提案する。何が問題になる可能性がありますか（ヒント：パラメータを更新し続けると、静止点の近くで何が起こるか）？これを直せる？
+1. 2つの線形層を構成して、2つの層を持つニューラルネットワークを設計すると仮定します。つまり、最初のレイヤーの出力が 2 番目のレイヤーの入力になります。なぜそのような素朴な構成がうまくいかないのですか？
+1. 住宅や株価の現実的な価格見積もりに回帰を使用したい場合はどうなりますか？
+    1. 加法性ガウスノイズの仮定が適切でないことを示します。ヒント:マイナス値になることはありますか?ゆらぎはどうですか？
+    1. 価格の対数への回帰がはるかに良いのはなぜですか、つまり$y = \log \text{price}$？
+    1. ペニーストック、つまり非常に低価格の株を扱う場合、何を心配する必要がありますか？ヒント：可能な限りの価格で取引できますか？なぜこれが安い株にとって大きな問題なのですか？
+    1. 詳細については、オプション価格:cite:`Black.Scholes.1973`の有名なBlack-Scholesモデルを参照してください。
+1. 回帰を使用して、食料品店で売られているリンゴの*数*を見積もるとします。
+    1. ガウス加法性ノイズモデルの問題点は何ですか?ヒント：油ではなくリンゴを売っています。
+    1. [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)は、カウント全体の分布をキャプチャします。$p (k\ mid\ lambda) =\ lambda^k e^ {-\ lambda} /k で与えられます！$. Here $\ ラムダ$ is the rate function and $k$ is the number of events you see. Prove that $\ ラムダ$ is the expected value of counts $k$。
+    1. ポアソン分布に関連する損失関数を設計します。
+    1. 代わりに $\log \lambda$ を推定する損失関数を設計します。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/40)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/258)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/259)
+:end_tab:
diff --git a/chapter_linear-regression/linear-regression_origin.md b/chapter_linear-regression/linear-regression_origin.md
new file mode 100644
index 0000000..8437660
--- /dev/null
+++ b/chapter_linear-regression/linear-regression_origin.md
@@ -0,0 +1,734 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Linear Regression
+:label:`sec_linear_regression`
+
+*Regression* problems pop up whenever we want to predict a numerical value.
+Common examples include predicting prices (of homes, stocks, etc.),
+predicting the length of stay (for patients in the hospital),
+forecasting demand (for retail sales), among countless others.
+Not every prediction problem is a classic regression problem.
+Later on, we will introduce classification problems,
+where the goal is to predict membership among a set of categories.
+
+As a running example, suppose that we wish
+to estimate the prices of houses (in dollars)
+based on their area (in square feet) and age (in years).
+To develop a model for predicting house prices,
+we need to get our hands on data consisting of sales,
+including the sales price, area, and age for each home.
+In the terminology of machine learning,
+the dataset is called a *training dataset* or *training set*,
+and each row (containing the data corresponding to one sale)
+is called an *example* (or *data point*, *instance*, *sample*).
+The thing we are trying to predict (price)
+is called a *label* (or *target*).
+The variables (age and area)
+upon which the predictions are based
+are called *features* (or *covariates*).
+
+## Basics
+
+*Linear regression* may be both the simplest
+and most popular among the standard tools
+for tackling regression problems.
+Dating back to the dawn of the 19th century :cite:`Legendre.1805,Gauss.1809`,
+linear regression flows from a few simple assumptions.
+First, we assume that the relationship
+between features $\mathbf{x}$ and target $y$
+is approximately linear,
+i.e., that the conditional mean $E[Y \mid X=\mathbf{x}]$
+can be expressed as a weighted sum
+of the features $\mathbf{x}$.
+This setup allows that the target value
+may still deviate from its expected value
+on account of observation noise.
+Next, we can impose the assumption that any such noise
+is well-behaved, following a Gaussian distribution.
+Typically, we will use $n$ to denote
+the number of examples in our dataset.
+We use superscripts to enumerate samples and targets,
+and subscripts to index coordinates.
+More concretely,
+$\mathbf{x}^{(i)}$ denotes the $i$-th sample
+and $x_j^{(i)}$ denotes its $j$-th coordinate.
+
+### Model
+:label:`subsec_linear_model`
+
+At the heart of every solution is a model
+that describes how features can be transformed
+into an estimate of the target.
+The assumption of linearity means that
+the expected value of the target (price) can be expressed
+as a weighted sum of the features (area and age):
+
+$$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.$$
+:eqlabel:`eq_price-area`
+
+Here $w_{\mathrm{area}}$ and $w_{\mathrm{age}}$
+are called *weights*, and $b$ is called a *bias*
+(or *offset* or *intercept*).
+The weights determine the influence of each feature on our prediction.
+The bias determines the value of the estimate when all features are zero.
+Even though we will never see any newly-built homes with precisely zero area,
+we still need the bias because it allows us
+to express all linear functions of our features
+(versus restricting us to lines that pass through the origin).
+Strictly speaking, :eqref:`eq_price-area` is an *affine transformation* of input features, which is characterized by a *linear transformation* of features via weighted sum, combined with a *translation* via the added bias.
+Given a dataset, our goal is to choose
+the weights $\mathbf{w}$ and the bias $b$
+that, on average, make our model's predictions
+fit the true prices observed in the data as closely as possible.
+
+
+In disciplines where it is common to focus
+on datasets with just a few features,
+explicitly expressing models long-form,
+as in :eqref:`eq_price-area`, is common.
+In machine learning, we usually work
+with high-dimensional datasets,
+where it's more convenient to employ
+compact linear algebra notation.
+When our inputs consist of $d$ features,
+we can assign each an index (between $1$ and $d$)
+and express our prediction $\hat{y}$
+(in general the "hat" symbol denotes an estimate) as
+
+$$\hat{y} = w_1  x_1 + ... + w_d  x_d + b.$$
+
+Collecting all features into a vector $\mathbf{x} \in \mathbb{R}^d$
+and all weights into a vector $\mathbf{w} \in \mathbb{R}^d$,
+we can express our model compactly via the dot product
+between $\mathbf{w}$ and $\mathbf{x}$:
+
+$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b.$$
+:eqlabel:`eq_linreg-y`
+
+In :eqref:`eq_linreg-y`, the vector $\mathbf{x}$
+corresponds to the features of a single example.
+We will often find it convenient
+to refer to features of our entire dataset of $n$ examples
+via the *design matrix* $\mathbf{X} \in \mathbb{R}^{n \times d}$.
+Here, $\mathbf{X}$ contains one row for every example
+and one column for every feature.
+For a collection of features $\mathbf{X}$,
+the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$
+can be expressed via the matrix-vector product:
+
+$${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b,$$
+
+where broadcasting (:numref:`subsec_broadcasting`) is applied during the summation.
+Given features of a training dataset $\mathbf{X}$
+and corresponding (known) labels $\mathbf{y}$,
+the goal of linear regression is to find
+the weight vector $\mathbf{w}$ and the bias term $b$
+that given features of a new data example
+sampled from the same distribution as $\mathbf{X}$,
+the new example's label will (in expectation)
+be predicted with the lowest error.
+
+Even if we believe that the best model for
+predicting $y$ given $\mathbf{x}$ is linear,
+we would not expect to find a real-world dataset of $n$ examples where
+$y^{(i)}$ exactly equals $\mathbf{w}^\top \mathbf{x}^{(i)}+b$
+for all $1 \leq i \leq n$.
+For example, whatever instruments we use to observe
+the features $\mathbf{X}$ and labels $\mathbf{y}$
+might suffer small amount of measurement error.
+Thus, even when we are confident
+that the underlying relationship is linear,
+we will incorporate a noise term to account for such errors.
+
+Before we can go about searching for the best *parameters*
+(or *model parameters*) $\mathbf{w}$ and $b$,
+we will need two more things:
+(i) a quality measure for some given model;
+and (ii) a procedure for updating the model to improve its quality.
+
+### Loss Function
+:label:`subsec_linear-regression-loss-function`
+
+Naturally, fitting our model to the data requires
+that we agree on some measure of *fitness*
+(or, equivalently, of *unfitness*).
+*Loss functions* quantify the distance
+between the *real* and *predicted* values of the target.
+The loss will usually be a non-negative number
+where smaller values are better
+and perfect predictions incur a loss of 0.
+For regression problems, the most common loss function is squared error.
+When our prediction for an example $i$ is $\hat{y}^{(i)}$
+and the corresponding true label is $y^{(i)}$,
+the *squared error* is given by:
+
+$$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$
+:eqlabel:`eq_mse`
+
+The constant $\frac{1}{2}$ makes no real difference
+but proves to be notationally convenient,
+since it cancels out when we take the derivative of the loss.
+Because the training dataset is given to us,
+and thus out of our control,
+the empirical error is only a function of the model parameters.
+Below, we visualize the fit of a linear regression model
+in a problem with one-dimensional inputs (:numref:`fig_fit_linreg`).
+
+![Fitting a linear regression model to one-dimensional data.](../img/fit-linreg.svg)
+:label:`fig_fit_linreg`
+
+Note that large differences between
+estimates $\hat{y}^{(i)}$ and targets $y^{(i)}$
+lead to even larger contributions to the loss,
+due to the quadratic form of the loss
+(this can be a double-edge sword.
+While it encourages the model to avoid large errors
+it can also lead to excessive sensitivity to anomalous data).
+To measure the quality of a model on the entire dataset of $n$ examples,
+we simply average (or equivalently, sum)
+the losses on the training set:
+
+$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
+
+When training the model, we want to find parameters ($\mathbf{w}^*, b^*$)
+that minimize the total loss across all training examples:
+
+$$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b).$$
+
+### Analytic Solution
+
+Unlike most of the models that we will cover,
+linear regression presents us with
+a surprisingly easy optimization problem.
+In particular, we can find the optimal parameters
+(as assessed on the training data)
+analytically by applying a simple formula as follows.
+First, we can subsume the bias $b$ into the parameter $\mathbf{w}$
+by appending a column to the design matrix consisting of all ones.
+Then our prediction problem is to minimize $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$.
+So long as the design matrix $\mathbf{X}$ has full rank
+(no feature is linearly dependent on the others),
+then there will be just one critical point on the loss surface
+and it corresponds to the minimum of the loss over the entire domain.
+Taking the derivative of the loss with respect to $\mathbf{w}$
+and setting it equal to zero yields:
+
+$$\begin{aligned}
+    \partial_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 =
+    2 \mathbf{X}^\top (\mathbf{X} \mathbf{w} - \mathbf{y}) = 0
+    \text{ and hence }
+    \mathbf{X}^\top \mathbf{y} = \mathbf{X}^\top \mathbf{X} \mathbf{w}.
+\end{aligned}$$
+
+Solving for $\mathbf{w}$ provides us with the optimal solution
+for the optimization problem.
+Note that this solution 
+
+$$\mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}$$
+
+will only be unique
+when the matrix $\mathbf X^\top \mathbf X$ is invertible,
+i.e., when the columns of the design matrix
+are linearly independent :cite:`Golub.Van-Loan.1996`.
+
+
+
+While simple problems like linear regression
+may admit analytic solutions,
+you should not get used to such good fortune.
+Although analytic solutions allow for nice mathematical analysis,
+the requirement of an analytic solution is so restrictive
+that it would exclude almost all exciting aspects of deep learning.
+
+### Minibatch Stochastic Gradient Descent
+
+Fortunately, even in cases where we cannot solve the models analytically,
+we can still often train models effectively in practice.
+Moreover, for many tasks, those difficult-to-optimize models
+turn out to be so much better that figuring out how to train them
+ends up being well worth the trouble.
+
+The key technique for optimizing nearly any deep learning model,
+and which we will call upon throughout this book,
+consists of iteratively reducing the error
+by updating the parameters in the direction
+that incrementally lowers the loss function.
+This algorithm is called *gradient descent*.
+
+The most naive application of gradient descent
+consists of taking the derivative of the loss function,
+which is an average of the losses computed
+on every single example in the dataset.
+In practice, this can be extremely slow:
+we must pass over the entire dataset before making a single update,
+even if the update steps might be very powerful :cite:`Liu.Nocedal.1989`.
+Even worse, if there is a lot of redundancy in the training data,
+the benefit of a full update is even lower.
+
+The other extreme is to consider only a single example at a time and to take
+update steps based on one observation at a time.
+The resulting algorithm, *stochastic gradient descent* (SGD)
+can be an effective strategy :cite:`Bottou.2010`, even for large datasets.
+Unfortunately, SGD has drawbacks, both computational and statistical.
+One problem arises from the fact that processors are a lot faster
+multiplying and adding numbers than they are
+at moving data from main memory to processor cache.
+It is up to an order of magnitude more efficient to
+perform a matrix-vector multiplication
+than a corresponding number of vector-vector operations.
+This means that it can take a lot longer to process
+one sample at a time compared to a full batch.
+A second problem is that some of the layers,
+such as batch normalization (to be described in :numref:`sec_batch_norm`),
+only work well when we have access
+to more than one observation at a time.
+
+The solution to both problems is to pick an intermediate strategy:
+rather than taking a full batch or only a single sample at a time,
+we take a *minibatch* of observations :cite:`Li.Zhang.Chen.ea.2014`.
+The specific choice of the size of the said minibatch depends on many factors,
+such as the amount of memory, the number of accelerators,
+the choice of layers, and the total dataset size.
+Despite all of that, a number between 32 and 256,
+preferably a multiple of a large power of $2$, is a good start.
+This leads us to *minibatch stochastic gradient descent*.
+
+In its most basic form, in each iteration $t$,
+we first randomly sample a minibatch $\mathcal{B}_t$
+consisting of a fixed number $|\mathcal{B}|$ of training examples.
+We then compute the derivative (gradient) of the average loss
+on the minibatch with respect to the model parameters.
+Finally, we multiply the gradient
+by a predetermined small positive value $\eta$,
+called the *learning rate*,
+and subtract the resulting term from the current parameter values.
+We can express the update as follows:
+
+$$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).$$
+
+In summary, minibatch SGD proceeds as follows:
+(i) initialize the values of the model parameters, typically at random;
+(ii) iteratively sample random minibatches from the data,
+updating the parameters in the direction of the negative gradient.
+For quadratic losses and affine transformations,
+this has a closed-form expansion:
+
+$$\begin{aligned} \mathbf{w} & \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) && = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)\\ b &\leftarrow b -  \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \partial_b l^{(i)}(\mathbf{w}, b) &&  = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$
+:eqlabel:`eq_linreg_batch_update`
+
+Since we pick a minibatch $\mathcal{B}$
+we need to normalize by its size $|\mathcal{B}|$.
+Frequently minibatch size and learning rate are user-defined.
+Such tunable parameters that are not updated
+in the training loop are called *hyperparameters*.
+They can be tuned automatically by a number of techniques, such as Bayesian optimization
+:cite:`Frazier.2018`. In the end, the quality of the solution is
+typically assessed on a separate *validation dataset* (or *validation set*).
+
+After training for some predetermined number of iterations
+(or until some other stopping criterion is met),
+we record the estimated model parameters,
+denoted $\hat{\mathbf{w}}, \hat{b}$.
+Note that even if our function is truly linear and noiseless,
+these parameters will not be the exact minimizers of the loss, or even deterministic.
+Although the algorithm converges slowly towards the minimizers
+it typically cannot achieve it exactly in a finite number of steps.
+Moreover, the minibatches $\mathcal{B}$
+used to update the parameters are chosen at random.
+This breaks determinism.
+
+Linear regression happens to be a learning problem
+with a global minimum
+(whenever $\mathbf{X}$ is full rank, or equivalently,
+whenever $\mathbf{X}^\top \mathbf{X}$ is invertible).
+However, the loss surfaces for deep networks contain many saddle points and minima.
+Fortunately, we typically don't care about finding
+an exact set of parameters but merely any set of parameters
+that leads to accurate predictions (and thus low loss).
+In practice, deep learning practitioners
+seldom struggle to find parameters
+that minimize the loss *on training sets*
+:cite:`Izmailov.Podoprikhin.Garipov.ea.2018,Frankle.Carbin.2018`.
+The more formidable task is to find parameters
+that lead to accurate predictions on previously unseen data,
+a challenge called *generalization*.
+We return to these topics throughout the book.
+
+### Predictions
+
+Given the model $\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}$,
+we can now make *predictions* for a new example,
+e.g., to predict the sales price of a previously unseen house
+given its area $x_1$ and age $x_2$.
+Deep learning practitioners have taken to calling the prediction phase *inference*
+but this is a bit of a misnomer---*inference* refers broadly
+to any conclusion reached on the basis of evidence,
+including both the values of the parameters
+and the likely label for an unseen instance.
+If anything, in the statistics literature
+*inference* more often denotes parameter inference
+and this overloading of terminology creates unnecessary confusion
+when deep learning practitioners talk to statisticians.
+In the following we will stick to *prediction* whenever possible.
+
+
+
+
+## Vectorization for Speed
+
+When training our models, we typically want to process
+whole minibatches of examples simultaneously.
+Doing this efficiently requires that (**we**) (~~should~~)
+(**vectorize the calculations and leverage
+fast linear algebra libraries
+rather than writing costly for-loops in Python.**)
+
+```{.python .input  n=1}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+import math
+from mxnet import np
+import time
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import math
+import torch
+import numpy as np
+import time
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import math
+import tensorflow as tf
+import numpy as np
+import time
+```
+
+To illustrate why this matters so much,
+we can (**consider two methods for adding vectors.**)
+To start, we instantiate two 10,000-dimensional vectors
+containing all ones.
+In one method, we loop over the vectors with a Python for-loop.
+In the other method, we rely on a single call to `+`.
+
+```{.python .input  n=2}
+%%tab all
+n = 10000
+a = d2l.ones(n)
+b = d2l.ones(n)
+```
+
+Now we can benchmark the workloads.
+First, [**we add them, one coordinate at a time,
+using a for-loop.**]
+
+```{.python .input  n=3}
+%%tab mxnet, pytorch
+c = d2l.zeros(n)
+t = time.time()
+for i in range(n):
+    c[i] = a[i] + b[i]
+f'{time.time() - t:.5f} sec'
+```
+
+```{.python .input}
+%%tab tensorflow
+c = tf.Variable(d2l.zeros(n))
+t = time.time()
+for i in range(n):
+    c[i].assign(a[i] + b[i])
+f'{time.time() - t:.5f} sec'
+```
+
+(**Alternatively, we rely on the reloaded `+` operator to compute the elementwise sum.**)
+
+```{.python .input  n=4}
+%%tab all
+t = time.time()
+d = a + b
+f'{time.time() - t:.5f} sec'
+```
+
+The second method is dramatically faster than the first.
+Vectorizing code often yields order-of-magnitude speedups.
+Moreover, we push more of the mathematics to the library
+without the need to write as many calculations ourselves,
+reducing the potential for errors and increasing portability of the code.
+
+
+## The Normal Distribution and Squared Loss
+:label:`subsec_normal_distribution_and_squared_loss`
+
+So far we've given a fairly functional motivation
+of the squared loss objective:
+the optimal parameters return the conditional expectation $E[Y\mid X]$
+whenever the underlying pattern is truly linear,
+and the loss assigns outsize penalties for outliers.
+We can also provide a more formal motivation
+for the squared loss objective
+by making probabilistic assumptions
+about the distribution of noise.
+
+Linear regression was invented at the turn of the 19th century.
+While it has long been debated whether Gauss or Legendre
+first thought up the idea,
+it was Gauss who also discovered the normal distribution
+(also called the *Gaussian*).
+It turns out that the normal distribution
+and linear regression with squared loss
+share a deeper connection than common parentage.
+
+To begin, recall that a normal distribution
+with mean $\mu$ and variance $\sigma^2$ (standard deviation $\sigma$)
+is given as
+
+$$p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right).$$
+
+Below [**we define a function to compute the normal distribution**].
+
+```{.python .input  n=3}
+%%tab all
+def normal(x, mu, sigma):
+    p = 1 / math.sqrt(2 * math.pi * sigma**2)
+    return p * np.exp(-0.5 * (x - mu)**2 / sigma**2)
+```
+
+We can now (**visualize the normal distributions**).
+
+```{.python .input  n=8}
+%%tab mxnet
+# Use numpy again for visualization
+x = np.arange(-7, 7, 0.01)
+
+# Mean and standard deviation pairs
+params = [(0, 1), (0, 2), (3, 1)]
+d2l.plot(x.asnumpy(), [normal(x, mu, sigma).asnumpy() for mu, sigma in params], xlabel='x',
+         ylabel='p(x)', figsize=(4.5, 2.5),
+         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
+```
+
+```{.python .input  n=8}
+%%tab pytorch, tensorflow
+# Use numpy again for visualization
+x = np.arange(-7, 7, 0.01)
+
+# Mean and standard deviation pairs
+params = [(0, 1), (0, 2), (3, 1)]
+d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
+         ylabel='p(x)', figsize=(4.5, 2.5),
+         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
+```
+
+Note that changing the mean corresponds
+to a shift along the $x$-axis,
+and increasing the variance
+spreads the distribution out,
+lowering its peak.
+
+One way to motivate linear regression with squared loss
+is to assume that observations arise from noisy measurements,
+where the noise is normally distributed as follows:
+
+$$y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2).$$
+
+Thus, we can now write out the *likelihood*
+of seeing a particular $y$ for a given $\mathbf{x}$ via
+
+$$P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right).$$
+
+As such, the likelihood factorizes.
+According to *the principle of maximum likelihood*,
+the best values of parameters $\mathbf{w}$ and $b$ are those
+that maximize the *likelihood* of the entire dataset:
+
+$$P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)} \mid \mathbf{x}^{(i)}).$$
+
+The equality follows since all pairs $(\mathbf{x}^{(i)}, y^{(i)})$
+were drawn independently of each other.
+Estimators chosen according to the principle of maximum likelihood
+are called *maximum likelihood estimators*.
+While, maximizing the product of many exponential functions,
+might look difficult,
+we can simplify things significantly, without changing the objective,
+by maximizing the logarithm of the likelihood instead.
+For historical reasons, optimizations are more often expressed
+as minimization rather than maximization.
+So, without changing anything,
+we can *minimize* the *negative log-likelihood*,
+which we can express as follows:
+
+$$-\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2.$$
+
+If we assume that $\sigma$ is fixed,
+we can ignore the first term,
+because it does not depend on $\mathbf{w}$ or $b$.
+The second term is identical
+to the squared error loss introduced earlier,
+except for the multiplicative constant $\frac{1}{\sigma^2}$.
+Fortunately, the solution does not depend on $\sigma$ either.
+It follows that minimizing the mean squared error
+is equivalent to maximum likelihood estimation
+of a linear model under the assumption of additive Gaussian noise.
+
+
+## Linear Regression as a Neural Network
+
+While linear models are not sufficiently rich
+to express the many complicated neural networks
+that we will introduce in this book,
+neural networks are rich enough
+to subsume linear models as neural networks
+in which every feature is represented by an input neuron,
+all of which are connected directly to the output.
+
+:numref:`fig_single_neuron` depicts
+linear regression as a neural network.
+The diagram highlights the connectivity pattern
+such as how each input is connected to the output,
+but not the specific values taken by the weights or biases.
+
+![Linear regression is a single-layer neural network.](../img/singleneuron.svg)
+:label:`fig_single_neuron`
+
+The inputs are $x_1, \ldots, x_d$.
+We refer to $d$ as the *number of inputs*
+or *feature dimensionality* in the input layer.
+The output of the network is $o_1$.
+Because we are just trying to predict
+a single numerical value,
+we have only one output neuron.
+Note that the input values are all *given*.
+There is just a single *computed* neuron.
+In summary, we can think of linear regression
+as a single-layer fully connected neural network.
+We will encounter networks
+with far more layers
+in future chapters.
+
+### Biology
+
+Because linear regression predates computational neuroscience,
+it might seem anachronistic to describe
+linear regression in terms of neural networks.
+Nonetheless, they were a natural place to start
+when the cyberneticists and neurophysiologists
+Warren McCulloch and Walter Pitts began to develop
+models of artificial neurons.
+Consider the cartoonish picture
+of a biological neuron in :numref:`fig_Neuron`,
+consisting of *dendrites* (input terminals),
+the *nucleus* (CPU), the *axon* (output wire),
+and the *axon terminals* (output terminals),
+enabling connections to other neurons via *synapses*.
+
+![The real neuron.](../img/neuron.svg)
+:label:`fig_Neuron`
+
+Information $x_i$ arriving from other neurons
+(or environmental sensors) is received in the dendrites.
+In particular, that information is weighted
+by *synaptic weights* $w_i$,
+determining the effect of the inputs,
+e.g., activation or inhibition via the product $x_i w_i$.
+The weighted inputs arriving from multiple sources
+are aggregated in the nucleus
+as a weighted sum $y = \sum_i x_i w_i + b$,
+possibly subject to some nonlinear postprocessing via $\sigma(y)$.
+This information is then sent via the axon to the axon terminals,
+where it reaches its destination
+(e.g., an actuator such as a muscle)
+or it is fed into another neuron via its dendrites.
+
+Certainly, the high-level idea that many such units
+could be combined with the right connectivity
+and right learning algorithm,
+to produce far more interesting and complex behavior
+than any one neuron alone could express
+owes to our study of real biological neural systems.
+At the same time, most research in deep learning today
+draws inspiration from a much wider source.
+We invoke Stuart Russell and Peter Norvig :cite:`Russell.Norvig.2016`
+who pointed out that although airplanes might have been *inspired* by birds,
+ornithology has not been the primary driver
+of aeronautics innovation for some centuries.
+Likewise, inspiration in deep learning these days
+comes in equal or greater measure
+from mathematics, linguistics, psychology,
+statistics, computer science, and many other fields.
+
+## Summary
+
+In this section, we introduced
+traditional linear regression,
+where the parameters of a linear function
+are chosen to minimize squared loss on the training set.
+We also motivated this choice of objective
+both via some practical considerations
+and through an interpretation
+of linear regression as maximimum likelihood estimation
+under an assumption of linearity and Gaussian noise.
+After discussing both computational considerations
+and connections to statistics,
+we showed how such linear models could be expressed
+as simple neural networks where the inputs
+are directly wired to the output(s).
+While we will soon move past linear models altogether,
+they are sufficient to introduce most of the components
+that all of our models require:
+parametric forms, differentiable objectives,
+optimization via minibatch stochastic gradient descent,
+and ultimately, evaluation on previously unseen data.
+
+
+
+## Exercises
+
+1. Assume that we have some data $x_1, \ldots, x_n \in \mathbb{R}$. Our goal is to find a constant $b$ such that $\sum_i (x_i - b)^2$ is minimized.
+    1. Find an analytic solution for the optimal value of $b$.
+    1. How does this problem and its solution relate to the normal distribution?
+    1. What if we change the loss from $\sum_i (x_i - b)^2$ to $\sum_i |x_i-b|$? Can you find the optimal solution for $b$?
+1. Prove that the affine functions that can be expressed by $\mathbf{x}^\top \mathbf{w} + b$ are equivalent to linear functions on $(\mathbf{x}, 1)$.
+1. Assume that you want to find quadratic functions of $\mathbf{x}$, i.e., $f(\mathbf{x}) = b + \sum_i w_i x_i + \sum_{j \leq i} w_{ij} x_{i} x_{j}$. How would you formulate this in a deep network?
+1. Recall that one of the conditions for the linear regression problem to be solvable was that the design matrix $\mathbf{X}^\top \mathbf{X}$ has full rank.
+    1. What happens if this is not the case?
+    1. How could you fix it? What happens if you add a small amount of coordinate-wise independent Gaussian noise to all entries of $\mathbf{X}$?
+    1. What is the expected value of the design matrix $\mathbf{X}^\top \mathbf{X}$ in this case?
+    1. What happens with stochastic gradient descent when $\mathbf{X}^\top \mathbf{X}$ doesn't have full rank?
+1. Assume that the noise model governing the additive noise $\epsilon$ is the exponential distribution. That is, $p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$.
+    1. Write out the negative log-likelihood of the data under the model $-\log P(\mathbf y \mid \mathbf X)$.
+    1. Can you find a closed form solution?
+    1. Suggest a minibatch stochastic gradient descent algorithm to solve this problem. What could possibly go wrong (hint: what happens near the stationary point as we keep on updating the parameters)? Can you fix this?
+1. Assume that we want to design a neural network with two layers by composing two linear layers. That is, the output of the first layer becomes the input of the second layer. Why would such a naive composition not work?
+1. What happens if you want to use regression for realistic price estimation of houses or stock prices?
+    1. Show that the additive Gaussian noise assumption is not appropriate. Hint: can we have negative prices? What about fluctuations?
+    1. Why would regression to the logarithm of the price be much better, i.e., $y = \log \text{price}$?
+    1. What do you need to worry about when dealing with pennystock, i.e., stock with very low prices? Hint: can you trade at all possible prices? Why is this a bigger problem for cheap stock?
+    1. For more information review the celebrated Black-Scholes model for option pricing :cite:`Black.Scholes.1973`.
+1. Suppose we want to use regression to estimate the *number* of apples sold in a grocery store.
+    1. What are the problems with a Gaussian additive noise model? Hint: you are selling apples, not oil.
+    1. The [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution) captures distributions over counts. It is given by $p(k \mid \lambda) = \lambda^k e^{-\lambda}/k!$. Here $\lambda$ is the rate function and $k$ is the number of events you see. Prove that $\lambda$ is the expected value of counts $k$.
+    1. Design a loss function associated with the Poisson distribution.
+    1. Design a loss function for estimating $\log \lambda$ instead.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/40)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/258)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/259)
+:end_tab:
diff --git a/chapter_linear-regression/oo-design.md b/chapter_linear-regression/oo-design.md
new file mode 100644
index 0000000..d71b92b
--- /dev/null
+++ b/chapter_linear-regression/oo-design.md
@@ -0,0 +1,281 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 実装のためのオブジェクト指向設計
+:label:`sec_oo-design`
+
+線形回帰の概要では、データ、モデル、損失関数、最適化アルゴリズムなど、さまざまなコンポーネントについて説明しました。実際、線形回帰は最も単純な機械学習モデルの1つです。しかし、それをトレーニングするには、この本の他のモデルが必要とするものと同じコンポーネントの多くを使用します。したがって、実装の詳細を掘り下げる前に、本書全体で使用されているいくつかの API を設計する価値があります。ディープラーニングのコンポーネントをオブジェクトとして扱う場合、これらのオブジェクトとその相互作用のクラスを定義することから始めることができます。このオブジェクト指向の実装設計により、プレゼンテーションが大幅に合理化され、プロジェクトで使用することもできます。 
+
+[PyTorch Lightning](https://www.pytorchlightning.ai/)などのオープンソースライブラリに触発され、高レベルでは、3つのクラスを用意したいと考えています。（i）`Module`にはモデル、損失、および最適化メソッドが含まれています。（ii）`DataModule`はトレーニングと検証のためのデータローダーを提供します。（iii）両方のクラスは`Trainer`クラスを使用して結合され、トレーニングが可能になりますさまざまなハードウェアプラットフォーム上のモデル。この本のほとんどのコードは、`Module`と`DataModule`に適合しています。`Trainer` クラスについて触れるのは、GPU、CPU、並列トレーニング、および最適化アルゴリズムについて説明するときだけです。
+
+```{.python .input}
+%%tab mxnet
+import time
+import numpy as np
+from d2l import mxnet as d2l
+from mxnet.gluon import nn
+```
+
+```{.python .input}
+%%tab pytorch
+import time
+import numpy as np
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input}
+%%tab tensorflow
+import time
+import numpy as np
+from d2l import torch as d2l
+import tensorflow as tf
+```
+
+## ユーティリティ
+:label:`oo-design-utilities`
+
+Jupyter ノートブックのオブジェクト指向プログラミングを簡略化するには、いくつかのユーティリティが必要です。課題の 1 つは、クラス定義がかなり長いコードブロックになる傾向があることです。ノートブックの可読性には、説明が散在する短いコード断片が必要です。これは、Pythonライブラリに共通のプログラミングスタイルと両立しない要件です。最初のユーティリティ関数では、クラスが作成された *後* に、関数をメソッドとしてクラスに登録することができます。実際、クラスのインスタンスを作成した後でも、そうすることができます。これにより、クラスの実装を複数のコードブロックに分割できます。
+
+```{.python .input}
+%%tab all
+def add_to_class(Class):  #@save
+    def wrapper(obj):
+        setattr(Class, obj.__name__, obj)
+    return wrapper
+```
+
+それでは、使い方を簡単に見てみましょう。クラス`A`をメソッド`do`で実装する予定です。同じコードブロックに `A` と `do` の両方のコードを含める代わりに、まずクラス `A` を宣言し、インスタンス `a` を作成します。
+
+```{.python .input}
+%%tab all
+class A:
+    def __init__(self):
+        self.b = 1
+
+a = A()
+```
+
+次に、通常どおりにメソッド `do` を定義しますが、クラス `A` のスコープでは定義しません。代わりに、引数としてクラス `A` を使用して `add_to_class` によってこのメソッドを修飾します。そうすることで、このメソッドは `A` の定義の一部として定義されていた場合に予想されるように、`A` のメンバー変数にアクセスできます。インスタンス `a` に対して呼び出すとどうなるか見てみましょう。
+
+```{.python .input}
+%%tab all
+@add_to_class(A)
+def do(self):
+    print('Class attribute "b" is', self.b)
+
+a.do()
+```
+
+2 つ目は、クラスの `__init__` メソッドのすべての引数をクラス属性として保存するユーティリティクラスです。これにより、追加のコードなしでコンストラクタ呼び出しシグネチャを暗黙的に拡張できます。
+
+```{.python .input}
+%%tab all
+class HyperParameters:  #@save
+    def save_hyperparameters(self, ignore=[]):
+        raise NotImplemented
+```
+
+その実装は:numref:`sec_utils`に延期されます。これを使用するには、`HyperParameters` を継承し、`__init__` メソッドで `save_hyperparameters` を呼び出すクラスを定義します。
+
+```{.python .input}
+%%tab all
+# Call the fully implemented HyperParameters class saved in d2l
+class B(d2l.HyperParameters):
+    def __init__(self, a, b, c):
+        self.save_hyperparameters(ignore=['c'])
+        print('self.a =', self.a, 'self.b =', self.b)
+        print('There is no self.c =', not hasattr(self, 'c'))
+
+b = B(a=1, b=2, c=3)
+```
+
+最後のユーティリティは、実験の進行中にインタラクティブに実験の進行状況をプロットすることができます。はるかに強力な（そして複雑な）[TensorBoard](https://www.tensorflow.org/tensorboard)に敬意を表して、`ProgressBoard`と名付けました。実装は:numref:`sec_utils`に延期されます。とりあえず、動作を簡単に見てみましょう。 
+
+関数 `draw` は、凡例で `label` を指定して、図の点 `(x, y)` をプロットします。オプションの`every_n`は、図に$1/n$点のみを表示することでラインを滑らかにします。これらの値は、元の図の $n$ の近傍点から平均化されています。
+
+```{.python .input}
+%%tab all
+class ProgressBoard(d2l.HyperParameters):  #@save
+    """Plot data points in animation."""
+    def __init__(self, xlabel=None, ylabel=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
+                 fig=None, axes=None, figsize=(3.5, 2.5), display=True):
+        self.save_hyperparameters()
+
+    def draw(self, x, y, label, every_n=1):
+        raise NotImplemented
+```
+
+次の例では、`sin`と`cos`を異なる滑らかさで描画します。このコードブロックを実行すると、アニメーションで線が大きくなるのがわかります。
+
+```{.python .input}
+%%tab all
+board = d2l.ProgressBoard('x')
+for x in np.arange(0, 10, 0.1):
+    board.draw(x, np.sin(x), 'sin', every_n=2)
+    board.draw(x, np.cos(x), 'cos', every_n=10)
+```
+
+## モデル
+:label:`oo-design-models`
+
+`Module` クラスは、実装するすべてのモデルの基本クラスです。少なくとも 3 つの方法を定義する必要があります。`__init__` メソッドは学習可能なパラメーターを格納し、`training_step` メソッドはデータバッチを受け入れて損失値を返し、`configure_optimizers` メソッドは学習可能なパラメーターの更新に使用される最適化メソッドまたはそのリストを返します。オプションで、評価尺度を報告する `validation_step` を定義できます。再利用性を高めるために、出力を計算するコードを別の`forward`メソッドに入れることがあります。
+
+```{.python .input}
+%%tab all
+class Module(d2l.nn_Module, d2l.HyperParameters):  #@save
+    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
+        super().__init__()
+        self.save_hyperparameters()
+        self.board = ProgressBoard()
+        if tab.selected('tensorflow'):
+            self.training = None
+
+    def loss(self, y_hat, y):
+        raise NotImplementedError
+
+    def forward(self, X):
+        assert hasattr(self, 'net'), 'Neural network is defined'
+        return self.net(X)
+
+    if tab.selected('tensorflow'):
+        def call(self, X, *args, **kwargs):
+            if kwargs and "training" in kwargs:
+                self.training = kwargs['training']
+            return self.forward(X, *args)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'Trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        if tab.selected('mxnet', 'tensorflow'):
+            self.board.draw(x, d2l.numpy(value), (
+                'train_' if train else 'val_') + key, every_n=int(n))
+        if tab.selected('pytorch'):
+            self.board.draw(x, d2l.numpy(d2l.to(value, d2l.cpu())),
+                            ('train_' if train else 'val_') + key,
+                            every_n=int(n))
+
+    def training_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=True)
+        return l
+
+    def validation_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=False)
+
+    def configure_optimizers(self):
+        raise NotImplementedError
+```
+
+:begin_tab:`mxnet`
+`Module`は、Gluonのニューラルネットワークの基本クラスである`nn.Block`のサブクラスであることに気付くかもしれません。ニューラルネットワークを処理する便利な機能を提供します。たとえば、`forward(self, X)`などの`forward`メソッドを定義すると、インスタンス`a`に対して`a(X)`によってこの関数を呼び出すことができます。これは、組み込みの`__call__`メソッドで`forward`メソッドを呼び出すため機能します。`nn.Block` の詳細と例については、:numref:`sec_model_construction` を参照してください。
+:end_tab:
+
+:begin_tab:`pytorch`
+`Module`は、PyTorchのニューラルネットワークの基本クラスである`nn.Module`のサブクラスであることに気付くかもしれません。ニューラルネットワークを処理する便利な機能を提供します。たとえば、`forward(self, X)`などの`forward`メソッドを定義すると、インスタンス`a`に対して`a(X)`によってこの関数を呼び出すことができます。これは、組み込みの`__call__`メソッドで`forward`メソッドを呼び出すため機能します。`nn.Module` の詳細と例については、:numref:`sec_model_construction` を参照してください。
+:end_tab:
+
+:begin_tab:`tensorflow`
+`Module`は、TensorFlowのニューラルネットワークの基本クラスである`tf.keras.Model`のサブクラスであることに気付くかもしれません。ニューラルネットワークを処理する便利な機能を提供します。たとえば、組み込みの `__call__` メソッドの `call` メソッドを呼び出します。ここでは、`call` を `forward` 関数にリダイレクトし、引数をクラス属性として保存します。これは、コードを他のフレームワーク実装とより類似させるために行います。
+:end_tab:
+
+##  データ
+:label:`oo-design-data`
+
+`DataModule` クラスは、データの基本クラスです。データの準備には `__init__` メソッドがよく使用されます。これには、必要に応じてダウンロードと前処理が含まれます。`train_dataloader` は、トレーニングデータセットのデータローダーを返します。データローダーは、使用されるたびにデータバッチを生成する (Python) ジェネレーターです。このバッチは、`Module` の `training_step` メソッドに入力され、損失が計算されます。検証データセットローダーを返すオプションの `val_dataloader` があります。これは、`Module` の `validation_step` メソッドのデータバッチを生成することを除いて、同じように動作します。
+
+```{.python .input}
+%%tab all
+class DataModule(d2l.HyperParameters):  #@save
+    if tab.selected('mxnet', 'pytorch'):
+        def __init__(self, root='../data', num_workers=4):
+            self.save_hyperparameters()
+
+    if tab.selected('tensorflow'):
+        def __init__(self, root='../data'):
+            self.save_hyperparameters()
+
+    def get_dataloader(self, train):
+        raise NotImplementedError
+
+    def train_dataloader(self):
+        return self.get_dataloader(train=True)
+
+    def val_dataloader(self):
+        return self.get_dataloader(train=False)
+```
+
+## トレーニング
+:label:`oo-design-training`
+
+`Trainer` クラスは、`DataModule` で指定されたデータを使用して `Module` クラスの学習可能なパラメーターを学習させます。キーメソッドは`fit`で、2つの引数を受け取ります。`model`は`Module`のインスタンスであり、`DataModule`のインスタンスである`data`です。次に、データセット全体を `max_epochs` 回反復してモデルをトレーニングします。前と同じように、この関数の実装は後の章に任せます。
+
+```{.python .input}
+%%tab all
+class Trainer(d2l.HyperParameters):  #@save
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        self.save_hyperparameters()
+        assert num_gpus == 0, 'No GPU support yet'
+
+    def prepare_data(self, data):
+        self.train_dataloader = data.train_dataloader()
+        self.val_dataloader = data.val_dataloader()
+        self.num_train_batches = len(self.train_dataloader)
+        self.num_val_batches = (len(self.val_dataloader)
+                                if self.val_dataloader is not None else 0)
+
+    def prepare_model(self, model):
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        self.model = model
+
+    def fit(self, model, data):
+        self.prepare_data(data)
+        self.prepare_model(model)
+        self.optim = model.configure_optimizers()
+        self.epoch = 0
+        self.train_batch_idx = 0
+        self.val_batch_idx = 0
+        for self.epoch in range(self.max_epochs):
+            self.fit_epoch()
+
+    def fit_epoch(self):
+        raise NotImplementedError
+```
+
+## まとめ
+
+将来のディープラーニング実装のためのオブジェクト指向設計を強調するために、上記のクラスは、オブジェクトがどのようにデータを格納し、相互に作用するかを示すだけです。本の残りの部分では、`@add_to_class `を介するなどして、これらのクラスの実装を充実させ続ける。さらに、これらの完全に実装されたクラスは、ディープラーニングのための構造化モデリングを容易にする*軽量ツールキット*である[d2l library](https://github.com/d2l-ai/d2l-en/tree/master/d2l)に保存されています。特に、あまり変更することなく、プロジェクト間で多くのコンポーネントを再利用することが容易になります。たとえば、オプティマイザだけ、モデルだけ、データセットだけを置き換えることができます。この程度のモジュール性は、簡潔さと単純さ（これが私たちがそれを追加した理由です）の点で本全体に配当をもたらし、あなた自身のプロジェクトでも同じことをすることができます。  
+
+## 演習
+
+1. [d2l library](https://github.com/d2l-ai/d2l-en/tree/master/d2l) に保存されている上記のクラスの完全な実装を見つけます。ディープラーニングモデリングに慣れてきたら、実装の詳細を確認することを強くお勧めします。
+1. `B` クラスの `save_hyperparameters` ステートメントを削除します。`self.a` と `self.b` をまだ印刷できますか？オプション:`HyperParameters` クラスの完全な実装に没頭したことがあるなら、その理由を説明できますか?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6645)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6646)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6647)
+:end_tab:
diff --git a/chapter_linear-regression/oo-design_origin.md b/chapter_linear-regression/oo-design_origin.md
new file mode 100644
index 0000000..21f6e1a
--- /dev/null
+++ b/chapter_linear-regression/oo-design_origin.md
@@ -0,0 +1,331 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Object-Oriented Design for Implementation
+:label:`sec_oo-design`
+
+In our introduction to linear regression,
+we walked through various components
+including
+the data, the model, the loss function,
+and the optimization algorithm.
+Indeed,
+linear regression is
+one of the simplest machine learning models.
+Training it,
+however, uses many of the same components as other models in this book require.
+Therefore, 
+before diving into the implementation details
+it is worth 
+designing some of the APIs
+used throughout this book. 
+Treating components in deep learning
+as objects,
+we can start by
+defining classes for these objects
+and their interactions.
+This object-oriented design
+for implementation
+will greatly
+streamline the presentation and you might even want to use it in your projects.
+
+
+Inspired by open-source libraries such as [PyTorch Lightning](https://www.pytorchlightning.ai/),
+on a high level
+we wish to have three classes: 
+(i) `Module` contains models, losses, and optimization methods; 
+(ii) `DataModule` provides data loaders for training and validation; 
+(iii) both classes are combined using the `Trainer` class, which allows us to
+train models on a variety of hardware platforms. 
+Most code in this book adapts `Module` and `DataModule`. We will touch upon the `Trainer` class only when we discuss GPUs, CPUs, parallel training, and optimization algorithms.
+
+```{.python .input}
+%%tab mxnet
+import time
+import numpy as np
+from d2l import mxnet as d2l
+from mxnet.gluon import nn
+```
+
+```{.python .input}
+%%tab pytorch
+import time
+import numpy as np
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input}
+%%tab tensorflow
+import time
+import numpy as np
+from d2l import torch as d2l
+import tensorflow as tf
+```
+
+## Utilities
+:label:`oo-design-utilities`
+
+We need a few utilities to simplify object-oriented programming in Jupyter notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first
+utility function allows us to register functions as methods in a class *after* the class has been created. In fact, we can do so *even after* we've created instances of the class! It allows us to split the implementation of a class into multiple code blocks.
+
+```{.python .input}
+%%tab all
+def add_to_class(Class):  #@save
+    def wrapper(obj):
+        setattr(Class, obj.__name__, obj)
+    return wrapper
+```
+
+Let's have a quick look at how to use it. We plan to implement a class `A` with a method `do`. Instead of having code for both `A` and `do` in the same code block, we can first declare the class `A` and create an instance `a`.
+
+```{.python .input}
+%%tab all
+class A:
+    def __init__(self):
+        self.b = 1
+
+a = A()
+```
+
+Next we define the method `do` as we normally would, but not in class `A`'s scope. Instead, we decorate this method by `add_to_class` with class `A` as its argument. In doing so, the method is able to access the member variables of `A` as we would expect if it had been defined as part of `A`'s definition. Let's see what happens when we invoke it for the instance `a`.
+
+```{.python .input}
+%%tab all
+@add_to_class(A)
+def do(self):
+    print('Class attribute "b" is', self.b)
+
+a.do()
+```
+
+The second one is a utility class that saves all arguments in a class's `__init__` method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.
+
+```{.python .input}
+%%tab all
+class HyperParameters:  #@save
+    def save_hyperparameters(self, ignore=[]):
+        raise NotImplemented
+```
+
+We defer its implementation into :numref:`sec_utils`. To use it, we define our class that inherits from `HyperParameters` and calls `save_hyperparameters` in the `__init__` method.
+
+```{.python .input}
+%%tab all
+# Call the fully implemented HyperParameters class saved in d2l
+class B(d2l.HyperParameters):
+    def __init__(self, a, b, c):
+        self.save_hyperparameters(ignore=['c'])
+        print('self.a =', self.a, 'self.b =', self.b)
+        print('There is no self.c =', not hasattr(self, 'c'))
+
+b = B(a=1, b=2, c=3)
+```
+
+The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) [TensorBoard](https://www.tensorflow.org/tensorboard) we name it `ProgressBoard`. The  implementation is deferred to :numref:`sec_utils`. For now, let's simply see it in action.
+
+The `draw` function plots a point `(x, y)` in the figure, with `label` specified in the legend. The optional `every_n` smooths the line by only showing $1/n$ points in the figure. Their values are averaged from the $n$ neighbor points in the original figure.
+
+```{.python .input}
+%%tab all
+class ProgressBoard(d2l.HyperParameters):  #@save
+    """Plot data points in animation."""
+    def __init__(self, xlabel=None, ylabel=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
+                 fig=None, axes=None, figsize=(3.5, 2.5), display=True):
+        self.save_hyperparameters()
+
+    def draw(self, x, y, label, every_n=1):
+        raise NotImplemented
+```
+
+In the following example, we draw `sin` and `cos` with a different smoothness. If you run this code block, you will see the lines grow in animation.
+
+```{.python .input}
+%%tab all
+board = d2l.ProgressBoard('x')
+for x in np.arange(0, 10, 0.1):
+    board.draw(x, np.sin(x), 'sin', every_n=2)
+    board.draw(x, np.cos(x), 'cos', every_n=10)
+```
+
+## Models
+:label:`oo-design-models`
+
+The `Module` class  is the base class of all models we will implement. At a minimum we need to define three methods. The `__init__` method stores the learnable parameters, the `training_step` method accepts a data batch to return the loss value, the `configure_optimizers` method returns the optimization method, or a list of them, that is used to update the learnable parameters. Optionally we can define `validation_step` to report the evaluation measures.
+Sometimes we put the code to compute the output into a separate `forward` method to make it more reusable.
+
+```{.python .input}
+%%tab all
+class Module(d2l.nn_Module, d2l.HyperParameters):  #@save
+    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
+        super().__init__()
+        self.save_hyperparameters()
+        self.board = ProgressBoard()
+        if tab.selected('tensorflow'):
+            self.training = None
+
+    def loss(self, y_hat, y):
+        raise NotImplementedError
+
+    def forward(self, X):
+        assert hasattr(self, 'net'), 'Neural network is defined'
+        return self.net(X)
+
+    if tab.selected('tensorflow'):
+        def call(self, X, *args, **kwargs):
+            if kwargs and "training" in kwargs:
+                self.training = kwargs['training']
+            return self.forward(X, *args)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'Trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        if tab.selected('mxnet', 'tensorflow'):
+            self.board.draw(x, d2l.numpy(value), (
+                'train_' if train else 'val_') + key, every_n=int(n))
+        if tab.selected('pytorch'):
+            self.board.draw(x, d2l.numpy(d2l.to(value, d2l.cpu())),
+                            ('train_' if train else 'val_') + key,
+                            every_n=int(n))
+
+    def training_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=True)
+        return l
+
+    def validation_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=False)
+
+    def configure_optimizers(self):
+        raise NotImplementedError
+```
+
+:begin_tab:`mxnet`
+You may notice that `Module` is a subclass of `nn.Block`, the base class of neural networks in Gluon.
+It provides convenient features to handle neural networks. For example, if we define a `forward` method, such as `forward(self, X)`, then for an instance `a` we can invoke this function by `a(X)`. This works since it calls the `forward` method in the built-in `__call__` method. You can find more details and examples about `nn.Block` in :numref:`sec_model_construction`.
+:end_tab:
+
+:begin_tab:`pytorch`
+You may notice that `Module` is a subclass of `nn.Module`, the base class of neural networks in PyTorch.
+It provides convenient features to handle neural networks. For example, if we define a `forward` method, such as `forward(self, X)`, then for an instance `a` we can invoke this function by `a(X)`. This works since it calls the `forward` method in the built-in `__call__` method. You can find more details and examples about `nn.Module` in :numref:`sec_model_construction`.
+:end_tab:
+
+:begin_tab:`tensorflow`
+You may notice that `Module` is a subclass of `tf.keras.Model`, the base class of neural networks in TensorFlow.
+It provides convenient features to handle neural networks. For example, it invokes the `call` method in the built-in `__call__` method. Here we redirect `call` to the `forward` function, saving its arguments as a class attribute. We do this to make our code more similar to other framework implementations.
+:end_tab:
+
+##  Data
+:label:`oo-design-data`
+
+The `DataModule` class is the base class for data. Quite frequently the `__init__` method is used to prepare the data. This includes downloading and preprocessing if needed. The `train_dataloader` returns the data loader for the training dataset. A data loader is a (Python) generator that yields a data batch each time it is used. This batch is then fed into the `training_step` method of `Module` to compute the loss. There is an optional `val_dataloader` to return the validation dataset loader. It behaves in the same manner, except that it yields data batches for the `validation_step` method in `Module`.
+
+```{.python .input}
+%%tab all
+class DataModule(d2l.HyperParameters):  #@save
+    if tab.selected('mxnet', 'pytorch'):
+        def __init__(self, root='../data', num_workers=4):
+            self.save_hyperparameters()
+
+    if tab.selected('tensorflow'):
+        def __init__(self, root='../data'):
+            self.save_hyperparameters()
+
+    def get_dataloader(self, train):
+        raise NotImplementedError
+
+    def train_dataloader(self):
+        return self.get_dataloader(train=True)
+
+    def val_dataloader(self):
+        return self.get_dataloader(train=False)
+```
+
+## Training
+:label:`oo-design-training`
+
+The `Trainer` class trains the learnable parameters in the `Module` class with data specified in `DataModule`. The key method is `fit`, which accepts two arguments: `model`, an instance of `Module`, and `data`, an instance of `DataModule`. It then iterates over the entire dataset `max_epochs` times to train the model. As before, we will defer the implementation of this function to later chapters.
+
+```{.python .input}
+%%tab all
+class Trainer(d2l.HyperParameters):  #@save
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        self.save_hyperparameters()
+        assert num_gpus == 0, 'No GPU support yet'
+
+    def prepare_data(self, data):
+        self.train_dataloader = data.train_dataloader()
+        self.val_dataloader = data.val_dataloader()
+        self.num_train_batches = len(self.train_dataloader)
+        self.num_val_batches = (len(self.val_dataloader)
+                                if self.val_dataloader is not None else 0)
+
+    def prepare_model(self, model):
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        self.model = model
+
+    def fit(self, model, data):
+        self.prepare_data(data)
+        self.prepare_model(model)
+        self.optim = model.configure_optimizers()
+        self.epoch = 0
+        self.train_batch_idx = 0
+        self.val_batch_idx = 0
+        for self.epoch in range(self.max_epochs):
+            self.fit_epoch()
+
+    def fit_epoch(self):
+        raise NotImplementedError
+```
+
+## Summary
+
+To highlight the object-oriented design
+for our future deep learning implementation,
+the above classes just show how their objects 
+store data and interact with each other.
+We will keep enriching implementations of these classes,
+such as via `@add_to_class`,
+in the rest of the book.
+Moreover,
+these fully implemented classes
+are saved in the [d2l library](https://github.com/d2l-ai/d2l-en/tree/master/d2l),
+a *lightweight toolkit* that makes structured modeling for deep learning easy. 
+In particular, it facilitates reusing many components between projects without changing much at all. For instance, we can replace just the optimizer, just the model, just the dataset, etc.;
+this degree of modularity pays dividends throughout the book in terms of conciseness and simplicity (this is why we added it) and it can do the same for your own projects. 
+
+
+## Exercises
+
+1. Locate full implementations of the above classes that are saved in the [d2l library](https://github.com/d2l-ai/d2l-en/tree/master/d2l). We strongly recommend that you look at the implementation in detail once you have gained some more familiarity with deep learning modeling.
+1. Remove the `save_hyperparameters` statement in the `B` class. Can you still print `self.a` and `self.b`? Optional: if you have dived into the full implementation of the `HyperParameters` class, can you explain why?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6645)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6646)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6647)
+:end_tab:
diff --git a/chapter_linear-regression/synthetic-regression-data.md b/chapter_linear-regression/synthetic-regression-data.md
new file mode 100644
index 0000000..864dffd
--- /dev/null
+++ b/chapter_linear-regression/synthetic-regression-data.md
@@ -0,0 +1,180 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 合成回帰データ
+:label:`sec_synthetic-regression-data`
+
+機械学習とは、データから情報を抽出することです。合成データから何を学べるのか不思議に思うかもしれません。私たち自身が人工的なデータ生成モデルに組み込んだパターンについては本質的に気にしないかもしれませんが、そのようなデータセットは教訓的な目的に役立ち、学習アルゴリズムの特性を評価し、実装が期待どおりに機能することを確認するのに役立ちます。たとえば、*アプリオリ*で正しいパラメータがわかっているデータを作成すると、モデルが実際にそれらを回復できることを検証できます。
+
+```{.python .input}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import np, npx, gluon
+import random
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+import random
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+import random
+```
+
+## データセットの生成
+
+この例では、簡潔にするために低次元で作業します。次のコードスニペットは、標準正規分布から抽出された2次元の特徴を含む1000の例を生成します。結果として得られる計画マトリックス $\mathbf{X}$ は $\mathbb{R}^{1000 \times 2}$ に属します。ここでは、*グラウンドトゥルース* 線形関数を適用して各ラベルを生成し、加法性ノイズ $\epsilon$ によってそれらを破壊し、各例で独立して同じように描画します。 
+
+(**$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$**) 
+
+便宜上、$\epsilon$は平均$\mu= 0$と標準偏差$\sigma = 0.01$の正規分布から導出されると仮定します。オブジェクト指向設計では、`d2l.DataModule` (:numref:`oo-design-data` で導入) のサブクラスの `__init__` メソッドにコードを追加することに注意してください。追加のハイパーパラメータを設定することは良い習慣です。これを`save_hyperparameters()`で達成します。`batch_size`は後で決定されます。
+
+```{.python .input}
+%%tab all
+class SyntheticRegressionData(d2l.DataModule):  #@save
+    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000, 
+                 batch_size=32):
+        super().__init__()
+        self.save_hyperparameters()
+        n = num_train + num_val
+        if tab.selected('pytorch') or tab.selected('mxnet'):                
+            self.X = d2l.randn(n, len(w))
+            noise = d2l.randn(n, 1) * noise
+        if tab.selected('tensorflow'):
+            self.X = tf.random.normal((n, w.shape[0]))
+            noise = tf.random.normal((n, 1)) * noise            
+        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + noise
+```
+
+以下では、真のパラメータを $\mathbf{w} = [2, -3.4]^\top$ と $b = 4.2$ に設定します。後で、これらの*グラウンドトゥルース*値と照らし合わせて推定パラメータを確認できます。
+
+```{.python .input}
+%%tab all
+data = SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+```
+
+[**`features` の各行は $\mathbb{R}^2$ のベクトルで構成され、`labels` の各行はスカラーです。**] 最初のエントリを見てみましょう。
+
+```{.python .input}
+%%tab all
+print('features:', data.X[0],'\nlabel:', data.y[0])
+```
+
+## データセットの読み取り
+
+機械学習モデルをトレーニングするには、多くの場合、データセットを複数回通過し、一度に 1 つのミニバッチの例を取得する必要があります。このデータは、モデルの更新に使用されます。これがどのように機能するかを説明するために、[**`get_dataloader`関数を実装し、**] `add_to_class`（:numref:`oo-design-utilities`で導入）を介して`SyntheticRegressionData`クラスのメソッドとして登録します。それ (**バッチサイズ、特徴の行列、およびラベルのベクトルを取り、サイズ`batch_size`のミニバッチを生成します**) そのため、各ミニバッチは特徴とラベルのタプルで構成されます。トレーニングモードか検証モードかに注意する必要があることに注意してください。前者ではランダムな順序でデータを読み取る必要があるのに対し、後者の場合、事前に定義された順序でデータを読み取ることができることがデバッグの目的で重要になる場合があります。
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SyntheticRegressionData)
+def get_dataloader(self, train):
+    if train:
+        indices = list(range(0, self.num_train))
+        # The examples are read in random order
+        random.shuffle(indices)
+    else:
+        indices = list(range(self.num_train, self.num_train+self.num_val))
+    for i in range(0, len(indices), self.batch_size):
+        if tab.selected('mxnet') or tab.selected('pytorch'):
+            batch_indices = d2l.tensor(indices[i: i+self.batch_size])
+            yield self.X[batch_indices], self.y[batch_indices]
+        if tab.selected('tensorflow'):
+            j = tf.constant(indices[i : i+self.batch_size])
+            yield tf.gather(self.X, j), tf.gather(self.y, j)
+```
+
+直感を構築するために、データの最初のミニバッチを調べてみましょう。フィーチャの各ミニバッチは、そのサイズと入力フィーチャの次元の両方を提供します。同様に、ラベルのミニバッチは、`batch_size`によって与えられた一致する形状になります。
+
+```{.python .input}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print('X shape:', X.shape, '\ny shape:', y.shape)
+```
+
+一見無害に見えますが、`iter(data.train_dataloader())`の呼び出しは、Pythonのオブジェクト指向設計の力を示しています。`SyntheticRegressionData` クラスにメソッドを追加したことに注意してください。
+** `data` オブジェクトを作成した後。 
+それにもかかわらず、オブジェクトは、クラスに機能を*事後*追加することで恩恵を受けます。 
+
+反復を通して、データセット全体が使い果たされるまで、個別のミニバッチを取得します（これを試してください）。上記で実装された反復は教訓的な目的には適していますが、実際の問題で私たちを困らせるような方法では非効率的です。たとえば、すべてのデータをメモリにロードし、大量のランダムメモリアクセスを実行する必要があります。ディープラーニングフレームワークに実装されたビルトインイテレーターは、かなり効率的で、ファイルに格納されたデータ、ストリームを介して受信したデータ、オンザフライで生成または処理されたデータなどのソースを処理できます。次に、組み込みのイテレータを使って同じ関数を実装してみましょう。 
+
+## データローダーの簡潔な実装
+
+独自のイテレータを書く代わりに、[**フレームワーク内の既存のAPIを呼び出してデータをロードします。**] 前と同じように、機能`X`とラベル`y`を持つデータセットが必要です。それ以上に、組み込みのデータローダーに`batch_size`を設定し、サンプルを効率的にシャッフルできるようにします。
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(d2l.DataModule)  #@save
+def get_tensorloader(self, tensors, train, indices=slice(0, None)):
+    tensors = tuple(a[indices] for a in tensors)
+    if tab.selected('mxnet'):
+        dataset = gluon.data.ArrayDataset(*tensors)
+        return gluon.data.DataLoader(dataset, self.batch_size,
+                                     shuffle=train)
+    if tab.selected('pytorch'):
+        dataset = torch.utils.data.TensorDataset(*tensors)
+        return torch.utils.data.DataLoader(dataset, self.batch_size,
+                                           shuffle=train)
+    if tab.selected('tensorflow'):
+        shuffle_buffer = tensors[0].shape[0] if train else 1
+        return tf.data.Dataset.from_tensor_slices(tensors).shuffle(
+            buffer_size=shuffle_buffer).batch(self.batch_size)
+
+@d2l.add_to_class(SyntheticRegressionData)  #@save
+def get_dataloader(self, train):
+    i = slice(0, self.num_train) if train else slice(self.num_train, None)
+    return self.get_tensorloader((self.X, self.y), train, i)
+```
+
+新しいデータローダーは、より効率的で機能が追加されている点を除いて、前のデータローダーと同じように動作します。
+
+```{.python .input  n=4}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print('X shape:', X.shape, '\ny shape:', y.shape)
+```
+
+たとえば、フレームワーク API によって提供されるデータローダーは、組み込みの `__len__` メソッドをサポートしているため、長さ、つまりバッチ数をクエリできます。
+
+```{.python .input}
+%%tab all
+len(data.train_dataloader())
+```
+
+## まとめ
+
+データローダーは、データのロードと操作のプロセスを抽象化する便利な方法です。このように、同じ機械学習*アルゴリズム*が、変更を必要とせずに多くの異なるタイプとデータソースを処理することができます。データローダーの優れた点の 1 つは、構成できることです。たとえば、画像を読み込んで、それらを切り抜いたり、別の方法で変更したりする後処理フィルターがあるとします。そのため、データローダーはデータ処理パイプライン全体を記述するために使用できます。  
+
+モデル自体に関しては、2次元線形モデルは、私たちが遭遇するかもしれないほど単純なモデルです。これにより、データ量が不十分だったり、方程式系が不十分であることを心配することなく、回帰モデルの精度をテストできます。これを次のセクションで有効に活用します。   
+
+## 演習
+
+1. 例の数をバッチサイズで割ることができない場合はどうなりますか。フレームワークのAPIを使用して別の引数を指定してこの動作を変更するにはどうすればいいですか?
+1. パラメータベクトル`w`のサイズと`num_examples`の例の数の両方が大きい巨大なデータセットを生成したい場合はどうなりますか？ 
+    1. すべてのデータをメモリに保持できない場合はどうなりますか？
+    1. データがディスク上に保持されている場合、どのようにデータをシャッフルしますか？あなたの仕事は、ランダムな読み取りまたは書き込みをあまり必要としない、*効率的な*アルゴリズムを設計することです。ヒント: [pseudorandom permutation generators](https://en.wikipedia.org/wiki/Pseudorandom_permutation) allow you to design a reshuffle without the need to store the permutation table explicitly :cite:`Naor.Reingold.1999`。 
+1. イテレータが呼び出されるたびに、その場で新しいデータを生成するデータジェネレータを実装します。 
+1. 呼び出されるたびに*同じ*データを生成するランダムデータジェネレータをどのように設計しますか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6662)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6663)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6664)
+:end_tab:
diff --git a/chapter_linear-regression/synthetic-regression-data_origin.md b/chapter_linear-regression/synthetic-regression-data_origin.md
new file mode 100644
index 0000000..fb593ac
--- /dev/null
+++ b/chapter_linear-regression/synthetic-regression-data_origin.md
@@ -0,0 +1,261 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Synthetic Regression Data
+:label:`sec_synthetic-regression-data`
+
+
+Machine learning is all about extracting information from data.
+So you might wonder, what could we possibly learn from synthetic data?
+While we might not care intrinsically about the patterns 
+that we ourselves baked into an artificial data generating model,
+such datasets are nevertheless useful for didactic purposes,
+helping us to evaluate the properties of our learning 
+algorithms and to confirm that our implementations work as expected.
+For example, if we create data for which the correct parameters are known *a priori*,
+then we can verify that our model can in fact recover them.
+
+```{.python .input}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import np, npx, gluon
+import random
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+import random
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+import random
+```
+
+## Generating the Dataset
+
+For this example, we will work low-dimensional
+for succinctness.
+The following code snippet generates 1000 examples
+with 2-dimensional features drawn 
+from a standard normal distribution.
+The resulting design matrix $\mathbf{X}$
+belongs to $\mathbb{R}^{1000 \times 2}$. 
+We generate each label by applying 
+a *ground truth* linear function, 
+corrupted them via additive noise $\epsilon$, 
+drawn independently and identically for each example:
+
+(**$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$**)
+
+For convenience we assume that $\epsilon$ is drawn 
+from a normal distribution with mean $\mu= 0$ 
+and standard deviation $\sigma = 0.01$.
+Note that for object-oriented design
+we add the code to the `__init__` method of a subclass of `d2l.DataModule` (introduced in :numref:`oo-design-data`). 
+It's good practice to allow setting any additional hyperparameters. 
+We accomplish this with `save_hyperparameters()`. 
+The `batch_size` will be determined later on.
+
+```{.python .input}
+%%tab all
+class SyntheticRegressionData(d2l.DataModule):  #@save
+    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000, 
+                 batch_size=32):
+        super().__init__()
+        self.save_hyperparameters()
+        n = num_train + num_val
+        if tab.selected('pytorch') or tab.selected('mxnet'):                
+            self.X = d2l.randn(n, len(w))
+            noise = d2l.randn(n, 1) * noise
+        if tab.selected('tensorflow'):
+            self.X = tf.random.normal((n, w.shape[0]))
+            noise = tf.random.normal((n, 1)) * noise            
+        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + noise
+```
+
+Below, we set the true parameters to $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$.
+Later, we can check our estimated parameters against these *ground truth* values.
+
+```{.python .input}
+%%tab all
+data = SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
+```
+
+[**Each row in `features` consists of a vector in $\mathbb{R}^2$ and each row in `labels` is a scalar.**] Let's have a look at the first entry.
+
+```{.python .input}
+%%tab all
+print('features:', data.X[0],'\nlabel:', data.y[0])
+```
+
+## Reading the Dataset
+
+Training machine learning models often requires multiple passes over a dataset, 
+grabbing one minibatch of examples at a time. 
+This data is then used to update the model. 
+To illustrate how this works, we 
+[**implement the `get_dataloader` function,**] 
+registering it as a method in the `SyntheticRegressionData` class via `add_to_class` (introduced in :numref:`oo-design-utilities`).
+It (**takes a batch size, a matrix of features,
+and a vector of labels, and generates minibatches of size `batch_size`.**)
+As such, each minibatch consists of a tuple of features and labels. 
+Note that we need to be mindful of whether we're in training or validation mode: 
+in the former, we will want to read the data in random order, 
+whereas for the latter, being able to read data in a pre-defined order 
+may be important for debugging purposes.
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(SyntheticRegressionData)
+def get_dataloader(self, train):
+    if train:
+        indices = list(range(0, self.num_train))
+        # The examples are read in random order
+        random.shuffle(indices)
+    else:
+        indices = list(range(self.num_train, self.num_train+self.num_val))
+    for i in range(0, len(indices), self.batch_size):
+        if tab.selected('mxnet') or tab.selected('pytorch'):
+            batch_indices = d2l.tensor(indices[i: i+self.batch_size])
+            yield self.X[batch_indices], self.y[batch_indices]
+        if tab.selected('tensorflow'):
+            j = tf.constant(indices[i : i+self.batch_size])
+            yield tf.gather(self.X, j), tf.gather(self.y, j)
+```
+
+To build some intuition, let's inspect the first minibatch of
+data. Each minibatch of features provides us with both its size and the dimensionality of input features.
+Likewise, our minibatch of labels will have a matching shape given by `batch_size`.
+
+```{.python .input}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print('X shape:', X.shape, '\ny shape:', y.shape)
+```
+
+While seemingly innocuous, the invocation 
+of `iter(data.train_dataloader())` 
+illustrates the power of Python's object-oriented design. 
+Note that we added a method to the `SyntheticRegressionData` class
+*after* creating the `data` object. 
+Nonetheless, the object benefits from 
+the *ex post facto* addition of functionality to the class.
+
+Throughout the iteration we obtain distinct minibatches
+until the entire dataset has been exhausted (try this).
+While the iteration implemented above is good for didactic purposes,
+it is inefficient in ways that might get us in trouble on real problems.
+For example, it requires that we load all the data in memory
+and that we perform lots of random memory access.
+The built-in iterators implemented in a deep learning framework
+are considerably more efficient and they can deal
+with sources such as data stored in files, 
+data received via a stream, 
+and data generated or processed on the fly. 
+Next let's try to implement the same function using built-in iterators.
+
+## Concise Implementation of the Data Loader
+
+Rather than writing our own iterator,
+we can [**call the existing API in a framework to load data.**]
+As before, we need a dataset with features `X` and labels `y`. 
+Beyond that, we set `batch_size` in the built-in data loader 
+and let it take care of shuffling examples  efficiently.
+
+```{.python .input}
+%%tab all
+@d2l.add_to_class(d2l.DataModule)  #@save
+def get_tensorloader(self, tensors, train, indices=slice(0, None)):
+    tensors = tuple(a[indices] for a in tensors)
+    if tab.selected('mxnet'):
+        dataset = gluon.data.ArrayDataset(*tensors)
+        return gluon.data.DataLoader(dataset, self.batch_size,
+                                     shuffle=train)
+    if tab.selected('pytorch'):
+        dataset = torch.utils.data.TensorDataset(*tensors)
+        return torch.utils.data.DataLoader(dataset, self.batch_size,
+                                           shuffle=train)
+    if tab.selected('tensorflow'):
+        shuffle_buffer = tensors[0].shape[0] if train else 1
+        return tf.data.Dataset.from_tensor_slices(tensors).shuffle(
+            buffer_size=shuffle_buffer).batch(self.batch_size)
+
+@d2l.add_to_class(SyntheticRegressionData)  #@save
+def get_dataloader(self, train):
+    i = slice(0, self.num_train) if train else slice(self.num_train, None)
+    return self.get_tensorloader((self.X, self.y), train, i)
+```
+
+The new data loader behaves just as the previous one, except that it is more efficient and has some added functionality.
+
+```{.python .input  n=4}
+%%tab all
+X, y = next(iter(data.train_dataloader()))
+print('X shape:', X.shape, '\ny shape:', y.shape)
+```
+
+For instance, the data loader provided by the framework API 
+supports the built-in `__len__` method, 
+so we can query its length, 
+i.e., the number of batches.
+
+```{.python .input}
+%%tab all
+len(data.train_dataloader())
+```
+
+## Summary
+
+Data loaders are a convenient way of abstracting out 
+the process of loading and manipulating data. 
+This way the same machine learning *algorithm* 
+is capable of processing many different types and sources of data 
+without the need for modification. 
+One of the nice things about data loaders 
+is that they can be composed. 
+For instance, we might be loading images 
+and then have a post-processing filter 
+that crops them or modifies them otherwise. 
+As such, data loaders can be used 
+to describe an entire data processing pipeline. 
+
+As for the model itself, the two-dimensional linear model 
+is about as simple a model as we might encounter. 
+It lets us test out the accuracy of regression models 
+without worry about having insufficient amounts of data 
+or an underdetermined system of equations. 
+We will put this to good use in the next section.  
+
+
+## Exercises
+
+1. What will happen if the number of examples cannot be divided by the batch size. How to change this behavior by specifying a different argument by using framework's API?
+1. What if we want to generate a huge dataset, where both the size of the parameter vector `w` and the number of examples `num_examples` are large? 
+    1. What happens if we cannot hold all data in memory?
+    1. How would you shuffle the data if data is held on disk? Your task is to design an *efficient* algorithm that does not require too many random reads or writes. Hint: [pseudorandom permutation generators](https://en.wikipedia.org/wiki/Pseudorandom_permutation) allow you to design a reshuffle without the need to store the permutation table explicitly :cite:`Naor.Reingold.1999`. 
+1. Implement a data generator that produces new data on the fly, every time the iterator is called. 
+1. How would you design a random data generator that generates *the same* data each time it's called?
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/6662)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/6663)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/6664)
+:end_tab:
diff --git a/chapter_linear-regression/weight-decay.md b/chapter_linear-regression/weight-decay.md
new file mode 100644
index 0000000..e99ebff
--- /dev/null
+++ b/chapter_linear-regression/weight-decay.md
@@ -0,0 +1,257 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 体重減衰
+:label:`sec_weight_decay`
+
+オーバーフィットの問題を特徴づけたところで、最初の*正則化*手法を紹介します。より多くのトレーニングデータを収集することで、過適合をいつでも軽減できることを思い出してください。しかし、それはコストがかかり、時間がかかり、または完全に私たちの制御不能になる可能性があり、短期的には不可能になります。今のところ、私たちはリソースが許す限り多くの高品質のデータをすでに持っていると仮定し、データセットが与えられたものとして取られたとしても、自由に使えるツールに集中することができます。 
+
+多項式回帰の例 (:numref:`subsec_polynomial-curve-fitting`) では、近似した多項式の次数を微調整することでモデルの容量を制限できることを思い出してください。実際、特徴の数を制限することは、過適合を緩和するための一般的な手法です。しかし、単に機能を捨てるだけでは、楽器が鈍すぎる可能性があります。多項式回帰の例に固執し、高次元の入力で何が起こるかを考えてみましょう。多変量データへの多項式の自然な拡張は*単項式*と呼ばれ、単に変数のべき乗の積です。単項式の次数は、べき乗の合計です。たとえば、$x_1^2 x_2$ と $x_3 x_5^2$ は、どちらも次数 3 の単項式です。 
+
+$d$ の次数を持つ項の数は、$d$ が大きくなるにつれて急速に増加することに注意してください。$k$の変数が与えられた場合、$d$の次数の単項式（つまり、$k$のマルチチョイス$d$）は${k - 1 + d} \choose {k - 1}$になります。$2$から$3$への小さな次数の変化でも、モデルの複雑さは劇的に増大します。そのため、関数の複雑さを調整するために、よりきめ細かなツールが必要になることがよくあります。 
+
+## 規範と体重減少
+
+(**パラメータの数を直接操作するのではなく、
+*重量の減衰*、値を制限することで動作します 
+パラメータが使用できること。**) ディープラーニングサークルの外ではより一般的には $\ell_2$ 正則化と呼ばれ、ミニバッチの確率的勾配降下法によって最適化される場合、重み減衰は、パラメトリック機械学習モデルを正則化するために最も広く使用されている手法である可能性があります。この手法は、すべての関数$f$の中で、関数$f = 0$（すべての入力に値$0$を割り当てる）が何らかの形で*最も単純*であり、ゼロからのパラメーターの距離によって関数の複雑さを測定できるという基本的な直感によって動機付けられています。しかし、関数とゼロの間の距離をどれくらい正確に測定すべきでしょうか？正解は1つもありません。実際、関数解析の一部やバナッハ空間の理論を含む数学の分野全体が、そのような問題に取り組むことに専念しています。 
+
+簡単な解釈の1つは、線形関数$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$の複雑さをその重みベクトルのあるノルム、たとえば$\| \mathbf{w} \|^2$によって測定することです。$\ell_2$ノルムと$\ell_1$ノルムを導入したことを思い出してください。これらは、:numref:`subsec_lin-algebra-norms`のより一般的な$\ell_p$ノルムの特別なケースです。小さい重みベクトルを保証する最も一般的な方法は、損失を最小にする問題に、そのノルムをペナルティ項として追加することです。したがって、私たちは当初の目標を置き換え、
+*トレーニングラベルの予測損失を最小限に抑える*、
+新しい目的で、
+*予測損失とペナルティタームの合計を最小化する*。
+ここで、重みベクトルが大きくなりすぎると、学習アルゴリズムは重みノルム $\| \mathbf{w} \|^2$ の最小化と学習エラーの最小化に焦点を当てる可能性があります。それがまさに私たちが望んでいることです。コードで説明するために、線形回帰の:numref:`sec_linear_regression`の前の例を復活させます。そこで、私たちの損失は 
+
+$$L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
+
+$\mathbf{x}^{(i)}$ がフィーチャ、$y^{(i)}$ が任意のデータ例のラベル $i$、$(\mathbf{w}, b)$ がそれぞれ重みとバイアスのパラメーターであることを思い出してください。重みベクトルの大きさにペナルティを課すには、何らかの形で$\| \mathbf{w} \|^2$を損失関数に追加する必要がありますが、モデルはこの新しい加算ペナルティに対して標準損失をどのようにトレードオフする必要がありますか？実際には、検証データを使用して近似する非負のハイパーパラメータである*正則化定数* $\lambda$を使用してこのトレードオフを特徴付けます。 
+
+$$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2.$$
+
+$\lambda = 0$ では、元の損失関数を回復します。$\lambda > 0$ については、$\| \mathbf{w} \|$ のサイズを制限しています。慣例により $2$ で割ります。二次関数の微分を取るとき、$2$ と $1/2$ は相殺され、更新の式が美しくシンプルに見えるようにします。鋭い読者は、なぜ標準ノルム（ユークリッド距離）ではなく二乗ノルムを扱うのか疑問に思うかもしれません。これは、計算の便宜のために行います。$\ell_2$ ノルムを二乗することにより、平方根を削除し、重みベクトルの各成分の二乗和を残します。これにより、ペナルティの微分を計算しやすくなります。導関数の合計は合計の微分と等しくなります。 
+
+さらに、そもそもなぜ私たちが$\ell_2$ノルムを使用し、たとえば$\ell_1$ノルムを使用しないのかと尋ねるかもしれません。実際、他の選択肢は統計全体で有効で人気があります。$\ell_2$ 正則化線形モデルは古典的な *リッジ回帰* アルゴリズムを構成しますが、$\ell_1$ 正則化線形回帰は、統計における同様に基本的な方法であり、一般に*投げ縄回帰* として知られています。$\ell_2$ ノルムを使用する理由の 1 つは、重みベクトルの大きな成分に大きすぎるペナルティを課すことです。これにより、学習アルゴリズムは、より多くの特徴に均等に重みを配分するモデルに偏ります。実際には、これにより、単一変数の測定誤差に対してよりロバストになる可能性があります。対照的に、$\ell_1$ のペナルティは、他のウェイトをゼロにクリアすることにより、一部のフィーチャに重みを集中させるモデルにつながります。これにより、*フィーチャ選択*の効果的な方法が得られますが、これは他の理由で望ましい場合があります。たとえば、モデルが少数のフィーチャのみに依存している場合、他の (ドロップされた) フィーチャのデータを収集、保存、または送信する必要がない場合があります。  
+
+:eqref:`eq_linreg_batch_update` で同じ表記法を使用して、$\ell_2$ 正則化回帰のミニバッチ確率勾配降下法の更新は次のとおりです。 
+
+$$\begin{aligned}
+\mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).
+\end{aligned}$$
+
+前と同様に、推定値が観測値と異なる量に基づいて$\mathbf{w}$を更新します。ただし、$\mathbf{w}$ のサイズもゼロに向かって縮小します。そのため、この方法は「体重減衰」と呼ばれることもあります。ペナルティ項のみを考慮すると、最適化アルゴリズムはトレーニングの各ステップで体重を*減衰*します。特徴の選択とは対照的に、重量減衰は機能の複雑さを調整するための連続的なメカニズムを提供します。$\lambda$ の値が小さいほど制約の少ない $\mathbf{w}$ に対応し、$\lambda$ の値が大きいほど $\mathbf{w}$ の制約が大きくなります。対応するバイアスペナルティ$b^2$を含めるかどうかは、実装によって異なり、ニューラルネットワークのレイヤーによって異なる場合があります。多くの場合、バイアス項を正則化しません。また、$\ell_2$ の正則化は、他の最適化アルゴリズムの重み減衰と同等ではないかもしれませんが、重みのサイズを縮小して正則化するという考え方は依然として当てはまります。 
+
+## 高次元線形回帰
+
+簡単な合成例を通して、体重減衰の利点を説明できます。
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+まず、[**前と同じようにデータを生成する**]: 
+
+(**$$y = 0.05 +\ sum_ {i = 1} ^d 0.01 x_i +\ イプシロン\ テキスト {どこ}\ イプシロン\ sim\ mathcal {N} (0, 0.01^2) .$$**) 
+
+この合成データセットでは、ラベルは入力の基礎となる線形関数によって与えられ、ゼロ平均、標準偏差 0.01 のガウスノイズによって破損しています。説明のために、問題の次元を$d = 200$に増やし、20例しかない小さなトレーニングセットで作業することで、オーバーフィットの影響を顕著にすることができます。
+
+```{.python .input  n=5}
+%%tab all
+class Data(d2l.DataModule):
+    def __init__(self, num_train, num_val, num_inputs, batch_size):
+        self.save_hyperparameters()                
+        n = num_train + num_val 
+        if tab.selected('mxnet') or tab.selected('pytorch'):
+            self.X = d2l.randn(n, num_inputs)
+            noise = d2l.randn(n, 1) * 0.01
+        if tab.selected('tensorflow'):
+            self.X = d2l.normal((n, num_inputs))
+            noise = d2l.normal((n, 1)) * 0.01
+        w, b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
+        self.y = d2l.matmul(self.X, w) + b + noise
+
+    def get_dataloader(self, train):
+        i = slice(0, self.num_train) if train else slice(self.num_train, None)
+        return self.get_tensorloader([self.X, self.y], train, i)
+```
+
+## ゼロからの実装
+
+それでは、体重減衰をゼロから実装してみましょう。ミニバッチの確率的勾配降下法はオプティマイザなので、元の損失関数に二乗した$\ell_2$ペナルティを追加するだけで済みます。 
+
+### (**$\ell_2$ ノルムペナルティの定義**)
+
+おそらく、このペナルティを実装する最も便利な方法は、すべての項を二乗して合計することです。
+
+```{.python .input  n=6}
+%%tab all
+def l2_penalty(w):
+    return d2l.reduce_sum(w**2) / 2
+```
+
+### モデルを定義する
+
+最終的なモデルでは、線形回帰と二乗損失は :numref:`sec_linear_scratch` 以降変化していないため、`d2l.LinearRegressionScratch` のサブクラスを定義します。ここでの唯一の変更点は、損失にペナルティ期間が含まれるようになったことです。
+
+```{.python .input  n=7}
+%%tab all
+class WeightDecayScratch(d2l.LinearRegressionScratch):
+    def __init__(self, num_inputs, lambd, lr, sigma=0.01):
+        super().__init__(num_inputs, lr, sigma)
+        self.save_hyperparameters()
+        
+    def loss(self, y_hat, y):
+        return super().loss(y_hat, y) + self.lambd * l2_penalty(self.w)
+```
+
+次のコードは、20個の例を含むトレーニングセットのモデルを適合させ、100個の例を含む検証セットで評価します。
+
+```{.python .input  n=8}
+%%tab all
+data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5)
+trainer = d2l.Trainer(max_epochs=10)
+
+def train_scratch(lambd):    
+    model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
+    model.board.yscale='log'
+    trainer.fit(model, data)
+    print('L2 norm of w:', float(l2_penalty(model.w)))
+```
+
+### [**正規化なしのトレーニング**]
+
+このコードを `lambd = 0` で実行し、重量の減衰を無効にします。オーバーフィットがひどく、学習エラーは減少しますが、検証エラーは減少しないことに注意してください。これは教科書ではオーバーフィットのケースです。
+
+```{.python .input  n=9}
+%%tab all
+train_scratch(0)
+```
+
+### [**重量減衰を使用する**]
+
+以下では、かなりの重量減衰で走ります。学習誤差は増加するが、検証誤差は減少することに注意してください。これは正則化から期待される効果です。
+
+```{.python .input  n=10}
+%%tab all
+train_scratch(3)
+```
+
+## [**簡潔な実装**]
+
+重み減衰はニューラルネットワークの最適化に遍在するため、ディープラーニングフレームワークは特に便利で、重み減衰を最適化アルゴリズム自体に統合して、損失関数と組み合わせて簡単に使用できます。さらに、この統合は計算上の利点をもたらし、追加の計算オーバーヘッドなしに実装トリックがアルゴリズムに重み付けを加えることを可能にします。更新の重み減衰部分は各パラメーターの現在の値にのみ依存するため、オプティマイザーはいずれにせよ各パラメーターに一度タッチする必要があります。
+
+:begin_tab:`mxnet`
+次のコードでは、`Trainer`をインスタンス化するときに、`wd`を介して直接重み減衰ハイパーパラメータを指定します。既定では、Gluon は重みとバイアスの両方を同時に減衰させます。モデルパラメーターを更新すると、ハイパーパラメーター `wd` に `wd_mult` が乗算されることに注意してください。したがって、`wd_mult`をゼロに設定すると、バイアスパラメータ$b$は減衰しません。
+:end_tab:
+
+:begin_tab:`pytorch`
+次のコードでは、オプティマイザをインスタンス化するときに `weight_decay` を介して直接重み減衰ハイパーパラメータを指定します。デフォルトでは、PyTorch はウェイトとバイアスの両方を同時に減衰させます。ここでは、重みに `weight_decay` を設定しただけなので、バイアスパラメータ $b$ は減衰しません。
+:end_tab:
+
+:begin_tab:`tensorflow`
+次のコードでは、重み減衰ハイパーパラメーター `wd` を使用して $\ell_2$ 正則化器を作成し、`kernel_regularizer` 引数によって層の重みに適用します。
+:end_tab:
+
+```{.python .input  n=11}
+%%tab mxnet
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.wd = wd
+        
+    def configure_optimizers(self):
+        self.collect_params('.*bias').setattr('wd_mult', 0)
+        return gluon.Trainer(self.collect_params(),
+                             'sgd', 
+                             {'learning_rate': self.lr, 'wd': self.wd})
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.wd = wd
+    
+    def configure_optimizers(self):
+        return torch.optim.SGD(self.net.parameters(), 
+                               lr=self.lr, weight_decay=self.wd)
+```
+
+```{.python .input  n=13}
+%%tab tensorflow
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.net = tf.keras.layers.Dense(
+            1, kernel_regularizer=tf.keras.regularizers.l2(wd),
+            kernel_initializer=tf.keras.initializers.RandomNormal(0, 0.01)
+        )
+        
+    def loss(self, y_hat, y):
+        return super().loss(y_hat, y) + self.net.losses
+```
+
+[**このプロットは、ゼロからの重量減衰を実装したときと似ています**]。しかし、このバージョンはより速く実行され、実装が簡単です。より大きな問題に対処し、この作業がより日常的になるにつれて、利点はより顕著になります。
+
+```{.python .input  n=14}
+%%tab all
+model = WeightDecay(wd=3, lr=0.01)
+model.board.yscale='log'
+trainer.fit(model, data)
+print('L2 norm of w:', float(l2_penalty(model.get_w_b()[0])))
+```
+
+これまでは、単純な一次関数を構成するものの概念について触れただけです。さらに、単純な非線形関数を構成するものは、さらに複雑な問題になる可能性があります。たとえば、[カーネルヒルベルト空間 (RKHS) の再現](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space) を使用すると、非線形コンテキストで線形関数に導入されたツールを適用できます。残念ながら、RKHSベースのアルゴリズムは、大規模で高次元のデータにはあまりスケーリングできない傾向があります。この本では、重みの減衰が深いネットワークのすべての層に適用されるという共通のヒューリスティックをしばしば採用します。 
+
+## まとめ
+
+* 正則化は、過適合に対処するための一般的な方法です。従来の正則化手法では、学習したモデルの複雑さを軽減するために (学習時に) 損失関数にペナルティ項を追加します。
+* モデルをシンプルに保つための特別な選択肢の 1 つは、$\ell_2$ ペナルティを使用することです。これにより、ミニバッチ確率的勾配降下アルゴリズムの更新ステップで重みが減衰します。
+* 重み減衰機能は、ディープラーニングフレームワークのオプティマイザーで提供されます。
+* パラメーターのセットが異なれば、同じトレーニングループ内で異なる更新動作を持つことができます。
+
+## 演習
+
+1. このセクションの推定問題で $\lambda$ の値を試します。学習と検証の精度を$\lambda$の関数としてプロットします。あなたは何を観察していますか？
+1. 検証セットを使用して、$\lambda$ の最適値を見つけます。本当に最適値なのですか？これは問題なの？
+1. $\|\mathbf{w}\|^2$の代わりに$\sum_i |w_i|$を選択したペナルティ（$\ell_1$正則化）として使用した場合、更新方程式はどのようになりますか？
+1. 私たちは$\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}$を知っています。同様の行列方程式が見つかりますか (:numref:`subsec_lin-algebra-norms`のフロベニウスノルムを参照)。
+1. 学習誤差と汎化誤差の関係を確認します。体重減少、トレーニングの増加、適切な複雑さのモデルの使用に加えて、過適合に対処するために他にどのような方法を考えられますか？
+1. ベイズ統計では、$P(w \mid x) \propto P(x \mid w) P(w)$を介して事後に到達する前の確率と尤度の積を使用します。$P(w)$を正規化でどのように識別できますか？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/98)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/99)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/236)
+:end_tab:
diff --git a/chapter_linear-regression/weight-decay_origin.md b/chapter_linear-regression/weight-decay_origin.md
new file mode 100644
index 0000000..633c8a6
--- /dev/null
+++ b/chapter_linear-regression/weight-decay_origin.md
@@ -0,0 +1,473 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Weight Decay
+:label:`sec_weight_decay`
+
+Now that we have characterized the problem of overfitting,
+we can introduce our first *regularization* technique.
+Recall that we can always mitigate overfitting
+by collecting more training data.
+However, that can be costly, time consuming,
+or entirely out of our control,
+making it impossible in the short run.
+For now, we can assume that we already have
+as much high-quality data as our resources permit
+and focus the tools at our disposal
+even when the dataset is taken as a given.
+
+Recall that in our polynomial regression example
+(:numref:`subsec_polynomial-curve-fitting`)
+we could limit our model's capacity
+by tweaking the degree
+of the fitted polynomial.
+Indeed, limiting the number of features
+is a popular technique to mitigate overfitting.
+However, simply tossing aside features
+can be too blunt an instrument.
+Sticking with the polynomial regression
+example, consider what might happen
+with high-dimensional input.
+The natural extensions of polynomials
+to multivariate data are called *monomials*,
+which are simply products of powers of variables.
+The degree of a monomial is the sum of the powers.
+For example, $x_1^2 x_2$, and $x_3 x_5^2$
+are both monomials of degree 3.
+
+Note that the number of terms with degree $d$
+blows up rapidly as $d$ grows larger.
+Given $k$ variables, the number of monomials
+of degree $d$ (i.e., $k$ multichoose $d$) is ${k - 1 + d} \choose {k - 1}$.
+Even small changes in degree, say from $2$ to $3$,
+dramatically increase the complexity of our model.
+Thus we often need a more fine-grained tool
+for adjusting function complexity.
+
+## Norms and Weight Decay
+
+(**Rather than directly manipulating the number of parameters,
+*weight decay*, operates by restricting the values 
+that the parameters can take.**)
+More commonly called $\ell_2$ regularization
+outside of deep learning circles
+when optimized by minibatch stochastic gradient descent,
+weight decay might be the most widely used technique
+for regularizing parametric machine learning models.
+The technique is motivated by the basic intuition
+that among all functions $f$,
+the function $f = 0$
+(assigning the value $0$ to all inputs)
+is in some sense the *simplest*,
+and that we can measure the complexity
+of a function by the distance of its parameters from zero.
+But how precisely should we measure
+the distance between a function and zero?
+There's no single right answer.
+In fact, entire branches of mathematics,
+including parts of functional analysis
+and the theory of Banach spaces,
+are devoted to addressing such issues.
+
+One simple interpretation might be
+to measure the complexity of a linear function
+$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$
+by some norm of its weight vector, e.g., $\| \mathbf{w} \|^2$.
+Recall that we introduced the $\ell_2$ norm and $\ell_1$ norm,
+which are special cases of the more general $\ell_p$ norm
+in :numref:`subsec_lin-algebra-norms`.
+The most common method for ensuring a small weight vector
+is to add its norm as a penalty term
+to the problem of minimizing the loss.
+Thus we replace our original objective,
+*minimizing the prediction loss on the training labels*,
+with new objective,
+*minimizing the sum of the prediction loss and the penalty term*.
+Now, if our weight vector grows too large,
+our learning algorithm might focus
+on minimizing the weight norm $\| \mathbf{w} \|^2$
+vs. minimizing the training error.
+That is exactly what we want.
+To illustrate things in code,
+we revive our previous example
+from :numref:`sec_linear_regression` for linear regression.
+There, our loss was given by
+
+$$L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
+
+Recall that $\mathbf{x}^{(i)}$ are the features,
+$y^{(i)}$ is the label for any data example $i$, and $(\mathbf{w}, b)$
+are the weight and bias parameters, respectively.
+To penalize the size of the weight vector,
+we must somehow add $\| \mathbf{w} \|^2$ to the loss function,
+but how should the model trade off the
+standard loss for this new additive penalty?
+In practice, we characterize this tradeoff
+via the *regularization constant* $\lambda$,
+a non-negative hyperparameter
+that we fit using validation data:
+
+$$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2.$$
+
+
+For $\lambda = 0$, we recover our original loss function.
+For $\lambda > 0$, we restrict the size of $\| \mathbf{w} \|$.
+We divide by $2$ by convention:
+when we take the derivative of a quadratic function,
+the $2$ and $1/2$ cancel out, ensuring that the expression
+for the update looks nice and simple.
+The astute reader might wonder why we work with the squared
+norm and not the standard norm (i.e., the Euclidean distance).
+We do this for computational convenience.
+By squaring the $\ell_2$ norm, we remove the square root,
+leaving the sum of squares of
+each component of the weight vector.
+This makes the derivative of the penalty easy to compute: 
+the sum of derivatives equals the derivative of the sum.
+
+
+Moreover, you might ask why we work with the $\ell_2$ norm
+in the first place and not, say, the $\ell_1$ norm.
+In fact, other choices are valid and
+popular throughout statistics.
+While $\ell_2$-regularized linear models constitute
+the classic *ridge regression* algorithm,
+$\ell_1$-regularized linear regression
+is a similarly fundamental method in statistics, 
+popularly known as *lasso regression*.
+One reason to work with the $\ell_2$ norm
+is that it places an outsize penalty
+on large components of the weight vector.
+This biases our learning algorithm
+towards models that distribute weight evenly
+across a larger number of features.
+In practice, this might make them more robust
+to measurement error in a single variable.
+By contrast, $\ell_1$ penalties lead to models
+that concentrate weights on a small set of features
+by clearing the other weights to zero.
+This gives us an effective method for *feature selection*,
+which may be desirable for other reasons.
+For example, if our model only relies on a few features,
+then we may not need to collect, store, or transmit data
+for the other (dropped) features. 
+
+Using the same notation in :eqref:`eq_linreg_batch_update`,
+the minibatch stochastic gradient descent updates
+for $\ell_2$-regularized regression follow:
+
+$$\begin{aligned}
+\mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).
+\end{aligned}$$
+
+As before, we update $\mathbf{w}$ based on the amount
+by which our estimate differs from the observation.
+However, we also shrink the size of $\mathbf{w}$ towards zero.
+That is why the method is sometimes called "weight decay":
+given the penalty term alone,
+our optimization algorithm *decays*
+the weight at each step of training.
+In contrast to feature selection,
+weight decay offers us a continuous mechanism
+for adjusting the complexity of a function.
+Smaller values of $\lambda$ correspond
+to less constrained $\mathbf{w}$,
+whereas larger values of $\lambda$
+constrain $\mathbf{w}$ more considerably.
+Whether we include a corresponding bias penalty $b^2$ 
+can vary across implementations, 
+and may vary across layers of a neural network.
+Often, we do not regularize the bias term.
+Besides,
+although $\ell_2$ regularization may not be equivalent to weight decay for other optimization algorithms,
+the idea of regularization through
+shrinking the size of weights
+still holds true.
+
+
+
+## High-Dimensional Linear Regression
+
+We can illustrate the benefits of weight decay 
+through a simple synthetic example.
+
+```{.python .input  n=2}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+First, we [**generate some data as before**]:
+
+(**$$y = 0.05 + \sum_{i = 1}^d 0.01 x_i + \epsilon \text{ where }
+\epsilon \sim \mathcal{N}(0, 0.01^2).$$**)
+
+In this synthetic dataset, our label is given 
+by an underlying linear function of our inputs,
+corrupted by Gaussian noise 
+with zero mean and standard deviation 0.01.
+For illustrative purposes, 
+we can make the effects of overfitting pronounced,
+by increasing the dimensionality of our problem to $d = 200$
+and working with a small training set with only 20 examples.
+
+```{.python .input  n=5}
+%%tab all
+class Data(d2l.DataModule):
+    def __init__(self, num_train, num_val, num_inputs, batch_size):
+        self.save_hyperparameters()                
+        n = num_train + num_val 
+        if tab.selected('mxnet') or tab.selected('pytorch'):
+            self.X = d2l.randn(n, num_inputs)
+            noise = d2l.randn(n, 1) * 0.01
+        if tab.selected('tensorflow'):
+            self.X = d2l.normal((n, num_inputs))
+            noise = d2l.normal((n, 1)) * 0.01
+        w, b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
+        self.y = d2l.matmul(self.X, w) + b + noise
+
+    def get_dataloader(self, train):
+        i = slice(0, self.num_train) if train else slice(self.num_train, None)
+        return self.get_tensorloader([self.X, self.y], train, i)
+```
+
+## Implementation from Scratch
+
+Now, let's try implementing weight decay from scratch.
+Since minibatch stochastic gradient descent
+is our optimizer,
+we just need to add the squared $\ell_2$ penalty
+to the original loss function.
+
+### (**Defining $\ell_2$ Norm Penalty**)
+
+Perhaps the most convenient way to implement this penalty
+is to square all terms in place and sum them up.
+
+```{.python .input  n=6}
+%%tab all
+def l2_penalty(w):
+    return d2l.reduce_sum(w**2) / 2
+```
+
+### Defining the Model
+
+In the final model,
+the linear regression and the squared loss have not changed since :numref:`sec_linear_scratch`,
+so we will just define a subclass of `d2l.LinearRegressionScratch`. The only change here is that our loss now includes the penalty term.
+
+```{.python .input  n=7}
+%%tab all
+class WeightDecayScratch(d2l.LinearRegressionScratch):
+    def __init__(self, num_inputs, lambd, lr, sigma=0.01):
+        super().__init__(num_inputs, lr, sigma)
+        self.save_hyperparameters()
+        
+    def loss(self, y_hat, y):
+        return super().loss(y_hat, y) + self.lambd * l2_penalty(self.w)        
+```
+
+The following code fits our model on the training set with 20 examples and evaluates it on the validation set with 100 examples.
+
+```{.python .input  n=8}
+%%tab all
+data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5)
+trainer = d2l.Trainer(max_epochs=10)
+
+def train_scratch(lambd):    
+    model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
+    model.board.yscale='log'
+    trainer.fit(model, data)
+    print('L2 norm of w:', float(l2_penalty(model.w)))
+```
+
+### [**Training without Regularization**]
+
+We now run this code with `lambd = 0`,
+disabling weight decay.
+Note that we overfit badly,
+decreasing the training error but not the
+validation error---a textbook case of overfitting.
+
+```{.python .input  n=9}
+%%tab all
+train_scratch(0)
+```
+
+### [**Using Weight Decay**]
+
+Below, we run with substantial weight decay.
+Note that the training error increases
+but the validation error decreases.
+This is precisely the effect
+we expect from regularization.
+
+```{.python .input  n=10}
+%%tab all
+train_scratch(3)
+```
+
+## [**Concise Implementation**]
+
+Because weight decay is ubiquitous
+in neural network optimization,
+the deep learning framework makes it especially convenient,
+integrating weight decay into the optimization algorithm itself
+for easy use in combination with any loss function.
+Moreover, this integration serves a computational benefit,
+allowing implementation tricks to add weight decay to the algorithm,
+without any additional computational overhead.
+Since the weight decay portion of the update
+depends only on the current value of each parameter,
+the optimizer must touch each parameter once anyway.
+
+:begin_tab:`mxnet`
+In the following code, we specify
+the weight decay hyperparameter directly
+through `wd` when instantiating our `Trainer`.
+By default, Gluon decays both
+weights and biases simultaneously.
+Note that the hyperparameter `wd`
+will be multiplied by `wd_mult`
+when updating model parameters.
+Thus, if we set `wd_mult` to zero,
+the bias parameter $b$ will not decay.
+:end_tab:
+
+:begin_tab:`pytorch`
+In the following code, we specify
+the weight decay hyperparameter directly
+through `weight_decay` when instantiating our optimizer.
+By default, PyTorch decays both
+weights and biases simultaneously.
+Here, we only set `weight_decay` for
+the weight, so the bias parameter $b$ will not decay.
+:end_tab:
+
+:begin_tab:`tensorflow`
+In the following code, we create an $\ell_2$ regularizer with
+the weight decay hyperparameter `wd` and apply it to the layer's weights
+through the `kernel_regularizer` argument.
+:end_tab:
+
+```{.python .input  n=11}
+%%tab mxnet
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.wd = wd
+        
+    def configure_optimizers(self):
+        self.collect_params('.*bias').setattr('wd_mult', 0)
+        return gluon.Trainer(self.collect_params(),
+                             'sgd', 
+                             {'learning_rate': self.lr, 'wd': self.wd})
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.wd = wd
+    
+    def configure_optimizers(self):
+        return torch.optim.SGD(self.net.parameters(), 
+                               lr=self.lr, weight_decay=self.wd)
+```
+
+```{.python .input  n=13}
+%%tab tensorflow
+class WeightDecay(d2l.LinearRegression):
+    def __init__(self, wd, lr):
+        super().__init__(lr)
+        self.save_hyperparameters()
+        self.net = tf.keras.layers.Dense(
+            1, kernel_regularizer=tf.keras.regularizers.l2(wd),
+            kernel_initializer=tf.keras.initializers.RandomNormal(0, 0.01)
+        )
+        
+    def loss(self, y_hat, y):
+        return super().loss(y_hat, y) + self.net.losses
+```
+
+[**The plot looks similar to that when
+we implemented weight decay from scratch**].
+However, this version runs faster
+and is easier to implement,
+benefits that will become more
+pronounced as you address larger problems
+and this work becomes more routine.
+
+```{.python .input  n=14}
+%%tab all
+model = WeightDecay(wd=3, lr=0.01)
+model.board.yscale='log'
+trainer.fit(model, data)
+print('L2 norm of w:', float(l2_penalty(model.get_w_b()[0])))
+```
+
+So far, we only touched upon one notion of
+what constitutes a simple linear function.
+Moreover, what constitutes a simple nonlinear function
+can be an even more complex question.
+For instance, [reproducing kernel Hilbert space (RKHS)](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space)
+allows one to apply tools introduced
+for linear functions in a nonlinear context.
+Unfortunately, RKHS-based algorithms
+tend to scale poorly to large, high-dimensional data.
+In this book we will often adopt the common heuristic
+whereby weight decay is applied
+to all layers of a deep network.
+
+## Summary
+
+* Regularization is a common method for dealing with overfitting. Classical regularization techniques add a penalty term to the loss function (when training) to reduce the complexity of the learned model.
+* One particular choice for keeping the model simple is using an $\ell_2$ penalty. This leads to weight decay in the update steps of the minibatch stochastic gradient descent algorithm.
+* The weight decay functionality is provided in optimizers from deep learning frameworks.
+* Different sets of parameters can have different update behaviors within the same training loop.
+
+
+
+## Exercises
+
+1. Experiment with the value of $\lambda$ in the estimation problem in this section. Plot training and validation accuracy as a function of $\lambda$. What do you observe?
+1. Use a validation set to find the optimal value of $\lambda$. Is it really the optimal value? Does this matter?
+1. What would the update equations look like if instead of $\|\mathbf{w}\|^2$ we used $\sum_i |w_i|$ as our penalty of choice ($\ell_1$ regularization)?
+1. We know that $\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}$. Can you find a similar equation for matrices (see the Frobenius norm in :numref:`subsec_lin-algebra-norms`)?
+1. Review the relationship between training error and generalization error. In addition to weight decay, increased training, and the use of a model of suitable complexity, what other ways can you think of to deal with overfitting?
+1. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via $P(w \mid x) \propto P(x \mid w) P(w)$. How can you identify $P(w)$ with regularization?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/98)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/99)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/236)
+:end_tab:
diff --git a/chapter_multilayer-perceptrons/backprop.md b/chapter_multilayer-perceptrons/backprop.md
index 439d3a8..490a12c 100644
--- a/chapter_multilayer-perceptrons/backprop.md
+++ b/chapter_multilayer-perceptrons/backprop.md
@@ -1,65 +1,65 @@
-# 順伝搬、逆方向伝播、計算グラフ
+# フォワードプロパゲーション、バックワードプロパゲーション、および計算グラフ
 :label:`sec_backprop`
 
-これまで、ミニバッチ確率的勾配降下法を使用してモデルをトレーニングしました。しかし、アルゴリズムを実装したときは、モデルを介した*前方伝播*に関わる計算についてのみ懸念していました。勾配を計算するときは、ディープラーニングフレームワークが提供するバックプロパゲーション関数を呼び出しました。 
+これまで、ミニバッチの確率的勾配降下法でモデルをトレーニングしてきました。しかし、アルゴリズムを実装したときは、モデルを介した*前方伝播*に関連する計算のみを懸念していました。勾配を計算するときが来たとき、ディープラーニングフレームワークによって提供されるバックプロパゲーション関数を呼び出しました。 
 
-勾配の自動計算 (自動微分) により、深層学習アルゴリズムの実装が大幅に簡素化されます。自動微分の前は、複雑なモデルを少しでも変更しても、複雑な微分を手作業で再計算する必要がありました。驚くべきことに、学術論文は更新規則を導出するために多数のページを割り当てる必要がありました。興味深い部分に焦点を当てるためには、引き続き自動微分に頼る必要がありますが、ディープラーニングの浅い理解を超えたい場合は、これらの勾配が内部でどのように計算されるかを知っておく必要があります。 
+勾配の自動計算（自動微分）により、ディープラーニングアルゴリズムの実装が大幅に簡素化されます。自動微分以前は、複雑なモデルに少しでも変更を加えるだけでも、複雑な微分を手動で再計算する必要がありました。驚くべきことに、学術論文は更新規則を導き出すために多数のページを割り当てる必要がありました。興味深い部分に集中できるように自動微分に依存し続ける必要がありますが、ディープラーニングの浅い理解を超えたい場合は、これらの勾配が内部でどのように計算されるかを知っておく必要があります。 
 
-このセクションでは、*逆方向伝播* (より一般的には*backpropagation*) の詳細を掘り下げます。テクニックとその実装に関する洞察を伝えるために、基本的な数学と計算グラフを利用しています。まず、重量減衰 ($L_2$ 正則化) をもつ一隠れ層MLPに焦点をあてて解説します。 
+このセクションでは、*逆伝播* (より一般的には*バックプロパゲーション*) の詳細を掘り下げます。技術とその実装の両方についてある程度の洞察を伝えるために、私たちはいくつかの基本的な数学と計算グラフに依存しています。まず、重量減衰を伴う1つの隠れ層MLPに焦点を当てます（$\ell_2$正則化、後続の章で説明します）。 
 
-## フォワード伝播
+## フォワードプロパゲーション
 
-*フォワード伝播* (または*フォワードパス*) とは、計算と保存のことです。
-入力層から出力層への順に、ニューラルネットワークの中間変数 (出力を含む) を順に示します。ここでは、隠れ層が 1 つあるニューラルネットワークの仕組みを順を追って説明します。これは退屈に思えるかもしれませんが、ファンクの巨匠ジェームズ・ブラウンの永遠の言葉では、「ボスになるための費用を払う」必要があります。 
+*フォワードプロパゲーション* (または*フォワードパス*) は、計算とストレージを指します
+ニューラルネットワークの中間変数（出力を含む）を、入力層から出力層の順に並べて表示します。ここでは、隠れ層が 1 つあるニューラルネットワークの仕組みを段階的に説明していきます。これは退屈に思えるかもしれませんが、ファンクの巨匠ジェームズ・ブラウンの永遠の言葉では、あなたは「ボスになるための費用を支払う」必要があります。 
 
-簡単にするために、入力例が $\mathbf{x}\in \mathbb{R}^d$ で、隠れ層にバイアス項が含まれていないと仮定します。ここで、中間変数は次のようになります。 
+簡単にするために、入力例が$\mathbf{x}\in \mathbb{R}^d$であり、隠れ層にバイアス項が含まれていないと仮定します。ここで、中間変数は次のとおりです。 
 
 $$\mathbf{z}= \mathbf{W}^{(1)} \mathbf{x},$$
 
-$\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$ は非表示レイヤの重みパラメータです。アクティベーション関数 $\phi$ を介して中間変数 $\mathbf{z}\in \mathbb{R}^h$ を実行すると、長さ $h$ の隠れたアクティベーションベクトルが得られます。 
+ここで、$\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$は非表示レイヤーの重みパラメータです。活性化関数$\phi$を介して中間変数$\mathbf{z}\in \mathbb{R}^h$を実行した後、長さ$h$の隠れ活性化ベクトルが得られます。 
 
 $$\mathbf{h}= \phi (\mathbf{z}).$$
 
-隠し変数 $\mathbf{h}$ も中間変数です。出力層のパラメーターの重みが $\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ であると仮定すると、長さが $q$ のベクトルをもつ出力層変数を取得できます。 
+隠れ層出力 $\mathbf{h}$ も中間変数です。出力層のパラメーターが $\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ の重みしか持たないと仮定すると、長さ $q$ のベクトルを持つ出力層変数を取得できます。 
 
 $$\mathbf{o}= \mathbf{W}^{(2)} \mathbf{h}.$$
 
-損失関数が $l$ で、ラベルの例が $y$ であると仮定すると、1 つのデータ例に対する損失項を計算できます。 
+損失関数が$l$で、例のラベルが$y$であると仮定すると、単一のデータ例の損失項を計算できます。 
 
 $$L = l(\mathbf{o}, y).$$
 
-$L_2$ 正則化の定義によると、ハイパーパラメーター $\lambda$ が与えられた場合、正則化項は次のようになります。 
+後で紹介する $\ell_2$ の正則化の定義によれば、ハイパーパラメータ $\lambda$ を考えると、正則化項は次のようになります。 
 
 $$s = \frac{\lambda}{2} \left(\|\mathbf{W}^{(1)}\|_F^2 + \|\mathbf{W}^{(2)}\|_F^2\right),$$
 :eqlabel:`eq_forward-s`
 
-ここで、行列のフロベニウスノルムは、行列をベクトルに平坦化した後に適用された $L_2$ ノルムです。最後に、与えられたデータ例でのモデルの正則化損失は次のようになります。 
+ここで、行列のフロベニウスノルムは、行列をベクトルに平坦化した後に適用される $\ell_2$ ノルムです。最後に、特定のデータ例に対するモデルの正則化された損失は次のとおりです。 
 
 $$J = L + s.$$
 
-以下の説明では $J$ を*目的関数* と呼びます。 
+次の説明では、$J$を*目的関数*と呼びます。 
 
-## フォワード伝搬の計算グラフ
+## 前方伝播の計算グラフ
 
-*計算グラフ* をプロットすると、計算における演算子と変数の依存関係を視覚化するのに役立ちます。:numref:`fig_forward` には、前述の単純なネットワークに関連するグラフが含まれ、四角は変数を表し、円は演算子を表します。左下隅が入力を表し、右上隅が出力を表します。矢印の方向 (データフローを示す) は、主に右向きと上向きであることに注意してください。 
+*計算グラフ*をプロットすると、計算内の演算子と変数の依存関係を視覚化するのに役立ちます。:numref:`fig_forward`には、上で説明した単純なネットワークに関連するグラフが含まれており、正方形は変数を表し、円は演算子を表します。左下隅は入力を表し、右上隅は出力を示します。矢印 (データフローを示す) の方向は、主に右方向と上向きであることに注意してください。 
 
 ![Computational graph of forward propagation.](../img/forward.svg)
 :label:`fig_forward`
 
 ## バックプロパゲーション
 
-*バックプロパゲーション* とは、計算の方法を指します。
-ニューラルネットワークのパラメーターの勾配。つまり、この方法は、微積分の*チェーンルール*に従って、出力層から入力層まで逆の順序でネットワークをトラバースします。このアルゴリズムは、一部のパラメーターに関する勾配を計算するときに必要な中間変数 (偏微分) を保存します。関数 $\mathsf{Y}=f(\mathsf{X})$ と $\mathsf{Z}=g(\mathsf{Y})$ があり、入力と出力 $\mathsf{X}, \mathsf{Y}, \mathsf{Z}$ が任意の形状のテンソルであると仮定します。連鎖則を使うことで、$\mathsf{X}$ に対する $\mathsf{Z}$ の微分を次のように計算できます。 
+*バックプロパゲーション* は計算方法を指します
+ニューラルネットワークパラメータの勾配。要するに、この方法は、微積分からの*連鎖法*に従って、出力層から入力層まで逆の順序でネットワークを横断します。このアルゴリズムは、一部のパラメーターに関する勾配を計算するときに必要な中間変数 (偏導関数) を保存します。入力と出力$\mathsf{X}, \mathsf{Y}, \mathsf{Z}$が任意の形状のテンソルである関数$\mathsf{Y}=f(\mathsf{X})$と$\mathsf{Z}=g(\mathsf{Y})$があると仮定します。連鎖則を使用することにより、次の方法で$\mathsf{X}$に対する$\mathsf{Z}$の微分を計算できます。 
 
 $$\frac{\partial \mathsf{Z}}{\partial \mathsf{X}} = \text{prod}\left(\frac{\partial \mathsf{Z}}{\partial \mathsf{Y}}, \frac{\partial \mathsf{Y}}{\partial \mathsf{X}}\right).$$
 
-ここでは $\text{prod}$ 演算子を使用して、転置や入力位置の入れ替えなどの必要な演算を実行した後に、その引数を乗算します。ベクトルの場合、これは単純に行列-行列の乗算です。高次元のテンソルには、対応するテンソルを使用します。演算子 $\text{prod}$ は表記法のオーバーヘッドをすべて隠します。 
+ここでは、$\text{prod}$演算子を使用して、転置や入力位置の入れ替えなどの必要な操作が実行された後、引数を乗算します。ベクトルの場合、これは簡単です。単純に行列と行列の乗算です。高次元のテンソルには、適切なテンソルを使用します。演算子 $\text{prod}$ は、すべての表記オーバーヘッドを隠します。 
 
-計算グラフが :numref:`fig_forward` にある 1 つの隠れ層をもつ単純ネットワークのパラメーターは $\mathbf{W}^{(1)}$ と $\mathbf{W}^{(2)}$ であることを思い出してください。逆伝播の目的は、勾配 $\partial J/\partial \mathbf{W}^{(1)}$ と $\partial J/\partial \mathbf{W}^{(2)}$ を計算することです。これを実現するために、チェーンルールを適用し、各中間変数とパラメーターの勾配を計算します。計算グラフの結果から始めて、パラメーターに向かって作業する必要があるため、計算の順序は順伝播で実行される順序とは逆になります。最初のステップは、損失項 $L$ と正則化項 $s$ に関する目的関数 $J=L+s$ の勾配を計算することです。 
+計算グラフが :numref:`fig_forward` にある 1 つの隠れ層を持つ単純ネットワークのパラメーターは、$\mathbf{W}^{(1)}$ と $\mathbf{W}^{(2)}$ であることを思い出してください。バックプロパゲーションの目的は、$\partial J/\partial \mathbf{W}^{(1)}$ と $\partial J/\partial \mathbf{W}^{(2)}$ の勾配を計算することです。これを達成するために、連鎖則を適用し、各中間変数とパラメータの勾配を計算します。計算の順序は、順伝播で実行される順序と逆になります。これは、計算グラフの結果から始めて、パラメーターに向かって進む必要があるためです。最初のステップは、損失項 $L$ と正則化項 $s$ に対する目的関数 $J=L+s$ の勾配を計算することです。 
 
 $$\frac{\partial J}{\partial L} = 1 \; \text{and} \; \frac{\partial J}{\partial s} = 1.$$
 
-次に、チェーンルールに従って、出力層 $\mathbf{o}$ の変数に対する目的関数の勾配を計算します。 
+次に、連鎖則に従って出力層 $\mathbf{o}$ の変数に対する目的関数の勾配を計算します。 
 
 $$
 \frac{\partial J}{\partial \mathbf{o}}
@@ -68,18 +68,18 @@ $$
 \in \mathbb{R}^q.
 $$
 
-次に、両方のパラメーターに関する正則化項の勾配を計算します。 
+次に、両方のパラメータに関する正則化項の勾配を計算します。 
 
 $$\frac{\partial s}{\partial \mathbf{W}^{(1)}} = \lambda \mathbf{W}^{(1)}
 \; \text{and} \;
 \frac{\partial s}{\partial \mathbf{W}^{(2)}} = \lambda \mathbf{W}^{(2)}.$$
 
-これで、出力レイヤーに最も近いモデルパラメーターの勾配 $\partial J/\partial \mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ を計算できるようになりました。連鎖規則を使用すると、次の結果が得られます。 
+これで、出力層に最も近いモデルパラメーターの勾配 $\partial J/\partial \mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ を計算できます。連鎖ルールを使用すると、次の結果が得られます。 
 
 $$\frac{\partial J}{\partial \mathbf{W}^{(2)}}= \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right)= \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)}.$$
 :eqlabel:`eq_backprop-J-h`
 
-$\mathbf{W}^{(1)}$ に関する勾配を得るには、出力層に沿って隠れ層への逆伝播を続ける必要があります。隠れ層の出力 $\partial J/\partial \mathbf{h} \in \mathbb{R}^h$ に対する勾配は次の式で与えられます。 
+$\mathbf{W}^{(1)}$に関する勾配を得るには、出力層に沿って隠れ層への逆伝播を続ける必要があります。隠れ層出力 $\partial J/\partial \mathbf{h} \in \mathbb{R}^h$ に対する勾配は、次の式で与えられます。 
 
 $$
 \frac{\partial J}{\partial \mathbf{h}}
@@ -87,7 +87,7 @@ $$
 = {\mathbf{W}^{(2)}}^\top \frac{\partial J}{\partial \mathbf{o}}.
 $$
 
-活性化関数 $\phi$ は要素単位で適用されるため、中間変数 $\mathbf{z}$ の勾配 $\partial J/\partial \mathbf{z} \in \mathbb{R}^h$ を計算するには、$\odot$ で表される要素単位の乗算演算子を使用する必要があります。 
+活性化関数 $\phi$ は要素単位に適用されるため、中間変数 $\mathbf{z}$ の勾配 $\partial J/\partial \mathbf{z} \in \mathbb{R}^h$ を計算するには、要素単位の乗算演算子を使用する必要があります。これは $\odot$ で表します。 
 
 $$
 \frac{\partial J}{\partial \mathbf{z}}
@@ -95,7 +95,7 @@ $$
 = \frac{\partial J}{\partial \mathbf{h}} \odot \phi'\left(\mathbf{z}\right).
 $$
 
-最後に、入力層に最も近いモデルパラメーターの勾配 $\partial J/\partial \mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$ を取得できます。連鎖規則によれば、 
+最後に、入力層に最も近いモデルパラメーターの勾配 $\partial J/\partial \mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$ を取得できます。連鎖ルールによると、私たちは 
 
 $$
 \frac{\partial J}{\partial \mathbf{W}^{(1)}}
@@ -103,31 +103,31 @@ $$
 = \frac{\partial J}{\partial \mathbf{z}} \mathbf{x}^\top + \lambda \mathbf{W}^{(1)}.
 $$
 
-## ニューラルネットワークの学習
+## ニューラルネットワークのトレーニング
 
-ニューラルネットワークに学習させる場合、順伝播と逆伝播は互いに依存します。特に、順伝播では、計算グラフを依存関係の方向にトラバースし、そのパス上のすべての変数を計算します。これらは逆伝播に使用され、グラフ上の計算順序が逆になります。 
+ニューラルネットワークを学習させる場合、順伝播と逆伝播は互いに依存します。特に、順伝播では、計算グラフを依存関係の方向にトラバースし、そのパス上のすべての変数を計算します。これらは、グラフ上の計算順序が逆になるバックプロパゲーションに使用されます。 
 
-前述の単純なネットワークを例に挙げて説明します。一方では、順伝播中の正則化項 :eqref:`eq_forward-s` の計算は、モデルパラメーター $\mathbf{W}^{(1)}$ と $\mathbf{W}^{(2)}$ の現在の値に依存します。これらは、最新のイテレーションのバックプロパゲーションに従って、最適化アルゴリズムによって与えられます。一方、バックプロパゲーション中のパラメーター :eqref:`eq_backprop-J-h` の勾配計算は、フォワードプロパゲーションによって与えられる隠れ変数 $\mathbf{h}$ の現在の値に依存します。 
+説明する例として、前述の単純なネットワークを取り上げます。一方では、順伝播中の正則化項 :eqref:`eq_forward-s` の計算は、モデルパラメーター $\mathbf{W}^{(1)}$ および $\mathbf{W}^{(2)}$ の現在の値に依存します。これらは、最新の反復におけるバックプロパゲーションに従って最適化アルゴリズムによって与えられます。一方、バックプロパゲーション中のパラメーター :eqref:`eq_backprop-J-h` の勾配計算は、フォワードプロパゲーションによって与えられる隠れ層出力 $\mathbf{h}$ の現在の値に依存します。 
 
-したがって、ニューラルネットワークの学習時には、モデルパラメーターの初期化後に、順伝播と逆伝播を交互に行い、バックプロパゲーションによって与えられた勾配を使用してモデルパラメーターを更新します。バックプロパゲーションでは、計算の重複を避けるため、前方伝播の格納された中間値が再利用されることに注意してください。その結果の 1 つは、逆伝播が完了するまで中間値を保持する必要があることです。これは、トレーニングが単純な予測よりもはるかに多くのメモリを必要とする理由の 1 つでもあります。また、このような中間値のサイズは、ネットワーク層の数とバッチサイズにほぼ比例します。したがって、より大きなバッチサイズを使用してより深いネットワークに学習させると、「メモリ不足*」エラーが発生しやすくなります。 
+したがって、ニューラルネットワークをトレーニングする場合、モデルパラメーターが初期化された後、順伝播とバックプロパゲーションを交互に行い、バックプロパゲーションによって与えられる勾配を使用してモデルパラメーターを更新します。バックプロパゲーションでは、計算の重複を避けるために、格納されている中間値が順伝播から再利用されることに注意してください。その結果の 1 つは、バックプロパゲーションが完了するまで中間値を保持する必要があることです。これは、トレーニングが単純な予測よりもはるかに多くのメモリを必要とする理由の1つでもあります。また、このような中間値のサイズは、ネットワーク層の数とバッチサイズにほぼ比例します。したがって、より大きなバッチサイズを使用してより深いネットワークを学習させると、*メモリ不足* エラーが発生しやすくなります。 
 
-## [概要
+## まとめ
 
-* フォワードプロパゲーションは、ニューラルネットワークによって定義される計算グラフ内で中間変数を順次計算して保存します。入力層から出力層へと進みます。
+* 前方伝播は、ニューラルネットワークによって定義された計算グラフ内の中間変数を順次計算して保存します。入力層から出力層に進みます。
 * バックプロパゲーションは、ニューラルネットワーク内の中間変数とパラメーターの勾配を逆の順序で順次計算して保存します。
-* ディープラーニングモデルに学習させる場合、順伝播と逆伝播は相互に依存しています。
-* トレーニングには予測よりもはるかに多くのメモリが必要です。
+* ディープラーニングモデルをトレーニングする場合、フォワードプロパゲーションとバックプロパゲーションは相互に依存しています。
+* トレーニングには、予測よりもはるかに多くのメモリが必要です。
 
 ## 演習
 
-1. あるスカラー関数 $f$ に対する入力 $\mathbf{X}$ は $n \times m$ 行列であると仮定します。$\mathbf{X}$ に対する $f$ の勾配の次元はどれくらいですか？
+1. いくつかのスカラー関数 $f$ への入力 $\mathbf{X}$ が $n \times m$ 行列であると仮定します。$\mathbf{X}$に対する$f$の勾配の次元はどれくらいですか？
 1. このセクションで説明するモデルの隠れ層にバイアスを追加します (正則化項にバイアスを含める必要はありません)。
     1. 対応する計算グラフを描画します。
-    1. 順伝播方程式と逆方向伝播方程式を導出する。
+    1. 順伝播方程式と逆伝播方程式を導出する。
 1. このセクションで説明するモデルで、学習と予測のためのメモリフットプリントを計算します。
-1. 2 次導関数を計算すると仮定します。計算グラフはどうなりますか？計算にはどれくらい時間がかかると思いますか。
+1. 二次導関数を計算すると仮定します。コンピュテーショナルグラフはどうなりますか？計算にはどれくらい時間がかかると思いますか？
 1. 計算グラフが GPU に対して大きすぎると仮定します。
-    1. 複数の GPU にパーティション分割できますか？
-    1. 小さいミニバッチでのトレーニングに勝るメリットとデメリットは何ですか？
+    1. それを複数の GPU に分割できますか？
+    1. 小規模なミニバッチでのトレーニングに勝るメリットとデメリットは何ですか?
 
 [Discussions](https://discuss.d2l.ai/t/102)
diff --git a/chapter_multilayer-perceptrons/backprop_origin.md b/chapter_multilayer-perceptrons/backprop_origin.md
index ed4113e..d6b4377 100644
--- a/chapter_multilayer-perceptrons/backprop_origin.md
+++ b/chapter_multilayer-perceptrons/backprop_origin.md
@@ -31,7 +31,7 @@ techniques and their implementations,
 we rely on some basic mathematics and computational graphs.
 To start, we focus our exposition on
 a one-hidden-layer MLP
-with weight decay ($L_2$ regularization).
+with weight decay ($\ell_2$ regularization, to be described in subsequent chapters).
 
 ## Forward Propagation
 
@@ -46,7 +46,7 @@ of funk virtuoso James Brown,
 you must "pay the cost to be the boss".
 
 
-For the sake of simplicity, let us assume
+For the sake of simplicity, let's assume
 that the input example is $\mathbf{x}\in \mathbb{R}^d$
 and that our hidden layer does not include a bias term.
 Here the intermediate variable is:
@@ -62,7 +62,7 @@ we obtain our hidden activation vector of length $h$,
 
 $$\mathbf{h}= \phi (\mathbf{z}).$$
 
-The hidden variable $\mathbf{h}$
+The hidden layer output $\mathbf{h}$
 is also an intermediate variable.
 Assuming that the parameters of the output layer
 only possess a weight of
@@ -79,7 +79,8 @@ for a single data example,
 
 $$L = l(\mathbf{o}, y).$$
 
-According to the definition of $L_2$ regularization,
+According to the definition of $\ell_2$ regularization
+that we will introduce later,
 given the hyperparameter $\lambda$,
 the regularization term is
 
@@ -87,7 +88,7 @@ $$s = \frac{\lambda}{2} \left(\|\mathbf{W}^{(1)}\|_F^2 + \|\mathbf{W}^{(2)}\|_F^
 :eqlabel:`eq_forward-s`
 
 where the Frobenius norm of the matrix
-is simply the $L_2$ norm applied
+is simply the $\ell_2$ norm applied
 after flattening the matrix into a vector.
 Finally, the model's regularized loss
 on a given data example is:
@@ -200,7 +201,7 @@ $$\frac{\partial J}{\partial \mathbf{W}^{(2)}}= \text{prod}\left(\frac{\partial
 To obtain the gradient with respect to $\mathbf{W}^{(1)}$
 we need to continue backpropagation
 along the output layer to the hidden layer.
-The gradient with respect to the hidden layer's outputs
+The gradient with respect to the hidden layer output
 $\partial J/\partial \mathbf{h} \in \mathbb{R}^h$ is given by
 
 
@@ -246,7 +247,7 @@ These are then used for backpropagation
 where the compute order on the graph is reversed.
 
 Take the aforementioned simple network as an example to illustrate.
-On one hand,
+On the one hand,
 computing the regularization term :eqref:`eq_forward-s`
 during forward propagation
 depends on the current values of model parameters $\mathbf{W}^{(1)}$ and $\mathbf{W}^{(2)}$.
@@ -254,7 +255,7 @@ They are given by the optimization algorithm according to backpropagation in the
 On the other hand,
 the gradient calculation for the parameter
 :eqref:`eq_backprop-J-h` during backpropagation
-depends on the current value of the hidden variable $\mathbf{h}$,
+depends on the current value of the hidden layer output $\mathbf{h}$,
 which is given by forward propagation.
 
 
diff --git a/chapter_multilayer-perceptrons/dropout.md b/chapter_multilayer-perceptrons/dropout.md
index 2141350..cc95801 100644
--- a/chapter_multilayer-perceptrons/dropout.md
+++ b/chapter_multilayer-perceptrons/dropout.md
@@ -1,39 +1,24 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # ドロップアウト
 :label:`sec_dropout`
 
-:numref:`sec_weight_decay` では、重みの $L_2$ ノルムにペナルティを課すことで、統計モデルを正則化する古典的なアプローチを導入しました。確率論的に言えば、重みは平均がゼロのガウス分布から値を取るという事前の信念を前提としていたと主張することで、この手法を正当化できます。より直感的に言えば、少数の疑似的な関連性に過度に依存しすぎないように、モデルがその重みを多数の特徴量に分散させるよう奨励したと主張するかもしれません。 
-
-## オーバーフィットの再検討
-
-例よりも多くの特徴量に直面すると、線形モデルは過適合になりがちです。しかし、特徴量よりも多くの例を挙げれば、一般に線形モデルは過適合にならないと期待できます。残念ながら、線形モデルが一般化する信頼性にはコストがかかります。単純に適用すると、線形モデルは特徴間の相互作用を考慮しません。線形モデルは、すべての特徴量について、コンテキストを無視して正または負の重みを割り当てなければなりません。 
-
-従来のテキストでは、一般化可能性と柔軟性の間のこの根本的な緊張は、*バイアスと分散のトレードオフ*として説明されています。線形モデルはバイアスが高く、小さなクラスの関数しか表現できません。ただし、これらのモデルは分散が小さく、データのランダムサンプルが異なっても同様の結果が得られます。 
-
-ディープニューラルネットワークは、バイアス分散スペクトルの反対側に存在します。ニューラルネットワークは、線形モデルとは異なり、各特徴を個別に調べることに限定されません。フィーチャグループ間の相互作用を学習できます。たとえば、電子メールに「ナイジェリア」と「ウエスタンユニオン」が一緒に表示されているのはスパムを示しているが、個別にはスパムではないと推測する場合があります。 
-
-特徴量よりもはるかに多くの例がある場合でも、ディープニューラルネットワークは過適合する可能性があります。2017年、研究者のグループは、ランダムにラベル付けされた画像でディープネットをトレーニングすることで、ニューラルネットワークの非常に高い柔軟性を実証しました。入力を出力にリンクする真のパターンがないにもかかわらず、確率的勾配降下法によって最適化されたニューラルネットワークは、学習セット内のすべてのイメージに完全にラベルを付けることができることを発見しました。これが何を意味するのか考えてみてください。ラベルがランダムに一様に割り当てられ、クラスが 10 個ある場合、ホールドアウトデータの精度が 10% を超える分類器は存在しません。ここでの汎化ギャップはなんと 90% です。私たちのモデルが表現力豊かで、これがひどく過度にフィットする可能性がある場合、いつオーバーフィットしないと予想すべきですか？ 
+良い予測モデルに期待されることについて簡単に考えてみましょう。私たちは、目に見えないデータに対してうまく機能することを望んでいます。古典的汎化理論は、列車と試験性能のギャップを埋めるために、単純なモデルを目指すべきであることを示唆しています。シンプルさは、少数の次元の形でもたらされます。:numref:`sec_generalization_basics`の線形モデルの単項基底関数を議論する際にこれを探りました。さらに、:numref:`sec_weight_decay`で重量の減衰（$\ell_2$正則化）について説明したときに見たように、パラメータの（逆）ノルムも単純さの有用な尺度を表しています。シンプルさのもう1つの有用な概念は滑らかさです。つまり、関数は入力の小さな変化に敏感であってはならないということです。たとえば、画像を分類する場合、ピクセルにランダムノイズを追加してもほとんど無害であると予想されます。 
 
-ディープネットワークの不可解な汎化特性の数学的基礎は未解決の研究課題であり、理論志向の読者はこのトピックをより深く掘り下げることを奨励します。ここでは、ディープネットの一般化を実証的に改善する傾向がある実用的なツールの調査に移ります。 
+1995年、クリストファー・ビショップは、入力ノイズによるトレーニングがティホノフの正則化:cite:`Bishop.1995`と同等であることを証明したときに、このアイデアを形式化しました。この作業は、関数が滑らかである（したがって単純である）という要件と、入力の摂動に対して回復力があるという要件との間に明確な数学的関連性を引き出しました。 
 
-## 摂動によるロバスト性
+そして、2014年、Srivastavaら:cite:`Srivastava.Hinton.Krizhevsky.ea.2014`は、ビショップのアイデアをネットワークの内部レイヤーにも適用する方法について巧妙なアイデアを開発しました。彼らのアイデアは*ドロップアウト*と呼ばれ、フォワードプロパゲーション中に各内部レイヤーを計算しながらノイズを注入することを含み、ニューラルネットワークをトレーニングするための標準的な手法になりました。この方法は*dropout* と呼ばれています。なぜなら、私たちは文字通り
+*トレーニング中にいくつかのニューロンを落とす*。
+トレーニング中、各反復で、標準ドロップアウトは、後続のレイヤーを計算する前に、各レイヤーのノードの一部をゼロにすることで構成されます。 
 
-優れた予測モデルに期待されることについて簡単に考えてみましょう。私たちは、目に見えないデータでもうまく機能することを望んでいます。古典的な一般化理論は、訓練とテストの性能のギャップを埋めるためには、単純なモデルを目指すべきだと示唆しています。シンプルさは、少数の次元の形でもたらされます。:numref:`sec_model_selection` では、線形モデルの単項基底関数について論じるときに、このことを検討しました。さらに、:numref:`sec_weight_decay` で重みの減衰 ($L_2$ 正則化) について説明したときにわかったように、パラメーターの (逆) ノルムも簡略化の有効な尺度を表します。単純さのもう 1 つの有用な概念は、滑らかさです。つまり、関数は入力に対する小さな変化に敏感であってはならないということです。たとえば、画像を分類する場合、ピクセルにランダムノイズを追加してもほとんど無害であると予想されます。 
+明確にするために、私たちはビショップへのリンクで私たち自身の物語を押し付けています。ドロップアウトに関する元の論文は、有性生殖の驚くべき類推を通して直感を提供します。著者らは、ニューラルネットワークの過剰適合は、各層が前の層の特定の活性化パターンに依存している状態によって特徴付けられ、この状態を*共適応*と呼んでいると主張している。ドロップアウトは、有性生殖が共適応を解散すると主張されているのと同じように、共適応を壊すと主張している遺伝子。この理論の説明は確かに議論の余地がありますが、ドロップアウト技術自体は永続的であることが証明されており、さまざまな形式のドロップアウトがほとんどのディープラーニングライブラリに実装されています。  
 
-1995年、クリストファー・ビショップは、入力ノイズによるトレーニングがTikhonov正則化:cite:`Bishop.1995`と同等であることを証明したときに、この考えを形式化しました。この研究により、関数が滑らかである (したがって単純である) という要件と、入力の摂動に対して弾力性があるという要件との間に明確な数学的なつながりが描かれました。 
+重要な課題は、このノイズをいかに注入するかです。1つのアイデアは、ノイズを*偏りのない*方法で注入することです。これにより、各レイヤーの期待値は、他のレイヤーを固定しながら、ノイズがない場合と同じになります。ビショップの研究では、線形モデルへの入力にガウスノイズを追加しました。各トレーニング反復で、彼は平均ゼロの分布からサンプリングされたノイズを入力 $\mathbf{x}$ に追加し、摂動点 $\mathbf{x}' = \mathbf{x} + \epsilon$ を生成します。予想通り、$E[\mathbf{x}'] = \mathbf{x}$。 
 
-そして2014年、Srivastava et al. :cite:`Srivastava.Hinton.Krizhevsky.ea.2014` は、ビショップのアイデアをネットワークの内部層にも適用する方法について巧妙なアイデアを開発しました。つまり、学習中に次の層を計算する前に、ネットワークの各層にノイズを注入することを提案しました。彼らは、多くの層を持つ深層ネットワークに学習させる場合、ノイズを注入すると入出力マッピングだけで滑らかさが強制されることに気付きました。 
-
-*dropout* と呼ばれる彼らのアイデアは、順伝播中に各内部層を計算しながらノイズを注入することを含み、ニューラルネットワークを訓練するための標準的な手法となっています。この方法は*dropout*と呼ばれていますので、文字通り
-*トレーニング中に一部のニューロンを脱落させる。
-学習中、各反復で、標準ドロップアウトは、次の層を計算する前に、各層のノードの一部をゼロにすることで構成されます。 
-
-明確にするために、私たちはビショップへのリンクで私たち自身の物語を押し付けています。ドロップアウトに関するオリジナルの論文は、有性生殖との驚くべき類推を通して直感を提供します。著者らは、ニューラルネットワークの過適合は、各層が前の層の特定の活性化パターンに依存し、この条件を「共適応」と呼んでいる状態によって特徴付けられると主張している。ドロップアウト, 彼らが主張する, 有性生殖が共適応遺伝子を破壊すると主張されているのと同じように、共適応を崩壊させる. 
-
-ここで重要な課題は、このノイズをどのように注入するかです。1 つのアイディアは、各レイヤーの期待値が (他のレイヤーは固定しながら) ノイズがないと想定される値と等しくなるように、ノイズを*バイアスなし*の方法で注入することです。 
-
-Bishopの研究では、線形モデルへの入力にガウスノイズを加えました。学習の反復ごとに、平均 0 $\epsilon \sim \mathcal{N}(0,\sigma^2)$ の分布からサンプリングされたノイズを入力の $\mathbf{x}$ に追加し、摂動点 $\mathbf{x}' = \mathbf{x} + \epsilon$ を生成しました。予想通り、$E[\mathbf{x}'] = \mathbf{x}$。 
-
-標準のドロップアウト正則化では、保持された (ドロップアウトされていない) ノードの割合で正規化することで、各層のバイアスを除去します。つまり、*ドロップアウト確率* $p$ では、各中間活性化 $h$ は次のように確率変数 $h'$ に置き換えられます。 
+標準のドロップアウト正則化では、各レイヤーのノードの一部をゼロにし、保持された（ドロップアウトされていない）ノードの割合で正規化することにより、各レイヤーを*debiases* します。つまり、*ドロップアウト確率* $p$ では、各中間アクティベーション $h$ は次のように確率変数 $h'$ に置き換えられます。 
 
 $$
 \begin{aligned}
@@ -45,24 +30,25 @@ h' =
 \end{aligned}
 $$
 
-設計上、期待値は変わりません、つまり $E[h'] = h$ です。 
+設計上、期待値は変わりません、つまり $E[h'] = h$。 
 
 ## ドロップアウト・イン・プラクティス
 
-:numref:`fig_mlp` の隠れ層と 5 つの隠れユニットを持つ MLP を思い出してください。隠れ層にドロップアウトを適用し、隠れユニットを確率 $p$ でゼロにすると、その結果は元のニューロンのサブセットのみを含むネットワークと見なすことができます。:numref:`fig_dropout2` では、$h_2$ と $h_5$ は削除されています。その結果、出力の計算が $h_2$ または $h_5$ に依存しなくなり、逆伝播を実行するとそれぞれの勾配も消滅します。このように、出力層の計算が $h_1, \ldots, h_5$ の 1 つの要素に過度に依存しすぎることはありません。 
+:numref:`fig_mlp`の隠れ層と5つの隠れユニットを持つMLPを思い出してください。隠れ層にドロップアウトを適用し、各隠れユニットを確率$p$でゼロにすると、結果は元のニューロンのサブセットのみを含むネットワークとして見ることができます。:numref:`fig_dropout2` では、$h_2$ と $h_5$ が削除されました。その結果、出力の計算は $h_2$ または $h_5$ に依存しなくなり、バックプロパゲーションの実行時にそれぞれの勾配も消失します。この方法では、出力層の計算が $h_1, \ldots, h_5$ のいずれかの要素に過度に依存することはありません。 
 
 ![MLP before and after dropout.](../img/dropout2.svg)
 :label:`fig_dropout2`
 
-通常、ドロップアウトはテスト時に無効にします。トレーニング済みのモデルと新しい例があれば、ノードをドロップアウトしないため、正規化する必要はありません。ただし、いくつかの例外があります。一部の研究者は、ニューラルネットワーク予測の*不確実性*を推定するためのヒューリスティックとしてテスト時にドロップアウトを使用します。予測が多数の異なるドロップアウトマスクで一致すれば、ネットワークの信頼性が高いと言えるかもしれません。 
+通常、テスト時にドロップアウトを無効にします。訓練されたモデルと新しい例を考えれば、ノードをドロップアウトしないため、正規化する必要はありません。ただし、いくつかの例外があります。一部の研究者は、ニューラルネットワーク予測の*不確実性*を推定するためのヒューリスティックとして、テスト時にドロップアウトを使用します。予測が多くの異なるドロップアウトマスク間で一致する場合、ネットワークの信頼性が高いと言えます。 
 
 ## ゼロからの実装
 
-単一層にドロップアウト関数を実装するには、層の次元数と同じ数のサンプルをベルヌーイ (バイナリ) 確率変数から引き出さなければなりません。ここで、確率変数は $1-p$ の値 $1$ (保持) と確率 $p$ の $0$ (drop) を取ります。これを実装する簡単な方法の 1 つは、一様分布 $U[0, 1]$ から標本を抽出することです。次に、対応するサンプルが $p$ より大きいノードを保持し、残りを削除できます。 
+単一レイヤーにドロップアウト関数を実装するには、レイヤーの次元数と同じ数のベルヌーイ (バイナリ) 確率変数からサンプルを描画する必要があります。ここで、確率変数は値 $1$ (keep) と確率 $1-p$、$0$ (drop) と確率 $p$。これを実装する簡単な方法の 1 つは、まず一様分布 $U[0, 1]$ からサンプルを抽出することです。次に、対応するサンプルが$p$より大きいノードを保持し、残りを削除できます。 
 
-次のコードでは (** テンソル入力 `X` の要素を確率で `dropout` でドロップアウトする `dropout_layer` 関数を実装**)、上記のように余りを再スケーリングします:生存者を `1.0-dropout` で割ります。
+次のコードでは、(**テンソル入力 `X` の要素を確率 `dropout` で削除する `dropout_layer` 関数を実装する**)、上記のように余りを再スケーリングします:生存者を `1.0-dropout` で割ります。
 
-```{.python .input}
+```{.python .input  n=5}
+%%tab mxnet
 from d2l import mxnet as d2l
 from mxnet import autograd, gluon, init, np, npx
 from mxnet.gluon import nn
@@ -70,301 +56,207 @@ npx.set_np()
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return np.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return np.zeros_like(X)
     mask = np.random.uniform(0, 1, X.shape) > dropout
     return mask.astype(np.float32) * X / (1.0 - dropout)
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=7}
+%%tab pytorch
 from d2l import torch as d2l
 import torch
 from torch import nn
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return torch.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return torch.zeros_like(X)
     mask = (torch.rand(X.shape) > dropout).float()
     return mask * X / (1.0 - dropout)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 from d2l import tensorflow as d2l
 import tensorflow as tf
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return tf.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return tf.zeros_like(X)
     mask = tf.random.uniform(
         shape=tf.shape(X), minval=0, maxval=1) < 1 - dropout
     return tf.cast(mask, dtype=tf.float32) * X / (1.0 - dropout)
 ```
 
-[**`dropout_layer` 関数をいくつかの例でテストできます**]。次のコード行では、入力 `X` をドロップアウト演算にそれぞれ確率 0、0.5、1 で渡しています。
-
-```{.python .input}
-X = np.arange(16).reshape(2, 8)
-print(dropout_layer(X, 0))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1))
-```
-
-```{.python .input}
-#@tab pytorch
-X= torch.arange(16, dtype = torch.float32).reshape((2, 8))
-print(X)
-print(dropout_layer(X, 0.))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1.))
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
-print(X)
-print(dropout_layer(X, 0.))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1.))
-```
-
-### モデルパラメーターの定義
-
-ここでも、:numref:`sec_fashion_mnist` で導入された Fashion-MNIST データセットを使用します。[**それぞれ 256 単位を含む 2 つの隠れ層をもつ MLP を定義します**]
-
-```{.python .input}
-num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
-
-W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens1))
-b1 = np.zeros(num_hiddens1)
-W2 = np.random.normal(scale=0.01, size=(num_hiddens1, num_hiddens2))
-b2 = np.zeros(num_hiddens2)
-W3 = np.random.normal(scale=0.01, size=(num_hiddens2, num_outputs))
-b3 = np.zeros(num_outputs)
-
-params = [W1, b1, W2, b2, W3, b3]
-for param in params:
-    param.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
-```
-
-```{.python .input}
-#@tab tensorflow
-num_outputs, num_hiddens1, num_hiddens2 = 10, 256, 256
+[**いくつかの例で`dropout_layer`関数をテストする**] ことができます。次のコード行では、入力`X`をドロップアウト操作に渡します。確率はそれぞれ0、0.5、1です。
+
+```{.python .input  n=6}
+%%tab all
+if tab.selected('mxnet'):
+    X = np.arange(16).reshape(2, 8)
+if tab.selected('pytorch'):
+    X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
+if tab.selected('tensorflow'):
+    X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
+print('dropout_p = 0:', dropout_layer(X, 0))
+print('dropout_p = 0.5:', dropout_layer(X, 0.5))
+print('dropout_p = 1:', dropout_layer(X, 1))
 ```
 
 ### モデルを定義する
 
-以下のモデルは、各隠れ層の出力にドロップアウトを適用します (アクティベーション関数に従う)。各層にドロップアウト確率を個別に設定できます。一般的な傾向として、ドロップアウト確率を低く設定すると、入力レイヤーに近づきます。以下では、1 番目と 2 番目の隠れレイヤーをそれぞれ 0.2 と 0.5 に設定します。ドロップアウトはトレーニング中のみ有効になるようにしています。
+以下のモデルは、（活性化関数に従って）各隠れ層の出力にドロップアウトを適用します。脱落確率はレイヤーごとに個別に設定できます。一般的な傾向は、入力レイヤーの近くでドロップアウトの確率を低く設定することです。ドロップアウトはトレーニング中のみアクティブになるようにしています。
 
 ```{.python .input}
-dropout1, dropout2 = 0.2, 0.5
-
-def net(X):
-    X = X.reshape(-1, num_inputs)
-    H1 = npx.relu(np.dot(X, W1) + b1)
-    # Use dropout only when training the model
-    if autograd.is_training():
-        # Add a dropout layer after the first fully connected layer
-        H1 = dropout_layer(H1, dropout1)
-    H2 = npx.relu(np.dot(H1, W2) + b2)
-    if autograd.is_training():
-        # Add a dropout layer after the second fully connected layer
-        H2 = dropout_layer(H2, dropout2)
-    return np.dot(H2, W3) + b3
+%%tab mxnet
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.lin1 = nn.Dense(num_hiddens_1, activation='relu')
+        self.lin2 = nn.Dense(num_hiddens_2, activation='relu')
+        self.lin3 = nn.Dense(num_outputs)
+        self.initialize()
+
+    def forward(self, X):
+        H1 = self.lin1(X)
+        if autograd.is_training():
+            H1 = dropout_layer(H1, self.dropout_1)
+        H2 = self.lin2(H1)
+        if autograd.is_training():
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
 ```{.python .input}
-#@tab pytorch
-dropout1, dropout2 = 0.2, 0.5
-
-class Net(nn.Module):
-    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2,
-                 is_training = True):
-        super(Net, self).__init__()
-        self.num_inputs = num_inputs
-        self.training = is_training
-        self.lin1 = nn.Linear(num_inputs, num_hiddens1)
-        self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
-        self.lin3 = nn.Linear(num_hiddens2, num_outputs)
+%%tab pytorch
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.lin1 = nn.LazyLinear(num_hiddens_1)
+        self.lin2 = nn.LazyLinear(num_hiddens_2)
+        self.lin3 = nn.LazyLinear(num_outputs)
         self.relu = nn.ReLU()
 
     def forward(self, X):
-        H1 = self.relu(self.lin1(X.reshape((-1, self.num_inputs))))
-        # Use dropout only when training the model
-        if self.training == True:
-            # Add a dropout layer after the first fully connected layer
-            H1 = dropout_layer(H1, dropout1)
+        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
+        if self.training:  
+            H1 = dropout_layer(H1, self.dropout_1)
         H2 = self.relu(self.lin2(H1))
-        if self.training == True:
-            # Add a dropout layer after the second fully connected layer
-            H2 = dropout_layer(H2, dropout2)
-        out = self.lin3(H2)
-        return out
-
-
-net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)
+        if self.training:
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
 ```{.python .input}
-#@tab tensorflow
-dropout1, dropout2 = 0.2, 0.5
-
-class Net(tf.keras.Model):
-    def __init__(self, num_outputs, num_hiddens1, num_hiddens2):
+%%tab tensorflow
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
         super().__init__()
-        self.input_layer = tf.keras.layers.Flatten()
-        self.hidden1 = tf.keras.layers.Dense(num_hiddens1, activation='relu')
-        self.hidden2 = tf.keras.layers.Dense(num_hiddens2, activation='relu')
-        self.output_layer = tf.keras.layers.Dense(num_outputs)
-
-    def call(self, inputs, training=None):
-        x = self.input_layer(inputs)
-        x = self.hidden1(x)
-        if training:
-            x = dropout_layer(x, dropout1)
-        x = self.hidden2(x)
-        if training:
-            x = dropout_layer(x, dropout2)
-        x = self.output_layer(x)
-        return x
-
-net = Net(num_outputs, num_hiddens1, num_hiddens2)
+        self.save_hyperparameters()
+        self.lin1 = tf.keras.layers.Dense(num_hiddens_1, activation='relu')
+        self.lin2 = tf.keras.layers.Dense(num_hiddens_2, activation='relu')
+        self.lin3 = tf.keras.layers.Dense(num_outputs)
+        
+    def forward(self, X):
+        H1 = self.lin1(tf.reshape(X, (X.shape[0], -1)))
+        if self.training:
+            H1 = dropout_layer(H1, self.dropout_1)
+        H2 = self.lin2(H1)
+        if self.training:
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
-### [**トレーニングとテスト**]
-
-これは、前に説明した MLP のトレーニングとテストと似ています。
+### [**トレーニング**]
 
-```{.python .input}
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
-              lambda batch_size: d2l.sgd(params, lr, batch_size))
-```
+以下は、前に説明した MLP のトレーニングと似ています。
 
 ```{.python .input}
-#@tab pytorch
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = nn.CrossEntropyLoss()
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab all
+hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256, 
+           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
+model = DropoutMLPScratch(**hparams)
+data = d2l.FashionMNIST(batch_size=256)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
 ```
 
 ## [**簡潔な実装**]
 
-高レベル API では、完全接続された各レイヤーの後に `Dropout` レイヤーを追加し、ドロップアウト確率をコンストラクターの唯一の引数として渡すだけで済みます。学習中、`Dropout` 層は、指定されたドロップアウト確率に従って、前の層の出力 (またはそれと同等に後続の層への入力) をランダムにドロップアウトします。トレーニングモードでない場合、`Dropout` 層はテスト中にデータを渡すだけです。
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(256, activation="relu"),
-        # Add a dropout layer after the first fully connected layer
-        nn.Dropout(dropout1),
-        nn.Dense(256, activation="relu"),
-        # Add a dropout layer after the second fully connected layer
-        nn.Dropout(dropout2),
-        nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
+高レベルAPIでは、全結合層の後に`Dropout`層を追加し、ドロップアウト確率をコンストラクタの唯一の引数として渡すだけで済みます。トレーニング中、`Dropout` 層は、指定されたドロップアウト確率に従って、前の層の出力 (または同等に後続のレイヤーへの入力) をランダムにドロップアウトします。トレーニングモードではない場合、`Dropout` 層はテスト中にデータを渡すだけです。
 
 ```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Flatten(),
-        nn.Linear(784, 256),
-        nn.ReLU(),
-        # Add a dropout layer after the first fully connected layer
-        nn.Dropout(dropout1),
-        nn.Linear(256, 256),
-        nn.ReLU(),
-        # Add a dropout layer after the second fully connected layer
-        nn.Dropout(dropout2),
-        nn.Linear(256, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
+%%tab mxnet
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential()
+        self.net.add(nn.Dense(num_hiddens_1, activation="relu"),
+                     nn.Dropout(dropout_1),
+                     nn.Dense(num_hiddens_2, activation="relu"),
+                     nn.Dropout(dropout_2),
+                     nn.Dense(num_outputs))
+        self.net.initialize()
 ```
 
 ```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    # Add a dropout layer after the first fully connected layer
-    tf.keras.layers.Dropout(dropout1),
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    # Add a dropout layer after the second fully connected layer
-    tf.keras.layers.Dropout(dropout2),
-    tf.keras.layers.Dense(10),
-])
+%%tab pytorch
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential(
+            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(), 
+            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(), 
+            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))
 ```
 
-次に、[**モデルのトレーニングとテスト**] を行います。
-
 ```{.python .input}
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab tensorflow
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = tf.keras.models.Sequential([
+            tf.keras.layers.Flatten(),
+            tf.keras.layers.Dense(num_hiddens_1, activation=tf.nn.relu),
+            tf.keras.layers.Dropout(dropout_1),
+            tf.keras.layers.Dense(num_hiddens_2, activation=tf.nn.relu),
+            tf.keras.layers.Dropout(dropout_2),
+            tf.keras.layers.Dense(num_outputs)])
 ```
 
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
+次に、[**モデルをトレーニングする**]。
 
 ```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab all
+model = DropoutMLP(**hparams)
+trainer.fit(model, data)
 ```
 
-## [概要
+## まとめ
 
-* ドロップアウトは、次元数と重みベクトルのサイズを制御するだけでなく、過適合を回避するためのもう 1 つのツールです。多くの場合、それらは共同で使用されます。
-* ドロップアウトは、アクティベーション $h$ を予想値 $h$ の確率変数に置き換えます。
+* ドロップアウトは、次元の数と重みベクトルのサイズを制御するだけでなく、過剰適合を避けるためのもう1つのツールです。多くの場合、それらは共同で使用されます。
+* ドロップアウトは、アクティベーション$h$を、期待値$h$のランダム変数に置き換えます。
 * ドロップアウトはトレーニング中にのみ使用されます。
 
 ## 演習
 
-1. 第1層と第2層のドロップアウト確率を変更するとどうなりますか？特に、両方のレイヤーのものを切り替えるとどうなりますか？これらの質問に答える実験を計画し、結果を定量的に説明し、定性的な要点をまとめます。
-1. エポック数を増やし、dropout を使用した場合と使用しない場合の結果を比較します。
-1. ドロップアウトが適用された場合と適用されない場合の各隠れレイヤーでのアクティベーションのばらつきはどれくらいですか？プロットを描画して、この量が両方のモデルで経時的にどのように変化するかを示します。
-1. テスト時にドロップアウトが一般的に使用されないのはなぜですか？
-1. このセクションのモデルを例として使用して、ドロップアウトとウェイトディケイを使用した場合の効果を比較します。ドロップアウトとウェイトディケイを同時に使用するとどうなりますか？結果は加法性ですか？リターンの減少（またはそれより悪い）はありますか？彼らはお互いをキャンセルしますか?
-1. 活性化ではなく重みマトリックスの個々の重みにドロップアウトを適用するとどうなりますか？
-1. 各層にランダムノイズを注入する、標準のドロップアウト手法とは異なる、もう 1 つの手法を考案します。Fashion-MNIST データセット (固定アーキテクチャの場合) のドロップアウトよりも優れた方法を開発できますか?
+1. 第1層と第2層の脱落確率を変更するとどうなりますか？特に、両方のレイヤーを切り替えるとどうなりますか？これらの質問に答える実験を計画し、結果を定量的に説明し、定性的な要点を要約します。
+1. エポック数を増やし、dropoutを使用した場合と使用しない場合の結果を比較します。
+1. ドロップアウトが適用されている場合と適用されていない場合の各非表示レイヤーのアクティベーションの差異はどれくらいですか？両方のモデルについて、この量が時間とともにどのように変化するかを示すプロットを描画します。
+1. ドロップアウトは通常、テスト時に使用されないのはなぜですか？
+1. このセクションのモデルを例として使用して、ドロップアウトと重量減衰を使用した場合の効果を比較します。ドロップアウトとウェイトディケイを同時に使用するとどうなりますか？結果は加算的ですか？リターンの減少（またはもっと悪い）はありますか？彼らはお互いをキャンセルしますか?
+1. 活性化ではなくウェイトマトリックスの個々のウェイトにドロップアウトを適用するとどうなりますか？
+1. 標準的なドロップアウト手法とは異なる、各層にランダムノイズを注入する別の手法を考案する。Fashion-mnist データセット (固定アーキテクチャー用) でドロップアウトよりも優れた方法を開発できますか?
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/100)
diff --git a/chapter_multilayer-perceptrons/dropout_origin.md b/chapter_multilayer-perceptrons/dropout_origin.md
index a6ce988..7579356 100644
--- a/chapter_multilayer-perceptrons/dropout_origin.md
+++ b/chapter_multilayer-perceptrons/dropout_origin.md
@@ -1,80 +1,13 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # Dropout
 :label:`sec_dropout`
 
-In :numref:`sec_weight_decay`,
-we introduced the classical approach
-to regularizing statistical models
-by penalizing the $L_2$ norm of the weights.
-In probabilistic terms, we could justify this technique
-by arguing that we have assumed a prior belief
-that weights take values from
-a Gaussian distribution with mean zero.
-More intuitively, we might argue
-that we encouraged the model to spread out its weights
-among many features rather than depending too much
-on a small number of potentially spurious associations.
-
-## Overfitting Revisited
-
-Faced with more features than examples,
-linear models tend to overfit.
-But given more examples than features,
-we can generally count on linear models not to overfit.
-Unfortunately, the reliability with which
-linear models generalize comes at a cost.
-Naively applied, linear models do not take
-into account interactions among features.
-For every feature, a linear model must assign
-either a positive or a negative weight, ignoring context.
-
-In traditional texts, this fundamental tension
-between generalizability and flexibility
-is described as the *bias-variance tradeoff*.
-Linear models have high bias: they can only represent a small class of functions.
-However, these models have low variance: they give similar results
-across different random samples of the data.
-
-Deep neural networks inhabit the opposite
-end of the bias-variance spectrum.
-Unlike linear models, neural networks
-are not confined to looking at each feature individually.
-They can learn interactions among groups of features.
-For example, they might infer that
-“Nigeria” and “Western Union” appearing
-together in an email indicates spam
-but that separately they do not.
-
-Even when we have far more examples than features,
-deep neural networks are capable of overfitting.
-In 2017, a group of researchers demonstrated
-the extreme flexibility of neural networks
-by training deep nets on randomly-labeled images.
-Despite the absence of any true pattern
-linking the inputs to the outputs,
-they found that the neural network optimized by stochastic gradient descent
-could label every image in the training set perfectly.
-Consider what this means.
-If the labels are assigned uniformly
-at random and there are 10 classes,
-then no classifier can do better
-than 10% accuracy on holdout data.
-The generalization gap here is a whopping 90%.
-If our models are so expressive that they
-can overfit this badly, then when should
-we expect them not to overfit?
-
-The mathematical foundations for
-the puzzling generalization properties
-of deep networks remain open research questions,
-and we encourage the theoretically-oriented
-reader to dig deeper into the topic.
-For now, we turn to the investigation of
-practical tools that tend to
-empirically improve the generalization of deep nets.
-
-## Robustness through Perturbations
-
-Let us think briefly about what we
+
+Let's think briefly about what we
 expect from a good predictive model.
 We want it to peform well on unseen data.
 Classical generalization theory
@@ -85,9 +18,9 @@ Simplicity can come in the form
 of a small number of dimensions.
 We explored this when discussing the
 monomial basis functions of linear models
-in :numref:`sec_model_selection`.
+in :numref:`sec_generalization_basics`.
 Additionally, as we saw when discussing weight decay
-($L_2$ regularization) in :numref:`sec_weight_decay`,
+($\ell_2$ regularization) in :numref:`sec_weight_decay`,
 the (inverse) norm of the parameters also
 represents a useful measure of simplicity.
 Another useful notion of simplicity is smoothness,
@@ -108,13 +41,6 @@ to perturbations in the input.
 Then, in 2014, Srivastava et al. :cite:`Srivastava.Hinton.Krizhevsky.ea.2014`
 developed a clever idea for how to apply Bishop's idea
 to the internal layers of a network, too.
-Namely, they proposed to inject noise
-into each layer of the network
-before calculating the subsequent layer during training.
-They realized that when training
-a deep network with many layers,
-injecting noise enforces smoothness just on the input-output mapping.
-
 Their idea, called *dropout*, involves
 injecting noise while computing
 each internal layer during forward propagation,
@@ -134,18 +60,22 @@ offers intuition through a surprising
 analogy to sexual reproduction.
 The authors argue that neural network overfitting
 is characterized by a state in which
-each layer relies on a specifc
+each layer relies on a specific
 pattern of activations in the previous layer,
 calling this condition *co-adaptation*.
-Dropout, they claim, breaks up co-adaptation
+dropout, they claim, breaks up co-adaptation
 just as sexual reproduction is argued to
 break up co-adapted genes.
+While the explanatory of this theory is certainly up for debate,
+the dropout technique itself has proved enduring,
+and various forms of dropout are implemented
+in most deep learning libraries. 
+
 
-The key challenge then is how to inject this noise.
+The key challenge is how to inject this noise.
 One idea is to inject the noise in an *unbiased* manner
 so that the expected value of each layer---while fixing
 the others---equals to the value it would have taken absent noise.
-
 In Bishop's work, he added Gaussian noise
 to the inputs to a linear model.
 At each training iteration, he added noise
@@ -155,7 +85,8 @@ yielding a perturbed point $\mathbf{x}' = \mathbf{x} + \epsilon$.
 In expectation, $E[\mathbf{x}'] = \mathbf{x}$.
 
 In standard dropout regularization,
-one debiases each layer by normalizing
+one zeros out some fraction of the nodes in each layer
+and then *debiases* each layer by normalizing
 by the fraction of nodes that were retained (not dropped out).
 In other words,
 with *dropout probability* $p$,
@@ -223,7 +154,8 @@ with probability `dropout`**),
 rescaling the remainder as described above:
 dividing the survivors by `1.0-dropout`.
 
-```{.python .input}
+```{.python .input  n=5}
+%%tab mxnet
 from d2l import mxnet as d2l
 from mxnet import autograd, gluon, init, np, npx
 from mxnet.gluon import nn
@@ -231,47 +163,32 @@ npx.set_np()
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return np.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return np.zeros_like(X)
     mask = np.random.uniform(0, 1, X.shape) > dropout
     return mask.astype(np.float32) * X / (1.0 - dropout)
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=7}
+%%tab pytorch
 from d2l import torch as d2l
 import torch
 from torch import nn
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return torch.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return torch.zeros_like(X)
     mask = (torch.rand(X.shape) > dropout).float()
     return mask * X / (1.0 - dropout)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 from d2l import tensorflow as d2l
 import tensorflow as tf
 
 def dropout_layer(X, dropout):
     assert 0 <= dropout <= 1
-    # In this case, all elements are dropped out
-    if dropout == 1:
-        return tf.zeros_like(X)
-    # In this case, all elements are kept
-    if dropout == 0:
-        return X
+    if dropout == 1: return tf.zeros_like(X)
     mask = tf.random.uniform(
         shape=tf.shape(X), minval=0, maxval=1) < 1 - dropout
     return tf.cast(mask, dtype=tf.float32) * X / (1.0 - dropout)
@@ -282,61 +199,17 @@ In the following lines of code,
 we pass our input `X` through the dropout operation,
 with probabilities 0, 0.5, and 1, respectively.
 
-```{.python .input}
-X = np.arange(16).reshape(2, 8)
-print(dropout_layer(X, 0))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1))
-```
-
-```{.python .input}
-#@tab pytorch
-X= torch.arange(16, dtype = torch.float32).reshape((2, 8))
-print(X)
-print(dropout_layer(X, 0.))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1.))
-```
-
-```{.python .input}
-#@tab tensorflow
-X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
-print(X)
-print(dropout_layer(X, 0.))
-print(dropout_layer(X, 0.5))
-print(dropout_layer(X, 1.))
-```
-
-### Defining Model Parameters
-
-Again, we work with the Fashion-MNIST dataset
-introduced in :numref:`sec_fashion_mnist`.
-We [**define an MLP with
-two hidden layers containing 256 units each.**]
-
-```{.python .input}
-num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
-
-W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens1))
-b1 = np.zeros(num_hiddens1)
-W2 = np.random.normal(scale=0.01, size=(num_hiddens1, num_hiddens2))
-b2 = np.zeros(num_hiddens2)
-W3 = np.random.normal(scale=0.01, size=(num_hiddens2, num_outputs))
-b3 = np.zeros(num_outputs)
-
-params = [W1, b1, W2, b2, W3, b3]
-for param in params:
-    param.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
-```
-
-```{.python .input}
-#@tab tensorflow
-num_outputs, num_hiddens1, num_hiddens2 = 10, 256, 256
+```{.python .input  n=6}
+%%tab all
+if tab.selected('mxnet'):
+    X = np.arange(16).reshape(2, 8)
+if tab.selected('pytorch'):
+    X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
+if tab.selected('tensorflow'):
+    X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
+print('dropout_p = 0:', dropout_layer(X, 0))
+print('dropout_p = 0.5:', dropout_layer(X, 0.5))
+print('dropout_p = 1:', dropout_layer(X, 1))
 ```
 
 ### Defining the Model
@@ -346,119 +219,91 @@ of each hidden layer (following the activation function).
 We can set dropout probabilities for each layer separately.
 A common trend is to set
 a lower dropout probability closer to the input layer.
-Below we set it to 0.2 and 0.5 for the first
-and second hidden layers, respectively.
 We ensure that dropout is only active during training.
 
 ```{.python .input}
-dropout1, dropout2 = 0.2, 0.5
-
-def net(X):
-    X = X.reshape(-1, num_inputs)
-    H1 = npx.relu(np.dot(X, W1) + b1)
-    # Use dropout only when training the model
-    if autograd.is_training():
-        # Add a dropout layer after the first fully connected layer
-        H1 = dropout_layer(H1, dropout1)
-    H2 = npx.relu(np.dot(H1, W2) + b2)
-    if autograd.is_training():
-        # Add a dropout layer after the second fully connected layer
-        H2 = dropout_layer(H2, dropout2)
-    return np.dot(H2, W3) + b3
+%%tab mxnet
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.lin1 = nn.Dense(num_hiddens_1, activation='relu')
+        self.lin2 = nn.Dense(num_hiddens_2, activation='relu')
+        self.lin3 = nn.Dense(num_outputs)
+        self.initialize()
+
+    def forward(self, X):
+        H1 = self.lin1(X)
+        if autograd.is_training():
+            H1 = dropout_layer(H1, self.dropout_1)
+        H2 = self.lin2(H1)
+        if autograd.is_training():
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
 ```{.python .input}
-#@tab pytorch
-dropout1, dropout2 = 0.2, 0.5
-
-class Net(nn.Module):
-    def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2,
-                 is_training = True):
-        super(Net, self).__init__()
-        self.num_inputs = num_inputs
-        self.training = is_training
-        self.lin1 = nn.Linear(num_inputs, num_hiddens1)
-        self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
-        self.lin3 = nn.Linear(num_hiddens2, num_outputs)
+%%tab pytorch
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.lin1 = nn.LazyLinear(num_hiddens_1)
+        self.lin2 = nn.LazyLinear(num_hiddens_2)
+        self.lin3 = nn.LazyLinear(num_outputs)
         self.relu = nn.ReLU()
 
     def forward(self, X):
-        H1 = self.relu(self.lin1(X.reshape((-1, self.num_inputs))))
-        # Use dropout only when training the model
-        if self.training == True:
-            # Add a dropout layer after the first fully connected layer
-            H1 = dropout_layer(H1, dropout1)
+        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
+        if self.training:  
+            H1 = dropout_layer(H1, self.dropout_1)
         H2 = self.relu(self.lin2(H1))
-        if self.training == True:
-            # Add a dropout layer after the second fully connected layer
-            H2 = dropout_layer(H2, dropout2)
-        out = self.lin3(H2)
-        return out
-
-
-net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)
+        if self.training:
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
 ```{.python .input}
-#@tab tensorflow
-dropout1, dropout2 = 0.2, 0.5
-
-class Net(tf.keras.Model):
-    def __init__(self, num_outputs, num_hiddens1, num_hiddens2):
+%%tab tensorflow
+class DropoutMLPScratch(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
         super().__init__()
-        self.input_layer = tf.keras.layers.Flatten()
-        self.hidden1 = tf.keras.layers.Dense(num_hiddens1, activation='relu')
-        self.hidden2 = tf.keras.layers.Dense(num_hiddens2, activation='relu')
-        self.output_layer = tf.keras.layers.Dense(num_outputs)
-
-    def call(self, inputs, training=None):
-        x = self.input_layer(inputs)
-        x = self.hidden1(x)
-        if training:
-            x = dropout_layer(x, dropout1)
-        x = self.hidden2(x)
-        if training:
-            x = dropout_layer(x, dropout2)
-        x = self.output_layer(x)
-        return x
-
-net = Net(num_outputs, num_hiddens1, num_hiddens2)
+        self.save_hyperparameters()
+        self.lin1 = tf.keras.layers.Dense(num_hiddens_1, activation='relu')
+        self.lin2 = tf.keras.layers.Dense(num_hiddens_2, activation='relu')
+        self.lin3 = tf.keras.layers.Dense(num_outputs)
+        
+    def forward(self, X):
+        H1 = self.lin1(tf.reshape(X, (X.shape[0], -1)))
+        if self.training:
+            H1 = dropout_layer(H1, self.dropout_1)
+        H2 = self.lin2(H1)
+        if self.training:
+            H2 = dropout_layer(H2, self.dropout_2)
+        return self.lin3(H2)
 ```
 
-### [**Training and Testing**]
+### [**Training**]
 
-This is similar to the training and testing of MLPs described previously.
-
-```{.python .input}
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
-              lambda batch_size: d2l.sgd(params, lr, batch_size))
-```
+The following is similar to the training of MLPs described previously.
 
 ```{.python .input}
-#@tab pytorch
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = nn.CrossEntropyLoss()
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs, lr, batch_size = 10, 0.5, 256
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab all
+hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256, 
+           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
+model = DropoutMLPScratch(**hparams)
+data = d2l.FashionMNIST(batch_size=256)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
 ```
 
 ## [**Concise Implementation**]
 
 With high-level APIs, all we need to do is add a `Dropout` layer
-after each fully-connected layer,
+after each fully connected layer,
 passing in the dropout probability
 as the only argument to its constructor.
 During training, the `Dropout` layer will randomly
@@ -469,68 +314,56 @@ When not in training mode,
 the `Dropout` layer simply passes the data through during testing.
 
 ```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(256, activation="relu"),
-        # Add a dropout layer after the first fully connected layer
-        nn.Dropout(dropout1),
-        nn.Dense(256, activation="relu"),
-        # Add a dropout layer after the second fully connected layer
-        nn.Dropout(dropout2),
-        nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Flatten(),
-        nn.Linear(784, 256),
-        nn.ReLU(),
-        # Add a dropout layer after the first fully connected layer
-        nn.Dropout(dropout1),
-        nn.Linear(256, 256),
-        nn.ReLU(),
-        # Add a dropout layer after the second fully connected layer
-        nn.Dropout(dropout2),
-        nn.Linear(256, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
+%%tab mxnet
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential()
+        self.net.add(nn.Dense(num_hiddens_1, activation="relu"),
+                     nn.Dropout(dropout_1),
+                     nn.Dense(num_hiddens_2, activation="relu"),
+                     nn.Dropout(dropout_2),
+                     nn.Dense(num_outputs))
+        self.net.initialize()
 ```
 
 ```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    # Add a dropout layer after the first fully connected layer
-    tf.keras.layers.Dropout(dropout1),
-    tf.keras.layers.Dense(256, activation=tf.nn.relu),
-    # Add a dropout layer after the second fully connected layer
-    tf.keras.layers.Dropout(dropout2),
-    tf.keras.layers.Dense(10),
-])
+%%tab pytorch
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential(
+            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(), 
+            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(), 
+            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))
 ```
 
-Next, we [**train and test the model**].
-
 ```{.python .input}
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab tensorflow
+class DropoutMLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
+                 dropout_1, dropout_2, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = tf.keras.models.Sequential([
+            tf.keras.layers.Flatten(),
+            tf.keras.layers.Dense(num_hiddens_1, activation=tf.nn.relu),
+            tf.keras.layers.Dropout(dropout_1),
+            tf.keras.layers.Dense(num_hiddens_2, activation=tf.nn.relu),
+            tf.keras.layers.Dropout(dropout_2),
+            tf.keras.layers.Dense(num_outputs)])
 ```
 
-```{.python .input}
-#@tab pytorch
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
+Next, we [**train the model**].
 
 ```{.python .input}
-#@tab tensorflow
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
+%%tab all
+model = DropoutMLP(**hparams)
+trainer.fit(model, data)
 ```
 
 ## Summary
diff --git a/chapter_multilayer-perceptrons/environment.md b/chapter_multilayer-perceptrons/environment.md
deleted file mode 100644
index c3bdbf6..0000000
--- a/chapter_multilayer-perceptrons/environment.md
+++ /dev/null
@@ -1,251 +0,0 @@
-# 環境と流通シフト
-
-これまでのセクションでは、さまざまなデータセットにモデルを適合させながら、機械学習の実践的なアプリケーションを数多く取り上げました。それでも、そもそもデータがどこから来るのか、それともモデルからの出力を最終的にどう処理するのかを考えるのをやめませんでした。データを所有している機械学習の開発者は、こうした根本的な問題を考えるために立ち止まることなくモデルの開発を急ぐことが多々あります。 
-
-失敗した機械学習の導入の多くは、このパターンにさかのぼることができます。モデルは、テストセットの精度で測定すると驚異的なパフォーマンスを発揮しているように見えますが、データの分布が突然変化すると、展開時に壊滅的に失敗することがあります。もっと狡猾に言えば、モデルの展開そのものがデータ分布を混乱させる触媒になることもあります。たとえば、ローンの返済者と債務不履行を予測するモデルをトレーニングしたところ、申請者の履物の選択が債務不履行のリスクと関連していることが判明したとします (オックスフォードは返済を示し、スニーカーはデフォルトを示します)。その後、オックスフォードを着用しているすべての応募者に融資を行い、スニーカーを着用しているすべての応募者を拒否する傾向があるかもしれません。. 
-
-この場合、パターン認識から意思決定への慎重な飛躍や、環境を批判的に考慮しなかったことは、悲惨な結果をもたらす可能性があります。手始めに、私たちが履物に基づいて意思決定を始めるとすぐに、顧客は自分の行動に追いつき、変化しました。やがて、すべての応募者はオックスフォードを着用し、信用力が同時に向上することはありません。機械学習の多くのアプリケーションには同様の問題がたくさんあるため、この点を理解してください。モデルベースの意思決定を環境に導入すると、モデルが壊れる可能性があります。 
-
-これらのトピックを1つのセクションで完全に扱うことはできませんが、ここでは共通の懸念を明らかにし、これらの状況を早期に発見し、被害を軽減し、責任を持って機械学習を使用するために必要な批判的思考を刺激することを目指しています。解決策の中には、単純なもの（「正しい」データを求める）もの、技術的に難しいもの（強化学習システムを実装すること）、統計的予測の領域から完全に踏み出して、その倫理的応用に関する難しい哲学的疑問に取り組むことが求められるものもあります。アルゴリズム。 
-
-## 流通シフトの種類
-
-まず、データ分布が変化する可能性のあるさまざまな方法と、モデルのパフォーマンスを改善するために何ができるかを考慮して、パッシブ予測の設定に固執します。ある古典的な設定では、トレーニングデータは分布 $p_S(\mathbf{x},y)$ からサンプリングされたものの、検定データは異なる分布 $p_T(\mathbf{x},y)$ から抽出されたラベルのない例で構成されると仮定します。すでに、私たちは冷静な現実に立ち向かわなければなりません。$p_S$ と $p_T$ の相互関係についての仮定がなければ、ロバストな分類器を学習することは不可能です。 
-
-犬と猫を区別したい二項分類問題を考えてみましょう。分布が任意の方法でシフトできる場合、入力に対する分布は一定のままである病理学的ケース $p_S(\mathbf{x}) = p_T(\mathbf{x})$ を許可しますが、ラベルはすべて反転します ($p_S(y | \mathbf{x}) = 1 - p_T(y | \mathbf{x})$)。言い換えれば、将来、すべての「猫」が犬になり、以前「犬」と呼ばれていたものが今や猫であると神が突然決定されるなら、入力$p(\mathbf{x})$の分布に変化がなければ、この設定を分布がまったく変化しなかった設定と区別できないでしょう。 
-
-幸いなことに、データが将来どのように変化するかについての制限された仮定の下で、原理アルゴリズムはシフトを検出し、場合によってはオンザフライで適応し、元の分類器の精度を向上させることができます。 
-
-### 共変量シフト
-
-分布シフトのカテゴリの中でも、共変量シフトが最も広く研究されている可能性があります。ここでは、入力の分布は時間とともに変化する可能性がありますが、ラベル付け関数、つまり条件付き分布 $P(y \mid \mathbf{x})$ は変化しないと仮定します。共変量 (特徴) の分布のシフトにより問題が生じるため、統計学者はこれを*共変量シフト*と呼んでいます。因果関係を呼び出さずに分布シフトについて推論できる場合もありますが、$\mathbf{x}$ が $y$ を引き起こすと考えられる設定では、共変量シフトが自然な仮定であることに注意してください。 
-
-猫と犬を区別するという課題を考えてみましょう。私たちのトレーニングデータは :numref:`fig_cat-dog-train` の種類のイメージで構成されているかもしれません。 
-
-![Training data for distinguishing cats and dogs.](../img/cat-dog-train.svg)
-:label:`fig_cat-dog-train`
-
-テスト時には :numref:`fig_cat-dog-test` で画像を分類するよう求められます。 
-
-![Test data for distinguishing cats and dogs.](../img/cat-dog-test.svg)
-:label:`fig_cat-dog-test`
-
-トレーニングセットは写真で構成され、テストセットには漫画のみが含まれます。テストセットとは大幅に異なる特性を持つデータセットでトレーニングを行うと、新しい領域への適応方法に関する一貫した計画がないと、問題を引き起こす可能性があります。 
-
-### ラベルシフト
-
-*ラベル shift* は逆問題を表します。
-ここでは、ラベル marginal $P(y)$ は変更できるが、クラス条件付き分布 $P(\mathbf{x} \mid y)$ はドメイン間で固定されたままであると仮定します。$y$ が $\mathbf{x}$ を引き起こすと考えられる場合、ラベルシフトは妥当な仮定です。たとえば、診断の相対的な有病率が時間とともに変化している場合でも、症状 (または他の症状) を考慮して診断を予測したい場合があります。病気は症状を引き起こすため、ここではラベルシフトが適切な仮定です。縮退したケースでは、ラベルシフトと共変量シフトの仮定が同時に成立することがあります。たとえば、ラベルが決定論的である場合、$y$ が $\mathbf{x}$ を引き起こしても、共変量シフトの仮定は満たされます。興味深いことに、このような場合、ラベルシフトの仮定から流れ出るメソッドを使用する方が有利な場合がよくあります。これは、ディープラーニングでは高次元になりがちな入力のように見えるオブジェクトとは対照的に、これらのメソッドではラベルのように見えるオブジェクト (通常は低次元) を操作する傾向があるためです。 
-
-### コンセプトシフト
-
-また、ラベルの定義そのものが変わる可能性がある場合に発生する「コンセプトシフト」という関連する問題に遭遇することもあります。これは奇妙に聞こえる-*猫*は*猫*だ、いや？ただし、他のカテゴリは時間の経過とともに使用量が変化することがあります。精神疾患の診断基準、ファッショナブルなもの、役職はすべて、かなりの量の概念シフトの対象となります。:numref:`fig_popvssoda`に示すように、データのソースを地理的にシフトして米国内を移動すると、*ソフトドリンク*の名前の分布に関してかなりの概念シフトが見られることが分かります。 
-
-![Concept shift on soft drink names in the United States.](../img/popvssoda.png)
-:width:`400px`
-:label:`fig_popvssoda`
-
-機械翻訳システムを構築する場合、$P(y \mid \mathbf{x})$ の配布は、所在地によって異なる可能性があります。この問題は見つけにくい場合があります。シフトは時間的または地理的な意味で徐々にしか起こらないという知識を活用したいと思うかもしれません。 
-
-## 流通シフトの例
-
-形式主義とアルゴリズムを掘り下げる前に、共変量や概念シフトが明らかではないかもしれない具体的な状況について議論することができます。 
-
-### 医療診断
-
-がんを検出するアルゴリズムを設計するとします。健康な人や病気の人からデータを収集し、アルゴリズムをトレーニングします。それはうまく機能し、高い精度を提供し、医療診断のキャリアを成功させる準備ができていると結論付けます。
-*そんなに早くない*
-
-トレーニングデータを生成した分布と、実際に遭遇する分布は大きく異なる可能性があります。これは、私たち（作家）の何人かが何年も前に働いていた不幸なスタートアップに起こりました。彼らは、主に高齢男性に影響を及ぼす病気の血液検査を開発しており、患者から採取した血液サンプルを用いて血液検査を研究したいと考えていました。しかし、健康な男性から血液サンプルを採取することは、すでにシステム内にいる病気の患者よりもかなり困難です。これを補うために、スタートアップは大学のキャンパスの学生から献血を募り、テストの開発における健康的なコントロールとして機能しました。次に、病気を検出するための分類器を構築するのを手伝ってもらえないかと尋ねました。 
-
-彼らに説明したように、健康なコホートと病気のコホートをほぼ完璧な精度で区別するのは確かに簡単です。ただし、これは、被験者の年齢、ホルモンレベル、身体活動、食事、アルコール消費量、および病気に関係のない多くの要因が異なるためです。これは実際の患者には当てはまりそうにありませんでした。それらのサンプリング手順により、極端な共変量シフトが発生することが予想されます。さらに、このケースは従来の方法で修正できる可能性は低かった。要するに、彼らはかなりの金額を浪費した。 
-
-### 自動運転車
-
-ある企業が、自動運転車の開発に機械学習を活用したいと考えていたとします。ここで重要なコンポーネントの1つは、路側検出器です。実際のアノテーション付きデータは入手にコストがかかるため、ゲームレンダリングエンジンからの合成データを追加のトレーニングデータとして使用するという (賢明で疑わしい) アイデアがありました。これは、レンダリングエンジンから引き出された「テストデータ」に対して非常にうまく機能しました。悲しいかな、実際の車の中では災害でした。結局のところ、道端は非常にシンプルなテクスチャでレンダリングされていました。さらに重要なのは、路側が*すべて*同じ*テクスチャでレンダリングされ、路側検出器がこの「特徴」について非常に迅速に認識したことです。 
-
-米軍が森林内の戦車を最初に検出しようとしたときにも同様のことが起こりました。彼らは戦車なしで森の空中写真を撮り、戦車を森に追い込み、別の写真を撮りました。分類器は*完全に*機能しているように見えました。残念ながら、影のある木と影のない木を区別する方法を学んだだけでした。最初の写真は早朝、2番目の写真は正午に撮影されました。 
-
-### 非定常分布
-
-分布の変化が遅く (*非定常分布* とも呼ばれる)、モデルが適切に更新されない場合は、さらに微妙な状況が発生します。以下は代表的なケースです。 
-
-* 私たちはコンピュテーショナル広告モデルをトレーニングし、それを頻繁に更新することに失敗しています (例えば、iPad と呼ばれるあいまいな新しいデバイスが発売されたばかりであることを組み込むのを忘れたなど)。
-* スパムフィルターを構築します。これは、これまでに見たすべてのスパムを検出するのに適しています。しかし、その後、スパマーは賢明になり、これまでに見たことのない新しいメッセージを作成します。
-* 製品レコメンデーションシステムを構築します。冬の間は機能しますが、クリスマスの後もずっとサンタの帽子を推薦し続けます。
-
-### その他の逸話
-
-* 顔検出器を作ります。すべてのベンチマークでうまく機能します。残念ながら、テストデータでは失敗します。問題のある例は、顔が画像全体を塗りつぶすクローズアップです（トレーニングセットにはそのようなデータはありませんでした）。
-* 米国市場向けのWeb検索エンジンを構築し、英国に展開したいと考えています。
-* 画像分類器を学習させるには、大規模なデータセットをコンパイルします。このデータセットでは、多数のクラスの各クラスが 1000 個のカテゴリをデータセット内で等しく表し、それぞれ 1000 個の画像で表されます。次に、写真の実際のラベル配布が明らかに不均一である現実世界にシステムを展開します。
-
-## 分配シフトの訂正
-
-すでに説明したように、学習分布と検定分布 $P(\mathbf{x}, y)$ が異なるケースが多くあります。場合によっては、共変量、ラベル、または概念のシフトにもかかわらず、ラッキーになり、モデルが機能します。それ以外の場合は、シフトに対処するための原則的な戦略を採用することで、より良い結果を得ることができます。このセクションの残りの部分では、より技術的な内容が大きくなります。この資料は後の概念の前提条件ではないため、せっかちな読者は次のセクションに進むことができます。 
-
-### 経験的リスクとリスク
-:label:`subsec_empirical-risk-and-risk`
-
-まず、モデルトレーニング中に何が起きているのかを考えてみましょう。トレーニングデータ $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ の特徴と関連ラベルを反復処理し、ミニバッチごとにモデル $f$ のパラメーターを更新します。簡単にするために、正則化は考慮しないため、トレーニングの損失を大幅に最小限に抑えます。 
-
-$$\mathop{\mathrm{minimize}}_f \frac{1}{n} \sum_{i=1}^n l(f(\mathbf{x}_i), y_i),$$
-:eqlabel:`eq_empirical-risk-min`
-
-$l$ は、予測の $f(\mathbf{x}_i)$ が関連付けられたラベル $y_i$ が「どれほど悪い」かを測定する損失関数です。統計学者は:eqref:`eq_empirical-risk-min`でこの用語を「経験的リスク」と呼んでいます。*経験的リスク* は、*risk* を近似するためのトレーニングデータの平均損失です。これは、真の分布 $p(\mathbf{x},y)$ から引き出されたデータの母集団全体における損失の予想値です。 
-
-$$E_{p(\mathbf{x}, y)} [l(f(\mathbf{x}), y)] = \int\int l(f(\mathbf{x}), y) p(\mathbf{x}, y) \;d\mathbf{x}dy.$$
-:eqlabel:`eq_true-risk`
-
-ただし、実際には、通常、データの母集団全体を取得することはできません。したがって、:eqref:`eq_empirical-risk-min` の経験的リスクを最小化する*経験的リスク最小化*は、リスクを近似的に最小化することを期待して、機械学習の実用的な戦略です。 
-
-### 共変量シフト補正
-:label:`subsec_covariate-shift-correction`
-
-データ $(\mathbf{x}_i, y_i)$ とラベル付けした依存関係 $P(y \mid \mathbf{x})$ を推定するとします。残念ながら、観測値 $\mathbf{x}_i$ は、*ターゲット分布* $p(\mathbf{x})$ ではなく、一部の*ソース分布* $q(\mathbf{x})$ から抽出されています。幸いなことに、依存関係の仮定は条件付き分布が変わらないことを意味します ($p(y \mid \mathbf{x}) = q(y \mid \mathbf{x})$)。ソースディストリビューション $q(\mathbf{x})$ が「間違っている」場合、リスクに以下の単純な ID を使用することで修正できます。 
-
-$$
-\begin{aligned}
-\int\int l(f(\mathbf{x}), y) p(y \mid \mathbf{x})p(\mathbf{x}) \;d\mathbf{x}dy =
-\int\int l(f(\mathbf{x}), y) q(y \mid \mathbf{x})q(\mathbf{x})\frac{p(\mathbf{x})}{q(\mathbf{x})} \;d\mathbf{x}dy.
-\end{aligned}
-$$
-
-言い換えると、正しい分布から導き出される確率と間違った分布から引き出される確率の比率によって、各データ例を比較検討する必要があります。 
-
-$$\beta_i \stackrel{\mathrm{def}}{=} \frac{p(\mathbf{x}_i)}{q(\mathbf{x}_i)}.$$
-
-各データ例$(\mathbf{x}_i, y_i)$の重み$\beta_i$を差し込むと、以下のようにモデルをトレーニングできます。
-*加重経験的リスク最小化*:
-
-$$\mathop{\mathrm{minimize}}_f \frac{1}{n} \sum_{i=1}^n \beta_i l(f(\mathbf{x}_i), y_i).$$
-:eqlabel:`eq_weighted-empirical-risk-min`
-
-悲しいかな、私たちはその比率を知らないので、何か役に立つことをする前に、それを推定する必要があります。最小ノルムまたは最大エントロピーの原理を使用して期待演算子を直接再調整しようとする、いくつかの空想的な演算子理論的アプローチを含む、多くの方法が利用可能です。このようなアプローチでは、テストデータへのアクセスなどによる「真の」$p$ と、トレーニングセット $q$ (後者は簡単に使用可能) の生成に使用される両方の分布から抽出された標本が必要であることに注意してください。ただし、必要なのは機能 $\mathbf{x} \sim p(\mathbf{x})$ だけであり、ラベル $y \sim p(y)$ にアクセスする必要はないことに注意してください。 
-
-この場合、元のものとほぼ同じ結果が得られる非常に効果的なアプローチが存在します。ロジスティック回帰は、バイナリ分類のソフトマックス回帰 (:numref:`sec_softmax` を参照) の特殊なケースです。推定確率比を計算するのに必要なのはこれだけです。$p(\mathbf{x})$ から引き出されたデータと $q(\mathbf{x})$ から引き出されたデータを区別する分類器を学習します。2 つのディストリビューションを区別できない場合は、関連付けられているインスタンスが 2 つのディストリビューションのいずれかに由来する可能性が等しいことを意味します。一方、十分に識別できるインスタンスは、それに応じて大幅に過大評価または過小評価する必要があります。 
-
-簡単にするために、両方のディストリビューション $p(\mathbf{x})$ と $q(\mathbf{x})$ のインスタンス数が同じであると仮定します。$z$ ラベルで表します。ラベルは $p$ から抽出されたデータでは $1$、$q$ から抽出されたデータでは $-1$ になります。次に、混合データセットの確率は次の式で与えられます。 
-
-$$P(z=1 \mid \mathbf{x}) = \frac{p(\mathbf{x})}{p(\mathbf{x})+q(\mathbf{x})} \text{ and hence } \frac{P(z=1 \mid \mathbf{x})}{P(z=-1 \mid \mathbf{x})} = \frac{p(\mathbf{x})}{q(\mathbf{x})}.$$
-
-したがって、$P(z=1 \mid \mathbf{x})=\frac{1}{1+\exp(-h(\mathbf{x}))}$ ($h$ はパラメーター化された関数) というロジスティック回帰アプローチを使用すると、次のようになります。 
-
-$$
-\beta_i = \frac{1/(1 + \exp(-h(\mathbf{x}_i)))}{\exp(-h(\mathbf{x}_i))/(1 + \exp(-h(\mathbf{x}_i)))} = \exp(h(\mathbf{x}_i)).
-$$
-
-その結果、2 つの問題を解く必要があります。1 つ目は両方の分布から抽出されたデータを区別するための問題で、次に :eqref:`eq_weighted-empirical-risk-min` の加重経験的リスク最小化問題で、項を $\beta_i$ で計量します。 
-
-これで、補正アルゴリズムを説明する準備が整いました。学習セット $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ とラベルのないテストセット $\{\mathbf{u}_1, \ldots, \mathbf{u}_m\}$ があるとします。共変量シフトでは、すべての $1 \leq i \leq n$ の $\mathbf{x}_i$ が何らかのソース分布から抽出され、$1 \leq i \leq m$ の $\mathbf{u}_i$ がターゲット分布から抽出されると仮定します。共変量シフトを補正するための典型的なアルゴリズムを次に示します。 
-
-1. バイナリ分類学習セット $\{(\mathbf{x}_1, -1), \ldots, (\mathbf{x}_n, -1), (\mathbf{u}_1, 1), \ldots, (\mathbf{u}_m, 1)\}$ を生成します。
-1. ロジスティック回帰を使用してバイナリ分類器に学習をさせ、関数 $h$ を取得します。
-1. 定数 $c$ に対して $\beta_i = \exp(h(\mathbf{x}_i))$ 以上の $\beta_i = \min(\exp(h(\mathbf{x}_i)), c)$ を使用してトレーニングデータを重み付けします。
-1. :eqref:`eq_weighted-empirical-risk-min` の $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ のトレーニングには、ウェイト $\beta_i$ を使用してください。
-
-上記のアルゴリズムは決定的な前提に基づいていることに注意してください。このスキームを機能させるには、ターゲット (検定時間など) 分布の各データ例が学習時に発生する確率がゼロでないことが必要です。$p(\mathbf{x}) > 0$ だが $q(\mathbf{x}) = 0$ の点が見つかると、対応する重要度の重みは無限大になります。 
-
-### ラベルシフト補正
-
-$k$ カテゴリの分類タスクを扱っていると仮定します。:numref:`subsec_covariate-shift-correction`、$q$、$p$ で同じ表記法を使用すると、それぞれソース分布 (トレーニング時間など) とターゲット分布 (テスト時間など) になります。ラベルの分布は時間の経過とともに変化すると仮定します。$q(y) \neq p(y)$ ですが、クラス条件付き分布は同じ $q(\mathbf{x} \mid y)=p(\mathbf{x} \mid y)$ のままです。ソースディストリビューション $q(y)$ が「間違っている」場合、:eqref:`eq_true-risk` で定義されているリスク内の次の識別情報に従って修正できます。 
-
-$$
-\begin{aligned}
-\int\int l(f(\mathbf{x}), y) p(\mathbf{x} \mid y)p(y) \;d\mathbf{x}dy =
-\int\int l(f(\mathbf{x}), y) q(\mathbf{x} \mid y)q(y)\frac{p(y)}{q(y)} \;d\mathbf{x}dy.
-\end{aligned}
-$$
-
-ここで、重要度の重みはラベルの尤度比に対応します。 
-
-$$\beta_i \stackrel{\mathrm{def}}{=} \frac{p(y_i)}{q(y_i)}.$$
-
-ラベルシフトの良い点の 1 つは、ソース分布にかなり良いモデルがあれば、周囲の次元を処理しなくてもこれらの重みを一貫して推定できることです。ディープラーニングでは、入力は画像のような高次元のオブジェクトになりがちですが、ラベルはカテゴリなどの単純なオブジェクトであることがよくあります。 
-
-ターゲットラベルの分布を推定するには、まず適度に優れた市販の分類器 (通常は学習データで学習済み) を使用し、検証セット (これも学習分布から) を使用して混同行列を計算します。*混同行列* $\mathbf{C}$ は単に $k \times k$ 行列で、各列はラベルカテゴリ (グラウンドトゥルース) に対応し、各行はモデルの予測カテゴリに対応します。各セルの値 $c_{ij}$ は、真のラベルが $j$ で、モデルが $i$ を予測した検証セットでの予測合計の比率です。 
-
-複雑なリアルタイムアノテーションパイプラインに投資しない限り、実際に見られる例のラベルを見ることができないため、ターゲットデータの混同行列を直接計算することはできません。ただし、可能なことは、テスト時にすべてのモデル予測を平均して、平均モデル出力 $\mu(\hat{\mathbf{y}}) \in \mathbb{R}^k$ を生成することです。$i^\mathrm{th}$ 要素 $\mu(\hat{y}_i)$ は、モデルが $i$ を予測したテストセットでの予測全体の比率です。 
-
-いくつかの穏やかな条件下で、分類器がそもそも適度に正確で、ターゲットデータに以前に見たカテゴリのみが含まれていて、ラベルシフトの仮定がそもそも当てはまる場合 (ここで最も強い仮定)、テストセットのラベルを推定できることがわかりました。単純な線形システムを解くことによる分布 
-
-$$\mathbf{C} p(\mathbf{y}) = \mu(\hat{\mathbf{y}}),$$
-
-$p(y_j)$ は $k$ 次元のラベル分布ベクトル $p(\mathbf{y})$ の $j^\mathrm{th}$ 要素であるため、$\sum_{j=1}^k c_{ij} p(y_j) = \mu(\hat{y}_i)$ は推定値としてすべての $1 \leq i \leq k$ に当てはまるためです。分類器が最初から十分に正確であれば、混同行列 $\mathbf{C}$ は可逆になり、解 $p(\mathbf{y}) = \mathbf{C}^{-1} \mu(\hat{\mathbf{y}})$ が得られます。 
-
-ソースデータのラベルが観察されるため、分布 $q(y)$ を推定するのは簡単です。ラベルが $y_i$ のトレーニング例 $i$ について、推定した $p(y_i)/q(y_i)$ の比率を使用して重量 $\beta_i$ を計算し、これを :eqref:`eq_weighted-empirical-risk-min` の加重経験的リスク最小化にプラグインできます。 
-
-### コンセプトシフト補正
-
-コンセプトシフトは、原則的に修正するのがはるかに困難です。例えば、猫と犬を区別することから、白と黒の区別に問題が突然変わる状況では、新しいラベルを集めてゼロから訓練するよりもずっと良いことができると考えるのは無理です。幸いなことに、実際には、このような極端なシフトはまれです。代わりに、通常、タスクがゆっくりと変化し続けることが起こります。物事をより具体的にするために、いくつかの例を挙げます。 
-
-* コンピュテーショナル広告では、新製品が発売され、
-古い製品はあまり人気がなくなります。つまり、広告の分布とその人気は徐々に変化し、クリック率の予測因子もそれに伴って徐々に変化する必要があります。
-* 交通カメラのレンズは環境摩耗により徐々に劣化し、画質に徐々に影響を与えます。
-* ニュースの内容は徐々に変化する（ニュースのほとんどは変わらないが、新しい記事が出てくる）。
-
-このような場合、ネットワークの学習に使用したのと同じアプローチを使用して、ネットワークをデータの変化に適応させることができます。つまり、ゼロから学習させるのではなく、既存のネットワークの重みを使用し、新しいデータでいくつかの更新ステップを実行するだけです。 
-
-## 学習問題の分類学
-
-分布の変化にどう対処するかについての知識をもって、機械学習の問題の定式化に関する他の側面についても検討できるようになりました。 
-
-### バッチ学習
-
-*バッチ学習* では、モデル $f(\mathbf{x})$ のトレーニングに使用するトレーニング機能とラベル $\{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ にアクセスできます。その後、このモデルを展開して、同じ分布から抽出された新しいデータ $(\mathbf{x}, y)$ をスコアリングします。これは、ここで説明するすべての問題に対する既定の前提です。たとえば、猫と犬の写真をたくさん使って猫検出器をトレーニングするとします。トレーニングが完了すると、猫だけが入ることができるスマートキャットドアコンピュータビジョンシステムの一部として出荷されます。その後、これは顧客の自宅に設置され、（極端な状況を除いて）再度更新されることはありません。 
-
-### オンライン学習
-
-ここで、データ $(\mathbf{x}_i, y_i)$ が一度に 1 つのサンプルに到達するとします。より具体的には、最初に$\mathbf{x}_i$を観察し、次に推定$f(\mathbf{x}_i)$を考え出す必要があると仮定します。これを行うと、$y_i$を観察し、決定を下すと報酬を受け取るか損失を被ります。多くの実際の問題がこのカテゴリに分類されます。たとえば、明日の株価を予測する必要があります。これにより、その見積もりに基づいて取引できるようになり、1日の終わりに、見積もりによって利益が得られたかどうかがわかります。言い換えれば、*オンライン学習*では、新しい観察からモデルを継続的に改善しているという次のサイクルがあります。 
-
-$$
-\mathrm{model} ~ f_t \longrightarrow
-\mathrm{data} ~ \mathbf{x}_t \longrightarrow
-\mathrm{estimate} ~ f_t(\mathbf{x}_t) \longrightarrow
-\mathrm{observation} ~ y_t \longrightarrow
-\mathrm{loss} ~ l(y_t, f_t(\mathbf{x}_t)) \longrightarrow
-\mathrm{model} ~ f_{t+1}
-$$
-
-### バンディッツ
-
-*Bandits*は上記の問題の特殊なケースです。ほとんどの学習問題には、パラメーターを学習したい連続パラメーター化関数 $f$ がありますが (例:ディープネットワーク)、*bandit* 問題では、引き出すことができる腕の数は有限です。つまり、実行できるアクションの数は有限です。この単純な問題に対して、最適性の観点からより強力な理論的保証が得られることはそれほど驚くべきことではない。この問題はしばしば（紛らわしい）明確な学習環境であるかのように扱われるため、主にリストアップします。
-
-### コントロール
-
-多くの場合、環境は私たちが行ったことを記憶しています。必ずしも敵対的な方法ではありませんが、記憶するだけで、反応は以前に起こったことに依存します。たとえば、コーヒーボイラーコントローラーは、以前にボイラーを加熱していたかどうかによって異なる温度を観測します。PID (比例-積分-微分) コントローラーアルゴリズムは一般的な選択肢です。同様に、ニュースサイトでのユーザーの行動は、以前に表示した内容によって異なります（例えば、ほとんどのニュースを一度だけ読むなど）。このようなアルゴリズムの多くは、意思決定のランダム性を低くするように動作する環境のモデルを形成しています。最近では、制御理論 (PID バリアントなど) もハイパーパラメーターを自動的に調整し、よりよいもつれと再構成品質を実現し、生成テキストの多様性と生成画像の再構成品質を向上させるためにも使用されている :cite:`Shao.Yao.Sun.ea.2020`。 
-
-### 強化学習
-
-メモリのある環境のより一般的なケースでは、その環境が私たちと協力しようとしている状況（特にゼロサム以外のゲームのための協力ゲーム）や、環境が勝とうとする状況に遭遇することがあります。*強化学習*には、チェス、ゴー、バックギャモン、スタークラフトなどがあります。同様に、自動運転車用の優れたコントローラーを構築したいと思うかもしれません。他の車は、回避しようとする、事故を起こそうとする、協力しようとする、など、自明ではない方法で自動運転車の運転スタイルに反応する可能性が高い。 
-
-### 環境を考える
-
-上記のさまざまな状況の大きな違いの 1 つは、静止した環境の場合にずっと有効であったのと同じ戦略が、環境が適応できるときには機能しない可能性があるということです。たとえば、トレーダーが発見した裁定取引機会は、その機会を悪用し始めると消滅する可能性があります。環境が変化する速度と方法によって、私たちが耐えられるアルゴリズムの種類が大きく決まります。たとえば、物事がゆっくりとしか変化しないことがわかっている場合は、見積もりをゆっくりとしか変化させないようにすることもできます。環境が瞬時に変化する可能性があるが、ごくまれにしか変化しないことがわかっている場合は、それを考慮に入れることができます。このような知識は、意欲的なデータサイエンティストがコンセプトシフト、つまり解決しようとしている問題が時間とともに変化するときに対処するために不可欠です。 
-
-## 機械学習における公平性、説明責任、透明性
-
-最後に、機械学習システムを導入する際には、単に予測モデルを最適化するだけではなく、意思決定を (部分的または完全に) 自動化するためのツールを提供することになるということを覚えておくことが重要です。これらの技術システムは、結果として生じる決定の対象となる個人の生活に影響を与える可能性があります。予測の検討から意思決定への飛躍は、新しい技術的な問題だけでなく、慎重に検討しなければならない多くの倫理的問題も提起します。医療診断システムを導入する場合、どの集団に対してそれが機能し、どの集団が機能しないかを知る必要があります。亜集団の福祉に対する予測可能なリスクを見落とすと、私たちは劣悪なケアを受ける可能性があります。さらに、いったん意思決定システムを熟考したら、一歩下がってテクノロジーの評価方法を見直さなければなりません。この範囲の変更による他の結果の中でも、*正確さ*が適切な尺度になることはめったにありません。たとえば、予測をアクションに変換する場合、エラーがもたらす潜在的なコスト感受性をさまざまな方法で考慮することがよくあります。画像を誤分類する1つの方法が人種的な手先として認識され、別のカテゴリへの誤分類が無害である場合は、意思決定プロトコルの設計における社会的価値を考慮して、それに応じてしきい値を調整する必要があります。また、予測システムがどのようにフィードバックループにつながるかについても注意が必要です。たとえば、犯罪が予測されるエリアに巡回担当者を割り当てる予測ポリシングシステムを考えてみましょう。心配なパターンがどのように現れるかは簡単にわかります。 
-
- 1. 犯罪の多い地域では、パトロールが増えます。
- 1. その結果、これらの近傍でより多くの犯罪が発見され、今後の反復に使用できるトレーニングデータが入力されます。
- 1. より多くの陽性にさらされたこのモデルは、これらの近傍でさらに多くの犯罪を予測しています。
- 1. 次のイテレーションでは、更新されたモデルが同じ近傍をさらにターゲットにしているため、さらに多くの犯罪が発見されるなどです。
-
-多くの場合、モデルの予測がトレーニングデータに結合されるさまざまなメカニズムは、モデリングプロセスでは考慮されません。これは、研究者が「暴走フィードバックループ」と呼ぶものにつながる可能性があります。また、そもそも正しい問題に取り組んでいるかどうかについても注意が必要です。現在、予測アルゴリズムは、情報の伝達を仲介する上で非常に大きな役割を果たしています。個人が遭遇するニュースは、彼らが*いいね*したFacebookページのセットによって決定されるべきですか？これらは、機械学習のキャリアで遭遇する可能性のある、差し迫った倫理的ジレンマのほんの一部です。 
-
-## [概要
-
-* 多くの場合、学習セットとテストセットは同じ分布に由来しません。これを配分シフトといいます。
-* リスクとは、真の分布から引き出されたデータの母集団全体に対する損失の予測です。ただし、この全人口は通常利用できません。経験的リスクは、リスクを近似するためのトレーニングデータに対する平均損失です。実際には、経験的リスク最小化を行っています。
-* 対応する仮定の下で、共変量とラベルシフトを検出し、検定時に補正できます。このバイアスを考慮しないと、テスト時に問題が生じる可能性があります。
-* 場合によっては、環境が自動化されたアクションを記憶し、意外な方法で反応することがあります。モデルを構築する際には、この可能性を考慮し、モデルと環境が予期せぬ方法で絡み合う可能性を考慮して、ライブシステムを監視し続ける必要があります。
-
-## 演習
-
-1. 検索エンジンの動作を変更するとどうなるでしょうか？ユーザーは何をしますか？広告主はどうですか？
-1. 共変量シフト検出器を実装します。ヒント:分類器を構築します。
-1. 共変量シフト補正器を実装します。
-1. 分配シフト以外に、経験的リスクがどのようにリスクに近似するかに影響する可能性のあるものは他にありますか？
-
-[Discussions](https://discuss.d2l.ai/t/105)
diff --git a/chapter_multilayer-perceptrons/generalization-deep.md b/chapter_multilayer-perceptrons/generalization-deep.md
new file mode 100644
index 0000000..3ff013d
--- /dev/null
+++ b/chapter_multilayer-perceptrons/generalization-deep.md
@@ -0,0 +1,61 @@
+# ディープラーニングにおける汎化
+
+:numref:`chap_regression`と:numref:`chap_classification`では、線形モデルをトレーニングデータに適合させることにより、回帰と分類の問題に取り組みました。どちらの場合も、観測されたトレーニングラベルの可能性を最大化するパラメーターを見つけるための実用的なアルゴリズムを提供しました。そして、各章の終わりに向かって、トレーニングデータのフィッティングは中間的な目標にすぎないことを思い出しました。私たちの本当の探求は、同じ母集団から引き出された新しい例でも正確な予測を行うことができる*一般的なパターン*を発見することでした。機械学習の研究者は、最適化アルゴリズムの「消費者」です。時には、新しい最適化アルゴリズムを開発しなければならないこともあります。しかし、結局のところ、最適化は単に目的を達成するための手段にすぎません。その核となるのは、機械学習は統計的な分野であり、何らかの統計的原理（既知または未知）によって結果として得られるモデルがトレーニングセットを超えて一般化される場合に限り、トレーニング損失を最適化したいと考えています。 
+
+明るい面として、確率的勾配降下法によって訓練されたディープニューラルネットワークは、コンピュータービジョン、自然言語処理、時系列データ、レコメンダーシステム、電子健康記録、タンパク質の折りたたみ、値関数にまたがる無数の予測問題にわたって非常にうまく一般化することがわかります。ビデオゲームやボードゲームの近似; そして無数の他のドメイン。欠点として、最適化ストーリー（トレーニングデータに適合させることができる理由）またはジェネラライズストーリー（結果として得られるモデルが目に見えない例に一般化される理由）のどちらかを簡単に説明したい場合は、自分で飲み物を注ぐことをお勧めします。線形モデルを最適化する手順とソリューションの統計的特性は、どちらも包括的な理論体系によって十分に説明されていますが、ディープラーニングの理解は、両面で依然として西部開拓時代に似ています。 
+
+ディープラーニングの理論と実践は両方の面で急速に進化しており、理論家は何が起こっているのかを説明する新しい戦略を採用しています。実践者が猛烈なペースで革新を続けているにもかかわらず、深いネットワークと直感と民俗知識を訓練するためのヒューリスティックの武器を構築しています。どのテクニックをどの状況に適用するかを決定するためのガイダンスを提供します。 
+
+現在のTL; DRは、ディープラーニングの理論が有望な攻撃ラインを生み出し、魅力的な結果を散在させているということですが、（i）ニューラルネットワークを最適化できる理由と（ii）勾配降下法によって学習されたモデルがどのように管理できるかの両方の包括的な説明からはほど遠いようです高次元のタスクでもうまく一般化してください。しかし、実際には、(i) が問題になることはほとんどありません (すべてのトレーニングデータに適合するパラメータを常に見つけることができます)。したがって、汎化を理解することははるかに大きな問題です。一方、首尾一貫した科学理論の快適さがなくても、実務家は、実際に一般化するモデルを作成するのに役立つ可能性のある多数の技術を開発しました。簡潔な要約は、ディープラーニングの一般化という広大なトピックを正当化することはできず、研究の全体的な状態は解決にはほど遠いものの、このセクションでは、研究と実践の現状の広い概要を提示することを願っています。 
+
+## オーバーフィットと正則化を再考する
+
+機械学習モデルのトレーニングに対する私たちのアプローチは、通常、（i）トレーニングデータを適合させること、および（ii）ホールドアウトデータでモデルを評価することによって*汎化誤差*（基礎となる母集団の真の誤差）を推定する2つのフェーズで構成されることを思い出してください。トレーニングデータへの適合とテストデータへの適合度の差は*汎化ギャップ*と呼ばれ、汎化ギャップが大きい場合、モデルはトレーニングデータに*オーバーフィット*します。過適合の極端なケースでは、テスト誤差が有意なままであっても、トレーニングデータを正確に近似する可能性があります。そして、古典的な見方では、私たちのモデルは複雑すぎると解釈され、特徴の数、学習された非ゼロのパラメーターの数、または定量化されたパラメーターのサイズを縮小する必要があります。:numref:`sec_generalization_basics`のモデルの複雑度対損失（:numref:`fig_capacity_vs_error`）のプロットを思い出してください。 
+
+しかし、ディープラーニングはこの状況を直観に反する形で複雑にします。まず、分類問題の場合、私たちのモデルは通常、数百万の:cite:`zhang2021understanding`で構成されるデータセットであっても、すべてのトレーニング例に完全に適合するのに十分な表現力があります。古典的な図では、この設定はモデルの複雑度軸の右端にあり、汎化誤差の改善は、モデルクラスの複雑さを軽減するか、またはペナルティを適用してセットを厳しく制約することによって、正則化によってもたらされなければならないと考えるかもしれません。私たちのパラメータが取るかもしれない値の。しかし、それは物事が奇妙になり始めるところです。 
+
+奇妙なことに、多くのディープラーニングタスク（画像認識やテキスト分類など）では、通常、モデルアーキテクチャの中から選択しています。これらのアーキテクチャはすべて、任意に低いトレーニング損失（およびトレーニングエラーゼロ）を達成できます。検討中のすべてのモデルがゼロトレーニングエラーを達成するため、
+*さらなる利益を得る唯一の手段は、オーバーフィッティングを減らすことです*。
+さらに奇妙なことに、トレーニングデータを完全に適合させても、レイヤーやノードを追加したり、より多くのエポックでトレーニングしたりするなど、モデルを*より表現力豊かに*することで、実際にジェネラライズエラーをさらに削減*できることがよくあります。しかし、奇妙なことに、ジェネラライズギャップをモデルの*複雑さ*に関連付けるパターン（ネットワークの深さや幅などでキャプチャされたもの）は単調ではなく、最初は複雑さが大きくなりますが、その後、いわゆる「二重降下」パターンに役立ちます。:cite:`nakkiran2021deep`。したがって、ディープラーニングの実践者は、ある意味でモデルを制限しているように見えるものと、モデルをさらに表現力豊かにすると思われるものと、ある意味で過適合を緩和するためにすべて適用されるトリックの袋を持っています。 
+
+さらに複雑なことに、古典的学習理論によって提供される保証は古典モデルであっても保守的である可能性がありますが、そもそもディープニューラルネットワークが一般化される理由を説明するには無力に見えます。ディープニューラルネットワークは、大規模なデータセットに対しても任意のラベルを当てはめることができるため、$\ell_2$正則化のような使い慣れた方法を使用しているにもかかわらず、従来の複雑さに基づく汎化限界（仮説クラスのVC次元またはRademacher複雑度に基づくものなど）では、ニューラルネットワークが一般化する理由を説明する。 
+
+## ノンパラメトリクスからのインスピレーション
+
+ディープラーニングに初めて近づくと、それらをパラメトリックモデルと考えるのは魅力的です。結局のところ、モデルには何百万ものパラメータがあります。モデルを更新すると、そのパラメータが更新されます。モデルを保存すると、パラメータがディスクに書き込まれます。しかしながら, 数学とコンピューターサイエンスは、直感に反する視点の変化に満ちています, 驚くべき同型は一見異なる問題.ニューラルネットワークは明らかにパラメータを「持っている」ものの、ある意味では、ノンパラメトリックモデルのように振る舞うと考える方が実り多いかもしれません。では、モデルをノンパラメトリックにする正確な理由は何ですか？この名前にはさまざまなアプローチが含まれていますが、共通のテーマの1つは、ノンパラメトリック手法は、利用可能なデータ量が増えるにつれて複雑になる傾向があるということです。 
+
+おそらく、ノンパラメトリックモデルの最も単純な例は、$k$-最近傍アルゴリズムです（:numref:`sec_attention-pooling`など、より多くのノンパラメトリックモデルについては後で説明します）。ここで、学習者は学習時に、データセットを単に記憶します。次に、予測時に新しい点 $\mathbf{x}$ が検出されると、学習者は $k$ 最近傍を調べます ($k$ ポイント $\mathbf{x}_i'$ はある程度の距離を最小化します $d(\mathbf{x}, \mathbf{x}_i')$)。$k=1$ の場合、このアルゴリズムは 1 最近傍と呼ばれ、アルゴリズムは常にゼロの学習誤差を達成します。しかし、それはアルゴリズムが一般化されないという意味ではありません。実際、ある穏やかな条件下では、1最近傍アルゴリズムは一貫している（最終的には最適な予測変数に収束する）ことが分かります。 
+
+1つの最近傍では、何らかの距離関数$d$を指定するか、またはそれと同等に、データを特徴付けるために何らかのベクトル値基底関数$\phi(\mathbf{x})$を指定する必要があることに注意してください。どの距離計量を選択しても、学習誤差が0になり、最終的には最適な予測変数に到達しますが、距離計量 $d$ によってさまざまな誘導バイアスがエンコードされ、利用可能なデータ量が限られていると、異なる予測変数が生成されます。距離計量 $d$ のさまざまな選択肢は、基礎となるパターンに関するさまざまな仮定を表し、さまざまな予測変数の性能は、仮定と観測データとの適合性によって異なります。 
+
+ある意味では、ニューラルネットワークはオーバーパラメーター化されており、トレーニングデータの近似に必要以上のパラメーターがあるため、トレーニングデータを「補間」する（完全に適合する）傾向があるため、ある意味ではノンパラメトリックモデルのように動作します。最近の理論的研究により、大規模ニューラルネットワークとノンパラメトリック手法、特にカーネル法との間に深いつながりが確立されています。特に、:cite:`Jacot.Grabriel.Hongler.2018`は、限界内で、ランダムに初期化された重みを持つ多層パーセプトロンが無限に広くなるにつれて、ニューラルタンジェントと呼ばれるカーネル関数（本質的には距離関数）の特定の選択に対する（ノンパラメトリック）カーネル法と同等になることを実証しました。カーネル。現在のニューラルタンジェントカーネルモデルは、最新のディープネットワークの動作を完全には説明できないかもしれませんが、解析ツールとしての成功は、オーバーパラメーター化されたディープネットワークの動作を理解するためのノンパラメトリックモデリングの有用性を強調しています。 
+
+## 早期停止
+
+ディープニューラルネットワークは任意のラベルを当てはめることができますが、ラベルが誤ってまたはランダムに割り当てられた場合でも :cite:`zhang2021understanding`、この能力はトレーニングの反復を何度も繰り返すことによってのみ現れます。新しい作業ライン:cite:`Rolnick.Veit.Belongie.Shavit.2017`は、ラベルノイズの設定において、ニューラルネットワークがきれいにラベル付けされたデータを最初に適合させ、その後にのみ誤ったラベル付けされたデータを補間する傾向があることを明らかにしました。さらに、この現象は一般化の保証に直接変換されることが確立されています。モデルがトレーニングセットに含まれるランダムにラベル付けされた例ではなく、きれいにラベル付けされたデータに適合する場合は常に、実際には:cite:`Garg.Balakrishnan.Kolter.Lipton.2021`を一般化しています。 
+
+これらの知見を合わせると、ディープニューラルネットワークを正則化するための古典的な手法である*早期停止*の動機付けに役立ちます。ここでは、重みの値を直接制約するのではなく、トレーニングのエポック数を制約します。停止基準を決定する最も一般的な方法は、トレーニング全体で検証エラーを監視し（通常、各エポックの後に1回チェックする）、検証エラーがいくつかのエポックで少しだけ減少していないときにトレーニングを遮断することです。$\epsilon$。これは*忍耐基準*と呼ばれることもあります。より一般化につながる可能性に加えて、ノイズの多いラベルの設定では、早期停止のもう1つの利点は時間の節約です。忍耐の基準が満たされると、トレーニングを終了できます。8つ以上のGPUで同時に数日間のトレーニングを必要とする大規模なモデルの場合、適切に調整された早期停止により、研究者の時間を節約し、雇用者を何千ドルも節約できます。 
+
+特に、ラベルノイズがなく、データセットが*実現可能*である場合（クラスが本当に分離可能、たとえば猫と犬を区別するなど）、早期停止は一般化の大幅な改善につながらない傾向があります。一方、ラベルノイズやラベルに固有のばらつきがある場合（患者の死亡率を予測するなど）、早期停止が重要です。ノイズの多いデータを内挿するまでモデルをトレーニングすることは、一般的に悪い考えです。 
+
+## ディープネットワークのための古典的正則化手法
+
+:numref:`chap_regression`では、モデルの複雑さを制約するためのいくつかの古典的な正則化手法について説明しました。特に、:numref:`sec_weight_decay`は、重み減衰と呼ばれる方法を導入しました。これは、損失関数に正則化項を追加して、大きな値の重みにペナルティを課すことで構成されます。どのウェイトノルムにペナルティが課されるかに応じて、この手法はリッジ正則化 ($\ell_2$ ペナルティ) またはラッソ正則化 ($\ell_1$ ペナルティの場合) として知られています。これらの正則化器の古典的解析では、モデルが任意のラベルに当てはまるのを防ぐために、重みが取ることができる値を制限すると考えられています。 
+
+ディープラーニングの実装では、体重減衰が依然として一般的なツールです。しかし、研究者は、$\ell_2$正則化の典型的な強みは、ネットワークがデータ:cite:`zhang2021understanding`を補間するのを防ぐには不十分であり、したがって、正則化として解釈された場合の利点は、早期停止基準と組み合わせてのみ意味をなす可能性があることを指摘しています。早期停止がない場合、層数やノード数（ディープラーニングの場合）または距離計量（1最近傍内）と同様に、これらの方法はニューラルネットワークのパワーを有意義に制約するためではなく、何らかの形でより良い一般化につながる可能性があります。関心のあるデータセットで見つかったパターンとの互換性が高い誘導バイアスをエンコードします。したがって、古典的正則化器は、その有効性の理論的根拠が根本的に異なっていても、ディープラーニングの実装では依然として人気があります。 
+
+特に、ディープラーニングの研究者は、モデル入力にノイズを追加するなど、古典的な正則化のコンテキストで最初に普及した手法も構築しています。次のセクションでは、ディープラーニングの有効性の理論的根拠が同様に謎のままであるにもかかわらず、ディープラーニングの主力となった有名なドロップアウト手法（:citet:`Srivastava.Hinton.Krizhevsky.ea.2014`によって発明された）を紹介します。 
+
+## まとめ
+
+例よりもパラメーターが少ない傾向がある古典的な線形モデルとは異なり、ディープネットワークはパラメーター化しすぎる傾向があり、ほとんどのタスクでトレーニングセットを完全に適合させることができます。この*補間レジーム*は、多くの難しい素早い直感に挑戦します。機能的には、ニューラルネットワークはパラメトリックモデルのように見えます。しかし、それらをノンパラメトリックモデルと考えることは、直感のより信頼できる情報源になる場合があります。検討中のすべてのディープネットワークがすべての学習ラベルを近似できることはよくあることなので、ほぼすべての利益は過適合を緩和する（*汎化のギャップ*を埋める）ことによって得られなければなりません。逆説的に、汎化ギャップを減少させる介入は、モデルの複雑さを増すように見える場合や、複雑さを軽減するように見える場合があります。しかし、これらの方法は、古典理論がディープネットワークの一般化を説明するのに十分なほど複雑さを低下させることはめったになく、*特定の選択が一般化の改善につながる理由*は、多くの優秀な研究者の協調的な努力にもかかわらず、大部分が未解決の問題のままです。 
+
+## 演習
+
+1. 従来の複雑性に基づく測定では、ディープニューラルネットワークの一般化を説明できないのはどのような意味ですか？
+1. なぜ*早期停止*が正則化手法と見なされるのでしょうか？
+1. 研究者は通常、停止基準をどのように決定しますか？
+1. 早期停止が一般化の大幅な改善につながるケースを区別する重要な要素は何ですか？
+1. 一般化を超えて、早期停止の別の利点を説明する。
+
+[Discussions](https://discuss.d2l.ai/t/7473)
diff --git a/chapter_multilayer-perceptrons/generalization-deep_origin.md b/chapter_multilayer-perceptrons/generalization-deep_origin.md
new file mode 100644
index 0000000..f2a298c
--- /dev/null
+++ b/chapter_multilayer-perceptrons/generalization-deep_origin.md
@@ -0,0 +1,363 @@
+# Generalization in Deep Learning
+
+
+In :numref:`chap_regression` and :numref:`chap_classification`,
+we tackled regression and classification problems
+by fitting linear models to training data.
+In both cases, we provided practical algorithms
+for finding the parameters that maximized
+the likelihood of the observed training labels.
+And then, towards the end of each chapter,
+we recalled that fitting the training data
+was only an intermediate goal.
+Our real quest all along was to discover *general patterns*
+on the basis of which we can make accurate predictions
+even on new examples drawn from the same underlying population.
+Machine learning researchers are *consumers* of optimization algorithms.
+Sometimes, we must even develop new optimization algorithms.
+But at the end of the day, optimization is merely a means to an end.
+At its core, machine learning is a statistical discipline
+and we wish to optimize training loss only insofar
+as some statistical principle (known or unknown)
+leads the resulting models to generalize beyond the training set.
+
+
+On the bright side, it turns out that deep neural networks
+trained by stochastic gradient descent generalize remarkably well
+across myriad prediction problems, spanning computer vision;
+natural language processing; time series data; recommender systems;
+electronic health records; protein folding;
+value function approximation in video games
+and board games; and countless other domains.
+On the downside, if you were looking
+for a straightforward account
+of either the optimization story
+(why we can fit them to training data)
+or the generalization story
+(why the resulting models generalize to unseen examples),
+then you might want to pour yourself a drink.
+While our procedures for optimizing linear models
+and the statistical properties of the solutions
+are both described well by a comprehensive body of theory,
+our understanding of deep learning
+still resembles the wild west on both fronts.
+
+The theory and practice of deep learning
+are rapidly evolving on both fronts,
+with theorists adopting new strategies
+to explain what's going on,
+even as practitioners continue
+to innovate at a blistering pace,
+building arsenals of heuristics for training deep networks
+and a body of intuitions and folk knowledge
+that provide guidance for deciding
+which techniques to apply in which situations.
+
+The TL;DR of the present moment is that the theory of deep learning
+has produced promising lines of attack and scattered fascinating results,
+but still appears far from a comprehensive account
+of both (i) why we are able to optimize neural networks
+and (ii) how models learned by gradient descent
+manage to generalize so well, even on high-dimensional tasks.
+However, in practice, (i) is seldom a problem
+(we can always find parameters that will fit all of our training data)
+and thus understanding generalization is far the bigger problem.
+On the other hand, even absent the comfort of a coherent scientific theory,
+practitioners have developed a large collection of techniques
+that may help you to produce models that generalize well in practice.
+While no pithy summary can possibly do justice
+to the vast topic of generalization in deep learning,
+and while the overall state of research is far from resolved,
+we hope, in this section, to present a broad overview
+of the state of research and practice.
+
+
+## Revisiting Overfitting and Regularization
+
+Recall that our approach to training machine learning models
+typically consists of two phases: (i) fit the training data;
+and (ii) estimate the *generalization error*
+(the true error on the underlying population)
+by evaluating the model on holdout data.
+The difference between our fit on the training data
+and our fit on the test data is called the *generalization gap*
+and when the generalization gap is large,
+we say that our models *overfit* to the training data.
+In extreme cases of overfitting,
+we might exactly fit the training data,
+even when the test error remains significant.
+And in the classical view,
+the interpretation is that our models are too complex,
+requiring that we either shrink the number of features,
+the number of nonzero parameters learned,
+or the size of the parameters as quantified.
+Recall the plot of model complexity vs loss
+(:numref:`fig_capacity_vs_error`)
+from :numref:`sec_generalization_basics`.
+
+
+However deep learning complicates this picture in counterintuitive ways.
+First, for classification problems,
+our models are typically expressive enough
+to perfectly fit every training example,
+even in datasets consisting of millions
+:cite:`zhang2021understanding`.
+In the classical picture, we might think
+that this setting lies on the far right extreme
+of the model complexity axis,
+and that any improvements in generalization error
+must come by way of regularization,
+either by reducing the complexity of the model class,
+or by applying a penalty, severely constraining
+the set of values that our parameters might take.
+But that's where things start to get weird.
+
+Strangely, for many deep learning tasks
+(e.g., image recognition and text classification)
+we are typically choosing among model architectures,
+all of which can achieve arbitrarily low training loss
+(and zero training error).
+Because all models under consideration achieve zero training error,
+*the only avenue for further gains is to reduce overfitting*.
+Even stranger, it's often the case that
+despite fitting the training data perfectly,
+we can actually *reduce the generalization error*
+further by making the model *even more expressive*,
+e.g., adding layers, nodes, or training
+for a larger number of epochs.
+Stranger yet, the pattern relating the generalization gap
+to the *complexity* of the model (as captured, e.g.,
+in the depth or width of the networks)
+can be non-monotonic,
+with greater complexity hurting at first
+but subsequently helping in a so-called "double-descent" pattern
+:cite:`nakkiran2021deep`.
+Thus the deep learning practitioner possesses a bag of tricks,
+some of which seemingly restrict the model in some fashion
+and others that seemingly make it even more expressive,
+and all of which, in some sense, are applied to mitigate overfitting.
+
+Complicating things even further,
+while the guarantees provided by classical learning theory
+can be conservative even for classical models,
+they appear powerless to explain why it is
+that deep neural networks generalize in the first place.
+Because deep neural networks are capable of fitting
+arbitrary labels even for large datasets,
+and despite the use of familiar methods like $\ell_2$ regularization,
+traditional complexity-based generalization bounds,
+e.g., those based on the VC dimension
+or Rademacher complexity of a hypothesis class
+cannot explain why neural networks generalize.
+
+## Inspiration from Nonparametrics
+
+Approaching deep learning for the first time,
+it's tempting to think of them as parametric models.
+After all, the models *do* have millions of parameters.
+When we update the models, we update their parameters.
+When we save the models, we write their parameters to disk.
+However, mathematics and computer science are riddled
+with counterintuitive changes of perspective,
+and surprising isomorphisms seemingly different problems.
+While neural networks, clearly *have* parameters,
+in some ways, it can be more fruitful
+to think of them as behaving like nonparametric models.
+So what precisely makes a model nonparametric?
+While the name covers a diverse set of approaches,
+one common theme is that nonparametric methods
+tend to have a level of complexity that grows
+as the amount of available data grows.
+
+Perhaps the simplest example of a nonparametric model
+is the $k$-nearest neighbor algorithm (we will cover more nonparametric models later, such as in :numref:`sec_attention-pooling`).
+Here, at training time,
+the learner simply memorizes the dataset.
+Then, at prediction time,
+when confronted with a new point $\mathbf{x}$,
+the learner looks up the $k$ nearest neighbors
+(the $k$ points $\mathbf{x}_i'$ that minimize
+some distance $d(\mathbf{x}, \mathbf{x}_i')$).
+When $k=1$, this is algorithm is called 1-nearest neighbors,
+and the algorithm will always achieve a training error of zero.
+That however, does not mean that the algorithm will not generalize.
+In fact, it turns out that under some mild conditions,
+the 1-nearest neighbor algorithm is consistent
+(eventually converging to the optimal predictor).
+
+
+Note that 1 nearest neighbor requires that we specify
+some distance function $d$, or equivalently,
+that we specify some vector-valued basis function $\phi(\mathbf{x})$
+for featurizing our data.
+For any choice of the distance metric,
+we will achieve 0 training error
+and eventually reach an optimal predictor,
+but different distance metrics $d$
+encode different inductive biases
+and with a finite amount of available data
+will yield different predictors.
+Different choices of the distance metric $d$
+represent different assumptions about the underlying patterns
+and the performance of the different predictors
+will depend on how compatible the assumptions
+are with the observed data.
+
+In a sense, because neural networks are over-parameterized,
+possessing many more parameters than are needed to fit the training data,
+they tend to *interpolate* the training data (fitting it perfectly)
+and thus behave, in some ways, more like nonparametric models.
+More recent theoretical research has established
+deep connection between large neural networks
+and nonparametric methods, notably kernel methods.
+In particular, :cite:`Jacot.Grabriel.Hongler.2018`
+demonstrated that in the limit, as multilayer perceptrons
+with randomly initialized weights grow infinitely wide,
+they become equivalent to (nonparametric) kernel methods
+for a specific choice of the kernel function
+(essentially, a distance function),
+which they call the neural tangent kernel.
+While current neural tangent kernel models may not fully explain
+the behavior of modern deep networks,
+their success as an analytical tool
+underscores the usefulness of nonparametric modeling
+for understanding the behavior of over-parameterized deep networks.
+
+
+## Early Stopping
+
+While deep neural networks are capable of fitting arbitrary labels,
+even when labels are assigned incorrectly or randomly
+:cite:`zhang2021understanding`,
+this ability only emerges over many iterations of training.
+A new line of work :cite:`Rolnick.Veit.Belongie.Shavit.2017`
+has revealed that in the setting of label noise,
+neural networks tend to fit cleanly labeled data first
+and only subsequently to interpolate the mislabeled data.
+Moreover, it's been established that this phenomenon
+translates directly into a guarantee on generalization:
+whenever a model has fitted the cleanly labeled data
+but not randomly labeled examples included in the training set,
+it has in fact generalized :cite:`Garg.Balakrishnan.Kolter.Lipton.2021`.
+
+Together these findings help to motivate *early stopping*,
+a classic technique for regularizing deep neural networks.
+Here, rather than directly constraining the values of the weights,
+one constrains the number of epochs of training.
+The most common way to determine the stopping criteria
+is to monitor validation error throughout training
+(typically by checking once after each epoch)
+and to cut off training when the validation error
+has not decreased by more than some small amount $\epsilon$
+for some number of epochs.
+This is sometimes called a *patience criteria*.
+Besides the potential to lead to better generalization,
+in the setting of noisy labels,
+another benefit of early stopping is the time saved.
+Once the patience criteria is met, one can terminate training.
+For large models that might require days of training
+simultaneously across 8 GPUs or more,
+well-tuned early stopping can save researchers days of time
+and can save their employers many thousands of dollars.
+
+Notably, when there is no label noise and datasets are *realizable*
+(the classes are truly separable, e.g., distinguishing cats from dogs),
+early stopping tends not to lead to significant improvements in generalization.
+On the other hand, when there is label noise,
+or intrinsic variability in the label
+(e.g., predicting mortality among patients),
+early stopping is crucial.
+Training models until they interpolate noisy data is typically a bad idea.
+
+
+## Classical Regularization Methods for Deep Networks
+
+In :numref:`chap_regression`, we described
+several  classical regularization techniques
+for constraining the complexity of our models.
+In particular, :numref:`sec_weight_decay`
+introduced a method called weight decay,
+which consists of adding a regularization term to the loss function
+to penalize large values of the weights.
+Depending on which weight norm is penalized
+this technique is known either as ridge regularization (for $\ell_2$ penalty)
+or lasso regularization (for an $\ell_1$ penalty).
+In the classical analysis of these regularizers,
+they are considered to restrict the values
+that the weights can take sufficiently
+to prevent the model from fitting arbitrary labels.
+
+In deep learning implementations,
+weight decay remains a popular tool.
+However, researchers have noted
+that typical strengths of $\ell_2$ regularization
+are insufficient to prevent the networks
+from interpolating the data
+:cite:`zhang2021understanding`
+and thus the benefits if interpreted
+as regularization might only make sense
+in combination with the early stopping criteria.
+Absent early stopping, it's possible
+that just like the number of layers
+or number of nodes (in deep learning)
+or the distance metric (in 1-nearest neighbor),
+these methods may lead to better generalization
+not because they meaningfully constrain
+the power of the neural network
+but rather because they somehow encode inductive biases
+that are better compatible with the patterns
+found in datasets of interests.
+Thus, classical regularizers remain popular
+in deep learning implementations,
+even if the theoretical rationale
+for their efficacy may be radically different.
+
+Notably, deep learning researchers have also built
+on techniques first popularized
+in classical regularization contexts,
+such as adding noise to model inputs.
+In the next section we will introduce
+the famous dropout technique
+(invented by :citet:`Srivastava.Hinton.Krizhevsky.ea.2014`),
+which has become a mainstay of deep learning,
+even as the theoretical basis for its efficacy
+remains similarly mysterious.
+
+
+## Summary
+
+Unlike classical linear models,
+which tend to have fewer parameters than examples,
+deep networks tend to be over-parameterized,
+and for most tasks are capable
+of perfectly fitting the training set.
+This *interpolation regime* challenges
+many of hard fast-held intuitions.
+Functionally, neural networks look like parametric models.
+But thinking of them as nonparametric models
+can sometimes be a more reliable source of intuition.
+Because it's often the case that all deep networks under consideration
+are capable of fitting all of the training labels,
+nearly all gains must come by mitigating overfitting
+(closing the *generalization gap*).
+Paradoxically, the interventions
+that reduce the generalization gap
+sometimes appear to increase model complexity
+and at other times appear to decrease complexity.
+However, these methods seldom decrease complexity
+sufficiently for classical theory
+to explain the generalization of deep networks,
+and *why certain choices lead to improved generalization*
+remains for the most part a massive open question
+despite the concerted efforts of many brilliant researchers.
+
+
+## Exercises
+
+1. In what sense do traditional complexity-based measures fail to account for generalization of deep neural networks?
+1. Why might *early stopping* be considered a regularization technique?
+1. How do researchers typically determine the stopping criteria?
+1. What important factor seems to differentiate cases when early stopping leads to big improvements in generalization?
+1. Beyond generalization, describe another benefit of early stopping.
+
+[Discussions](https://discuss.d2l.ai/t/7473)
diff --git a/chapter_multilayer-perceptrons/index.md b/chapter_multilayer-perceptrons/index.md
index a0756f3..d1bb07c 100644
--- a/chapter_multilayer-perceptrons/index.md
+++ b/chapter_multilayer-perceptrons/index.md
@@ -1,19 +1,16 @@
 # 多層パーセプトロン
 :label:`chap_perceptrons`
 
-この章では、初めての「ディープ」なネットワークを紹介します。最も単純なディープネットワークは多層パーセプトロンと呼ばれ、ニューロンの複数の層から構成され、それぞれが下層 (入力を受け取る) と上の層 (これらが影響を与える) に完全に接続されたニューロンで構成されます。大容量モデルをトレーニングすると、過適合のリスクがあります。したがって、オーバーフィット、アンダーフィット、およびモデル選択の概念を最初に厳密に紹介する必要があります。これらの問題に対処するために、ウェイトディケイやドロップアウトなどの正則化手法を紹介します。また、深層ネットワークの学習を成功させるための鍵となる、数値の安定性とパラメーターの初期化に関する問題についても説明します。全体を通して、概念だけでなく、ディープネットワークの使用方法についてもしっかりと理解できるようにすることを目指しています。この章の最後に、これまで紹介してきたことを実際のケースである住宅価格予測に適用します。モデルの計算性能、スケーラビリティ、効率性に関する事項については、以降の章で説明します。
+この章では、初めての真に*深い*ネットワークを紹介します。最も単純なディープネットワークは*多層パーセプトロン*と呼ばれ、それぞれが下の層（入力を受け取る）と上の層（順番に影響を与える）のニューロンに完全に接続された複数のニューロン層で構成されています。自動微分はディープラーニングアルゴリズムの実装を大幅に簡素化しますが、これらの勾配がディープネットワークでどのように計算されるかを深く掘り下げます。次に、ディープネットワークをうまくトレーニングするための鍵となる数値安定性とパラメーターの初期化に関連する問題について議論する準備が整います。このような大容量モデルをトレーニングすると、過剰適合のリスクがあります。したがって、ディープネットワークの正則化と汎化を再検討します。全体を通して、概念だけでなく、ディープネットワークを使用する実践についてもしっかりと理解できるようにすることを目指しています。この章の終わりに、これまでに紹介したものを実際のケースである住宅価格予測に適用します。モデルの計算パフォーマンス、スケーラビリティ、および効率に関する事項を次の章にパントします。
 
 ```toc
 :maxdepth: 2
 
 mlp
-mlp-scratch
-mlp-concise
-underfit-overfit
-weight-decay
-dropout
+mlp-implementation
 backprop
 numerical-stability-and-init
-environment
+generalization-deep
+dropout
 kaggle-house-price
 ```
diff --git a/chapter_multilayer-perceptrons/index_origin.md b/chapter_multilayer-perceptrons/index_origin.md
index 295fb68..1ffc67e 100644
--- a/chapter_multilayer-perceptrons/index_origin.md
+++ b/chapter_multilayer-perceptrons/index_origin.md
@@ -2,37 +2,37 @@
 :label:`chap_perceptrons`
 
 In this chapter, we will introduce your first truly *deep* network.
-The simplest deep networks are called multilayer perceptrons,
+The simplest deep networks are called *multilayer perceptrons*,
 and they consist of multiple layers of neurons
 each fully connected to those in the layer below
 (from which they receive input)
 and those above (which they, in turn, influence).
-When we train high-capacity models we run the risk of overfitting.
-Thus, we will need to provide your first rigorous introduction
-to the notions of overfitting, underfitting, and model selection.
-To help you combat these problems,
-we will introduce regularization techniques such as weight decay and dropout.
-We will also discuss issues relating to numerical stability and parameter initialization
+Although automatic differentiation
+significantly simplifies the implementation of deep learning algorithms,
+we will dive deep into how these gradients
+are calculated in deep networks.
+Then we will
+be ready to
+discuss issues relating to numerical stability and parameter initialization
 that are key to successfully training deep networks.
-Throughout, we aim to give you a firm grasp not just of the concepts
-but also of the practice of using deep networks.
-At the end of this chapter,
-we apply what we have introduced so far to a real case: house price prediction.
-We punt matters relating to the computational performance,
-scalability, and efficiency of our models to subsequent chapters.
+When we train such high-capacity models we run the risk of overfitting. Thus, we will
+revisit regularization and generalization
+for deep networks.
+Throughout, we aim
+to give you a firm grasp not just of the concepts but also of the practice of using deep networks.
+At the end of this chapter, we apply what we have introduced so far to a real case: house price
+prediction. We punt matters relating to the computational performance, scalability, and efficiency
+of our models to subsequent chapters.
 
 ```toc
 :maxdepth: 2
 
 mlp
-mlp-scratch
-mlp-concise
-underfit-overfit
-weight-decay
-dropout
+mlp-implementation
 backprop
 numerical-stability-and-init
-environment
+generalization-deep
+dropout
 kaggle-house-price
 ```
 
diff --git a/chapter_multilayer-perceptrons/kaggle-house-price.md b/chapter_multilayer-perceptrons/kaggle-house-price.md
index 4bd6352..89303e2 100644
--- a/chapter_multilayer-perceptrons/kaggle-house-price.md
+++ b/chapter_multilayer-perceptrons/kaggle-house-price.md
@@ -1,87 +1,38 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # Kaggleで住宅価格を予測する
 :label:`sec_kaggle_house`
 
-ディープネットワークを構築してトレーニングし、ウェイトディケイやドロップアウトなどのテクニックで正規化するための基本的なツールをいくつか紹介したので、Kaggleコンペティションに参加することで、この知識をすべて実践する準備が整いました。住宅価格予測コンペティションは、始めるのに最適な場所です。データはかなり汎用的で、特殊なモデル (オーディオやビデオなど) を必要とする特殊な構造を示しません。2011 年 :cite:`De-Cock.2011` 年に Bart de Cock によって収集されたこのデータセットは、2006年から2010年までのアイオワ州エイムズの住宅価格を対象としています。ハリソンとルービンフェルド (1978) の有名な[Boston housing dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names)よりかなり大きく、より多くの例とより多くの特徴を誇っています。 
+ディープネットワークを構築してトレーニングし、ウェイトディケイやドロップアウトなどのテクニックでそれらを正規化するための基本的なツールをいくつか紹介したので、Kaggleコンペティションに参加してこの知識をすべて実践する準備が整いました。住宅価格予測競争は、始めるのに最適な場所です。データはかなり汎用的で、特殊なモデル（オーディオやビデオなど）を必要とするようなエキゾチックな構造を示していません。このデータセットは、2011年にバート・デ・コックによって収集された:cite:`De-Cock.2011`で、2006年から2010年のアイオワ州エイムズの住宅価格をカバーしています。それは有名なハリソンとルビンフェルド（1978）の[Boston housing dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names)（1978）よりもかなり大きく、より多くの例とより多くの機能の両方を誇っています。 
 
-このセクションでは、データの前処理、モデル設計、ハイパーパラメーター選択について詳しく説明します。実践的なアプローチを通じて、データサイエンティストとしてのキャリアを導く直感が得られることを願っています。 
+このセクションでは、データの前処理、モデル設計、およびハイパーパラメータの選択について詳しく説明します。実践的なアプローチを通じて、データサイエンティストとしてのキャリアを導く直感が得られることを願っています。 
 
-## データセットのダウンロードとキャッシュ
+## データをダウンロードする
 
-本書全体を通して、ダウンロードしたさまざまなデータセットでモデルのトレーニングとテストを行います。ここでは、(**データのダウンロードを容易にするいくつかのユーティリティ関数を実装**) します。まず、文字列 (データセットの*name*) を、データセットを検索するための URL とファイルの整合性を検証する SHA-1 キーの両方を含むタプルにマップするディクショナリ `DATA_HUB` を維持します。このようなデータセットはすべて、アドレスが `DATA_URL` のサイトでホストされています。
+本書全体を通して、ダウンロードしたさまざまなデータセットでモデルのトレーニングとテストを行います。ここでは、ファイルをダウンロードし、zipまたはtarファイルを抽出する（**2つのユーティリティ関数を実装**）します。繰り返しますが、それらの実装は :numref:`sec_utils` に延期します。
 
-```{.python .input}
-#@tab all
-import os
-import requests
-import zipfile
-import tarfile
-import hashlib
-
-#@save
-DATA_HUB = dict()
-DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
-```
+```{.python .input  n=2}
+%%tab all
 
-次の `download` 関数は、データセットをダウンロードしてローカルディレクトリ (デフォルトは `../data`) にキャッシュし、ダウンロードしたファイルの名前を返します。このデータセットに対応するファイルがすでにキャッシュディレクトリに存在し、その SHA-1 が `DATA_HUB` に格納されているものと一致する場合、このコードはキャッシュされたファイルを使用して、冗長なダウンロードによるインターネットの詰まりを回避します。
+def download(url, folder, sha1_hash=None):
+    """Download a file to folder and return the local filepath."""
 
-```{.python .input}
-#@tab all
-def download(name, cache_dir=os.path.join('..', 'data')):  #@save
-    """Download a file inserted into DATA_HUB, return the local filename."""
-    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
-    url, sha1_hash = DATA_HUB[name]
-    os.makedirs(cache_dir, exist_ok=True)
-    fname = os.path.join(cache_dir, url.split('/')[-1])
-    if os.path.exists(fname):
-        sha1 = hashlib.sha1()
-        with open(fname, 'rb') as f:
-            while True:
-                data = f.read(1048576)
-                if not data:
-                    break
-                sha1.update(data)
-        if sha1.hexdigest() == sha1_hash:
-            return fname  # Hit cache
-    print(f'Downloading {fname} from {url}...')
-    r = requests.get(url, stream=True, verify=True)
-    with open(fname, 'wb') as f:
-        f.write(r.content)
-    return fname
-```
-
-また、2 つのユーティリティ関数も実装しています。1 つは zip ファイルまたは tar ファイルをダウンロードして解凍し、もう 1 つは、本書で使用しているすべてのデータセットを `DATA_HUB` からキャッシュディレクトリにダウンロードするためのものです。
-
-```{.python .input}
-#@tab all
-def download_extract(name, folder=None):  #@save
-    """Download and extract a zip/tar file."""
-    fname = download(name)
-    base_dir = os.path.dirname(fname)
-    data_dir, ext = os.path.splitext(fname)
-    if ext == '.zip':
-        fp = zipfile.ZipFile(fname, 'r')
-    elif ext in ('.tar', '.gz'):
-        fp = tarfile.open(fname, 'r')
-    else:
-        assert False, 'Only zip/tar files can be extracted.'
-    fp.extractall(base_dir)
-    return os.path.join(base_dir, folder) if folder else data_dir
-
-def download_all():  #@save
-    """Download all files in the DATA_HUB."""
-    for name in DATA_HUB:
-        download(name)
+def extract(filename, folder):
+    """Extract a zip/tar file into folder."""
 ```
 
 ## Kaggle
 
-[Kaggle](https://www.kaggle.com) は、機械学習のコンペティションを主催する人気のプラットフォームです。各コンペティションはデータセットを中心としており、その多くは受賞したソリューションに賞品を提供するステークホルダーによって後援されています。このプラットフォームは、ユーザーがフォーラムや共有コードを介して対話し、コラボレーションと競争の両方を促進するのに役立ちます。リーダーボードの追跡は制御不能になることが多く、研究者は基本的な質問をするのではなく前処理ステップに近視的に焦点を合わせていますが、競合するアプローチとコード間の直接的な定量的比較を容易にするプラットフォームの客観性にも大きな価値があります。何がうまくいったのか、何がうまくいかなかったのかを誰もが知ることができるように分かち合う。Kaggle コンペティションに参加するには、まずアカウントを登録する必要があります (:numref:`fig_kaggle` 参照)。 
+[Kaggle](https://www.kaggle.com)は、機械学習コンペティションを主催する人気のあるプラットフォームです。各コンペティションはデータセットを中心としており、その多くは、受賞したソリューションに賞品を提供する利害関係者によって後援されています。このプラットフォームは、ユーザーがフォーラムや共有コードを介して対話し、コラボレーションと競争の両方を促進するのに役立ちます。リーダーボードの追跡は制御不能になることが多く、研究者は基本的な質問をするのではなく前処理のステップに近視的に焦点を合わせていますが、競合するアプローチとコード間の直接的な定量的比較を容易にするプラットフォームの客観性にも大きな価値があります。共有することで、誰もが何がうまくいったか、何がうまくいかなかったかを知ることができます。Kaggleコンペティションに参加するには、まずアカウントを登録する必要があります（:numref:`fig_kaggle`を参照）。 
 
 ![The Kaggle website.](../img/kaggle.png)
 :width:`400px`
 :label:`fig_kaggle`
 
-:numref:`fig_house_pricing` に示すように、住宅価格予測コンペページでは、データセット ([データ] タブ) を検索し、予測を送信し、ランキングを確認できます。URL はここにあります。 
+:numref:`fig_house_pricing`に示されている住宅価格予測コンペティションページでは、データセット（[データ] タブの下）を見つけ、予測を送信し、ランキングを確認できます。URLはここにあります。 
 
 > https://www.kaggle.com/c/house-prices-advanced-regression-techniques 
 
@@ -89,16 +40,12 @@ def download_all():  #@save
 :width:`400px`
 :label:`fig_house_pricing`
 
-## データセットへのアクセスと読み取り
+## データセットのアクセスと読み取り
 
-競技データはトレーニングセットとテストセットに分かれています。各レコードには、住宅のプロパティ値と、道路タイプ、建設年、屋根のタイプ、地下の状態などの属性が含まれます。フィーチャは、さまざまなデータタイプで構成されます。たとえば、建設年は整数で表され、屋根のタイプは個別のカテゴリ割り当てで表され、その他のフィーチャは浮動小数点数で表されます。そして、現実が物事を複雑にしているのはここです。いくつかの例として、一部のデータは完全に欠落しており、欠損値は単に「na」とマークされています。各ハウスの価格はトレーニングセットのみに含まれています（結局コンペティションです）。トレーニングセットを分割して検証セットを作成しますが、Kaggle に予測をアップロードした後に公式テストセットでモデルを評価することしかできません。:numref:`fig_house_pricing` の「競技」タブの「データ」タブには、データをダウンロードするためのリンクがあります。 
-
-はじめに、:numref:`sec_pandas` で導入した [**`pandas` を使用してデータを読み込んで処理します**]。したがって、先に進む前に `pandas` がインストールされていることを確認してください。幸いなことに、Jupyterで読んでいる場合は、ノートブックを離れることなくパンダをインストールできます。
-
-```{.python .input}
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
+競技データはトレーニングセットとテストセットに分かれていることに注意してください。各レコードには、住宅のプロパティ値と、道路タイプ、建設年、屋根タイプ、地下の状態などの属性が含まれます。フィーチャはさまざまなデータタイプで構成されています。たとえば、建設年は整数で表され、屋根タイプは個別のカテゴリ割り当てで表され、その他のフィーチャは浮動小数点数で表されます。そして、ここで現実は物事を複雑にします。いくつかの例として、一部のデータは完全に欠落しており、欠落している値は単に「na」とマークされています。各家の価格は、トレーニングセットにのみ含まれています（結局それは競争です）。トレーニングセットを分割して検証セットを作成したいと思いますが、Kaggleに予測をアップロードした後にのみ、公式テストセットでモデルを評価できます。:numref:`fig_house_pricing`の競技タブの「データ」タブには、データをダウンロードするためのリンクがあります。
 
+```{.python .input  n=14}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import gluon, autograd, init, np, npx
@@ -107,11 +54,8 @@ import pandas as pd
 npx.set_np()
 ```
 
-```{.python .input}
-#@tab pytorch
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
-
+```{.python .input  n=4}
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
@@ -121,10 +65,7 @@ import numpy as np
 ```
 
 ```{.python .input}
-#@tab tensorflow
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
-
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -132,345 +73,192 @@ import pandas as pd
 import numpy as np
 ```
 
-便宜上、上で定義したスクリプトを使用して Kaggle の住宅データセットをダウンロードしてキャッシュすることができます。
-
-```{.python .input}
-#@tab all
-DATA_HUB['kaggle_house_train'] = (  #@save
-    DATA_URL + 'kaggle_house_pred_train.csv',
-    '585e9cc93e70b39160e7921475f9bcd7d31219ce')
-
-DATA_HUB['kaggle_house_test'] = (  #@save
-    DATA_URL + 'kaggle_house_pred_test.csv',
-    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')
-```
-
-`pandas` を使用して、それぞれトレーニングデータとテストデータを含む 2 つの csv ファイルをロードします。
-
-```{.python .input}
-#@tab all
-train_data = pd.read_csv(download('kaggle_house_train'))
-test_data = pd.read_csv(download('kaggle_house_test'))
+はじめに、:numref:`sec_pandas`で紹介した [**`pandas`を使用してデータを読み込んで処理する**] を行います。便宜上、Kaggleの住宅データセットをダウンロードしてキャッシュすることができます。このデータセットに対応するファイルが既にキャッシュディレクトリに存在し、その SHA-1 が `sha1_hash` と一致する場合、コードはキャッシュされたファイルを使用して、冗長なダウンロードでインターネットが詰まるのを防ぎます。
+
+```{.python .input  n=30}
+%%tab all
+class KaggleHouse(d2l.DataModule):
+    def __init__(self, batch_size, train=None, val=None):
+        super().__init__()
+        self.save_hyperparameters()
+        if self.train is None:
+            self.raw_train = pd.read_csv(d2l.download(
+                d2l.DATA_URL + 'kaggle_house_pred_train.csv', self.root,
+                sha1_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce'))
+            self.raw_val = pd.read_csv(d2l.download(
+                d2l.DATA_URL + 'kaggle_house_pred_test.csv', self.root,
+                sha1_hash='fa19780a7b011d9b009e8bff8e99922a8ee2eb90'))
 ```
 
-トレーニングデータセットには 1460 個の例、80 個のフィーチャ、1 個のラベルが含まれ、テストデータには 1459 個のサンプルと 80 個のフィーチャが含まれています。
+トレーニングデータセットには 1460 個の例、80 個のフィーチャー、1 個のラベルが含まれていますが、検証データには 1459 個の例と 80 個のフィーチャが含まれています。
 
-```{.python .input}
-#@tab all
-print(train_data.shape)
-print(test_data.shape)
-```
-
-最初の 4 つの例の [**最初の 4 つと最後の 2 つのフィーチャと、ラベル (SalePrice) **] を見てみましょう。
-
-```{.python .input}
-#@tab all
-print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])
-```
-
-各例で (**最初の特徴はID**)、モデルが各トレーニング例を識別するのに役立ちます。これは便利ですが、予測を目的とした情報は一切含まれていません。したがって、データをモデルに入力する前に (**データセットから削除**) します。
-
-```{.python .input}
-#@tab all
-all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
+```{.python .input  n=31}
+%%tab all
+data = KaggleHouse(batch_size=64)
+print(data.raw_train.shape)
+print(data.raw_val.shape)
 ```
 
 ## データ前処理
 
-前述のとおり、データタイプは多種多様です。モデリングを開始する前に、データを前処理する必要があります。数値的な特徴から始めましょう。まず、ヒューリスティックを適用します [**すべての欠損値を対応する特徴の平均で置き換える**]。次に、すべての特徴を共通の尺度に置くために、特徴をゼロ平均と単位分散に再スケーリングしてデータを***標準化* します。 
+最初の4つの例から [**最初の4つと最後の2つの機能、およびラベル (SalePrice) **] を見てみましょう。
 
-$$x \leftarrow \frac{x - \mu}{\sigma},$$
-
-$\mu$ と $\sigma$ はそれぞれ平均偏差と標準偏差を表します。これが実際にフィーチャ (変数) を平均と単位分散がゼロになるように変換することを検証するには、$E[\frac{x-\mu}{\sigma}] = \frac{\mu - \mu}{\sigma} = 0$ と $E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$ に注意してください。直感的に、2 つの理由からデータを標準化しています。まず、最適化に便利であることがわかります。第2に、どの地物が関連するかが*事前に*わからないため、ある地物に割り当てられた係数を他の地物よりも多くペナルティを課したくありません。
-
-```{.python .input}
-#@tab all
-# If test data were inaccessible, mean and standard deviation could be 
-# calculated from training data
-numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
-all_features[numeric_features] = all_features[numeric_features].apply(
-    lambda x: (x - x.mean()) / (x.std()))
-# After standardizing the data all means vanish, hence we can set missing
-# values to 0
-all_features[numeric_features] = all_features[numeric_features].fillna(0)
+```{.python .input  n=10}
+%%tab all
+print(data.raw_train.iloc[:4, [0, 1, 2, 3, -3, -2, -1]])
 ```
 
-[**次は離散値を扱います。**] これには「MSZoning」などの機能が含まれます。(**ワンホットエンコーディングに置き換えます**)、以前にマルチクラスラベルをベクトルに変換したのと同じ方法で (:numref:`subsec_classification-problem` を参照)。たとえば、「MSZoning」は「RL」と「RM」という値を想定しています。「msZoning」機能を削除すると、2 つの新しいインジケーター機能「msZoning_RL」と「msZoning_RM」が作成され、値は 0 または 1 になります。ワンホットエンコーディングによると、「msZoning」の元の値が「RL」の場合、「msZoning_RL」は 1、「msZoning_RM」は 0 になります。`pandas` パッケージはこれを自動的に行います。
+それぞれの例で、最初の特徴はIDであることがわかります。これは、モデルが各トレーニング例を識別するのに役立ちます。これは便利ですが、予測のための情報は含まれていません。したがって、データをモデルに入力する前に、データセットから削除します。また、さまざまなデータタイプがあるため、モデリングを開始する前にデータを前処理する必要があります。 
 
-```{.python .input}
-#@tab all
-# `Dummy_na=True` considers "na" (missing value) as a valid feature value, and
-# creates an indicator feature for it
-all_features = pd.get_dummies(all_features, dummy_na=True)
-all_features.shape
-```
+数値的特徴から始めましょう。まず、ヒューリスティックを適用し、[**すべての欠損値を対応する地物の平均で置き換える**]。次に、すべての特徴を共通の尺度に置くために、(***標準化* 特徴量をゼロ平均と単位分散に再スケーリングすることによってデータを***標準化**)。 
 
-この変換により、フィーチャの数が 79 から 331 に増加することがわかります。最後に、`values` 属性を介して、[**`pandas` 形式から NumPy 形式を抽出し、テンソルに変換する**] トレーニング用に。
+$$x \leftarrow \frac{x - \mu}{\sigma},$$
 
-```{.python .input}
-#@tab all
-n_train = train_data.shape[0]
-train_features = d2l.tensor(all_features[:n_train].values, dtype=d2l.float32)
-test_features = d2l.tensor(all_features[n_train:].values, dtype=d2l.float32)
-train_labels = d2l.tensor(
-    train_data.SalePrice.values.reshape(-1, 1), dtype=d2l.float32)
+ここで、$\mu$と$\sigma$はそれぞれ平均と標準偏差を示します。これが実際に私たちの特徴（変数）をゼロ平均と単位分散を持つように変換することを検証するには、$E[\frac{x-\mu}{\sigma}] = \frac{\mu - \mu}{\sigma} = 0$と$E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$に注意してください。直感的に、データを標準化する理由は2つあります。まず、最適化に便利であることがわかります。第2に、どのフィーチャーが関連するか*アプリオリ*がわからないため、あるフィーチャーに割り当てられた係数を他のどのフィーチャーよりも多くペナルティを課したくないということです。 
+
+[**次に離散値を扱います。**] これには「MSZoning」などの機能が含まれます。(**ワンホットエンコーディングに置き換えます**) 以前にマルチクラスラベルをベクトルに変換したのと同じ方法で (:numref:`subsec_classification-problem` 参照)。たとえば、「MSZoning」は「RL」と「RM」という値を想定しています。「msZoning」機能を削除すると、2つの新しいインジケーター機能「MsZoning_RL」と「MsZoning_RM」が0または1のいずれかの値で作成されます。ワンホットエンコーディングによると、「msZoning」の元の値が「RL」の場合、「msZoning_RL」は1で、「msZoning_RM」は0です。`pandas` パッケージはこれを自動的に行います。
+
+```{.python .input  n=32}
+%%tab all
+@d2l.add_to_class(KaggleHouse)
+def preprocess(self):
+    # Remove the ID and label columns
+    label = 'SalePrice'
+    features = pd.concat(
+        (self.raw_train.drop(columns=['Id', label]),
+         self.raw_val.drop(columns=['Id'])))
+    # Standardize numerical columns
+    numeric_features = features.dtypes[features.dtypes != 'object'].index
+    features[numeric_features] = features[numeric_features].apply(
+        lambda x: (x - x.mean()) / (x.std()))
+    # Replace NAN numerical features by 0
+    features[numeric_features] = features[numeric_features].fillna(0)
+    # Replace discrete features by one-hot encoding.
+    features = pd.get_dummies(features, dummy_na=True)
+    # Save preprocessed features
+    self.train = features[:self.raw_train.shape[0]].copy()
+    self.train[label] = self.raw_train[label]
+    self.val = features[self.raw_train.shape[0]:].copy()
 ```
 
-## [**トレーニング**]
-
-はじめに、二乗損失をもつ線形モデルに学習をさせます。当然のことながら、私たちの線形モデルは、競争に勝つ提出には至りませんが、データに意味のある情報があるかどうかを確認するためのサニティチェックを提供します。ここでランダムな推測よりもうまく行けない場合は、データ処理のバグがある可能性が高くなります。そして、うまくいけば、線形モデルがベースラインとして機能し、単純なモデルが報告された最良のモデルにどれだけ近づくかについてある程度の直感が得られ、より洗練されたモデルからどれだけのゲインが期待できるかがわかります。
+この変換により、フィーチャの数が 79 から 331 に増加することがわかります (ID 列とラベル列を除く)。
 
-```{.python .input}
-loss = gluon.loss.L2Loss()
-
-def get_net():
-    net = nn.Sequential()
-    net.add(nn.Dense(1))
-    net.initialize()
-    return net
+```{.python .input  n=33}
+%%tab all
+data.preprocess()
+data.train.shape
 ```
 
-```{.python .input}
-#@tab pytorch
-loss = nn.MSELoss()
-in_features = train_features.shape[1]
+## エラーメジャー
 
-def get_net():
-    net = nn.Sequential(nn.Linear(in_features,1))
-    return net
-```
+はじめに、損失を二乗した線形モデルをトレーニングします。当然のことながら、私たちの線形モデルは競争に勝つ提出につながることはありませんが、データに意味のある情報があるかどうかを確認するための健全性チェックを提供します。ここでランダムに推測するよりもうまくできないなら、データ処理のバグが発生する可能性が高いかもしれません。そして、うまくいけば、線形モデルはベースラインとして機能し、単純なモデルが最良の報告モデルにどれだけ近づくかについての直感を与え、より洗練されたモデルからどれだけの利益を期待すべきかを私たちに与えます。 
 
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.MeanSquaredError()
-
-def get_net():
-    net = tf.keras.models.Sequential()
-    net.add(tf.keras.layers.Dense(
-        1, kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
-    return net
-```
+住宅価格は、株価と同様に、絶対数量よりも相対的な数量を重視します。したがって [**絶対誤差 $y - \hat{y}$ よりも相対誤差 $\frac{y - \hat{y}}{y}$** を重視する傾向があります]。たとえば、典型的な住宅の価値が125,000米ドルであるオハイオ州の農村部の住宅価格を見積もるときに、予測が100,000米ドルずれている場合、おそらく恐ろしい仕事をしているでしょう。一方、カリフォルニアのロスアルトスヒルズでこの金額を間違えた場合、これは驚くほど正確な予測を表している可能性があります（そこでは、住宅価格の中央値が400万米ドルを超えています）。 
 
-住宅価格では、株価と同様に、絶対量よりも相対数量を重視しています。したがって、[**絶対誤差 $y - \hat{y}$ よりも相対誤差 $\frac{y - \hat{y}}{y}$**] を重視する傾向があります。たとえば、典型的な住宅の価値が125,000米ドルであるオハイオ州の農村部の住宅価格を見積もるときに予測が100,000米ドルずれている場合、私たちは恐らくひどい仕事をしているでしょう。一方、カリフォルニア州ロスアルトスヒルズでこの金額を誤ると、驚くほど正確な予測になるかもしれません（そこでは、住宅価格の中央値は400万米ドルを超えています）。 
-
-(**この問題に対処する1つの方法は、価格見積もりの対数の不一致を測定することです**) 実際、これは応募作品の質を評価するために競合他社が使用する公式の誤差測定でもあります。結局のところ、$|\log y - \log \hat{y}| \leq \delta$ の小さい値 $\delta$ は $e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta$ に変換されます。これにより、予測価格の対数とラベル価格の対数の間に、次の二乗平均平方根誤差が生じます。 
+(**この問題に対処する1つの方法は、価格見積もりの対数の不一致を測定することです。**) 実際、これはコンテストが提出物の品質を評価するために使用する公式の誤差測定でもあります。結局のところ、$|\log y - \log \hat{y}| \leq \delta$の小さな値$\delta$は、$e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta$に変換されます。これにより、予測価格の対数とラベル価格の対数の間に次の二乗平均二乗誤差が生じます。 
 
 $$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2}.$$
 
-```{.python .input}
-def log_rmse(net, features, labels):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = np.clip(net(features), 1, float('inf'))
-    return np.sqrt(2 * loss(np.log(clipped_preds), np.log(labels)).mean())
-```
-
-```{.python .input}
-#@tab pytorch
-def log_rmse(net, features, labels):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = torch.clamp(net(features), 1, float('inf'))
-    rmse = torch.sqrt(loss(torch.log(clipped_preds),
-                           torch.log(labels)))
-    return rmse.item()
-```
-
-```{.python .input}
-#@tab tensorflow
-def log_rmse(y_true, y_pred):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = tf.clip_by_value(y_pred, 1, float('inf'))
-    return tf.sqrt(tf.reduce_mean(loss(
-        tf.math.log(y_true), tf.math.log(clipped_preds))))
-```
-
-前のセクションとは異なり、[**私たちのトレーニング関数は Adam オプティマイザーに依存しています (これについては後で詳しく説明します) **]。このオプティマイザの主な魅力は、ハイパーパラメータ最適化のためのリソースが無制限に与えられても、初期学習率に対する感度が大幅に低くなる傾向があることです。
-
-```{.python .input}
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    trainer = gluon.Trainer(net.collect_params(), 'adam', {
-        'learning_rate': learning_rate, 'wd': weight_decay})
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                l = loss(net(X), y)
-            l.backward()
-            trainer.step(batch_size)
-        train_ls.append(log_rmse(net, train_features, train_labels))
-        if test_labels is not None:
-            test_ls.append(log_rmse(net, test_features, test_labels))
-    return train_ls, test_ls
-```
-
-```{.python .input}
-#@tab pytorch
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    optimizer = torch.optim.Adam(net.parameters(),
-                                 lr = learning_rate,
-                                 weight_decay = weight_decay)
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            optimizer.zero_grad()
-            l = loss(net(X), y)
-            l.backward()
-            optimizer.step()
-        train_ls.append(log_rmse(net, train_features, train_labels))
-        if test_labels is not None:
-            test_ls.append(log_rmse(net, test_features, test_labels))
-    return train_ls, test_ls
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    optimizer = tf.keras.optimizers.Adam(learning_rate)
-    net.compile(loss=loss, optimizer=optimizer)
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                y_hat = net(X)
-                l = loss(y, y_hat)
-            params = net.trainable_variables
-            grads = tape.gradient(l, params)
-            optimizer.apply_gradients(zip(grads, params))
-        train_ls.append(log_rmse(train_labels, net(train_features)))
-        if test_labels is not None:
-            test_ls.append(log_rmse(test_labels, net(test_features)))
-    return train_ls, test_ls
+```{.python .input  n=60}
+%%tab all
+@d2l.add_to_class(KaggleHouse)
+def get_dataloader(self, train):
+    label = 'SalePrice'
+    data = self.train if train else self.val
+    if label not in data: return
+    get_tensor = lambda x: d2l.tensor(x.values, dtype=d2l.float32)
+    # Logarithm of prices 
+    tensors = (get_tensor(data.drop(columns=[label])),  # X
+               d2l.reshape(d2l.log(get_tensor(data[label])), (-1, 1)))  # Y
+    return self.get_tensorloader(tensors, train)
 ```
 
 ## $K$ 分割交差検証
 
-モデル選択の扱い方について説明したセクション (:numref:`sec_model_selection`) に [**$K$ 分割交差検証**] を導入したことを思い出してください。これを、モデル設計の選択とハイパーパラメータの調整に有効に活用します。まず、$K$ 分割の交差検証手順で $i^\mathrm{th}$ 倍のデータを返す関数が必要です。$i^\mathrm{th}$ セグメントを検証データとしてスライスし、残りをトレーニングデータとして返します。これはデータを処理する上で最も効率的な方法ではないことに注意してください。データセットがかなり大きければ、はるかにスマートな処理を行うことは間違いありません。しかし、この複雑さが増すと、コードが不必要に難読化される可能性があるため、問題が単純なため、ここでは安全に省略できます。
+:numref:`subsec_generalization-model-selection`で [**交差検証**] を導入したことを覚えているかもしれません。そこではモデル選択の扱い方について議論しました。これを活用して、モデル設計を選択し、ハイパーパラメータを調整します。まず、$K$ 分割交差検証手順でデータの $i^\mathrm{th}$ 分割を返す関数が必要です。次に、$i^\mathrm{th}$ セグメントを検証データとしてスライスし、残りをトレーニングデータとして返します。これはデータを処理する最も効率的な方法ではないことに注意してください。データセットがかなり大きければ、もっと賢いことをすることは間違いありません。しかし、この複雑さが増すと、コードが不必要に難読化される可能性があるため、問題が単純であるため、ここでは安全に省略できます。
 
 ```{.python .input}
-#@tab all
-def get_k_fold_data(k, i, X, y):
-    assert k > 1
-    fold_size = X.shape[0] // k
-    X_train, y_train = None, None
+%%tab all
+def k_fold_data(data, k):
+    rets = []
+    fold_size = data.train.shape[0] // k
     for j in range(k):
-        idx = slice(j * fold_size, (j + 1) * fold_size)
-        X_part, y_part = X[idx, :], y[idx]
-        if j == i:
-            X_valid, y_valid = X_part, y_part
-        elif X_train is None:
-            X_train, y_train = X_part, y_part
-        else:
-            X_train = d2l.concat([X_train, X_part], 0)
-            y_train = d2l.concat([y_train, y_part], 0)
-    return X_train, y_train, X_valid, y_valid
+        idx = range(j * fold_size, (j+1) * fold_size)
+        rets.append(KaggleHouse(data.batch_size, data.train.drop(index=idx),  
+                                data.train.loc[idx]))    
+    return rets
 ```
 
-$K$ 分割交差検証で $K$ 回学習させると [**学習誤差と検証誤差の平均が返されます**]。
+[**平均検証エラーが返されます**] $K$分割交差検証で$K$回学習させたとき。
 
 ```{.python .input}
-#@tab all
-def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
-           batch_size):
-    train_l_sum, valid_l_sum = 0, 0
-    for i in range(k):
-        data = get_k_fold_data(k, i, X_train, y_train)
-        net = get_net()
-        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
-                                   weight_decay, batch_size)
-        train_l_sum += train_ls[-1]
-        valid_l_sum += valid_ls[-1]
-        if i == 0:
-            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
-                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
-                     legend=['train', 'valid'], yscale='log')
-        print(f'fold {i + 1}, train log rmse {float(train_ls[-1]):f}, '
-              f'valid log rmse {float(valid_ls[-1]):f}')
-    return train_l_sum / k, valid_l_sum / k
+%%tab all
+def k_fold(trainer, data, k, lr):
+    val_loss, models = [], []
+    for i, data_fold in enumerate(k_fold_data(data, k)):
+        model = d2l.LinearRegression(lr)
+        model.board.yscale='log'
+        if i != 0: model.board.display = False
+        trainer.fit(model, data_fold)
+        val_loss.append(float(model.board.data['val_loss'][-1].y))
+        models.append(model)
+    print(f'average validation log mse = {sum(val_loss)/len(val_loss)}')
+    return models
 ```
 
 ## [**モデル選択**]
 
-この例では、調整されていないハイパーパラメーターのセットを選択し、それを読者に任せてモデルを改善します。最適化する変数の数によっては、適切な選択肢を見つけるのに時間がかかる場合があります。$K$ 分割交差検証では、データセットが十分に大きく、通常の種類のハイパーパラメーターを使用すると、複数の検定に対して適度に回復力がある傾向があります。しかし、不当に多数のオプションを試してみると、運が良ければ、検証のパフォーマンスがもはや真のエラーを表していないことに気付くかもしれません。
+この例では、調整されていないハイパーパラメーターのセットを選択し、読者に任せてモデルを改善します。最適化する変数の数によっては、適切な選択肢を見つけるのに時間がかかる場合があります。十分な大きさのデータセットと通常の種類のハイパーパラメータを使用すると、$K$倍の交差検証は複数のテストに対して適度に回復する傾向があります。しかし、不当に多数のオプションを試すと、運が良ければ、検証のパフォーマンスが真のエラーを表していないことに気付くかもしれません。
 
 ```{.python .input}
-#@tab all
-k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
-train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
-                          weight_decay, batch_size)
-print(f'{k}-fold validation: avg train log rmse: {float(train_l):f}, '
-      f'avg valid log rmse: {float(valid_l):f}')
+%%tab all
+trainer = d2l.Trainer(max_epochs=10)
+models = k_fold(trainer, data, k=5, lr=0.01)
 ```
 
-$K$ 分割交差検証の誤差数がかなり多い場合でも、ハイパーパラメーターのセットの学習誤差の数が非常に少なくなる場合があることに注意してください。これは、過適合していることを示しています。トレーニング中は、両方の数値を監視する必要があります。過適合が少ない場合は、データがより強力なモデルをサポートできることを示している可能性があります。大規模な過適合は、正則化手法を組み込むことで得られることを示唆している可能性があります。 
+$K$ 分割交差検証のエラー数がかなり多い場合でも、ハイパーパラメーターのセットに対する学習エラーの数が非常に少ない場合があることに注意してください。これは、私たちが過剰適合していることを示しています。トレーニング中、両方の数値を監視したいと思うでしょう。過適合が少ないということは、データがより強力なモデルをサポートできることを示している可能性があります。大規模なオーバーフィットは、正則化手法を組み込むことで得られることを示唆しているかもしれません。 
 
 ##  [**Kaggleで予測を送信する**]
 
-ハイパーパラメーターの適切な選択がどうあるべきかがわかったので、(交差検証スライスで使用されるデータの $1-1/K$ ではなく) すべてのデータを使用してトレーニングすることもできます。この方法で得られたモデルは、テストセットに適用できます。予測を csv ファイルに保存すると、結果を Kaggle にアップロードするのが簡単になります。
-
-```{.python .input}
-#@tab all
-def train_and_pred(train_features, test_feature, train_labels, test_data,
-                   num_epochs, lr, weight_decay, batch_size):
-    net = get_net()
-    train_ls, _ = train(net, train_features, train_labels, None, None,
-                        num_epochs, lr, weight_decay, batch_size)
-    d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
-             ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
-    print(f'train log rmse {float(train_ls[-1]):f}')
-    # Apply the network to the test set
-    preds = d2l.numpy(net(test_features))
-    # Reformat it to export to Kaggle
-    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
-    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
-    submission.to_csv('submission.csv', index=False)
-```
-
-良いサニティチェックの 1 つは、テストセットの予測が $K$ 分割交差検証プロセスの予測と似ているかどうかを確認することです。もしそうなら、Kaggleにアップロードする時です。次のコードは `submission.csv` という名前のファイルを生成します。
+ハイパーパラメーターの適切な選択がわかったので、すべての $K$ モデルによって設定されたテストの平均予測を計算します。予測をCSVファイルに保存すると、結果をKaggleにアップロードするのが簡単になります。次のコードは、`submission.csv` という名前のファイルを生成します。
 
 ```{.python .input}
-#@tab all
-train_and_pred(train_features, test_features, train_labels, test_data,
-               num_epochs, lr, weight_decay, batch_size)
+%%tab all
+preds = [model(d2l.tensor(data.val.values, dtype=d2l.float32))
+         for model in models]
+# Taking exponentiation of predictions in the logarithm scale
+ensemble_preds = d2l.reduce_mean(d2l.exp(d2l.concat(preds, 1)), 1)
+submission = pd.DataFrame({'Id':data.raw_val.Id,
+                           'SalePrice':d2l.numpy(ensemble_preds)})
+submission.to_csv('submission.csv', index=False)
 ```
 
-次に、:numref:`fig_kaggle_submit2` で示したように、Kaggle に関する予測を送信し、テストセットの実際の住宅価格 (ラベル) とどのように比較されるかを確認できます。手順は非常に簡単です。 
+次に、:numref:`fig_kaggle_submit2`で示されているように、Kaggleで予測を送信し、テストセットの実際の住宅価格（ラベル）とどのように比較されるかを確認できます。手順は非常に簡単です。 
 
-* Kaggleのウェブサイトにログインし、住宅価格予測コンペティションページにアクセスしてください。
-* 「予測を送信」または「提出遅延」ボタンをクリックします（この記事の執筆時点では、このボタンは右側にあります）。
-* ページ下部の破線ボックスにある [提出ファイルをアップロード] ボタンをクリックし、アップロードする予測ファイルを選択します。
-* ページ下部にある「提出する」ボタンをクリックすると、結果が表示されます。
+* KaggleのWebサイトにログインし、住宅価格予測コンペのページにアクセスします。
+* 「予測を送信」または「提出遅延」ボタンをクリックします（この記事を書いている時点で、ボタンは右側にあります）。
+* ページ下部の破線ボックスにある「提出ファイルのアップロード」ボタンをクリックし、アップロードする予測ファイルを選択します。
+* ページの下部にある「提出する」ボタンをクリックして結果を表示します。
 
 ![Submitting data to Kaggle](../img/kaggle-submit2.png)
 :width:`400px`
 :label:`fig_kaggle_submit2`
 
-## [概要
+## まとめ
 
-* 実データにはさまざまなデータ型が混在していることが多く、前処理が必要です。
-* 実数値データをゼロ平均と単位分散に再スケーリングするのが適切な既定値です。欠損値をその平均値に置き換えることもそうです。
-* カテゴリカル特徴量をインジケーター特徴量に変換すると、ワンホットベクトルのように扱うことができます。
+* 実際のデータにはさまざまなデータ型が混在していることが多く、前処理が必要です。
+* 実数値データをゼロ平均と単位分散に再スケーリングするのが適切なデフォルトです。欠損値をその平均値に置き換えます。
+* カテゴリカル特徴を指標特徴に変換することで、それらをワンホットベクトルのように扱うことができます。
 * $K$ 分割交差検証を使用してモデルを選択し、ハイパーパラメーターを調整できます。
-* 対数は相対誤差に便利です。
+* 対数は相対誤差に役立ちます。
 
 ## 演習
 
 1. このセクションの予測を Kaggle に送信してください。あなたの予測はどれくらい良いですか？
-1. 価格の対数を直接最小化してモデルを改善できますか？価格ではなく価格の対数を予測しようとするとどうなりますか？
-1. 欠損値を平均値で置き換えるのは常に良い考えですか？ヒント:値がランダムに欠落しない状況を構築できますか?
+1. 欠損値をその平均値で置き換えるのは常に良い考えですか？ヒント:値がランダムに欠落していない状況を構築できますか?
 1. $K$ 分割交差検証によってハイパーパラメーターを調整して Kaggle のスコアを改善します。
-1. モデル (レイヤー、ウェイト減衰、ドロップアウトなど) を改善してスコアを向上させます。
-1. このセクションで行ったように連続的な数値特徴を標準化しないとどうなりますか？
+1. モデルを改善してスコアを改善する (レイヤー、重量の減衰、ドロップアウトなど)。
+1. このセクションで行ったような連続数値特徴を標準化しないとどうなりますか？
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/106)
diff --git a/chapter_multilayer-perceptrons/kaggle-house-price_origin.md b/chapter_multilayer-perceptrons/kaggle-house-price_origin.md
index ebc44de..82c890b 100644
--- a/chapter_multilayer-perceptrons/kaggle-house-price_origin.md
+++ b/chapter_multilayer-perceptrons/kaggle-house-price_origin.md
@@ -1,3 +1,8 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # Predicting House Prices on Kaggle
 :label:`sec_kaggle_house`
 
@@ -9,7 +14,7 @@ we are ready to put all this knowledge into practice
 by participating in a Kaggle competition.
 The house price prediction competition
 is a great place to start.
-The data are fairly generic and do not exhibit exotic structure
+The data is fairly generic and do not exhibit exotic structure
 that might require specialized models (as audio or video might).
 This dataset, collected by Bart de Cock in 2011 :cite:`De-Cock.2011`,
 covers house prices in Ames, IA from the period of 2006--2010.
@@ -24,90 +29,22 @@ you will gain some intuitions that will guide you
 in your career as a data scientist.
 
 
-## Downloading and Caching Datasets
+## Downloading Data
 
 Throughout the book, we will train and test models
 on various downloaded datasets.
-Here, we (**implement several utility functions
-to facilitate data downloading**).
-First, we maintain a dictionary `DATA_HUB`
-that maps a string (the *name* of the dataset)
-to a tuple containing both the URL to locate the dataset
-and the SHA-1 key that verifies the integrity of the file.
-All such datasets are hosted at the site
-whose address is `DATA_URL`.
+Here, we (**implement two utility functions**)
+to download files and extract zip or tar files.
+Again, we defer their implementations into :numref:`sec_utils`.
 
-```{.python .input}
-#@tab all
-import os
-import requests
-import zipfile
-import tarfile
-import hashlib
-
-#@save
-DATA_HUB = dict()
-DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
-```
+```{.python .input  n=2}
+%%tab all
 
-The following `download` function downloads a dataset,
-caches it in a local directory (`../data` by default),
-and returns the name of the downloaded file.
-If a file corresponding to this dataset
-already exists in the cache directory
-and its SHA-1 matches the one stored in `DATA_HUB`,
-our code will use the cached file to avoid
-clogging up your internet with redundant downloads.
+def download(url, folder, sha1_hash=None):
+    """Download a file to folder and return the local filepath."""
 
-```{.python .input}
-#@tab all
-def download(name, cache_dir=os.path.join('..', 'data')):  #@save
-    """Download a file inserted into DATA_HUB, return the local filename."""
-    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
-    url, sha1_hash = DATA_HUB[name]
-    os.makedirs(cache_dir, exist_ok=True)
-    fname = os.path.join(cache_dir, url.split('/')[-1])
-    if os.path.exists(fname):
-        sha1 = hashlib.sha1()
-        with open(fname, 'rb') as f:
-            while True:
-                data = f.read(1048576)
-                if not data:
-                    break
-                sha1.update(data)
-        if sha1.hexdigest() == sha1_hash:
-            return fname  # Hit cache
-    print(f'Downloading {fname} from {url}...')
-    r = requests.get(url, stream=True, verify=True)
-    with open(fname, 'wb') as f:
-        f.write(r.content)
-    return fname
-```
-
-We also implement two additional utility functions:
-one is to download and extract a zip or tar file
-and the other to download all the datasets used in this book from `DATA_HUB` into the cache directory.
-
-```{.python .input}
-#@tab all
-def download_extract(name, folder=None):  #@save
-    """Download and extract a zip/tar file."""
-    fname = download(name)
-    base_dir = os.path.dirname(fname)
-    data_dir, ext = os.path.splitext(fname)
-    if ext == '.zip':
-        fp = zipfile.ZipFile(fname, 'r')
-    elif ext in ('.tar', '.gz'):
-        fp = tarfile.open(fname, 'r')
-    else:
-        assert False, 'Only zip/tar files can be extracted.'
-    fp.extractall(base_dir)
-    return os.path.join(base_dir, folder) if folder else data_dir
-
-def download_all():  #@save
-    """Download all files in the DATA_HUB."""
-    for name in DATA_HUB:
-        download(name)
+def extract(filename, folder):
+    """Extract a zip/tar file into folder."""
 ```
 
 ## Kaggle
@@ -160,7 +97,7 @@ is represented by an integer,
 the roof type by discrete categorical assignments,
 and other features by floating point numbers.
 And here is where reality complicates things:
-for some examples, some data are altogether missing
+for some examples, some data is altogether missing
 with the missing value marked simply as "na".
 The price of each house is included
 for the training set only
@@ -173,18 +110,8 @@ The "Data" tab on the competition tab
 in :numref:`fig_house_pricing`
 has links to download the data.
 
-
-To get started, we will [**read in and process the data
-using `pandas`**], which we have introduced in :numref:`sec_pandas`.
-So, you will want to make sure that you have `pandas` installed
-before proceeding further.
-Fortunately, if you are reading in Jupyter,
-we can install pandas without even leaving the notebook.
-
-```{.python .input}
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
-
+```{.python .input  n=14}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import gluon, autograd, init, np, npx
@@ -193,11 +120,8 @@ import pandas as pd
 npx.set_np()
 ```
 
-```{.python .input}
-#@tab pytorch
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
-
+```{.python .input  n=4}
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
@@ -207,10 +131,7 @@ import numpy as np
 ```
 
 ```{.python .input}
-#@tab tensorflow
-# If pandas is not installed, please uncomment the following line:
-# !pip install pandas
-
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -218,64 +139,59 @@ import pandas as pd
 import numpy as np
 ```
 
+To get started, we will [**read in and process the data
+using `pandas`**], which we have introduced in :numref:`sec_pandas`.
 For convenience, we can download and cache
-the Kaggle housing dataset
-using the script we defined above.
-
-```{.python .input}
-#@tab all
-DATA_HUB['kaggle_house_train'] = (  #@save
-    DATA_URL + 'kaggle_house_pred_train.csv',
-    '585e9cc93e70b39160e7921475f9bcd7d31219ce')
-
-DATA_HUB['kaggle_house_test'] = (  #@save
-    DATA_URL + 'kaggle_house_pred_test.csv',
-    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')
-```
-
-We use `pandas` to load the two csv files containing training and test data respectively.
-
-```{.python .input}
-#@tab all
-train_data = pd.read_csv(download('kaggle_house_train'))
-test_data = pd.read_csv(download('kaggle_house_test'))
+the Kaggle housing dataset.
+If a file corresponding to this dataset already exists in the cache directory and its SHA-1 matches `sha1_hash`, our code will use the cached file to avoid clogging up your internet with redundant downloads.
+
+```{.python .input  n=30}
+%%tab all
+class KaggleHouse(d2l.DataModule):
+    def __init__(self, batch_size, train=None, val=None):
+        super().__init__()
+        self.save_hyperparameters()
+        if self.train is None:
+            self.raw_train = pd.read_csv(d2l.download(
+                d2l.DATA_URL + 'kaggle_house_pred_train.csv', self.root,
+                sha1_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce'))
+            self.raw_val = pd.read_csv(d2l.download(
+                d2l.DATA_URL + 'kaggle_house_pred_test.csv', self.root,
+                sha1_hash='fa19780a7b011d9b009e8bff8e99922a8ee2eb90'))
 ```
 
 The training dataset includes 1460 examples,
-80 features, and 1 label, while the test data
+80 features, and 1 label, while the validation data
 contains 1459 examples and 80 features.
 
-```{.python .input}
-#@tab all
-print(train_data.shape)
-print(test_data.shape)
+```{.python .input  n=31}
+%%tab all
+data = KaggleHouse(batch_size=64)
+print(data.raw_train.shape)
+print(data.raw_val.shape)
 ```
 
-Let us [**take a look at the first four and last two features
+## Data Preprocessing
+
+Let's [**take a look at the first four and last two features
 as well as the label (SalePrice)**] from the first four examples.
 
-```{.python .input}
-#@tab all
-print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])
+```{.python .input  n=10}
+%%tab all
+print(data.raw_train.iloc[:4, [0, 1, 2, 3, -3, -2, -1]])
 ```
 
-We can see that in each example, (**the first feature is the ID.**)
+We can see that in each example, the first feature is the ID.
 This helps the model identify each training example.
 While this is convenient, it does not carry
 any information for prediction purposes.
-Hence, (**we remove it from the dataset**)
+Hence, we will remove it from the dataset
 before feeding the data into the model.
+Besides, given a wide variety of data types,
+we will need to preprocess the data before we can start modeling.
 
-```{.python .input}
-#@tab all
-all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
-```
-
-## Data Preprocessing
 
-As stated above, we have a wide variety of data types.
-We will need to preprocess the data before we can start modeling.
-Let us start with the numerical features.
+Let's start with the numerical features.
 First, we apply a heuristic,
 [**replacing all missing values
 by the corresponding feature's mean.**]
@@ -298,18 +214,6 @@ which features will be relevant,
 we do not want to penalize coefficients
 assigned to one feature more than on any other.
 
-```{.python .input}
-#@tab all
-# If test data were inaccessible, mean and standard deviation could be 
-# calculated from training data
-numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
-all_features[numeric_features] = all_features[numeric_features].apply(
-    lambda x: (x - x.mean()) / (x.std()))
-# After standardizing the data all means vanish, hence we can set missing
-# values to 0
-all_features[numeric_features] = all_features[numeric_features].fillna(0)
-```
-
 [**Next we deal with discrete values.**]
 This includes features such as "MSZoning".
 (**We replace them by a one-hot encoding**)
@@ -324,75 +228,41 @@ if the original value of "MSZoning" is "RL",
 then "MSZoning_RL" is 1 and "MSZoning_RM" is 0.
 The `pandas` package does this automatically for us.
 
-```{.python .input}
-#@tab all
-# `Dummy_na=True` considers "na" (missing value) as a valid feature value, and
-# creates an indicator feature for it
-all_features = pd.get_dummies(all_features, dummy_na=True)
-all_features.shape
+```{.python .input  n=32}
+%%tab all
+@d2l.add_to_class(KaggleHouse)
+def preprocess(self):
+    # Remove the ID and label columns
+    label = 'SalePrice'
+    features = pd.concat(
+        (self.raw_train.drop(columns=['Id', label]),
+         self.raw_val.drop(columns=['Id'])))
+    # Standardize numerical columns
+    numeric_features = features.dtypes[features.dtypes != 'object'].index
+    features[numeric_features] = features[numeric_features].apply(
+        lambda x: (x - x.mean()) / (x.std()))
+    # Replace NAN numerical features by 0
+    features[numeric_features] = features[numeric_features].fillna(0)
+    # Replace discrete features by one-hot encoding.
+    features = pd.get_dummies(features, dummy_na=True)
+    # Save preprocessed features
+    self.train = features[:self.raw_train.shape[0]].copy()
+    self.train[label] = self.raw_train[label]
+    self.val = features[self.raw_train.shape[0]:].copy()
 ```
 
 You can see that this conversion increases
-the number of features from 79 to 331.
-Finally, via the `values` attribute,
-we can [**extract the NumPy format from the `pandas` format
-and convert it into the tensor**]
-representation for training.
-
-```{.python .input}
-#@tab all
-n_train = train_data.shape[0]
-train_features = d2l.tensor(all_features[:n_train].values, dtype=d2l.float32)
-test_features = d2l.tensor(all_features[n_train:].values, dtype=d2l.float32)
-train_labels = d2l.tensor(
-    train_data.SalePrice.values.reshape(-1, 1), dtype=d2l.float32)
-```
-
-## [**Training**]
-
-To get started we train a linear model with squared loss.
-Not surprisingly, our linear model will not lead
-to a competition-winning submission
-but it provides a sanity check to see whether
-there is meaningful information in the data.
-If we cannot do better than random guessing here,
-then there might be a good chance
-that we have a data processing bug.
-And if things work, the linear model will serve as a baseline
-giving us some intuition about how close the simple model
-gets to the best reported models, giving us a sense
-of how much gain we should expect from fancier models.
-
-```{.python .input}
-loss = gluon.loss.L2Loss()
+the number of features from 79 to 331 (excluding ID and label columns).
 
-def get_net():
-    net = nn.Sequential()
-    net.add(nn.Dense(1))
-    net.initialize()
-    return net
+```{.python .input  n=33}
+%%tab all
+data.preprocess()
+data.train.shape
 ```
 
-```{.python .input}
-#@tab pytorch
-loss = nn.MSELoss()
-in_features = train_features.shape[1]
-
-def get_net():
-    net = nn.Sequential(nn.Linear(in_features,1))
-    return net
-```
+## Error Measure
 
-```{.python .input}
-#@tab tensorflow
-loss = tf.keras.losses.MeanSquaredError()
-
-def get_net():
-    net = tf.keras.models.Sequential()
-    net.add(tf.keras.layers.Dense(
-        1, kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
-    return net
-```
+To get started we will train a linear model with squared loss. Not surprisingly, our linear model will not lead to a competition-winning submission but it provides a sanity check to see whether there is meaningful information in the data. If we cannot do better than random guessing here, then there might be a good chance that we have a data processing bug. And if things work, the linear model will serve as a baseline giving us some intuition about how close the simple model gets to the best reported models, giving us a sense of how much gain we should expect from fancier models.
 
 With house prices, as with stock prices,
 we care about relative quantities
@@ -419,114 +289,25 @@ This leads to the following root-mean-squared-error between the logarithm of the
 
 $$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2}.$$
 
-```{.python .input}
-def log_rmse(net, features, labels):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = np.clip(net(features), 1, float('inf'))
-    return np.sqrt(2 * loss(np.log(clipped_preds), np.log(labels)).mean())
-```
-
-```{.python .input}
-#@tab pytorch
-def log_rmse(net, features, labels):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = torch.clamp(net(features), 1, float('inf'))
-    rmse = torch.sqrt(loss(torch.log(clipped_preds),
-                           torch.log(labels)))
-    return rmse.item()
-```
-
-```{.python .input}
-#@tab tensorflow
-def log_rmse(y_true, y_pred):
-    # To further stabilize the value when the logarithm is taken, set the
-    # value less than 1 as 1
-    clipped_preds = tf.clip_by_value(y_pred, 1, float('inf'))
-    return tf.sqrt(tf.reduce_mean(loss(
-        tf.math.log(y_true), tf.math.log(clipped_preds))))
-```
-
-Unlike in previous sections, [**our training functions
-will rely on the Adam optimizer
-(we will describe it in greater detail later)**].
-The main appeal of this optimizer is that,
-despite doing no better (and sometimes worse)
-given unlimited resources for hyperparameter optimization,
-people tend to find that it is significantly less sensitive
-to the initial learning rate.
-
-```{.python .input}
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    trainer = gluon.Trainer(net.collect_params(), 'adam', {
-        'learning_rate': learning_rate, 'wd': weight_decay})
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                l = loss(net(X), y)
-            l.backward()
-            trainer.step(batch_size)
-        train_ls.append(log_rmse(net, train_features, train_labels))
-        if test_labels is not None:
-            test_ls.append(log_rmse(net, test_features, test_labels))
-    return train_ls, test_ls
-```
-
-```{.python .input}
-#@tab pytorch
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    optimizer = torch.optim.Adam(net.parameters(),
-                                 lr = learning_rate,
-                                 weight_decay = weight_decay)
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            optimizer.zero_grad()
-            l = loss(net(X), y)
-            l.backward()
-            optimizer.step()
-        train_ls.append(log_rmse(net, train_features, train_labels))
-        if test_labels is not None:
-            test_ls.append(log_rmse(net, test_features, test_labels))
-    return train_ls, test_ls
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(net, train_features, train_labels, test_features, test_labels,
-          num_epochs, learning_rate, weight_decay, batch_size):
-    train_ls, test_ls = [], []
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    # The Adam optimization algorithm is used here
-    optimizer = tf.keras.optimizers.Adam(learning_rate)
-    net.compile(loss=loss, optimizer=optimizer)
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                y_hat = net(X)
-                l = loss(y, y_hat)
-            params = net.trainable_variables
-            grads = tape.gradient(l, params)
-            optimizer.apply_gradients(zip(grads, params))
-        train_ls.append(log_rmse(train_labels, net(train_features)))
-        if test_labels is not None:
-            test_ls.append(log_rmse(test_labels, net(test_features)))
-    return train_ls, test_ls
+```{.python .input  n=60}
+%%tab all
+@d2l.add_to_class(KaggleHouse)
+def get_dataloader(self, train):
+    label = 'SalePrice'
+    data = self.train if train else self.val
+    if label not in data: return
+    get_tensor = lambda x: d2l.tensor(x.values, dtype=d2l.float32)
+    # Logarithm of prices 
+    tensors = (get_tensor(data.drop(columns=[label])),  # X
+               d2l.reshape(d2l.log(get_tensor(data[label])), (-1, 1)))  # Y
+    return self.get_tensorloader(tensors, train)
 ```
 
 ## $K$-Fold Cross-Validation
 
-You might recall that we introduced [**$K$-fold cross-validation**]
-in the section where we discussed how to deal
-with model selection (:numref:`sec_model_selection`).
+You might recall that we introduced [**cross-validation**]
+in :numref:`subsec_generalization-model-selection`, where we discussed how to deal
+with model selection.
 We will put this to good use to select the model design
 and to adjust the hyperparameters.
 We first need a function that returns
@@ -541,46 +322,33 @@ But this added complexity might obfuscate our code unnecessarily
 so we can safely omit it here owing to the simplicity of our problem.
 
 ```{.python .input}
-#@tab all
-def get_k_fold_data(k, i, X, y):
-    assert k > 1
-    fold_size = X.shape[0] // k
-    X_train, y_train = None, None
+%%tab all
+def k_fold_data(data, k):
+    rets = []
+    fold_size = data.train.shape[0] // k
     for j in range(k):
-        idx = slice(j * fold_size, (j + 1) * fold_size)
-        X_part, y_part = X[idx, :], y[idx]
-        if j == i:
-            X_valid, y_valid = X_part, y_part
-        elif X_train is None:
-            X_train, y_train = X_part, y_part
-        else:
-            X_train = d2l.concat([X_train, X_part], 0)
-            y_train = d2l.concat([y_train, y_part], 0)
-    return X_train, y_train, X_valid, y_valid
+        idx = range(j * fold_size, (j+1) * fold_size)
+        rets.append(KaggleHouse(data.batch_size, data.train.drop(index=idx),  
+                                data.train.loc[idx]))    
+    return rets
 ```
 
-[**The training and verification error averages are returned**]
+[**The average validation error is returned**]
 when we train $K$ times in the $K$-fold cross-validation.
 
 ```{.python .input}
-#@tab all
-def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
-           batch_size):
-    train_l_sum, valid_l_sum = 0, 0
-    for i in range(k):
-        data = get_k_fold_data(k, i, X_train, y_train)
-        net = get_net()
-        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
-                                   weight_decay, batch_size)
-        train_l_sum += train_ls[-1]
-        valid_l_sum += valid_ls[-1]
-        if i == 0:
-            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
-                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
-                     legend=['train', 'valid'], yscale='log')
-        print(f'fold {i + 1}, train log rmse {float(train_ls[-1]):f}, '
-              f'valid log rmse {float(valid_ls[-1]):f}')
-    return train_l_sum / k, valid_l_sum / k
+%%tab all
+def k_fold(trainer, data, k, lr):
+    val_loss, models = [], []
+    for i, data_fold in enumerate(k_fold_data(data, k)):
+        model = d2l.LinearRegression(lr)
+        model.board.yscale='log'
+        if i != 0: model.board.display = False
+        trainer.fit(model, data_fold)
+        val_loss.append(float(model.board.data['val_loss'][-1].y))
+        models.append(model)
+    print(f'average validation log mse = {sum(val_loss)/len(val_loss)}')
+    return models
 ```
 
 ## [**Model Selection**]
@@ -598,12 +366,9 @@ we might just get lucky and find that our validation
 performance is no longer representative of the true error.
 
 ```{.python .input}
-#@tab all
-k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
-train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
-                          weight_decay, batch_size)
-print(f'{k}-fold validation: avg train log rmse: {float(train_l):f}, '
-      f'avg valid log rmse: {float(valid_l):f}')
+%%tab all
+trainer = d2l.Trainer(max_epochs=10)
+models = k_fold(trainer, data, k=5, lr=0.01)
 ```
 
 Notice that sometimes the number of training errors
@@ -619,42 +384,23 @@ by incorporating regularization techniques.
 ##  [**Submitting Predictions on Kaggle**]
 
 Now that we know what a good choice of hyperparameters should be,
-we might as well use all the data to train on it
-(rather than just $1-1/K$ of the data
-that are used in the cross-validation slices).
-The model that we obtain in this way
-can then be applied to the test set.
+we might 
+calculate the average predictions 
+on the test set
+by all the $K$ models.
 Saving the predictions in a csv file
 will simplify uploading the results to Kaggle.
-
-```{.python .input}
-#@tab all
-def train_and_pred(train_features, test_feature, train_labels, test_data,
-                   num_epochs, lr, weight_decay, batch_size):
-    net = get_net()
-    train_ls, _ = train(net, train_features, train_labels, None, None,
-                        num_epochs, lr, weight_decay, batch_size)
-    d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel='epoch',
-             ylabel='log rmse', xlim=[1, num_epochs], yscale='log')
-    print(f'train log rmse {float(train_ls[-1]):f}')
-    # Apply the network to the test set
-    preds = d2l.numpy(net(test_features))
-    # Reformat it to export to Kaggle
-    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
-    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
-    submission.to_csv('submission.csv', index=False)
-```
-
-One nice sanity check is to see
-whether the predictions on the test set
-resemble those of the $K$-fold cross-validation process.
-If they do, it is time to upload them to Kaggle.
 The following code will generate a file called `submission.csv`.
 
 ```{.python .input}
-#@tab all
-train_and_pred(train_features, test_features, train_labels, test_data,
-               num_epochs, lr, weight_decay, batch_size)
+%%tab all
+preds = [model(d2l.tensor(data.val.values, dtype=d2l.float32))
+         for model in models]
+# Taking exponentiation of predictions in the logarithm scale
+ensemble_preds = d2l.reduce_mean(d2l.exp(d2l.concat(preds, 1)), 1)
+submission = pd.DataFrame({'Id':data.raw_val.Id,
+                           'SalePrice':d2l.numpy(ensemble_preds)})
+submission.to_csv('submission.csv', index=False)
 ```
 
 Next, as demonstrated in :numref:`fig_kaggle_submit2`,
@@ -674,7 +420,7 @@ The steps are quite simple:
 
 ## Summary
 
-* Real data often contain a mix of different data types and need to be preprocessed.
+* Real data often contains a mix of different data types and need to be preprocessed.
 * Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing missing values with their mean.
 * Transforming categorical features into indicator features allows us to treat them like one-hot vectors.
 * We can use $K$-fold cross-validation to select the model and adjust the hyperparameters.
@@ -684,7 +430,6 @@ The steps are quite simple:
 ## Exercises
 
 1. Submit your predictions for this section to Kaggle. How good are your predictions?
-1. Can you improve your model by minimizing the logarithm of prices directly? What happens if you try to predict the logarithm of the price rather than the price?
 1. Is it always a good idea to replace missing values by their mean? Hint: can you construct a situation where the values are not missing at random?
 1. Improve the score on Kaggle by tuning the hyperparameters through $K$-fold cross-validation.
 1. Improve the score by improving the model (e.g., layers, weight decay, and dropout).
diff --git a/chapter_multilayer-perceptrons/mlp-concise.md b/chapter_multilayer-perceptrons/mlp-concise.md
deleted file mode 100644
index 28d71dd..0000000
--- a/chapter_multilayer-perceptrons/mlp-concise.md
+++ /dev/null
@@ -1,110 +0,0 @@
-# 多層パーセプトロンの簡潔な実装
-:label:`sec_mlp_concise`
-
-ご想像のとおり、(**高レベルAPIに頼ることで、MLPをより簡潔に実装できます**)
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, init, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-## モデル
-
-softmax 回帰実装 (:numref:`sec_softmax_concise`) の簡潔な実装と比較すると、唯一の違いは以下を追加することです。
-*2つの* 完全に接続されたレイヤー
-(以前は、*one* を追加しました)。1つ目は [**私たちの隠れ層**] で、(**256 個の隠しユニットを含み、ReLU アクティベーション機能を適用する**)。2 つ目は出力レイヤーです。
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(256, activation='relu'),
-        nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Flatten(),
-                    nn.Linear(784, 256),
-                    nn.ReLU(),
-                    nn.Linear(256, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(256, activation='relu'),
-    tf.keras.layers.Dense(10)])
-```
-
-[**トレーニングループ**] はソフトマックス回帰を実装したときとまったく同じです。このモジュール性により、モデルアーキテクチャに関する事項を直交的な考慮事項から切り離すことができます。
-
-```{.python .input}
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
-```
-
-```{.python .input}
-#@tab pytorch
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = nn.CrossEntropyLoss()
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-```
-
-```{.python .input}
-#@tab tensorflow
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-```
-
-```{.python .input}
-#@tab all
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-## [概要
-
-* 高レベル API を使用することで、MLP をより簡潔に実装できます。
-* 同じ分類問題では、MLP の実装はソフトマックス回帰の実装と同じですが、活性化関数をもつ隠れ層が追加されている点が異なります。
-
-## 演習
-
-1. 異なる数の隠れ層を追加してみてください (学習率を変更することもできます)。どの設定が最適ですか？
-1. さまざまなアクティベーション機能を試してみてください。どれが一番効果的ですか？
-1. ウェイトの初期化にはさまざまなスキームを試してください。どの方法が最適ですか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/94)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/95)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/262)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp-concise_origin.md b/chapter_multilayer-perceptrons/mlp-concise_origin.md
deleted file mode 100644
index d772c9c..0000000
--- a/chapter_multilayer-perceptrons/mlp-concise_origin.md
+++ /dev/null
@@ -1,122 +0,0 @@
-# Concise Implementation of Multilayer Perceptrons
-:label:`sec_mlp_concise`
-
-As you might expect, by (**relying on the high-level APIs,
-we can implement MLPs even more concisely.**)
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, init, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-## Model
-
-As compared with our concise implementation
-of softmax regression implementation
-(:numref:`sec_softmax_concise`),
-the only difference is that we add
-*two* fully-connected layers
-(previously, we added *one*).
-The first is [**our hidden layer**],
-which (**contains 256 hidden units
-and applies the ReLU activation function**).
-The second is our output layer.
-
-```{.python .input}
-net = nn.Sequential()
-net.add(nn.Dense(256, activation='relu'),
-        nn.Dense(10))
-net.initialize(init.Normal(sigma=0.01))
-```
-
-```{.python .input}
-#@tab pytorch
-net = nn.Sequential(nn.Flatten(),
-                    nn.Linear(784, 256),
-                    nn.ReLU(),
-                    nn.Linear(256, 10))
-
-def init_weights(m):
-    if type(m) == nn.Linear:
-        nn.init.normal_(m.weight, std=0.01)
-
-net.apply(init_weights);
-```
-
-```{.python .input}
-#@tab tensorflow
-net = tf.keras.models.Sequential([
-    tf.keras.layers.Flatten(),
-    tf.keras.layers.Dense(256, activation='relu'),
-    tf.keras.layers.Dense(10)])
-```
-
-[**The training loop**] is exactly the same
-as when we implemented softmax regression.
-This modularity enables us to separate
-matters concerning the model architecture
-from orthogonal considerations.
-
-```{.python .input}
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
-```
-
-```{.python .input}
-#@tab pytorch
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = nn.CrossEntropyLoss()
-trainer = torch.optim.SGD(net.parameters(), lr=lr)
-```
-
-```{.python .input}
-#@tab tensorflow
-batch_size, lr, num_epochs = 256, 0.1, 10
-loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
-trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-```
-
-```{.python .input}
-#@tab all
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
-```
-
-## Summary
-
-* Using high-level APIs, we can implement MLPs much more concisely.
-* For the same classification problem, the implementation of an MLP is the same as that of softmax regression except for additional hidden layers with activation functions.
-
-## Exercises
-
-1. Try adding different numbers of hidden layers (you may also modify the learning rate). What setting works best?
-1. Try out different activation functions. Which one works best?
-1. Try different schemes for initializing the weights. What method works best?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/94)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/95)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/262)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp-implementation.md b/chapter_multilayer-perceptrons/mlp-implementation.md
new file mode 100644
index 0000000..677e4fb
--- /dev/null
+++ b/chapter_multilayer-perceptrons/mlp-implementation.md
@@ -0,0 +1,215 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# 多層パーセプトロンの実装
+:label:`sec_mlp-implementation`
+
+多層パーセプトロン (MLP) は、単純な線形モデルほど実装が複雑ではありません。概念上の重要な違いは、複数のレイヤーを連結するようになったことです。
+
+```{.python .input  n=2}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## ゼロからの実装
+
+このようなネットワークをゼロから実装することから始めましょう。 
+
+### モデルパラメーターの初期化
+
+Fashion-mnist には 10 個のクラスが含まれており、各イメージはグレースケールピクセル値の $28 \times 28 = 784$ グリッドで構成されていることを思い出してください。前と同じように、ここではピクセル間の空間構造を無視するので、これは 784 個の入力フィーチャと 10 個のクラスを持つ分類データセットと考えることができます。はじめに、[**1つの隠れ層と256の隠れユニットを持つMLPを実装する。**] 層の数と幅はどちらも調整可能（ハイパーパラメータとみなされる）。通常、層の幅は 2 の累乗で割り切れるように選択します。これは、メモリがハードウェアで割り当てられ、アドレス指定される方法により、計算効率が向上します。 
+
+ここでも、パラメータをいくつかのテンソルで表現します。*すべてのレイヤー*について、1つの重み行列と1つのバイアスベクトルを追跡しなければならないことに注意してください。いつものように、これらのパラメータに関して損失の勾配にメモリを割り当てます。
+
+```{.python .input  n=5}
+%%tab mxnet
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = np.random.randn(num_inputs, num_hiddens) * sigma
+        self.b1 = np.zeros(num_hiddens)
+        self.W2 = np.random.randn(num_hiddens, num_outputs) * sigma
+        self.b2 = np.zeros(num_outputs)
+        for param in self.get_scratch_params():
+            param.attach_grad()
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
+        self.b1 = nn.Parameter(torch.zeros(num_hiddens))
+        self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
+        self.b2 = nn.Parameter(torch.zeros(num_outputs))
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = tf.Variable(
+            tf.random.normal((num_inputs, num_hiddens)) * sigma)
+        self.b1 = tf.Variable(tf.zeros(num_hiddens))
+        self.W2 = tf.Variable(
+            tf.random.normal((num_hiddens, num_outputs)) * sigma)
+        self.b2 = tf.Variable(tf.zeros(num_outputs))
+```
+
+### モデル
+
+すべてがどのように機能するかを確実に知るために、組み込みの`relu`関数を直接呼び出すのではなく、[**ReLUアクティベーション**を実装する**] します。
+
+```{.python .input  n=8}
+%%tab mxnet
+def relu(X):
+    return np.maximum(X, 0)
+```
+
+```{.python .input  n=9}
+%%tab pytorch
+def relu(X):
+    a = torch.zeros_like(X)
+    return torch.max(X, a)
+```
+
+```{.python .input  n=10}
+%%tab tensorflow
+def relu(X):
+    return tf.math.maximum(X, 0)
+```
+
+空間構造を無視しているので、各2次元画像を長さ`num_inputs`のフラットベクトルに`reshape`します。最後に、ほんの数行のコードで (**モデルを実装**) します。私たちはフレームワークの組み込みオートグラードを使っているので、これだけで十分です。
+
+```{.python .input  n=11}
+%%tab all
+@d2l.add_to_class(MLPScratch)
+def forward(self, X):
+    X = d2l.reshape(X, (-1, self.num_inputs))
+    H = relu(d2l.matmul(X, self.W1) + self.b1)
+    return d2l.matmul(H, self.W2) + self.b2
+```
+
+### トレーニング
+
+幸い、[**MLPの学習ループはソフトマックス回帰とまったく同じです。**] モデル、データ、トレーナーを定義し、最後にモデルとデータに対して関数 `fit` を呼び出します。
+
+```{.python .input  n=12}
+%%tab all
+model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
+data = d2l.FashionMNIST(batch_size=256)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+## 簡潔な実装
+
+ご想像のとおり、高レベル API に頼ることで、MLP をさらに簡潔に実装できます。 
+
+### モデル
+
+ソフトマックス回帰実装の簡潔な実装 (:numref:`sec_softmax_concise`) と比べると、唯一の違いは
+*以前に*1つ*だけ追加した2つの*完全接続レイヤー。
+1つ目は [**非表示レイヤー**] で、2つ目は出力レイヤーです。
+
+```{.python .input}
+%%tab mxnet
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential()
+        self.net.add(nn.Dense(num_hiddens, activation='relu'),
+                     nn.Dense(num_outputs))
+        self.net.initialize()
+```
+
+```{.python .input}
+%%tab pytorch
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens),
+                                 nn.ReLU(), nn.LazyLinear(num_outputs))
+```
+
+```{.python .input}
+%%tab tensorflow
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = tf.keras.models.Sequential([
+            tf.keras.layers.Flatten(),
+            tf.keras.layers.Dense(num_hiddens, activation='relu'),
+            tf.keras.layers.Dense(num_outputs)])
+```
+
+### トレーニング
+
+[**トレーニングループ**] は、ソフトマックス回帰を実装したときとまったく同じです。このモジュール性により、モデルアーキテクチャに関する事項を直交的な考慮事項から分離することができます。
+
+```{.python .input}
+%%tab all
+model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
+trainer.fit(model, data)
+```
+
+## まとめ
+
+ディープネットワークを設計する実践が増えた今、ディープネットワークの単一レイヤーから複数レイヤーへのステップは、もはやそれほど大きな課題にはなりません。特に、トレーニングアルゴリズムとデータローダーを再利用できます。ただし、MLP をゼロから実装するのは面倒です。モデルパラメータの名前付けと追跡を行うと、モデルの拡張が困難になります。たとえば、レイヤー 42 と 43 の間に別のレイヤーを挿入するとします。これは、順番に名前を変更する意思がない限り、レイヤー42bになる可能性があります。さらに、ネットワークをゼロから実装すると、フレームワークが有意義なパフォーマンスの最適化を実行することははるかに困難になります。 
+
+それでも、完全に接続されたディープネットワークがニューラルネットワークモデリングの選択方法であった1980年代後半の最先端に到達しました。次の概念的なステップは、画像を考えることです。その前に、いくつかの統計の基礎と、モデルを効率的に計算する方法の詳細を確認する必要があります。 
+
+## 演習
+
+1. 非表示ユニットの数 `num_hiddens` を変更し、その数がモデルの精度にどのように影響するかをプロットします。このハイパーパラメータの最大の価値は何ですか？
+1. 非表示のレイヤーを追加して、結果にどのような影響があるかを確認してください。
+1. 単一のニューロンで隠れ層を挿入するのはなぜ悪い考えですか？何が悪くなる可能性がありますか？
+1. 学習率を変えると結果はどう変わりますか？他のすべてのパラメータを固定した状態で、どの学習率が最も良い結果が得られますか？これはエポック数とどのように関係していますか？
+1. 学習率、エポック数、隠れ層の数、層ごとの隠れユニットの数など、すべてのハイパーパラメータを合わせて最適化しましょう。
+    1. それらすべてを最適化することで得られる最高の結果は何ですか？
+    1. 複数のハイパーパラメータを扱うのがはるかに難しいのはなぜですか？
+    1. 複数のパラメータを共同で最適化する効率的な戦略を説明する。
+1. 困難な問題について、フレームワークの速度とゼロからの実装を比較します。ネットワークの複雑さによってどのように変化しますか？
+1. 整列した行列と整列していない行列のテンソル行列の乗算の速度を測定します。たとえば、次元 1024、1025、1026、1028、および 1032 の行列をテストします。
+    1. これはGPUとCPUの間でどのように変化しますか？
+    1. CPU と GPU のメモリバス幅を決定します。
+1. さまざまなアクティベーション機能を試してみてください。どれが一番いいの？
+1. ネットワークの重み付け初期化に違いはありますか?それは問題なの？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/92)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/93)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/227)
+:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp-implementation_origin.md b/chapter_multilayer-perceptrons/mlp-implementation_origin.md
new file mode 100644
index 0000000..646e69e
--- /dev/null
+++ b/chapter_multilayer-perceptrons/mlp-implementation_origin.md
@@ -0,0 +1,249 @@
+```{.python .input  n=1}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Implementation of Multilayer Perceptrons
+:label:`sec_mlp-implementation`
+
+Multilayer perceptrons (MLPs) are not much more complex to implement than simple linear models. The key conceptual
+difference is that we now concatenate multiple layers.
+
+```{.python .input  n=2}
+%%tab mxnet
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.gluon import nn
+npx.set_np()
+```
+
+```{.python .input  n=3}
+%%tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+```{.python .input  n=4}
+%%tab tensorflow
+from d2l import tensorflow as d2l
+import tensorflow as tf
+```
+
+## Implementation from Scratch
+
+Let's begin again by implementing such a network from scratch.
+
+### Initializing Model Parameters
+
+Recall that Fashion-MNIST contains 10 classes,
+and that each image consists of a $28 \times 28 = 784$
+grid of grayscale pixel values.
+As before we will disregard the spatial structure
+among the pixels for now,
+so we can think of this as a classification dataset
+with 784 input features and 10 classes.
+To begin, we will [**implement an MLP
+with one hidden layer and 256 hidden units.**]
+Both the number of layers and their width are adjustable
+(they are considered hyperparameters).
+Typically, we choose the layer widths to be divisible by larger powers of 2.
+This is computationally efficient due to the way
+memory is allocated and addressed in hardware.
+
+Again, we will represent our parameters with several tensors.
+Note that *for every layer*, we must keep track of
+one weight matrix and one bias vector.
+As always, we allocate memory
+for the gradients of the loss with respect to these parameters.
+
+```{.python .input  n=5}
+%%tab mxnet
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = np.random.randn(num_inputs, num_hiddens) * sigma
+        self.b1 = np.zeros(num_hiddens)
+        self.W2 = np.random.randn(num_hiddens, num_outputs) * sigma
+        self.b2 = np.zeros(num_outputs)
+        for param in self.get_scratch_params():
+            param.attach_grad()
+```
+
+```{.python .input  n=6}
+%%tab pytorch
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
+        self.b1 = nn.Parameter(torch.zeros(num_hiddens))
+        self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
+        self.b2 = nn.Parameter(torch.zeros(num_outputs))
+```
+
+```{.python .input  n=7}
+%%tab tensorflow
+class MLPScratch(d2l.Classifier):
+    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.W1 = tf.Variable(
+            tf.random.normal((num_inputs, num_hiddens)) * sigma)
+        self.b1 = tf.Variable(tf.zeros(num_hiddens))
+        self.W2 = tf.Variable(
+            tf.random.normal((num_hiddens, num_outputs)) * sigma)
+        self.b2 = tf.Variable(tf.zeros(num_outputs))
+```
+
+### Model
+
+To make sure we know how everything works,
+we will [**implement the ReLU activation**] ourselves
+rather than invoking the built-in `relu` function directly.
+
+```{.python .input  n=8}
+%%tab mxnet
+def relu(X):
+    return np.maximum(X, 0)
+```
+
+```{.python .input  n=9}
+%%tab pytorch
+def relu(X):
+    a = torch.zeros_like(X)
+    return torch.max(X, a)
+```
+
+```{.python .input  n=10}
+%%tab tensorflow
+def relu(X):
+    return tf.math.maximum(X, 0)
+```
+
+Since we are disregarding spatial structure,
+we `reshape` each two-dimensional image into
+a flat vector of length  `num_inputs`.
+Finally, we (**implement our model**)
+with just a few lines of code. Since we use the framework built-in autograd this is all that it takes.
+
+```{.python .input  n=11}
+%%tab all
+@d2l.add_to_class(MLPScratch)
+def forward(self, X):
+    X = d2l.reshape(X, (-1, self.num_inputs))
+    H = relu(d2l.matmul(X, self.W1) + self.b1)
+    return d2l.matmul(H, self.W2) + self.b2
+```
+
+### Training
+
+Fortunately, [**the training loop for MLPs
+is exactly the same as for softmax regression.**] We define the model, data, trainer and finally invoke the `fit` function on model and data.
+
+```{.python .input  n=12}
+%%tab all
+model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
+data = d2l.FashionMNIST(batch_size=256)
+trainer = d2l.Trainer(max_epochs=10)
+trainer.fit(model, data)
+```
+
+## Concise Implementation
+
+As you might expect, by relying on the high-level APIs, we can implement MLPs even more concisely.
+
+### Model
+
+As compared with our concise implementation
+of softmax regression implementation
+(:numref:`sec_softmax_concise`),
+the only difference is that we add
+*two* fully connected layers where we previously added only *one*.
+The first is [**the hidden layer**],
+the second is the output layer.
+
+```{.python .input}
+%%tab mxnet
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential()
+        self.net.add(nn.Dense(num_hiddens, activation='relu'),
+                     nn.Dense(num_outputs))
+        self.net.initialize()
+```
+
+```{.python .input}
+%%tab pytorch
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens),
+                                 nn.ReLU(), nn.LazyLinear(num_outputs))
+```
+
+```{.python .input}
+%%tab tensorflow
+class MLP(d2l.Classifier):
+    def __init__(self, num_outputs, num_hiddens, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = tf.keras.models.Sequential([
+            tf.keras.layers.Flatten(),
+            tf.keras.layers.Dense(num_hiddens, activation='relu'),
+            tf.keras.layers.Dense(num_outputs)])
+```
+
+### Training
+
+[**The training loop**] is exactly the same
+as when we implemented softmax regression.
+This modularity enables us to separate
+matters concerning the model architecture
+from orthogonal considerations.
+
+```{.python .input}
+%%tab all
+model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
+trainer.fit(model, data)
+```
+
+## Summary
+
+Now that we have more practice in designing deep networks, the step from a single to multiple layers of deep networks doesn't pose such a significant challenge any longer. In particular, we can reuse the training algorithm and data loader. Note, though, that implementing MLPs from scratch is nonetheless messy: naming and keeping track of the model parameters makes it difficult to extend models. For instance, imagine wanting to insert another layer between layers 42 and 43. This might now be layer 42b, unless we are willing to perform sequential renaming. Moreover, if we implement the network from scratch, it is much more difficult for the framework to perform meaningful performance optimizations.
+
+Nonetheless, you have now reached the state of the art of the late 1980s when fully connected deep networks were the method of choice for neural network modeling. Our next conceptual step will be to consider images. Before we do so, we need to review a number of statistical basics and details on how to compute models efficiently.
+
+
+## Exercises
+
+1. Change the number of hidden units `num_hiddens` and plot how its number affects the accuracy of the model. What is the best value of this hyperparameter?
+1. Try adding a hidden layer to see how it affects the results.
+1. Why is it a bad idea to insert a hidden layer with a single neuron? What could go wrong?
+1. How does changing the learning rate alter your results? With all other parameters fixed, which learning rate gives you the best results? How does this relate to the number of epochs?
+1. Let's optimize over all hyperparameters jointly, i.e., learning rate, number of epochs, number of hidden layers, and number of hidden units per layer.
+    1. What is the best result you can get by optimizing over all of them?
+    1. Why it is much more challenging to deal with multiple hyperparameters?
+    1. Describe an efficient strategy for optimizing over multiple parameters jointly.
+1. Compare the speed of the framework and the from-scratch implementation for a challenging problem. How does it change with the complexity of the network?
+1. Measure the speed of tensor-matrix multiplications for well-aligned and misaligned matrices. For instance, test for matrices with dimension 1024, 1025, 1026, 1028, and 1032.
+    1. How does this change between GPUs and CPUs?
+    1. Determine the memory bus width of your CPU and GPU.
+1. Try out different activation functions. Which one works best?
+1. Is there a difference between weight initializations of the network? Does it matter?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/92)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/93)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/227)
+:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp-scratch.md b/chapter_multilayer-perceptrons/mlp-scratch.md
deleted file mode 100644
index f0a8f82..0000000
--- a/chapter_multilayer-perceptrons/mlp-scratch.md
+++ /dev/null
@@ -1,202 +0,0 @@
-# 多層パーセプトロンのゼロからの実装
-:label:`sec_mlp_scratch`
-
-多層パーセプトロン (MLP) を数学的に特徴付けたので、自分で実装してみましょう。ソフトマックス回帰 (:numref:`sec_softmax_scratch`) で達成した以前の結果と比較するために、Fashion-MNIST 画像分類データセット (:numref:`sec_fashion_mnist`) を引き続き使用します。
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, np, npx
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## モデルパラメーターの初期化
-
-Fashion-MNist には 10 個のクラスが含まれており、各イメージはグレースケールピクセル値の $28 \times 28 = 784$ グリッドで構成されていることを思い出してください。ここでも、ピクセル間の空間構造は無視するので、784 個の入力フィーチャと 10 個のクラスを含む単純な分類データセットと考えることができます。はじめに、[**隠れ層が 1 つ、隠れ単位が 256 個の MLP を実装します。**] これら両方の量をハイパーパラメーターと見なすことができます。通常、レイヤーの幅は 2 の累乗で選択しますが、ハードウェアでのメモリの割り当て方法とアドレス指定方法により、計算効率が高くなる傾向があります。 
-
-ここでも、パラメータをいくつかのテンソルで表します。*すべての層* について、1 つの重み行列と 1 つのバイアスベクトルを追跡する必要があることに注意してください。いつものように、これらのパラメータに関して損失の勾配にメモリを割り当てます。
-
-```{.python .input}
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens))
-b1 = np.zeros(num_hiddens)
-W2 = np.random.normal(scale=0.01, size=(num_hiddens, num_outputs))
-b2 = np.zeros(num_outputs)
-params = [W1, b1, W2, b2]
-
-for param in params:
-    param.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = nn.Parameter(torch.randn(
-    num_inputs, num_hiddens, requires_grad=True) * 0.01)
-b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
-W2 = nn.Parameter(torch.randn(
-    num_hiddens, num_outputs, requires_grad=True) * 0.01)
-b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
-
-params = [W1, b1, W2, b2]
-```
-
-```{.python .input}
-#@tab tensorflow
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = tf.Variable(tf.random.normal(
-    shape=(num_inputs, num_hiddens), mean=0, stddev=0.01))
-b1 = tf.Variable(tf.zeros(num_hiddens))
-W2 = tf.Variable(tf.random.normal(
-    shape=(num_hiddens, num_outputs), mean=0, stddev=0.01))
-b2 = tf.Variable(tf.random.normal([num_outputs], stddev=.01))
-
-params = [W1, b1, W2, b2]
-```
-
-## アクティベーション機能
-
-すべてがどのように機能するかを確実に知るために、組み込みの `relu` 関数を直接呼び出すのではなく、max 関数を使って [**ReLU アクティベーションを実装**] します。
-
-```{.python .input}
-def relu(X):
-    return np.maximum(X, 0)
-```
-
-```{.python .input}
-#@tab pytorch
-def relu(X):
-    a = torch.zeros_like(X)
-    return torch.max(X, a)
-```
-
-```{.python .input}
-#@tab tensorflow
-def relu(X):
-    return tf.math.maximum(X, 0)
-```
-
-## モデル
-
-空間構造を無視しているので、各 2 次元イメージを `reshape` の長さの `num_inputs` の平面ベクトルにします。最後に、わずか数行のコードで (**モデルを実装**) します。
-
-```{.python .input}
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(np.dot(X, W1) + b1)
-    return np.dot(H, W2) + b2
-```
-
-```{.python .input}
-#@tab pytorch
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(X@W1 + b1)  # Here '@' stands for matrix multiplication
-    return (H@W2 + b2)
-```
-
-```{.python .input}
-#@tab tensorflow
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(tf.matmul(X, W1) + b1)
-    return tf.matmul(H, W2) + b2
-```
-
-## 損失関数
-
-数値の安定性を確保するため、また softmax 関数をゼロから実装しているため (:numref:`sec_softmax_scratch`)、ソフトマックス損失とクロスエントロピー損失の計算には、高レベル API からの積分関数を活用しています。:numref:`subsec_softmax-implementation-revisited` のこれらの複雑さについての以前の議論を思い出してください。興味のある読者には、損失関数のソースコードを調べて、実装の詳細についての知識を深めることをお勧めします。
-
-```{.python .input}
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.CrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-def loss(y_hat, y):
-    return tf.losses.sparse_categorical_crossentropy(
-        y, y_hat, from_logits=True)
-```
-
-## 訓練
-
-幸いなことに、[**MLP の学習ループはソフトマックス回帰の場合とまったく同じです。**] `d2l` パッケージをもう一度活用して `train_ch3` 関数 (:numref:`sec_softmax_scratch` を参照) を呼び出し、エポック数を 10、学習率を 0.1 に設定します。
-
-```{.python .input}
-num_epochs, lr = 10, 0.1
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
-              lambda batch_size: d2l.sgd(params, lr, batch_size))
-```
-
-```{.python .input}
-#@tab pytorch
-num_epochs, lr = 10, 0.1
-updater = torch.optim.SGD(params, lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs, lr = 10, 0.1
-updater = d2l.Updater([W1, W2, b1, b2], lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
-```
-
-学習したモデルを評価するために、[**テストデータに適用する**]。
-
-```{.python .input}
-#@tab all
-d2l.predict_ch3(net, test_iter)
-```
-
-## [概要
-
-* 単純な MLP の実装は、手動で行う場合でも簡単であることがわかりました。
-* しかし、レイヤーの数が多いと、MLP をゼロから実装するのは面倒です (たとえば、モデルのパラメーターの命名や追跡など)。
-
-## 演習
-
-1. ハイパーパラメータ `num_hiddens` の値を変更して、このハイパーパラメータが結果にどのように影響するかを確認します。このハイパーパラメータの最適値を決定し、他の値をすべて一定に保ちます。
-1. 非表示レイヤーを追加して、結果にどのような影響があるかを確認します。
-1. 学習率を変更すると、結果にどのような影響がありますか？モデルアーキテクチャとその他のハイパーパラメータ (エポック数を含む) を修正した場合、どの学習率で最良の結果が得られますか?
-1. すべてのハイパーパラメータ（学習率、エポック数、隠れ層の数、層あたりの隠れユニット数）を合わせて最適化すると、どのような結果が得られますか？
-1. 複数のハイパーパラメータを扱うのがはるかに難しい理由を説明する。
-1. 複数のハイパーパラメータに対する検索を構造化するために考えられる最も賢い戦略は何ですか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/92)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/93)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/227)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp-scratch_origin.md b/chapter_multilayer-perceptrons/mlp-scratch_origin.md
deleted file mode 100644
index 21f1f41..0000000
--- a/chapter_multilayer-perceptrons/mlp-scratch_origin.md
+++ /dev/null
@@ -1,251 +0,0 @@
-# Implementation of Multilayer Perceptrons from Scratch
-:label:`sec_mlp_scratch`
-
-Now that we have characterized
-multilayer perceptrons (MLPs) mathematically,
-let us try to implement one ourselves. To compare against our previous results
-achieved with softmax regression
-(:numref:`sec_softmax_scratch`),
-we will continue to work with
-the Fashion-MNIST image classification dataset
-(:numref:`sec_fashion_mnist`).
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, np, npx
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-```{.python .input}
-#@tab all
-batch_size = 256
-train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
-```
-
-## Initializing Model Parameters
-
-Recall that Fashion-MNIST contains 10 classes,
-and that each image consists of a $28 \times 28 = 784$
-grid of grayscale pixel values.
-Again, we will disregard the spatial structure
-among the pixels for now,
-so we can think of this as simply a classification dataset
-with 784 input features and 10 classes.
-To begin, we will [**implement an MLP
-with one hidden layer and 256 hidden units.**]
-Note that we can regard both of these quantities
-as hyperparameters.
-Typically, we choose layer widths in powers of 2,
-which tend to be computationally efficient because
-of how memory is allocated and addressed in hardware.
-
-Again, we will represent our parameters with several tensors.
-Note that *for every layer*, we must keep track of
-one weight matrix and one bias vector.
-As always, we allocate memory
-for the gradients of the loss with respect to these parameters.
-
-```{.python .input}
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens))
-b1 = np.zeros(num_hiddens)
-W2 = np.random.normal(scale=0.01, size=(num_hiddens, num_outputs))
-b2 = np.zeros(num_outputs)
-params = [W1, b1, W2, b2]
-
-for param in params:
-    param.attach_grad()
-```
-
-```{.python .input}
-#@tab pytorch
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = nn.Parameter(torch.randn(
-    num_inputs, num_hiddens, requires_grad=True) * 0.01)
-b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
-W2 = nn.Parameter(torch.randn(
-    num_hiddens, num_outputs, requires_grad=True) * 0.01)
-b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
-
-params = [W1, b1, W2, b2]
-```
-
-```{.python .input}
-#@tab tensorflow
-num_inputs, num_outputs, num_hiddens = 784, 10, 256
-
-W1 = tf.Variable(tf.random.normal(
-    shape=(num_inputs, num_hiddens), mean=0, stddev=0.01))
-b1 = tf.Variable(tf.zeros(num_hiddens))
-W2 = tf.Variable(tf.random.normal(
-    shape=(num_hiddens, num_outputs), mean=0, stddev=0.01))
-b2 = tf.Variable(tf.random.normal([num_outputs], stddev=.01))
-
-params = [W1, b1, W2, b2]
-```
-
-## Activation Function
-
-To make sure we know how everything works,
-we will [**implement the ReLU activation**] ourselves
-using the maximum function rather than
-invoking the built-in `relu` function directly.
-
-```{.python .input}
-def relu(X):
-    return np.maximum(X, 0)
-```
-
-```{.python .input}
-#@tab pytorch
-def relu(X):
-    a = torch.zeros_like(X)
-    return torch.max(X, a)
-```
-
-```{.python .input}
-#@tab tensorflow
-def relu(X):
-    return tf.math.maximum(X, 0)
-```
-
-## Model
-
-Because we are disregarding spatial structure,
-we `reshape` each two-dimensional image into
-a flat vector of length  `num_inputs`.
-Finally, we (**implement our model**)
-with just a few lines of code.
-
-```{.python .input}
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(np.dot(X, W1) + b1)
-    return np.dot(H, W2) + b2
-```
-
-```{.python .input}
-#@tab pytorch
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(X@W1 + b1)  # Here '@' stands for matrix multiplication
-    return (H@W2 + b2)
-```
-
-```{.python .input}
-#@tab tensorflow
-def net(X):
-    X = d2l.reshape(X, (-1, num_inputs))
-    H = relu(tf.matmul(X, W1) + b1)
-    return tf.matmul(H, W2) + b2
-```
-
-## Loss Function
-
-To ensure numerical stability,
-and because we already implemented
-the softmax function from scratch
-(:numref:`sec_softmax_scratch`),
-we leverage the integrated function from high-level APIs
-for calculating the softmax and cross-entropy loss.
-Recall our earlier discussion of these intricacies
-in :numref:`subsec_softmax-implementation-revisited`.
-We encourage the interested reader
-to examine the source code for the loss function
-to deepen their knowledge of implementation details.
-
-```{.python .input}
-loss = gluon.loss.SoftmaxCrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab pytorch
-loss = nn.CrossEntropyLoss()
-```
-
-```{.python .input}
-#@tab tensorflow
-def loss(y_hat, y):
-    return tf.losses.sparse_categorical_crossentropy(
-        y, y_hat, from_logits=True)
-```
-
-## Training
-
-Fortunately, [**the training loop for MLPs
-is exactly the same as for softmax regression.**]
-Leveraging the `d2l` package again,
-we call the `train_ch3` function
-(see :numref:`sec_softmax_scratch`),
-setting the number of epochs to 10
-and the learning rate to 0.1.
-
-```{.python .input}
-num_epochs, lr = 10, 0.1
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
-              lambda batch_size: d2l.sgd(params, lr, batch_size))
-```
-
-```{.python .input}
-#@tab pytorch
-num_epochs, lr = 10, 0.1
-updater = torch.optim.SGD(params, lr=lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
-```
-
-```{.python .input}
-#@tab tensorflow
-num_epochs, lr = 10, 0.1
-updater = d2l.Updater([W1, W2, b1, b2], lr)
-d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
-```
-
-To evaluate the learned model,
-we [**apply it on some test data**].
-
-```{.python .input}
-#@tab all
-d2l.predict_ch3(net, test_iter)
-```
-
-## Summary
-
-* We saw that implementing a simple MLP is easy, even when done manually.
-* However, with a large number of layers, implementing MLPs from scratch can still get messy (e.g., naming and keeping track of our model's parameters).
-
-
-## Exercises
-
-1. Change the value of the hyperparameter `num_hiddens` and see how this hyperparameter influences your results. Determine the best value of this hyperparameter, keeping all others constant.
-1. Try adding an additional hidden layer to see how it affects the results.
-1. How does changing the learning rate alter your results? Fixing the model architecture and other hyperparameters (including number of epochs), what learning rate gives you the best results?
-1. What is the best result you can get by optimizing over all the hyperparameters (learning rate, number of epochs, number of hidden layers, number of hidden units per layer) jointly?
-1. Describe why it is much more challenging to deal with multiple hyperparameters.
-1. What is the smartest strategy you can think of for structuring a search over multiple hyperparameters?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/92)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/93)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/227)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/mlp.md b/chapter_multilayer-perceptrons/mlp.md
index b80c770..92148b2 100644
--- a/chapter_multilayer-perceptrons/mlp.md
+++ b/chapter_multilayer-perceptrons/mlp.md
@@ -1,36 +1,43 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # 多層パーセプトロン
 :label:`sec_mlp`
 
-:numref:`chap_linear` では softmax 回帰 (:numref:`sec_softmax`) を導入し、アルゴリズムをゼロから実装し (:numref:`sec_softmax_scratch`)、高レベル API (:numref:`sec_softmax_concise`) を使用し、低解像度画像から 10 種類の衣類を認識するように分類器をトレーニングしました。その過程で、データをラングリングし、出力を有効な確率分布に強制し、適切な損失関数を適用し、モデルのパラメーターに関して最小化する方法を学びました。単純な線形モデルのコンテキストでこれらの力学を習得したので、この本が主に関係する比較的豊富なモデルのクラスであるディープニューラルネットワークの探索を開始できます。 
+:numref:`chap_classification`では、アルゴリズムをゼロから実装し（:numref:`sec_softmax_scratch`）、高レベルAPI（:numref:`sec_softmax_concise`）を使用して、ソフトマックス回帰（:numref:`sec_softmax`）を導入しました。これにより、低解像度の画像から衣類の10種類を認識できる分類器をトレーニングすることができました。その過程で、データをまとめ、出力を有効な確率分布に強制し、適切な損失関数を適用し、モデルのパラメーターに関して最小化する方法を学びました。単純な線形モデルのコンテキストでこれらのメカニズムを習得したので、この本が主に関係する比較的豊富なクラスのモデルであるディープニューラルネットワークの探索を開始できます。 
 
 ## 非表示レイヤー
 
-:numref:`subsec_linear_model` では、バイアスによって加えられた線形変換であるアフィン変換について説明しました。はじめに、:numref:`fig_softmaxreg` に示されている softmax 回帰の例に対応するモデルアーキテクチャを思い出してください。このモデルは、単一のアフィン変換とそれに続くソフトマックス演算により、入力を出力に直接マッピングしました。ラベルがアフィン変換によって入力データと本当に関連しているなら、このアプローチで十分です。しかし、アフィン変換における直線性は「強い」仮定です。 
+:numref:`subsec_linear_model`では、アフィン変換をバイアスを加えた線形変換として説明しました。はじめに、:numref:`fig_softmaxreg`に示されているソフトマックス回帰の例に対応するモデルアーキテクチャを思い出してください。このモデルは、単一のアフィン変換とそれに続くソフトマックス演算によって、入力を出力に直接マッピングします。ラベルが単純なアフィン変換によって本当に入力データに関連している場合、このアプローチで十分です。ただし、線形性 (アフィン変換における) は「強固な」仮定です。 
+
+### 線形モデルの限界
 
-### 線形モデルが間違ってしまうことがある
+たとえば、線形性は、*単調性*の*弱い*仮定を意味します。つまり、特徴量が増加すると、常にモデルの出力が増加する（対応する重みが正の場合）か、モデルの出力が常に減少する（対応する重みが負の場合）必要があります。時にはそれが理にかなっています。たとえば、個人がローンを返済するかどうかを予測しようとした場合、他のすべてのものが等しいと合理的に想定できます。収入の高い申請者は、低所得の申請者よりも常に返済する可能性が高くなります。単調ではあるが、この関係は返済の確率と直線的に関連していない可能性が高い。$0 to \\$50,000円からの収入の増加は、$1 million to \\$10万円からの増加よりも返済の可能性の大きな増加に相当する可能性が高い。これを処理する1つの方法は、ロジスティックマップ（および結果の確率の対数）を使用して、線形性がより妥当になるように結果を後処理することです。 
 
-たとえば、線形性は、*単調性*の*より弱い*仮定を意味します。特徴量が増加すると常にモデルの出力が増加するか (対応する重みが正の場合)、モデルの出力が必ず減少する (対応する重みが負の場合) 必要があります。時にはそれは理にかなっています。たとえば、個人がローンを返済するかどうかを予測しようとした場合、他のすべてを平等に保有すると、収入の高い申請者は、低所得の申請者よりも常に返済する可能性が高いと合理的に想像できます。単調ではあるが、この関係は返済の確率と直線的に関連していない可能性が高い。0万から5万への収入の増加は、100万から105万への増加よりも返済の可能性の大きい増加に相当する可能性が高い。これを処理する 1 つの方法は、たとえば収入の対数を特徴として使用して、線形性がより妥当になるようにデータを前処理することです。 
+単調性に違反する例を簡単に思いつくことができることに注意してください。たとえば、体温の関数として健康を予測したいとします。体温が37°C（98.6°F）を超える個人では、体温が高いほどリスクが高いことを示します。ただし、体温が37°C未満の個人では、体温が低いほどリスクが高いことを示します。繰り返しになりますが、37°Cからの距離を特徴として使用するなど、巧妙な前処理で問題を解決できるかもしれません。 
 
-単調性に違反する例を簡単に思いつくことができることに注意してください。たとえば、体温に基づいて死亡確率を予測したいとします。体温が37°C（98.6°F）を超える人の場合、体温が高いほどリスクが高くなります。ただし、体温が37°C未満の個人では、体温が高いほどリスクが低いことを示します。この場合も、巧妙な前処理で問題を解決できるかもしれません。つまり、37°Cからの距離を特徴として使用するかもしれません。 
+しかし、猫と犬の画像を分類するのはどうですか？位置（13、17）のピクセルの強度を上げると、画像が犬を描写する可能性が常に高くなる（または常に減少する）必要がありますか？線形モデルへの依存は、猫と犬を区別するための唯一の要件は個々のピクセルの明るさを評価することであるという暗黙の仮定に対応します。このアプローチは、画像を反転させることでカテゴリが保持される世界では失敗する運命にあります。 
 
-しかし、猫と犬の画像を分類するのはどうですか？位置 (13, 17) のピクセルの強度を上げると、イメージが犬を描写している可能性が常に高くなる (または常に減少する) べきですか?線形モデルへの依存は、猫と犬を区別するための唯一の要件は個々のピクセルの明るさを評価することであるという暗黙の仮定に対応しています。このアプローチは、画像の反転によってカテゴリが保持される世界では失敗する運命にあります。 
+それでも、ここでの直線性は明らかに不条理ですが、前の例と比較して、単純な前処理の修正で問題に対処できるかどうかはあまり明白ではありません。これは、ピクセルの重要度は、そのコンテキスト（周囲のピクセルの値）に複雑に依存するためです。フィーチャ間の関連する相互作用を考慮したデータの表現が存在する可能性があり、その上に線形モデルが適していますが、それを手作業で計算する方法はわかりません。ディープニューラルネットワークでは、観測データを使用して、隠れ層を介した表現と、その表現に作用する線形予測変数の両方を共同で学習しました。 
 
-しかし、ここでは明らかに不条理な直線性にもかかわらず、前の例と比較して、単純な前処理の修正でこの問題に対処できることはあまり明白ではありません。これは、ピクセルの重要度が、そのコンテキスト (周囲のピクセルの値) によって複雑に左右されるためです。フィーチャ間の関連する相互作用を考慮したデータの表現が存在する可能性がありますが、その上に線形モデルが適していますが、手作業で計算する方法がわかりません。ディープニューラルネットワークでは、観測データを使用して、隠れ層を介した表現と、その表現に作用する線形予測子の両方を共同で学習しました。 
+この非線形性の問題は、少なくとも1世紀にわたって研究されてきた :cite:`Fisher.1928`。たとえば、最も基本的な形式の決定木は、クラスのメンバーシップを決定するために一連のバイナリ決定を使用します :cite:`quinlan2014c4`。同様に、カーネル法は非線形依存関係をモデル化するために何十年も前から使用されてきました :cite:`Aronszajn.1950`。これは、例えば、ノンパラメトリックスプラインモデル:cite:`Wahba.1990`とカーネル法:cite:`Scholkopf.Smola.2002`への道を見つけました。それはまた、脳がかなり自然に解決するものです。結局のところ、ニューロンは他のニューロンに供給され、次に他のニューロンに再び供給されます :cite:`Cajal.Azoulay.1894`。その結果、一連の比較的単純な変換があります。 
 
-### 非表示レイヤの組み込み
+### 隠しレイヤーを組み込む
 
-1 つ以上の隠れ層を組み込むことで、線形モデルのこれらの制限を克服し、より一般的なクラスの関数を処理できます。これを行う最も簡単な方法は、完全に接続された多数のレイヤを重ねて積み重ねることです。各レイヤーは、出力が生成されるまで、その上のレイヤーにフィードされます。最初の $L-1$ 層は表現、最後の層は線形予測子と考えることができます。このアーキテクチャは一般に*マルチレイヤパーセプトロン* と呼ばれ、*MLP* と略されることもあります。以下に、MLP を図式的に示します (:numref:`fig_mlp`)。 
+1つ以上の隠れ層を組み込むことで、線形モデルの限界を克服できます。これを行う最も簡単な方法は、完全に接続された多数のレイヤーを互いに積み重ねることです。各レイヤーは、出力を生成するまで、その上のレイヤーに入力されます。最初の $L-1$ 層は表現として、最後の層は線形予測子と考えることができます。このアーキテクチャは一般に*多層パーセプトロン*と呼ばれ、しばしば*MLP* (:numref:`fig_mlp`) と略されます。 
 
 ![An MLP with a hidden layer of 5 hidden units. ](../img/mlp.svg)
 :label:`fig_mlp`
 
-この MLP には 4 つの入力、3 つの出力があり、隠れ層には 5 つの隠れユニットが含まれています。入力層には計算が含まれないため、このネットワークで出力を生成するには、隠れ層と出力層の両方の計算を実装する必要があります。したがって、この MLP の層数は 2 になります。これらのレイヤは両方とも完全に接続されていることに注意してください。すべての入力は隠れ層のすべてのニューロンに影響し、各入力は出力層のすべてのニューロンに影響を与えます。ただし、:numref:`subsec_parameterization-cost-fc-layers` で示唆されているように、レイヤが完全接続されたMLPのパラメータ化コストは非常に高くなる可能性があり、入力または出力サイズ :cite:`Zhang.Tay.Zhang.ea.2021` を変更しなくても、パラメータ節約とモデル有効性のトレードオフにつながる可能性があります。 
+この MLP には 4 つの入力、3 つの出力があり、その隠れ層には 5 つの隠れユニットが含まれています。入力層には計算が含まれないため、このネットワークで出力を生成するには、隠れ層と出力層の両方の計算を実装する必要があります。したがって、この MLP の層数は 2 です。両方のレイヤーが完全に接続されていることに注意してください。すべての入力は隠れ層のすべてのニューロンに影響し、これらの各入力は出力層のすべてのニューロンに影響を与えます。悲しいかな、まだ終わっていません。 
 
 ### 線形から非線形へ
 
-前述のように、行列 $\mathbf{X} \in \mathbb{R}^{n \times d}$ によって、各例に $d$ の入力 (特徴) がある $n$ 例のミニバッチを示します。隠れ層の隠れ単位が $h$ をもつ 1 つの隠れ層 MLP の場合、$\mathbf{H} \in \mathbb{R}^{n \times h}$ で表すと、隠れ層の出力は次のようになります。
-*非表示リプリゼンテーション*。
-数学またはコードでは、$\mathbf{H}$ は*隠れ層変数* または*隠れ変数* とも呼ばれます。隠れ層と出力層はどちらも完全に接続されているため、隠れ層の重み $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$、バイアス $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$、出力層の重み $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$、バイアス $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$ があります。正式には、1 隠れ層 MLP の出力 $\mathbf{O} \in \mathbb{R}^{n \times q}$ を次のように計算します。 
+前述のように、$n$の例のミニバッチを行列$\mathbf{X} \in \mathbb{R}^{n \times d}$で表します。各例には$d$の入力（特徴）があります。隠れ層が$h$の隠れユニットを持つ1つの隠れ層MLPの場合、$\mathbf{H} \in \mathbb{R}^{n \times h}$によって隠れ層の出力を示します。
+*非表示の表現*。
+隠れ層と出力層の両方が完全に接続されているため、隠れ層の重み $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$ とバイアス $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$ と出力層の重み $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$ とバイアス $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$ があります。これにより、1 隠れ層 MLP の出力 $\mathbf{O} \in \mathbb{R}^{n \times q}$ を次のように計算できます。 
 
 $$
 \begin{aligned}
@@ -39,15 +46,15 @@ $$
 \end{aligned}
 $$
 
-非表示レイヤーを追加した後、モデルでは追加のパラメーターセットを追跡および更新する必要があることに注意してください。それでは、引き換えに何が得られましたか？上で定義したモデルでは、*トラブルに対して何も得られない* ということに驚かれるかもしれません。理由は明白です。上記の隠れ単位は入力のアフィン関数によって与えられ、出力 (pre-softmax) は隠れ単位のアフィン関数にすぎません。アフィン関数のアフィン関数はそれ自体がアフィン関数です。さらに、線形モデルはすでにあらゆるアフィン関数を表現できました。 
+非表示レイヤーを追加した後、モデルでは追加のパラメーターセットを追跡および更新する必要があることに注意してください。それでは、引き換えに何が得られましたか？上で定義したモデルでは、*トラブルに対して何も得られない*、と知って驚くかもしれません！その理由は明白です。上記の隠れ単位は入力のアフィン関数によって与えられ、出力 (pre-softmax) は隠れ単位の単なるアフィン関数です。アフィン関数のアフィン関数は、それ自体がアフィン関数です。さらに、私たちの線形モデルはすでにあらゆるアフィン関数を表すことができました。 
 
-重みの任意の値について、隠れ層を折りたたむだけで、パラメータ $\mathbf{W} = \mathbf{W}^{(1)}\mathbf{W}^{(2)}$ と $\mathbf{b} = \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}$ をもつ同等の単層モデルが得られることを証明することで、同等性を公式に見ることができます。 
+これを正式に見るには、上記の定義で隠れたレイヤーを折りたたむだけで、$\mathbf{W} = \mathbf{W}^{(1)}\mathbf{W}^{(2)}$と$\mathbf{b} = \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}$のパラメーターを持つ同等の単層モデルが得られます。 
 
 $$
 \mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W} + \mathbf{b}.
 $$
 
-多層アーキテクチャの可能性を実現するためには、アフィン変換後に隠れた各ユニットに適用する非線形*活性化関数* $\sigma$ という重要な要素をもう1つ必要とします。アクティベーション関数 ($\sigma(\cdot)$ など) の出力は*activations* と呼ばれます。一般に、アクティベーション関数が配置されると、MLP を線形モデルに折りたたむことができなくなります。 
+多層アーキテクチャの可能性を実現するには、アフィン変換の後に隠れた各ユニットに適用する非線形*活性化関数* $\sigma$という重要な要素をもう1つ必要とします。たとえば、一般的な選択肢は、ReLU (整流線形単位) 活性化関数 :cite:`Nair.Hinton.2010` $\sigma(x) = \mathrm{max}(0, x)$ がその引数を要素単位で操作することです。アクティベーション関数 $\sigma(\cdot)$ の出力は、*アクティベーション* と呼ばれます。一般に、アクティベーション関数があれば、MLPを線形モデルに折りたたむことはできなくなりました。 
 
 $$
 \begin{aligned}
@@ -56,24 +63,27 @@ $$
 \end{aligned}
 $$
 
-$\mathbf{X}$ の各行は表記法の乱用を伴うミニバッチの例に対応しているため、非線形性 $\sigma$ をその入力に行単位、つまり一度に 1 つずつ適用するように定義します。:numref:`subsec_softmax_vectorization` では、softmax の表記法がローワイズ演算を表すために同じ方法で使用されたことに注意してください。多くの場合、このセクションのように、隠れ層に適用するアクティベーション関数は行単位ではなく要素単位です。つまり、レイヤーの線形部分を計算した後、他の隠れユニットがとる値を見ることなく、各アクティベーションを計算できます。これはほとんどのアクティベーション関数に当てはまります。 
+$\mathbf{X}$ の各行はミニバッチの例に対応し、表記法の乱用があるため、非線形性 $\sigma$ を行単位で、つまり一度に 1 つの例として入力に適用するように定義します。:numref:`subsec_softmax_vectorization` で行単位の演算を示すとき、softmax にも同じ表記を使用したことに注意してください。私たちが使用するアクティベーション関数は、行単位だけでなく要素単位にも適用されることがよくあります。つまり、レイヤーの線形部分を計算した後、他の隠れユニットが取った値を見ることなく、各アクティベーションを計算できます。 
 
-より一般的な MLP を構築するために、$\mathbf{H}^{(1)} = \sigma_1(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})$ や $\mathbf{H}^{(2)} = \sigma_2(\mathbf{H}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)})$ などの隠れ層を重ねて積み重ね続け、表現力豊かなモデルを生み出すことができます。 
+より一般的なMLPを構築するために、$\mathbf{H}^{(1)} = \sigma_1(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})$や$\mathbf{H}^{(2)} = \sigma_2(\mathbf{H}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)})$などの隠れたレイヤーを積み重ね続け、より表現力豊かなモデルを生み出すことができます。 
 
 ### ユニバーサル近似器
 
-MLPは、各入力の値に依存する隠れニューロンを介して、入力間の複雑な相互作用を捉えることができます。隠れノードは簡単に設計でき、例えば、一対の入力に対して基本的な論理演算など、任意の計算を実行できます。さらに、活性化関数の特定の選択肢については、MLPが普遍的近似器であることが広く知られている。単一の隠れ層ネットワークであっても、十分なノード (おそらく非常に多い) と適切な重みのセットがあれば、どの関数もモデル化できますが、実際にその関数を学習するのは難しい部分です。ニューラルネットワークは C プログラミング言語に少し似ていると考えるかもしれません。この言語は、他の現代言語と同様に、あらゆる計算可能なプログラムを表現することができます。しかし、実際にあなたの仕様に合ったプログラムを思いつくのは難しいことです。 
+脳は非常に高度な統計分析が可能であることを私たちは知っています。そのため、ディープネットワークがどれほど強力である可能性があるかを尋ねる価値があります。この質問は何度も回答されています。たとえば、MLPのコンテキストでは:citet:`Cybenko.1989`で、単一の隠れ層を持つ放射状基底関数（RBF）ネットワークと見なすことができる方法でカーネルヒルベルト空間を再現するコンテキストでは:citet:`micchelli1984interpolation`です。これら（および関連する結果）は、十分な数のノード（おそらく不合理な数）と適切な重みのセットが与えられれば、単一の隠れ層ネットワークであっても、任意の関数をモデル化できることを示唆しています。しかし、実際にその機能を学ぶのは難しい部分です。ニューラルネットワークはCプログラミング言語に少し似ていると考えるかもしれません。この言語は、他の現代言語と同様に、あらゆる計算可能なプログラムを表現することができます。しかし、実際にあなたの仕様に合ったプログラムを思いつくのは難しい部分です。 
 
-しかも、単一隠れ層ネットワークだから
-*どんな機能も学べる*
-は、単一隠れ層ネットワークに関するすべての問題を解決しようとする必要があるという意味ではありません。実際、より深い (より広い) ネットワークを使用することで、多くの関数をよりコンパクトに近似することができます。より厳密な議論については、以降の章で触れます。 
+しかも、単一の隠れ層ネットワークだからこそ
+*どんな機能でも学べる*
+は、単一隠れ層ネットワークに関する問題をすべて解決しようとすべきだという意味ではありません。実際、この場合、カーネルメソッドは問題を解決できるため、はるかに効果的です。
+*無限次元空間でも正確* :cite:`Kimeldorf.Wahba.1971,Scholkopf.Herbrich.Smola.2001`。
+実際、より深い（より広い）ネットワークを使用することで、多くの関数をはるかにコンパクトに近似できます。:cite:`Simonyan.Zisserman.2014`。以降の章では、より厳密な議論に触れます。 
 
-## アクティベーション関数
+## アクティベーション機能
 :label:`subsec_activation-functions`
 
-活性化関数は、重み付けされた和を計算し、それにバイアスを加えることによって、ニューロンを活性化すべきかどうかを決定します。入力信号を出力に変換する微分可能演算子ですが、そのほとんどは非線形性を加えます。アクティベーション関数はディープラーニングの基本であるため、(**一般的なアクティベーション関数について簡単に調べてみましょう**)。
+活性化関数は、加重和を計算し、さらにバイアスを加えることによって、ニューロンを活性化すべきかどうかを決定します。これらは入力信号を出力に変換する微分可能な演算子ですが、そのほとんどは非線形性を追加します。アクティベーション関数はディープラーニングの基本であるため、(**いくつかの一般的なアクティベーション関数を簡単に調べてみましょう**)。
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import autograd, np, npx
@@ -81,14 +91,14 @@ npx.set_np()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -96,13 +106,14 @@ import tensorflow as tf
 
 ### ReLU 関数
 
-実装のシンプルさとさまざまな予測タスクでの優れたパフォーマンスの両方から、最も一般的な選択肢は、*修正された線形単位* (*RELU*) です。[**relU は非常に単純な非線形変換を提供します**]。要素 $x$ が与えられた場合、関数はその要素の最大値と $0$ として定義されます。 
+実装の単純さとさまざまな予測タスクでの優れたパフォーマンスの両方から、最も人気のある選択肢は、*整流線形単位* (*ReLU*) :cite:`Nair.Hinton.2010`です。[**ReLUは非常に単純な非線形変換を提供します**]。要素$x$が与えられると、関数はその要素の最大値と$0$として定義されます。 
 
 $$\operatorname{ReLU}(x) = \max(x, 0).$$
 
-非公式には、ReLU 関数は正の要素のみを保持し、対応するアクティベーションを 0 に設定することですべての負の要素を破棄します。ある程度の直感を得るために、関数をプロットすることができます。ご覧のとおり、活性化関数は区分的線形です。
+非公式には、ReLU 関数は正の要素のみを保持し、対応するアクティブ化を 0 に設定することですべての負の要素を破棄します。ある程度の直感を得るために、関数をプロットできます。ご覧のとおり、活性化関数は区分線形です。
 
 ```{.python .input}
+%%tab mxnet
 x = np.arange(-8.0, 8.0, 0.1)
 x.attach_grad()
 with autograd.record():
@@ -111,89 +122,92 @@ d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
 y = torch.relu(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x = tf.Variable(tf.range(-8.0, 8.0, 0.1), dtype=tf.float32)
 y = tf.nn.relu(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'relu(x)', figsize=(5, 2.5))
 ```
 
-入力が負の場合、ReLU 関数の微分は 0 になり、入力が正の場合、ReLU 関数の微分は 1 になります。入力が正確に 0 に等しい値を取る場合、ReLU 関数は微分できないことに注意してください。このような場合、既定では左辺の微分が使用され、入力が 0 のときに微分は 0 になります。入力が実際にはゼロになることはないので、これを回避できます。微妙な境界条件が重要であれば、私たちはおそらく工学ではなく (*実際の) 数学をやっているという古い格言があります。その常識がここに当てはまるかもしれません。以下にプロットした ReLU 関数の導関数をプロットします。
+入力が負の場合、ReLU 関数の微分は 0 で、入力が正の場合、ReLU 関数の微分は 1 です。入力が正確に 0 と等しい値を取る場合、ReLU 関数は微分できないことに注意してください。このような場合、デフォルトは左辺の微分で、入力が0のときに微分が0であるとします。入力が実際にはゼロになることはないかもしれないので、これを回避できます（数学者は、メジャーゼロのセットでは微分不可能だと言うでしょう）。微妙な境界条件が重要な場合、私たちはおそらく工学ではなく（*本物*）数学をしているという古い格言があります。この従来の知恵は、ここで当てはまるかもしれません。少なくとも、制約付き最適化を実行していないという事実に当てはまるかもしれません :cite:`Mangasarian.1965,Rockafellar.1970`。以下にプロットした ReLU 関数の微分をプロットします。
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y.backward(torch.ones_like(x), retain_graph=True)
 d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.relu(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of relu',
          figsize=(5, 2.5))
 ```
 
-ReLUを使用する理由は、その派生物が特にうまく動作するためです。ReLUは消滅するか、単に議論を通過させるかのどちらかです。これにより、最適化の動作が改善され、以前のバージョンのニューラルネットワークに悩まされていた勾配の消失という十分に文書化された問題が軽減されました (これについては後で詳しく説明します)。 
+ReLUを使用する理由は、その派生物が特にうまく動作するためです。それらは消滅するか、単に議論を通過させるかのどちらかです。これにより、最適化の動作が向上し、以前のバージョンのニューラルネットワークを悩ませていた勾配が消失するという十分に文書化された問題が軽減されました（これについては後で詳しく説明します）。 
 
-*パラメータ化された Relu* (*PreLU*) 関数 :cite:`He.Zhang.Ren.ea.2015` など、ReLU 関数には多くのバリアントがあることに注意してください。このバリエーションにより ReLU に線形項が追加されるため、引数が負の場合でも、一部の情報は引き続き通過します。 
+*パラメータ化された ReLU* (*preLU*) 関数 :cite:`He.Zhang.Ren.ea.2015` など、ReLU 関数には多くのバリエーションがあることに注意してください。この変化は ReLU に線形項を追加するため、引数が負の場合でも、一部の情報は引き継がれます。 
 
 $$\operatorname{pReLU}(x) = \max(0, x) + \alpha \min(0, x).$$
 
 ### シグモイド関数
 
-[***シグモイド関数* は入力を変換します**] は $\mathbb{R}$ の範囲にある値です (**区間 (0, 1) にある出力へ**) そのため、シグモイドはしばしば「*squashing function*」と呼ばれます。この関数は範囲 (-inf, inf) の入力を (0, 1) の範囲の値に押しつぶします: 
+[***シグモイド関数* は入力を変換します**]、その値は領域 $\mathbb{R}$ にあります (**区間 (0, 1) にある出力に。**) そのため、シグモイドはしばしば*押し潰し関数* と呼ばれます:範囲 (-inf, inf) の入力を (0, 1) の範囲内の値に押しつぶします。 
 
 $$\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.$$
 
-初期のニューラルネットワークでは、科学者は*発火*または*発火しない*生物学的ニューロンのモデル化に興味を持っていました。したがって、この分野のパイオニアは、人工ニューロンの発明者であるマカロックとピッツにまでさかのぼり、しきい値測定ユニットに焦点を合わせました。しきい値処理のアクティブ化は、入力があるしきい値を下回る場合は値 0 を、入力がしきい値を上回った場合は値 1 を取ります。 
+初期のニューラルネットワークでは、科学者は*発火*または*発火しない*生体ニューロンをモデル化することに興味を持っていました。したがって、この分野の先駆者たちは、人工ニューロンの発明者であるマカロックとピッツにまでさかのぼり、しきい値処理ユニット:cite:`McCulloch.Pitts.1943`に焦点を当てました。スレッショルディングのアクティブ化は、入力があるスレッショルドを下回る場合は値0、入力がスレッショルドを超えると値1になります。 
 
-勾配ベースの学習に注目が移ったとき、シグモイド関数はしきい値単位に対する滑らかで微分可能な近似であるため、自然な選択でした。シグモイドは、出力をバイナリ分類問題の確率として解釈する場合 (シグモイドはソフトマックスの特殊なケースと考えることができます)、出力ユニットの活性化関数として広く使用されています。しかし、シグモイドは、隠れ層でのほとんどの使用のために、より単純でトレーニングが容易なReLUに置き換えられています。リカレントニューラルネットワークに関する後の章では、シグモイドユニットを活用して時間の経過に伴う情報の流れを制御するアーキテクチャについて説明します。 
+勾配ベースの学習に注目が移ったとき、シグモイド関数はしきい値単位に対する滑らかで微分可能な近似であるため、自然な選択でした。シグモイドは、出力をバイナリ分類問題の確率として解釈する場合、出力単位の活性化関数として今でも広く使用されています。シグモイドはソフトマックスの特殊なケースと考えることができます。しかし、シグモイドは、隠れ層でのほとんどの使用のために、より単純で訓練しやすいReLUにほとんど置き換えられています。これの多くは、シグモイドが最適化の課題となるという事実に関係しています :cite:`LeCun.Bottou.Orr.ea.1998`。これは、その勾配が大きな正の*および*の負の引数に対して消失するためです。これは、脱出するのが難しい高原につながる可能性があります。それにもかかわらず、シグモイドは重要です。リカレントニューラルネットワークに関する後の章（例：:numref:`sec_lstm`）では、シグモイドユニットを活用して時間の経過に伴う情報の流れを制御するアーキテクチャについて説明します。 
 
 以下に、シグモイド関数をプロットします。入力が 0 に近い場合、シグモイド関数は線形変換に近づくことに注意してください。
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = npx.sigmoid(x)
 d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.sigmoid(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.nn.sigmoid(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
 
-シグモイド関数の導関数は次の方程式で求められます。 
+シグモイド関数の微分は次の方程式で与えられます。 
 
 $$\frac{d}{dx} \operatorname{sigmoid}(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2} = \operatorname{sigmoid}(x)\left(1-\operatorname{sigmoid}(x)\right).$$
 
-シグモイド関数の導関数を以下にプロットします。入力が 0 の場合、シグモイド関数の導関数は最大 0.25 に達することに注意してください。入力が 0 からどちらかの方向に発散するにつれて、微分は 0 に近づきます。
+シグモイド関数の導関数を以下にプロットします。入力が 0 の場合、シグモイド関数の微分は最大 0.25 に達することに注意してください。入力が 0 からいずれかの方向に発散すると、微分は 0 に近づきます。
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 # Clear out previous gradients
 x.grad.data.zero_()
 y.backward(torch.ones_like(x),retain_graph=True)
@@ -201,52 +215,55 @@ d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.sigmoid(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of sigmoid',
          figsize=(5, 2.5))
 ```
 
-### タン関数
+### タン機能
+:label:`subsec_tanh`
 
-シグモイド関数と同様に、[**tanh (双曲線正接) 関数も入力をスカッシュ**] し、区間 (**-1 と 1 の間**) の要素に変換します。 
+シグモイド関数と同様に、[**tanh (双曲線正接) 関数も入力を押しつぶします**]、区間 (**-1 と 1** の間) の要素に変換します。 
 
 $$\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.$$
 
-tanh 関数を以下にプロットします。入力が 0 に近づくにつれて、関数 tanh は線形変換に近づくことに注意してください。関数の形状はシグモイド関数の形状と似ていますが、tanh 関数は座標系の原点を中心に点対称性を示します。
+以下に tanh 関数をプロットします。入力が 0 に近づくにつれ、関数 tanh は線形変換に近づくことに注意してください。関数の形状はシグモイド関数の形状と似ていますが、tanh 関数は座標系 :cite:`Kalman.Kwasny.1992` の原点を中心に点対称になります。
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = np.tanh(x)
 d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.tanh(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.nn.tanh(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
 
-tanh 関数の導関数は次のようになります。 
+tanh 関数の導関数は次のとおりです。 
 
 $$\frac{d}{dx} \operatorname{tanh}(x) = 1 - \operatorname{tanh}^2(x).$$
 
-tanh 関数の導関数を以下にプロットします。入力が 0 に近づくにつれて、関数 tanh の微分は最大値の 1 に近づきます。シグモイド関数で見たように、入力が 0 からいずれかの方向に移動すると、tanh 関数の導関数は0に近づきます。
+下にプロットされています。入力が 0 に近づくにつれて、関数 tanh の微分は最大 1 に近づきます。シグモイド関数で見たように、入力がいずれかの方向に 0 から離れるにつれて、tanh 関数の微分は 0 に近づきます。
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 # Clear out previous gradients.
 x.grad.data.zero_()
 y.backward(torch.ones_like(x),retain_graph=True)
@@ -254,26 +271,30 @@ d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.tanh(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of tanh',
          figsize=(5, 2.5))
 ```
 
-要約すると、表現力豊かな多層ニューラルネットワークアーキテクチャを構築するために非線形性を組み込む方法がわかりました。補足として、あなたの知識はすでに1990年頃に開業医と同様のツールキットを指揮しています。強力なオープンソースのディープラーニングフレームワークを活用して、わずか数行のコードでモデルを迅速に構築できるため、1990 年代に作業する誰よりも有利な点もあります。これまで、これらのネットワークのトレーニングでは、研究者は数千行の C と Fortran をコーディングする必要がありました。 
+## まとめ
 
-## [概要
+私たちは今、表現力豊かな多層ニューラルネットワークアーキテクチャを構築するために非線形性を組み込む方法を知っています。補足として、あなたの知識はすでに1990年頃の開業医と同様のツールキットを指揮しています。ある意味では、強力なオープンソースのディープラーニングフレームワークを活用して、わずか数行のコードでモデルを迅速に構築できるため、1990年代に働く誰よりも有利です。以前は、これらのネットワークをトレーニングするには、C、Fortran、または Lisp (LeNet の場合) でレイヤーとデリバティブを明示的にコード化する必要がありました。 
 
-* MLP は、出力層と入力層の間に 1 つまたは複数の完全に接続された隠れ層を追加し、活性化関数によって隠れ層の出力を変換します。
-* 一般的に使用されるアクティベーション関数には、ReLU 関数、シグモイド関数、tanh 関数があります。
+二次的な利点は、ReLU がシグモイドや tanh 関数よりも最適化にかなり適していることです。これは、過去10年間にディープラーニングが復活するのを助けた重要な革新の1つだったと言えるでしょう。ただし、アクティベーション機能の研究は止まっていないことに注意してください。たとえば、:cite:`Ramachandran.Zoph.Le.2017`で提案されているSwishアクティベーション関数$\sigma(x) = x \operatorname{sigmoid}(\beta x)$は、多くの場合、より高い精度を得ることができます。 
 
 ## 演習
 
-1. preLU アクティベーション関数の微分を計算します。
+1. *線形*の深いネットワーク、つまり非線形性のないネットワークに層を追加しても、ネットワークの表現力を高めることはできないことを示してください。それが積極的にそれを減らす例を挙げてください。
+1. PreLU 活性化関数の微分を計算します。
+1. スウィッシュ活性化関数 $x \operatorname{sigmoid}(\beta x)$ の微分を計算します。
 1. ReLU (または preLU) のみを使用する MLP が連続区分的線形関数を構成することを示します。
-1. $\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)$を見せて。
-1. 一度に 1 つのミニバッチに適用される非線形性があると仮定します。これによってどのような問題が発生すると予想されますか？
+1. シグモイドとタンは非常に似ています。
+    1. $\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)$を見せて。
+    1. 両方の非線形性によってパラメーター化された関数クラスが同一であることを証明します。ヒント:アフィン層にもバイアス項があります。
+1. バッチ正規化 :cite:`Ioffe.Szegedy.2015` のように、一度に 1 つのミニバッチに適用される非線形性があるとします。これによってどのような問題が発生すると予想されますか？
+1. シグモイド活性化関数の勾配が消失する例を挙げてください。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/90)
diff --git a/chapter_multilayer-perceptrons/mlp_origin.md b/chapter_multilayer-perceptrons/mlp_origin.md
index de739c7..f2191d3 100644
--- a/chapter_multilayer-perceptrons/mlp_origin.md
+++ b/chapter_multilayer-perceptrons/mlp_origin.md
@@ -1,12 +1,17 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # Multilayer Perceptrons
 :label:`sec_mlp`
 
-In :numref:`chap_linear`, we introduced
+In :numref:`chap_classification`, we introduced
 softmax regression (:numref:`sec_softmax`),
 implementing the algorithm from scratch
 (:numref:`sec_softmax_scratch`) and using high-level APIs
-(:numref:`sec_softmax_concise`),
-and training classifiers to recognize
+(:numref:`sec_softmax_concise`). This allowed us to
+train classifiers capable of recognizing
 10 categories of clothing from low-resolution images.
 Along the way, we learned how to wrangle data,
 coerce our outputs into a valid probability distribution,
@@ -21,24 +26,24 @@ with which this book is primarily concerned.
 
 ## Hidden Layers
 
-We have described the affine transformation in
-:numref:`subsec_linear_model`,
-which is a linear transformation added by a bias.
+We described affine transformations in
+:numref:`subsec_linear_model` as
+linear transformations with added bias.
 To begin, recall the model architecture
 corresponding to our softmax regression example,
 illustrated in  :numref:`fig_softmaxreg`.
-This model mapped our inputs directly to our outputs
+This model maps inputs directly to outputs
 via a single affine transformation,
 followed by a softmax operation.
 If our labels truly were related
-to our input data by an affine transformation,
+to the input data by a simple affine transformation,
 then this approach would be sufficient.
-But linearity in affine transformations is a *strong* assumption.
+However, linearity (in affine transformations) is a *strong* assumption.
 
-### Linear Models May Go Wrong
+### Limitations of Linear Models
 
 For example, linearity implies the *weaker*
-assumption of *monotonicity*:
+assumption of *monotonicity*, i.e.,
 that any increase in our feature must
 either always cause an increase in our model's output
 (if the corresponding weight is positive),
@@ -47,33 +52,32 @@ or always cause a decrease in our model's output
 Sometimes that makes sense.
 For example, if we were trying to predict
 whether an individual will repay a loan,
-we might reasonably imagine that holding all else equal,
+we might reasonably assume that all other things being equal,
 an applicant with a higher income
 would always be more likely to repay
 than one with a lower income.
 While monotonic, this relationship likely
 is not linearly associated with the probability of
-repayment. An increase in income from 0 to 50 thousand
+repayment. An increase in income from \\$0 to \\$50,000
 likely corresponds to a bigger increase
 in likelihood of repayment
-than an increase from 1 million to 1.05 million.
-One way to handle this might be to preprocess
-our data such that linearity becomes more plausible,
-say, by using the logarithm of income as our feature.
-
+than an increase from \\$1 million to \\$1.05 million.
+One way to handle this might be to post-process our outcome
+such that linearity becomes more plausible,
+by using the logistic map (and thus the logarithm of the probability of outcome).
 
 Note that we can easily come up with examples
 that violate monotonicity.
-Say for example that we want to predict probability
-of death based on body temperature.
+Say for example that we want to predict health as a function
+of body temperature.
 For individuals with a body temperature
 above 37°C (98.6°F),
 higher temperatures indicate greater risk.
 However, for individuals with body temperatures
-below 37° C, higher temperatures indicate lower risk!
-In this case too, we might resolve the problem
-with some clever preprocessing.
-Namely, we might use the distance from 37°C as our feature.
+below 37°C, lower temperatures indicate greater risk!
+Again, we might resolve the problem
+with some clever preprocessing, such as using the distance from 37°C
+as a feature.
 
 
 But what about classifying images of cats and dogs?
@@ -88,12 +92,11 @@ the brightness of individual pixels.
 This approach is doomed to fail in a world
 where inverting an image preserves the category.
 
-
 And yet despite the apparent absurdity of linearity here,
 as compared with our previous examples,
 it is less obvious that we could address the problem
 with a simple preprocessing fix.
-That is because the significance of any pixel
+That is, because the significance of any pixel
 depends in complex ways on its context
 (the values of the surrounding pixels).
 While there might exist a representation of our data
@@ -105,14 +108,24 @@ With deep neural networks, we used observational data
 to jointly learn both a representation via hidden layers
 and a linear predictor that acts upon that representation.
 
+This problem of nonlinearity has been studied for at least a
+century :cite:`Fisher.1928`. For instance, decision trees
+in their most basic form use a sequence of binary decisions to
+decide upon class membership :cite:`quinlan2014c4`. Likewise, kernel
+methods have been used for many decades to model nonlinear dependencies
+:cite:`Aronszajn.1950`. This has found its way, e.g., into
+nonparametric spline models :cite:`Wahba.1990` and kernel methods
+:cite:`Scholkopf.Smola.2002`. It is also something that the brain solves
+quite naturally. After all, neurons feed into other neurons which,
+in turn, feed into other neurons again :cite:`Cajal.Azoulay.1894`.
+Consequently we have a sequence of relatively simple transformations.
 
 ### Incorporating Hidden Layers
 
-We can overcome these limitations of linear models
-and handle a more general class of functions
+We can overcome the limitations of linear models
 by incorporating one or more hidden layers.
 The easiest way to do this is to stack
-many fully-connected layers on top of each other.
+many fully connected layers on top of each other.
 Each layer feeds into the layer above it,
 until we generate outputs.
 We can think of the first $L-1$ layers
@@ -120,8 +133,7 @@ as our representation and the final layer
 as our linear predictor.
 This architecture is commonly called
 a *multilayer perceptron*,
-often abbreviated as *MLP*.
-Below, we depict an MLP diagrammatically (:numref:`fig_mlp`).
+often abbreviated as *MLP* (:numref:`fig_mlp`).
 
 ![An MLP with a hidden layer of 5 hidden units. ](../img/mlp.svg)
 :label:`fig_mlp`
@@ -133,33 +145,24 @@ producing outputs with this network
 requires implementing the computations
 for both the hidden and output layers;
 thus, the number of layers in this MLP is 2.
-Note that these layers are both fully connected.
+Note that both layers are fully connected.
 Every input influences every neuron in the hidden layer,
 and each of these in turn influences
-every neuron in the output layer.
-However, as suggested by :numref:`subsec_parameterization-cost-fc-layers`,
-the parameterization cost of MLPs
-with fully-connected layers
-can be prohibitively high,
-which may motivate
-tradeoff between parameter saving and model effectiveness even without changing the input or output size :cite:`Zhang.Tay.Zhang.ea.2021`.
-
-
+every neuron in the output layer. Alas, we are not quite
+done yet.
 
 ### From Linear to Nonlinear
 
-
-As before, by the matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$,
-we denote a minibatch of $n$ examples where each example has $d$ inputs (features).
+As before, we denote by the matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$
+a minibatch of $n$ examples where each example has $d$ inputs (features).
 For a one-hidden-layer MLP whose hidden layer has $h$ hidden units,
-denote by $\mathbf{H} \in \mathbb{R}^{n \times h}$
+we denote by $\mathbf{H} \in \mathbb{R}^{n \times h}$
 the outputs of the hidden layer, which are
 *hidden representations*.
-In mathematics or code, $\mathbf{H}$ is also known as a *hidden-layer variable* or a *hidden variable*.
 Since the hidden and output layers are both fully connected,
 we have hidden-layer weights $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$ and biases $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$
 and output-layer weights $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$ and biases $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$.
-Formally, we calculate the outputs $\mathbf{O} \in \mathbb{R}^{n \times q}$
+This allows us to calculate the outputs $\mathbf{O} \in \mathbb{R}^{n \times q}$
 of the one-hidden-layer MLP as follows:
 
 $$
@@ -169,8 +172,6 @@ $$
 \end{aligned}
 $$
 
-
-
 Note that after adding the hidden layer,
 our model now requires us to track and update
 additional sets of parameters.
@@ -188,10 +189,7 @@ is itself an affine function.
 Moreover, our linear model was already
 capable of representing any affine function.
 
-
-We can view the equivalence formally
-by proving that for any values of the weights,
-we can just collapse out the hidden layer,
+To see this formally we can just collapse out the hidden layer in the above definition,
 yielding an equivalent single-layer model with parameters
 $\mathbf{W} = \mathbf{W}^{(1)}\mathbf{W}^{(2)}$ and $\mathbf{b} = \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}$:
 
@@ -199,19 +197,18 @@ $$
 \mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W} + \mathbf{b}.
 $$
 
-
 In order to realize the potential of multilayer architectures,
 we need one more key ingredient: a
 nonlinear *activation function* $\sigma$
 to be applied to each hidden unit
-following the affine transformation.
-The outputs of activation functions
-(e.g., $\sigma(\cdot)$)
+following the affine transformation. For instance, a popular
+choice is the ReLU (Rectified Linear Unit) activation function :cite:`Nair.Hinton.2010`
+$\sigma(x) = \mathrm{max}(0, x)$ operating on its arguments element-wise.
+The outputs of activation functions $\sigma(\cdot)$
 are called *activations*.
 In general, with activation functions in place,
 it is no longer possible to collapse our MLP into a linear model:
 
-
 $$
 \begin{aligned}
     \mathbf{H} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\
@@ -221,18 +218,14 @@ $$
 
 Since each row in $\mathbf{X}$ corresponds to an example in the minibatch,
 with some abuse of notation, we define the nonlinearity
-$\sigma$ to apply to its inputs in a rowwise fashion,
+$\sigma$ to apply to its inputs in a row-wise fashion,
 i.e., one example at a time.
-Note that we used the notation for softmax
-in the same way to denote a rowwise operation in :numref:`subsec_softmax_vectorization`.
-Often, as in this section, the activation functions
-that we apply to hidden layers are not merely rowwise,
-but elementwise.
-That means that after computing the linear portion of the layer,
+Note that we used the same notation for softmax
+when we denoted a row-wise operation in :numref:`subsec_softmax_vectorization`.
+Quite frequently the activation functions we use apply not merely row-wise but
+element-wise. That means that after computing the linear portion of the layer,
 we can calculate each activation
 without looking at the values taken by the other hidden units.
-This is true for most activation functions.
-
 
 To build more general MLPs, we can continue stacking
 such hidden layers,
@@ -242,19 +235,16 @@ one atop another, yielding ever more expressive models.
 
 ### Universal Approximators
 
-MLPs can capture complex interactions
-among our inputs via their hidden neurons,
-which depend on the values of each of the inputs.
-We can easily design hidden nodes
-to perform arbitrary computation,
-for instance, basic logic operations on a pair of inputs.
-Moreover, for certain choices of the activation function,
-it is widely known that MLPs are universal approximators.
-Even with a single-hidden-layer network,
+We know that the brain is capable of very sophisticated statistical analysis. As such,
+it is worth asking, just *how powerful* a deep network could be. This question
+has been answered multiple times, e.g., in :citet:`Cybenko.1989` in the context
+of MLPs, and in :citet:`micchelli1984interpolation` in the context of reproducing kernel
+Hilbert spaces in a way that could be seen as radial basis function (RBF) networks with a single hidden layer.
+These (and related results) suggest that even with a single-hidden-layer network,
 given enough nodes (possibly absurdly many),
 and the right set of weights,
-we can model any function,
-though actually learning that function is the hard part.
+we can model any function.
+Actually learning that function is the hard part, though.
 You might think of your neural network
 as being a bit like the C programming language.
 The language, like any other modern language,
@@ -266,9 +256,11 @@ Moreover, just because a single-hidden-layer network
 *can* learn any function
 does not mean that you should try
 to solve all of your problems
-with single-hidden-layer networks.
+with single-hidden-layer networks. In fact, in this case kernel methods
+are way more effective, since they are capable of solving the problem
+*exactly* even in infinite dimensional spaces :cite:`Kimeldorf.Wahba.1971,Scholkopf.Herbrich.Smola.2001`.
 In fact, we can approximate many functions
-much more compactly by using deeper (vs. wider) networks.
+much more compactly by using deeper (vs. wider) networks :cite:`Simonyan.Zisserman.2014`.
 We will touch upon more rigorous arguments in subsequent chapters.
 
 
@@ -280,9 +272,10 @@ calculating the weighted sum and further adding bias with it.
 They are differentiable operators to transform input signals to outputs,
 while most of them add non-linearity.
 Because activation functions are fundamental to deep learning,
-(**let us briefly survey some common activation functions**).
+(**let's briefly survey some common activation functions**).
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import autograd, np, npx
@@ -290,14 +283,14 @@ npx.set_np()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -308,7 +301,7 @@ import tensorflow as tf
 The most popular choice,
 due to both simplicity of implementation and
 its good performance on a variety of predictive tasks,
-is the *rectified linear unit* (*ReLU*).
+is the *rectified linear unit* (*ReLU*) :cite:`Nair.Hinton.2010`.
 [**ReLU provides a very simple nonlinear transformation**].
 Given an element $x$, the function is defined
 as the maximum of that element and $0$:
@@ -322,6 +315,7 @@ To gain some intuition, we can plot the function.
 As you can see, the activation function is piecewise linear.
 
 ```{.python .input}
+%%tab mxnet
 x = np.arange(-8.0, 8.0, 0.1)
 x.attach_grad()
 with autograd.record():
@@ -330,14 +324,14 @@ d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
 y = torch.relu(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x = tf.Variable(tf.range(-8.0, 8.0, 0.1), dtype=tf.float32)
 y = tf.nn.relu(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'relu(x)', figsize=(5, 2.5))
@@ -352,25 +346,28 @@ when the input takes value precisely equal to 0.
 In these cases, we default to the left-hand-side
 derivative and say that the derivative is 0 when the input is 0.
 We can get away with this because
-the input may never actually be zero.
+the input may never actually be zero (mathematicians would
+say that it's nondifferentiable on a set of measure zero).
 There is an old adage that if subtle boundary conditions matter,
 we are probably doing (*real*) mathematics, not engineering.
-That conventional wisdom may apply here.
+That conventional wisdom may apply here, or at least, the fact that
+we are not performing constrained optimization :cite:`Mangasarian.1965,Rockafellar.1970`.
 We plot the derivative of the ReLU function plotted below.
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y.backward(torch.ones_like(x), retain_graph=True)
 d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.relu(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of relu',
@@ -411,12 +408,11 @@ which either *fire* or *do not fire*.
 Thus the pioneers of this field,
 going all the way back to McCulloch and Pitts,
 the inventors of the artificial neuron,
-focused on thresholding units.
+focused on thresholding units :cite:`McCulloch.Pitts.1943`.
 A thresholding activation takes value 0
 when its input is below some threshold
 and value 1 when the input exceeds the threshold.
 
-
 When attention shifted to gradient based learning,
 the sigmoid function was a natural choice
 because it is a smooth, differentiable
@@ -424,12 +420,14 @@ approximation to a thresholding unit.
 Sigmoids are still widely used as
 activation functions on the output units,
 when we want to interpret the outputs as probabilities
-for binary classification problems
-(you can think of the sigmoid as a special case of the softmax).
+for binary classification problems: you can think of the sigmoid as a special case of the softmax.
 However, the sigmoid has mostly been replaced
 by the simpler and more easily trainable ReLU
-for most use in hidden layers.
-In later chapters on recurrent neural networks,
+for most use in hidden layers. Much of this has to do
+with the fact that the sigmoid poses challenges for optimization
+:cite:`LeCun.Bottou.Orr.ea.1998` since its gradient vanishes for large positive *and* negative arguments.
+This can lead to plateaus that are difficult to escape from.
+Nonetheless sigmoids are important. In later chapters (e.g., :numref:`sec_lstm`) on recurrent neural networks,
 we will describe architectures that leverage sigmoid units
 to control the flow of information across time.
 
@@ -439,19 +437,20 @@ the sigmoid function approaches
 a linear transformation.
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = npx.sigmoid(x)
 d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.sigmoid(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.nn.sigmoid(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
 ```
@@ -469,12 +468,13 @@ As the input diverges from 0 in either direction,
 the derivative approaches 0.
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 # Clear out previous gradients
 x.grad.data.zero_()
 y.backward(torch.ones_like(x),retain_graph=True)
@@ -482,7 +482,7 @@ d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.sigmoid(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of sigmoid',
@@ -490,6 +490,7 @@ d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of sigmoid',
 ```
 
 ### Tanh Function
+:label:`subsec_tanh`
 
 Like the sigmoid function, [**the tanh (hyperbolic tangent)
 function also squashes its inputs**],
@@ -497,23 +498,23 @@ transforming them into elements on the interval (**between -1 and 1**):
 
 $$\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.$$
 
-We plot the tanh function below.
-Note that as the input nears 0, the tanh function approaches a linear transformation. Although the shape of the function is similar to that of the sigmoid function, the tanh function exhibits point symmetry about the origin of the coordinate system.
+We plot the tanh function below. Note that as input nears 0, the tanh function approaches a linear transformation. Although the shape of the function is similar to that of the sigmoid function, the tanh function exhibits point symmetry about the origin of the coordinate system :cite:`Kalman.Kwasny.1992`.
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = np.tanh(x)
 d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 y = torch.tanh(x)
 d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 y = tf.nn.tanh(x)
 d2l.plot(x.numpy(), y.numpy(), 'x', 'tanh(x)', figsize=(5, 2.5))
 ```
@@ -522,20 +523,21 @@ The derivative of the tanh function is:
 
 $$\frac{d}{dx} \operatorname{tanh}(x) = 1 - \operatorname{tanh}^2(x).$$
 
-The derivative of tanh function is plotted below.
+It is plotted below.
 As the input nears 0,
 the derivative of the tanh function approaches a maximum of 1.
 And as we saw with the sigmoid function,
-as the input moves away from 0 in either direction,
+as input moves away from 0 in either direction,
 the derivative of the tanh function approaches 0.
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 d2l.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 # Clear out previous gradients.
 x.grad.data.zero_()
 y.backward(torch.ones_like(x),retain_graph=True)
@@ -543,14 +545,16 @@ d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.nn.tanh(x)
 d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of tanh',
          figsize=(5, 2.5))
 ```
 
-In summary, we now know how to incorporate nonlinearities
+## Summary
+
+We now know how to incorporate nonlinearities
 to build expressive multilayer neural network architectures.
 As a side note, your knowledge already
 puts you in command of a similar toolkit
@@ -561,21 +565,32 @@ because you can leverage powerful
 open-source deep learning frameworks
 to build models rapidly, using only a few lines of code.
 Previously, training these networks
-required researchers to code up
-thousands of lines of C and Fortran.
-
-## Summary
-
-* MLP adds one or multiple fully-connected hidden layers between the output and input layers and transforms the output of the hidden layer via an activation function.
-* Commonly-used activation functions include the ReLU function, the sigmoid function, and the tanh function.
-
+required researchers to code up layers and derivatives
+explicitly in C, Fortran, or even Lisp (in the case of LeNet).
+
+A secondary benefit is that ReLU is significantly more amenable to
+optimization than the sigmoid or the tanh function. One could argue
+that this was one of the key innovations that helped the resurgence
+of deep learning over the past decade. Note, though, that research in
+activation functions has not stopped. For instance, the Swish activation
+function $\sigma(x) = x \operatorname{sigmoid}(\beta x)$ as proposed in
+:cite:`Ramachandran.Zoph.Le.2017` can yield better accuracy
+in many cases.
 
 ## Exercises
 
+1. Show that adding layers to a *linear* deep network, i.e., a network without
+   nonlinearity $\sigma$ can never increase the expressive power of the network.
+   Give an example where it actively reduces it.
 1. Compute the derivative of the pReLU activation function.
-1. Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise linear function.
-1. Show that $\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)$.
-1. Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of problems do you expect this to cause?
+1. Compute the derivative of the Swish activation function $x \operatorname{sigmoid}(\beta x)$.
+1. Show that an MLP using only ReLU (or pReLU) constructs a
+   continuous piecewise linear function.
+1. Sigmoid and tanh are very similar.
+    1. Show that $\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)$.
+    1. Prove that the function classes parametrized by both nonlinearities are identical. Hint: affine layers have bias terms, too.
+1. Assume that we have a nonlinearity that applies to one minibatch at a time, such as the batch normalization :cite:`Ioffe.Szegedy.2015`. What kinds of problems do you expect this to cause?
+1. Provide an example where the gradients vanish for the sigmoid activation function.
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/90)
diff --git a/chapter_multilayer-perceptrons/numerical-stability-and-init.md b/chapter_multilayer-perceptrons/numerical-stability-and-init.md
index c996b3a..206c71b 100644
--- a/chapter_multilayer-perceptrons/numerical-stability-and-init.md
+++ b/chapter_multilayer-perceptrons/numerical-stability-and-init.md
@@ -1,27 +1,33 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # 数値の安定性と初期化
 :label:`sec_numerical_stability`
 
-これまで、実装したすべてのモデルでは、あらかじめ指定された分布に従ってパラメーターを初期化する必要がありました。これまで、私たちは初期化スキームを当たり前のことと考えていましたが、これらの選択がどのように行われるかの詳細については見落としていました。これらの選択は特に重要ではないという印象を受けたかもしれません。逆に、初期化方式の選択はニューラルネットワークの学習において重要な役割を果たし、数値の安定性を維持する上で非常に重要になる可能性がある。さらに、これらの選択は、非線形活性化関数の選択と興味深い方法で結びつけることができます。どの関数を選択し、どのようにパラメーターを初期化するかによって、最適化アルゴリズムが収束する速度が決まります。ここでの選択が悪いと、トレーニング中に爆発したり消えたりする勾配に遭遇する可能性があります。このセクションでは、これらのトピックをより詳細に掘り下げ、ディープラーニングのキャリアを通じて役立つと思われる有用なヒューリスティックについて説明します。 
+これまで、実装したすべてのモデルでは、事前に指定された分布に従ってパラメータを初期化する必要がありました。これまで、私たちは初期化スキームを当然のことと思い、これらの選択がどのように行われるかの詳細を詳しく説明しました。これらの選択は特に重要ではないという印象を受けたかもしれません。逆に、初期化スキームの選択はニューラルネットワーク学習において重要な役割を果たし、数値の安定性を維持するために重要になる可能性があります。さらに、これらの選択肢は、非線形活性化関数の選択と興味深い方法で結びつけることができます。どの関数を選択し、どのようにパラメータを初期化するかによって、最適化アルゴリズムの収束速度が決まります。ここでの選択が悪いと、トレーニング中にグラデーションが爆発したり消失したりすることがあります。このセクションでは、これらのトピックをより詳細に掘り下げ、ディープラーニングのキャリアを通じて役立ついくつかの有用なヒューリスティックについて説明します。 
 
-## 消失するグラデーションと爆発するグラデーション
+## グラデーションの消失と爆発
 
-$L$ 層、入力 $\mathbf{x}$、出力 $\mathbf{o}$ をもつディープネットワークについて考えてみます。隠れ変数が $\mathbf{h}^{(l)}$ ($\mathbf{h}^{(0)} = \mathbf{x}$) である重み $\mathbf{W}^{(l)}$ でパラメーター化された変換 $f_l$ によって定義される各レイヤー $l$ では、ネットワークは次のように表すことができます。 
+$L$ 層、入力 $\mathbf{x}$、出力 $\mathbf{o}$ を持つディープネットワークを考えてみましょう。各レイヤー$l$は、重み$\mathbf{W}^{(l)}$でパラメーター化された変換$f_l$によって定義され、その隠れレイヤー出力は$\mathbf{h}^{(l)}$（$\mathbf{h}^{(0)} = \mathbf{x}$としましょう）であるため、ネットワークは次のように表すことができます。 
 
 $$\mathbf{h}^{(l)} = f_l (\mathbf{h}^{(l-1)}) \text{ and thus } \mathbf{o} = f_L \circ \ldots \circ f_1(\mathbf{x}).$$
 
-すべての隠れ変数と入力がベクトルの場合、$\mathbf{W}^{(l)}$ の任意のパラメーターセットに対する $\mathbf{o}$ の勾配を次のように記述できます。 
+すべての隠れ層の出力と入力がベクトルである場合、$\mathbf{o}$の任意のパラメータセットに対する$\mathbf{o}$の勾配を次のように記述できます。 
 
 $$\partial_{\mathbf{W}^{(l)}} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{(L-1)}} \mathbf{h}^{(L)}}_{ \mathbf{M}^{(L)} \stackrel{\mathrm{def}}{=}} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{(l)}} \mathbf{h}^{(l+1)}}_{ \mathbf{M}^{(l+1)} \stackrel{\mathrm{def}}{=}} \underbrace{\partial_{\mathbf{W}^{(l)}} \mathbf{h}^{(l)}}_{ \mathbf{v}^{(l)} \stackrel{\mathrm{def}}{=}}.$$
 
-つまり、この勾配は $L-l$ 行列 $\mathbf{M}^{(L)} \cdot \ldots \cdot \mathbf{M}^{(l+1)}$ と勾配ベクトル $\mathbf{v}^{(l)}$ の積になります。したがって、あまりにも多くの確率を掛け合わせるとしばしば発生する数値アンダーフローの問題と同じ影響を受けやすくなります。確率を扱う場合、一般的なトリックは対数空間に切り替える、つまり圧力を仮数から数値表現の指数にシフトさせることです。残念ながら、上記の問題はもっと深刻です。最初は、行列 $\mathbf{M}^{(l)}$ は多種多様な固有値をもつ可能性があります。それらは小さい場合と大きい場合があり、製品は*非常に大きい*または*非常に小さい場合があります。 
+つまり、この勾配は $L-l$ 行列 $\mathbf{M}^{(L)} \cdot \ldots \cdot \mathbf{M}^{(l+1)}$ と勾配ベクトル $\mathbf{v}^{(l)}$ の積です。したがって、あまりにも多くの確率を掛け合わせるとしばしば発生する数値アンダーフローの同じ問題の影響を受けやすくなります。確率を扱う場合、一般的なトリックは対数空間に切り替えることです。つまり、圧力を仮数部から数値表現の指数に移すことです。残念ながら、上記の問題はより深刻です。最初に行列 $\mathbf{M}^{(l)}$ はさまざまな固有値を持つ可能性があります。それらは小さい場合も大きい場合もあれば、製品が*非常に大きい*または*非常に小さい*場合もあります。 
 
-不安定な勾配によってもたらされるリスクは、数値表現を超えています。予測不可能な大きさの勾配も、最適化アルゴリズムの安定性を脅かしています。(i) 過度に大きくてモデルが破壊される (*爆発する勾配* 問題)、(ii) 過度に小さい (*消失する勾配* 問題)、更新のたびにパラメータがほとんど移動しないため、学習が不可能になるパラメータの更新に直面している可能性があります。 
+不安定な勾配によってもたらされるリスクは、数値表現を超えています。予測不可能な大きさの勾配も、最適化アルゴリズムの安定性を脅かします。(i) 過度に大きく、モデルを破壊する (*爆発する勾配*問題)、(ii) 過度に小さい (*消失する勾配*の問題)、更新のたびにパラメータがほとんど移動しないため、学習が不可能になるパラメータの更新に直面している可能性があります。 
 
-### (**消失するグラデーション**)
+### (**消えるグラデーション**)
 
-消失勾配の問題を引き起こす原因の 1 つは、各層の線形演算の後に追加される活性化関数 $\sigma$ の選択です。従来、シグモイド関数 $1/(1 + \exp(-x))$ (:numref:`sec_mlp` で導入) はしきい値処理関数に似ているため人気がありました。初期の人工ニューラルネットワークは生物学的ニューラルネットワークに触発されていたため、（生体ニューロンのように）*完全に*発火するか、まったく*発火しないニューロンのアイデアは魅力的に思えました。シグモイドが勾配の消失を引き起こす原因を詳しく見てみましょう。
+消失勾配の問題を引き起こす原因の 1 つは、各層の線形演算の後に追加される活性化関数 $\sigma$ の選択です。歴史的に、シグモイド関数 $1/(1 + \exp(-x))$ (:numref:`sec_mlp` で導入) は、しきい値関数に似ているため人気がありました。初期の人工ニューラルネットワークは生物学的ニューラルネットワークに触発されていたため、（生物学的ニューロンのように）*完全に*または*まったく*発火しないニューロンのアイデアは魅力的に思えました。シグモイドを詳しく見て、勾配の消失を引き起こす原因を見てみましょう。
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import autograd, np, npx
@@ -37,7 +43,7 @@ d2l.plot(x, [y, x.grad], legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
@@ -51,7 +57,7 @@ d2l.plot(x.detach().numpy(), [y.detach().numpy(), x.grad.numpy()],
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -63,13 +69,14 @@ d2l.plot(x.numpy(), [y.numpy(), t.gradient(y, x).numpy()],
          legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
 ```
 
-ご覧のとおり (**シグモイドの勾配は、入力が大きい場合と小さい場合の両方で消滅します**)。さらに、多くの層を逆伝播する場合、多くのシグモイドへの入力がゼロに近いゴルディロックゾーンにいない限り、積全体の勾配が消滅する可能性があります。ネットワークに多くのレイヤーがある場合、注意しない限り、あるレイヤーでグラデーションが途切れる可能性があります。実際、この問題はディープネットワークトレーニングを悩ませていました。その結果、より安定している (しかし神経的にもっともらしくない) RELUは、開業医にとってデフォルトの選択肢として浮上している。 
+ご覧のとおり、(**シグモイドの勾配は、入力が大きい場合と小さい場合の両方で消失します**)。さらに、多くの層を逆伝播する場合、多くのシグモイドへの入力がゼロに近いゴルディロックスゾーンにいない限り、製品全体の勾配が消える可能性があります。私たちのネットワークが多くのレイヤーを誇っている場合、注意しない限り、グラデーションはあるレイヤーで切り取られる可能性があります。実際、この問題はかつてディープネットワークトレーニングを悩ませていました。その結果、より安定した（しかし神経的にもっともらしくない）ReLUは、開業医のデフォルトの選択肢として浮上しています。 
 
-### [**分解するグラデーション**]
+### [**グラデーションの展開**]
 
-グラデーションが爆発すると、逆の問題も同様に厄介になる可能性があります。これをもう少しわかりやすく説明するために、100 個のガウス乱数行列を描き、それに初期行列を掛けます。選択したスケール (分散 $\sigma^2=1$ の選択) では、行列積が爆発します。ディープネットワークの初期化が原因でこれが発生した場合、勾配降下オプティマイザが収束する可能性はありません。
+グラデーションが爆発するときの反対の問題は、同様に厄介です。これをもう少しわかりやすく説明するために、100個のガウスランダム行列を描き、それらに初期行列を掛けます。選択した尺度（分散$\sigma^2=1$の選択）では、行列積が爆発的に増加します。ディープネットワークの初期化によってこれが発生した場合、勾配降下オプティマイザが収束する可能性はありません。
 
 ```{.python .input}
+%%tab mxnet
 M = np.random.normal(size=(4, 4))
 print('a single matrix', M)
 for i in range(100):
@@ -79,17 +86,17 @@ print('after multiplying 100 matrices', M)
 ```
 
 ```{.python .input}
-#@tab pytorch
-M = torch.normal(0, 1, size=(4,4))
+%%tab pytorch
+M = torch.normal(0, 1, size=(4, 4))
 print('a single matrix \n',M)
 for i in range(100):
-    M = torch.mm(M,torch.normal(0, 1, size=(4, 4)))
+    M = M @ torch.normal(0, 1, size=(4, 4))
 
 print('after multiplying 100 matrices\n', M)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 M = tf.random.normal((4, 4))
 print('a single matrix \n', M)
 for i in range(100):
@@ -98,30 +105,30 @@ for i in range(100):
 print('after multiplying 100 matrices\n', M.numpy())
 ```
 
-### シンメトリを破る
+### シンメトリーを破る
 
-ニューラルネットワークの設計におけるもう 1 つの問題は、パラメータ化に内在する対称性です。1 つの隠れ層と 2 つのユニットをもつ単純な MLP があると仮定します。この場合、第 1 層の重み $\mathbf{W}^{(1)}$ を並び替え、同様に出力層の重みを置換して同じ関数を得ることができます。第1隠れユニットと第2隠しユニットを区別する特別なものはありません。言い換えれば、各層の隠れユニット間に順列対称性があるということです。 
+ニューラルネットワーク設計におけるもう一つの問題は、それらのパラメータ化に内在する対称性です。1 つの隠れ層と 2 つのユニットを持つ単純な MLP があると仮定します。この場合、最初の層の重み $\mathbf{W}^{(1)}$ を順列化し、同様に出力層の重みを置換して同じ関数を得ることができます。最初の隠しユニットと2番目の隠しユニットを区別する特別なことは何もありません。言い換えれば、各レイヤーの隠れたユニット間に順列対称性があります。 
 
-これは単なる理論上の迷惑ではありません。2 つの隠れユニットを持つ、前述の 1 つの隠れ層 MLP について考えてみます。説明のために、出力層が 2 つの非表示単位を 1 つの出力単位だけに変換するとします。ある定数 $c$ に対して、隠れ層のすべてのパラメータを $\mathbf{W}^{(1)} = c$ として初期化するとどうなるか想像してみてください。この場合、フォワード伝播中に、隠れユニットが同じ入力とパラメータを受け取り、同じアクティベーションを生成して出力ユニットに供給します。バックプロパゲーション中、パラメーター $\mathbf{W}^{(1)}$ に関して出力単位を微分すると、要素がすべて同じ値を取る勾配が得られます。したがって、勾配ベースの反復 (ミニバッチ確率的勾配降下法など) の後も、$\mathbf{W}^{(1)}$ のすべての要素は同じ値を取ります。このような反復は、それ自体では*対称性を破る*ことはなく、ネットワークの表現力は決して実現できないかもしれません。非表示レイヤーは、ユニットが 1 つしかないかのように動作します。ミニバッチ確率的勾配降下法はこの対称性を破ることはありませんが、ドロップアウト正則化は壊れることに注意してください。 
+これは単なる理論上の迷惑ではありません。前述の 2 つの隠れユニットを持つ 1 つの隠れ層 MLP を考えてみましょう。説明のために、出力レイヤーが 2 つの非表示ユニットを 1 つの出力ユニットだけに変換するとします。いくつかの定数$c$に対して、隠れ層のすべてのパラメータを$\mathbf{W}^{(1)} = c$として初期化したらどうなるか想像してみてください。この場合、順伝播中、いずれかの隠しユニットが同じ入力とパラメータを受け取り、同じアクティベーションを生成し、出力ユニットに供給されます。バックプロパゲーション中、パラメーター $\mathbf{W}^{(1)}$ に関して出力単位を微分すると、要素がすべて同じ値を取る勾配が得られます。したがって、勾配ベースの反復 (ミニバッチ確率的勾配降下法など) の後でも、$\mathbf{W}^{(1)}$ のすべての要素は同じ値をとります。このような反復は、それ自体で「対称性を破る」ことはなく、ネットワークの表現力を実現することは決してできないかもしれません。非表示のレイヤーは、あたかもユニットが 1 つしかないかのように動作します。ミニバッチの確率的勾配降下法はこの対称性を壊さないが、ドロップアウト正則化（後で紹介する）はそうなることに注意してください！ 
 
 ## パラメーターの初期化
 
-上記で提起された問題に対処する (少なくとも軽減する) 方法の 1 つは、慎重に初期化することです。最適化時の注意と適切な正則化により、安定性をさらに高めることができます。 
+上記で提起された問題に対処する、または少なくとも軽減する1つの方法は、慎重に初期化することです。後で説明するように、最適化中の追加の注意と適切な正則化により、安定性をさらに高めることができます。 
 
 ### 既定の初期化
 
-前のセクション、例えば :numref:`sec_linear_concise` では、正規分布を使用して重みの値を初期化しました。初期化方法を指定しない場合、フレームワークはデフォルトのランダム初期化方法を使用します。これは、中程度の問題サイズに対しては実際にうまく機能することがよくあります。 
+前のセクション、たとえば:numref:`sec_linear_concise`では、正規分布を使用して重みの値を初期化しました。初期化方法を指定しない場合、フレームワークはデフォルトのランダム初期化方法を使用します。これは、中程度の問題サイズに対して実際にうまく機能することがよくあります。 
 
-### ザビエルの初期化
+### ザビエル初期化
 :label:`subsec_xavier`
 
-全結合層に対する出力 (隠れ変数など) $o_{i}$ のスケール分布を見てみましょう。
+全結合層の出力 $o_{i}$ のスケール分布を見てみましょう。
 *非線形性なし*。
-この層の $n_\mathrm{in}$ の入力 $x_j$ とそれに関連する重み $w_{ij}$ の場合、出力は次の式で与えられます。 
+$n_\mathrm{in}$ 入力 $x_j$ と、この層に関連する重み $w_{ij}$ の場合、出力は次の式で与えられます。 
 
 $$o_{i} = \sum_{j=1}^{n_\mathrm{in}} w_{ij} x_j.$$
 
-重み $w_{ij}$ は、すべて同じ分布から独立して描画されます。さらに、この分布の平均がゼロで分散 $\sigma^2$ であると仮定します。これは、分布がガウス分布でなければならないという意味ではなく、単に平均と分散が存在する必要があるということだけであることに注意してください。ここでは、層 $x_j$ への入力もゼロの平均と分散 $\gamma^2$ をもち、$w_{ij}$ から独立していて互いに独立していると仮定します。この場合、$o_i$ の平均と分散は次のように計算できます。 
+重み $w_{ij}$ はすべて同じ分布から独立して描画されます。さらに、この分布にはゼロ平均と分散$\sigma^2$があると仮定します。これは、分布がガウス分布でなければならないという意味ではなく、平均と分散が存在する必要があるということだけを意味することに注意してください。今のところ、レイヤー $x_j$ への入力もゼロの平均と分散 $\gamma^2$ をもち、$w_{ij}$ から独立していて互いに独立していると仮定します。この場合、$o_i$の平均と分散は次のように計算できます。 
 
 $$
 \begin{aligned}
@@ -133,7 +140,7 @@ $$
 \end{aligned}
 $$
 
-分散を固定する 1 つの方法は $n_\mathrm{in} \sigma^2 = 1$ を設定することです。ここでバックプロパゲーションについて考えてみましょう。そこでは、出力に近いレイヤーから勾配が伝播されるにもかかわらず、同様の問題に直面します。順伝播と同じ推論を使用すると、$n_\mathrm{out} \sigma^2 = 1$ ($n_\mathrm{out}$ はこの層の出力数) でない限り、勾配の分散が爆発する可能性があることがわかります。これはジレンマに陥ります。両方の条件を同時に満たすことはできません。代わりに、私たちは単に以下を満たそうとします。 
+分散を固定する 1 つの方法は、$n_\mathrm{in} \sigma^2 = 1$ を設定することです。ここで、バックプロパゲーションについて考えてみましょうそこでは、出力に近い層から勾配が伝播されるにもかかわらず、同様の問題に直面します。順伝播と同じ推論を使用して、$n_\mathrm{out} \sigma^2 = 1$（$n_\mathrm{out}$）がこの層の出力数でない限り、勾配の分散が爆発する可能性があることがわかります。これにより、私たちはジレンマに陥ります。両方の条件を同時に満たすことはできません。代わりに、私たちは単に以下を満足させようとします。 
 
 $$
 \begin{aligned}
@@ -142,32 +149,32 @@ $$
 \end{aligned}
 $$
 
-これが、今や標準的で実用的に有益な*Xavier initialization* の根底にある理由であり、その作成者 :cite:`Glorot.Bengio.2010` の最初の作者にちなんで名付けられました。通常、Xavier 初期化では、平均がゼロで分散 $\sigma^2 = \frac{2}{n_\mathrm{in} + n_\mathrm{out}}$ をもつガウス分布から重みがサンプリングされます。また、一様分布から重みをサンプリングするときに、ザビエルの直感を適応させて分散を選択することもできます。一様分布 $U(-a, a)$ には分散 $\frac{a^2}{3}$ があることに注意してください。$\frac{a^2}{3}$ を $\sigma^2$ の状態に差し込むと、以下のように初期化するよう提案されます。 
+これが、その作成者:cite:`Glorot.Bengio.2010`の最初の作者にちなんで名付けられた、現在標準的で実用的に有益な*Xavier初期化*の根底にある理由です。通常、Xavier の初期化は、ゼロ平均、分散 $\sigma^2 = \frac{2}{n_\mathrm{in} + n_\mathrm{out}}$ をもつガウス分布から重みをサンプリングします。また、ザビエルの直感を応用して、一様分布から重みをサンプリングするときの分散を選択することもできます。一様分布$U(-a, a)$には分散$\frac{a^2}{3}$があることに注意してください。$\frac{a^2}{3}$を$\sigma^2$の条件に差し込むと、それに従って初期化する提案が得られます。 
 
 $$U\left(-\sqrt{\frac{6}{n_\mathrm{in} + n_\mathrm{out}}}, \sqrt{\frac{6}{n_\mathrm{in} + n_\mathrm{out}}}\right).$$
 
-上記の数学的推論で非線形性が存在しないという仮定は、ニューラルネットワークでは容易に破られる可能性がありますが、実際にはXavier初期化法がうまく機能することがわかりました。 
+上記の数学的推論における非線形性の非存在の仮定は、ニューラルネットワークでは簡単に破られる可能性がありますが、ザビエルの初期化方法は実際にはうまく機能することがわかりました。 
 
 ### 超えて
 
-上記の推論は、パラメータ初期化に対する現代的なアプローチの表面をほとんど傷つけません。ディープラーニングフレームワークでは、十数種類以上のヒューリスティックが実装されることがよくあります。さらに、パラメーターの初期化は、ディープラーニングの基礎研究のホットエリアであり続けています。その中には、タイド (共有) パラメータ、超解像、シーケンスモデル、およびその他の状況に特化したヒューリスティックがあります。たとえば、Xiao らは、慎重に設計された初期化メソッド :cite:`Xiao.Bahri.Sohl-Dickstein.ea.2018` を使用して、アーキテクチャ上のトリックなしに 10000 層のニューラルネットワークをトレーニングできる可能性を示しました。 
+上記の推論は、パラメータの初期化に対する最新のアプローチの表面をほとんど傷つけません。ディープラーニングフレームワークは、多くの場合、十数種類以上のヒューリスティックを実装します。さらに、パラメーターの初期化は、ディープラーニングの基礎研究の注目の分野であり続けています。これらの中には、関連付けられた（共有された）パラメータ、超解像度、シーケンスモデル、およびその他の状況に特化したヒューリスティックがあります。たとえば、Xiaoらは、慎重に設計された初期化方法:cite:`Xiao.Bahri.Sohl-Dickstein.ea.2018`を使用して、アーキテクチャ上のトリックなしで10000層のニューラルネットワークをトレーニングする可能性を示しました。 
 
-トピックに興味がある場合は、このモジュールの内容を深く掘り下げて、各ヒューリスティックを提案および分析した論文を読み、そのトピックに関する最新の出版物を調べることをお勧めします。たぶん、あなたはつまずいたり、巧妙なアイデアを発明したり、ディープラーニングフレームワークの実装に貢献したりするでしょう。 
+トピックに興味がある場合は、このモジュールの提供内容を深く掘り下げ、各ヒューリスティックを提案および分析した論文を読み、そのトピックに関する最新の出版物を調べることをお勧めします。おそらく、あなたはつまずいたり、巧妙なアイデアを発明したり、ディープラーニングフレームワークの実装に貢献したりするでしょう。 
 
-## [概要
+## まとめ
 
-* グラデーションの消失と爆発は、ディープネットワークではよくある問題です。勾配とパラメーターを適切に制御するには、パラメーターの初期化に細心の注意を払う必要があります。
+* 深層ネットワークでは、勾配の消失と爆発が一般的な問題です。勾配とパラメータを適切に制御するには、パラメータの初期化には細心の注意が必要です。
 * 初期勾配が大きすぎたり小さすぎたりしないようにするには、初期化ヒューリスティックが必要です。
-* ReLU アクティベーション関数は、消失する勾配の問題を軽減します。これにより、コンバージェンスが加速します。
-* 最適化前に対称性が崩れるようにするには、ランダム初期化が重要です。
-* ザビエルの初期化では、各層について、出力の分散は入力数の影響を受けず、勾配の分散は出力数の影響を受けないことが示唆されています。
+* ReLU 活性化関数は消失勾配の問題を軽減します。これにより、コンバージェンスが加速されます。
+* ランダム初期化は、最適化の前に対称性が破られるようにするための鍵です。
+* ザビエルの初期化では、各層について、出力の分散は入力数の影響を受けず、勾配の分散は出力の数に影響されないことが示唆されています。
 
 ## 演習
 
-1. ニューラルネットワークが、MLP の層における順列対称性以外に、破断を必要とする対称性を示す可能性のある他のケースを設計できますか？
+1. MLPのレイヤーの順列対称性以外に、ニューラルネットワークが破壊を必要とする対称性を示す可能性がある他のケースを設計できますか？
 1. 線形回帰またはソフトマックス回帰のすべての重みパラメータを同じ値に初期化できますか？
-1. 2 つの行列の積の固有値で解析限界を調べます。これは、グラデーションが適切に調整されていることを確認することについて何を教えてくれますか？
-1. いくつかの用語が発散していることがわかっている場合、事後にこれを修正できますか？インスピレーションを得るには、レイヤーワイズアダプティブレートスケーリングに関する論文を見てください :cite:`You.Gitman.Ginsburg.2017`。
+1. 2 つの行列の積の固有値の解析的限界を調べます。これは、グラデーションが適切に調整されていることを確認することについて何を教えてくれますか？
+1. いくつかの用語が分かれていることがわかっている場合、事後にこれを修正できますか？インスピレーションを得るために、レイヤーごとの適応レートスケーリングに関する論文を見てください :cite:`You.Gitman.Ginsburg.2017`。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/103)
diff --git a/chapter_multilayer-perceptrons/numerical-stability-and-init_origin.md b/chapter_multilayer-perceptrons/numerical-stability-and-init_origin.md
index 9cb6208..08fe82c 100644
--- a/chapter_multilayer-perceptrons/numerical-stability-and-init_origin.md
+++ b/chapter_multilayer-perceptrons/numerical-stability-and-init_origin.md
@@ -1,3 +1,8 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # Numerical Stability and Initialization
 :label:`sec_numerical_stability`
 
@@ -30,12 +35,12 @@ Consider a deep network with $L$ layers,
 input $\mathbf{x}$ and output $\mathbf{o}$.
 With each layer $l$ defined by a transformation $f_l$
 parameterized by weights $\mathbf{W}^{(l)}$,
-whose hidden variable is $\mathbf{h}^{(l)}$ (let $\mathbf{h}^{(0)} = \mathbf{x}$),
+whose hidden layer output is $\mathbf{h}^{(l)}$ (let $\mathbf{h}^{(0)} = \mathbf{x}$),
 our network can be expressed as:
 
 $$\mathbf{h}^{(l)} = f_l (\mathbf{h}^{(l-1)}) \text{ and thus } \mathbf{o} = f_L \circ \ldots \circ f_1(\mathbf{x}).$$
 
-If all the hidden variables and the input are vectors,
+If all the hidden layer output and the input are vectors,
 we can write the gradient of $\mathbf{o}$ with respect to
 any set of parameters $\mathbf{W}^{(l)}$ as follows:
 
@@ -82,10 +87,11 @@ Since early artificial neural networks were inspired
 by biological neural networks,
 the idea of neurons that fire either *fully* or *not at all*
 (like biological neurons) seemed appealing.
-Let us take a closer look at the sigmoid
+Let's take a closer look at the sigmoid
 to see why it can cause vanishing gradients.
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import autograd, np, npx
@@ -101,7 +107,7 @@ d2l.plot(x, [y, x.grad], legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
 import torch
@@ -115,7 +121,7 @@ d2l.plot(x.detach().numpy(), [y.detach().numpy(), x.grad.numpy()],
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
 import tensorflow as tf
@@ -157,6 +163,7 @@ of a deep network, we have no chance of getting
 a gradient descent optimizer to converge.
 
 ```{.python .input}
+%%tab mxnet
 M = np.random.normal(size=(4, 4))
 print('a single matrix', M)
 for i in range(100):
@@ -166,17 +173,17 @@ print('after multiplying 100 matrices', M)
 ```
 
 ```{.python .input}
-#@tab pytorch
-M = torch.normal(0, 1, size=(4,4))
+%%tab pytorch
+M = torch.normal(0, 1, size=(4, 4))
 print('a single matrix \n',M)
 for i in range(100):
-    M = torch.mm(M,torch.normal(0, 1, size=(4, 4)))
+    M = M @ torch.normal(0, 1, size=(4, 4))
 
 print('after multiplying 100 matrices\n', M)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 M = tf.random.normal((4, 4))
 print('a single matrix \n', M)
 for i in range(100):
@@ -223,14 +230,15 @@ the network's expressive power.
 The hidden layer would behave
 as if it had only a single unit.
 Note that while minibatch stochastic gradient descent would not break this symmetry,
-dropout regularization would!
+dropout regularization (to be introduced later) would!
 
 
 ## Parameter Initialization
 
 One way of addressing---or at least mitigating---the
 issues raised above is through careful initialization.
-Additional care during optimization
+As we will see later,
+additional care during optimization
 and suitable regularization can further enhance stability.
 
 
@@ -251,8 +259,8 @@ for moderate problem sizes.
 ### Xavier Initialization
 :label:`subsec_xavier`
 
-Let us look at the scale distribution of
-an output (e.g., a hidden variable) $o_{i}$ for some fully-connected layer
+Let's look at the scale distribution of
+an output $o_{i}$ for some fully connected layer
 *without nonlinearities*.
 With $n_\mathrm{in}$ inputs $x_j$
 and their associated weights $w_{ij}$ for this layer,
@@ -262,11 +270,11 @@ $$o_{i} = \sum_{j=1}^{n_\mathrm{in}} w_{ij} x_j.$$
 
 The weights $w_{ij}$ are all drawn
 independently from the same distribution.
-Furthermore, let us assume that this distribution
+Furthermore, let's assume that this distribution
 has zero mean and variance $\sigma^2$.
 Note that this does not mean that the distribution has to be Gaussian,
 just that the mean and variance need to exist.
-For now, let us assume that the inputs to the layer $x_j$
+For now, let's assume that the inputs to the layer $x_j$
 also have zero mean and variance $\gamma^2$
 and that they are independent of $w_{ij}$ and independent of each other.
 In this case, we can compute the mean and variance of $o_i$ as follows:
diff --git a/chapter_multilayer-perceptrons/underfit-overfit.md b/chapter_multilayer-perceptrons/underfit-overfit.md
deleted file mode 100644
index 01fc9b6..0000000
--- a/chapter_multilayer-perceptrons/underfit-overfit.md
+++ /dev/null
@@ -1,342 +0,0 @@
-# モデル選択、アンダーフィット、オーバーフィット
-:label:`sec_model_selection`
-
-機械学習の科学者として、私たちの目標は「パターン」を発見することです。しかし、単にデータを記憶したのではなく、本当に*一般的な*パターンを発見したと確信するにはどうすればよいでしょうか。たとえば、患者を認知症の状態に結びつける遺伝子マーカーのパターンを探したいとします。ラベルはセット $\{\text{dementia}, \text{mild cognitive impairment}, \text{healthy}\}$ から描かれています。各人の遺伝子はそれらを一意に（同一の兄弟を無視して）識別するため、データセット全体を記憶することができます。 
-
-モデルに言わせたくない
-*「あれはボブだ！彼を覚えてる！彼は認知症だ！」*
-理由は単純です。将来モデルを展開すると、そのモデルがこれまでに見たことのない患者に遭遇することになります。私たちの予測は、モデルが本当に*一般的な*パターンを発見した場合にのみ役に立ちます。 
-
-より正式に要約するために、私たちの目標は、トレーニングセットが引き出された基礎となる母集団の規則性を捉えるパターンを発見することです。この取り組みが成功すれば、これまで遭遇したことのない個人に対してもリスク評価を成功させることができます。この問題、つまり*一般化*するパターンをどうやって発見するかが、機械学習の根本的な問題です。 
-
-危険なのは、モデルをトレーニングするときに、ごくわずかなデータサンプルにしかアクセスしないことです。最大のパブリック画像データセットには、約 100 万枚の画像が含まれています。多くの場合、数千または数万のデータ例からしか学ばないといけません。大規模な病院システムでは、何十万もの医療記録にアクセスする可能性があります。有限サンプルを扱う場合、より多くのデータを収集しても保持されない明らかな関連性が発見されるリスクがあります。 
-
-基礎となる分布に近似するよりも学習データを近似する現象を*overfitting*、過剰適合に対抗するために使用される手法を*正則化* と呼びます。前のセクションでは、Fashion-MNIST データセットを試しているときに、この影響を観察したことがあるかもしれません。実験中にモデル構造またはハイパーパラメーターを変更した場合、十分なニューロン、層、およびトレーニングエポックがあれば、テストデータの精度が低下しても、最終的にモデルはトレーニングセットで完全な精度に達する可能性があることに気付いたかもしれません。 
-
-## 学習誤差と汎化誤差
-
-この現象をより形式的に論じるためには、学習誤差と汎化誤差を区別する必要があります。*トレーニング誤差* は、トレーニングデータセットで計算されたモデルの誤差です。*汎化誤差* は、元のサンプルと同じ基になるデータ分布から引き出された追加のデータ例の無限ストリームに適用した場合のモデルの誤差の予測値です。 
-
-問題として、汎化誤差を正確に計算することはできません。これは、無限データのストリームが架空のオブジェクトだからです。実際には、学習セットから除外されたランダムなデータ例で構成される独立したテストセットにモデルを適用して、汎化誤差を「推定」する必要があります。 
-
-次の3つの思考実験は、この状況をよりよく説明するのに役立ちます。最終試験の準備をしようとしている大学生を考えてみましょう。勤勉な学生は、前年度の試験を使用して、よく練習し、能力をテストするよう努めます。それにもかかわらず、過去の試験でうまくやることは、それが重要なときに彼が優れていることを保証するものではありません。たとえば、学生は試験問題の解答を暗記して準備しようとするかもしれません。そのためには、学生は多くのことを暗記する必要があります。彼女は過去の試験の答えを完全に覚えているかもしれません。他の学生は、特定の答えを出す理由を理解しようとすることによって準備するかもしれません。ほとんどの場合、後者の学生の方がはるかに優れています。 
-
-同様に、単にルックアップテーブルを使用して質問に答えるモデルを考えてみましょう。許容される入力のセットが離散的で適度に小さい場合、おそらく*多くの*トレーニング例を見た後であれば、このアプローチはうまく機能します。それでも、このモデルには、これまでに見たことのない例に直面したときに、ランダムな推測よりも優れた機能はありません。実際には、入力スペースが大きすぎて、考えられるすべての入力に対応する答えを記憶できません。たとえば、白黒の $28\times28$ イメージについて考えてみます。各ピクセルが $256$ のグレースケール値の 1 つを取ることができれば、$256^{784}$ 個のイメージが考えられます。つまり、低解像度のグレースケールサムネイルサイズの画像は、宇宙の原子よりもはるかに多いということです。そのようなデータに遭遇したとしても、ルックアップテーブルを格納する余裕はありませんでした。 
-
-最後に、コイントス (クラス 0: 頭、クラス 1: テール) の結果を、利用可能な状況に応じた特徴に基づいて分類しようとする問題を考えてみましょう。コインが公正であると仮定します。どんなアルゴリズムを考えても、汎化誤差は常に $\frac{1}{2}$ になります。しかし、ほとんどのアルゴリズムでは、たとえ特徴がなくても、抽選の運にもよりますが、トレーニングエラーはかなり低くなると予想されます。データセット {0, 1, 1, 0, 1} を考えてみましょう。私たちの特徴のないアルゴリズムは、限られたサンプルから*1*と思われる*マジョリティクラス*を常に予測することに頼らなければなりません。この場合、クラス 1 を常に予測するモデルでは $\frac{1}{3}$ の誤差が発生します。これは、汎化誤差よりもはるかに優れています。データ量を増やすと、頭部の割合が $\frac{1}{2}$ から大幅に逸脱する確率は小さくなり、学習誤差は汎化誤差と一致するようになります。 
-
-### 統計的学習理論
-
-一般化は機械学習の根本的な問題なので、多くの数学者や理論家が、この現象を説明する形式理論の開発に全力を注いでいることを知っても驚くことではないかもしれません。GlivenkoとCantelliは [同義の定理](https://en.wikipedia.org/wiki/Glivenko%E2%80%93Cantelli_theorem) で、学習誤差が汎化誤差に収束する速度を導き出しました。一連の独創的な論文の中で、[Vapnik and Chervonenkis](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory) はこの理論をより一般的な関数クラスに拡張した。この研究は統計的学習理論の基礎を築いた。 
-
-これまで取り上げた標準の教師あり学習の設定では、本書のほとんどの部分で取り上げますが、学習データとテストデータの両方が、*同一* の分布から「独立して」引き出されると仮定します。これは一般に「*i.i.d. Assumption*」と呼ばれ、データをサンプリングするプロセスにはメモリがないことを意味します。言い換えれば、描かれた2番目の例と3番目の描かれた例は、描かれた2番目と200万番目のサンプルと相関関係がありません。 
-
-優れた機械学習の科学者になるには、批判的に考える必要があります。すでにこの仮定に穴を開けて、仮定が失敗する一般的なケースを考え出す必要があります。UCSF Medical Centerで患者から収集したデータに基づいて死亡リスク予測因子をトレーニングし、マサチューセッツ総合病院の患者に適用するとどうなるでしょうか？これらの分布はまったく同じではありません。さらに、抽選は時間的に相関する可能性があります。ツイートのトピックを分類するとどうなりますか？ニュースサイクルは、議論されているトピックに一時的な依存関係を作成し、独立性の前提に違反します。 
-
-I.D. の仮定の軽微な違反から逃れることもあり、私たちのモデルは非常にうまく機能し続けるでしょう。結局のところ、ほとんどすべての現実世界のアプリケーションには、i.d. の前提に対する軽微な違反が含まれていますが、顔認識、音声認識、言語翻訳など、さまざまなアプリケーションに役立つツールが多数あります。 
-
-他の違反は必ずトラブルの原因となります。たとえば、顔認識システムを大学生だけにトレーニングしてトレーニングし、老人ホームの高齢者を監視するツールとして導入するとします。大学生は高齢者とかなり違って見える傾向があるため、これはうまく機能しそうにありません。 
-
-以降の章では、i.i.d. 仮定の違反から生じる問題について説明します。今のところ、i.d. の仮定を当たり前のことと考えても、汎化を理解することは手ごわい問題です。さらに、ディープニューラルネットワークが一般化するのと同様に一般化する理由を説明するかもしれない正確な理論的基礎を解明することは、学習理論の最大の心を悩ませ続けています。 
-
-モデルをトレーニングするときは、トレーニングデータにできる限り適合する関数を探そうとします。関数が非常に柔軟で、真の関連性と同じくらい簡単に偽のパターンに追いつくことができる場合、目に見えないデータに対して適切に一般化するモデルを生成しなくても、*うまく機能しすぎます*。これはまさに、避けたい、または少なくとも制御したいことです。ディープラーニングの手法の多くは、過適合を防ぐことを目的としたヒューリスティックとトリックです。 
-
-### モデルの複雑さ
-
-単純なモデルと豊富なデータがある場合、汎化誤差は学習誤差に似ていると予想されます。より複雑なモデルと少数の例を扱うと、学習誤差は減少するが、汎化ギャップは大きくなると予想されます。モデルの複雑さを正確に構成しているのは複雑な問題です。モデルが適切に一般化されるかどうかは、多くの要因によって決まります。たとえば、パラメーターが多いモデルはより複雑であると見なされることがあります。パラメーターがより広い範囲の値をとることができるモデルは、より複雑になる場合があります。ニューラルネットワークでは、トレーニングの反復数が多いモデルはより複雑で、*早期停止* (トレーニングの反復回数が少ない) モデルはより複雑ではないと考えることがよくあります。 
-
-実質的に異なるモデルクラス (たとえば、決定木とニューラルネットワーク) のメンバー間で複雑さを比較するのは難しい場合があります。今のところ、単純な経験則は非常に有用です。恣意的な事実を簡単に説明できるモデルは統計学者が複雑と見なすものですが、表現力は限られていてもデータをうまく説明できるモデルはおそらく真実に近いでしょう。哲学では、これは科学理論の偽造可能性に関するポッパーの基準と密接に関連しています。理論は、データに適合し、それを反証するために使用できる特定のテストがある場合に適しています。すべての統計的推定は重要であるため、これは重要です。
-*ポストホック*、
-つまり、事実を観察した後に推定するため、関連する誤謬に対して脆弱です。今のところ、私たちは哲学を脇に置き、より具体的な問題に固執します。 
-
-このセクションでは、直感的に理解できるように、モデルクラスの一般化可能性に影響するいくつかの要因に焦点を当てます。 
-
-1. 調整可能なパラメーターの数。*自由度* と呼ばれることもある調整可能なパラメーターの数が多い場合、モデルは過適合の影響を受けやすくなります。
-1. パラメータがとる値。重みがより広い範囲の値を取ることができる場合、モデルは過適合の影響を受けやすくなります。
-1. トレーニング例の数。モデルが単純であっても、1 つまたは 2 つの例しか含まれていないデータセットを過剰適合させるのは簡単です。しかし、何百万もの例を含むデータセットを過剰適合させるには、きわめて柔軟なモデルが必要です。
-
-## モデル選択
-
-機械学習では、通常、いくつかの候補モデルを評価した後、最終モデルを選択します。このプロセスを*モデル選択* と呼びます。比較の対象となるモデルの性質が根本的に異なる場合があります (たとえば、決定木と線形モデル)。また、異なるハイパーパラメーター設定でトレーニングされた同じクラスのモデルのメンバーを比較する場合もあります。 
-
-たとえば、MLP では、隠れ層の数や隠れ単位の数が異なり、各隠れ層に適用される活性化関数の選択肢が異なるモデルを比較したい場合があります。候補モデルの中から最適なモデルを決定するために、通常は検証データセットを使用します。 
-
-### 検証データセット
-
-原則として、すべてのハイパーパラメータを選択するまでテストセットに触れないでください。モデル選択プロセスでテストデータを使用した場合、テストデータを過剰に適合させるリスクがあります。そうすれば、私たちは深刻な問題に陥るでしょう。トレーニングデータをオーバーフィットさせた場合、正直さを保つためにテストデータに対する評価が常に行われます。しかし、テストデータを過剰に適合させたら、どうしてわかるでしょうか？ 
-
-したがって、モデル選択にテストデータに頼るべきではありません。しかし、モデルのトレーニングに使用するデータそのものに対する汎化誤差を推定することはできないため、モデルの選択をトレーニングデータだけに頼ることはできません。 
-
-実際のアプリケーションでは、画像が濁ります。最適なモデルを評価したり、少数のモデルを相互に比較したりするために、テストデータに一度だけ触れるのが理想的ですが、実際のテストデータは 1 回使用しただけで破棄されることはほとんどありません。実験のラウンドごとに新しいテストセットを用意することはめったにありません。 
-
-この問題に対処するための一般的な方法は、トレーニングデータセットとテストデータセットに加えて、*検証データセット* (または*検証セット*) を組み込んで、データを 3 つの方法で分割することです。その結果、検証データとテストデータの境界があいまいなほど曖昧になるという、曖昧な習慣が生まれます。特に明記されていない限り、この本の実験では、真のテストセットを使用せずに、トレーニングデータと検証データと呼ぶべきものを実際に扱っています。したがって、本の各実験で報告される精度は、実際には検証精度であり、真のテストセットの精度ではありません。 
-
-### $K$ 分割交差検証
-
-トレーニングデータが不足していると、適切な検証セットを構成するのに十分なデータを保持する余裕すらできない場合があります。この問題に対する一般的な解決策の 1 つは、$K$*-fold 交差検証* を採用することです。ここでは、元のトレーニングデータが $K$ 個の重複しないサブセットに分割されます。その後、モデルのトレーニングと検証が $K$ 回実行され、そのたびに $K-1$ のサブセットでトレーニングされ、別のサブセット (そのラウンドではトレーニングに使用されないサブセット) で検証されます。最後に、$K$ 実験の結果を平均化して学習誤差と検証誤差を推定します。 
-
-## アンダーフィットまたはオーバーフィット？
-
-学習エラーと検証エラーを比較するときは、2 つの一般的な状況に留意する必要があります。まず、トレーニングエラーと検証エラーの両方が大きいが、両者の間にわずかなギャップがある場合に注意します。モデルがトレーニングエラーを減らすことができない場合、モデルが単純すぎる (つまり、表現力が足りない) ため、モデル化しようとしているパターンをキャプチャできない可能性があります。さらに、トレーニングエラーと検証エラーの間の「汎化ギャップ」は小さいため、より複雑なモデルで回避できると信じる理由があります。この現象を*アンダーフィット*といいます。 
-
-一方、上で説明したように、トレーニングエラーが検証エラーよりも大幅に小さく、深刻な*オーバーフィット*を示すケースに注意する必要があります。オーバーフィットは必ずしも悪いことではないことに注意してください。特にディープラーニングでは、最良の予測モデルが、ホールドアウトデータよりもトレーニングデータの方がはるかに優れたパフォーマンスを発揮することがよく知られています。最終的には、通常、学習エラーと検証エラーのギャップよりも、検証エラーを重視します。 
-
-過適合か不適合かは、モデルの複雑さと利用可能なトレーニングデータセットのサイズの両方に依存する可能性があります。これについては、以下で説明する 2 つのトピックです。 
-
-### モデルの複雑さ
-
-過適合とモデルの複雑さに関する古典的な直感を説明するために、多項式を使用した例を挙げます。1 つのフィーチャ $x$ とそれに対応する実数値のラベル $y$ で構成される学習データから、次数 $d$ の多項式を求めます。 
-
-$$\hat{y}= \sum_{i=0}^d x^i w_i$$
-
-ラベル$y$を見積もります。これは単なる線形回帰問題で、$x$ のべき乗によって特徴が与えられ、モデルの重みが $w_i$ で与えられ、$x^0 = 1$ 以降の $w_0$ によってバイアスが与えられます。これは単なる線形回帰問題なので、二乗誤差を損失関数として使用できます。 
-
-高次の多項式はパラメーターが多く、モデル関数の選択範囲が広いため、高次の多項式関数は低次の多項式関数よりも複雑です。トレーニングデータセットを修正すると、高次の多項式関数は、低次多項式に比べて常に低い (最悪の場合、等しい) トレーニングエラーを達成する必要があります。実際、各データ例の値が $x$ である場合は常に、次数がデータ例の数と等しい多項式関数はトレーニングセットに完全に適合します。:numref:`fig_capacity_vs_error` では、多項式の次数と過適合と過適合の関係を可視化します。 
-
-![Influence of model complexity on underfitting and overfitting](../img/capacity-vs-error.svg)
-:label:`fig_capacity_vs_error`
-
-### データセットサイズ
-
-もう 1 つ留意すべき大きな考慮事項は、データセットのサイズです。モデルを修正すると、トレーニングデータセットに含まれるサンプルが少ないほど、過適合が発生する可能性が高くなります（さらに深刻になります）。学習データの量が増えるにつれて、汎化誤差は一般に減少します。さらに、一般的に、より多くのデータが害を及ぼすことはありません。固定タスクとデータ分散では、通常、モデルの複雑度とデータセットのサイズには関係があります。より多くのデータがあれば、より複雑なモデルをあてはめようとすると有益です。十分なデータがないと、単純なモデルは打ち負かすのが難しくなる可能性があります。多くのタスクにおいて、ディープラーニングは数千ものトレーニング例が利用できる場合にのみ、線形モデルよりも優れたパフォーマンスを発揮します。ディープラーニングの現在の成功の一部は、インターネット企業、安価なストレージ、コネクテッドデバイス、および経済の広範なデジタル化により、大量のデータセットが現在豊富に存在しているためです。 
-
-## 多項式回帰
-
-これで (**多項式をデータにあてはめることで、これらの概念を対話的に探索できます**)
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, np, npx
-from mxnet.gluon import nn
-import math
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-import numpy as np
-import math
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-import numpy as np
-import math
-```
-
-### データセットの生成
-
-まず、データが必要です。$x$ を指定すると、学習データとテストデータに対して [**次の三次多項式を使用してラベルを生成**] します。 
-
-(**$$y = 5 + 1.2x-3.4\ frac {x^2} {2!}+ 5.6\ frac {x^3} {3!}+\ イプシロン\ text {where}\ イプシロン\ sim\ mathcal {N} (0, 0.1 ^2) .$$**) 
-
-ノイズ項 $\epsilon$ は、平均 0、標準偏差 0.1 の正規分布に従います。最適化のためには、通常、非常に大きな値の勾配や損失を避けたいと考えています。これが、*フィーチャ* が $x^i$ から $\ frac {x^i} {i!} に再スケーリングされる理由です。$。これにより、大きな指数 $i$ に対して非常に大きな値を避けることができます。トレーニングセットとテストセットにそれぞれ100個のサンプルを合成します。
-
-```{.python .input}
-#@tab all
-max_degree = 20  # Maximum degree of the polynomial
-n_train, n_test = 100, 100  # Training and test dataset sizes
-true_w = np.zeros(max_degree)  # Allocate lots of empty space
-true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])
-
-features = np.random.normal(size=(n_train + n_test, 1))
-np.random.shuffle(features)
-poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
-for i in range(max_degree):
-    poly_features[:, i] /= math.gamma(i + 1)  # `gamma(n)` = (n-1)!
-# Shape of `labels`: (`n_train` + `n_test`,)
-labels = np.dot(poly_features, true_w)
-labels += np.random.normal(scale=0.1, size=labels.shape)
-```
-
-この場合も、`poly_features` に格納された単項式は、$\ Gamma (n) = (n-1) のガンマ関数によって再スケーリングされます。$。生成されたデータセットから [**最初の2つのサンプルを見てみましょう**]。値 1 は技術的には特徴であり、バイアスに対応する定数特徴量です。
-
-```{.python .input}
-#@tab pytorch, tensorflow
-# Convert from NumPy ndarrays to tensors
-true_w, features, poly_features, labels = [d2l.tensor(x, dtype=
-    d2l.float32) for x in [true_w, features, poly_features, labels]]
-```
-
-```{.python .input}
-#@tab all
-features[:2], poly_features[:2, :], labels[:2]
-```
-
-### モデルのトレーニングとテスト
-
-まず [**与えられたデータセットの損失を評価する関数を実装する**]。
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def evaluate_loss(net, data_iter, loss):  #@save
-    """Evaluate the loss of a model on the given dataset."""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        l = loss(net(X), y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
-```
-
-```{.python .input}
-#@tab pytorch
-def evaluate_loss(net, data_iter, loss):  #@save
-    """Evaluate the loss of a model on the given dataset."""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        out = net(X)
-        y = d2l.reshape(y, out.shape)
-        l = loss(out, y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
-```
-
-ここで [**トレーニング関数を定義する**]。
-
-```{.python .input}
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = gluon.loss.L2Loss()
-    net = nn.Sequential()
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net.add(nn.Dense(1, use_bias=False))
-    net.initialize()
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    test_iter = d2l.load_array((test_features, test_labels), batch_size,
-                               is_train=False)
-    trainer = gluon.Trainer(net.collect_params(), 'sgd',
-                            {'learning_rate': 0.01})
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net[0].weight.data().asnumpy())
-```
-
-```{.python .input}
-#@tab pytorch
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = nn.MSELoss(reduction='none')
-    input_shape = train_features.shape[-1]
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
-                                batch_size)
-    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
-                               batch_size, is_train=False)
-    trainer = torch.optim.SGD(net.parameters(), lr=0.001)
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net[0].weight.data.numpy())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = tf.losses.MeanSquaredError()
-    input_shape = train_features.shape[-1]
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net = tf.keras.Sequential()
-    net.add(tf.keras.layers.Dense(1, use_bias=False))
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    test_iter = d2l.load_array((test_features, test_labels), batch_size,
-                               is_train=False)
-    trainer = tf.keras.optimizers.SGD(learning_rate=.01)
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net.get_weights()[0].T)
-```
-
-### [**三次多項式関数近似 (正規) **]
-
-まず、データ生成関数と同じ次数である 3 次多項式関数を使用します。この結果は、このモデルの学習損失とテスト損失の両方を効果的に低減できることを示しています。学習したモデルパラメーターも真値 $w = [5, 1.2, -3.4, 5.6]$ に近くなります。
-
-```{.python .input}
-#@tab all
-# Pick the first four dimensions, i.e., 1, x, x^2/2!, x^3/3! from the
-# polynomial features
-train(poly_features[:n_train, :4], poly_features[n_train:, :4],
-      labels[:n_train], labels[n_train:])
-```
-
-### [**線形関数近似 (アンダーフィット) **]
-
-一次関数近似をもう一度見てみましょう。初期のエポックが減少した後、このモデルのトレーニングロスをさらに減らすことは困難になります。最後のエポック反復が完了した後も、学習損失は依然として高いままです。非線形パターン (ここでは 3 次多項式関数のように) を近似するために使用すると、線形モデルは適合不足になりがちです。
-
-```{.python .input}
-#@tab all
-# Pick the first two dimensions, i.e., 1, x, from the polynomial features
-train(poly_features[:n_train, :2], poly_features[n_train:, :2],
-      labels[:n_train], labels[n_train:])
-```
-
-### [**高次多項式関数近似 (過適合) **]
-
-次数が高すぎる多項式を使ってモデルをトレーニングしてみましょう。ここでは、高次係数の値がゼロに近いはずであることを知るには不十分なデータがあります。その結果、過度に複雑なモデルは非常に影響を受けやすく、トレーニングデータのノイズの影響を受けています。トレーニングロスは効果的に減らすことができますが、テストロスは依然としてはるかに高くなります。これは、複素数モデルがデータに過適合していることを示しています。
-
-```{.python .input}
-#@tab all
-# Pick all the dimensions from the polynomial features
-train(poly_features[:n_train, :], poly_features[n_train:, :],
-      labels[:n_train], labels[n_train:], num_epochs=1500)
-```
-
-以降のセクションでは、オーバーフィットの問題と、体重の減少や脱落など、それらに対処する方法について引き続き説明します。 
-
-## [概要
-
-* 汎化誤差は学習誤差に基づいて推定できないため、単純に学習誤差を最小化しても必ずしも汎化誤差が減少するわけではありません。機械学習モデルでは、汎化誤差を最小限に抑えるために、過適合を防ぐよう注意する必要があります。
-* 検証セットは、あまり自由に使用されない限り、モデルの選択に使用できます。
-* アンダーフィットとは、モデルが学習誤差を減らすことができないことを意味します。学習誤差が検証誤差よりはるかに小さい場合、過適合が発生します。
-* 適切に複雑なモデルを選択し、不十分なトレーニングサンプルを使用しないようにする必要があります。
-
-## 演習
-
-1. 多項式回帰問題を正確に解けますか？ヒント:線形代数を使う。
-1. 多項式のモデル選択について考えてみましょう。
-    1. 学習損失対モデルの複雑度 (多項式の次数) をプロットします。あなたは何を観察していますか？学習損失を0に減らすには、どの程度の多項式が必要ですか？
-    1. この場合のテスト損失をプロットします。
-    1. 同じプロットをデータ量の関数として生成します。
-1. 正規化を落とすとどうなりますか ($1/i!$) of the polynomial features $x^i$？これを他の方法で直せる？
-1. 汎化エラーがゼロになると予想できますか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/96)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/97)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/234)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/underfit-overfit_origin.md b/chapter_multilayer-perceptrons/underfit-overfit_origin.md
deleted file mode 100644
index 7b15444..0000000
--- a/chapter_multilayer-perceptrons/underfit-overfit_origin.md
+++ /dev/null
@@ -1,665 +0,0 @@
-# Model Selection, Underfitting, and Overfitting
-:label:`sec_model_selection`
-
-As machine learning scientists,
-our goal is to discover *patterns*.
-But how can we be sure that we have
-truly discovered a *general* pattern
-and not simply memorized our data?
-For example, imagine that we wanted to hunt
-for patterns among genetic markers
-linking patients to their dementia status,
-where the labels are drawn from the set
-$\{\text{dementia}, \text{mild cognitive impairment}, \text{healthy}\}$.
-Because each person's genes identify them uniquely
-(ignoring identical siblings),
-it is possible to memorize the entire dataset.
-
-We do not want our model to say
-*"That's Bob! I remember him! He has dementia!"*
-The reason why is simple.
-When we deploy the model in the future,
-we will encounter patients
-that the model has never seen before.
-Our predictions will only be useful
-if our model has truly discovered a *general* pattern.
-
-To recapitulate more formally,
-our goal is to discover patterns
-that capture regularities in the underlying population
-from which our training set was drawn.
-If we are successful in this endeavor,
-then we could successfully assess risk
-even for individuals that we have never encountered before.
-This problem---how to discover patterns that *generalize*---is
-the fundamental problem of machine learning.
-
-The danger is that when we train models,
-we access just a small sample of data.
-The largest public image datasets contain
-roughly one million images.
-More often, we must learn from only thousands
-or tens of thousands of data examples.
-In a large hospital system, we might access
-hundreds of thousands of medical records.
-When working with finite samples, we run the risk
-that we might discover apparent associations
-that turn out not to hold up when we collect more data.
-
-The phenomenon of fitting our training data
-more closely than we fit the underlying distribution is called *overfitting*, and the techniques used to combat overfitting are called *regularization*.
-In the previous sections, you might have observed
-this effect while experimenting with the Fashion-MNIST dataset.
-If you altered the model structure or the hyperparameters during the experiment, you might have noticed that with enough neurons, layers, and training epochs, the model can eventually reach perfect accuracy on the training set, even as the accuracy on test data deteriorates.
-
-
-## Training Error and Generalization Error
-
-In order to discuss this phenomenon more formally,
-we need to differentiate between training error and generalization error.
-The *training error* is the error of our model
-as calculated on the training dataset,
-while *generalization error* is the expectation of our model's error
-were we to apply it to an infinite stream of additional data examples
-drawn from the same underlying data distribution as our original sample.
-
-Problematically, we can never calculate the generalization error exactly.
-That is because the stream of infinite data is an imaginary object.
-In practice, we must *estimate* the generalization error
-by applying our model to an independent test set
-constituted of a random selection of data examples
-that were withheld from our training set.
-
-The following three thought experiments
-will help illustrate this situation better.
-Consider a college student trying to prepare for his final exam.
-A diligent student will strive to practice well
-and test his abilities using exams from previous years.
-Nonetheless, doing well on past exams is no guarantee
-that he will excel when it matters.
-For instance, the student might try to prepare
-by rote learning the answers to the exam questions.
-This requires the student to memorize many things.
-She might even remember the answers for past exams perfectly.
-Another student might prepare by trying to understand
-the reasons for giving certain answers.
-In most cases, the latter student will do much better.
-
-Likewise, consider a model that simply uses a lookup table to answer questions. If the set of allowable inputs is discrete and reasonably small, then perhaps after viewing *many* training examples, this approach would perform well. Still this model has no ability to do better than random guessing when faced with examples that it has never seen before.
-In reality the input spaces are far too large to memorize the answers corresponding to every conceivable input. For example, consider the black and white $28\times28$ images. If each pixel can take one among $256$ grayscale values, then there are $256^{784}$ possible images. That means that there are far more low-resolution grayscale thumbnail-sized images than there are atoms in the universe. Even if we could encounter such data, we could never afford to store the lookup table.
-
-Last, consider the problem of trying
-to classify the outcomes of coin tosses (class 0: heads, class 1: tails)
-based on some contextual features that might be available.
-Suppose that the coin is fair.
-No matter what algorithm we come up with,
-the generalization error will always be $\frac{1}{2}$.
-However, for most algorithms,
-we should expect our training error to be considerably lower,
-depending on the luck of the draw,
-even if we did not have any features!
-Consider the dataset {0, 1, 1, 1, 0, 1}.
-Our feature-less algorithm would have to fall back on always predicting
-the *majority class*, which appears from our limited sample to be *1*.
-In this case, the model that always predicts class 1
-will incur an error of $\frac{1}{3}$,
-considerably better than our generalization error.
-As we increase the amount of data,
-the probability that the fraction of heads
-will deviate significantly from $\frac{1}{2}$ diminishes,
-and our training error would come to match the generalization error.
-
-### Statistical Learning Theory
-
-Since generalization is the fundamental problem in machine learning,
-you might not be surprised to learn
-that many mathematicians and theorists have dedicated their lives
-to developing formal theories to describe this phenomenon.
-In their [eponymous theorem](https://en.wikipedia.org/wiki/Glivenko%E2%80%93Cantelli_theorem), Glivenko and Cantelli
-derived the rate at which the training error
-converges to the generalization error.
-In a series of seminal papers, [Vapnik and Chervonenkis](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory)
-extended this theory to more general classes of functions.
-This work laid the foundations of statistical learning theory.
-
-
-In the standard supervised learning setting, which we have addressed up until now and will stick with throughout most of this book,
-we assume that both the training data and the test data
-are drawn *independently* from *identical* distributions.
-This is commonly called the *i.i.d. assumption*,
-which means that the process that samples our data has no memory.
-In other words,
-the second example drawn and the third drawn
-are no more correlated than the second and the two-millionth sample drawn.
-
-Being a good machine learning scientist requires thinking critically,
-and already you should be poking holes in this assumption,
-coming up with common cases where the assumption fails.
-What if we train a mortality risk predictor
-on data collected from patients at UCSF Medical Center,
-and apply it on patients at Massachusetts General Hospital?
-These distributions are simply not identical.
-Moreover, draws might be correlated in time.
-What if we are classifying the topics of Tweets?
-The news cycle would create temporal dependencies
-in the topics being discussed, violating any assumptions of independence.
-
-Sometimes we can get away with minor violations of the i.i.d. assumption
-and our models will continue to work remarkably well.
-After all, nearly every real-world application
-involves at least some minor violation of the i.i.d. assumption,
-and yet we have many useful tools for
-various applications such as
-face recognition,
-speech recognition, and language translation.
-
-Other violations are sure to cause trouble.
-Imagine, for example, if we try to train
-a face recognition system by training it
-exclusively on university students
-and then want to deploy it as a tool
-for monitoring geriatrics in a nursing home population.
-This is unlikely to work well since college students
-tend to look considerably different from the elderly.
-
-In subsequent chapters, we will discuss problems
-arising from violations of the i.i.d. assumption.
-For now, even taking the i.i.d. assumption for granted,
-understanding generalization is a formidable problem.
-Moreover, elucidating the precise theoretical foundations
-that might explain why deep neural networks generalize as well as they do
-continues to vex the greatest minds in learning theory.
-
-When we train our models, we attempt to search for a function
-that fits the training data as well as possible.
-If the function is so flexible that it can catch on to spurious patterns
-just as easily as to true associations,
-then it might perform *too well* without producing a model
-that generalizes well to unseen data.
-This is precisely what we want to avoid or at least control.
-Many of the techniques in deep learning are heuristics and tricks
-aimed at guarding against overfitting.
-
-### Model Complexity
-
-When we have simple models and abundant data,
-we expect the generalization error to resemble the training error.
-When we work with more complex models and fewer examples,
-we expect the training error to go down but the generalization gap to grow.
-What precisely constitutes model complexity is a complex matter.
-Many factors govern whether a model will generalize well.
-For example a model with more parameters might be considered more complex.
-A model whose parameters can take a wider range of values
-might be more complex.
-Often with neural networks, we think of a model
-that takes more training iterations as more complex,
-and one subject to *early stopping* (fewer training iterations) as less complex.
-
-It can be difficult to compare the complexity among members
-of substantially different model classes
-(say, decision trees vs. neural networks).
-For now, a simple rule of thumb is quite useful:
-a model that can readily explain arbitrary facts
-is what statisticians view as complex,
-whereas one that has only a limited expressive power
-but still manages to explain the data well
-is probably closer to the truth.
-In philosophy, this is closely related to Popper's
-criterion of falsifiability
-of a scientific theory: a theory is good if it fits data
-and if there are specific tests that can be used to disprove it.
-This is important since all statistical estimation is
-*post hoc*,
-i.e., we estimate after we observe the facts,
-hence vulnerable to the associated fallacy.
-For now, we will put the philosophy aside and stick to more tangible issues.
-
-In this section, to give you some intuition,
-we will focus on a few factors that tend
-to influence the generalizability of a model class:
-
-1. The number of tunable parameters. When the number of tunable parameters, sometimes called the *degrees of freedom*, is large, models tend to be more susceptible to overfitting.
-1. The values taken by the parameters. When weights can take a wider range of values, models can be more susceptible to overfitting.
-1. The number of training examples. It is trivially easy to overfit a dataset containing only one or two examples even if your model is simple. But overfitting a dataset with millions of examples requires an extremely flexible model.
-
-## Model Selection
-
-In machine learning, we usually select our final model
-after evaluating several candidate models.
-This process is called *model selection*.
-Sometimes the models subject to comparison
-are fundamentally different in nature
-(say, decision trees vs. linear models).
-At other times, we are comparing
-members of the same class of models
-that have been trained with different hyperparameter settings.
-
-With MLPs, for example,
-we may wish to compare models with
-different numbers of hidden layers,
-different numbers of hidden units,
-and various choices of the activation functions
-applied to each hidden layer.
-In order to determine the best among our candidate models,
-we will typically employ a validation dataset.
-
-
-### Validation Dataset
-
-In principle we should not touch our test set
-until after we have chosen all our hyperparameters.
-Were we to use the test data in the model selection process,
-there is a risk that we might overfit the test data.
-Then we would be in serious trouble.
-If we overfit our training data,
-there is always the evaluation on test data to keep us honest.
-But if we overfit the test data, how would we ever know?
-
-
-Thus, we should never rely on the test data for model selection.
-And yet we cannot rely solely on the training data
-for model selection either because
-we cannot estimate the generalization error
-on the very data that we use to train the model.
-
-
-In practical applications, the picture gets muddier.
-While ideally we would only touch the test data once,
-to assess the very best model or to compare
-a small number of models to each other,
-real-world test data is seldom discarded after just one use.
-We can seldom afford a new test set for each round of experiments.
-
-The common practice to address this problem
-is to split our data three ways,
-incorporating a *validation dataset* (or *validation set*)
-in addition to the training and test datasets.
-The result is a murky practice where the boundaries
-between validation and test data are worryingly ambiguous.
-Unless explicitly stated otherwise, in the experiments in this book
-we are really working with what should rightly be called
-training data and validation data, with no true test sets.
-Therefore, the accuracy reported in each experiment of the book is really the validation accuracy and not a true test set accuracy.
-
-### $K$-Fold Cross-Validation
-
-When training data is scarce,
-we might not even be able to afford to hold out
-enough data to constitute a proper validation set.
-One popular solution to this problem is to employ
-$K$*-fold cross-validation*.
-Here, the original training data is split into $K$ non-overlapping subsets.
-Then model training and validation are executed $K$ times,
-each time training on $K-1$ subsets and validating
-on a different subset (the one not used for training in that round).
-Finally, the training and validation errors are estimated
-by averaging over the results from the $K$ experiments.
-
-## Underfitting or Overfitting?
-
-When we compare the training and validation errors,
-we want to be mindful of two common situations.
-First, we want to watch out for cases
-when our training error and validation error are both substantial
-but there is a little gap between them.
-If the model is unable to reduce the training error,
-that could mean that our model is too simple
-(i.e., insufficiently expressive)
-to capture the pattern that we are trying to model.
-Moreover, since the *generalization gap*
-between our training and validation errors is small,
-we have reason to believe that we could get away with a more complex model.
-This phenomenon is known as *underfitting*.
-
-On the other hand, as we discussed above,
-we want to watch out for the cases
-when our training error is significantly lower
-than our validation error, indicating severe *overfitting*.
-Note that overfitting is not always a bad thing.
-With deep learning especially, it is well known
-that the best predictive models often perform
-far better on training data than on holdout data.
-Ultimately, we usually care more about the validation error
-than about the gap between the training and validation errors.
-
-Whether we overfit or underfit can depend
-both on the complexity of our model
-and the size of the available training datasets,
-two topics that we discuss below.
-
-### Model Complexity
-
-To illustrate some classical intuition
-about overfitting and model complexity,
-we give an example using polynomials.
-Given training data consisting of a single feature $x$
-and a corresponding real-valued label $y$,
-we try to find the polynomial of degree $d$
-
-$$\hat{y}= \sum_{i=0}^d x^i w_i$$
-
-to estimate the labels $y$.
-This is just a linear regression problem
-where our features are given by the powers of $x$,
-the model's weights are given by $w_i$,
-and the bias is given by $w_0$ since $x^0 = 1$ for all $x$.
-Since this is just a linear regression problem,
-we can use the squared error as our loss function.
-
-
-A higher-order polynomial function is more complex
-than a lower-order polynomial function,
-since the higher-order polynomial has more parameters
-and the model function's selection range is wider.
-Fixing the training dataset,
-higher-order polynomial functions should always
-achieve lower (at worst, equal) training error
-relative to lower degree polynomials.
-In fact, whenever the data examples each have a distinct value of $x$,
-a polynomial function with degree equal to the number of data examples
-can fit the training set perfectly.
-We visualize the relationship between polynomial degree
-and underfitting vs. overfitting in :numref:`fig_capacity_vs_error`.
-
-![Influence of model complexity on underfitting and overfitting](../img/capacity-vs-error.svg)
-:label:`fig_capacity_vs_error`
-
-### Dataset Size
-
-The other big consideration to bear in mind is the dataset size.
-Fixing our model, the fewer samples we have in the training dataset,
-the more likely (and more severely) we are to encounter overfitting.
-As we increase the amount of training data,
-the generalization error typically decreases.
-Moreover, in general, more data never hurt.
-For a fixed task and data distribution,
-there is typically a relationship between model complexity and dataset size.
-Given more data, we might profitably attempt to fit a more complex model.
-Absent sufficient data, simpler models may be more difficult to beat.
-For many tasks, deep learning only outperforms linear models
-when many thousands of training examples are available.
-In part, the current success of deep learning
-owes to the current abundance of massive datasets
-due to Internet companies, cheap storage, connected devices,
-and the broad digitization of the economy.
-
-## Polynomial Regression
-
-We can now (**explore these concepts interactively
-by fitting polynomials to data.**)
-
-```{.python .input}
-from d2l import mxnet as d2l
-from mxnet import gluon, np, npx
-from mxnet.gluon import nn
-import math
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-from d2l import torch as d2l
-import torch
-from torch import nn
-import numpy as np
-import math
-```
-
-```{.python .input}
-#@tab tensorflow
-from d2l import tensorflow as d2l
-import tensorflow as tf
-import numpy as np
-import math
-```
-
-### Generating the Dataset
-
-First we need data. Given $x$, we will [**use the following cubic polynomial to generate the labels**] on training and test data:
-
-(**$$y = 5 + 1.2x - 3.4\frac{x^2}{2!} + 5.6 \frac{x^3}{3!} + \epsilon \text{ where }
-\epsilon \sim \mathcal{N}(0, 0.1^2).$$**)
-
-The noise term $\epsilon$ obeys a normal distribution
-with a mean of 0 and a standard deviation of 0.1.
-For optimization, we typically want to avoid
-very large values of gradients or losses.
-This is why the *features*
-are rescaled from $x^i$ to $\frac{x^i}{i!}$.
-It allows us to avoid very large values for large exponents $i$.
-We will synthesize 100 samples each for the training set and test set.
-
-```{.python .input}
-#@tab all
-max_degree = 20  # Maximum degree of the polynomial
-n_train, n_test = 100, 100  # Training and test dataset sizes
-true_w = np.zeros(max_degree)  # Allocate lots of empty space
-true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])
-
-features = np.random.normal(size=(n_train + n_test, 1))
-np.random.shuffle(features)
-poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
-for i in range(max_degree):
-    poly_features[:, i] /= math.gamma(i + 1)  # `gamma(n)` = (n-1)!
-# Shape of `labels`: (`n_train` + `n_test`,)
-labels = np.dot(poly_features, true_w)
-labels += np.random.normal(scale=0.1, size=labels.shape)
-```
-
-Again, monomials stored in `poly_features`
-are rescaled by the gamma function,
-where $\Gamma(n)=(n-1)!$.
-[**Take a look at the first 2 samples**] from the generated dataset.
-The value 1 is technically a feature,
-namely the constant feature corresponding to the bias.
-
-```{.python .input}
-#@tab pytorch, tensorflow
-# Convert from NumPy ndarrays to tensors
-true_w, features, poly_features, labels = [d2l.tensor(x, dtype=
-    d2l.float32) for x in [true_w, features, poly_features, labels]]
-```
-
-```{.python .input}
-#@tab all
-features[:2], poly_features[:2, :], labels[:2]
-```
-
-### Training and Testing the Model
-
-Let us first [**implement a function to evaluate the loss on a given dataset**].
-
-```{.python .input}
-#@tab mxnet, tensorflow
-def evaluate_loss(net, data_iter, loss):  #@save
-    """Evaluate the loss of a model on the given dataset."""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        l = loss(net(X), y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
-```
-
-```{.python .input}
-#@tab pytorch
-def evaluate_loss(net, data_iter, loss):  #@save
-    """Evaluate the loss of a model on the given dataset."""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        out = net(X)
-        y = d2l.reshape(y, out.shape)
-        l = loss(out, y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
-```
-
-Now [**define the training function**].
-
-```{.python .input}
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = gluon.loss.L2Loss()
-    net = nn.Sequential()
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net.add(nn.Dense(1, use_bias=False))
-    net.initialize()
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    test_iter = d2l.load_array((test_features, test_labels), batch_size,
-                               is_train=False)
-    trainer = gluon.Trainer(net.collect_params(), 'sgd',
-                            {'learning_rate': 0.01})
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net[0].weight.data().asnumpy())
-```
-
-```{.python .input}
-#@tab pytorch
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = nn.MSELoss(reduction='none')
-    input_shape = train_features.shape[-1]
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
-                                batch_size)
-    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
-                               batch_size, is_train=False)
-    trainer = torch.optim.SGD(net.parameters(), lr=0.001)
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net[0].weight.data.numpy())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(train_features, test_features, train_labels, test_labels,
-          num_epochs=400):
-    loss = tf.losses.MeanSquaredError()
-    input_shape = train_features.shape[-1]
-    # Switch off the bias since we already catered for it in the polynomial
-    # features
-    net = tf.keras.Sequential()
-    net.add(tf.keras.layers.Dense(1, use_bias=False))
-    batch_size = min(10, train_labels.shape[0])
-    train_iter = d2l.load_array((train_features, train_labels), batch_size)
-    test_iter = d2l.load_array((test_features, test_labels), batch_size,
-                               is_train=False)
-    trainer = tf.keras.optimizers.SGD(learning_rate=.01)
-    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
-                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
-                            legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
-        if epoch == 0 or (epoch + 1) % 20 == 0:
-            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
-                                     evaluate_loss(net, test_iter, loss)))
-    print('weight:', net.get_weights()[0].T)
-```
-
-### [**Third-Order Polynomial Function Fitting (Normal)**]
-
-We will begin by first using a third-order polynomial function, which is the same order as that of the data generation function.
-The results show that this model's training and test losses can be both effectively reduced.
-The learned model parameters are also close
-to the true values $w = [5, 1.2, -3.4, 5.6]$.
-
-```{.python .input}
-#@tab all
-# Pick the first four dimensions, i.e., 1, x, x^2/2!, x^3/3! from the
-# polynomial features
-train(poly_features[:n_train, :4], poly_features[n_train:, :4],
-      labels[:n_train], labels[n_train:])
-```
-
-### [**Linear Function Fitting (Underfitting)**]
-
-Let us take another look at linear function fitting.
-After the decline in early epochs,
-it becomes difficult to further decrease
-this model's training loss.
-After the last epoch iteration has been completed,
-the training loss is still high.
-When used to fit nonlinear patterns
-(like the third-order polynomial function here)
-linear models are liable to underfit.
-
-```{.python .input}
-#@tab all
-# Pick the first two dimensions, i.e., 1, x, from the polynomial features
-train(poly_features[:n_train, :2], poly_features[n_train:, :2],
-      labels[:n_train], labels[n_train:])
-```
-
-### [**Higher-Order Polynomial Function Fitting  (Overfitting)**]
-
-Now let us try to train the model
-using a polynomial of too high degree.
-Here, there are insufficient data to learn that
-the higher-degree coefficients should have values close to zero.
-As a result, our overly-complex model
-is so susceptible that it is being influenced
-by noise in the training data.
-Though the training loss can be effectively reduced,
-the test loss is still much higher.
-It shows that
-the complex model overfits the data.
-
-```{.python .input}
-#@tab all
-# Pick all the dimensions from the polynomial features
-train(poly_features[:n_train, :], poly_features[n_train:, :],
-      labels[:n_train], labels[n_train:], num_epochs=1500)
-```
-
-In the subsequent sections, we will continue
-to discuss overfitting problems
-and methods for dealing with them,
-such as weight decay and dropout.
-
-
-## Summary
-
-* Since the generalization error cannot be estimated based on the training error, simply minimizing the training error will not necessarily mean a reduction in the generalization error. Machine learning models need to be careful to safeguard against overfitting so as to minimize the generalization error.
-* A validation set can be used for model selection, provided that it is not used too liberally.
-* Underfitting means that a model is not able to reduce the training error. When training error is much lower than validation error, there is overfitting.
-* We should choose an appropriately complex model and avoid using insufficient training samples.
-
-
-## Exercises
-
-1. Can you solve the polynomial regression problem exactly? Hint: use linear algebra.
-1. Consider model selection for polynomials:
-    1. Plot the training loss vs. model complexity (degree of the polynomial). What do you observe? What degree of polynomial do you need to reduce the training loss to 0?
-    1. Plot the test loss in this case.
-    1. Generate the same plot as a function of the amount of data.
-1. What happens if you drop the normalization ($1/i!$) of the polynomial features $x^i$? Can you fix this in some other way?
-1. Can you ever expect to see zero generalization error?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/96)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/97)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/234)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/weight-decay.md b/chapter_multilayer-perceptrons/weight-decay.md
deleted file mode 100644
index 66d8d65..0000000
--- a/chapter_multilayer-perceptrons/weight-decay.md
+++ /dev/null
@@ -1,361 +0,0 @@
-# 体重減衰
-:label:`sec_weight_decay`
-
-オーバーフィットの問題を特徴づけたので、モデルを正則化するための標準的な手法をいくつか紹介します。外出してより多くのトレーニングデータを収集することで、常に過適合を緩和できることを思い出してください。これはコストがかかり、時間がかかったり、完全に制御不能になったりする可能性があり、短期的には不可能です。今のところ、リソースが許す限り高品質のデータがすでに存在し、正則化手法に焦点を当てていると想定できます。 
-
-多項式回帰の例 (:numref:`sec_model_selection`) では、近似多項式の次数を微調整するだけでモデルの容量を制限できることを思い出してください。実際、特徴量の数を制限することは、過剰適合を緩和するための一般的な手法です。しかし、単に機能を捨てるだけでは、仕事にはあまりにも鈍い楽器になる可能性があります。多項式回帰の例にこだわり、高次元の入力で何が起こるかを考えてみましょう。多項式の多変量データへの自然拡張は*単項式* と呼ばれ、単に変数のべき乗の積です。単項式の次数は累乗の和です。たとえば、$x_1^2 x_2$ と $x_3 x_5^2$ はどちらも次数 3 の単項式です。 
-
-$d$ の次数を持つ項の数は $d$ が大きくなるにつれて急激に増加することに注意してください。$k$ 個の変数が与えられた場合、$d$ (つまり $k$ マルチチョース $d$) の単項式の数は ${k - 1 + d} \choose {k - 1}$ になります。$2$ から $3$ へのわずかな次数の変化でも、モデルの複雑さが大幅に増します。したがって、関数の複雑さを調整するために、よりきめ細かいツールが必要になることがよくあります。 
-
-## 規範と体重減少
-
-:numref:`subsec_lin-algebra-norms` のより一般的な $L_p$ ノルムの特殊なケースである $L_2$ ノルムと $L_1$ ノルムの両方について説明しました。(***Weight decay* (一般に $L_2$ 正則化と呼ばれる) は、パラメトリック機械学習モデルの正則化に最も広く使用されている手法です。**) この手法は、すべての関数の中で $f$ 関数 $f = 0$ (すべての入力に値 $0$ を代入する) という基本的な直感に基づいています。ある意味では*最も単純*であり、ゼロからの距離で関数の複雑さを測定できるということです。しかし、関数とゼロの間の距離をどの程度正確に測定すべきでしょうか？正解は一つもありません。実際、関数解析の一部やバナッハ空間の理論など、数学の全分野がこの問題に答えることに専念しています。 
-
-単純な解釈の 1 つとして、線形関数 $f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$ の実数/複素数を、その重みベクトルのノルム ($\| \mathbf{w} \|^2$ など) で測定すると考えられます。重みベクトルを小さくするための最も一般的な方法は、そのノルムをペナルティ項として損失を最小化する問題に加えることです。したがって、私たちは当初の目標を置き換え、
-*トレーニングラベルの予測損失を最小化*、
-新しい目標をもって、
-*予測損失とペナルティ項*の合計を最小化する。
-ここで、重みベクトルが大きくなりすぎると、学習アルゴリズムは重みノルム $\| \mathbf{w} \|^2$ の最小化と学習誤差の最小化に重点を置く可能性があります。それがまさに私たちの望みです。コードで説明するために、前の例の :numref:`sec_linear_regression` の線形回帰を復活させましょう。そこで、私たちの損失はによって与えられました 
-
-$$L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
-
-$\mathbf{x}^{(i)}$ はフィーチャ、$y^{(i)}$ はすべてのデータ例のラベル $i$、$(\mathbf{w}, b)$ はそれぞれ重みとバイアスのパラメーターであることを思い出してください。重みベクトルのサイズにペナルティを課すには、$\| \mathbf{w} \|^2$ を何らかの形で損失関数に加算しなければなりませんが、モデルはこの新しい加法ペナルティに対して標準損失とどのようにトレードオフすべきでしょうか？実際には、検証データを使用して近似する非負のハイパーパラメーターである*正則化定数* $\lambda$ を使用して、このトレードオフを特徴付けます。 
-
-$$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2,$$
-
-$\lambda = 0$ では、元の損失関数を回復します。$\lambda > 0$ では、$\| \mathbf{w} \|$ のサイズを制限しています。慣例により $2$ で割ります。二次関数の導関数を取るとき、$2$ と $1/2$ は相殺され、更新の式は見栄えがよくシンプルになります。賢明な読者は、なぜ標準ノルム（ユークリッド距離）ではなく二乗ノルムを使って作業するのか疑問に思うかもしれません。これは計算上の利便性のために行います。$L_2$ ノルムを二乗することで、重みベクトルの各成分の二乗和を残して、平方根を削除します。これにより、ペナルティの微分を計算しやすくなります。微分の和は和の導関数と等しくなります。 
-
-さらに、そもそもなぜ$L_2$ノルムを使用し、たとえば$L_1$ノルムを使用しないのかと尋ねるかもしれません。実際、他の選択肢は統計全体で有効で人気があります。$L_2$ 正則化線形モデルは従来の*リッジ回帰* アルゴリズムを構成しますが、$L_1$ 正化線形回帰は統計学でも同様に基本的なモデルであり、一般に*LASSO 回帰* として知られています。 
-
-$L_2$ ノルムを使用する理由の 1 つは、重みベクトルの大きな成分にアウトサイズペナルティが課されるためです。これにより、学習アルゴリズムは、より多くの特徴量にわたって重みを均等に分散するモデルに偏ります。実際には、これによって 1 つの変数の測定誤差に対してロバスト性が高まる可能性があります。一方、$L_1$ のペナルティでは、他の重みをゼロにすることで、モデルの重みを小さな特徴量に集中させることになります。これは*特徴選択* と呼ばれ、他の理由から望ましい場合もあります。 
-
-:eqref:`eq_linreg_batch_update` で同じ表記法を使用すると、$L_2$ 正則化回帰のミニバッチ確率的勾配降下法の更新は次のようになります。 
-
-$$
-\begin{aligned}
-\mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).
-\end{aligned}
-$$
-
-以前と同様に、推定値が観測値と異なる量に基づいて $\mathbf{w}$ を更新します。ただし、$\mathbf{w}$ のサイズもゼロに縮小します。そのため、この方法は「ウェイト減衰」と呼ばれることもあります。ペナルティ項のみを考えると、最適化アルゴリズムはトレーニングの各ステップでウェイトを*減衰* します。特徴量の選択とは対照的に、重みの減衰は関数の複雑さを調整するための連続的なメカニズムを提供します。$\lambda$ の値が小さいほど制約が少ない $\mathbf{w}$ に対応し、$\lambda$ の値が大きいほど $\mathbf{w}$ の制約が大きくなります。 
-
-対応するバイアスペナルティ $b^2$ を含めるかどうかは、実装によって異なり、ニューラルネットワークのレイヤーによって異なる場合があります。多くの場合、ネットワークの出力層のバイアス項は正則化されません。 
-
-## 高次元線形回帰
-
-簡単な合成例を通して、体重減少の利点を説明できます。
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import autograd, gluon, init, np, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-まず、[**以前と同じようにデータを生成する**] 
-
-(**$$y = 0.05 +\ sum_ {i = 1} ^d 0.01 x_i +\ イプシロン\ text {どこ}\ イプシロン\ sim\ mathcal {N} (0, 0.01^2) .$$**) 
-
-ラベルを入力の線形関数として選択し、平均がゼロで標準偏差が 0.01 のガウスノイズによって破損します。過適合の影響を顕著にするために、問題の次元を $d = 200$ に増やし、20 個の例のみを含む小さなトレーニングセットで作業できます。
-
-```{.python .input}
-#@tab all
-n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
-true_w, true_b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
-train_data = d2l.synthetic_data(true_w, true_b, n_train)
-train_iter = d2l.load_array(train_data, batch_size)
-test_data = d2l.synthetic_data(true_w, true_b, n_test)
-test_iter = d2l.load_array(test_data, batch_size, is_train=False)
-```
-
-## ゼロからの実装
-
-以下では、$L_2$ の二乗ペナルティを元のターゲット関数に追加するだけで、重みの減衰をゼロから実装します。 
-
-### [**モデルパラメーターの初期化**]
-
-まず、モデルパラメーターをランダムに初期化する関数を定義します。
-
-```{.python .input}
-def init_params():
-    w = np.random.normal(scale=1, size=(num_inputs, 1))
-    b = np.zeros(1)
-    w.attach_grad()
-    b.attach_grad()
-    return [w, b]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_params():
-    w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
-    b = torch.zeros(1, requires_grad=True)
-    return [w, b]
-```
-
-```{.python .input}
-#@tab tensorflow
-def init_params():
-    w = tf.Variable(tf.random.normal(mean=1, shape=(num_inputs, 1)))
-    b = tf.Variable(tf.zeros(shape=(1, )))
-    return [w, b]
-```
-
-### (** $L_2$ ノルムペナルティの定義**)
-
-おそらく、このペナルティを実装する最も便利な方法は、すべての項を二乗して合計することです。
-
-```{.python .input}
-def l2_penalty(w):
-    return (w**2).sum() / 2
-```
-
-```{.python .input}
-#@tab pytorch
-def l2_penalty(w):
-    return torch.sum(w.pow(2)) / 2
-```
-
-```{.python .input}
-#@tab tensorflow
-def l2_penalty(w):
-    return tf.reduce_sum(tf.pow(w, 2)) / 2
-```
-
-### [**トレーニングループの定義**]
-
-次のコードは、モデルをトレーニングセットにあてはめ、テストセットで評価します。線形ネットワークと二乗損失は :numref:`chap_linear` 以降変更されていないため、`d2l.linreg` と `d2l.squared_loss` を使用してインポートします。ここでの唯一の変更点は、損失にペナルティ期間が含まれるようになったことです。
-
-```{.python .input}
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                # The L2 norm penalty term has been added, and broadcasting
-                # makes `l2_penalty(w)` a vector whose length is `batch_size`
-                l = loss(net(X), y) + lambd * l2_penalty(w)
-            l.backward()
-            d2l.sgd([w, b], lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', np.linalg.norm(w))
-```
-
-```{.python .input}
-#@tab pytorch
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            # The L2 norm penalty term has been added, and broadcasting
-            # makes `l2_penalty(w)` a vector whose length is `batch_size`
-            l = loss(net(X), y) + lambd * l2_penalty(w)
-            l.sum().backward()
-            d2l.sgd([w, b], lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', torch.norm(w).item())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                # The L2 norm penalty term has been added, and broadcasting
-                # makes `l2_penalty(w)` a vector whose length is `batch_size`
-                l = loss(net(X), y) + lambd * l2_penalty(w)
-            grads = tape.gradient(l, [w, b])
-            d2l.sgd([w, b], grads, lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', tf.norm(w).numpy())
-```
-
-### [**正規化なしのトレーニング**]
-
-このコードを `lambd = 0` で実行し、重みの減衰を無効にします。過適合がひどく、学習誤差は減少するが、テスト誤差（過適合の教科書の場合）は減少しないことに注意してください。
-
-```{.python .input}
-#@tab all
-train(lambd=0)
-```
-
-### [**ウェイトディケイの使用**]
-
-以下では、かなりの体重減少を伴って走ります。学習誤差は増加するが、検定誤差は減少することに注意してください。これは正則化から期待される効果です。
-
-```{.python .input}
-#@tab all
-train(lambd=3)
-```
-
-## [**簡潔な実装**]
-
-重み減衰はニューラルネットワークの最適化では至る所に存在するため、ディープラーニングフレームワークは特に便利で、重み減衰を最適化アルゴリズム自体に統合して、あらゆる損失関数と組み合わせて簡単に使用できます。さらに、この積分は計算上の利点をもたらし、追加の計算オーバーヘッドなしにアルゴリズムに重みの減衰を追加するための実装トリックが可能になります。更新のウェイト減衰部分は各パラメーターの現在の値にのみ依存するため、オプティマイザーは各パラメーターに 1 回タッチする必要があります。
-
-:begin_tab:`mxnet`
-次のコードでは、`Trainer` をインスタンス化するときに `wd` で直接ウェイト減衰ハイパーパラメータを指定します。デフォルトでは、Gluon はウェイトとバイアスの両方を同時に減衰させます。モデルパラメーターの更新時に、ハイパーパラメーター `wd` に `wd_mult` が乗算されることに注意してください。したがって、`wd_mult` を 0 に設定すると、バイアスパラメータ $b$ は減衰しません。
-:end_tab:
-
-:begin_tab:`pytorch`
-次のコードでは、オプティマイザーをインスタンス化するときに `weight_decay` を使用して weight decay ハイパーパラメーターを直接指定します。デフォルトでは、PyTorch はウェイトとバイアスの両方を同時に減衰させます。ここではウェイトに `weight_decay` のみを設定しているので、バイアスパラメータ $b$ は減衰しません。
-:end_tab:
-
-:begin_tab:`tensorflow`
-次のコードでは、重み減衰ハイパーパラメーター `wd` をもつ $L_2$ 正則化器を作成し、`kernel_regularizer` 引数によって層に適用します。
-:end_tab:
-
-```{.python .input}
-def train_concise(wd):
-    net = nn.Sequential()
-    net.add(nn.Dense(1))
-    net.initialize(init.Normal(sigma=1))
-    loss = gluon.loss.L2Loss()
-    num_epochs, lr = 100, 0.003
-    trainer = gluon.Trainer(net.collect_params(), 'sgd',
-                            {'learning_rate': lr, 'wd': wd})
-    # The bias parameter has not decayed. Bias names generally end with "bias"
-    net.collect_params('.*bias').setattr('wd_mult', 0)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                l = loss(net(X), y)
-            l.backward()
-            trainer.step(batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', np.linalg.norm(net[0].weight.data()))
-```
-
-```{.python .input}
-#@tab pytorch
-def train_concise(wd):
-    net = nn.Sequential(nn.Linear(num_inputs, 1))
-    for param in net.parameters():
-        param.data.normal_()
-    loss = nn.MSELoss(reduction='none')
-    num_epochs, lr = 100, 0.003
-    # The bias parameter has not decayed
-    trainer = torch.optim.SGD([
-        {"params":net[0].weight,'weight_decay': wd},
-        {"params":net[0].bias}], lr=lr)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            trainer.zero_grad()
-            l = loss(net(X), y)
-            l.sum().backward()
-            trainer.step()
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', net[0].weight.norm().item())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train_concise(wd):
-    net = tf.keras.models.Sequential()
-    net.add(tf.keras.layers.Dense(
-        1, kernel_regularizer=tf.keras.regularizers.l2(wd)))
-    net.build(input_shape=(1, num_inputs))
-    w, b = net.trainable_variables
-    loss = tf.keras.losses.MeanSquaredError()
-    num_epochs, lr = 100, 0.003
-    trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                # `tf.keras` requires retrieving and adding the losses from
-                # layers manually for custom training loop.
-                l = loss(net(X), y) + net.losses
-            grads = tape.gradient(l, net.trainable_variables)
-            trainer.apply_gradients(zip(grads, net.trainable_variables))
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', tf.norm(net.get_weights()[0]).numpy())
-```
-
-[**プロットは、ゼロからの重量減衰を実装したときのプロットと同じように見えます**]ただし、実行速度がかなり速く、実装が簡単なため、大きな問題ではより顕著になります。
-
-```{.python .input}
-#@tab all
-train_concise(0)
-```
-
-```{.python .input}
-#@tab all
-train_concise(3)
-```
-
-ここまでは、単純な一次関数を構成するものの概念を1つだけ取り上げました。さらに、単純な非線形関数を構成するものは、さらに複雑な問題になる可能性があります。例えば [カーネルヒルベルト空間 (RKHS) を再現](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space) を使うと、線形関数に導入されたツールを非線形の文脈で適用することができます。残念ながら、RKHS ベースのアルゴリズムは、大規模で高次元のデータにはあまりスケーリングされない傾向があります。本書では、ディープネットワークのすべてのレイヤにウェイト減衰を適用するという単純なヒューリスティックをデフォルトとします。 
-
-## [概要
-
-* 正則化は、過適合に対処するための一般的な方法です。学習セットの損失関数にペナルティ項を追加して、学習したモデルの複雑さを軽減します。
-* モデルを単純に保つための特別な選択肢の 1 つは、$L_2$ ペナルティを使用した重量減衰です。これにより、学習アルゴリズムの更新ステップで重みが減衰します。
-* 重み減衰機能は、ディープラーニングフレームワークのオプティマイザーで提供されます。
-* 同じトレーニングループ内で、パラメーターのセットが異なると、更新動作が異なる場合があります。
-
-## 演習
-
-1. このセクションの推定問題で $\lambda$ の値を試してみてください。学習とテストの精度を $\lambda$ の関数としてプロットします。あなたは何を観察していますか？
-1. 検証セットを使用して $\lambda$ の最適値を求めます。本当に最適値なのでしょうか？これは問題なの？
-1. $\|\mathbf{w}\|^2$ の代わりに $\sum_i |w_i|$ をペナルティ ($L_1$ 正則化) として使用した場合、更新方程式はどのようになるでしょうか。
-1. 私たちは$\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}$ということを知っています。行列についても同様の方程式を見つけることができますか (:numref:`subsec_lin-algebra-norms` のフロベニウスノルムを参照)。
-1. 学習誤差と汎化誤差の関係を確認します。体重減少、トレーニングの増加、適切な複雑さのモデルの使用に加えて、オーバーフィットに対処するために他にどのような方法が考えられますか？
-1. ベイズ統計では、$P(w \mid x) \propto P(x \mid w) P(w)$ を介して事後に到達する事前確率と尤度の積を使用します。正則化で $P(w)$ をどのように識別できますか？
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/98)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/99)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/236)
-:end_tab:
diff --git a/chapter_multilayer-perceptrons/weight-decay_origin.md b/chapter_multilayer-perceptrons/weight-decay_origin.md
deleted file mode 100644
index 099d19f..0000000
--- a/chapter_multilayer-perceptrons/weight-decay_origin.md
+++ /dev/null
@@ -1,560 +0,0 @@
-# Weight Decay
-:label:`sec_weight_decay`
-
-Now that we have characterized the problem of overfitting,
-we can introduce some standard techniques for regularizing models.
-Recall that we can always mitigate overfitting
-by going out and collecting more training data.
-That can be costly, time consuming,
-or entirely out of our control,
-making it impossible in the short run.
-For now, we can assume that we already have
-as much high-quality data as our resources permit
-and focus on regularization techniques.
-
-Recall that in our
-polynomial regression example
-(:numref:`sec_model_selection`)
-we could limit our model's capacity
-simply by tweaking the degree
-of the fitted polynomial.
-Indeed, limiting the number of features
-is a popular technique to mitigate overfitting.
-However, simply tossing aside features
-can be too blunt an instrument for the job.
-Sticking with the polynomial regression
-example, consider what might happen
-with high-dimensional inputs.
-The natural extensions of polynomials
-to multivariate data are called *monomials*,
-which are simply products of powers of variables.
-The degree of a monomial is the sum of the powers.
-For example, $x_1^2 x_2$, and $x_3 x_5^2$
-are both monomials of degree 3.
-
-Note that the number of terms with degree $d$
-blows up rapidly as $d$ grows larger.
-Given $k$ variables, the number of monomials
-of degree $d$ (i.e., $k$ multichoose $d$) is ${k - 1 + d} \choose {k - 1}$.
-Even small changes in degree, say from $2$ to $3$,
-dramatically increase the complexity of our model.
-Thus we often need a more fine-grained tool
-for adjusting function complexity.
-
-
-## Norms and Weight Decay
-
-We have described
-both the $L_2$ norm and the $L_1$ norm,
-which are special cases of the more general $L_p$ norm
-in :numref:`subsec_lin-algebra-norms`.
-(***Weight decay* (commonly called $L_2$ regularization),
-might be the most widely-used technique
-for regularizing parametric machine learning models.**)
-The technique is motivated by the basic intuition
-that among all functions $f$,
-the function $f = 0$
-(assigning the value $0$ to all inputs)
-is in some sense the *simplest*,
-and that we can measure the complexity
-of a function by its distance from zero.
-But how precisely should we measure
-the distance between a function and zero?
-There is no single right answer.
-In fact, entire branches of mathematics,
-including parts of functional analysis
-and the theory of Banach spaces,
-are devoted to answering this issue.
-
-One simple interpretation might be
-to measure the complexity of a linear function
-$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$
-by some norm of its weight vector, e.g., $\| \mathbf{w} \|^2$.
-The most common method for ensuring a small weight vector
-is to add its norm as a penalty term
-to the problem of minimizing the loss.
-Thus we replace our original objective,
-*minimizing the prediction loss on the training labels*,
-with new objective,
-*minimizing the sum of the prediction loss and the penalty term*.
-Now, if our weight vector grows too large,
-our learning algorithm might focus
-on minimizing the weight norm $\| \mathbf{w} \|^2$
-vs. minimizing the training error.
-That is exactly what we want.
-To illustrate things in code,
-let us revive our previous example
-from :numref:`sec_linear_regression` for linear regression.
-There, our loss was given by
-
-$$L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
-
-Recall that $\mathbf{x}^{(i)}$ are the features,
-$y^{(i)}$ are labels for all data examples $i$, and $(\mathbf{w}, b)$
-are the weight and bias parameters, respectively.
-To penalize the size of the weight vector,
-we must somehow add $\| \mathbf{w} \|^2$ to the loss function,
-but how should the model trade off the
-standard loss for this new additive penalty?
-In practice, we characterize this tradeoff
-via the *regularization constant* $\lambda$,
-a non-negative hyperparameter
-that we fit using validation data:
-
-$$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2,$$
-
-For $\lambda = 0$, we recover our original loss function.
-For $\lambda > 0$, we restrict the size of $\| \mathbf{w} \|$.
-We divide by $2$ by convention:
-when we take the derivative of a quadratic function,
-the $2$ and $1/2$ cancel out, ensuring that the expression
-for the update looks nice and simple.
-The astute reader might wonder why we work with the squared
-norm and not the standard norm (i.e., the Euclidean distance).
-We do this for computational convenience.
-By squaring the $L_2$ norm, we remove the square root,
-leaving the sum of squares of
-each component of the weight vector.
-This makes the derivative of the penalty easy to compute: the sum of derivatives equals the derivative of the sum.
-
-
-Moreover, you might ask why we work with the $L_2$ norm
-in the first place and not, say, the $L_1$ norm.
-In fact, other choices are valid and
-popular throughout statistics.
-While $L_2$-regularized linear models constitute
-the classic *ridge regression* algorithm,
-$L_1$-regularized linear regression
-is a similarly fundamental model in statistics, which is popularly known as *lasso regression*.
-
-
-One reason to work with the $L_2$ norm
-is that it places an outsize penalty
-on large components of the weight vector.
-This biases our learning algorithm
-towards models that distribute weight evenly
-across a larger number of features.
-In practice, this might make them more robust
-to measurement error in a single variable.
-By contrast, $L_1$ penalties lead to models
-that concentrate weights on a small set of features by clearing the other weights to zero.
-This is called *feature selection*,
-which may be desirable for other reasons.
-
-
-Using the same notation in :eqref:`eq_linreg_batch_update`,
-the minibatch stochastic gradient descent updates
-for $L_2$-regularized regression follow:
-
-$$
-\begin{aligned}
-\mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).
-\end{aligned}
-$$
-
-As before, we update $\mathbf{w}$ based on the amount
-by which our estimate differs from the observation.
-However, we also shrink the size of $\mathbf{w}$ towards zero.
-That is why the method is sometimes called "weight decay":
-given the penalty term alone,
-our optimization algorithm *decays*
-the weight at each step of training.
-In contrast to feature selection,
-weight decay offers us a continuous mechanism
-for adjusting the complexity of a function.
-Smaller values of $\lambda$ correspond
-to less constrained $\mathbf{w}$,
-whereas larger values of $\lambda$
-constrain $\mathbf{w}$ more considerably.
-
-Whether we include a corresponding bias penalty $b^2$
-can vary across implementations,
-and may vary across layers of a neural network.
-Often, we do not regularize the bias term
-of a network's output layer.
-
-## High-Dimensional Linear Regression
-
-We can illustrate the benefits of
-weight decay
-through a simple synthetic example.
-
-```{.python .input}
-%matplotlib inline
-from d2l import mxnet as d2l
-from mxnet import autograd, gluon, init, np, npx
-from mxnet.gluon import nn
-npx.set_np()
-```
-
-```{.python .input}
-#@tab pytorch
-%matplotlib inline
-from d2l import torch as d2l
-import torch
-from torch import nn
-```
-
-```{.python .input}
-#@tab tensorflow
-%matplotlib inline
-from d2l import tensorflow as d2l
-import tensorflow as tf
-```
-
-First, we [**generate some data as before**]
-
-(**$$y = 0.05 + \sum_{i = 1}^d 0.01 x_i + \epsilon \text{ where }
-\epsilon \sim \mathcal{N}(0, 0.01^2).$$**)
-
-We choose our label to be a linear function of our inputs,
-corrupted by Gaussian noise with zero mean and standard deviation 0.01.
-To make the effects of overfitting pronounced,
-we can increase the dimensionality of our problem to $d = 200$
-and work with a small training set containing only 20 examples.
-
-```{.python .input}
-#@tab all
-n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
-true_w, true_b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
-train_data = d2l.synthetic_data(true_w, true_b, n_train)
-train_iter = d2l.load_array(train_data, batch_size)
-test_data = d2l.synthetic_data(true_w, true_b, n_test)
-test_iter = d2l.load_array(test_data, batch_size, is_train=False)
-```
-
-## Implementation from Scratch
-
-In the following, we will implement weight decay from scratch,
-simply by adding the squared $L_2$ penalty
-to the original target function.
-
-### [**Initializing Model Parameters**]
-
-First, we will define a function
-to randomly initialize our model parameters.
-
-```{.python .input}
-def init_params():
-    w = np.random.normal(scale=1, size=(num_inputs, 1))
-    b = np.zeros(1)
-    w.attach_grad()
-    b.attach_grad()
-    return [w, b]
-```
-
-```{.python .input}
-#@tab pytorch
-def init_params():
-    w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
-    b = torch.zeros(1, requires_grad=True)
-    return [w, b]
-```
-
-```{.python .input}
-#@tab tensorflow
-def init_params():
-    w = tf.Variable(tf.random.normal(mean=1, shape=(num_inputs, 1)))
-    b = tf.Variable(tf.zeros(shape=(1, )))
-    return [w, b]
-```
-
-### (**Defining $L_2$ Norm Penalty**)
-
-Perhaps the most convenient way to implement this penalty
-is to square all terms in place and sum them up.
-
-```{.python .input}
-def l2_penalty(w):
-    return (w**2).sum() / 2
-```
-
-```{.python .input}
-#@tab pytorch
-def l2_penalty(w):
-    return torch.sum(w.pow(2)) / 2
-```
-
-```{.python .input}
-#@tab tensorflow
-def l2_penalty(w):
-    return tf.reduce_sum(tf.pow(w, 2)) / 2
-```
-
-### [**Defining the Training Loop**]
-
-The following code fits a model on the training set
-and evaluates it on the test set.
-The linear network and the squared loss
-have not changed since :numref:`chap_linear`,
-so we will just import them via `d2l.linreg` and `d2l.squared_loss`.
-The only change here is that our loss now includes the penalty term.
-
-```{.python .input}
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                # The L2 norm penalty term has been added, and broadcasting
-                # makes `l2_penalty(w)` a vector whose length is `batch_size`
-                l = loss(net(X), y) + lambd * l2_penalty(w)
-            l.backward()
-            d2l.sgd([w, b], lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', np.linalg.norm(w))
-```
-
-```{.python .input}
-#@tab pytorch
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            # The L2 norm penalty term has been added, and broadcasting
-            # makes `l2_penalty(w)` a vector whose length is `batch_size`
-            l = loss(net(X), y) + lambd * l2_penalty(w)
-            l.sum().backward()
-            d2l.sgd([w, b], lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', torch.norm(w).item())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train(lambd):
-    w, b = init_params()
-    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
-    num_epochs, lr = 100, 0.003
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                # The L2 norm penalty term has been added, and broadcasting
-                # makes `l2_penalty(w)` a vector whose length is `batch_size`
-                l = loss(net(X), y) + lambd * l2_penalty(w)
-            grads = tape.gradient(l, [w, b])
-            d2l.sgd([w, b], grads, lr, batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', tf.norm(w).numpy())
-```
-
-### [**Training without Regularization**]
-
-We now run this code with `lambd = 0`,
-disabling weight decay.
-Note that we overfit badly,
-decreasing the training error but not the
-test error---a textbook case of overfitting.
-
-```{.python .input}
-#@tab all
-train(lambd=0)
-```
-
-### [**Using Weight Decay**]
-
-Below, we run with substantial weight decay.
-Note that the training error increases
-but the test error decreases.
-This is precisely the effect
-we expect from regularization.
-
-```{.python .input}
-#@tab all
-train(lambd=3)
-```
-
-## [**Concise Implementation**]
-
-Because weight decay is ubiquitous
-in neural network optimization,
-the deep learning framework makes it especially convenient,
-integrating weight decay into the optimization algorithm itself
-for easy use in combination with any loss function.
-Moreover, this integration serves a computational benefit,
-allowing implementation tricks to add weight decay to the algorithm,
-without any additional computational overhead.
-Since the weight decay portion of the update
-depends only on the current value of each parameter,
-the optimizer must touch each parameter once anyway.
-
-:begin_tab:`mxnet`
-In the following code, we specify
-the weight decay hyperparameter directly
-through `wd` when instantiating our `Trainer`.
-By default, Gluon decays both
-weights and biases simultaneously.
-Note that the hyperparameter `wd`
-will be multiplied by `wd_mult`
-when updating model parameters.
-Thus, if we set `wd_mult` to zero,
-the bias parameter $b$ will not decay.
-:end_tab:
-
-:begin_tab:`pytorch`
-In the following code, we specify
-the weight decay hyperparameter directly
-through `weight_decay` when instantiating our optimizer.
-By default, PyTorch decays both
-weights and biases simultaneously. Here we only set `weight_decay` for
-the weight, so the bias parameter $b$ will not decay.
-:end_tab:
-
-:begin_tab:`tensorflow`
-In the following code, we create an $L_2$ regularizer with
-the weight decay hyperparameter `wd` and apply it to the layer
-through the `kernel_regularizer` argument.
-:end_tab:
-
-```{.python .input}
-def train_concise(wd):
-    net = nn.Sequential()
-    net.add(nn.Dense(1))
-    net.initialize(init.Normal(sigma=1))
-    loss = gluon.loss.L2Loss()
-    num_epochs, lr = 100, 0.003
-    trainer = gluon.Trainer(net.collect_params(), 'sgd',
-                            {'learning_rate': lr, 'wd': wd})
-    # The bias parameter has not decayed. Bias names generally end with "bias"
-    net.collect_params('.*bias').setattr('wd_mult', 0)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with autograd.record():
-                l = loss(net(X), y)
-            l.backward()
-            trainer.step(batch_size)
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', np.linalg.norm(net[0].weight.data()))
-```
-
-```{.python .input}
-#@tab pytorch
-def train_concise(wd):
-    net = nn.Sequential(nn.Linear(num_inputs, 1))
-    for param in net.parameters():
-        param.data.normal_()
-    loss = nn.MSELoss(reduction='none')
-    num_epochs, lr = 100, 0.003
-    # The bias parameter has not decayed
-    trainer = torch.optim.SGD([
-        {"params":net[0].weight,'weight_decay': wd},
-        {"params":net[0].bias}], lr=lr)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            trainer.zero_grad()
-            l = loss(net(X), y)
-            l.sum().backward()
-            trainer.step()
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', net[0].weight.norm().item())
-```
-
-```{.python .input}
-#@tab tensorflow
-def train_concise(wd):
-    net = tf.keras.models.Sequential()
-    net.add(tf.keras.layers.Dense(
-        1, kernel_regularizer=tf.keras.regularizers.l2(wd)))
-    net.build(input_shape=(1, num_inputs))
-    w, b = net.trainable_variables
-    loss = tf.keras.losses.MeanSquaredError()
-    num_epochs, lr = 100, 0.003
-    trainer = tf.keras.optimizers.SGD(learning_rate=lr)
-    animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
-                            xlim=[5, num_epochs], legend=['train', 'test'])
-    for epoch in range(num_epochs):
-        for X, y in train_iter:
-            with tf.GradientTape() as tape:
-                # `tf.keras` requires retrieving and adding the losses from
-                # layers manually for custom training loop.
-                l = loss(net(X), y) + net.losses
-            grads = tape.gradient(l, net.trainable_variables)
-            trainer.apply_gradients(zip(grads, net.trainable_variables))
-        if (epoch + 1) % 5 == 0:
-            animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
-                                     d2l.evaluate_loss(net, test_iter, loss)))
-    print('L2 norm of w:', tf.norm(net.get_weights()[0]).numpy())
-```
-
-[**The plots look identical to those when
-we implemented weight decay from scratch**].
-However, they run appreciably faster
-and are easier to implement,
-a benefit that will become more
-pronounced for larger problems.
-
-```{.python .input}
-#@tab all
-train_concise(0)
-```
-
-```{.python .input}
-#@tab all
-train_concise(3)
-```
-
-So far, we only touched upon one notion of
-what constitutes a simple linear function.
-Moreover, what constitutes a simple nonlinear function
-can be an even more complex question.
-For instance, [reproducing kernel Hilbert space (RKHS)](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space)
-allows one to apply tools introduced
-for linear functions in a nonlinear context.
-Unfortunately, RKHS-based algorithms
-tend to scale poorly to large, high-dimensional data.
-In this book we will default to the simple heuristic
-of applying weight decay on all layers of a deep network.
-
-## Summary
-
-* Regularization is a common method for dealing with overfitting. It adds a penalty term to the loss function on the training set to reduce the complexity of the learned model.
-* One particular choice for keeping the model simple is weight decay using an $L_2$ penalty. This leads to weight decay in the update steps of the learning algorithm.
-* The weight decay functionality is provided in optimizers from deep learning frameworks.
-* Different sets of parameters can have different update behaviors within the same training loop.
-
-
-
-## Exercises
-
-1. Experiment with the value of $\lambda$ in the estimation problem in this section. Plot training and test accuracy as a function of $\lambda$. What do you observe?
-1. Use a validation set to find the optimal value of $\lambda$. Is it really the optimal value? Does this matter?
-1. What would the update equations look like if instead of $\|\mathbf{w}\|^2$ we used $\sum_i |w_i|$ as our penalty of choice ($L_1$ regularization)?
-1. We know that $\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}$. Can you find a similar equation for matrices (see the Frobenius norm in :numref:`subsec_lin-algebra-norms`)?
-1. Review the relationship between training error and generalization error. In addition to weight decay, increased training, and the use of a model of suitable complexity, what other ways can you think of to deal with overfitting?
-1. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via $P(w \mid x) \propto P(x \mid w) P(w)$. How can you identify $P(w)$ with regularization?
-
-:begin_tab:`mxnet`
-[Discussions](https://discuss.d2l.ai/t/98)
-:end_tab:
-
-:begin_tab:`pytorch`
-[Discussions](https://discuss.d2l.ai/t/99)
-:end_tab:
-
-:begin_tab:`tensorflow`
-[Discussions](https://discuss.d2l.ai/t/236)
-:end_tab:
diff --git a/chapter_notation/index.md b/chapter_notation/index.md
index 035b738..2b4c86f 100644
--- a/chapter_notation/index.md
+++ b/chapter_notation/index.md
@@ -1,74 +1,74 @@
-# 表記法
+# 記法
 :label:`chap_notation`
 
-本書では、以下の表記規則を順守しています。これらの記号にはプレースホルダであるものもあれば、特定のオブジェクトを参照するものもあります。一般的な経験則として、不定冠詞「a」は、シンボルがプレースホルダであり、同じ形式のシンボルが同じタイプの他のオブジェクトを表すことができることを示します。たとえば、「$x$: スカラー」は、小文字が一般的にスカラー値を表すことを意味します。 
+本書では、以下の表記規則を順守しています。これらのシンボルの一部はプレースホルダであり、他のシンボルは特定のオブジェクトを参照します。一般的な経験則として、不定冠詞「a」は、シンボルがプレースホルダであり、同様の形式のシンボルが同じタイプの他のオブジェクトを表すことができることを示していることがよくあります。たとえば、「$x$: a scalar」は、小文字が一般にスカラー値を表すことを意味しますが、「$\mathbb{Z}$: 整数の集合」は特にシンボル $\mathbb{Z}$ を指します。 
 
 ## 数値オブジェクト
 
 * $x$: スカラー
-* $\mathbf{x}$: ベクトルです
-* $\mathbf{X}$: マトリックスです
+* $\mathbf{x}$: ベクトル
+* $\mathbf{X}$: マトリックス
 * $\mathsf{X}$: 一般的なテンソル
-* $\mathbf{I}$: 単位行列-正方形。すべての対角線エントリに $1$、すべての対角線外に $0$ をもつ
-* $x_i$、$[\mathbf{x}]_i$:$i^\mathrm{th}$ ベクトルの $i^\mathrm{th}$ エレメントです。
-* $x_{ij}$、$x_{i,j}$、$[\mathbf{X}]_{ij}$、$[\mathbf{X}]_{i,j}$: 行$i$ と列 $j$ にあるマトリックス $\mathbf{X}$ のエレメントです。
+* $\mathbf{I}$:（ある特定の次元の）単位行列、すなわち、すべての対角要素に$1$、すべての非対角要素に$0$をもつ正方行列
+* $x_i$、$[\mathbf{x}]_i$:$i^\mathrm{th}$ ベクトルの要素 $\mathbf{x}$
+* $x_{ij}$、$x_{i,j}$、$[\mathbf{X}]_{ij}$、$[\mathbf{X}]_{i,j}$: 行$i$と列$j$の行列$\mathbf{X}$の要素。
 
-## 集合論
+## 集合理論
 
 * $\mathcal{X}$: セット
 * $\mathbb{Z}$: 整数の集合
 * $\mathbb{Z}^+$: 正の整数の集合
 * $\mathbb{R}$: 実数の集合
-* $\mathbb{R}^n$:$n$ 次元の実数ベクトルの集合
+* $\mathbb{R}^n$: 実数の$n$次元ベクトルの集合
 * $\mathbb{R}^{a\times b}$:$a$ 行と $b$ 列をもつ実数の行列の集合
-* $|\mathcal{X}|$: 集合 $\mathcal{X}$ の基数 (エレメントの数)
-* $\mathcal{A}\cup\mathcal{B}$:$\mathcal{A}$ と $\mathcal{B}$ のセットのユニオン
-* $\mathcal{A}\cap\mathcal{B}$: セット$\mathcal{A}$と$\mathcal{B}$の交差部分
-* $\mathcal{A}\setminus\mathcal{B}$:$\mathcal{A}$ から $\mathcal{B}$ の減算を設定する ($\mathcal{A}$ のうち $\mathcal{B}$ に属さない要素のみを含む)
+* $|\mathcal{X}|$: セット $\mathcal{X}$ のカーディナリティ (要素の数)
+* $\mathcal{A}\cup\mathcal{B}$: セット $\mathcal{A}$ と $\mathcal{B}$ のユニオン
+* $\mathcal{A}\cap\mathcal{B}$: セット $\mathcal{A}$ と $\mathcal{B}$ の交差
+* $\mathcal{A}\setminus\mathcal{B}$:$\mathcal{A}$ から $\mathcal{B}$ の減算を設定します ($\mathcal{A}$ の $\mathcal{B}$ に属さない要素のみが含まれます)
 
 ## 関数と演算子
 
-* $f(\cdot)$: 関数です
-* $\log(\cdot)$: 自然対数 (基数 $e$)
-* $\log_2(\cdot)$: 底を底とする対数 $2$
+* $f(\cdot)$: 関数
+* $\log(\cdot)$: 自然対数 (底が$e$)
+* $\log_2(\cdot)$: 底が$2$の対数
 * $\exp(\cdot)$: 指数関数です
-* $\mathbf{1}(\cdot)$: インジケーター関数。ブール型引数が真であれば $1$、そうでなければ $0$ に評価されます。
-* $\mathbf{1}_{\mathcal{X}}(z)$: セットメンバシップインジケータ関数。エレメント $z$ がセット $\mathcal{X}$ に属していれば $1$ に評価され、そうでなければ $0$ に評価されます。
+* $\mathbf{1}(\cdot)$: インジケーター関数は、ブール引数が真の場合は$1$、そうでない場合は$0$と評価されます
+* $\mathbf{1}_{\mathcal{X}}(z)$: 集合メンバーシップ指標関数は、要素$z$が集合$\mathcal{X}$に属している場合は$1$と評価され、そうでなければ$0$と評価される
 * $\mathbf{(\cdot)}^\top$: ベクトルまたは行列の転置
-* $\mathbf{X}^{-1}$: 行列の逆行列$\mathbf{X}$
-* $\odot$: アダマール (要素単位) 積
-* $[\cdot, \cdot]$: コンカチネーション
-* $\|\cdot\|_p$:$L_p$ ノルム
-* $\|\cdot\|$:$L_2$ ノルム
+* $\mathbf{X}^{-1}$: 行列の逆行列 $\mathbf{X}$
+* $\odot$: アダマール (元素的) 積
+* $[\cdot, \cdot]$: 連結
+* $\|\cdot\|_p$:$\ell_p$ ノルム
+* $\|\cdot\|$:$\ell_2$ ノルム
 * $\langle \mathbf{x}, \mathbf{y} \rangle$: ベクトル$\mathbf{x}$と$\mathbf{y}$のドット積
-* $\sum$: 要素の集合に対する総和
-* $\prod$: 要素のコレクションに対するプロダクト
-* $\stackrel{\mathrm{def}}{=}$: 左辺のシンボルの定義として表明された等価性
+* $\sum$: 要素の集合の合計
+* $\prod$: 要素の集合上のプロダクト
+* $\stackrel{\mathrm{def}}{=}$: 左側のシンボルの定義として表される等価性
 
 ## 微積分
 
-* $\frac{dy}{dx}$:$x$ を基準にした $y$ の微分
-* $\frac{\partial y}{\partial x}$:$x$ を基準にした $y$ の偏微分
-* $\nabla_{\mathbf{x}} y$:$\mathbf{x}$ を基準にした $y$ のグラデーション
-* $\int_a^b f(x) \;dx$:$x$ を基準にして $a$ から $b$ までの $f$ の定積分
-* $\int f(x) \;dx$:$x$ を基準にした $f$ の不定積分
+* $\frac{dy}{dx}$:$x$ に対する $y$ の派生物
+* $\frac{\partial y}{\partial x}$:$x$ に対する $y$ の偏微分
+* $\nabla_{\mathbf{x}} y$:$\mathbf{x}$ に対するグラデーション $y$
+* $\int_a^b f(x) \;dx$:$x$に対する$a$から$b$への$f$の定積分
+* $\int f(x) \;dx$:$x$ に対する $f$ の不定積分
 
-## 確率論と情報理論
+## 確率と情報理論
 
-* $X$: 確率変数です
+* $X$: 確率変数
 * $P$: 確率分布
-* $X \sim P$: 確率変数 $X$ の分布は $P$ です
-* $P(X=x)$: 確率変数 $X$ が値 $x$ を取る事象に割り当てられる確率
-* $P(X \mid Y)$:$Y$ が与えられた場合の$X$の条件付き確率分布
-* $p(\cdot)$: 分布 P に関連付けられた確率密度関数 (PDF)
-* ${E}[X]$: 確率変数の期待値 $X$
+* $X \sim P$: 確率変数 $X$ は分布 $P$ に続きます
+* $P(X=x)$: 確率変数$X$が値$x$を取る事象に割り当てられる確率
+* $P(X \mid Y)$:$Y$が与えられた場合の$X$の条件付き確率分布
+* $p(\cdot)$: 分布Pに関連する確率密度関数 (PDF)
+* ${E}[X]$: 確率変数の期待 $X$
 * $X \perp Y$: 確率変数 $X$ と $Y$ は独立しています
-* $X \perp Y \mid Z$:$Z$ が与えられた場合、確率変数 $X$ と $Y$ は条件付きで独立しています
+* $X \perp Y \mid Z$: 確率変数$X$および$Y$は、$Z$が与えられた場合に条件付きで独立しています
 * $\sigma_X$: 確率変数の標準偏差 $X$
-* $\mathrm{Var}(X)$: 確率変数 $X$ の分散、$\sigma^2_X$ と等しい
+* $\mathrm{Var}(X)$: 確率変数$X$の分散、$\sigma^2_X$と等しい
 * $\mathrm{Cov}(X, Y)$: 確率変数の共分散 $X$ と $Y$
-* $\rho(X, Y)$:$X$ と $Y$ の間のピアソン相関係数は $\frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}$ と等しくなります
-* $H(X)$: 確率変数のエントロピー $X$
-* $D_{\mathrm{KL}}(P\|Q)$: 分布 $Q$ から分布 $P$ への KL ダイバージェンス (または相対エントロピー)
+* $\rho(X, Y)$:$X$と$Y$の間のピアソン相関係数は、$\frac{\mathrm{Cov}(X, Y)}{\sigma_X \sigma_Y}$と等しくなります
+* $H(X)$: ランダム変数のエントロピー $X$
+* $D_{\mathrm{KL}}(P\|Q)$: 分布$Q$から分布$P$へのKLダイバージェンス（または相対エントロピー）
 
 [Discussions](https://discuss.d2l.ai/t/25)
diff --git a/chapter_notation/index_origin.md b/chapter_notation/index_origin.md
index cf80eb9..acace86 100644
--- a/chapter_notation/index_origin.md
+++ b/chapter_notation/index_origin.md
@@ -1,17 +1,20 @@
 # Notation
 :label:`chap_notation`
 
-Throughout this book, we adhere to the following notational conventions.
+Throughout this book, we adhere 
+to the following notational conventions.
 Note that some of these symbols are placeholders,
 while others refer to specific objects.
 As a general rule of thumb, 
-the indefinite article "a" indicates
+the indefinite article "a" often indicates
 that the symbol is a placeholder
 and that similarly formatted symbols
 can denote other objects of the same type.
 For example, "$x$: a scalar" means 
 that lowercased letters generally
-represent scalar values.
+represent scalar values,
+but "$\mathbb{Z}$: the set of integers"
+refers specifically to the symbol $\mathbb{Z}$.
 
 
 
@@ -21,13 +24,12 @@ represent scalar values.
 * $\mathbf{x}$: a vector
 * $\mathbf{X}$: a matrix
 * $\mathsf{X}$: a general tensor
-* $\mathbf{I}$: an identity matrix---square, with $1$ on all diagonal entries and $0$ on all off-diagonals
+* $\mathbf{I}$: the identity matrix (of some given dimension), i.e., a square matrix with $1$ on all diagonal entries and $0$ on all off-diagonals
 * $x_i$, $[\mathbf{x}]_i$: the $i^\mathrm{th}$ element of vector $\mathbf{x}$
 * $x_{ij}$, $x_{i,j}$,$[\mathbf{X}]_{ij}$, $[\mathbf{X}]_{i,j}$: the element of matrix $\mathbf{X}$ at row $i$ and column $j$.
 
 
 
-
 ## Set Theory
 
 
@@ -43,6 +45,7 @@ represent scalar values.
 * $\mathcal{A}\setminus\mathcal{B}$: set subtraction of $\mathcal{B}$ from $\mathcal{A}$ (contains only those elements of $\mathcal{A}$ that do not belong to $\mathcal{B}$)
 
 
+
 ## Functions and Operators
 
 
@@ -56,14 +59,15 @@ represent scalar values.
 * $\mathbf{X}^{-1}$: inverse of matrix $\mathbf{X}$
 * $\odot$: Hadamard (elementwise) product
 * $[\cdot, \cdot]$: concatenation
-* $\|\cdot\|_p$: $L_p$ norm
-* $\|\cdot\|$: $L_2$ norm
+* $\|\cdot\|_p$: $\ell_p$ norm
+* $\|\cdot\|$: $\ell_2$ norm
 * $\langle \mathbf{x}, \mathbf{y} \rangle$: dot product of vectors $\mathbf{x}$ and $\mathbf{y}$
 * $\sum$: summation over a collection of elements
 * $\prod$: product over a collection of elements
 * $\stackrel{\mathrm{def}}{=}$: an equality asserted as a definition of the symbol on the left-hand side
 
 
+
 ## Calculus
 
 * $\frac{dy}{dx}$: derivative of $y$ with respect to $x$
@@ -72,11 +76,13 @@ represent scalar values.
 * $\int_a^b f(x) \;dx$: definite integral of $f$ from $a$ to $b$ with respect to $x$
 * $\int f(x) \;dx$: indefinite integral of $f$ with respect to $x$
 
+
+
 ## Probability and Information Theory
 
 * $X$: a random variable
 * $P$: a probability distribution
-* $X \sim P$: the random variable $X$ has distribution $P$
+* $X \sim P$: the random variable $X$ follows distribution $P$
 * $P(X=x)$: the probability assigned to the event where random variable $X$ takes value $x$
 * $P(X \mid Y)$: the conditional probability distribution of $X$ given $Y$
 * $p(\cdot)$: a probability density function (PDF) associated with distribution P
@@ -91,4 +97,5 @@ represent scalar values.
 * $D_{\mathrm{KL}}(P\|Q)$: the KL-divergence (or relative entropy) from distribution $Q$ to distribution $P$
 
 
+
 [Discussions](https://discuss.d2l.ai/t/25)
diff --git a/chapter_preface/index.md b/chapter_preface/index.md
index bd4a220..261c7fb 100644
--- a/chapter_preface/index.md
+++ b/chapter_preface/index.md
@@ -1,74 +1,76 @@
 # 序文
 
-ほんの数年前、大手企業や新興企業でインテリジェントな製品やサービスを開発しているディープラーニングの科学者はほとんどいませんでした。私たちがこの分野に参入したとき、機械学習は日刊紙の見出しにはなりませんでした。私たちの両親は機械学習が何であるかを知りませんでした。言うまでもなく、私たちが医学や法律のキャリアよりも機械学習を好む理由は言うまでもありません。機械学習は青空の学問分野であり、その産業上の意義は、音声認識やコンピュータービジョンなど、現実世界のごく一部のアプリケーションに限られていました。さらに、これらのアプリケーションの多くは非常に多くのドメイン知識を必要とするため、機械学習が1つの小さなコンポーネントである完全に独立した領域と見なされることが多かった。当時、本書で取り上げるディープラーニング手法の前身であるニューラルネットワークは、一般的に時代遅れと見なされていました。 
+ほんの数年前、大手企業や新興企業でインテリジェントな製品やサービスを開発するディープラーニングの科学者は大勢いませんでした。私たちがこの分野に参入したとき、機械学習は日刊紙のヘッドラインを指揮していませんでした。私たちの両親は、私たちが医学や法律のキャリアよりも機械学習を好む理由は言うまでもなく、機械学習が何であるかを知りませんでした。機械学習はブルースカイの学問分野であり、その産業的意義は、音声認識やコンピュータービジョンなど、実際のアプリケーションの狭いセットに限定されていました。さらに、これらのアプリケーションの多くは非常に多くのドメイン知識を必要とするため、機械学習が1つの小さなコンポーネントである完全に独立した領域と見なされることがよくありました。当時、この本で取り上げているディープラーニング手法の前身であるニューラルネットワークは、一般的に時代遅れと見なされていました。 
 
-過去5年間で、ディープラーニングは世界を驚かせ、コンピュータビジョン、自然言語処理、自動音声認識、強化学習、生物医学情報学などの多様な分野で急速な進歩を後押ししました。さらに、実際に関心のある多くのタスクでディープラーニングが成功したことで、理論的な機械学習や統計学の発展が促進されました。これらの進歩により、かつてないほど自律性（および一部の企業が信じているよりも自律性が低い）で自走する車、最もありふれた電子メールを自動的にドラフトするスマートリプライシステム、人々が圧倒的に大きな受信トレイから掘り出すのを支援するスマートリプライシステム、およびソフトウェアを構築できるようになりました。ゴーのようなボードゲームで世界最高の人間を支配するエージェントは、かつて数十年先にあると考えられていた偉業です。すでに、これらのツールは産業や社会にますます大きな影響を及ぼし、映画の制作方法や病気の診断方法を変え、天体物理学から生物学に至る基礎科学においてますます大きな役割を果たしています。 
+ここ数年の間に、ディープラーニングは世界を驚かせ、コンピュータービジョン、自然言語処理、自動音声認識、強化学習、生物医学情報学などの多様な分野で急速な進歩を遂げました。さらに、実際に関心のある非常に多くのタスクに関するディープラーニングの成功は、理論的な機械学習と統計学の発展を促進することさえありました。これらの進歩により、これまで以上に自律性（および一部の企業が信じているよりも自律性が低い）、最もありふれた電子メールを自動的にドラフトするスマート返信システム、非常に大きな受信トレイから人々が掘り出すのを支援するスマート返信システム、およびソフトウェアで自らを運転する車を構築できるようになりました。かつて数十年先と考えられていた偉業である囲碁のようなボードゲームで世界最高の人間を支配するエージェント。すでに、これらのツールは産業や社会にかつてないほど大きな影響を及ぼし、映画の制作方法や病気の診断方法を変え、天体物理学から生物学まで、基礎科学においてますます重要な役割を果たしています。 
 
 ## この本について
 
-この本は、ディープラーニングを親しみやすくするための私たちの試みを表しており、*概念*、*コンテキスト*、*コード*を教えています。 
+この本は、ディープラーニングを親しみやすいものにするための私たちの試みを表しており、*概念*、*コンテキスト*、および*コード*を教えています。 
 
 ### コード、数学、HTMLを組み合わせた1つの媒体
 
-コンピューティングテクノロジが最大限の効果を発揮するには、そのテクノロジを十分に理解し、十分に文書化して、十分に管理された成熟したツールでサポートする必要があります。重要なアイデアは明確に抽出され、新しい開業医を最新の状態にするために必要なオンボーディング時間を最小限に抑える必要があります。成熟したライブラリは一般的なタスクを自動化するべきであり、模範的なコードは、実践者が自分のニーズに合わせて共通のアプリケーションを簡単に変更、適用、拡張できるようにすべきです。動的な Web アプリケーションを例に挙げてみましょう。Amazonのように多くの企業が1990年代にデータベース駆動型のウェブアプリケーションを開発してきたにもかかわらず、このテクノロジーがクリエイティブな起業家を支援する可能性は、過去10年間で大幅に実現されました。その一因は、強力で十分に文書化されたフレームワーク。 
+コンピューティングテクノロジーが完全な効果を発揮するには、十分に理解され、十分に文書化され、成熟した手入れの行き届いたツールによってサポートされる必要があります。重要なアイデアを明確に抽出し、新しい開業医に最新の情報を提供するために必要なオンボーディング時間を最小限に抑える必要があります。成熟したライブラリは一般的なタスクを自動化する必要があり、模範的なコードは、実践者が自分のニーズに合わせて一般的なアプリケーションを変更、適用、および拡張するのを容易にする必要があります。動的ウェブアプリケーションを例に挙げてみましょう。Amazonのような多数の企業が1990年代に成功したデータベース駆動型Webアプリケーションを開発していたにもかかわらず、このテクノロジーが創造的な起業家を支援する可能性は、強力で十分に文書化された開発のおかげで、過去10年間ではるかに大きな程度で実現しました。フレームワーク。 
 
-ディープラーニングの可能性のテストには特有の課題があります。どのアプリケーションでもさまざまな分野が統合されるからです。ディープラーニングを適用するには、(i) 特定の方法で問題を投げかける動機、(ii) 与えられたモデルの数学的形式、(iii) モデルをデータにあてはめるための最適化アルゴリズム、(iv) モデルをいつ期待すべきかを示す統計的原理を同時に理解する必要がある目に見えないデータと、それらが実際に一般化されていることを証明するための実用的な方法に一般化する。(v) モデルを効率的にトレーニングし、数値計算の落とし穴を乗り越え、利用可能なハードウェアを最大限に活用するために必要なエンジニアリング手法。問題を定式化するために必要な批判的思考スキル、問題を解くための数学、それらのソリューションを実装するためのソフトウェアツールの両方を1か所で教えることは、非常に困難な課題です。この本での私たちの目標は、開業医になる可能性のある人をスピードアップするための統一されたリソースを提示することです。 
+ディープラーニングの可能性をテストすることは、単一のアプリケーションがさまざまな分野をまとめるため、独特の課題を提示します。ディープラーニングを適用するには、（i）問題を特定の方法でキャストする動機、（ii）特定のモデルの数学的形式、（iii）モデルをデータに適合させるための最適化アルゴリズム、（iv）モデルをいつ期待すべきかを示す統計的原則を同時に理解する必要があります。目に見えないデータと、それらが実際に一般化されていることを証明するための実用的な方法に一般化する。（v）モデルを効率的にトレーニングし、数値計算の落とし穴をナビゲートし、利用可能なハードウェアを最大限に活用するために必要なエンジニアリング手法。問題を定式化するために必要な批判的思考スキル、それを解決するための数学、そしてそれらの解決策を実装するためのソフトウェアツールの両方を1か所で教えることは、困難な課題を提示します。この本の私たちの目標は、開業医になる可能性のある人をスピードアップさせるための統一されたリソースを提示することです。 
 
-私たちがこの本のプロジェクトを始めたとき、（i）最新のもの、（ii）技術的な深さで現代の機械学習の全範囲を網羅したもの、（iii）魅力的な教科書に期待される品質を、クリーンで実行可能なコードでインターリーブされた説明と同時に、リソースはありませんでした。はハンズオンチュートリアルで見つかることを期待しています。特定のディープラーニングフレームワークの使い方 (TensorFlow で行列を使って基本的な数値計算を行う方法など) や、特定のテクニック (LeNet、AlexNet、ResNetsなどのコードスニペットなど) を実装するためのコード例が、さまざまなブログ投稿や GitHub リポジトリに散らばっていました。ただし、これらの例では通常
-*与えられたアプローチをどのように実装するか
+私たちがこの本のプロジェクトを始めたとき、（i）最新の状態を保ち、（ii）十分な技術的深さで最新の機械学習の実践をカバーし、（iii）教科書に期待される品質を、クリーンで実行可能なコードでインターリーブして説明するリソースはありませんでした。は実践的なチュートリアルを期待しています。特定のディープラーニングフレームワークの使い方（TensorFlow で行列を使って基本的な数値計算を行う方法など）や、さまざまなブログ投稿や GitHub リポジトリに散在する特定の手法（LeNet、AlexNet、ResNet などのコードスニペットなど）を実装するためのコード例がたくさん見つかりました。ただし、これらの例では通常、
+*特定のアプローチをどのように実装するか、
 しかし、の議論は省きました 
-*なぜ* 特定のアルゴリズムによる決定がなされるのか
-Web サイト [Distill](http://distill.pub) で公開された魅力的なブログ投稿や個人ブログなど、特定のトピックに対応するために一部のインタラクティブなリソースが散発的に出現しましたが、ディープラーニングでは選択されたトピックのみを取り上げ、関連するコードがないことが多々ありました。一方、ディープラーニングの基礎に関する包括的な調査を提供する :cite:`Goodfellow.Bengio.Courville.2016` など、ディープラーニングの教科書がいくつか登場しましたが、これらのリソースは記述とコード内の概念の実現を結びつけるものではなく、読者がそれらをどのように実装するのか分からないままになることがあります。さらに、商用コースプロバイダのペイウォールの背後に隠されているリソースが多すぎます。 
+*特定のアルゴリズム上の決定がなされる理由*。
+ウェブサイト[Distill](http://distill.pub)で公開された魅力的なブログ投稿や個人のブログなど、特定のトピックに対処するためにいくつかのインタラクティブリソースが散発的にポップアップしましたが、ディープラーニングの選択されたトピックのみをカバーし、関連するコードが不足していることがよくありました。一方、ディープラーニングの基礎に関する包括的な調査を提供する:cite:`Goodfellow.Bengio.Courville.2016`など、いくつかのディープラーニングの教科書が登場しましたが、これらのリソースは説明とコード内の概念の実現を結び付けていないため、読者がそれらをどのように実装するかについて無知になることがあります。さらに、商用コースプロバイダーのペイウォールの背後に隠れているリソースが多すぎます。 
 
-私たちは、(i) すべての人が自由に利用できる、(ii) 実際に機械学習の応用科学者になるための出発点を提供するのに十分な技術的深みを提供する、(iii) 読者に見せる実行可能なコードを含めることができるリソースの作成に着手しました。
+私たちは、（i）誰もが自由に利用できる、（ii）実際に機械学習の応用科学者になるための出発点を提供するのに十分な技術的深さを提供する、（iii）実行可能なコードを含めて読者に見せることができるリソースの作成に着手しました
 *実際に問題を解決する方法*
-(iv) 当社およびコミュニティ全体による迅速な更新を可能にし、(v) 技術的な詳細について対話的に議論し、質問に答えるために [forum](http://discuss.d2l.ai) を補完する。 
+(iv) 私たちとコミュニティ全体の両方による迅速な更新を可能にする。(v) 技術的な詳細の対話的な議論と質問への回答のための[forum](http://discuss.d2l.ai)によって補完される。 
 
-これらの目標はしばしば矛盾していました。方程式、定理、引用は、LaTeXで最もよく管理され、配置されます。コードは Python で最もよく記述されます。また、ウェブページはHTMLとJavaScriptでネイティブです。さらに、コンテンツは、実行可能なコード、物理的な本、ダウンロード可能なPDF、およびインターネット上で Web サイトとしてアクセスできるようにしたいと考えています。現在、これらの要求にぴったり合ったツールやワークフローは存在しないため、自社で組み立てる必要がありました。このアプローチについては :numref:`sec_how_to_contribute` で詳しく説明しています。ソースを共有し、コミュニティへの貢献を促進するために GitHub、コード、方程式、テキストを混合するための Jupyter ノートブック、複数の出力を生成するレンダリングエンジンとしてのスフィンクス、フォーラムの Discourse に取り組みました。私たちのシステムはまだ完璧ではありませんが、これらの選択は競合する懸念の中で良い妥協点を提供します。このような統合ワークフローを使って出版された本は、これが初めてかもしれないと私たちは考えています。 
+これらの目標はしばしば対立していました。方程式、定理、および引用は、LaTeXで最適に管理および配置されます。コードはPythonで最もよく記述されています。そして、ウェブページはHTMLとJavaScriptでネイティブです。さらに、コンテンツを実行可能なコードとして、物理的な本として、ダウンロード可能なPDFとして、およびWebサイトとしてインターネット上でアクセスできるようにしたいと考えています。これらの要求に適したワークフローはなかったため、独自の (:numref:`sec_how_to_contribute`) を組み立てることにしました。私たちは、ソースを共有し、コミュニティへの貢献を促進するために GitHub、コード、方程式、テキストを混合するためのJupyterノートブック、レンダリングエンジンとしてのSphinx、ディスカッションプラットフォームとしてのDiscourseに落ち着きました。私たちのシステムは完璧ではありませんが、これらの選択は競合する懸念の中で妥協します。*Dive into Deep Learning * は、このような統合されたワークフローを使用して出版された最初の本になるかもしれないと私たちは信じています。 
 
-### やることによる学習
+### やって学ぼう
 
-多くの教科書は概念を連続して提示し、それぞれを網羅的に詳しく説明しています。たとえば、Chris Bishopの優れた教科書 :cite:`Bishop.2006` は、各トピックを徹底的に教えているため、線形回帰の章に入るには些細な作業が必要です。専門家はこの本を徹底的に愛していますが、真の初心者にとっては、この特性は導入テキストとしての有用性を制限します。 
+多くの教科書は、概念を連続して提示し、それぞれを網羅的に詳細にカバーしています。たとえば、Chris Bishopの優れた教科書:cite:`Bishop.2006`は、各トピックを徹底的に教えているため、線形回帰の章に到達するには簡単な作業が必要です。専門家はこの本を徹底的に愛していますが、真の初心者にとっては、このプロパティは紹介テキストとしての有用性を制限しています。 
 
-この本では、ほとんどの概念を*ジャストインタイム*で教えます。言い換えれば、実用的な目的を達成するために必要な概念をすぐに学ぶことができます。最初は、線形代数や確率などの基本的な予備を教えるのに少し時間がかかりますが、より難解な確率分布について心配する前に、最初のモデルをトレーニングした満足感を味わってほしい。 
+この本では、ほとんどの概念を*ジャストインタイム*で教えています。言い換えれば、実際的な目的を達成するために必要な概念をその瞬間に学習します。最初は線形代数や確率などの基本的な予備を教えるのに少し時間がかかりますが、より難解な概念について心配する前に、最初のモデルをトレーニングすることの満足感を味わってほしい。 
 
-基本的な数学的背景についての短期集中コースを提供する予備的なノートブックの他に、以降の各章では、妥当な数の新しい概念と、実際のデータセットを使用した単一の自己完結型の作業例の両方を紹介します。これは組織的な課題です。モデルによっては、論理的に 1 つのノートブックにまとめられている場合があります。また、いくつかのモデルを連続して実行することで最も効果的なアイデアもあります。一方、*1つの実例、1つのノートブック*というポリシーに従うことには大きな利点があります。これにより、私たちのコードを活用して、自分の研究プロジェクトをできるだけ簡単に始めることができます。ノートブックをコピーして修正を開始するだけです。 
+基本的な数学的背景の短期集中コースを提供するいくつかの予備的なノートブックの他に、後続の各章では、妥当な数の新しい概念を紹介し、実際のデータセットを使用したいくつかの自己完結型の作業例を提供します。これは組織的な課題を提示しました。一部のモデルは、論理的に 1 つのノートブックにまとめられている場合があります。そして、いくつかのアイデアは、いくつかのモデルを連続して実行することによって最もよく教えられるかもしれません。一方、*1つの実用的な例と1つのノートブック*のポリシーに従うことには大きな利点があります。これにより、私たちのコードを活用して独自の研究プロジェクトをできるだけ簡単に開始できます。ノートブックをコピーして、修正を開始するだけです。 
 
-必要に応じて、実行可能なコードをバックグラウンドマテリアルとインターリーブします。一般的に、ツールを完全に説明する前に、ツールを利用可能にするという誤りを犯すことがよくあります (その背景については後で説明します)。例えば、*確率的勾配降下法* を使うと、なぜそれが役に立つのか、なぜ機能するのかを完全に説明することができます。これは、問題を迅速に解決するために必要な弾薬を実践者に提供するのに役立ちますが、読者に学芸員の決定を信頼してもらうことを犠牲にします。 
+全体を通して、必要に応じて実行可能なコードを背景素材とインターリーブします。一般的に、ツールを完全に説明する前に、ツールを利用可能にするというのは間違いです（多くの場合、後で背景を埋めます）。たとえば、なぜそれが有用であるかを説明したり、なぜ機能するのかを直感的に説明する前に、*確率的勾配降下法*を使用するかもしれません。これは、読者がキュレーターの決定で私たちを信頼することを要求することを犠牲にして、問題を迅速に解決するために必要な弾薬を実践者に与えるのに役立ちます。 
 
-この本は、ディープラーニングの概念をゼロから教えます。ディープラーニングフレームワークの高度な抽象化によってユーザーには見えないモデルの詳細を掘り下げたい場合があります。これは特に、特定のレイヤーまたはオプティマイザーで発生するすべてのことを理解してもらいたい基本的なチュートリアルで取り上げられます。このような場合、2つのバージョンの例を挙げます。1つは、Numpyライクな機能と自動微分のみに依存してすべてをゼロから実装するバージョンと、ディープラーニングフレームワークの高レベル API を使用して簡潔なコードを記述するより実用的な例です。コンポーネントがどのように機能するかを説明したら、以降のチュートリアルで高レベル API を使用できます。 
+この本は、ディープラーニングの概念をゼロから教えています。時々、現代のディープラーニングフレームワークによってユーザーから隠されることが多いモデルの詳細を掘り下げます。これは特に、特定のレイヤーまたはオプティマイザで発生するすべてのことを理解してほしい基本的なチュートリアルで出てきます。このような場合、例の 2 つのバージョンを提示することがよくあります。1 つは、Numpyライクな機能と自動微分のみに依存して、すべてをゼロから実装するバージョンと、ディープラーニングフレームワークの高レベル API を使用して簡潔なコードを記述するより実用的な例です。一部のコンポーネントがどのように機能するかを説明した後、以降のチュートリアルでは高レベル API を使用します。 
 
 ### コンテンツと構造
 
-本書は、予習、ディープラーニング手法、実際のシステムとアプリケーションに焦点を当てた高度なトピックに焦点を当て、大きく3つのパートに分けることができます (:numref:`fig_book_org`)。 
+この本は、予備演習、ディープラーニング技術、および実際のシステムとアプリケーションに焦点を当てた高度なトピックに焦点を当てた、大まかに3つのパートに分けることができます（:numref:`fig_book_org`）。 
 
 ![Book structure](../img/book-org.svg)
 :label:`fig_book_org`
 
-* 最初のパートでは、基本と予習について説明します。
-:numref:`chap_introduction` では、ディープラーニングの概要を説明しています。その後、:numref:`chap_preliminaries` では、データを格納して操作する方法や、線形代数、微積分、確率などの基本概念に基づいたさまざまな数値演算を適用する方法など、実践的なディープラーニングに必要な前提条件について簡単に説明します。:numref:`chap_linear` と :numref:`chap_perceptrons` が最も多くカバーしています。回帰と分類、線形モデルと多層パーセプトロン、過適合と正則化など、ディープラーニングの基本的な概念と手法。 
+* **パート 1: 基本と予習。**
+:numref:`chap_introduction` は、ディープラーニングの概要を提供します。次に、:numref:`chap_preliminaries`では、データの保存方法と操作方法や、線形代数、微積分、確率からの基本概念に基づくさまざまな数値演算の適用方法など、実践的なディープラーニングに必要な前提条件をすばやく理解できます。:numref:`chap_regression`と:numref:`chap_perceptrons`が最も多くをカバーしています回帰と分類、線形モデル、多層パーセプトロン、過適合と正則化を含む、ディープラーニングの基本的な概念と手法。 
 
-* 次の 5 つの章では、最新のディープラーニング手法に焦点を当てます。
-:numref:`chap_computation` は、ディープラーニングシステムの主要な計算コンポーネントについて説明し、より複雑なモデルのその後の実装の基礎を築きます。次に、:numref:`chap_cnn` と :numref:`chap_modern_cnn` では、ほとんどの最新のコンピュータービジョンシステムのバックボーンを形成する強力なツールである畳み込みニューラルネットワーク (CNN) を紹介します。同様に、:numref:`chap_rnn` と :numref:`chap_modern_rnn` ではリカレントニューラルネットワーク (RNN) が導入されています。RNN は、データのシーケンシャル (例:時間) 構造を利用するモデルで、自然言語処理や時系列予測に一般的に使用されています。:numref:`chap_attention` では、ほとんどの自然言語処理タスクの主要なアーキテクチャとして RNN に取って代わった、いわゆるアテンションメカニズムに基づく比較的新しいクラスのモデルを導入しました。これらのセクションでは、ディープラーニングの実践者が広く使用している最も強力で一般的なツールについて解説します。 
+* **パート 2: 最新のディープラーニングのテクニック**
+:numref:`chap_computation`は、ディープラーニングシステムの主要な計算コンポーネントについて説明し、より複雑なモデルのその後の実装の基礎を築きます。次に、:numref:`chap_cnn`と:numref:`chap_modern_cnn`は、最新のコンピュータービジョンシステムのバックボーンを形成する強力なツールである畳み込みニューラルネットワーク（CNN）を導入します。同様に、:numref:`chap_rnn`と:numref:`chap_modern_rnn`はリカレントニューラルネットワーク（RNN）を導入します。これは、データ内のシーケンシャル（例：時間的）構造を利用し、自然言語処理と時系列予測に一般的に使用されるモデルです。:numref:`chap_attention-and-transformers`では、ほとんどの自然言語処理タスクの主要なアーキテクチャとしてRNNに取って代わった、いわゆる*注意メカニズム*に基づく比較的新しいクラスのモデルを導入しました。これらのセクションでは、ディープラーニングの実践者によって広く使用されている最も強力で一般的なツールについて理解を深めます。 
 
-* 第 3 部では、スケーラビリティ、効率性、アプリケーションについて説明します。
-まず、:numref:`chap_optimization` で、ディープラーニングモデルの学習に使用される一般的な最適化アルゴリズムをいくつか取り上げます。次の章 :numref:`chap_performance` では、ディープラーニングコードの計算パフォーマンスに影響するいくつかの重要な要素について説明します。:numref:`chap_cv` では、コンピュータービジョンにおけるディープラーニングの主な応用例を示しています。:numref:`chap_nlp_pretrain` と :numref:`chap_nlp_app` では、言語表現モデルを事前トレーニングし、自然言語処理タスクに適用する方法を示します。 
+* **パート 3: スケーラビリティ、効率性、およびアプリケーション。**
+:numref:`chap_optimization`では、ディープラーニングモデルのトレーニングに使用されるいくつかの一般的な最適化アルゴリズムについて説明します。次に、:numref:`chap_performance` で、ディープラーニングコードの計算パフォーマンスに影響するいくつかの重要な要因を調べます。次に、:numref:`chap_cv`では、コンピュータービジョンにおけるディープラーニングの主な用途を説明します。最後に、:numref:`chap_nlp_pretrain`と:numref:`chap_nlp_app`で、言語表現モデルを事前トレーニングし、自然言語処理タスクに適用する方法を示します。このパーツは[online](https://d2l.ai)で入手可能です。 
 
 ### コード
 :label:`sec_code`
 
-このマニュアルのほとんどのセクションでは、実行可能コードを取り上げています。いくつかの直感は、試行錯誤しながらコードを微調整し、結果を観察することによって最もよく発達すると信じています。理想的には、洗練された数学的理論が、望ましい結果を得るためにコードを微調整する方法を正確に教えてくれるかもしれません。しかし、今日のディープラーニングの実践者は、説得力のある理論では確固たるガイダンスを提供できない場所を踏まなければならないことがよくあります。私たちの最善の試みにもかかわらず、これらのモデルを特徴付ける数学が非常に難しい場合と、これらのトピックに関する真剣な調査が最近になってハイギアになったばかりであるため、さまざまな技術の有効性に関する正式な説明はまだ欠けています。ディープラーニングの理論が進歩するにつれて、この本の将来の版が、現在入手可能なものを凌駕する洞察を提供できることを期待しています。 
+この本のほとんどのセクションには、実行可能なコードが含まれています。直感の中には、コードを少しずつ微調整し、結果を観察しながら試行錯誤しながら開発するのが最も良いものがあると考えています。理想的には、洗練された数学的理論が、望ましい結果を得るためにコードを微調整する方法を正確に教えてくれるかもしれません。しかし、今日のディープラーニングの実践者は、確かな理論がガイダンスを提供しないところを踏まなければならないことがよくあります。最善の試みにもかかわらず、さまざまな手法の有効性に関する正式な説明はまだ不足しています。これらのモデルを特徴付ける数学が非常に難しい場合があるためです。これらのトピックに関する問い合わせは、つい最近ハイギアになりました。ディープラーニングの理論が進歩するにつれて、この本の今後の各版が、現在利用可能なものを凌駕する洞察を提供することを期待しています。 
 
-不要な繰り返しを避けるため、最も頻繁にインポートされ、参照される関数とクラスの一部を `d2l` パッケージにカプセル化します。関数、クラス、import ステートメントのコレクションなど、あとで `d2l` パッケージを介してアクセスされるコードブロックを示すために、`# @save `でマークします。:numref:`sec_d2l` には、これらの関数とクラスの詳細な概要が記載されています。`d2l` パッケージは軽量で、必要なのは次の依存関係のみです。
+不要な繰り返しを避けるため、`d2l` パッケージには、最も頻繁にインポートおよび使用される関数とクラスのいくつかをカプセル化します。全体を通して、コードブロック (関数、クラス、インポート文のコレクションなど) に `# @save `を付けて、`d2l` パッケージ経由で後でアクセスされることを示します。:numref:`sec_d2l` では、これらの関数とクラスの詳細な概要を提供しています。`d2l` パッケージは軽量で、次の依存関係のみが必要です。
 
 ```{.python .input}
 #@tab all
 #@save
+import inspect
 import collections
 from collections import defaultdict
 from IPython import display
 import math
 from matplotlib import pyplot as plt
+from matplotlib_inline import backend_inline
 import os
 import pandas as pd
 import random
@@ -84,24 +86,25 @@ d2l = sys.modules[__name__]
 ```
 
 :begin_tab:`mxnet`
-この本のコードのほとんどは、深層学習のためのオープンソースフレームワークである Apache MXNet に基づいています。Apache MXNet は、AWS（アマゾンウェブサービス）だけでなく、多くの大学や企業でも好まれています。本書のすべてのコードは、最新の MXNet バージョンでのテストに合格しています。ただし、ディープラーニングの急速な発展により、MXNet の将来のバージョンでは、*Print Edition* の一部のコードが正しく動作しなくなる可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation` を参照してコードとランタイム環境を更新してください。 
+この本のコードのほとんどは、AWS (アマゾンウェブサービス) や多くの大学や企業で推奨されているディープラーニングのオープンソースフレームワークである Apache MXNet に基づいています。この本のコードはすべて、最新の MXNet バージョンでのテストに合格しています。ただし、ディープラーニングの急速な発展により、一部のコード (印刷版) は将来のバージョンの MXNet で正しく動作しなくなる可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation`を参照してコードとランタイム環境を更新してください。 
 
 MXNet からモジュールをインポートする方法は次のとおりです。
 :end_tab:
 
 :begin_tab:`pytorch`
-この本のコードのほとんどは PyTorch をベースにしています。PyTorch は、ディープラーニングの研究コミュニティに熱狂的に受け入れられている非常に人気のあるオープンソースフレームワークです。本書のすべてのコードは PyTorch の最新安定版でのテストに合格しています。ただし、ディープラーニングの急速な発展により、PyTorch の将来のバージョンでは、*印刷版* の一部のコードが適切に動作しなくなる可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation` を参照してコードとランタイム環境を更新してください。 
+この本のコードのほとんどは、ディープラーニングの研究コミュニティによって熱心に受け入れられてきた、非常に人気のあるオープンソースフレームワークであるPyTorchに基づいています。この本のすべてのコードは、PyTorchの最新の安定バージョンでのテストに合格しています。しかし、ディープラーニングの急速な発展により、一部のコード (印刷版) は PyTorch の将来のバージョンでは正しく動作しなくなる可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation`を参照してコードとランタイム環境を更新してください。 
 
 PyTorch からモジュールをインポートする方法は次のとおりです。
 :end_tab:
 
 :begin_tab:`tensorflow`
-本書のコードのほとんどは、業界で広く採用され、研究者の間で人気があるディープラーニング用のオープンソースフレームワークである TensorFlow をベースにしています。本書のすべてのコードは、最新の安定版 TensorFlow でのテストに合格しています。ただし、ディープラーニングの急速な発展により、*印刷版* の一部のコードは TensorFlow の将来のバージョンでは正しく動作しない可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation` を参照してコードとランタイム環境を更新してください。 
+この本のコードのほとんどは、業界で広く採用され、研究者の間で人気のあるディープラーニング用のオープンソースフレームワークであるTensorFlowに基づいています。この本のコードはすべて、最新の安定バージョンTensorFlowでのテストに合格しています。ただし、ディープラーニングの急速な発展により、一部のコード（印刷版）は、TensorFlow の将来のバージョンでは正しく動作しない可能性があります。オンライン版は最新の状態に保つ予定です。問題が発生した場合は、:ref:`chap_installation`を参照してコードとランタイム環境を更新してください。 
 
 TensorFlow からモジュールをインポートする方法は次のとおりです。
 :end_tab:
 
 ```{.python .input}
+#@tab mxnet
 #@save
 from mxnet import autograd, context, gluon, image, init, np, npx
 from mxnet.gluon import nn, rnn
@@ -129,31 +132,27 @@ import tensorflow as tf
 
 ### 対象読者
 
-本書は、ディープラーニングの実践的手法をしっかりと理解しようとする学生（学部生または大学院生）、エンジニア、研究者を対象としています。すべての概念をゼロから説明するため、ディープラーニングや機械学習の経験は必要ありません。ディープラーニングの方法を完全に説明するには、ある程度の数学とプログラミングが必要ですが、ここでは、ある程度の量の線形代数、微積分、確率、Python プログラミングなど、いくつかの基本を習得することを前提としています。基本を忘れた場合に備えて、付録では、この本にあるほとんどの数学について復習しています。ほとんどの場合、数学的な厳密さよりも直感とアイデアを優先します。私たちの本を理解するための前提条件を超えてこれらの基礎を拡張したい場合は、他の素晴らしいリソースを喜んでお勧めします。Bela Bollobas :cite:`Bollobas.1999`による線形解析は、線形代数と関数解析を非常に深くカバーしています。統計:cite:`Wasserman.2013`のすべては、統計のすばらしい入門書を提供します。ジョー・ブリッツスタインの確率と推論に関する[books](https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918)と[courses](https://projects.iq.harvard.edu/stat110/home)は教育学的な逸品です。もし、もし Python を前に使ったことがないなら、この [Python tutorial](http://learnpython.org/) を熟読したくなるかもしれません。 
+この本は、ディープラーニングの実践的な技術をしっかりと理解したい学生（学部または大学院）、エンジニア、および研究者を対象としています。すべての概念をゼロから説明するため、ディープラーニングや機械学習の経験は必要ありません。ディープラーニングの方法を完全に説明するには、ある程度の数学とプログラミングが必要ですが、ここでは、適度な量の線形代数、微積分、確率、Pythonプログラミングなど、いくつかの基本を理解していることを前提としています。基本を忘れた場合に備えて、付録はこの本にあるほとんどの数学について復習します。ほとんどの場合、数学的な厳密さよりも直感とアイデアを優先します。私たちの本を理解するための前提条件を超えてこれらの基礎を拡張したい場合は、他の素晴らしいリソースをいくつかお勧めします。Bela Bollobasによる線形分析:cite:`Bollobas.1999`は、線形代数と関数解析を詳細にカバーしています。すべての統計 :cite:`Wasserman.2013` は、統計のすばらしい紹介を提供します。ジョー・ブリッツスタインの[books](https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918)と[courses](https://projects.iq.harvard.edu/stat110/home)の確率と推論は教育学的な宝石です。Pythonを使ったことがないなら、この[Python tutorial](http://learnpython.org/)をよく読んでみるといいかもしれません。 
 
 ### フォーラム
 
-この本に関連して、[discuss.d2l.ai](https://discuss.d2l.ai/) にあるディスカッションフォーラムを立ち上げました。本書のいずれかのセクションについて質問がある場合は、各ノートブックの末尾に、関連するディスカッションページへのリンクがあります。 
+この本に関連して、[discuss.d2l.ai](https://discuss.d2l.ai/)にあるディスカッションフォーラムを立ち上げました。本のセクションで質問がある場合は、各ノートブックの最後にある関連するディスカッションページへのリンクを見つけることができます。 
 
 ## 謝辞
 
-英語と中国語の草稿の何百人もの貢献者に感謝しています。彼らはコンテンツの改善に役立ち、貴重なフィードバックを提供しました。具体的には、この英語ドラフトの貢献者全員が、皆のためにより良いものにしてくれたことに感謝します。GitHub の ID または名前は (順不同): alxnorden, avinashingit, bowen0701, brettkoonce, チャイタニャ・プラカシュ・バパット, cryptonaut, ダヴィデ・フィオッコ, edgarroman, gkutiel, ジョン・ミトロ, リャンプー, ラフル・アガルワル, モハメド・アリ・ジャモウイ, マイケル (ステュー)）スチュワート、マイク・ミュラー、nRauschmayr、Prakhar Srivastav、sad-、シュルミジェ、シェン・ザー、sundeepteki、topecongiro、tpdi、春雨、ヴィシャール・カプール、ヴィシェッシュ・ラヴィ・シュリマリ、YayaB、YaYab、Yuhong Chen、エフゲニー・スミルノフ、lgov、Simon Corston-Oliver、イゴール・ドレエフ、ハン・グエン、pmuens、Andrei Lukovenko、senorcinco、vdev-5、甘い、モハマド・マハディ・ラヒミ、アビシェク・グプタ、USD、domMm、リサオークリー、ボーウェン・リー、オーラッシュ・アフジャ、プラサント・ブッダレッディガリ、brianhendee、mani2106、mtn、lkevinzc、caojilin、ラクシャ、フィエテ・ルーア、スルビ・ヴィジャイヴァルギーヤ、Muhyun Kim、dennismalmgren、adursun、Anirudh Dagar、liqingnz、ペドロ・ラロイ、lgov、ati-ozgur、ジュン・ウー、マティアス・ブルーメ、リン・ユアン、geogunow、ジョシュ・ガードナー、マクシミリアン兄弟、ラキブ・イスラム、レナード・ローゼン、Abhinav Upadhyay、rongruosong、スティーブ・セドルマイヤー、ルスラン・バラトフ、ラファエル・シュラッター、liusy182、ジャンニス・パパス、ati-ozgur、abaza、dchoi77、アダム・ガーソン、Phic Le、マーク・アトウッド、クリスタベラ、vn09、海浜林、jjangga0214、リッチーチェン、ノエロ、ハンセント、ガール・ドップス、dvincent1337、whited3vil、ピーター・クリッツ、codypenta、joseppinilla、ahaurya、karolszk、heytitle、ピーター・ゲッツ、rigtorp、Tiep Vu、フィリップ、mlxd、Kale-ab Tessera、サンジャル・アディロフ、MatteoFerrara、hento、Katarzyna Biesialska、グレゴリー・ブラス、Duy—タン・ドーン、ポーローレル、グレイタウン、デュック・ファム、sl7423、ジェドン・ファン、イーダ・ワン、cys4、clhm、ジャン・カドゥール、austinmw、trebeljahr、tbaums、チョン・V・グエン、パベルコマロフ、valamal、NotAnotherSystem、J-Arun-Mani、ジャンシオ、eldarkurtic、the-great-shazbot、ドクターコロッサス、グドゥシャルム、クラウス、ダニエル・ミッチェン、ホノース、ビアージー om, abhinavsp0730, jonathanhrandall, イスラエル, ノダール・オクロシアシビリ, guurKap, ジヤン・カン, スティーブンジョーク,トマー・カフタン、liweiwp、netyster、ypandya、nishantTharani、heiligerl、sportsThu、ホア・グエン、manuel-arno-korfmann-webentwicklung、aterzis-Personal、nxby、Xiaoting He、ジョサイア・ヨーダー、数学研究、mzz2017、jroberayalas u、ghejc、bsharmi、vkramdev、simonwardjones、lakshkD、talneOran、djliden、Nikhil95、Orenバルカン、guoweis、haozhu233、pratikhack、Yue Ying、tayfununal、steinsag、charleybeller、アンドリューLumsdaine、Jiekui Zhang、ディーパックPathak、フロリアン・ドンハウザー、ティム・ゲイツ、アドリアン・タイセリング、ロン・メディナ、ガウラフ・サハ、ミュラ・セメルシ、レイマオ、リーバイ・マクレニー、ジョシュア・ブロイド、jake221、jonally、zyhazwraith、ブライアン・パルファー、ニックトマシノ、レファン・チャン、ホンシェン・ヤン、ヴィニー・カヴァロ、ユンタイ、ユアンシャン・チュー、アマラゾフ、パストリチャ、ベン・グリナワルド、シヴァム・ウパディー、クアンシャンゼ・ドゥ、ビスワジット・サフー、パルテ・パンディット、イシャン・クマール、ホムンクルスク、レーン・シュワルツ、バラグンジャル、ジェイソンウィーナー, アーミン・ゴラムポール, Shreshtha13, eigen-arnav, キム・ヒョンギュー, EmilYong,Bálint Mucsányi、チェイス・デュボア。 
+私たちは、英語と中国語の草稿の両方について、何百人もの貢献者に感謝しています。彼らはコンテンツの改善に役立ち、貴重なフィードバックを提供しました。具体的には、この英語ドラフトのすべての貢献者に、すべての人にとってより良いものにしてくれたことに感謝します。彼らのGitHub IDまたは名前は（順不同）です：alxnorden、avinashingit、bowen0701、brettkoonce、チャイタンニャ・プラカシュ・バパット、クリプトノート、ダビデ・フィオッコ、エドガーロマン、グクティエル、ジョン・ミトロ、リャン・プー、ラフル・アガルワル、モハメド・アリ・ジャマウイ、マイケル（Stu) スチュワート、マイク・ミュラー、nrauschmayr、Prakhar Srivastav、sad-、sfermigier、シェン・ザ、sundeepteki、topecongiro、tpdi、春雨、ヴィシャール・カプール、ヴィシェシェ・ラヴィ・シュリマリ、ヤヤブ、ユーホン・チェン、エフゲニー・スミルノフ、lgov、サイモン・コルストン＝オリバー、イゴール・ズレエフ、ハ・グエン、プムエンス、アンドレイ・ルコヴェンコ、senorcinco、vfdev-5、dsweet、モハマド・マハディ・ラヒミ、アビシェック・グプタ、米ドル、domMm、リサオークリー、ボーエン・リー、アウルシュ・アフジャ、プラサント・ブッダレディガリ、brianhendee、mani2106、mtn、lkevinzc、caojilin、ラクシャ、フィエテ・ルアー、スルビ・ヴィジェイ・ヴァルゲヤ、ムヒョン・キム、デニスマルムグレン、adursun、Anirudh Dagar、liqingnz、ペドロ・ラロイ、lgov、ati-ozgur、ジュン・ウー、マティアス・ブルーメ、リン・ユアン、geogunow、ジョシュ・ガードナー、マクシミリアンBöther, ラキブ・イスラム, レナード・ローゼン, Abhinav Upadhyay, rongruosong, スティーブ・セデルマイヤー, ルスラン・バラトフ, ラファエル・シュラッター, liusy182, ジャンニス・パパス, ati-ozgur, qbaza, dchoi77, アダム・ガーソン, フックル, マーク・アトウッド, christabella, vn09, ハイビン・リン、jjangga0214、リッチ・チェン、ノエロ、ハンセント、ギール・ドープ、dvincent1337、WhiteD3vil、ピーター・クリッツ、codypenta、joseppinilla、ahaurya、karolszk、heytitle、ピーター・ゲッツ、rigtorp、Tiep Vu、フィリップ、mlxd、Kale-Ab Tessera、サンジャール・アディロフ、マッテオ・フェラーラ、ヘネト、Katarzyna Biesialska、グレゴリー・ブラス、Duy—タン・ドアン、ポーローレル、グレイタウン、デュック・ファム、sl7423、ジェドン・ファン、イーダ・ワン、cys4、clhm、ジャン・カドゥール、austinma、trebeljahr、tbaums、チョン・V・グエン、パベルコマロフ、バラマル、notAnotherSystem、J-Arun-Mani、jancio、eldarkurtic、thegreat-shazbot、doctorcolossus、gducharme、cclauss、ダニエル・ミッチェン、hoonose、biagiag ママ、abhinavsp0730、ジョナサンランドール、イズラエル、ノダル・オクロシャシビリ、ugurKap、チヤン・カン、スティーヴン・ジョークス、Tomer Kaftan, liweiwp, netyster, ypandya, ニシャンタラニ, heiligero, sportsThu, ホア・グエン, マヌエル・アルノ・コルフマン-webentwicklung, aterzis-personal, nxby, Xiaoting He, ジョサイア・ヨーダー, 数学研究, mzz2017, jroberayalas, iluuu u, ghejc, bSharmi, vkramdev, simonwardjones, lakshkD, talneORAN, djliden, Nikhil95, オレンバルカン、guoweis、haozhu233、pratikhack、ユエ・イン、tayfununal、steinsag、charleybeller、アンドリューLumsdaine、Jiekui Zhang、Deepak Pathak、フロリアン・ドンハウザー、ティム・ゲイツ、アドリアン・ティイセリング、ロン・メディナ、ガウラフ・サハ、ムラート・セメルチ、リーマオ、リーバイ・マクレニー、ジョシュア・ブロイド、jake221、jonbally、zyhazwraith、ブライアン・パルファー、ニックトマシノ、レファン・チャン、ホンシェン・ヤン、ヴィニー・カヴァロ、ユンタイ、元翔朱、アマラゾフ、パスリチャ、ベン・グリーナワルド、シヴァム・ウパディヤイ、クアンシャンゼ・ドゥ、ビスワジット・サフー、パルテ・パンディット、イシャン・クマール、ホムンクルスク、レーン・シュワルツ、バラドグンジャル、ジェイソンウィーナー、アーミン・ゴラムプア、Shreshtha13、eigen-arnav、キム・ヒョンギュ、EmilYong、Bálint Mucsányi, チェイス・デュボア, Juntian Tao, Wenxiang Xu, Lifu Huang, filevich, quake2005, nils-werner, イーミン・リー, マルセル・キサムティノフ, フランチェスコ「フマ」フマガリ, ペイリン・サン, ヴィンセント・グルグル, qingfengtommy, Janmey Shuktommy ラ、モー・シャン、カーン・サンカク、レゴブ、AlexSauer、ゴパラクリシュナ・ラマチャンドラ、トビアス・エルワー、チャオ・ワン、ティアン・カオ、ニコラス・コルソーン、akash5474、kxxt、zxydi1992、ジェイコブ・ブリットン、Shuangchi He、zhmou、krahets、ジー・ハン・チェン、アティシェイ・ガーグ、マルセル・フライガール、adtygan、ニック・ヴァッセン、太字、ルイ・シュレッシンジャー、バラジ・バラタラジャン、atgctg、Kaixin Li、ビクター・バルバロス、リカルド・ムスト、エリザベス・ホー、azimjonn、Guilherme Miotto、アレッサンドロ・フィナモア、ジョジ・ジョセフ、アンソニー・ビール、Zeming Zhao。 
 
-アマゾンウェブサービス、特にスワミ・シヴァスブラマニアン、ピーター・デサンティス、アダム・セリプスキー、アンドリュー・ジャシーがこの本の執筆に寛大な支援をしてくれたことに感謝します。利用可能な時間、リソース、同僚との議論、継続的な励ましなしには、この本は実現しなかったでしょう。 
+アマゾンウェブサービス、特にスワミ・シバスブラマニアン、ピーター・デサンティス、アダム・セリプスキー、アンドリュー・ジャシーがこの本の執筆を惜しみなくサポートしてくれたことに感謝します。利用可能な時間、リソース、同僚との議論、継続的な励ましなしでは、この本は実現しなかったでしょう。 
 
-## [概要
+## まとめ
 
-* ディープラーニングはパターン認識に革命をもたらし、コンピュータービジョン、自然言語処理、自動音声認識などの幅広いテクノロジーを強化するテクノロジーを導入しました。
-* ディープラーニングをうまく適用するには、問題のキャスト方法、モデリングの数学、モデルをデータにあてはめるアルゴリズム、そしてそれをすべて実装するためのエンジニアリング手法を理解する必要があります。
-* 本書は、散文、図、数学、コードを含む包括的なリソースをすべて1か所にまとめたものです。
-* この本に関する質問に答えるには、https://discuss.d2l.ai/ のフォーラムにアクセスしてください。
-* すべてのノートブックは GitHub からダウンロードできます。
+ディープラーニングはパターン認識に革命をもたらし、コンピュータービジョン、自然言語処理、自動音声認識などの多様な分野で幅広いテクノロジーを強化するテクノロジーを導入しました。ディープラーニングをうまく適用するには、問題をキャストする方法、モデリングの基本的な数学、モデルをデータに適合させるためのアルゴリズム、およびそれをすべて実装するためのエンジニアリング手法を理解する必要があります。この本は、散文、数学、コードを含む包括的なリソースをすべて1か所にまとめています。この本に関する質問をする (または答える) には、https://discuss.d2l.ai/ のフォーラムにアクセスしてください。すべてのノートブックは、[D2L.ai website](https://d2l.ai)および[GitHub](https://github.com/d2l-ai/d2l-en)からダウンロードできます。 
 
 ## 演習
 
-1. この書籍 [discuss.d2l.ai](https://discuss.d2l.ai/) のディスカッションフォーラムにアカウントを登録してください。
-1. Python をコンピューターにインストールします。
-1. セクションの下部にあるリンクからフォーラムにアクセスしてください。フォーラムでは、著者やより広範なコミュニティに参加することで、ヘルプを探したり、本について議論したり、質問に対する答えを見つけることができます。
+1. この本のディスカッションフォーラムにアカウントを登録してください [discuss.d2l.ai](https://discuss.d2l.ai/)。
+1. コンピューターに Python をインストールします。
+1. セクションの下部にあるリンクをたどってフォーラムにアクセスしてください。フォーラムでは、著者やより広いコミュニティと交流することで、助けを求めたり、本について議論したり、質問に対する答えを見つけることができます。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/18)
diff --git a/chapter_preface/index_origin.md b/chapter_preface/index_origin.md
index 30377f9..90717d7 100644
--- a/chapter_preface/index_origin.md
+++ b/chapter_preface/index_origin.md
@@ -20,7 +20,7 @@ that we focus on in this book---were
 generally regarded as outmoded.
 
 
-In just the past five years, deep learning has taken the world by surprise,
+In just the past few years, deep learning has taken the world by surprise,
 driving rapid progress in such diverse fields 
 as computer vision, natural language processing, 
 automatic speech recognition, reinforcement learning, 
@@ -91,19 +91,19 @@ Our goal in this book is to present a unified resource
 to bring would-be practitioners up to speed.
 
 When we started this book project,
-there were no resources that simultaneously
-(i) were up to date; (ii) covered the full breadth
-of modern machine learning with substantial technical depth;
-and (iii) interleaved exposition 
-of the quality one expects 
-from an engaging textbook 
+there were no resources that simultaneously 
+(i) remained up to date;
+(ii) covered the breadth of modern machine learning practices 
+with sufficient technical depth;
+and (iii) interleaved exposition of 
+the quality one expects of a textbook 
 with the clean runnable code
-that one expects to find in hands-on tutorials.
+that one expects of a hands-on tutorial.
 We found plenty of code examples for
 how to use a given deep learning framework
 (e.g., how to do basic numerical computing with matrices in TensorFlow)
 or for implementing particular techniques
-(e.g., code snippets for LeNet, AlexNet, ResNets, etc.)
+(e.g., code snippets for LeNet, AlexNet, ResNet, etc.)
 scattered across various blog posts and GitHub repositories.
 However, these examples typically focused on
 *how* to implement a given approach,
@@ -148,22 +148,18 @@ And webpages are native in HTML and JavaScript.
 Furthermore, we want the content to be
 accessible both as executable code, as a physical book,
 as a downloadable PDF, and on the Internet as a website.
-At present there exist no tools and no workflow
-perfectly suited to these demands, 
-so we had to assemble our own.
-We describe our approach in detail 
-in :numref:`sec_how_to_contribute`.
+No workflows seemed suited to these demands, 
+so we decided to assemble our own (:numref:`sec_how_to_contribute`).
 We settled on GitHub to share the source 
-and to facilitate community contributions,
-Jupyter notebooks for mixing code, equations and text,
-Sphinx as a rendering engine 
-to generate multiple outputs,
-and Discourse for the forum.
-While our system is not yet perfect,
-these choices provide a good compromise 
+and to facilitate community contributions;
+Jupyter notebooks for mixing code, equations and text;
+Sphinx as a rendering engine; 
+and Discourse as a discussion platform.
+While our system is not perfect,
+these choices strike a compromise 
 among the competing concerns.
-We believe that this might be 
-the first book published
+We believe that *Dive into Deep Learning*
+might be the first book published
 using such an integrated workflow.
 
 
@@ -181,19 +177,19 @@ precisely for its thoroughness,
 for true beginners, this property limits 
 its usefulness as an introductory text.
 
-In this book, we will teach most concepts *just in time*.
+In this book, we teach most concepts *just in time*.
 In other words, you will learn concepts at the very moment
 that they are needed to accomplish some practical end.
 While we take some time at the outset to teach
 fundamental preliminaries, like linear algebra and probability,
 we want you to taste the satisfaction of training your first model
-before worrying about more esoteric probability distributions.
+before worrying about more esoteric concepts.
 
 Aside from a few preliminary notebooks that provide a crash course
 in the basic mathematical background,
 each subsequent chapter introduces both a reasonable number of new concepts
-and provides single self-contained working examples---using real datasets.
-This presents an organizational challenge.
+and provides several self-contained working examples, using real datasets.
+This presented an organizational challenge.
 Some models might logically be grouped together in a single notebook.
 And some ideas might be best taught 
 by executing several models in succession.
@@ -203,47 +199,52 @@ This makes it as easy as possible for you to
 start your own research projects by leveraging our code.
 Just copy a notebook and start modifying it.
 
-We will interleave the runnable code with background material as needed.
-In general, we will often err on the side of making tools
-available before explaining them fully (and we will follow up by
-explaining the background later).
+Throughout, we interleave the runnable code
+with background material as needed.
+In general, we err on the side of making tools
+available before explaining them fully
+(often filling in the background later).
 For instance, we might use *stochastic gradient descent*
-before fully explaining why it is useful or why it works.
+before explaining why it is useful 
+or offering intuitions for why it works.
 This helps to give practitioners the necessary
 ammunition to solve problems quickly,
 at the expense of requiring the reader
 to trust us with some curatorial decisions.
 
-This book will teach deep learning concepts from scratch.
-Sometimes, we want to delve into fine details about the models
-that would typically be hidden from the user
-by deep learning frameworks' advanced abstractions.
+This book teaches deep learning concepts from scratch.
+Sometimes, we delve into fine details about models
+that would typically be hidden from users
+by modern deep learning frameworks.
 This comes up especially in the basic tutorials,
 where we want you to understand everything
 that happens in a given layer or optimizer.
-In these cases, we will often present two versions of the example:
+In these cases, we often present 
+two versions of the example:
 one where we implement everything from scratch,
 relying only on NumPy-like functionality
 and automatic differentiation,
-and another, more practical example,
-where we write succinct code using 
-the high-level APIs of deep learning frameworks.
-Once we have taught you how some component works,
-we can just use the high-level APIs in subsequent tutorials.
+and a more practical example,
+where we write succinct code 
+using the high-level APIs of deep learning frameworks.
+After explaining how some component works,
+we rely on the high-level API in subsequent tutorials.
 
 
 ### Content and Structure
 
-The book can be roughly divided into three parts,
-focusing on preliminaries, deep learning techniques,
-and advanced topics focused on real systems 
+The book can be divided into roughly three parts,
+focusing on preliminaries, 
+deep learning techniques,
+and advanced topics
+focused on real systems
 and applications (:numref:`fig_book_org`).
 
 ![Book structure](../img/book-org.svg)
 :label:`fig_book_org`
 
 
-* The first part covers basics and preliminaries.
+* **Part 1: Basics and Preliminaries.**
 :numref:`chap_introduction` offers 
 an introduction to deep learning.
 Then, in :numref:`chap_preliminaries`,
@@ -254,20 +255,20 @@ such as how to store and manipulate data,
 and how to apply various numerical operations 
 based on basic concepts from linear algebra, 
 calculus, and probability.
-:numref:`chap_linear` and :numref:`chap_perceptrons`
+:numref:`chap_regression` and :numref:`chap_perceptrons`
 cover the most basic concepts and techniques in deep learning,
 including regression and classification;
-linear models and multilayer perceptrons;
+linear models; multilayer perceptrons;
 and overfitting and regularization.
 
-* The next five chapters focus on modern deep learning techniques.
-:numref:`chap_computation` describes 
-the key computational components 
+* **Part 2: Modern Deep Learning Techniques.**
+:numref:`chap_computation` describes
+the key computational components
 of deep learning systems
 and lays the groundwork
 for our subsequent implementations
 of more complex models.
-Next, :numref:`chap_cnn` and :numref:`chap_modern_cnn`,
+Next, :numref:`chap_cnn` and :numref:`chap_modern_cnn`
 introduce convolutional neural networks (CNNs), 
 powerful tools that form the backbone 
 of most modern computer vision systems.
@@ -277,29 +278,30 @@ models that exploit sequential (e.g., temporal)
 structure in data and are commonly used
 for natural language processing 
 and time series prediction.
-In :numref:`chap_attention`, 
+In :numref:`chap_attention-and-transformers`, 
 we introduce a relatively new class of models
-based on so-called attention mechanisms
+based on so-called *attention mechanisms*
 that has displaced RNNs as the dominant architecture
 for most natural language processing tasks.
 These sections will bring you up to speed 
 on the most powerful and general tools
 that are widely used by deep learning practitioners.
 
-* Part three discusses scalability, efficiency, and applications.
-First, in :numref:`chap_optimization`,
+* **Part 3: Scalability, Efficiency, and Applications.**
+In :numref:`chap_optimization`,
 we discuss several common optimization algorithms
 used to train deep learning models.
-The next chapter, :numref:`chap_performance`,
-examines several key factors
+Next, in :numref:`chap_performance`,
+we examine several key factors
 that influence the computational performance 
-of your deep learning code.
-In :numref:`chap_cv`,
+of deep learning code.
+Then, in :numref:`chap_cv`,
 we illustrate major applications 
 of deep learning in computer vision.
-In :numref:`chap_nlp_pretrain` and :numref:`chap_nlp_app`,
-we show how to pretrain language representation models 
+Finally, in :numref:`chap_nlp_pretrain` and :numref:`chap_nlp_app`,
+we demonstrate how to pretrain language representation models 
 and apply them to natural language processing tasks.
+This part is available [online](https://d2l.ai).
 
 
 ### Code
@@ -311,27 +313,28 @@ via trial and error,
 tweaking the code in small ways and observing the results.
 Ideally, an elegant mathematical theory might tell us
 precisely how to tweak our code to achieve a desired result.
-However, today deep learning practitioners today
-must often tread where no cogent theory 
-can provide firm guidance. 
+However, deep learning practitioners today
+must often tread where no solid theory provides guidance. 
 Despite our best attempts, formal explanations 
 for the efficacy of various techniques are still lacking,
 both because the mathematics to characterize these models
-can be so difficult and also because 
-serious inquiry on these topics
-has only just recently kicked into high gear.
+can be so difficult,
+because the explanation likely depends on properties 
+of the data that currently lack clear definitions,
+and because serious inquiry on these topics
+has just recently kicked into high gear.
 We are hopeful that as the theory of deep learning progresses,
-future editions of this book 
-can provide insights that eclipse
-those presently available.
+each future edition of this book will provide insights 
+that eclipse those presently available.
 
-To avoid unnecessary repetition, we encapsulate 
-some of our most frequently imported and referred-to 
+To avoid unnecessary repetition, we encapsulate
+some of our most frequently imported and used
 functions and classes in the `d2l` package.
-To indicate a block of code, such as a function, 
-class, or collection of import statements,
-that will be subsequently accessed via the `d2l` package, 
-we will mark it with `#@save`. 
+Throughout, we mark blocks of code
+(such as functions, classes,
+or collection of import statements) with `#@save`
+to indicate that they will be accessed later
+via the `d2l` package.
 We offer a detailed overview 
 of these functions and classes in :numref:`sec_d2l`.
 The `d2l` package is lightweight and only requires
@@ -340,11 +343,13 @@ the following dependencies:
 ```{.python .input}
 #@tab all
 #@save
+import inspect
 import collections
 from collections import defaultdict
 from IPython import display
 import math
 from matplotlib import pyplot as plt
+from matplotlib_inline import backend_inline
 import os
 import pandas as pd
 import random
@@ -384,7 +389,7 @@ an extremely popular open-source framework
 that has been enthusiastically embraced 
 by the deep learning research community.
 All of the code in this book has passed tests 
-under the latest stable verion of PyTorch.
+under the latest stable version of PyTorch.
 However, due to the rapid development of deep learning,
 some code *in the print edition* 
 may not work properly in future versions of PyTorch.
@@ -400,7 +405,7 @@ Here is how we import modules from PyTorch.
 Most of the code in this book is based on TensorFlow,
 an open-source framework for deep learning
 that is widely adopted in industry
-and popular among reserchers.
+and popular among researchers.
 All of the code in this book has passed tests 
 under the latest stable version TensorFlow.
 However, due to the rapid development of deep learning, 
@@ -415,6 +420,7 @@ Here is how we import modules from TensorFlow.
 :end_tab:
 
 ```{.python .input}
+#@tab mxnet
 #@save
 from mxnet import autograd, context, gluon, image, init, np, npx
 from mxnet.gluon import nn, rnn
@@ -521,20 +527,40 @@ Adriaan Tijsseling, Ron Medina, Gaurav Saha, Murat Semerci, Lei Mao, Levi McClen
 jake221, jonbally, zyhazwraith, Brian Pulfer, Nick Tomasino, Lefan Zhang, Hongshen Yang, Vinney Cavallo,
 yuntai, Yuanxiang Zhu, amarazov, pasricha, Ben Greenawald, Shivam Upadhyay, Quanshangze Du, Biswajit Sahoo,
 Parthe Pandit, Ishan Kumar, HomunculusK, Lane Schwartz, varadgunjal, Jason Wiener, Armin Gholampoor,
-Shreshtha13, eigen-arnav, Hyeonggyu Kim, EmilyOng, Bálint Mucsányi, Chase DuBois.
-
-We thank Amazon Web Services, especially Swami Sivasubramanian, Peter DeSantis, Adam Selipsky and Andrew Jassy for their generous support in writing this book. 
+Shreshtha13, eigen-arnav, Hyeonggyu Kim, EmilyOng, Bálint Mucsányi, Chase DuBois, Juntian Tao,
+Wenxiang Xu, Lifu Huang, filevich, quake2005, nils-werner, Yiming Li, Marsel Khisamutdinov,
+Francesco "Fuma" Fumagalli, Peilin Sun, Vincent Gurgul, qingfengtommy, Janmey Shukla, Mo Shan,
+Kaan Sancak, regob, AlexSauer, Gopalakrishna Ramachandra, Tobias Uelwer, Chao Wang, Tian Cao,
+Nicolas Corthorn, akash5474, kxxt, zxydi1992, Jacob Britton, Shuangchi He, zhmou, krahets, Jie-Han Chen,
+Atishay Garg, Marcel Flygare, adtygan, Nik Vaessen, bolded, Louis Schlessinger, Balaji Varatharajan,
+atgctg, Kaixin Li, Victor Barbaros, Riccardo Musto, Elizabeth Ho, azimjonn, Guilherme Miotto, Alessandro Finamore,
+Joji Joseph, Anthony Biel, Zeming Zhao.
+
+We thank Amazon Web Services, especially Swami Sivasubramanian, Peter DeSantis, Adam Selipsky,
+and Andrew Jassy for their generous support in writing this book. 
 Without the available time, resources, discussions with colleagues, 
 and continuous encouragement, this book would not have happened.
 
 
 ## Summary
 
-* Deep learning has revolutionized pattern recognition, introducing technology that now powers a wide range of  technologies, including computer vision, natural language processing, automatic speech recognition.
-* To successfully apply deep learning, you must understand how to cast a problem, the mathematics of modeling, the algorithms for fitting your models to data, and the engineering techniques to implement it all.
-* This book presents a comprehensive resource, including prose, figures, mathematics, and code, all in one place.
-* To answer questions related to this book, visit our forum at https://discuss.d2l.ai/.
-* All notebooks are available for download on GitHub.
+Deep learning has revolutionized pattern recognition, 
+introducing technology that now powers a wide range of  technologies, 
+in such diverse fields as computer vision,
+natural language processing,
+and automatic speech recognition.
+To successfully apply deep learning, 
+you must understand how to cast a problem,
+the basic mathematics of modeling,
+the algorithms for fitting your models to data,
+and the engineering techniques to implement it all.
+This book presents a comprehensive resource, 
+including prose, figures, mathematics, and code, all in one place.
+To ask (or answer) questions related to this book,
+visit our forum at https://discuss.d2l.ai/.
+All of our notebooks are available for download
+on the [D2L.ai website](https://d2l.ai)
+and on [GitHub](https://github.com/d2l-ai/d2l-en).
 
 
 ## Exercises
diff --git a/chapter_preliminaries/autograd.md b/chapter_preliminaries/autograd.md
index b095ab7..c5ba214 100644
--- a/chapter_preliminaries/autograd.md
+++ b/chapter_preliminaries/autograd.md
@@ -1,15 +1,23 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # 自動微分
 :label:`sec_autograd`
 
-:numref:`sec_calculus` で説明したように、微分化はほぼすべてのディープラーニング最適化アルゴリズムにおいて重要なステップです。これらの微分を求める計算は簡単で、必要なのは基本的な微積分だけですが、複雑なモデルの場合、更新を手作業で行うのは面倒です (多くの場合、エラーが起こりやすい)。 
+:numref:`sec_calculus` から、微分を計算することは、ディープネットワークの学習に使用するすべての最適化アルゴリズムにおいて重要なステップであることを思い出してください。計算は簡単ですが、手作業で計算するのは面倒でエラーを起こしやすく、この問題はモデルがより複雑になるにつれて大きくなります。 
 
-ディープラーニングフレームワークは、微分 (*自動微分) を自動的に計算することで、この作業を迅速化します。実際には、設計したモデルに基づいて、システムは*計算グラフ*を構築し、どのデータをどの操作で組み合わせて出力を生成するかを追跡します。自動微分により、システムは後から勾配を逆伝播できます。ここで、*backpropagate* は単に、計算グラフをトレースし、各パラメーターに関する偏微分を埋めることを意味します。 
+幸いなことに、最新のディープラーニングフレームワークはすべて、*自動微分*（*autograd* と短縮されることが多い）を提供することで、この作業を私たちのプレートから取り除きます。連続する各関数にデータを渡すと、フレームワークは各値が他の値にどのように依存するかを追跡する*計算グラフ*を構築します。微分を計算するために、自動微分パッケージは連鎖則を適用してこのグラフを逆方向に処理します。この方法で連鎖則を適用する計算アルゴリズムは、*バックプロパゲーション*と呼ばれます。 
 
-## 簡単な例
+オートグラード図書館は過去10年間で注目を集めていますが、長い歴史があります。実際、オートグラードに関する最も初期の言及は、半世紀以上前にさかのぼります。:cite:`Wengert.1964`.現代のバックプロパゲーションの背後にある核となるアイデアは、1980年の:cite:`Speelpenning.1980`の博士論文にまでさかのぼり、1980年代後半の:cite:`Griewank.1989`でさらに発展しました。バックプロパゲーションは勾配を計算する既定の方法になりましたが、唯一の選択肢ではありません。たとえば、Juliaプログラミング言語は前方伝播:cite:`Revels.Lubin.Papamarkou.2016`を採用しています。方法を探る前に、まず autograd パッケージをマスターしましょう。 
 
-おもちゃの例として、(**列ベクトル $\mathbf{x}$.に関して関数 $y = 2\mathbf{x}^{\top}\mathbf{x}$ を微分する**) に興味があるとしましょう。まず、変数 `x` を作成して初期値を代入します。
+## シンプルな機能
 
-```{.python .input}
+興味があると仮定しましょう (**列ベクトル$\mathbf{x}$に関して関数$y = 2\mathbf{x}^{\top}\mathbf{x}$を微分する**) まず、`x`に初期値を割り当てます。
+
+```{.python .input  n=1}
+%%tab mxnet
 from mxnet import autograd, np, npx
 npx.set_np()
 
@@ -17,8 +25,8 @@ x = np.arange(4.0)
 x
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=7}
+%%tab pytorch
 import torch
 
 x = torch.arange(4.0)
@@ -26,16 +34,17 @@ x
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 
 x = tf.range(4, dtype=tf.float32)
 x
 ```
 
-[**$\mathbf{x}$ に対する $y$ の勾配を計算する前に、それを格納する場所が必要です。**] 同じパラメータを何千回または何百万回も更新することが多いので、パラメータに対して微分をとるたびに新しいメモリを割り当てないことが重要です。すぐにメモリが足りなくなる可能性があります。ベクトル $\mathbf{x}$ に対するスカラー値関数の勾配は、それ自体がベクトル値であり、$\mathbf{x}$ と同じ形状であることに注意してください。
+[**$\mathbf{x}$に対する$y$の勾配を計算する前に、それを保存する場所が必要です。**] ディープラーニングでは、数千または数百万の同じパラメータに関して導関数を連続的に計算する必要があるため、通常、微分を取るたびに新しいメモリを割り当てることは避けます。時間が経つと、メモリ不足の危険があります。ベクトル $\mathbf{x}$ に対するスカラー値関数の勾配はベクトル値であり、$\mathbf{x}$ と同じ形状であることに注意してください。
 
-```{.python .input}
+```{.python .input  n=8}
+%%tab mxnet
 # We allocate memory for a tensor's gradient by invoking `attach_grad`
 x.attach_grad()
 # After we calculate a gradient taken with respect to `x`, we will be able to
@@ -43,145 +52,170 @@ x.attach_grad()
 x.grad
 ```
 
-```{.python .input}
-#@tab pytorch
-x.requires_grad_(True)  # Same as `x = torch.arange(4.0, requires_grad=True)`
-x.grad  # The default value is None
+```{.python .input  n=9}
+%%tab pytorch
+x.requires_grad_(True)  # Better create `x = torch.arange(4.0, requires_grad=True)`
+x.grad                  # The default value is None
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x = tf.Variable(x)
 ```
 
-(**さあ $y$ を計算してみましょう**)
+(**ここで `x` の関数を計算し、その結果を `y` に代入します**)
 
-```{.python .input}
-# Place our code inside an `autograd.record` scope to build the computational
-# graph
+```{.python .input  n=10}
+%%tab mxnet
+# Our code is inside an `autograd.record` scope to build the computational graph
 with autograd.record():
     y = 2 * np.dot(x, x)
 y
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=11}
+%%tab pytorch
 y = 2 * torch.dot(x, x)
 y
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 # Record all computations onto a tape
 with tf.GradientTape() as t:
     y = 2 * tf.tensordot(x, x, axes=1)
 y
 ```
 
-`x` は長さ 4 のベクトルなので、`x` と `x` のドット積が実行され、`y` に代入するスカラー出力が得られます。次に、[**`x` の各成分に対する `y` の勾配を自動的に計算できます**] バックプロパゲーション用の関数を呼び出して勾配を出力します。
+:begin_tab:`mxnet`
+[**`x`に対する`y`の勾配を取ることができるようになりました**] `backward`メソッドを呼び出します。次に、`x`の`grad`属性を介してグラデーションにアクセスできます。
+:end_tab:
+
+:begin_tab:`pytorch`
+[**`x`に対する`y`の勾配を取ることができるようになりました**] `backward`メソッドを呼び出します。次に、`x`の`grad`属性を介してグラデーションにアクセスできます。
+:end_tab:
+
+:begin_tab:`tensorflow`
+[**`x`に対する`y`の勾配を計算できるようになりました**] `gradient`関数を呼び出します。
+:end_tab:
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 x.grad
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=12}
+%%tab pytorch
 y.backward()
 x.grad
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x_grad = t.gradient(y, x)
 x_grad
 ```
 
-(** $\mathbf{x}$ に対する関数 $y = 2\mathbf{x}^{\top}\mathbf{x}$ の勾配は $4\mathbf{x}$.**) 目的の勾配が正しく計算されたことをすぐに確認してみましょう。
+(**$\mathbf{x}$に対する関数$y = 2\mathbf{x}^{\top}\mathbf{x}$の勾配は$4\mathbf{x}$であることがすでにわかっています**) 自動勾配計算と期待される結果が同一であることを検証できます。
 
-```{.python .input}
+```{.python .input  n=13}
+%%tab mxnet
 x.grad == 4 * x
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=14}
+%%tab pytorch
 x.grad == 4 * x
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x_grad == 4 * x
 ```
 
-[**ここで `x` の別の関数を計算してみましょう**]
+:begin_tab:`mxnet`
+[**次に、`x` の別の関数を計算し、そのグラデーションを取得しましょう。**] MXNet は、新しいグラデーションを記録するたびにグラデーションバッファーをリセットすることに注意してください。
+:end_tab:
+
+:begin_tab:`pytorch`
+[**それでは、`x`の別の関数を計算し、そのグラデーションを取得しましょう。**] PyTorchは、新しいグラデーションを記録するときにグラデーションバッファを自動的にリセットしないことに注意してください。代わりに、新しいグラデーションが既に保存されているグラデーションに追加されます。この動作は、複数の目的関数の合計を最適化する場合に便利です。勾配バッファをリセットするには、`x.grad.zero()` を次のように呼び出します。
+:end_tab:
+
+:begin_tab:`tensorflow`
+[**次に、`x`の別の関数を計算し、その勾配を取得しましょう。**] TensorFlowは、新しいグラデーションを記録するたびにグラデーションバッファをリセットすることに注意してください。
+:end_tab:
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = x.sum()
 y.backward()
 x.grad  # Overwritten by the newly calculated gradient
 ```
 
-```{.python .input}
-#@tab pytorch
-# PyTorch accumulates the gradient in default, we need to clear the previous
-# values
-x.grad.zero_()
+```{.python .input  n=20}
+%%tab pytorch
+x.grad.zero_()  # Reset the gradient
 y = x.sum()
 y.backward()
 x.grad
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = tf.reduce_sum(x)
 t.gradient(y, x)  # Overwritten by the newly calculated gradient
 ```
 
-## 非スカラー変数の場合は逆方向
+## 非スカラー変数の逆方向
+
+`y` がベクトルの場合、ベクトル `x` に関する `y` の導関数の最も自然な解釈は、`x` の各成分に対する `y` の各成分の偏微分を含む*ヤコビアン* と呼ばれる行列です。同様に、高次の`y`と`x`の場合、微分結果はさらに高次のテンソルになる可能性があります。 
+
+ヤコビアンは、いくつかの高度な機械学習技術で現れますが、より一般的には、`y`の各成分の勾配をフルベクトル`x`に関して合計し、`x`と同じ形状のベクトルを生成します。たとえば、トレーニング例の*バッチ*のそれぞれについて個別に計算された損失関数の値を表すベクトルがよくあります。ここでは、(**例ごとに個別に計算された勾配を合計**) したいだけです。
+
+:begin_tab:`mxnet`
+MXNet は、勾配を計算する前に合計によってすべてのテンソルをスカラーに減らすことで、この問題を処理します。つまり、ヤコビアン $\partial_{\mathbf{x}} \mathbf{y}$ を返すのではなく、合計 $\partial_{\mathbf{x}} \sum_i y_i$ の勾配を返します。
+:end_tab:
 
-技術的には、`y` がスカラーでない場合、ベクトル `x` に対するベクトル `y` の微分の最も自然な解釈は行列です。高次の高次元の `y` と `x` では、微分結果が高次のテンソルになる可能性があります。 
+:begin_tab:`pytorch`
+ディープラーニングフレームワークは、非スカラーテンソルの勾配を解釈する方法が異なるため、PyTorch は混乱を避けるためにいくつかの手順を実行します。非スカラーで `backward` を呼び出すと、オブジェクトをスカラーに減らす方法を PyTorch に指示しない限り、エラーが発生します。より正式には、`backward` が $\partial_{\mathbf{x}} \mathbf{y}$ ではなく $\mathbf{v}^\top \partial_{\mathbf{x}} \mathbf{y}$ を計算するように、いくつかのベクトル $\mathbf{v}$ を提供する必要があります。この次の部分は混乱するかもしれませんが、後で明らかになる理由から、この引数（$\mathbf{v}$を表す）は`gradient`という名前になっています。より詳細な説明については、Yang Zhangの[Medium post](https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29)を参照してください。
+:end_tab:
 
-しかし、これらのよりエキゾチックなオブジェクトは高度な機械学習 ([**ディープラーニング**] を含む) に現れますが、より頻繁に (**ベクトルを逆方向に呼び出す場合**)、トレーニング例の*バッチ*の各構成要素について、損失関数の導関数を計算しようとしています。ここで、(**私たちの意図は**) 微分行列を計算するのではなく、バッチ内で (**例ごとに個別に計算された偏導関数の和**)。
+:begin_tab:`tensorflow`
+デフォルトでは、TensorFlow は合計の勾配を返します。つまり、ヤコビアン $\partial_{\mathbf{x}} \mathbf{y}$ を返すのではなく、合計 $\partial_{\mathbf{x}} \sum_i y_i$ の勾配を返します。
+:end_tab:
 
 ```{.python .input}
-# When we invoke `backward` on a vector-valued variable `y` (function of `x`),
-# a new scalar variable is created by summing the elements in `y`. Then the
-# gradient of that scalar variable with respect to `x` is computed
+%%tab mxnet
 with autograd.record():
-    y = x * x  # `y` is a vector
+    y = x * x  
 y.backward()
-x.grad  # Equals to y = sum(x * x)
+x.grad  # Equals the gradient of y = sum(x * x)
 ```
 
 ```{.python .input}
-#@tab pytorch
-# Invoking `backward` on a non-scalar requires passing in a `gradient` argument
-# which specifies the gradient of the differentiated function w.r.t `self`.
-# In our case, we simply want to sum the partial derivatives, so passing
-# in a gradient of ones is appropriate
+%%tab pytorch
 x.grad.zero_()
 y = x * x
-# y.backward(torch.ones(len(x))) equivalent to the below
-y.sum().backward()
+y.backward(gradient=torch.ones(len(y)))  # Faster: y.sum().backward()
 x.grad
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 with tf.GradientTape() as t:
     y = x * x
 t.gradient(y, x)  # Same as `y = tf.reduce_sum(x * x)`
 ```
 
-## 計算のデタッチ
-
-[**一部の計算を記録された計算グラフの外に移動させたい場合があります。**] たとえば、`y` が `x` の関数として計算され、その後 `z` が `y` と `x` の両方の関数として計算されたとします。ここで、`x` に対する `z` の勾配を計算したいが、何らかの理由で `y` を定数として扱い、`y` が計算された後に `x` が果たした役割のみを考慮に入れたいと想像してください。 
+## 計算をデタッチする
 
-ここで `y` をデタッチすると、`y` と同じ値を持つ新しい変数 `u` が返されますが、`y` が計算グラフでどのように計算されたかに関する情報はすべて破棄されます。つまり、勾配は `u` から `x` まで逆方向に流れません。したがって、次のバックプロパゲーション関数は `x` に対する `z = x * x * x` の偏微分ではなく `u` を定数として扱い、`x` に対する `z = u * x` の偏微分を計算します。
+時々、[**記録された計算グラフの外に計算を移動する**] したい場合があります。たとえば、入力を使用して、勾配を計算したくない補助中間項を作成するとします。この場合、それぞれの計算影響グラフを最終結果から*切り離す*必要があります。次のおもちゃの例はこれをより明確にしています。`z = x * y`と`y = x * x`があるが、`y`を介して伝えられる影響ではなく、`z`に対する`x`の*直接的な*影響に焦点を当てたいとします。この場合、新しい変数 `u` を作成できます。この変数は、`y` と同じ値を取りますが、その*出所* (作成方法) が消去されています。したがって、`u`にはグラフに祖先がなく、`u`から`x`まで流れない勾配があります。たとえば、`z = x * u`の勾配を取ると、`x`という結果が得られます（`z = x * x * x`以降に予想していたような`3 * x * x`ではありません）。
 
 ```{.python .input}
+%%tab mxnet
 with autograd.record():
     y = x * x
     u = y.detach()
@@ -190,8 +224,8 @@ z.backward()
 x.grad == u
 ```
 
-```{.python .input}
-#@tab pytorch
+```{.python .input  n=21}
+%%tab pytorch
 x.grad.zero_()
 y = x * x
 u = y.detach()
@@ -202,8 +236,9 @@ x.grad == u
 ```
 
 ```{.python .input}
-#@tab tensorflow
-# Set `persistent=True` to run `t.gradient` more than once
+%%tab tensorflow
+# Set `persistent=True` to preserve the compute graph. 
+# This lets us run `t.gradient` more than once
 with tf.GradientTape(persistent=True) as t:
     y = x * x
     u = tf.stop_gradient(y)
@@ -213,30 +248,32 @@ x_grad = t.gradient(z, x)
 x_grad == u
 ```
 
-`y` の計算が記録されたので、その後 `y` でバックプロパゲーションを呼び出して `x` に対する `y = x * x` の微分 (`2 * x`) を得ることができます。
+この手順は`y`につながるグラフから`y`の祖先を切り離しますが、`y`につながる計算グラフは存続するため、`y`に対する`y`の勾配を計算できることに注意してください。
 
 ```{.python .input}
+%%tab mxnet
 y.backward()
 x.grad == 2 * x
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x.grad.zero_()
 y.sum().backward()
 x.grad == 2 * x
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 t.gradient(y, x) == 2 * x
 ```
 
-## Python 制御フローの勾配を計算する
+## グラデーションと Python コントロールフロー
 
-自動微分を使用する利点の 1 つは、(**迷路の Python 制御フローを通過する必要がある関数**) (条件式、ループ、任意の関数呼び出しなど) の計算グラフを [**たとえ**] 構築して、(**結果の変数の勾配を計算できる**)次のスニペットでは、`while` ループの反復回数と `if` ステートメントの評価はどちらも入力 `a` の値に依存することに注意してください。
+ここまで、`z = x * x * x`などの関数を使用して、入力から出力へのパスが明確に定義されているケースを確認しました。プログラミングは、結果の計算方法により多くの自由度を提供します。たとえば、補助変数や中間結果の条件選択に依存させることができます。自動微分を使用する利点の1つは、[**たとえ**]（**Pythonの制御フローの迷路を通過する必要がある関数**）（例えば、条件文、ループ、任意の関数呼び出し）の計算グラフを作成することです（**結果の変数の勾配を計算することはできます**）これを説明するために、`while`ループの反復回数と`if`ステートメントの評価の両方が入力`a`の値に依存する次のコードスニペットを考えてみましょう。
 
 ```{.python .input}
+%%tab mxnet
 def f(a):
     b = a * 2
     while np.linalg.norm(b) < 1000:
@@ -249,7 +286,7 @@ def f(a):
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 def f(a):
     b = a * 2
     while b.norm() < 1000:
@@ -262,7 +299,7 @@ def f(a):
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 def f(a):
     b = a * 2
     while tf.norm(b) < 1000:
@@ -274,9 +311,10 @@ def f(a):
     return c
 ```
 
-勾配を計算してみましょう。
+以下では、この関数を呼び出し、ランダムな値を入力として渡します。入力は確率変数なので、計算グラフがどのような形式になるかはわかりません。ただし、特定の入力に対して`f(a)`を実行するたびに、特定の計算グラフが認識され、その後`backward`を実行できます。
 
 ```{.python .input}
+%%tab mxnet
 a = np.random.normal()
 a.attach_grad()
 with autograd.record():
@@ -285,14 +323,14 @@ d.backward()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 a = torch.randn(size=(), requires_grad=True)
 d = f(a)
 d.backward()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 a = tf.Variable(tf.random.normal(shape=()))
 with tf.GradientTape() as t:
     d = f(a)
@@ -300,33 +338,41 @@ d_grad = t.gradient(d, a)
 d_grad
 ```
 
-これで、上で定義した `f` 関数を解析できます。入力 `a` では区分的線形であることに注意してください。つまり、`a` には `f(a) = k * a` のような定数スカラー `k` が存在し、`k` の値は入力 `a` に依存します。したがって `d / a` では、勾配が正しいことを検証できます。
+私たちの関数`f`はデモンストレーション目的で少し工夫されていますが、入力への依存は非常に単純です。これは、区分的に定義されたスケールを持つ`a`の*線形*関数です。したがって、`f(a) / a`は定数エントリのベクトルであり、さらに、`f(a) / a`は、`a`に対する`f(a)`の勾配と一致する必要があります。
 
 ```{.python .input}
+%%tab mxnet
 a.grad == d / a
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 a.grad == d / a
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 d_grad == d / a
 ```
 
-## [概要
+動的制御フローは、ディープラーニングでは非常に一般的です。たとえば、テキストを処理する場合、計算グラフは入力の長さに依存します。このような場合、勾配を事前に計算することは不可能であるため、自動微分は統計的モデリングにとって不可欠になります。  
+
+## ディスカッション
+
+これで、自動微分のパワーを味わうことができました。デリバティブを自動的かつ効率的に計算するためのライブラリの開発は、ディープラーニングの実践者にとって生産性を大幅に向上させ、より高い懸念に集中できるようにしました。さらに、autogradを使用すると、ペンと紙のグラデーションの計算に非常に時間がかかる大規模なモデルを設計できます。興味深いことに、（統計的な意味で）autogradを使用してモデルを「最適化」しますが、autogradライブラリ自体の*最適化*（計算上の意味で）は、フレームワーク設計者にとって非常に興味深い豊富なテーマです。ここでは、コンパイラとグラフ操作のツールを活用して、最も便利でメモリ効率の良い方法で結果を計算します。  
 
-* ディープラーニングフレームワークでは、微分の計算を自動化できます。これを使用するには、まず偏微分を求める変数に勾配を付けます。次に、目標値の計算を記録し、その関数を逆伝播のために実行し、結果の勾配にアクセスします。
+とりあえず、次の基本を覚えておきましょう。(i) 微分を求める変数に勾配を付ける、(ii) 目標値の計算を記録する、(iii) バックプロパゲーション関数を実行する、(iv) 結果の勾配にアクセスする。 
 
 ## 演習
 
-1. 二次導関数が一次導関数より計算コストがかかるのはなぜですか？
-1. バックプロパゲーション用に関数を実行したら、ただちにその関数をもう一度実行して、何が起こるかを確認してください。
-1. `a` に対する `d` の微分を計算する制御フローの例では、変数 `a` をランダムなベクトルまたは行列に変更するとどうなるでしょうか。この時点では、`f(a)` の計算結果はスカラーではなくなります。結果はどうなりますか？これをどのように分析するのですか？
-1. 制御フローの勾配を求める例を再設計します。結果を実行して解析します。
-1. $f(x) = \sin(x)$ にしましょう。$f(x)$ と $\frac{df(x)}{dx}$ をプロットします。後者は $f'(x) = \cos(x)$ を利用せずに計算されます。
+1. 二階微分は一次導関数よりも計算コストがはるかに高いのはなぜですか？
+1. バックプロパゲーションの関数を実行したら、すぐに再度実行して、何が起こるかを確認します。なぜ？
+1. `a` に対する `d` の微分を計算する制御フローの例では、変数 `a` をランダムなベクトルまたは行列に変更するとどうなるでしょうか。この時点で、`f(a)` の計算結果はスカラーではなくなりました。結果はどうなりますか？これをどのように分析しますか？
+1. $f(x) = \sin(x)$としましょう。$f$ とその導関数 $f'$ のグラフをプロットします。$f'(x) = \cos(x)$ という事実を悪用するのではなく、結果を得るために自動微分を使用してください。 
+1. $f(x) = ((\log x^2) \cdot \sin x) + x^{-1}$ としましょう。$x$ から $f(x)$ までの依存関係グラフのトレース結果を書き出します。 
+1. 連鎖則を使用して前述の関数の微分 $\frac{df}{dx}$ を計算し、各項を前に作成した依存グラフに配置します。 
+1. グラフと中間導関数の結果を考えると、勾配を計算するときにいくつかの選択肢があります。$x$から$f$まで開始し、$f$から$x$までトレースして結果を1回評価します。$x$ から $f$ へのパスは一般に *順微分* として知られていますが、$f$ から $x$ へのパスは後方微分として知られています。 
+1. 前方微分と後方微分を使うのはいつですか？ヒント:必要な中間データの量、ステップを並列化する能力、関連する行列とベクトルのサイズを考慮してください。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/34)
diff --git a/chapter_preliminaries/autograd_origin.md b/chapter_preliminaries/autograd_origin.md
new file mode 100644
index 0000000..333781c
--- /dev/null
+++ b/chapter_preliminaries/autograd_origin.md
@@ -0,0 +1,590 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Automatic Differentiation
+:label:`sec_autograd`
+
+Recall from :numref:`sec_calculus` 
+that calculating derivatives is the crucial step
+in all of the optimization algorithms
+that we will use to train deep networks.
+While the calculations are straightforward,
+working them out by hand can be tedious and error-prone, 
+and this problem only grows
+as our models become more complex.
+
+Fortunately all modern deep learning frameworks
+take this work off of our plates
+by offering *automatic differentiation*
+(often shortened to *autograd*). 
+As we pass data through each successive function,
+the framework builds a *computational graph* 
+that tracks how each value depends on others.
+To calculate derivatives, 
+automatic differentiation packages 
+then work backwards through this graph
+applying the chain rule. 
+The computational algorithm for applying the chain rule
+this fashion is called *backpropagation*.
+
+While autograd libraries become 
+hot concerns over the past decade,
+they have a long history. 
+In fact the earliest references to autograd
+date back over half of a century :cite:`Wengert.1964`.
+The core ideas behind modern backpropagation
+date to a PhD thesis from 1980 :cite:`Speelpenning.1980`
+and were further developed in the late 1980s :cite:`Griewank.1989`.
+While backpropagation has become the default method 
+for computing gradients, it's not the only option. 
+For instance, the Julia programming language employs 
+forward propagation :cite:`Revels.Lubin.Papamarkou.2016`. 
+Before exploring methods, 
+let's first master the autograd package.
+
+
+## A Simple Function
+
+Let's assume that we are interested
+in (**differentiating the function
+$y = 2\mathbf{x}^{\top}\mathbf{x}$
+with respect to the column vector $\mathbf{x}$.**)
+To start, we assign `x` an initial value.
+
+```{.python .input  n=1}
+%%tab mxnet
+from mxnet import autograd, np, npx
+npx.set_np()
+
+x = np.arange(4.0)
+x
+```
+
+```{.python .input  n=7}
+%%tab pytorch
+import torch
+
+x = torch.arange(4.0)
+x
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+x = tf.range(4, dtype=tf.float32)
+x
+```
+
+[**Before we calculate the gradient
+of $y$ with respect to $\mathbf{x}$,
+we need a place to store it.**]
+In general, we avoid allocating new memory
+every time we take a derivative 
+because deep learning requires 
+successively computing derivatives
+with respect to the same parameters
+thousands or millions of times,
+and we might risk running out of memory.
+Note that the gradient of a scalar-valued function
+with respect to a vector $\mathbf{x}$
+is vector-valued and has 
+the same shape as $\mathbf{x}$.
+
+```{.python .input  n=8}
+%%tab mxnet
+# We allocate memory for a tensor's gradient by invoking `attach_grad`
+x.attach_grad()
+# After we calculate a gradient taken with respect to `x`, we will be able to
+# access it via the `grad` attribute, whose values are initialized with 0s
+x.grad
+```
+
+```{.python .input  n=9}
+%%tab pytorch
+x.requires_grad_(True)  # Better create `x = torch.arange(4.0, requires_grad=True)`
+x.grad                  # The default value is None
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.Variable(x)
+```
+
+(**We now calculate our function of `x` and assign the result to `y`.**)
+
+```{.python .input  n=10}
+%%tab mxnet
+# Our code is inside an `autograd.record` scope to build the computational graph
+with autograd.record():
+    y = 2 * np.dot(x, x)
+y
+```
+
+```{.python .input  n=11}
+%%tab pytorch
+y = 2 * torch.dot(x, x)
+y
+```
+
+```{.python .input}
+%%tab tensorflow
+# Record all computations onto a tape
+with tf.GradientTape() as t:
+    y = 2 * tf.tensordot(x, x, axes=1)
+y
+```
+
+:begin_tab:`mxnet`
+[**We can now take the gradient of `y`
+with respect to `x`**] by calling 
+its `backward` method.
+Next, we can access the gradient 
+via `x`'s `grad` attribute.
+:end_tab:
+
+:begin_tab:`pytorch`
+[**We can now take the gradient of `y`
+with respect to `x`**] by calling 
+its `backward` method.
+Next, we can access the gradient 
+via `x`'s `grad` attribute.
+:end_tab:
+
+:begin_tab:`tensorflow`
+[**We can now calculate the gradient of `y`
+with respect to `x`**] by calling 
+the `gradient` function.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+y.backward()
+x.grad
+```
+
+```{.python .input  n=12}
+%%tab pytorch
+y.backward()
+x.grad
+```
+
+```{.python .input}
+%%tab tensorflow
+x_grad = t.gradient(y, x)
+x_grad
+```
+
+(**We already know that the gradient of the function $y = 2\mathbf{x}^{\top}\mathbf{x}$
+with respect to $\mathbf{x}$ should be $4\mathbf{x}$.**)
+We can now verify that the automatic gradient computation
+and the expected result are identical.
+
+```{.python .input  n=13}
+%%tab mxnet
+x.grad == 4 * x
+```
+
+```{.python .input  n=14}
+%%tab pytorch
+x.grad == 4 * x
+```
+
+```{.python .input}
+%%tab tensorflow
+x_grad == 4 * x
+```
+
+:begin_tab:`mxnet`
+[**Now let's calculate 
+another function of `x`
+and take its gradient.**] 
+Note that MXNet resets the gradient buffer 
+whenever we record a new gradient. 
+:end_tab:
+
+:begin_tab:`pytorch`
+[**Now let's calculate 
+another function of `x`
+and take its gradient.**]
+Note that PyTorch does not automatically 
+reset the gradient buffer 
+when we record a new gradient. 
+Instead the new gradient 
+is added to the already stored gradient.
+This behavior comes in handy
+when we want to optimize the sum 
+of multiple objective functions.
+To reset the gradient buffer,
+we can call `x.grad.zero()` as follows:
+:end_tab:
+
+:begin_tab:`tensorflow`
+[**Now let's calculate 
+another function of `x`
+and take its gradient.**]
+Note that TensorFlow resets the gradient buffer 
+whenever we record a new gradient. 
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+with autograd.record():
+    y = x.sum()
+y.backward()
+x.grad  # Overwritten by the newly calculated gradient
+```
+
+```{.python .input  n=20}
+%%tab pytorch
+x.grad.zero_()  # Reset the gradient
+y = x.sum()
+y.backward()
+x.grad
+```
+
+```{.python .input}
+%%tab tensorflow
+with tf.GradientTape() as t:
+    y = tf.reduce_sum(x)
+t.gradient(y, x)  # Overwritten by the newly calculated gradient
+```
+
+## Backward for Non-Scalar Variables
+
+When `y` is a vector, 
+the most natural interpretation 
+of the derivative of  `y`
+with respect to a vector `x` 
+is a matrix called the *Jacobian*
+that contains the partial derivatives
+of each component of `y` 
+with respect to each component of `x`.
+Likewise, for higher-order `y` and `x`,
+the differentiation result could be an even higher-order tensor.
+
+While Jacobians do show up in some
+advanced machine learning techniques,
+more commonly we want to sum up 
+the gradients of each component of `y`
+with respect to the full vector `x`,
+yielding a vector of the same shape as `x`.
+For example, we often have a vector 
+representing the value of our loss function
+calculated separately for each among
+a *batch* of training examples.
+Here, we just want to (**sum up the gradients
+computed individually for each example**).
+
+:begin_tab:`mxnet`
+MXNet handles this problem by reducing all tensors to scalars 
+by summing before computing a gradient. 
+In other words, rather than returning the Jacobian 
+$\partial_{\mathbf{x}} \mathbf{y}$,
+it returns the gradient of the sum
+$\partial_{\mathbf{x}} \sum_i y_i$. 
+:end_tab:
+
+:begin_tab:`pytorch`
+Because deep learning frameworks vary 
+in how they interpret gradients of
+non-scalar tensors,
+PyTorch takes some steps to avoid confusion.
+Invoking `backward` on a non-scalar elicits an error 
+unless we tell PyTorch how to reduce the object to a scalar. 
+More formally, we need to provide some vector $\mathbf{v}$ 
+such that `backward` will compute 
+$\mathbf{v}^\top \partial_{\mathbf{x}} \mathbf{y}$ 
+rather than $\partial_{\mathbf{x}} \mathbf{y}$. 
+This next part may be confusing,
+but for reasons that will become clear later, 
+this argument (representing $\mathbf{v}$) is named `gradient`. 
+For a more detailed description, see Yang Zhang's 
+[Medium post](https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29). 
+:end_tab:
+
+:begin_tab:`tensorflow`
+By default, TensorFlow returns the gradient of the sum.
+In other words, rather than returning 
+the Jacobian $\partial_{\mathbf{x}} \mathbf{y}$,
+it returns the gradient of the sum
+$\partial_{\mathbf{x}} \sum_i y_i$. 
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+with autograd.record():
+    y = x * x  
+y.backward()
+x.grad  # Equals the gradient of y = sum(x * x)
+```
+
+```{.python .input}
+%%tab pytorch
+x.grad.zero_()
+y = x * x
+y.backward(gradient=torch.ones(len(y)))  # Faster: y.sum().backward()
+x.grad
+```
+
+```{.python .input}
+%%tab tensorflow
+with tf.GradientTape() as t:
+    y = x * x
+t.gradient(y, x)  # Same as `y = tf.reduce_sum(x * x)`
+```
+
+## Detaching Computation
+
+Sometimes, we wish to [**move some calculations
+outside of the recorded computational graph.**]
+For example, say that we use the input 
+to create some auxiliary intermediate terms 
+for which we do not want to compute a gradient. 
+In this case, we need to *detach* 
+the respective computational influence graph 
+from the final result. 
+The following toy example makes this clearer: 
+suppose we have `z = x * y` and `y = x * x` 
+but we want to focus on the *direct* influence of `x` on `z` 
+rather than the influence conveyed via `y`. 
+In this case, we can create a new variable `u`
+that takes the same value as `y` 
+but whose *provenance* (how it was created)
+has been wiped out.
+Thus `u` has no ancestors in the graph
+and gradients to not flow through `u` to `x`.
+For example, taking the gradient of `z = x * u`
+will yield the result `x`,
+(not `3 * x * x` as you might have 
+expected since `z = x * x * x`).
+
+```{.python .input}
+%%tab mxnet
+with autograd.record():
+    y = x * x
+    u = y.detach()
+    z = u * x
+z.backward()
+x.grad == u
+```
+
+```{.python .input  n=21}
+%%tab pytorch
+x.grad.zero_()
+y = x * x
+u = y.detach()
+z = u * x
+
+z.sum().backward()
+x.grad == u
+```
+
+```{.python .input}
+%%tab tensorflow
+# Set `persistent=True` to preserve the compute graph. 
+# This lets us run `t.gradient` more than once
+with tf.GradientTape(persistent=True) as t:
+    y = x * x
+    u = tf.stop_gradient(y)
+    z = u * x
+
+x_grad = t.gradient(z, x)
+x_grad == u
+```
+
+Note that while this procedure
+detaches `y`'s ancestors
+from the graph leading to `z`, 
+the computational graph leading to `y` 
+persists and thus we can calculate
+the gradient of `y` with respect to `x`.
+
+```{.python .input}
+%%tab mxnet
+y.backward()
+x.grad == 2 * x
+```
+
+```{.python .input}
+%%tab pytorch
+x.grad.zero_()
+y.sum().backward()
+x.grad == 2 * x
+```
+
+```{.python .input}
+%%tab tensorflow
+t.gradient(y, x) == 2 * x
+```
+
+## Gradients and Python Control Flow
+
+So far we reviewed cases where the path from input to output 
+was well-defined via a function such as `z = x * x * x`.
+Programming offers us a lot more freedom in how we compute results. 
+For instance, we can make them depend on auxiliary variables 
+or condition choices on intermediate results. 
+One benefit of using automatic differentiation
+is that [**even if**] building the computational graph of 
+(**a function required passing through a maze of Python control flow**)
+(e.g., conditionals, loops, and arbitrary function calls),
+(**we can still calculate the gradient of the resulting variable.**)
+To illustrate this, consider the following code snippet where 
+the number of iterations of the `while` loop
+and the evaluation of the `if` statement
+both depend on the value of the input `a`.
+
+```{.python .input}
+%%tab mxnet
+def f(a):
+    b = a * 2
+    while np.linalg.norm(b) < 1000:
+        b = b * 2
+    if b.sum() > 0:
+        c = b
+    else:
+        c = 100 * b
+    return c
+```
+
+```{.python .input}
+%%tab pytorch
+def f(a):
+    b = a * 2
+    while b.norm() < 1000:
+        b = b * 2
+    if b.sum() > 0:
+        c = b
+    else:
+        c = 100 * b
+    return c
+```
+
+```{.python .input}
+%%tab tensorflow
+def f(a):
+    b = a * 2
+    while tf.norm(b) < 1000:
+        b = b * 2
+    if tf.reduce_sum(b) > 0:
+        c = b
+    else:
+        c = 100 * b
+    return c
+```
+
+Below, we call this function, passing in a random value as input.
+Since the input is a random variable, 
+we do not know what form 
+the computational graph will take.
+However, whenever we execute `f(a)` 
+on a specific input, we realize 
+a specific computational graph
+and can subsequently run `backward`.
+
+```{.python .input}
+%%tab mxnet
+a = np.random.normal()
+a.attach_grad()
+with autograd.record():
+    d = f(a)
+d.backward()
+```
+
+```{.python .input}
+%%tab pytorch
+a = torch.randn(size=(), requires_grad=True)
+d = f(a)
+d.backward()
+```
+
+```{.python .input}
+%%tab tensorflow
+a = tf.Variable(tf.random.normal(shape=()))
+with tf.GradientTape() as t:
+    d = f(a)
+d_grad = t.gradient(d, a)
+d_grad
+```
+
+Even though our function `f` is a bit 
+contrived for demonstration purposes,
+its dependence on the input is quite simple: 
+it is a *linear* function of `a` 
+with piecewise defined scale. 
+As such, `f(a) / a` is a vector of constant entries 
+and, moreover, `f(a) / a` needs to match 
+the gradient of `f(a)` with respect to `a`.
+
+```{.python .input}
+%%tab mxnet
+a.grad == d / a
+```
+
+```{.python .input}
+%%tab pytorch
+a.grad == d / a
+```
+
+```{.python .input}
+%%tab tensorflow
+d_grad == d / a
+```
+
+Dynamic control flow is very common in deep learning. 
+For instance, when processing text, the computational graph
+depends on the length of the input. 
+In these cases, automatic differentiation 
+becomes vital for statistical modeling 
+since it is impossible to compute the gradient a priori. 
+
+
+## Discussion
+
+You've now gotten a taste of the power of automatic differentiation. 
+The development of libraries for calculating derivatives
+both automatically and efficiently 
+has been a massive productivity booster
+for deep learning practitioners,
+liberating them to focus on loftier concerns.
+Moreover, autograd permits us to design massive models
+for which pen and paper gradient computations 
+would be prohibitively time consuming.
+Interestingly, while we use autograd to *optimize* models
+(in a statistical sense)
+the *optimization* of autograd libraries themselves
+(in a computational sense)
+is a rich subject
+of vital interest to framework designers.
+Here, tools from compilers and graph manipulation 
+are leveraged to compute results 
+in the most expedient and memory-efficient manner. 
+
+For now, try to remember these basics: (i) attach gradients to those variables with respect to which we desire derivatives; (ii) record the computation of the target value; (iii) execute the backpropagation function; and  (iv) access the resulting gradient.
+
+
+## Exercises
+
+1. Why is the second derivative much more expensive to compute than the first derivative?
+1. After running the function for backpropagation, immediately run it again and see what happens. Why?
+1. In the control flow example where we calculate the derivative of `d` with respect to `a`, what would happen if we changed the variable `a` to a random vector or a matrix? At this point, the result of the calculation `f(a)` is no longer a scalar. What happens to the result? How do we analyze this?
+1. Let $f(x) = \sin(x)$. Plot the graph of $f$ and of its derivative $f'$. Do not exploit the fact that $f'(x) = \cos(x)$ but rather use automatic differentiation to get the result. 
+1. Let $f(x) = ((\log x^2) \cdot \sin x) + x^{-1}$. Write out a dependency graph tracing results from $x$ to $f(x)$. 
+1. Use the chain rule to compute the derivative $\frac{df}{dx}$ of the aforementioned function, placing each term on the dependency graph that you constructed previously. 
+1. Given the graph and the intermediate derivative results, you have a number of options when computing the gradient. Evaluate the result once starting from $x$ to $f$ and once from $f$ tracing back to $x$. The path from $x$ to $f$ is commonly known as *forward differentiation*, whereas the path from $f$ to $x$ is known as backward differentiation. 
+1. When might you want to use forward differentiation and when backward differentiation? Hint: consider the amount of intermediate data needed, the ability to parallelize steps, and the size of matrices and vectors involved. 
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/34)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/35)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/200)
+:end_tab:
diff --git a/chapter_preliminaries/calculus.md b/chapter_preliminaries/calculus.md
index e043d03..1996be3 100644
--- a/chapter_preliminaries/calculus.md
+++ b/chapter_preliminaries/calculus.md
@@ -1,33 +1,37 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # 微積分
 :label:`sec_calculus`
 
-多角形の面積を見つけることは、少なくとも2500年前、古代ギリシア人が多角形を三角形に分割して面積を合計するまで不思議なままでした。円などの湾曲した形状の領域を見つけるために、古代ギリシア人はそのような形状のポリゴンを内接しました。:numref:`fig_circle_area` に示すように、辺の長さが等しい内接多角形は、円の近似がよくなります。このプロセスは「枯渇方法」とも呼ばれています。 
+長い間、円の面積を計算する方法は謎のままでした。その後、古代ギリシャの数学者アルキメデスは、円の内側に頂点の数が増える一連の多角形を内接させるという巧妙なアイデアを思いつきました（:numref:`fig_circle_area`）。$n$の頂点を持つ多角形の場合、$n$の三角形が得られます。円をより細かく分割するにつれて、各三角形の高さは半径 $r$ に近づきます。同時に、円弧と割線の比が多数の頂点に対して1に近づくので、その底辺は$2 \pi r/n$に近づきます。したがって、三角形の面積は$n \cdot r \cdot \frac{1}{2} (2 \pi r/n) = \pi r^2$に近づきます。  
 
-![Find the area of a circle with the method of exhaustion.](../img/polygon-circle.svg)
+![Finding the area of a circle as a limit procedure.](../img/polygon-circle.svg)
 :label:`fig_circle_area`
 
-実際、枯渇法は*積分計算* (:numref:`sec_integral_calculus` で説明される) の由来です。2000年以上経った今後、微積分学のもうひとつの分野、*微分積分* が発明されました。微分積分の最も重要な応用例の中でも、最適化問題では「最良」なことをどう行うかが考慮されます。:numref:`subsec_norms_and_objectives` で説明したように、このような問題は深層学習では広く見られます。 
-
-ディープラーニングでは、モデルを「トレーニング」し、連続的に更新することで、見るデータが増えていくにつれてモデルがどんどん良くなるようにします。通常、より良くなるということは、「私たちのモデルがどれほど悪い*？」という質問に答えるスコアである*損失関数*を最小化することを意味します。この質問は見た目よりも微妙です。最終的に、私たちが本当に気にかけているのは、これまでに見たことのないデータに対して優れたパフォーマンスを発揮するモデルを作成することです。しかし、実際に見ることができるデータにしかモデルをあてはめられません。したがって、モデルをフィッティングするタスクを、(i) *最適化*: 観測されたデータにモデルをフィッティングするプロセス、(ii) *一般化*: 正確なデータセットを超える妥当性を持つモデルの作成方法を導く数学的原理と実践者の知恵に分解できます。彼らを訓練するのに使われた例。 
-
-後の章で最適化の問題と手法を理解しやすくするために、ここではディープラーニングで一般的に使用される微分計算について簡単に説明します。 
+この制限手順は両方につながります 
+*微分計算* と*積分* 
+(:numref:`sec_integral_calculus`)。前者は、引数を操作することで関数の値を増減する方法を教えてくれます。これは、損失関数を減らすためにパラメーターを繰り返し更新するディープラーニングで直面する「最適化問題」に役立ちます。最適化は、モデルをトレーニングデータに適合させる方法を扱い、微積分はその重要な前提条件です。しかし、私たちの最終的な目標は、*これまで見られなかった*データでうまく機能することであることを忘れないでください。この問題は*一般化*と呼ばれ、他の章の主要な焦点となるでしょう。 
 
-## デリバティブと微分
+## デリバティブと差別化
 
-まず、ほとんどすべてのディープラーニング最適化アルゴリズムにおいて重要なステップである微分の計算に取り組みます。ディープラーニングでは、通常、モデルのパラメーターに関して微分可能な損失関数を選択します。簡単に言うと、各パラメータについて、そのパラメータを極小の「増加」または「減少」した場合に、損失がどれだけ急速に増減するかを判断できるということです。 
+簡単に言えば、*微分*は、引数の変化に対する関数の変化率です。デリバティブは、各パラメータを無限に少しだけ*増加*または*減少*した場合、損失関数がどれだけ速く増加または減少するかを教えてくれます。正式には、スカラーからスカラーにマップする関数$f: \mathbb{R} \rightarrow \mathbb{R}$の場合、[**$f$の*微分* $x$は**として定義されます] 
 
-入力と出力の両方がスカラーである関数 $f: \mathbb{R} \rightarrow \mathbb{R}$ があるとします。[**$f$ の*微分* は次のように定義されます**] 
+(**$f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}.$ドル**) :eqlabel:`eq_derivative` 
 
-(** $f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h},$ドル**) :eqlabel:`eq_derivative` 
+右側のこの用語は*リミット*と呼ばれ、指定された変数が特定の値に近づくと式の値がどうなるかを示します。この制限は、摂動 $h$ と関数値 $f(x + h) - f(x)$ の変化との比が、サイズをゼロに縮小したときに収束する割合を示します。 
 
-この制限が存在する場合。$f'(a)$ が存在する場合、$a$ では $f$ は*微分可能* であると言われます。$f$ が区間の数ごとに微分可能である場合、この関数はこの区間で微分可能です。:eqref:`eq_derivative` の導関数 $f'(x)$ は、$x$ に対する $f(x)$ の*瞬間的な*変化率として解釈できます。いわゆる瞬時変化率は、$x$ の $h$ の変動 $h$ に基づいており、$0$ に近づいています。 
+$f'(x)$が存在する場合、$f$は$x$で*微分可能*と言われ、$f'(x)$がセットのすべての$x$に対して存在する場合、$f$はこのセットで微分可能であると言います。精度や受信動作特性（AUC）の下の領域など、最適化したい多くの機能を含め、すべての機能が差別化できるわけではありません。しかし、損失の微分を計算することは、ディープニューラルネットワークを学習するためのほぼすべてのアルゴリズムにおいて重要なステップであるため、代わりに微分可能な*サロゲート*を最適化することがよくあります。 
 
-導関数を説明するために、例を挙げて実験してみましょう。(** $u = f(x) = 3x^2-4x$ を定義してください**)
+微分$f'(x)$は、$x$に対する$f(x)$の*瞬間的な*変化率として解釈できます。例を挙げて直感を身につけましょう。(**$u = f(x) = 3x^2-4x$.の定義を挙げてください**)
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
-from IPython import display
+from matplotlib_inline import backend_inline
 from mxnet import np, npx
 npx.set_np()
 
@@ -36,10 +40,10 @@ def f(x):
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
-from IPython import display
+from matplotlib_inline import backend_inline
 import numpy as np
 
 def f(x):
@@ -47,197 +51,172 @@ def f(x):
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
-from IPython import display
+from matplotlib_inline import backend_inline
 import numpy as np
 
 def f(x):
     return 3 * x ** 2 - 4 * x
 ```
 
-[**$x=1$ を設定し $h$ を $0$ に近づけると $\frac{f(x+h) - f(x)}{h}$ の数値結果**] :eqref:`eq_derivative` (** $2$ に近づく**) この実験は数学的な証明ではありませんが、$x=1$ のときに導関数 $u'$ が $2$ であることが後でわかります。
+[**$x=1$、$\frac{f(x+h) - f(x)}{h}$**]（**$h$が$0$に近づくと、$2$に近づきます**）この実験は数学的な証明の厳密さを欠いていますが、すぐに$f'(1) = 2$であることがわかります。
 
 ```{.python .input}
-#@tab all
-def numerical_lim(f, x, h):
-    return (f(x + h) - f(x)) / h
-
-h = 0.1
-for i in range(5):
-    print(f'h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f}')
-    h *= 0.1
+%%tab all
+for h in 10.0**np.arange(-1, -6, -1):
+    print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')
 ```
 
-デリバティブの同等の表記法をいくつか理解しておきましょう。$y = f(x)$ を指定すると、$x$ と $y$ はそれぞれ関数 $f$ の独立変数と従属変数です。次の式は同等です。 
+デリバティブには同等の表記規則がいくつかあります。$y = f(x)$を考えると、次の式は同等です。 
 
 $$f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),$$
 
-シンボル $\frac{d}{dx}$ と $D$ は、*微分* の演算を示す*微分演算子* です。一般的な機能を区別するために、次のルールを使用できます。 
+ここで、記号$\frac{d}{dx}$と$D$は*微分演算子*です。以下に、いくつかの一般的な関数の派生物を示します。 
 
-* $DC = 0$ ($C$ は定数です)
-* $Dx^n = nx^{n-1}$ (*べき乗則*、$n$ は任意の実数です)
-* $De^x = e^x$,
-* $D\ln(x) = 1/x.$
+$$\begin{aligned} \frac{d}{dx} C & = 0 && \text{for any constant $C$} \\ \frac{d}{dx} x^n & = n x^{n-1} && \text{for } n \neq 0 \\ \frac{d}{dx} e^x & = e^x \\ \frac{d}{dx} \ln x & = x^{-1} \end{aligned}$$
 
-上記の共通関数のようないくつかのより単純な関数から形成される関数を区別するために、以下のルールが役に立ちます。関数 $f$ と $g$ が両方とも微分可能で、$C$ が定数であると仮定すると、*定数の倍数規則* があります。 
+微分可能な関数から構成される関数は、しばしばそれ自体が微分可能です。次のルールは、微分可能な関数 $f$ と $g$、および定数 $C$ のコンポジションを扱う場合に便利です。 
 
-$$\frac{d}{dx} [Cf(x)] = C \frac{d}{dx} f(x),$$
+$$\begin{aligned} \frac{d}{dx} [C f(x)] & = C \frac{d}{dx} f(x) && \text{Constant multiple rule} \\ \frac{d}{dx} [f(x) + g(x)] & = \frac{d}{dx} f(x) + \frac{d}{dx} g(x) && \text{Sum rule} \\ \frac{d}{dx} [f(x) g(x)] & = f(x) \frac{d}{dx} g(x) + g(x) \frac{d}{dx} f(x) && \text{Product rule} \\ \frac{d}{dx} \frac{f(x)}{g(x)} & = \frac{g(x) \frac{d}{dx} f(x) - f(x) \frac{d}{dx} g(x)}{g^2(x)} && \text{Quotient rule} \end{aligned}$$
 
-*sumルール* 
+これを使用して、規則を適用して$3 x^2 - 4x$の微分を求めることができます。 
 
-$$\frac{d}{dx} [f(x) + g(x)] = \frac{d}{dx} f(x) + \frac{d}{dx} g(x),$$
+$$\frac{d}{dx} [3 x^2 - 4x] = 3 \frac{d}{dx} x^2 - 4 \frac{d}{dx} x = 6x - 4.$$
 
-*製品ルール* 
+$x = 1$を接続すると、この位置では微分が$2$であることがわかります。微分は、特定の位置における関数の*傾き*を教えてくれることに注意してください。   
 
-$$\frac{d}{dx} [f(x)g(x)] = f(x) \frac{d}{dx} [g(x)] + g(x) \frac{d}{dx} [f(x)],$$
+## ビジュアル化ユーティリティ
 
-そして*商の法則* 
-
-$$\frac{d}{dx} \left[\frac{f(x)}{g(x)}\right] = \frac{g(x) \frac{d}{dx} [f(x)] - f(x) \frac{d}{dx} [g(x)]}{[g(x)]^2}.$$
-
-これで $u' = f'(x) = 3 \frac{d}{dx} x^2-4\frac{d}{dx}x = 6x-4$ を見つけるために、上記の規則のいくつかを適用できます。したがって、$x = 1$ を設定すると $u' = 2$ が得られます。これは、このセクションの以前の実験でサポートされており、数値結果は $2$ に近づきます。この微分は、$x = 1$ のときの曲線 $u = f(x)$ に対する接線の傾きでもあります。 
-
-[**このような微分の解釈を視覚化するために、Python でよく使われるプロットライブラリである `matplotlib`, **] を使います。`matplotlib` で生成される Figure のプロパティを設定するには、いくつかの関数を定義する必要があります。次の例では、`use_svg_display` 関数は `matplotlib` パッケージを指定して、より鮮明なイメージのために svg Figure を出力します。コメント `# @save `は、以下の関数、クラス、文を `d2l` パッケージに保存する特別なマークなので、あとで再定義することなく直接呼び出せる (`d2l.use_svg_display()` など) ことができます。
+[**`matplotlib`ライブラリを使用して関数の傾きを可視化できます**]。いくつかの関数を定義する必要があります。その名前が示すように、`use_svg_display`は`matplotlib`に、より鮮明な画像のためにSVG形式でグラフィックを出力するように指示します。コメント `# @save `は特別な修飾子で、関数、クラス、その他のコードブロックを`d2l`パッケージに保存して、コードを繰り返さずに後で呼び出せるようにする (例:`d2l.use_svg_display()`)。
 
 ```{.python .input}
-#@tab all
+%%tab all
 def use_svg_display():  #@save
     """Use the svg format to display a plot in Jupyter."""
-    display.set_matplotlib_formats('svg')
+    backend_inline.set_matplotlib_formats('svg')
 ```
 
-`set_figsize` 関数を定義して Figure のサイズを指定します。ここでは `d2l.plt` を直接使用することに注意してください。これは、インポートステートメント `from matplotlib import pyplot as plt` が、序文で `d2l` パッケージに保存されるようマークされているためです。
+便利なことに、`set_figsize`でフィギュアサイズを設定できます。インポート文`from matplotlib import pyplot as plt`は `# @save` in the `d2l` package, we can call `d2l .plt` でマークされていたので。
 
 ```{.python .input}
-#@tab all
+%%tab all
 def set_figsize(figsize=(3.5, 2.5)):  #@save
     """Set the figure size for matplotlib."""
     use_svg_display()
     d2l.plt.rcParams['figure.figsize'] = figsize
 ```
 
-次の `set_axes` 関数は `matplotlib` によって生成される図形座標軸のプロパティを設定します。
+`set_axes` 関数は、軸をラベル、範囲、スケールなどのプロパティに関連付けることができます。
 
 ```{.python .input}
-#@tab all
+%%tab all
 #@save
 def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
     """Set the axes for matplotlib."""
-    axes.set_xlabel(xlabel)
-    axes.set_ylabel(ylabel)
-    axes.set_xscale(xscale)
-    axes.set_yscale(yscale)
-    axes.set_xlim(xlim)
-    axes.set_ylim(ylim)
+    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
+    axes.set_xscale(xscale), axes.set_yscale(yscale)
+    axes.set_xlim(xlim),     axes.set_ylim(ylim)
     if legend:
         axes.legend(legend)
     axes.grid()
 ```
 
-Figure コンフィギュレーション用のこれら 3 つの関数を使用して、本書全体で多くの曲線を視覚化する必要があるため、複数の曲線を簡潔にプロットする関数 `plot` を定義します。
+これら 3 つの関数を使用して、複数の曲線をオーバーレイする `plot` 関数を定義できます。ここでのコードの多くは、入力のサイズと形状が一致することを保証するだけです。
 
 ```{.python .input}
-#@tab all
+%%tab all
 #@save
-def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
+def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
          ylim=None, xscale='linear', yscale='linear',
          fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
     """Plot data points."""
-    if legend is None:
-        legend = []
-
-    set_figsize(figsize)
-    axes = axes if axes else d2l.plt.gca()
 
-    # Return True if `X` (tensor or list) has 1 axis
-    def has_one_axis(X):
+    def has_one_axis(X):  # True if `X` (tensor or list) has 1 axis
         return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                 and not hasattr(X[0], "__len__"))
-
-    if has_one_axis(X):
-        X = [X]
+    
+    if has_one_axis(X): X = [X]
     if Y is None:
         X, Y = [[]] * len(X), X
     elif has_one_axis(Y):
         Y = [Y]
     if len(X) != len(Y):
         X = X * len(Y)
+        
+    set_figsize(figsize)
+    if axes is None: axes = d2l.plt.gca()
     axes.cla()
     for x, y, fmt in zip(X, Y, fmts):
-        if len(x):
-            axes.plot(x, y, fmt)
-        else:
-            axes.plot(y, fmt)
+        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
     set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
 ```
 
-これで、[**関数 $u = f(x)$ とその接線 $y = 2x - 3$ を $x=1$ にプロット**] できます。ここで、係数 $2$ は接線の傾きです。
+ここで、[**$u = f(x)$とその接線$y = 2x - 3$を$x=1$にプロット**] できます。ここで、係数$2$は接線の傾きです。
 
 ```{.python .input}
-#@tab all
+%%tab all
 x = np.arange(0, 3, 0.1)
 plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
 ```
 
-## 偏微分
+## 偏微分と勾配
+:label:`subsec_calculus-grad`
 
-これまで、1つの変数の関数の微分を扱ってきました。ディープラーニングでは、関数は多くの場合、*多数* 個の変数に依存しています。したがって、微分の概念をこれらの「多変量関数」にまで拡張する必要があります。 
+これまで、私たちはただ一つの変数の関数を区別してきました。ディープラーニングでは、*多くの*変数の関数も扱う必要があります。このような*多変量*関数に適用される微分の概念を簡単に紹介します。 
 
-$y = f(x_1, x_2, \ldots, x_n)$ を $n$ 個の変数をもつ関数とします。$i^\mathrm{th}$ パラメーター $x_i$ に対する $y$ の *偏微分* は次のようになります。 
+$y = f(x_1, x_2, \ldots, x_n)$を$n$の変数を持つ関数とします。$i^\mathrm{th}$ パラメータ $x_i$ に対する $y$ の*偏微分* は次のようになります。 
 
 $$ \frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}.$$
 
-$\frac{\partial y}{\partial x_i}$ を計算するには、$x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n$ を定数として扱い、$x_i$ に対する $y$ の導関数を計算します。偏微分の表記法では、以下は同等です。 
+$\frac{\partial y}{\partial x_i}$を計算するために、$x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n$を定数として扱い、$x_i$に対する$y$の微分を計算することができます。偏導関数の次の表記規則はすべて共通で、すべて同じ意味です。 
 
-$$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = f_{x_i} = f_i = D_i f = D_{x_i} f.$$
+$$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = \partial_{x_i} f = \partial_i f = f_{x_i} = f_i = D_i f = D_{x_i} f.$$
 
-## グラデーション
-:label:`subsec_calculus-grad`
+多変量関数の偏導関数をそのすべての変数に対して連結して、関数の*勾配*と呼ばれるベクトルを得ることができます。関数 $f: \mathbb{R}^n \rightarrow \mathbb{R}$ の入力が $n$ 次元ベクトル $\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ で、出力がスカラーであるとします。$\mathbf{x}$ に対する関数 $f$ の勾配は、$n$ の偏微分のベクトルです。 
 
-多変量関数の偏導関数をそのすべての変数に対して連結して、関数の*gradient* ベクトルを求めることができます。関数 $f: \mathbb{R}^n \rightarrow \mathbb{R}$ の入力が $n$ 次元のベクトル $\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ で、出力がスカラーであるとします。$\mathbf{x}$ に対する関数 $f(\mathbf{x})$ の勾配は $n$ 偏微分のベクトルです。 
+$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \left[\partial_{x_1} f(\mathbf{x}), \partial_{x_2} f(\mathbf{x}), \ldots
+\partial_{x_n} f(\mathbf{x})\right]^\top.$$ 
 
-$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top,$$
+あいまいさがない場合、$\nabla_{\mathbf{x}} f(\mathbf{x})$ は通常 $\nabla f(\mathbf{x})$ に置き換えられます。次のルールは、多変量関数を区別するのに便利です。 
 
-$\nabla_{\mathbf{x}} f(\mathbf{x})$ は、あいまいさがなければ $\nabla f(\mathbf{x})$ に置き換えられることがよくあります。 
+* すべての $\mathbf{A} \in \mathbb{R}^{m \times n}$ には $\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top$ と $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A}  = \mathbf{A}$ があります。
+* 正方行列 $\mathbf{A} \in \mathbb{R}^{n \times n}$ には $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x}  = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$ があり、特に
+$\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}$。 
 
-$\mathbf{x}$ を $n$ 次元のベクトルとすると、多変量関数を微分するときには次の規則がよく使われます。 
-
-* すべての$\mathbf{A} \in \mathbb{R}^{m \times n}$、$\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top$、
-* すべての$\mathbf{A} \in \mathbb{R}^{n \times m}$、$\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A}  = \mathbf{A}$、
-* すべての$\mathbf{A} \in \mathbb{R}^{n \times n}$、$\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x}  = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$、
-* $\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}$。
-
-同様に、マトリックス $\mathbf{X}$ については $\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}$ があります。後で説明するように、勾配はディープラーニングにおける最適化アルゴリズムの設計に役立ちます。 
+同様に、どのマトリックス $\mathbf{X}$ にも $\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}$ があります。  
 
 ## 連鎖規則
 
-しかし、そのようなグラデーションは見つけにくい場合があります。これは、ディープラーニングの多変量関数は*合成*であることが多いため、これらの関数を区別するために前述のルールを適用しない可能性があるためです。幸いなことに、*chainルール*によって複合関数を区別することができます。 
-
-まず、単一変数の関数について考えてみましょう。関数 $y=f(u)$ と $u=g(x)$ がどちらも微分可能であると仮定すると、連鎖規則は次のようになります。 
+ディープラーニングでは、深くネストされた関数（関数（関数の...））を扱っているため、関心の勾配を計算するのが難しいことがよくあります。幸いなことに、*チェーンルール*がこれを処理します。単一変数の関数に戻り、$y = f(g(x))$ と、基礎となる関数 $y=f(u)$ と $u=g(x)$ の両方が微分可能であると仮定します。連鎖規則には、  
 
 $$\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.$$
 
-ここで、関数が任意の数の変数をもつ、より一般的なシナリオに注目しましょう。微分可能関数 $y$ に変数 $u_1, u_2, \ldots, u_m$ があり、各微分可能関数 $u_i$ には変数 $x_1, x_2, \ldots, x_n$ があるとします。$y$ は $x_1, x_2, \ldots, x_n$ の関数であることに注意してください。そして、連鎖規則は次のようになります。 
+多変量関数に戻ると、$y = f(\mathbf{u})$には変数$u_1, u_2, \ldots, u_m$があり、各$u_i = g_i(\mathbf{x})$には変数$x_1, x_2, \ldots, x_n$、つまり$\mathbf{u} = g(\mathbf{x})$があるとします。そして、連鎖規則は次のように述べています 
+
+$$\frac{\partial y}{\partial x_{i}} = \frac{\partial y}{\partial u_{1}} \frac{\partial u_{1}}{\partial x_{i}} + \frac{\partial y}{\partial u_{2}} \frac{\partial u_{2}}{\partial x_{i}} + \ldots + \frac{\partial y}{\partial u_{m}} \frac{\partial u_{m}}{\partial x_{i}} \text{ and thus } \nabla_{\mathbf{x}} y =  \mathbf{A} \nabla_{\mathbf{u}} y,$$
 
-$$\frac{dy}{dx_i} = \frac{dy}{du_1} \frac{du_1}{dx_i} + \frac{dy}{du_2} \frac{du_2}{dx_i} + \cdots + \frac{dy}{du_m} \frac{du_m}{dx_i}$$
+ここで、$\mathbf{A} \in \mathbb{R}^{n \times m}$ は、ベクトル $\mathbf{x}$ に対するベクトル $\mathbf{u}$ の微分を含む*行列* です。したがって、勾配を評価するには、ベクトルマトリックスの積を計算する必要があります。これが、線形代数がディープラーニングシステムの構築において不可欠な構成要素である主な理由の1つです。  
 
-どんな$i = 1, 2, \ldots, n$にも合います。 
+## ディスカッション
 
-## [概要
+ディープトピックの表面をスクラッチしたばかりですが、すでにいくつかの概念に焦点が当てられています。1つ目は、差別化のための構成ルールを無意識に適用でき、勾配を*自動*で計算できることです。このタスクは創造性を必要としないため、認知力を他の場所に集中させることができます。第2に、ベクトル値関数の導関数を計算するには、出力から入力までの変数の依存グラフをトレースするときに、行列を乗算する必要があります。特に、このグラフは、関数を評価するときは*順方向*方向に、勾配を計算するときは*後方*方向にトラバースされます。後の章では、連鎖規則を適用するための計算手順であるバックプロパゲーションを正式に紹介します。 
 
-* 微分積分学と積分微積分は微積分学の2つの分岐であり、前者は深層学習におけるユビキタス最適化問題に適用できます。
-* 微分は、その変数に対する関数の瞬間的な変化率として解釈できます。これは、関数の曲線に対する接線の傾きでもあります。
-* 勾配は、そのすべての変数に対する多変量関数の偏導関数を成分とするベクトルです。
-* 連鎖則により、複合関数を区別することができます。
+最適化の観点から、勾配を使用すると、損失を減らすためにモデルのパラメーターをどのように移動するかを決定できます。この本全体で使用されている最適化アルゴリズムの各ステップでは、勾配を計算する必要があります。 
 
 ## 演習
 
-1. $x = 1$ の場合、関数 $y = f(x) = x^3 - \frac{1}{x}$ とその接線をプロットします。
+1. これまでのところ、デリバティブのルールは当然のことと考えていました。定義と制限を使用すると、(i) $f(x) = c$、(ii) $f(x) = x^n$、(iii) $f(x) = e^x$、(iv) $f(x) = \log x$ のプロパティが証明されます。
+1. 同じように、第一原理から積、和、商の法則を証明します。 
+1. 積則の特殊なケースとして、定数倍則が続くことを証明します。 
+1. $f(x) = x^x$ の微分を計算します。 
+1. $f'(x) = 0$が一部の$x$にとってどういう意味ですか？関数$f$と、これが当てはまる可能性のある場所$x$の例を挙げてください。 
+1. 関数 $y = f(x) = x^3 - \frac{1}{x}$ をプロットし、その接線を $x = 1$ にプロットします。
 1. 関数 $f(\mathbf{x}) = 3x_1^2 + 5e^{x_2}$ の勾配を求めます。
-1. 関数$f(\mathbf{x}) = \|\mathbf{x}\|_2$の勾配は何ですか？
-1. $u = f(x, y, z)$ と $x = x(a, b)$、$y = y(a, b)$、$z = z(a, b)$ の場合のチェーンルールを書き出せますか？
+1. 関数$f(\mathbf{x}) = \|\mathbf{x}\|_2$の勾配は何ですか？$\mathbf{x} = \mathbf{0}$はどうなりますか？
+1. $u = f(x, y, z)$と$x = x(a, b)$、$y = y(a, b)$、$z = z(a, b)$の場合の連鎖ルールを書けますか？
+1. 可逆関数$f(x)$が与えられると、その逆関数$f^{-1}(x)$の微分を計算します。ここにその$f^{-1}(f(x)) = x$があり、逆に$f(f^{-1}(y)) = y$があります。ヒント:これらのプロパティを派生に使用してください。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/32)
diff --git a/chapter_preliminaries/calculus_origin.md b/chapter_preliminaries/calculus_origin.md
new file mode 100644
index 0000000..cabaa1d
--- /dev/null
+++ b/chapter_preliminaries/calculus_origin.md
@@ -0,0 +1,424 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Calculus
+:label:`sec_calculus`
+
+For a long time, how to calculate 
+the area of a circle remained a mystery.
+Then, the ancient Greek mathematician Archimedes
+came up with the clever idea 
+to inscribe a series of polygons 
+with increasing numbers of vertices
+on the inside of a circle
+(:numref:`fig_circle_area`). 
+For a polygon with $n$ vertices,
+we obtain $n$ triangles.
+The height of each triangle approaches the radius $r$ 
+as we partition the circle more finely. 
+At the same time, its base approaches $2 \pi r/n$, 
+since the ratio between arc and secant approaches 1 
+for a large number of vertices. 
+Thus, the area of the triangle approaches
+$n \cdot r \cdot \frac{1}{2} (2 \pi r/n) = \pi r^2$. 
+
+![Finding the area of a circle as a limit procedure.](../img/polygon-circle.svg)
+:label:`fig_circle_area`
+
+This limiting procedure leads to both 
+*differential calculus* and *integral calculus* 
+(:numref:`sec_integral_calculus`). 
+The former can tell us how to increase
+or decrease a function value by
+manipulating its arguments. 
+This comes in handy for the *optimization problems*
+that we face in deep learning,
+where we repeatedly update our parameters 
+in order to decrease the loss function.
+Optimization addresses how to fit our models to training data,
+and calculus is its key prerequisite.
+However, don't forget that our ultimate goal
+is to perform well on *previously unseen* data.
+That problem is called *generalization*
+and will be a key focus of other chapters.
+
+
+
+## Derivatives and Differentiation
+
+Put simply, a *derivative* is the rate of change
+in a function with respect to changes in its arguments.
+Derivatives can tell us how rapidly a loss function
+would increase or decrease were we 
+to *increase* or *decrease* each parameter
+by an infinitesimally small amount.
+Formally, for functions $f: \mathbb{R} \rightarrow \mathbb{R}$,
+that map from scalars to scalars,
+[**the *derivative* of $f$ at a point $x$ is defined as**]
+
+(**$$f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}.$$**)
+:eqlabel:`eq_derivative`
+
+This term on the right hand side is called a *limit* 
+and it tells us what happens 
+to the value of an expression
+as a specified variable 
+approaches a particular value.
+This limit tells us what 
+the ratio between a perturbation $h$
+and the change in the function value 
+$f(x + h) - f(x)$ converges to 
+as we shrink its size to zero.
+
+When $f'(x)$ exists, $f$ is said 
+to be *differentiable* at $x$;
+and when $f'(x)$ exists for all $x$
+on a set, e.g., the interval $[a,b]$, 
+we say that $f$ is differentiable on this set.
+Not all functions are differentiable,
+including many that we wish to optimize,
+including accuracy and the area under the
+receiving operating characteristic (AUC).
+However, because computing the derivative of the loss 
+is a crucial step in nearly all 
+algorithms for training deep neural networks,
+we often optimize a differentiable *surrogate* instead.
+
+
+We can interpret the derivative 
+$f'(x)$
+as the *instantaneous* rate of change 
+of $f(x)$ with respect to $x$.
+Let's develop some intuition with an example.
+(**Define $u = f(x) = 3x^2-4x$.**)
+
+```{.python .input}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from matplotlib_inline import backend_inline
+from mxnet import np, npx
+npx.set_np()
+
+def f(x):
+    return 3 * x ** 2 - 4 * x
+```
+
+```{.python .input}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+from matplotlib_inline import backend_inline
+import numpy as np
+
+def f(x):
+    return 3 * x ** 2 - 4 * x
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+from matplotlib_inline import backend_inline
+import numpy as np
+
+def f(x):
+    return 3 * x ** 2 - 4 * x
+```
+
+[**Setting $x=1$, $\frac{f(x+h) - f(x)}{h}$**] (**approaches $2$
+as $h$ approaches $0$.**)
+While this experiment lacks 
+the rigor of a mathematical proof,
+we will soon see that indeed $f'(1) = 2$.
+
+```{.python .input}
+%%tab all
+for h in 10.0**np.arange(-1, -6, -1):
+    print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')
+```
+
+There are several equivalent notational conventions for derivatives.
+Given $y = f(x)$, the following expressions are equivalent:
+
+$$f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),$$
+
+where the symbols $\frac{d}{dx}$ and $D$ are *differentiation operators*.
+Below, we present the derivatives of some common functions:
+
+$$\begin{aligned} \frac{d}{dx} C & = 0 && \text{for any constant $C$} \\ \frac{d}{dx} x^n & = n x^{n-1} && \text{for } n \neq 0 \\ \frac{d}{dx} e^x & = e^x \\ \frac{d}{dx} \ln x & = x^{-1} \end{aligned}$$
+
+Functions composed from differentiable functions 
+are often themselves differentiable.
+The following rules come in handy 
+for working with compositions 
+of any differentiable functions 
+$f$ and $g$, and constant $C$.
+
+$$\begin{aligned} \frac{d}{dx} [C f(x)] & = C \frac{d}{dx} f(x) && \text{Constant multiple rule} \\ \frac{d}{dx} [f(x) + g(x)] & = \frac{d}{dx} f(x) + \frac{d}{dx} g(x) && \text{Sum rule} \\ \frac{d}{dx} [f(x) g(x)] & = f(x) \frac{d}{dx} g(x) + g(x) \frac{d}{dx} f(x) && \text{Product rule} \\ \frac{d}{dx} \frac{f(x)}{g(x)} & = \frac{g(x) \frac{d}{dx} f(x) - f(x) \frac{d}{dx} g(x)}{g^2(x)} && \text{Quotient rule} \end{aligned}$$
+
+Using this, we can apply the rules 
+to find the derivative of $3 x^2 - 4x$ via
+
+$$\frac{d}{dx} [3 x^2 - 4x] = 3 \frac{d}{dx} x^2 - 4 \frac{d}{dx} x = 6x - 4.$$
+
+Plugging in $x = 1$ shows that, indeed, 
+the derivative is $2$ at this location. 
+Note that derivatives tell us 
+the *slope* of a function 
+at a particular location.  
+
+## Visualization Utilities
+
+[**We can visualize the slopes of functions using the `matplotlib` library**].
+We need to define a few functions. 
+As its name indicates, `use_svg_display` 
+tells `matplotlib` to output graphics 
+in SVG format for crisper images. 
+The comment `#@save` is a special modifier 
+that allows us to save any function, 
+class, or other code block to the `d2l` package 
+so that we can invoke it later 
+without repeating the code, 
+e.g., via `d2l.use_svg_display()`.
+
+```{.python .input}
+%%tab all
+def use_svg_display():  #@save
+    """Use the svg format to display a plot in Jupyter."""
+    backend_inline.set_matplotlib_formats('svg')
+```
+
+Conveniently, we can set figure sizes with `set_figsize`. 
+Since the import statement `from matplotlib import pyplot as plt` 
+was marked via `#@save` in the `d2l` package, we can call `d2l.plt`.
+
+```{.python .input}
+%%tab all
+def set_figsize(figsize=(3.5, 2.5)):  #@save
+    """Set the figure size for matplotlib."""
+    use_svg_display()
+    d2l.plt.rcParams['figure.figsize'] = figsize
+```
+
+The `set_axes` function can associate axes
+with properties, including labels, ranges,
+and scales.
+
+```{.python .input}
+%%tab all
+#@save
+def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
+    """Set the axes for matplotlib."""
+    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
+    axes.set_xscale(xscale), axes.set_yscale(yscale)
+    axes.set_xlim(xlim),     axes.set_ylim(ylim)
+    if legend:
+        axes.legend(legend)
+    axes.grid()
+```
+
+With these three functions, we can define a `plot` function 
+to overlay multiple curves. 
+Much of the code here is just ensuring 
+that the sizes and shapes of inputs match.
+
+```{.python .input}
+%%tab all
+#@save
+def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
+         ylim=None, xscale='linear', yscale='linear',
+         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
+    """Plot data points."""
+
+    def has_one_axis(X):  # True if `X` (tensor or list) has 1 axis
+        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
+                and not hasattr(X[0], "__len__"))
+    
+    if has_one_axis(X): X = [X]
+    if Y is None:
+        X, Y = [[]] * len(X), X
+    elif has_one_axis(Y):
+        Y = [Y]
+    if len(X) != len(Y):
+        X = X * len(Y)
+        
+    set_figsize(figsize)
+    if axes is None: axes = d2l.plt.gca()
+    axes.cla()
+    for x, y, fmt in zip(X, Y, fmts):
+        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
+    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+```
+
+Now we can [**plot the function $u = f(x)$ and its tangent line $y = 2x - 3$ at $x=1$**],
+where the coefficient $2$ is the slope of the tangent line.
+
+```{.python .input}
+%%tab all
+x = np.arange(0, 3, 0.1)
+plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
+```
+
+## Partial Derivatives and Gradients
+:label:`subsec_calculus-grad`
+
+Thus far, we have been differentiating
+functions of just one variable.
+In deep learning, we also need to work
+with functions of *many* variables.
+We briefly introduce notions of the derivative
+that apply to such *multivariate* functions.
+
+
+Let $y = f(x_1, x_2, \ldots, x_n)$ be a function with $n$ variables. 
+The *partial derivative* of $y$ 
+with respect to its $i^\mathrm{th}$ parameter $x_i$ is
+
+$$ \frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}.$$
+
+
+To calculate $\frac{\partial y}{\partial x_i}$, 
+we can treat $x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n$ as constants 
+and calculate the derivative of $y$ with respect to $x_i$.
+The following notation conventions for partial derivatives 
+are all common and all mean the same thing:
+
+$$\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = \partial_{x_i} f = \partial_i f = f_{x_i} = f_i = D_i f = D_{x_i} f.$$
+
+We can concatenate partial derivatives 
+of a multivariate function 
+with respect to all its variables 
+to obtain a vector that is called
+the *gradient* of the function.
+Suppose that the input of function 
+$f: \mathbb{R}^n \rightarrow \mathbb{R}$ 
+is an $n$-dimensional vector 
+$\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ 
+and the output is a scalar. 
+The gradient of the function $f$ 
+with respect to $\mathbf{x}$ 
+is a vector of $n$ partial derivatives:
+
+$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \left[\partial_{x_1} f(\mathbf{x}), \partial_{x_2} f(\mathbf{x}), \ldots
+\partial_{x_n} f(\mathbf{x})\right]^\top.$$ 
+
+When there is no ambiguity,
+$\nabla_{\mathbf{x}} f(\mathbf{x})$ 
+is typically replaced 
+by $\nabla f(\mathbf{x})$.
+The following rules come in handy 
+for differentiating multivariate functions:
+
+* For all $\mathbf{A} \in \mathbb{R}^{m \times n}$ we have $\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top$ and $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A}  = \mathbf{A}$.
+* For square matrices $\mathbf{A} \in \mathbb{R}^{n \times n}$ we have that $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x}  = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$ and in particular
+$\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}$.
+
+Similarly, for any matrix $\mathbf{X}$, 
+we have $\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}$. 
+
+
+
+## Chain Rule
+
+In deep learning, the gradients of concern
+are often difficult to calculate
+because we are working with 
+deeply nested functions 
+(of functions (of functions...)).
+Fortunately, the *chain rule* takes care of this. 
+Returning to functions of a single variable,
+suppose that $y = f(g(x))$
+and that the underlying functions 
+$y=f(u)$ and $u=g(x)$ 
+are both differentiable.
+The chain rule states that 
+
+
+$$\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.$$
+
+
+
+Turning back to multivariate functions,
+suppose that $y = f(\mathbf{u})$ has variables
+$u_1, u_2, \ldots, u_m$, 
+where each $u_i = g_i(\mathbf{x})$ 
+has variables $x_1, x_2, \ldots, x_n$,
+i.e.,  $\mathbf{u} = g(\mathbf{x})$.
+Then the chain rule states that
+
+$$\frac{\partial y}{\partial x_{i}} = \frac{\partial y}{\partial u_{1}} \frac{\partial u_{1}}{\partial x_{i}} + \frac{\partial y}{\partial u_{2}} \frac{\partial u_{2}}{\partial x_{i}} + \ldots + \frac{\partial y}{\partial u_{m}} \frac{\partial u_{m}}{\partial x_{i}} \text{ and thus } \nabla_{\mathbf{x}} y =  \mathbf{A} \nabla_{\mathbf{u}} y,$$
+
+where $\mathbf{A} \in \mathbb{R}^{n \times m}$ is a *matrix*
+that contains the derivative of vector $\mathbf{u}$
+with respect to vector $\mathbf{x}$.
+Thus, evaluating the gradient requires 
+computing a vector-matrix product. 
+This is one of the key reasons why linear algebra 
+is such an integral building block 
+in building deep learning systems. 
+
+
+
+## Discussion
+
+While we have just scratched the surface of a deep topic,
+a number of concepts already come into focus: 
+first, the composition rules for differentiation
+can be applied mindlessly, enabling
+us to compute gradients *automatically*.
+This task requires no creativity and thus 
+we can focus our cognitive powers elsewhere.
+Second, computing the derivatives of vector-valued functions 
+requires us to multiply matrices as we trace 
+the dependency graph of variables from output to input. 
+In particular, this graph is traversed in a *forward* direction 
+when we evaluate a function 
+and in a *backwards* direction 
+when we compute gradients. 
+Later chapters will formally introduce backpropagation,
+a computational procedure for applying the chain rule.
+
+From the viewpoint of optimization, gradients allow us 
+to determine how to move the parameters of a model
+in order to lower the loss,
+and each step of the optimization algorithms used 
+throughout this book will require calculating the gradient.
+
+## Exercises
+
+1. So far we took the rules for derivatives for granted. 
+   Using the definition and limits prove the properties 
+   for (i) $f(x) = c$, (ii) $f(x) = x^n$, (iii) $f(x) = e^x$ and (iv) $f(x) = \log x$.
+1. In the same vein, prove the product, sum, and quotient rule from first principles. 
+1. Prove that the constant multiple rule follows as a special case of the product rule. 
+1. Calculate the derivative of $f(x) = x^x$. 
+1. What does it mean that $f'(x) = 0$ for some $x$? 
+   Give an example of a function $f$ 
+   and a location $x$ for which this might hold. 
+1. Plot the function $y = f(x) = x^3 - \frac{1}{x}$ 
+   and plot its tangent line at $x = 1$.
+1. Find the gradient of the function 
+   $f(\mathbf{x}) = 3x_1^2 + 5e^{x_2}$.
+1. What is the gradient of the function 
+   $f(\mathbf{x}) = \|\mathbf{x}\|_2$? What happens for $\mathbf{x} = \mathbf{0}$?
+1. Can you write out the chain rule for the case 
+   where $u = f(x, y, z)$ and $x = x(a, b)$, $y = y(a, b)$, and $z = z(a, b)$?
+1. Given a function $f(x)$ that is invertible, 
+   compute the derivative of its inverse $f^{-1}(x)$. 
+   Here we have that $f^{-1}(f(x)) = x$ and conversely $f(f^{-1}(y)) = y$. 
+   Hint: use these properties in your derivation. 
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/32)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/33)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/197)
+:end_tab:
diff --git a/chapter_preliminaries/index.md b/chapter_preliminaries/index.md
index 322e0e4..8bdfbae 100644
--- a/chapter_preliminaries/index.md
+++ b/chapter_preliminaries/index.md
@@ -1,17 +1,10 @@
-#  予選
+#  予選会
 :label:`chap_preliminaries`
 
-ディープラーニングを始めるには、いくつかの基本的なスキルを身に付ける必要があります。すべての機械学習は、データから情報を抽出することに関係しています。そこで、データの格納、操作、前処理に関する実践的なスキルを習得することから始めます。 
+ディープラーニングに飛び込む準備をするには、いくつかのサバイバルスキルが必要です。（i）データの保存と操作のテクニック、（ii）さまざまなソースからのデータの取り込みと前処理のためのライブラリ、（iii）高次元データに適用する基本的な線形代数演算の知識要素;（iv）損失関数を減らすために各パラメータを調整する方向を決定するのにちょうど十分な微積分;（v）今学んだ微積分の多くを忘れることができるように微分を自動的に計算する能力;（vi）確率の基本的な流暢さ、不確実性の下での推論; そして（vii）あなたが立ち往生したときに公式文書で答えを見つける適性。 
 
-さらに、機械学習では通常、行が例に対応し、列が属性に対応するテーブルと考えることができる大きなデータセットを扱う必要があります。線形代数は、表形式データを操作するための強力なテクニックを提供します。雑草についてはあまり詳しく説明しませんが、行列演算の基本とその実装に焦点を当てます。 
-
-さらに、ディープラーニングは最適化がすべてです。いくつかのパラメーターを持つモデルがあり、データに*最適*適合するモデルを見つけたいと考えています。アルゴリズムの各ステップで各パラメーターをどの方向に移動させるかを決定するには、少し計算が必要です。これについて簡単に説明します。幸いなことに、`autograd` パッケージは微分を自動的に計算してくれます。これについては次で説明します。 
-
-次に、機械学習は予測を行うことに関係します。観察した情報を考えると、未知の属性にはどのような値がありそうなのでしょうか？不確実性のもとで厳密に推論するには、確率の言語を呼び出す必要があります。 
-
-最終的には、公式ドキュメントには、この本にはない多くの説明と例が記載されています。この章を締めくくるために、必要な情報のドキュメントを検索する方法を示します。 
-
-本書では、ディープラーニングを正しく理解するために、数学的な内容を最小限に留めておきました。しかし、この本が数学フリーであるという意味ではありません。したがって、この章では、本書の数学的な内容の少なくとも「ほとんど」を誰でも理解できるように、基礎的でよく使われる数学を素早く紹介する。数学的な内容の「すべて」を理解したいなら、[online appendix on mathematics](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/index.html) をさらに復習すれば十分でしょう。
+要するに、この章では、従う必要のある基本を素早く紹介します。 
+*この本の技術的な内容のほとんど*。
 
 ```toc
 :maxdepth: 2
diff --git a/chapter_preliminaries/index_origin.md b/chapter_preliminaries/index_origin.md
new file mode 100644
index 0000000..f8a997b
--- /dev/null
+++ b/chapter_preliminaries/index_origin.md
@@ -0,0 +1,37 @@
+#  Preliminaries
+:label:`chap_preliminaries`
+
+To prepare for your dive into deep learning,
+you will need a few survival skills:
+(i) techniques for storing and manipulating data;
+(ii) libraries for ingesting 
+and preprocessing data from a variety of sources;
+(iii) knowledge of the basic linear algebraic operations
+that we apply to high-dimensional data elements;
+(iv) just enough calculus to determine
+which direction to adjust each parameter
+in order to decrease the loss function;
+(v) the ability to automatically compute derivatives
+so that you can forget much of 
+the calculus you just learned;
+(vi) some basic fluency in probability,
+our primary language for reasoning under uncertainty;
+and (vii) some aptitude for finding answers 
+in the official documentation when you get stuck.
+
+In short, this chapter provides a rapid introduction 
+to the basics that you will need to follow 
+*most* of the technical content in this book.
+
+```toc
+:maxdepth: 2
+
+ndarray
+pandas
+linear-algebra
+calculus
+autograd
+probability
+lookup-api
+```
+
diff --git a/chapter_preliminaries/linear-algebra.md b/chapter_preliminaries/linear-algebra.md
index 7295892..d223a9c 100644
--- a/chapter_preliminaries/linear-algebra.md
+++ b/chapter_preliminaries/linear-algebra.md
@@ -1,17 +1,25 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # 線形代数
 :label:`sec_linear-algebra`
 
-データの保存と操作ができるようになったところで、本書で説明するほとんどのモデルを理解して実装するために必要な、基本的な線形代数のサブセットについて簡単に説明します。以下では、線形代数における基本的な数学オブジェクト、算術、演算を紹介します。それぞれを数学的表記法と対応するコード実装で表現します。 
+今では、データセットをテンソルに読み込み、これらのテンソルを基本的な数学演算で操作できるようになりました。洗練されたモデルの構築を始めるには、線形代数のツールもいくつか必要になります。このセクションでは、スカラー算術から始まり、行列乗算に至るまで、最も重要な概念を穏やかに紹介します。 
 
 ## スカラー
 
-線形代数や機械学習を学んだことがないなら、数学に関する過去の経験は、おそらく一度に一つの数字について考えることだったでしょう。また、小切手帳のバランスを取ったり、レストランで夕食代を支払ったりしたことがある場合は、数字のペアの加算や乗算などの基本的なことを行う方法をすでに知っています。たとえば、パロアルトの気温は華氏$52$度です。正式には、1 つの数値量だけで構成される値を「スカラー」と呼びます。この値を摂氏 (メートル法のより適切な温度スケール) に変換する場合は、$c = \frac{5}{9}(f - 32)$ という式を評価し、$f$ を $52$ に設定します。この方程式では、$5$、$9$、$32$ の各項はスカラー値です。プレースホルダー $c$ と $f$ は*変数* と呼ばれ、不明なスカラー値を表します。 
+ほとんどの日常的な数学は、一度に1つずつ数字を操作することで構成されています。正式には、これらの値を*スカラー*と呼びます。たとえば、パロアルトの気温は華氏$72$度です。温度を摂氏に変換する場合は、$f$を$72$に設定して、$c = \frac{5}{9}(f - 32)$という式を評価します。この方程式では、$5$、$9$、および$32$という値はスカラーです。変数 $c$ と $f$ は不明なスカラーを表します。 
 
-本書では、スカラー変数を通常の小文字で表す数学表記法を採用しています (例:$x$、$y$、$z$)。すべての (連続) *実数値* スカラーの空間を $\mathbb{R}$ で表します。便宜上、*space* が正確に何であるかを厳密に定義しますが、$x \in \mathbb{R}$ という式は $x$ が実数値のスカラーであると言う正式な言い方であることを覚えておいてください。記号$\in$は「in」と発音でき、単に集合のメンバーであることを示します。同様に、$x$ と $y$ は値が $0$ または $1$ にしかならない数値であることを示すために $x, y \in \{0, 1\}$ と書くことができます。 
+スカラーは、通常の小文字の文字（例：$x$、$y$、$z$）とすべてのスペース（連続）で表します 
+*$\mathbb{R}$ による実数値* スカラー。
+便宜上、*スペース*の厳密な定義はスキップします。$x \in \mathbb{R}$という式は、$x$が実数値のスカラーであることを表す正式な言い方であることを覚えておいてください。記号$\in$（「in」と発音）は、セットのメンバーシップを示します。たとえば、$x, y \in \{0, 1\}$ は、$x$ と $y$ が値 $0$ または $1$ のみを取ることができる変数であることを示します。 
 
-(**スカラーは、要素が 1 つだけのテンソルで表されます。**) 次のスニペットでは、2 つのスカラーをインスタンス化し、加算、乗算、除算、べき乗という使い慣れた算術演算を行います。
+(**スカラーは、1つの要素のみを含むテンソルとして実装されます。**) 以下では、2つのスカラーを割り当て、おなじみの加算、乗算、除算、べき乗演算を実行します。
 
 ```{.python .input}
+%%tab mxnet
 from mxnet import np, npx
 npx.set_np()
 
@@ -22,7 +30,7 @@ x + y, x * y, x / y, x ** y
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 import torch
 
 x = torch.tensor(3.0)
@@ -32,7 +40,7 @@ x + y, x * y, x / y, x**y
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 
 x = tf.constant(3.0)
@@ -43,119 +51,114 @@ x + y, x * y, x / y, x**y
 
 ## ベクター
 
-[**ベクトルは単にスカラー値のリストと考えることができます**] これらの値をベクトルの*要素* (*entries* または*components*) と呼びます。ベクトルがデータセットの例を表す場合、その値には実世界での意味があります。たとえば、ローン債務不履行のリスクを予測するモデルをトレーニングする場合、各申請者を、収入、雇用期間、以前の債務不履行回数、その他の要因に対応する要素をもつベクトルに関連付けることができます。入院患者が直面する可能性のある心臓発作のリスクを研究している場合、各患者を最新のバイタルサイン、コレステロール値、1日あたりの運動時間などを捉えたベクターで表すことができます。数学表記では、通常、ベクトルを太字の小文字で表します。英字 ($\mathbf{x}$、$\mathbf{y}$、$\mathbf{z})$ など) 
+私たちの目的のために、[**ベクトルはスカラーの固定長配列と考えることができます。**] 対応するコードと同様に、これらの値をベクトルの*要素*と呼びます（同義語には*エントリ*と*コンポーネント*が含まれます）。ベクトルが実世界のデータセットの例を表す場合、その値は現実世界での意味を持ちます。たとえば、ローンの債務不履行のリスクを予測するモデルをトレーニングする場合、各申請者を、収入、雇用期間、以前のデフォルトの数などの数量に対応する要素を持つベクターに関連付けることができます。心臓発作のリスクを研究していた場合、各ベクターは患者を表し、その成分は最新のバイタルサイン、コレステロール値、1日の運動時間などに対応している可能性があります。ベクターを太字の小文字で表します（例：$\mathbf{x}$、$\mathbf{y}$、$\mathbf{z}$）。 
 
-一次元テンソルを介してベクトルを扱います。一般に、テンソルはマシンのメモリ制限に応じて任意の長さを持つことができます。
+ベクトルは $1^{\mathrm{st}}$ 次テンソルとして実装されます。一般に、このようなテンソルは、メモリの制限に応じて、任意の長さを持つことができます。注意：Pythonでは、ほとんどのプログラミング言語と同様に、ベクトルインデックスは$0$から始まり、*ゼロベースのインデックス*とも呼ばれますが、線形代数では添字は$1$（1ベースのインデックス）から始まります。
 
 ```{.python .input}
-x = np.arange(4)
+%%tab mxnet
+x = np.arange(3)
 x
 ```
 
 ```{.python .input}
-#@tab pytorch
-x = torch.arange(4)
+%%tab pytorch
+x = torch.arange(3)
 x
 ```
 
 ```{.python .input}
-#@tab tensorflow
-x = tf.range(4)
+%%tab tensorflow
+x = tf.range(3)
 x
 ```
 
-添字を使うと、ベクトルのどの要素でも参照できます。たとえば、$x_i$ によって $\mathbf{x}$ の $i^\mathrm{th}$ エレメントを参照できます。要素 $x_i$ はスカラーなので、参照するときにフォントを太字にしないことに注意してください。広範な文献では、列ベクトルがベクトルの既定の方向であると見なされています。この本も同様です。数学では、ベクトル $\mathbf{x}$ は次のように記述できます。 
+添字を使用してベクトルの要素を参照できます。たとえば、$x_2$ は $\mathbf{x}$ の 2 番目の要素を示します。$x_2$ はスカラーなので、太字にはしません。既定では、要素を垂直に積み重ねることでベクトルを視覚化します。 
 
-$$\mathbf{x} =\begin{bmatrix}x_{1}  \\x_{2}  \\ \vdots  \\x_{n}\end{bmatrix},$$
+$$\mathbf{x} =\begin{bmatrix}x_{1}  \\ \vdots  \\x_{n}\end{bmatrix},$$
 :eqlabel:`eq_vec_def`
 
-$x_1, \ldots, x_n$ はベクトルの要素です。コードでは、(**テンソルにインデックスを付けて任意の要素にアクセスする**)
+ここで $x_1, \ldots, x_n$ はベクトルの要素です。後で、そのような*列ベクトル*と、要素が水平に積み重なっている*行ベクトル*を区別します。[**インデックスを使ってテンソルの要素にアクセスします。**]
 
 ```{.python .input}
-x[3]
+%%tab mxnet
+x[2]
 ```
 
 ```{.python .input}
-#@tab pytorch
-x[3]
+%%tab pytorch
+x[2]
 ```
 
 ```{.python .input}
-#@tab tensorflow
-x[3]
+%%tab tensorflow
+x[2]
 ```
 
-### 長さ、次元、形状
-
-:numref:`sec_ndarray` のいくつかの概念をもう一度見てみましょう。ベクトルは単なる数値の配列です。そして、すべての配列が長さを持つように、すべてのベクトルもそうです。数学表記法では、ベクトル $\mathbf{x}$ が $n$ の実数値のスカラーで構成されているとすると、$\mathbf{x} \in \mathbb{R}^n$ と表現できます。ベクトルの長さは、一般にベクトルの*次元* と呼ばれます。 
-
-通常の Python 配列と同様に、Python に組み込まれている `len()` 関数を呼び出すことで [**テンソルの長さにアクセスできます**]。
+ベクトルに $n$ 個の要素が含まれていることを示すために、$\mathbf{x} \in \mathbb{R}^n$ と記述します。正式には、$n$をベクトルの*次元*と呼びます。[**コードでは、これはテンソルの長さに対応します**]、Pythonの組み込み`len`関数を介してアクセスできます。
 
 ```{.python .input}
+%%tab mxnet
 len(x)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 len(x)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 len(x)
 ```
 
-テンソルが (正確に 1 つの軸をもつ) ベクトルを表す場合、`.shape` 属性を介してその長さにアクセスすることもできます。形状は、テンソルの各軸に沿った長さ (次元) を列挙したタプルです。(**軸が 1 つだけのテンソルの場合、形状には要素が 1 つしかありません。**)
+`shape` 属性を使用して長さにアクセスすることもできます。形状は、各軸に沿ったテンソルの長さを示すタプルです。(**軸が1つだけのテンソルには、1つの要素しかない形状があります。**)
 
 ```{.python .input}
+%%tab mxnet
 x.shape
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x.shape
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x.shape
 ```
 
-これらの文脈では「次元」という言葉が過負荷になりがちで、人々を混乱させる傾向があることに注意してください。明確にするために、*vector* または*axis* の次元を使用して、その長さ、つまりベクトルまたは軸の要素数を参照します。ただし、テンソルの次元性は、テンソルが持つ軸の数を参照するために使用します。この意味で、テンソルのある軸の次元は、その軸の長さになります。 
-
-## 行列
+多くの場合、「ディメンション」という言葉は、軸の数と特定の軸に沿った長さの両方を意味するようにオーバーロードされます。この混乱を避けるために、*順序*は軸の数を表し、*次元*はコンポーネントの数だけを参照するために使用します。 
 
-ベクトルがスカラーを 0 次から 1 次まで一般化するように、行列はベクトルを次数 1 から次 2 に一般化します。通常、太字の大文字で表される行列 ($\mathbf{X}$、$\mathbf{Y}$、$\mathbf{Z}$ など) は、2 つの軸をもつテンソルとしてコードで表されます。 
+## マトリックス
 
-数学表記法では $\mathbf{A} \in \mathbb{R}^{m \times n}$ を使用して、行列 $\mathbf{A}$ が $m$ 行と $n$ 列の実数スカラーで構成されることを表します。任意の行列 $\mathbf{A} \in \mathbb{R}^{m \times n}$ をテーブルとして説明できます。各要素 $a_{ij}$ は $i^{\mathrm{th}}$ 行と $j^{\mathrm{th}}$ 列に属します。 
+スカラーが $0^{\mathrm{th}}$ 次テンソルで、ベクトルが $1^{\mathrm{st}}$ 次テンソルであるように、行列は $2^{\mathrm{nd}}$ 次テンソルです。行列を太字の大文字 (例:$\mathbf{X}$、$\mathbf{Y}$、$\mathbf{Z}$) で表し、コードでは 2 つの軸をもつテンソルで表します。式 $\mathbf{A} \in \mathbb{R}^{m \times n}$ は、行列 $\mathbf{A}$ に $m \times n$ の実数値のスカラーが含まれ、$m$ 行と $n$ 列として配置されていることを示します。$m = n$のとき、行列は*二乗*だと言います。視覚的には、任意のマトリックスを表として説明できます。個々の要素を参照するには、行インデックスと列インデックスの両方に添字を付けます。たとえば、$a_{ij}$ は $\mathbf{A}$ の $i^{\mathrm{th}}$ 行と $j^{\mathrm{th}}$ 列に属する値です。 
 
 $$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}.$$
 :eqlabel:`eq_matrix_def`
 
-$\mathbf{A} \in \mathbb{R}^{m \times n}$ の場合、$\mathbf{A}$ の形状は ($m$、$n$) または $m \times n$ になります。具体的には、行列の行数と列数が同じ場合、その形状は正方形になるため、「*正方行列*」と呼ばれます。 
-
-テンソルをインスタンス化するためにお気に入りの関数を呼び出すときに $m$ と $n$ の 2 つの成分をもつ形状を指定することで [**$m \times n$ 行列を作成**] できます。
+コードでは、$2^{\mathrm{nd}}$ オーダーのテンソルで行列 $\mathbf{A} \in \mathbb{R}^{m \times n}$ を形状 ($m$、$n$) で表します。[**任意の適切なサイズの $m \times n$ テンソルを $m \times n$ 行列に変換できます**] 希望の形状を `reshape` に渡します。
 
 ```{.python .input}
-A = np.arange(20).reshape(5, 4)
+%%tab mxnet
+A = np.arange(6).reshape(3, 2)
 A
 ```
 
 ```{.python .input}
-#@tab pytorch
-A = torch.arange(20).reshape(5, 4)
+%%tab pytorch
+A = torch.arange(6).reshape(3, 2)
 A
 ```
 
 ```{.python .input}
-#@tab tensorflow
-A = tf.reshape(tf.range(20), (5, 4))
+%%tab tensorflow
+A = tf.reshape(tf.range(6), (3, 2))
 A
 ```
 
-:eqref:`eq_matrix_def` の行列 $\mathbf{A}$ のスカラー要素 $a_{ij}$ にアクセスするには、$[\mathbf{A}]_{ij}$ のように行 ($i$) と列 ($j$) のインデックスを指定します。行列 $\mathbf{A}$ のスカラー要素 (:eqref:`eq_matrix_def` など) が指定されない場合、行列 $\mathbf{A}$ の小文字をインデックス添字 $a_{ij}$ とともに使用して $[\mathbf{A}]_{ij}$ を参照することができます。表記を単純にするために、$a_{2, 3j}$ や $[\mathbf{A}]_{2i-1, 3}$ のように、必要な場合にのみカンマを別々のインデックスに挿入します。 
-
-時々、軸を反転させたいことがあります。行列の行と列を交換すると、その結果は行列の*転置*と呼ばれます。正式には、行列 $\mathbf{A}$ の $\mathbf{A}^\top$ による転置を意味し、$\mathbf{B} = \mathbf{A}^\top$ の場合は $i$ と $j$ に対して $b_{ij} = a_{ji}$ を転置することを表します。したがって、:eqref:`eq_matrix_def` における $\mathbf{A}$ の転置は $n \times m$ 行列になります。 
+時々、軸を反転させたいことがあります。行列の行と列を交換すると、その結果は*転置*と呼ばれます。正式には、$\mathbf{A}$の転置を$\mathbf{A}^\top$で表し、$\mathbf{B} = \mathbf{A}^\top$の場合は、$i$と$j$のすべてに対して$b_{ij} = a_{ji}$を転置することを表します。したがって、$m \times n$ 行列の転置は $n \times m$ 行列になります。 
 
 $$
 \mathbf{A}^\top =
@@ -167,107 +170,94 @@ $$
 \end{bmatrix}.
 $$
 
-ここで、コード内で a (**行列の転置**) にアクセスします。
+コードでは、以下のように任意の (**行列の転置**) にアクセスできます。
 
 ```{.python .input}
+%%tab mxnet
 A.T
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A.T
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.transpose(A)
 ```
 
-正方行列の特殊な型として、[**a*対称行列* $\mathbf{A}$ はその転置と等しい:$\mathbf{A} = \mathbf{A}^\top$.**] ここでは対称行列 `B` を定義します。
-
-```{.python .input}
-B = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
-B
-```
-
-```{.python .input}
-#@tab pytorch
-B = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
-B
-```
-
-```{.python .input}
-#@tab tensorflow
-B = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
-B
-```
-
-ここで `B` をその転置と比較します。
+[**対称行列は、それ自体の転置と等しい正方行列のサブセットです:$\mathbf{A} = \mathbf{A}^\top$.**] 次の行列は対称です:
 
 ```{.python .input}
-B == B.T
+%%tab mxnet
+A = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == A.T
 ```
 
 ```{.python .input}
-#@tab pytorch
-B == B.T
+%%tab pytorch
+A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == A.T
 ```
 
 ```{.python .input}
-#@tab tensorflow
-B == tf.transpose(B)
+%%tab tensorflow
+A = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == tf.transpose(A)
 ```
 
-行列は有用なデータ構造です。行列を使用すると、さまざまな変動様式を持つデータを整理できます。たとえば、マトリックスの行は異なる住宅 (データ例) に対応し、列は異なる属性に対応することがあります。スプレッドシートソフトウェアを使用したことがある人や :numref:`sec_pandas` を読んだことがある人なら、これはおなじみのように思えます。したがって、単一のベクトルの既定の方向は列ベクトルですが、表形式のデータセットを表す行列では、各データ例を行列の行ベクトルとして扱うのがより一般的です。また、後の章で説明するように、この規則により、一般的なディープラーニングの実践が可能になります。たとえば、テンソルの最も外側の軸に沿って、データ例のミニバッチ、またはミニバッチが存在しない場合はデータ例のみにアクセスまたは列挙できます。 
+マトリックスはデータセットを表すのに便利です。通常、行は個々のレコードに対応し、列は個別の属性に対応します。 
 
 ## テンソル
 
-ベクトルがスカラーを一般化し、行列がベクトルを一般化するように、さらに多くの軸をもつデータ構造を構築できます。[**Tensors**](本項の「テンソル」は代数的オブジェクトを指す) (**$n$ 次元の配列を任意の軸数で記述する一般的な方法を挙げてください。**) ベクトルは一次テンソル、行列は二次テンソルです。テンソルは特殊なフォントフェース ($\mathsf{X}$、$\mathsf{Y}$、$\mathsf{Z}$ など) の大文字で表され、インデックスの仕組み ($x_{ijk}$ や $[\mathsf{X}]_{1, 2i-1, 3}$ など) は行列のものと似ています。 
+機械学習はスカラー、ベクトル、行列だけで遠くまで進むことができますが、最終的には高次 [**テンソル**] で作業する必要があるかもしれません。テンソル (**$n^{\mathrm{th}}$次配列の拡張を記述する一般的な方法を教えてください。**) *テンソルクラス*のソフトウェアオブジェクトは、これらも任意の数の軸を持つことができるため、正確に「テンソル」と呼びます。単語を使うのは混乱するかもしれませんが
+*テンソル* 両方の数学的オブジェクト
+そしてコードでのその実現、私たちの意味は通常文脈から明らかであるべきです。一般的なテンソルは、特殊なフォントフェース（例：$\mathsf{X}$、$\mathsf{Y}$、$\mathsf{Z}$）を持つ大文字で表し、それらのインデックスメカニズム（例：$x_{ijk}$と$[\mathsf{X}]_{1, 2i-1, 3}$）は行列のそれと自然に従います。 
 
-テンソルは、高さ、幅、およびカラーチャンネル (赤、緑、青) を積み重ねるための*channel* 軸に対応する 3 つの軸を持つ $n$ 次元の配列として到着するイメージで作業を開始するとより重要になります。ここでは、高次のテンソルをスキップして、基本に焦点を当てます。
+テンソルは、画像を扱い始めるとより重要になります。各イメージは、高さ、幅、および*チャネル* に対応する軸を持つ $3^{\mathrm{rd}}$ 次テンソルとして届きます。各空間位置で、各色 (赤、緑、青) の強度がチャネルに沿って積み重ねられます。さらに、画像の集合は、$4^{\mathrm{th}}$次テンソルによってコードで表され、異なる画像が第1軸に沿って索引付けされる。高次テンソルは、形状成分の数を増やすことによって、ベクトルや行列と同様に構築されます。
 
 ```{.python .input}
-X = np.arange(24).reshape(2, 3, 4)
-X
+%%tab mxnet
+np.arange(24).reshape(2, 3, 4)
 ```
 
 ```{.python .input}
-#@tab pytorch
-X = torch.arange(24).reshape(2, 3, 4)
-X
+%%tab pytorch
+torch.arange(24).reshape(2, 3, 4)
 ```
 
 ```{.python .input}
-#@tab tensorflow
-X = tf.reshape(tf.range(24), (2, 3, 4))
-X
+%%tab tensorflow
+tf.reshape(tf.range(24), (2, 3, 4))
 ```
 
-## テンソル演算の基本的性質
+## テンソル算術の基本的性質
 
-任意の数の軸のスカラー、ベクトル、行列、テンソル (この項の「テンソル」は代数的オブジェクトを指します) には、便利な便利なプロパティがいくつかあります。たとえば、要素単位の単項演算の定義から、要素単位の単項演算ではオペランドの形状が変化しないことに気付いたかもしれません。同様に、[**同じ形状のテンソルが2つあれば、要素ごとの2進演算の結果は同じ形状のテンソルになります。**] たとえば、同じ形状の2つの行列を加算すると、これら 2 つの行列に対して要素単位の加算が行われます。
+スカラー、ベクトル、行列、高次テンソルにはすべて便利なプロパティがあります。たとえば、要素単位の演算では、オペランドと同じ形状の出力が生成されます。
 
 ```{.python .input}
-A = np.arange(20).reshape(5, 4)
+%%tab mxnet
+A = np.arange(6).reshape(2, 3)
 B = A.copy()  # Assign a copy of `A` to `B` by allocating new memory
 A, A + B
 ```
 
 ```{.python .input}
-#@tab pytorch
-A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
+%%tab pytorch
+A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
 B = A.clone()  # Assign a copy of `A` to `B` by allocating new memory
 A, A + B
 ```
 
 ```{.python .input}
-#@tab tensorflow
-A = tf.reshape(tf.range(20, dtype=tf.float32), (5, 4))
+%%tab tensorflow
+A = tf.reshape(tf.range(6, dtype=tf.float32), (2, 3))
 B = A  # No cloning of `A` to `B` by allocating new memory
 A, A + B
 ```
 
-具体的には、[**2つの行列の要素ごとの乗算を*アダマール積***](数学表記 $\odot$) と呼びます。行 $i$ と列 $j$ の要素が $b_{ij}$ である行列 $\mathbf{B} \in \mathbb{R}^{m \times n}$ について考えてみます。行列 $\mathbf{A}$ (:eqref:`eq_matrix_def` で定義されている) と $\mathbf{B}$ のアダマール積 
+[**2つの行列の要素ごとの積は、それらの*アダマール積***]（$\odot$と表記）と呼ばれます。以下に、2つの行列$\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$のアダマール積のエントリを綴ります。 
 
 $$
 \mathbf{A} \odot \mathbf{B} =
@@ -280,265 +270,273 @@ $$
 $$
 
 ```{.python .input}
+%%tab mxnet
 A * B
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A * B
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 A * B
 ```
 
-[**テンソルにスカラーを乗算または加算する**] もテンソルの形状は変化せず、オペランドテンソルの各要素にスカラーが加算または乗算されます。
+[**スカラーとテンソルの加算または乗算**] は、元のテンソルと同じ形状の結果を生成します。ここでは、テンソルの各要素がスカラーに加算 (または乗算) されます。
 
 ```{.python .input}
+%%tab mxnet
 a = 2
 X = np.arange(24).reshape(2, 3, 4)
 a + X, (a * X).shape
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 a = 2
 X = torch.arange(24).reshape(2, 3, 4)
 a + X, (a * X).shape
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 a = 2
 X = tf.reshape(tf.range(24), (2, 3, 4))
 a + X, (a * X).shape
 ```
 
-## 減少
-:label:`subseq_lin-alg-reduction`
+## 削減
+:label:`subsec_lin-alg-reduction`
 
-任意のテンソルで実行できる便利な操作の 1 つは、[**要素の和**] を計算することです。数学的表記法では、$\sum$ 記号を使用して和を表現します。要素の和を長さ $d$ のベクトル $\mathbf{x}$ で表すために、$\sum_{i=1}^d x_i$ と書きます。コードでは、合計を計算する関数を呼び出すだけです。
+しばしば、[**テンソルの要素の合計**] を計算したいとします。長さ$n$のベクトル$\mathbf{x}$の要素の合計を表現するには、$\sum_{i=1}^n x_i$と記述します。それには簡単な機能があります:
 
 ```{.python .input}
-x = np.arange(4)
+%%tab mxnet
+x = np.arange(3)
 x, x.sum()
 ```
 
 ```{.python .input}
-#@tab pytorch
-x = torch.arange(4, dtype=torch.float32)
+%%tab pytorch
+x = torch.arange(3, dtype=torch.float32)
 x, x.sum()
 ```
 
 ```{.python .input}
-#@tab tensorflow
-x = tf.range(4, dtype=tf.float32)
+%%tab tensorflow
+x = tf.range(3, dtype=tf.float32)
 x, tf.reduce_sum(x)
 ```
 
-たとえば、$m \times n$ 行列 $\mathbf{A}$ の要素の和は $\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}$ と書くことができます。
+[**任意の形状のテンソルの要素の合計**] を表現するには、単純にそのすべての軸を合計します。たとえば、$m \times n$ 行列 $\mathbf{A}$ の要素の合計は、$\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}$ と記述できます。
 
 ```{.python .input}
+%%tab mxnet
 A.shape, A.sum()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A.shape, A.sum()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 A.shape, tf.reduce_sum(A)
 ```
 
-デフォルトでは、合計を計算する関数を呼び出します。
-*テンソルをそのすべての軸に沿ってスカラーに縮小* します。
-また、[**加算によってテンソルを減少させる軸を指定することもできます**] 行列を例にとります。すべての行の要素を合計して行の次元 (軸 0) を減らすには、関数を呼び出すときに `axis=0` を指定します。入力行列は軸 0 に沿って縮小されて出力ベクトルが生成されるため、入力の軸 0 の次元は出力シェイプでは失われます。
+デフォルトでは、sum 関数を呼び出します
+*すべての軸に沿ってテンソルを減らす*、
+最終的にスカラーを生成します。私たちのライブラリでは、[**テンソルを減少させる軸を指定する。**] 行に沿ったすべての要素 (軸0) を合計するには、`sum`に`axis=0`を指定します。入力行列は、出力ベクトルを生成するために軸 0 に沿って減少するため、この軸は出力の形状から欠落しています。
 
 ```{.python .input}
-A_sum_axis0 = A.sum(axis=0)
-A_sum_axis0, A_sum_axis0.shape
+%%tab mxnet
+A.shape, A.sum(axis=0).shape
 ```
 
 ```{.python .input}
-#@tab pytorch
-A_sum_axis0 = A.sum(axis=0)
-A_sum_axis0, A_sum_axis0.shape
+%%tab pytorch
+A.shape, A.sum(axis=0).shape
 ```
 
 ```{.python .input}
-#@tab tensorflow
-A_sum_axis0 = tf.reduce_sum(A, axis=0)
-A_sum_axis0, A_sum_axis0.shape
+%%tab tensorflow
+A.shape, tf.reduce_sum(A, axis=0).shape
 ```
 
-`axis=1` を指定すると、すべての列の要素が合計され、列の次元 (軸 1) が縮小されます。したがって、入力の軸 1 の次元は出力形状では失われます。
+`axis=1` を指定すると、すべての列の要素が合計され、列の次元 (軸 1) が小さくなります。
 
 ```{.python .input}
-A_sum_axis1 = A.sum(axis=1)
-A_sum_axis1, A_sum_axis1.shape
+%%tab mxnet
+A.shape, A.sum(axis=1).shape
 ```
 
 ```{.python .input}
-#@tab pytorch
-A_sum_axis1 = A.sum(axis=1)
-A_sum_axis1, A_sum_axis1.shape
+%%tab pytorch
+A.shape, A.sum(axis=1).shape
 ```
 
 ```{.python .input}
-#@tab tensorflow
-A_sum_axis1 = tf.reduce_sum(A, axis=1)
-A_sum_axis1, A_sum_axis1.shape
+%%tab tensorflow
+A.shape, tf.reduce_sum(A, axis=1).shape
 ```
 
-加算によって行と列の両方に沿って行列を削減することは、行列のすべての要素を合計することと等価です。
+合計によって行と列の両方に沿って行列を削減することは、行列のすべての要素を合計することと同じです。
 
 ```{.python .input}
-A.sum(axis=[0, 1])  # Same as `A.sum()`
+%%tab mxnet
+A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
 ```
 
 ```{.python .input}
-#@tab pytorch
-A.sum(axis=[0, 1])  # Same as `A.sum()`
+%%tab pytorch
+A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
 ```
 
 ```{.python .input}
-#@tab tensorflow
-tf.reduce_sum(A, axis=[0, 1])  # Same as `tf.reduce_sum(A)`
+%%tab tensorflow
+tf.reduce_sum(A, axis=[0, 1]), tf.reduce_sum(A) # Same as `tf.reduce_sum(A)`
 ```
 
-[**関連する量は*平均*で、*平均*とも呼ばれます。**] 合計を要素の総数で割ることで平均を計算します。コードでは、任意の形状のテンソルの平均を計算する関数を呼び出すだけで済みます。
+[**関連する数量は*平均*で、*平均*とも呼ばれます。**] 合計を要素の総数で割ることによって平均を計算します。平均値の計算は非常に一般的であるため、`sum`と同様に機能する専用のライブラリ関数を取得します。
 
 ```{.python .input}
+%%tab mxnet
 A.mean(), A.sum() / A.size
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A.mean(), A.sum() / A.numel()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()
 ```
 
-同様に、平均を計算する関数では、指定した軸に沿ってテンソルを減らすこともできます。
+同様に、平均を計算する関数も特定の軸に沿ってテンソルを減らすことができます。
 
 ```{.python .input}
+%%tab mxnet
 A.mean(axis=0), A.sum(axis=0) / A.shape[0]
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A.mean(axis=0), A.sum(axis=0) / A.shape[0]
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.reduce_mean(A, axis=0), tf.reduce_sum(A, axis=0) / A.shape[0]
 ```
 
-### 非リダクション合計
-:label:`subseq_lin-alg-non-reduction`
+## 非還元合計
+:label:`subsec_lin-alg-non-reduction`
 
-ただし、和または平均を計算する関数を呼び出す場合、[**軸の数を変更しない**] と便利な場合があります。
+合計または平均を計算する関数を呼び出すときに [**軸の数を変更しない**] と便利な場合があります。これは、ブロードキャストメカニズムを使用する場合に重要です。
 
 ```{.python .input}
+%%tab mxnet
 sum_A = A.sum(axis=1, keepdims=True)
-sum_A
+sum_A, sum_A.shape
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 sum_A = A.sum(axis=1, keepdims=True)
-sum_A
+sum_A, sum_A.shape
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
-sum_A
+sum_A, sum_A.shape
 ```
 
-たとえば、`sum_A` は各行を合計した後も 2 つの軸を保持しているので、ブロードキャストで `A` を `sum_A` で割ることができます。
+たとえば、`sum_A`は各行の合計後に2つの軸を保持するため、（**`A`をブロードキャストで`sum_A`で割る**）、各行の合計が$1$になる行列を作成できます。
 
 ```{.python .input}
+%%tab mxnet
 A / sum_A
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A / sum_A
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 A / sum_A
 ```
 
-[**軸に沿った `A` の要素の累積和**]、たとえば `axis=0` (行ごと) を計算したい場合は、`cumsum` 関数を呼び出すことができます。この関数は入力テンソルを軸に沿って減少させません。
+[**いくつかの軸に沿った`A`の要素の累積合計**]、たとえば`axis=0`（行ごと）を計算する場合、`cumsum`関数を呼び出すことができます。設計上、この関数はどの軸にも沿って入力テンソルを減少させません。
 
 ```{.python .input}
+%%tab mxnet
 A.cumsum(axis=0)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A.cumsum(axis=0)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.cumsum(A, axis=0)
 ```
 
-## ドットプロダクト
+## ドットプロダクツ
 
-これまでは、要素単位の演算、合計、平均のみを実行してきました。そして、これが私たちにできることのすべてであるならば、線形代数はおそらくそれ自身のセクションに値しないでしょう。ただし、最も基本的な演算の 1 つは内積です。2 つのベクトル $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ が与えられた場合、それらの*内積* $\mathbf{x}^\top \mathbf{y}$ (または $\langle \mathbf{x}, \mathbf{y}  \rangle$) は、同じ位置 $\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i$ にある要素の積の和になります。 
+これまでは、要素単位の演算、合計、平均のみを実行してきました。そして、これが私たちにできるすべてだったら、線形代数は独自のセクションに値しないでしょう。幸いなことに、これは物事がより面白くなるところです。最も基本的な操作の 1 つは内積です。2つのベクトル$\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$が与えられた場合、それらの*ドット積* $\mathbf{x}^\top \mathbf{y}$（または$\langle \mathbf{x}, \mathbf{y}  \rangle$）は、同じ位置にある要素の積の合計です：$\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i$。 
 
-[~~2つのベクトルの*内積* は、同じ位置にある要素の積の和です~~]
+[~~2つのベクトルの*内積* は、同じ位置にある要素の積の合計です~~]
 
 ```{.python .input}
-y = np.ones(4)
+%%tab mxnet
+y = np.ones(3)
 x, y, np.dot(x, y)
 ```
 
 ```{.python .input}
-#@tab pytorch
-y = torch.ones(4, dtype = torch.float32)
+%%tab pytorch
+y = torch.ones(3, dtype = torch.float32)
 x, y, torch.dot(x, y)
 ```
 
 ```{.python .input}
-#@tab tensorflow
-y = tf.ones(4, dtype=tf.float32)
+%%tab tensorflow
+y = tf.ones(3, dtype=tf.float32)
 x, y, tf.tensordot(x, y, axes=1)
 ```
 
-注意 (**要素ごとの乗算と和を実行することで、2つのベクトルの内積を等価的に表現できます:**)
+同等に、(**要素単位の乗算とそれに続く合計を実行することにより、2つのベクトルの内積を計算できます:**)
 
 ```{.python .input}
+%%tab mxnet
 np.sum(x * y)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.sum(x * y)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.reduce_sum(x * y)
 ```
 
-内積は幅広い状況で役に立ちます。たとえば、ベクトル $\mathbf{x}  \in \mathbb{R}^d$ で表される値のセットと $\mathbf{w} \in \mathbb{R}^d$ で表される重みのセットがある場合、$\mathbf{x}$ の値の重み $\mathbf{w}$ に従った加重和は、ドット積 $\mathbf{x}^\top \mathbf{w}$ として表すことができます。重みが負でなく、合計が 1 の場合 ($\left(\sum_{i=1}^{d} {w_i} = 1\right)$)、内積は*加重平均*を表します。2 つのベクトルを正規化して単位長をもつと、内積はそれらの間の角度の余弦を表します。この*length* の概念については、このセクションの後半で正式に紹介します。 
+ドット積は幅広い状況で役立ちます。たとえば、ベクトル $\mathbf{x}  \in \mathbb{R}^n$ で示されるいくつかの値のセットと $\mathbf{w} \in \mathbb{R}^n$ で示される重みのセットがある場合、重み $\mathbf{w}$ に従った $\mathbf{x}$ の値の加重合計は、内積 $\mathbf{x}^\top \mathbf{w}$ として表すことができます。重みが負でなく、合計が1になる場合、つまり$\left(\sum_{i=1}^{n} {w_i} = 1\right)$の場合、内積は*加重平均*を表します。2 つのベクトルを単位長に正規化した後、内積はそれらの間の角度の余弦を表します。このセクションの後半で、この*length*の概念を正式に紹介します。 
 
-## マトリックス-ベクトル積
+## マトリックス-ベクトル製品
 
-ドット積の計算方法がわかったところで、*行列-ベクトル積*について理解できるようになります。:eqref:`eq_matrix_def` と :eqref:`eq_vec_def` でそれぞれ定義され可視化された行列 $\mathbf{A} \in \mathbb{R}^{m \times n}$ とベクトル $\mathbf{x} \in \mathbb{R}^n$ を思い出してください。まず、行列 $\mathbf{A}$ を行ベクトルで可視化することから始めましょう。 
+ドット積の計算方法がわかったので、$m \times n$ 行列 $\mathbf{A}$ と $n$ 次元ベクトル $\mathbf{x}$ の間の*積* を理解し始めることができます。まず、行列を行ベクトルで視覚化します。 
 
 $$\mathbf{A}=
 \begin{bmatrix}
@@ -550,7 +548,7 @@ $$\mathbf{A}=
 
 ここで、各 $\mathbf{a}^\top_{i} \in \mathbb{R}^n$ は、行列 $\mathbf{A}$ の $i^\mathrm{th}$ 行を表す行ベクトルです。 
 
-[**行列-ベクトル積 $\mathbf{A}\mathbf{x}$ は長さが $m$ の列ベクトルで、$i^\mathrm{th}$ の要素はドット積 $\mathbf{a}^\top_i \mathbf{x}$: **] 
+[**行列ベクトル積 $\mathbf{A}\mathbf{x}$ は長さ$m$の単純な列ベクトルで、その$i^\mathrm{th}$要素はドット積 $\mathbf{a}^\top_i \mathbf{x}$: **] 
 
 $$
 \mathbf{A}\mathbf{x}
@@ -568,39 +566,40 @@ $$
 \end{bmatrix}.
 $$
 
-行列 $\mathbf{A}\in \mathbb{R}^{m \times n}$ による乗算は、ベクトルを $\mathbb{R}^{n}$ から $\mathbb{R}^{m}$ に投影する変換と考えることができます。これらの変換は非常に有用であることが分かります。たとえば、回転を正方行列による乗算として表すことができます。以降の章で説明するように、行列とベクトルの積を使用して、前の層の値からニューラルネットワークの各層を計算するときに必要な最も集中的な計算を記述することもできます。
+行列 $\mathbf{A}\in \mathbb{R}^{m \times n}$ を使用した乗算は、$\mathbb{R}^{n}$ から $\mathbb{R}^{m}$ へのベクトルを投影する変換と考えることができます。これらの変換は非常に便利です。たとえば、回転を特定の正方行列による乗算として表すことができます。マトリックスベクトル積は、前の層からの出力を前提として、ニューラルネットワークの各層の出力を計算する際に必要な主要な計算も記述します。
 
 :begin_tab:`mxnet`
-行列とベクトルの積をテンソルでコードで表現する場合、ドット積と同じ関数 `dot` を使用します。行列 `A` とベクトル `x` をもって `np.dot(A, x)` を呼び出すと、行列とベクトルの積が実行されます。`A` の列の次元 (軸 1 に沿った長さ) は `x` の次元 (長さ) と同じでなければならないことに注意してください。
+行列とベクトルの積をコードで表すには、同じ `dot` 関数を使用します。操作は、引数の型に基づいて推測されます。`A` (軸 1 に沿った長さ) の列の次元は、`x` (長さ) の次元と同じでなければならないことに注意してください。
 :end_tab:
 
 :begin_tab:`pytorch`
-行列とベクトルの積をテンソルを使ったコードで表現するには、`mv` 関数を使用します。行列 `A` とベクトル `x` をもって `torch.mv(A, x)` を呼び出すと、行列とベクトルの積が実行されます。`A` の列の次元 (軸 1 に沿った長さ) は `x` の次元 (長さ) と同じでなければならないことに注意してください。
+行列とベクトルの積をコードで表すには、`mv` 関数を使用します。`A` (軸 1 に沿った長さ) の列の次元は、`x` (長さ) の次元と同じでなければならないことに注意してください。PyTorch には、(引数に応じて) 行列ベクトルと行列行列積の両方を実行できる便利な演算子 `@` があります。こうして私達は `A @x `を書ける。
 :end_tab:
 
 :begin_tab:`tensorflow`
-行列とベクトルの積をテンソルを使ったコードで表現するには、`matvec` 関数を使用します。行列 `A` とベクトル `x` をもって `tf.linalg.matvec(A, x)` を呼び出すと、行列とベクトルの積が実行されます。`A` の列の次元 (軸 1 に沿った長さ) は `x` の次元 (長さ) と同じでなければならないことに注意してください。
+行列とベクトルの積をコードで表すには、`matvec` 関数を使用します。`A` (軸 1 に沿った長さ) の列の次元は、`x` (長さ) の次元と同じでなければならないことに注意してください。
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 A.shape, x.shape, np.dot(A, x)
 ```
 
 ```{.python .input}
-#@tab pytorch
-A.shape, x.shape, torch.mv(A, x)
+%%tab pytorch
+A.shape, x.shape, torch.mv(A, x), A@x
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 A.shape, x.shape, tf.linalg.matvec(A, x)
 ```
 
 ## マトリックス-マトリックス乗算
 
-ドット積と行列ベクトル積のコツをつかんだなら、*matrix-matrix乗算*は簡単なはずです。 
+ドット積と行列ベクトル積のコツをつかんだら、*行列-行列の乗算* は簡単なはずです。 
 
-$\mathbf{A} \in \mathbb{R}^{n \times k}$ と $\mathbf{B} \in \mathbb{R}^{k \times m}$ の 2 つの行列があるとします。 
+$\mathbf{A} \in \mathbb{R}^{n \times k}$と$\mathbf{B} \in \mathbb{R}^{k \times m}$の2つの行列があるとします。 
 
 $$\mathbf{A}=\begin{bmatrix}
  a_{11} & a_{12} & \cdots & a_{1k} \\
@@ -615,7 +614,7 @@ $$\mathbf{A}=\begin{bmatrix}
  b_{k1} & b_{k2} & \cdots & b_{km} \\
 \end{bmatrix}.$$
 
-行列 $\mathbf{A}$ の $i^\mathrm{th}$ 行を表す行ベクトルを $\mathbf{a}^\top_{i} \in \mathbb{R}^k$ で表し、$\mathbf{b}_{j} \in \mathbb{R}^k$ を行列 $\mathbf{B}$ の $j^\mathrm{th}$ 列の列ベクトルとします。行列積 $\mathbf{C} = \mathbf{A}\mathbf{B}$ を生成するには、$\mathbf{A}$ を行ベクトルから、$\mathbf{B}$ を列ベクトルから考えるのが最も簡単です。 
+$\mathbf{a}^\top_{i} \in \mathbb{R}^k$ が行列 $\mathbf{A}$ の $i^\mathrm{th}$ 行を表す行ベクトルを表し、$\mathbf{b}_{j} \in \mathbb{R}^k$ が行列 $\mathbf{B}$ の $j^\mathrm{th}$ 列からの列ベクトルを表すとします。 
 
 $$\mathbf{A}=
 \begin{bmatrix}
@@ -629,7 +628,7 @@ $$\mathbf{A}=
 \end{bmatrix}.
 $$
 
-次に、各要素 $c_{ij}$ をドット積 $\mathbf{a}^\top_i \mathbf{b}_j$ として計算するだけで、行列積 $\mathbf{C} \in \mathbb{R}^{n \times m}$ が生成されます。 
+行列積 $\mathbf{C} \in \mathbb{R}^{n \times m}$ を形成するには、各要素 $c_{ij}$ を、$\mathbf{A}$ の $i^{\mathrm{th}}$ 行と $\mathbf{B}$ の $j^{\mathrm{th}}$ 行、つまり $\mathbf{a}^\top_i \mathbf{b}_j$ の間の内積として計算します。 
 
 $$\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
 \mathbf{a}^\top_{1} \\
@@ -648,154 +647,148 @@ $$\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
 \end{bmatrix}.
 $$
 
-[**行列と行列の乗算 $\mathbf{AB}$ は、単に $m$ の行列とベクトルの積を実行し、その結果をつなぎ合わせて $n \times m$ 行列を形成すると考えることができます。**] 次のスニペットでは、`A` と `B` で行列の乗算を実行しています。ここで、`A` は 5 行 4 列の行列で、`B` は 4 行 3 列の行列です。乗算後、5行3列の行列が得られます。
+[**行列と行列の乗算$\mathbf{AB}$は、$m$の行列ベクトル積または$m \times n$のドット積を実行し、結果をステッチして$n \times m$行列を形成すると考えることができます。**] 次のスニペットでは、`A`と`B`で行列の乗算を実行します。ここで、`A`は2行3列の行列で、`B`は3行4列の行列です。乗算後、2行4列の行列が得られます。
 
 ```{.python .input}
-B = np.ones(shape=(4, 3))
+%%tab mxnet
+B = np.ones(shape=(3, 4))
 np.dot(A, B)
 ```
 
 ```{.python .input}
-#@tab pytorch
-B = torch.ones(4, 3)
-torch.mm(A, B)
+%%tab pytorch
+B = torch.ones(3, 4)
+torch.mm(A, B), A@B
 ```
 
 ```{.python .input}
-#@tab tensorflow
-B = tf.ones((4, 3), tf.float32)
+%%tab tensorflow
+B = tf.ones((3, 4), tf.float32)
 tf.matmul(A, B)
 ```
 
-行列-行列の乗算は単に「*行列乗算*」と呼ぶことができ、アダマール積と混同しないでください。 
+*行列-行列乗算* という用語は、多くの場合、*行列乗算* に簡略化されており、アダマール積と混同しないでください。 
 
 ## 規範
 :label:`subsec_lin-algebra-norms`
 
-線形代数で最も有用な演算子には、*norms* があります。非公式には、ベクトルのノルムによってベクトルがどれだけ*大きい*かがわかります。ここで検討中の*size* の概念は、次元性ではなく、成分の大きさに関係します。 
-
-線形代数では、ベクトルノルムは、ベクトルをスカラーにマッピングし、いくつかの特性を満たす関数 $f$ です。任意のベクトル $\mathbf{x}$ が与えられた場合、1 番目の特性は、ベクトルのすべての要素を定数係数 $\alpha$ でスケーリングすると、そのノルムも同じ定数因子の*絶対値* でスケーリングされることを示しています。 
-
-$$f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}).$$
+線形代数で最も有用な演算子のいくつかは、*ノルム*です。非公式に、ベクトルのノルムは、それがどれほど*大きい*かを教えてくれます。たとえば、$\\ell_2$ ノルムは、ベクトルの (ユークリッド) 長さを測定します。ここでは、ベクトルの成分（次元性ではない）の大きさに関係する*サイズ*の概念を採用しています。  
 
-2 つ目の特性は、おなじみの三角不等式です。 
+ノルムは、ベクトルをスカラーにマッピングし、次の 3 つのプロパティを満たす関数 $\| \cdot \|$ です。 
 
-$$f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}).$$
+1. 任意のベクトル $\mathbf{x}$ が与えられた場合、ベクトル (のすべての要素) をスカラー $\alpha \in \mathbb{R}$ でスケーリングすると、そのノルムはそれに応じてスケーリングされます:$$\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.$$
+2. 任意のベクトル $\mathbf{x}$ と $\mathbf{y}$: ノルムは三角形の不等式を満たす:$$\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.$$
+3. ベクトルのノルムは非負で、ベクトルがゼロの場合にのみ消滅します:$$\|\mathbf{x}\| > 0 \text{ for all } \mathbf{x} \neq 0.$$
 
-3番目の特性は、ノルムが非負でなければならないことを単純に示しています。 
+多くの関数は有効な規範であり、異なる規範は異なるサイズの概念をエンコードします。直角三角形の斜辺を計算するときに小学校の幾何学で学んだユークリッドノルムは、ベクトルの要素の平方和の平方根です。正式には、これは [**$\ell_2$ *norm***] と呼ばれ、次のように表されます。 
 
-$$f(\mathbf{x}) \geq 0.$$
+(**$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.$$**) 
 
-ほとんどのコンテキストでは、何かの最小の*size*は0なので、これは理にかなっています。最終的な特性では、最小ノルムが達成され、すべてゼロで構成されるベクトルによってのみ達成されることが要求されます。 
-
-$$\forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0.$$
-
-ノルムは距離の尺度によく似ていることに気付くかもしれません。小学校からのユークリッド距離（ピタゴラスの定理を考えて）を覚えていれば、非否定性と三角不等式の概念が鐘を鳴らすかもしれません。実際、ユークリッド距離はノルムです。具体的には $L_2$ ノルムです。$n$ 次元ベクトル $\mathbf{x}$ の要素が $x_1, \ldots, x_n$ であると仮定します。 
-
-[**$\mathbf{x}$ の $L_2$ *ノルム* は、ベクトル要素の二乗和の平方根です:**] 
-
-(** $\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},$ドル**) 
-
-$L_2$ の基準では、添字の $2$ が省略されることがよくあります。つまり $\|\mathbf{x}\|$ は $\|\mathbf{x}\|_2$ に相当します。コードでは、ベクトルの $L_2$ ノルムを次のように計算できます。
+`norm`という方法は、$\ell_2$ノルムを計算します。
 
 ```{.python .input}
+%%tab mxnet
 u = np.array([3, -4])
 np.linalg.norm(u)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 u = torch.tensor([3.0, -4.0])
 torch.norm(u)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 u = tf.constant([3.0, -4.0])
 tf.norm(u)
 ```
 
-ディープラーニングでは、$L_2$ の二乗ノルムを使用する頻度が高くなります。 
-
-また、[**the $L_1$ *norm***] もよく見かけますが、これはベクトル要素の絶対値の和で表されます。 
+[**$\ell_1$ norm**] も人気があり、関連するメトリックはマンハッタン距離と呼ばれます。定義上、$\ell_1$ノルムは、ベクトルの要素の絶対値を合計します。 
 
-(** $\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$ドル**) 
+(**$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$$**) 
 
-$L_2$ ノルムと比較すると、外れ値の影響は小さくなります。$L_1$ ノルムを計算するために、要素の合計をもつ絶対値関数を作成します。
+$\ell_2$ ノルムと比較して、外れ値に対する感度は低くなります。$\ell_1$ ノルムを計算するために、絶対値を合計演算で構成します。
 
 ```{.python .input}
+%%tab mxnet
 np.abs(u).sum()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.abs(u).sum()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.reduce_sum(tf.abs(u))
 ```
 
-$L_2$ ノルムと $L_1$ ノルムはどちらも、より一般的な $L_p$ *ノルム* の特殊なケースです。 
+$\ell_2$と$\ell_1$の規範はどちらも、より一般的な$\ell_p$*規範*の特殊なケースです。 
 
 $$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$
 
-ベクトルのノルム $L_2$ と同様に、[***フロベニウスノルム* の行列 $\mathbf{X} \in \mathbb{R}^{m \times n}$**] は、行列要素の二乗和の平方根です。 
+行列の場合、問題はもっと複雑です。結局のところ、行列は個々のエントリの集まりとしても見ることができます 
+*と* は、ベクトルを操作して他のベクトルに変換するオブジェクトです。 
+たとえば、行列とベクトルの積 $\mathbf{X} \mathbf{v}$ が $\mathbf{v}$ に比べてどれくらい長くなるかを尋ねることができます。この考え方は、*スペクトル*ノルムと呼ばれるノルムにつながります。ここでは、[**計算がはるかに簡単な*フロベニウスノルム*] を紹介し、行列の要素の二乗和の平方根として定義されます。 
 
-[** $\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$ドル**] 
+[**$$\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$**] 
 
-Frobenius ノルムは、ベクトルノルムのすべての特性を満たします。行列型のベクトルの $L_2$ ノルムであるかのように動作します。次の関数を呼び出すと、行列のフロベニウスノルムが計算されます。
+フロベニウスノルムは、行列型ベクトルの $\ell_2$ ノルムであるかのように動作します。次の関数を呼び出すと、行列のフロベニウスノルムが計算されます。
 
 ```{.python .input}
+%%tab mxnet
 np.linalg.norm(np.ones((4, 9)))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.norm(torch.ones((4, 9)))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.norm(tf.ones((4, 9)))
 ```
 
-### 規範と目標
-:label:`subsec_norms_and_objectives`
-
-私たちは自分より先に進みたくはありませんが、なぜこれらの概念が役に立つのかについて、すでにいくつかの直感を植え付けることができます。ディープラーニングでは、最適化問題を解こうとすることがよくあります。
-*観測データに割り当てられる確率を最大化*
+私たちは自分たちより先を行き過ぎたくありませんが、これらの概念がなぜ役立つのかについて、すでに直感を植え付けることができます。ディープラーニングでは、最適化問題を解こうとすることがよくあります。
+*観測データに割り当てられた確率を最大化*。
+*レコメンダーモデルに関連する収益を最大化* 
 *予測間の距離を最小化*
-そしてグラウンドトゥルースの観測。類似するアイテム間の距離が最小化され、異なるアイテム間の距離が最大になるように、アイテム (単語、製品、ニュース記事など) にベクトル表現を割り当てます。多くの場合、(データ以外に) ディープラーニングアルゴリズムの最も重要なコンポーネントである目的は標準として表現されます。 
+そしてグラウンドトゥルースの観察。 
+*表現間の距離を最小化* 
+異なる人物の写真の表現間の距離を*最大化*しながら、同じ人物の写真の。ディープラーニングアルゴリズムの目的を構成するこれらの距離は、しばしば規範として表現されます。  
 
-## 線形代数の詳細
+## ディスカッション
 
-このセクションでは、現代のディープラーニングの驚くべき部分を理解するために必要なすべての線形代数について説明しました。線形代数には他にも多くの機能があり、その数学の多くは機械学習に役立ちます。たとえば、行列を因子に分解することができ、この分解によって実世界のデータセットでは低次元の構造が明らかになります。機械学習には、行列分解とその高次テンソルへの一般化を使用してデータセット内の構造を発見し、予測問題を解決することに重点を置いたサブフィールドがあります。しかし、この本はディープラーニングに焦点を当てています。また、実際のデータセットに有用な機械学習モデルを導入して手を汚すと、より多くの数学を学ぶ傾向が強まると私たちは信じています。したがって、後で数学をさらに紹介する権利を留保しますが、このセクションをここでまとめます。 
+このセクションでは、最新のディープラーニングの注目すべき部分を理解するために必要なすべての線形代数について説明しました。線形代数には他にもたくさんあり、その多くは機械学習に役立ちます。たとえば、行列は因子に分解でき、これらの分解によって実世界のデータセットの低次元構造を明らかにすることができます。データセットの構造を発見し、予測問題を解決するために、行列分解とその高次テンソルへの汎化を使用することに焦点を当てた機械学習のすべてのサブフィールドがあります。しかし、この本はディープラーニングに焦点を当てています。そして、実際のデータセットに機械学習を適用して手を汚したら、もっと数学を学ぶ傾向が強くなると私たちは信じています。したがって、後でさらに数学を導入する権利を留保しますが、このセクションをここでまとめます。 
 
-線形代数についてもっと知りたければ、[online appendix on linear algebraic operations](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.html) またはその他の優れたリソース :cite:`Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008` を参照してください。 
+もっと線形代数を学びたいと思っているなら、たくさんの優れた本やオンラインリソースがあります。より高度なクラッシュコースについては、:cite:`Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008`をチェックすることを検討してください。 
 
-## [概要
+要点をまとめると: 
 
-* スカラー、ベクトル、行列、テンソルは、線形代数の基本的な数学オブジェクトです。
-* ベクトルはスカラーを一般化し、行列はベクトルを一般化します。
-* スカラー、ベクトル、行列、テンソルにはそれぞれ 0、1、2、任意の数の軸があります。
-* 指定した軸に沿って `sum` と `mean` だけテンソルを減らすことができます。
-* 2 つの行列の要素ごとの乗算は、アダマール積と呼ばれます。行列の乗算とは異なります。
-* ディープラーニングでは、$L_1$ ノルム、$L_2$ ノルム、フロベニウスノルムなどのノルムを扱うことがよくあります。
-* スカラー、ベクトル、行列、テンソルに対してさまざまな演算を実行できます。
+* スカラー、ベクトル、行列、テンソルは、線形代数で使用される基本的な数学オブジェクトであり、それぞれ 0、1、2、および任意の数の軸を持ちます。
+* テンソルは、インデックス付け、または `sum` や `mean` などの操作によって、指定された軸に沿ってスライスまたは削減できます。
+* Elementwise 製品はアダマール製品と呼ばれます。対照的に、ドット積、行列-ベクトル積、および行列-行列積は要素単位の演算ではなく、一般にオペランドとは異なる形状を持つオブジェクトを返します。 
+* アダマール積と比較して、行列-行列積は計算にかなり時間がかかります（二次時間よりも立方時間）。
+* ノルムは、ベクトルの大きさのさまざまな概念を捉え、一般的に2つのベクトルの距離を測定するために2つのベクトルの差に適用されます。
+ * 一般的なベクトルノルムには $\ell_1$ と $\ell_2$ ノルムが含まれ、一般的な行列ノルムには*スペクトル* ノルムと*フロベニウス* ノルムが含まれます。
 
 ## 演習
 
-1. 行列 $\mathbf{A}$ の転置の転置が $\mathbf{A}$:$(\mathbf{A}^\top)^\top = \mathbf{A}$ であることを証明します。
-1. 2 つの行列 $\mathbf{A}$ と $\mathbf{B}$ が与えられた場合、転置の和が和の転置に等しいことを示します。$\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$
-1. 正方行列 $\mathbf{A}$ がある場合、$\mathbf{A} + \mathbf{A}^\top$ は常に対称ですか?なぜ？
-1. この節では、形状 (2, 3, 4) のテンソル `X` を定義しました。`len(X)`の出力は何ですか？
-1. 任意の形状のテンソル `X` に対して、`len(X)` は常に `X` の特定の軸の長さに対応しますか？その軸は何ですか？
-1. `A / A.sum(axis=1)` を実行して、何が起こるか見てみましょう。その理由を分析できますか？
-1. マンハッタンの2地点間を移動する場合、座標、つまり道路と道路の観点からカバーする必要がある距離はどれくらいですか？斜めに旅行できますか。
-1. 形状 (2, 3, 4) のテンソルを考えてみます。軸 0、1、2 に沿った加算出力の形状を教えてください。
-1. 3 軸以上のテンソルを関数 `linalg.norm` に送り、その出力を観測します。この関数は任意の形状のテンソルに対して何を計算しますか？
+1. 行列の転置の転置が行列そのものであることを証明する:$(\mathbf{A}^\top)^\top = \mathbf{A}$。
+1. $\mathbf{A}$ と $\mathbf{B}$ の 2 つの行列が与えられると、和と転置が通勤することを示します:$\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$。
+1. $\mathbf{A}$の正方行列があれば、$\mathbf{A} + \mathbf{A}^\top$は常に対称ですか？前の2つの練習の結果だけを使って結果を証明できますか？
+1. このセクションでは、形状 (2、3、4) のテンソル `X` を定義しました。`len(X)`の出力はどれくらいですか？コードを実装せずに回答を記述し、コードを使用して回答を確認します。 
+1. 任意の形状のテンソル`X`の場合、`len(X)`は`X`の特定の軸の長さに常に対応しますか？その軸は何ですか？
+1. `A / A.sum(axis=1)` を実行して、何が起こるかを確認します。その理由を分析できますか？
+1. マンハッタンのダウンタウンの2地点間を移動する場合、座標、つまり大通りや通りの観点からカバーする必要がある距離はどれくらいですか？斜めに旅行できますか？
+1. 形状 (2、3、4) のテンソルを考えてみましょう。軸0、1、および2に沿った合計出力の形状は何ですか？
+1. 3 軸以上のテンソルを `linalg.norm` 関数に送り、その出力を観察します。この関数は任意の形状のテンソルについて何を計算しますか?
+1. たとえば、ガウス確率変数で初期化された$\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$、$\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$、$\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{14}}$など、3つの大きな行列を定義します。製品 $\mathbf{A} \mathbf{B} \mathbf{C}$ を計算するとします。$(\mathbf{A} \mathbf{B}) \mathbf{C}$と$\mathbf{A} (\mathbf{B} \mathbf{C})$のどちらを計算するかに応じて、メモリフットプリントと速度に違いはありますか。なぜ？
+1. $\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$、$\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$、$\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{16}}$ など、3 つの大きな行列を定義します。$\mathbf{A} \mathbf{B}$と$\mathbf{A} \mathbf{C}^\top$のどちらを計算するかによって、速度に違いはありますか？なぜ？メモリをクローンせずに$\mathbf{C} = \mathbf{B}^\top$を初期化すると何が変わりますか？なぜ？
+1. たとえば $\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{100 \times 200}$ という 3 つの行列を定義します。$[\mathbf{A}, \mathbf{B}, \mathbf{C}]$を積み重ねて3軸のテンソルを構成します。次元性って何ですか？3番目の軸の2番目の座標をスライスして、$\mathbf{B}$を回復します。あなたの答えが正しいか確認してください。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/30)
diff --git a/chapter_preliminaries/linear-algebra_origin.md b/chapter_preliminaries/linear-algebra_origin.md
new file mode 100644
index 0000000..7404637
--- /dev/null
+++ b/chapter_preliminaries/linear-algebra_origin.md
@@ -0,0 +1,1147 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Linear Algebra
+:label:`sec_linear-algebra`
+
+By now, we can load datasets into tensors
+and manipulate these tensors 
+with basic mathematical operations.
+To start building sophisticated models,
+we will also need a few tools from linear algebra. 
+This section offers a gentle introduction 
+to the most essential concepts,
+starting from scalar arithmetic
+and ramping up to matrix multiplication.
+
+
+
+## Scalars
+
+
+Most everyday mathematics
+consists of manipulating 
+numbers one at a time.
+Formally, we call these values *scalars*.
+For example, the temperature in Palo Alto 
+is a balmy $72$ degrees Fahrenheit.
+If you wanted to convert the temperature to Celsius
+you would evaluate the expression 
+$c = \frac{5}{9}(f - 32)$, setting $f$ to $72$.
+In this equation, the values 
+$5$, $9$, and $32$ are scalars.
+The variables $c$ and $f$ 
+represent unknown scalars.
+
+We denote scalars
+by ordinary lower-cased letters 
+(e.g., $x$, $y$, and $z$)
+and the space of all (continuous) 
+*real-valued* scalars by $\mathbb{R}$.
+For expedience, we will skip past
+rigorous definitions of *spaces*.
+Just remember that the expression $x \in \mathbb{R}$
+is a formal way to say that $x$ is a real-valued scalar.
+The symbol $\in$ (pronounced "in")
+denotes membership in a set.
+For example, $x, y \in \{0, 1\}$
+indicates that $x$ and $y$ are variables
+that can only take values $0$ or $1$.
+
+(**Scalars are implemented as tensors 
+that contain only one element.**)
+Below, we assign two scalars
+and perform the familiar addition, multiplication,
+division, and exponentiation operations.
+
+```{.python .input}
+%%tab mxnet
+from mxnet import np, npx
+npx.set_np()
+
+x = np.array(3.0)
+y = np.array(2.0)
+
+x + y, x * y, x / y, x ** y
+```
+
+```{.python .input}
+%%tab pytorch
+import torch
+
+x = torch.tensor(3.0)
+y = torch.tensor(2.0)
+
+x + y, x * y, x / y, x**y
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+x = tf.constant(3.0)
+y = tf.constant(2.0)
+
+x + y, x * y, x / y, x**y
+```
+
+## Vectors
+
+For our purposes, [**you can think of vectors
+as fixed-length arrays of scalars.**]
+As with their code counterparts,
+we call these values the *elements* of the vector
+(synonyms include *entries* and *components*).
+When vectors represent examples from real-world datasets,
+their values hold some real-world significance.
+For example, if we were training a model to predict
+the risk of a loan defaulting,
+we might associate each applicant with a vector
+whose components correspond to quantities
+like their income, length of employment, 
+or number of previous defaults.
+If we were studying heart attack risk,
+each vector might represent a patient
+and its components might correspond to
+their most recent vital signs, cholesterol levels, 
+minutes of exercise per day, etc.
+We denote vectors by bold lowercase letters, 
+(e.g., $\mathbf{x}$, $\mathbf{y}$, and $\mathbf{z}$).
+
+Vectors are implemented as $1^{\mathrm{st}}$-order tensors.
+In general, such tensors can have arbitrary lengths,
+subject to memory limitations. Caution: in Python, like in most programming languages, vector indices start at $0$, also known as *zero-based indexing*, whereas in linear algebra subscripts begin at $1$ (one-based indexing).
+
+```{.python .input}
+%%tab mxnet
+x = np.arange(3)
+x
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.arange(3)
+x
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.range(3)
+x
+```
+
+We can refer to an element of a vector by using a subscript.
+For example, $x_2$ denotes the second element of $\mathbf{x}$. 
+Since $x_2$ is a scalar, we do not bold it.
+By default, we visualize vectors 
+by stacking their elements vertically.
+
+$$\mathbf{x} =\begin{bmatrix}x_{1}  \\ \vdots  \\x_{n}\end{bmatrix},$$
+:eqlabel:`eq_vec_def`
+
+Here $x_1, \ldots, x_n$ are elements of the vector.
+Later on, we will distinguish between such *column vectors*
+and *row vectors* whose elements are stacked horizontally.
+Recall that [**we access a tensor's elements via indexing.**]
+
+```{.python .input}
+%%tab mxnet
+x[2]
+```
+
+```{.python .input}
+%%tab pytorch
+x[2]
+```
+
+```{.python .input}
+%%tab tensorflow
+x[2]
+```
+
+To indicate that a vector contains $n$ elements,
+we write $\mathbf{x} \in \mathbb{R}^n$.
+Formally, we call $n$ the *dimensionality* of the vector.
+[**In code, this corresponds to the tensor's length**],
+accessible via Python's built-in `len` function.
+
+```{.python .input}
+%%tab mxnet
+len(x)
+```
+
+```{.python .input}
+%%tab pytorch
+len(x)
+```
+
+```{.python .input}
+%%tab tensorflow
+len(x)
+```
+
+We can also access the length via the `shape` attribute.
+The shape is a tuple that indicates a tensor's length along each axis.
+(**Tensors with just one axis have shapes with just one element.**)
+
+```{.python .input}
+%%tab mxnet
+x.shape
+```
+
+```{.python .input}
+%%tab pytorch
+x.shape
+```
+
+```{.python .input}
+%%tab tensorflow
+x.shape
+```
+
+Oftentimes, the word "dimension" gets overloaded
+to mean both the number of axes 
+and the length along a particular axis.
+To avoid this confusion, 
+we use *order* to refer to the number of axes
+and *dimensionality* exclusively to refer 
+to the number of components.
+
+
+## Matrices
+
+Just as scalars are $0^{\mathrm{th}}$-order tensors
+and vectors are $1^{\mathrm{st}}$-order tensors,
+matrices are $2^{\mathrm{nd}}$-order tensors.
+We denote matrices by bold capital letters
+(e.g., $\mathbf{X}$, $\mathbf{Y}$, and $\mathbf{Z}$),
+and represent them in code by tensors with two axes.
+The expression $\mathbf{A} \in \mathbb{R}^{m \times n}$
+indicates that a matrix $\mathbf{A}$ 
+contains $m \times n$ real-valued scalars,
+arranged as $m$ rows and $n$ columns.
+When $m = n$, we say that a matrix is *square*.
+Visually, we can illustrate any matrix as a table.
+To refer to an individual element,
+we subscript both the row and column indices, e.g.,
+$a_{ij}$ is the value that belongs to $\mathbf{A}$'s
+$i^{\mathrm{th}}$ row and $j^{\mathrm{th}}$ column:
+
+$$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}.$$
+:eqlabel:`eq_matrix_def`
+
+
+In code, we represent a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$
+by a $2^{\mathrm{nd}}$-order tensor with shape ($m$, $n$).
+[**We can convert any appropriately sized $m \times n$ tensor 
+into an $m \times n$ matrix**] 
+by passing the desired shape to `reshape`:
+
+```{.python .input}
+%%tab mxnet
+A = np.arange(6).reshape(3, 2)
+A
+```
+
+```{.python .input}
+%%tab pytorch
+A = torch.arange(6).reshape(3, 2)
+A
+```
+
+```{.python .input}
+%%tab tensorflow
+A = tf.reshape(tf.range(6), (3, 2))
+A
+```
+
+Sometimes, we want to flip the axes.
+When we exchange a matrix's rows and columns,
+the result is called its *transpose*.
+Formally, we signify a matrix $\mathbf{A}$'s transpose 
+by $\mathbf{A}^\top$ and if $\mathbf{B} = \mathbf{A}^\top$, 
+then $b_{ij} = a_{ji}$ for all $i$ and $j$.
+Thus, the transpose of an $m \times n$ matrix 
+is an $n \times m$ matrix:
+
+$$
+\mathbf{A}^\top =
+\begin{bmatrix}
+    a_{11} & a_{21} & \dots  & a_{m1} \\
+    a_{12} & a_{22} & \dots  & a_{m2} \\
+    \vdots & \vdots & \ddots  & \vdots \\
+    a_{1n} & a_{2n} & \dots  & a_{mn}
+\end{bmatrix}.
+$$
+
+In code, we can access any (**matrix's transpose**) as follows:
+
+```{.python .input}
+%%tab mxnet
+A.T
+```
+
+```{.python .input}
+%%tab pytorch
+A.T
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.transpose(A)
+```
+
+[**Symmetric matrices are the subset of square matrices
+that are equal to their own transposes:
+$\mathbf{A} = \mathbf{A}^\top$.**]
+The following matrix is symmetric:
+
+```{.python .input}
+%%tab mxnet
+A = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == A.T
+```
+
+```{.python .input}
+%%tab pytorch
+A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == A.T
+```
+
+```{.python .input}
+%%tab tensorflow
+A = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
+A == tf.transpose(A)
+```
+
+Matrices are useful for representing datasets. 
+Typically, rows correspond to individual records
+and columns correspond to distinct attributes.
+
+
+
+## Tensors
+
+While you can go far in your machine learning journey
+with only scalars, vectors, and matrices,
+eventually you may need to work with 
+higher-order [**tensors**].
+Tensors (**give us a generic way to describe 
+extensions to $n^{\mathrm{th}}$-order arrays.**)
+We call software objects of the *tensor class* "tensors"
+precisely because they too can have arbitrary numbers of axes.
+While it may be confusing to use the word
+*tensor* for both the mathematical object
+and its realization in code,
+our meaning should usually be clear from context.
+We denote general tensors by capital letters 
+with a special font face
+(e.g., $\mathsf{X}$, $\mathsf{Y}$, and $\mathsf{Z}$)
+and their indexing mechanism 
+(e.g., $x_{ijk}$ and $[\mathsf{X}]_{1, 2i-1, 3}$) 
+follows naturally from that of matrices.
+
+Tensors will become more important 
+when we start working with images.
+Each image arrives as a $3^{\mathrm{rd}}$-order tensor
+with axes corresponding to the height, width, and *channel*.
+At each spatial location, the intensities 
+of each color (red, green, and blue)
+are stacked along the channel. 
+Moreover a collection of images is represented 
+in code by a $4^{\mathrm{th}}$-order tensor,
+where distinct images are indexed
+along the first axis.
+Higher-order tensors are constructed analogously 
+to vectors and matrices,
+by growing the number of shape components.
+
+```{.python .input}
+%%tab mxnet
+np.arange(24).reshape(2, 3, 4)
+```
+
+```{.python .input}
+%%tab pytorch
+torch.arange(24).reshape(2, 3, 4)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reshape(tf.range(24), (2, 3, 4))
+```
+
+## Basic Properties of Tensor Arithmetic
+
+Scalars, vectors, matrices, 
+and higher-order tensors
+all have some handy properties. 
+For example, elementwise operations
+produce outputs that have the 
+same shape as their operands.
+
+```{.python .input}
+%%tab mxnet
+A = np.arange(6).reshape(2, 3)
+B = A.copy()  # Assign a copy of `A` to `B` by allocating new memory
+A, A + B
+```
+
+```{.python .input}
+%%tab pytorch
+A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
+B = A.clone()  # Assign a copy of `A` to `B` by allocating new memory
+A, A + B
+```
+
+```{.python .input}
+%%tab tensorflow
+A = tf.reshape(tf.range(6, dtype=tf.float32), (2, 3))
+B = A  # No cloning of `A` to `B` by allocating new memory
+A, A + B
+```
+
+The [**elementwise product of two matrices
+is called their *Hadamard product***] (denoted $\odot$).
+Below, we spell out the entries 
+of the Hadamard product of two matrices 
+$\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$:
+
+
+
+$$
+\mathbf{A} \odot \mathbf{B} =
+\begin{bmatrix}
+    a_{11}  b_{11} & a_{12}  b_{12} & \dots  & a_{1n}  b_{1n} \\
+    a_{21}  b_{21} & a_{22}  b_{22} & \dots  & a_{2n}  b_{2n} \\
+    \vdots & \vdots & \ddots & \vdots \\
+    a_{m1}  b_{m1} & a_{m2}  b_{m2} & \dots  & a_{mn}  b_{mn}
+\end{bmatrix}.
+$$
+
+```{.python .input}
+%%tab mxnet
+A * B
+```
+
+```{.python .input}
+%%tab pytorch
+A * B
+```
+
+```{.python .input}
+%%tab tensorflow
+A * B
+```
+
+[**Adding or multiplying a scalar and a tensor**] produces a result
+with the same shape as the original tensor.
+Here, each element of the tensor is added to (or multiplied by) the scalar.
+
+```{.python .input}
+%%tab mxnet
+a = 2
+X = np.arange(24).reshape(2, 3, 4)
+a + X, (a * X).shape
+```
+
+```{.python .input}
+%%tab pytorch
+a = 2
+X = torch.arange(24).reshape(2, 3, 4)
+a + X, (a * X).shape
+```
+
+```{.python .input}
+%%tab tensorflow
+a = 2
+X = tf.reshape(tf.range(24), (2, 3, 4))
+a + X, (a * X).shape
+```
+
+## Reduction
+:label:`subsec_lin-alg-reduction`
+
+Often, we wish to calculate [**the sum of a tensor's elements.**]
+To express the sum of the elements in a vector $\mathbf{x}$ of length $n$,
+we write $\sum_{i=1}^n x_i$. There's a simple function for it:
+
+```{.python .input}
+%%tab mxnet
+x = np.arange(3)
+x, x.sum()
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.arange(3, dtype=torch.float32)
+x, x.sum()
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.range(3, dtype=tf.float32)
+x, tf.reduce_sum(x)
+```
+
+To express [**sums over the elements of tensors of arbitrary shape**],
+we simply sum over all of its axes. 
+For example, the sum of the elements 
+of an $m \times n$ matrix $\mathbf{A}$ 
+could be written $\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}$.
+
+```{.python .input}
+%%tab mxnet
+A.shape, A.sum()
+```
+
+```{.python .input}
+%%tab pytorch
+A.shape, A.sum()
+```
+
+```{.python .input}
+%%tab tensorflow
+A.shape, tf.reduce_sum(A)
+```
+
+By default, invoking the sum function
+*reduces* a tensor along all of its axes,
+eventually producing a scalar.
+Our libraries also allow us to [**specify the axes 
+along which the tensor should be reduced.**]
+To sum over all elements along the rows (axis 0),
+we specify `axis=0` in `sum`.
+Since the input matrix reduces along axis 0
+to generate the output vector,
+this axis is missing from the shape of the output.
+
+```{.python .input}
+%%tab mxnet
+A.shape, A.sum(axis=0).shape
+```
+
+```{.python .input}
+%%tab pytorch
+A.shape, A.sum(axis=0).shape
+```
+
+```{.python .input}
+%%tab tensorflow
+A.shape, tf.reduce_sum(A, axis=0).shape
+```
+
+Specifying `axis=1` will reduce the column dimension (axis 1) by summing up elements of all the columns.
+
+```{.python .input}
+%%tab mxnet
+A.shape, A.sum(axis=1).shape
+```
+
+```{.python .input}
+%%tab pytorch
+A.shape, A.sum(axis=1).shape
+```
+
+```{.python .input}
+%%tab tensorflow
+A.shape, tf.reduce_sum(A, axis=1).shape
+```
+
+Reducing a matrix along both rows and columns via summation
+is equivalent to summing up all the elements of the matrix.
+
+```{.python .input}
+%%tab mxnet
+A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
+```
+
+```{.python .input}
+%%tab pytorch
+A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_sum(A, axis=[0, 1]), tf.reduce_sum(A) # Same as `tf.reduce_sum(A)`
+```
+
+[**A related quantity is the *mean*, also called the *average*.**]
+We calculate the mean by dividing the sum 
+by the total number of elements.
+Because computing the mean is so common,
+it gets a dedicated library function 
+that works analogously to `sum`.
+
+```{.python .input}
+%%tab mxnet
+A.mean(), A.sum() / A.size
+```
+
+```{.python .input}
+%%tab pytorch
+A.mean(), A.sum() / A.numel()
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()
+```
+
+Likewise, the function for calculating the mean 
+can also reduce a tensor along specific axes.
+
+```{.python .input}
+%%tab mxnet
+A.mean(axis=0), A.sum(axis=0) / A.shape[0]
+```
+
+```{.python .input}
+%%tab pytorch
+A.mean(axis=0), A.sum(axis=0) / A.shape[0]
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_mean(A, axis=0), tf.reduce_sum(A, axis=0) / A.shape[0]
+```
+
+## Non-Reduction Sum
+:label:`subsec_lin-alg-non-reduction`
+
+Sometimes it can be useful to [**keep the number of axes unchanged**]
+when invoking the function for calculating the sum or mean. 
+This matters when we want to use the broadcast mechanism.
+
+```{.python .input}
+%%tab mxnet
+sum_A = A.sum(axis=1, keepdims=True)
+sum_A, sum_A.shape
+```
+
+```{.python .input}
+%%tab pytorch
+sum_A = A.sum(axis=1, keepdims=True)
+sum_A, sum_A.shape
+```
+
+```{.python .input}
+%%tab tensorflow
+sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
+sum_A, sum_A.shape
+```
+
+For instance, since `sum_A` keeps its two axes after summing each row,
+we can (**divide `A` by `sum_A` with broadcasting**) 
+to create a matrix where each row sums up to $1$.
+
+```{.python .input}
+%%tab mxnet
+A / sum_A
+```
+
+```{.python .input}
+%%tab pytorch
+A / sum_A
+```
+
+```{.python .input}
+%%tab tensorflow
+A / sum_A
+```
+
+If we want to calculate [**the cumulative sum of elements of `A` along some axis**],
+say `axis=0` (row by row), we can call the `cumsum` function.
+By design, this function does not reduce the input tensor along any axis.
+
+```{.python .input}
+%%tab mxnet
+A.cumsum(axis=0)
+```
+
+```{.python .input}
+%%tab pytorch
+A.cumsum(axis=0)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.cumsum(A, axis=0)
+```
+
+## Dot Products
+
+So far, we have only performed elementwise operations, sums, and averages. 
+And if this was all we could do, linear algebra 
+would not deserve its own section.
+Fortunately, this is where things get more interesting.
+One of the most fundamental operations is the dot product.
+Given two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$,
+their *dot product* $\mathbf{x}^\top \mathbf{y}$ (or $\langle \mathbf{x}, \mathbf{y}  \rangle$) 
+is a sum over the products of the elements at the same position: 
+$\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i$.
+
+[~~The *dot product* of two vectors is a sum over the products of the elements at the same position~~]
+
+```{.python .input}
+%%tab mxnet
+y = np.ones(3)
+x, y, np.dot(x, y)
+```
+
+```{.python .input}
+%%tab pytorch
+y = torch.ones(3, dtype = torch.float32)
+x, y, torch.dot(x, y)
+```
+
+```{.python .input}
+%%tab tensorflow
+y = tf.ones(3, dtype=tf.float32)
+x, y, tf.tensordot(x, y, axes=1)
+```
+
+Equivalently, (**we can calculate the dot product of two vectors 
+by performing an elementwise multiplication followed by a sum:**)
+
+```{.python .input}
+%%tab mxnet
+np.sum(x * y)
+```
+
+```{.python .input}
+%%tab pytorch
+torch.sum(x * y)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_sum(x * y)
+```
+
+Dot products are useful in a wide range of contexts.
+For example, given some set of values,
+denoted by a vector $\mathbf{x}  \in \mathbb{R}^n$
+and a set of weights denoted by $\mathbf{w} \in \mathbb{R}^n$,
+the weighted sum of the values in $\mathbf{x}$
+according to the weights $\mathbf{w}$
+could be expressed as the dot product $\mathbf{x}^\top \mathbf{w}$.
+When the weights are non-negative
+and sum to one, i.e., $\left(\sum_{i=1}^{n} {w_i} = 1\right)$,
+the dot product expresses a *weighted average*.
+After normalizing two vectors to have unit length,
+the dot products express the cosine of the angle between them.
+Later in this section, we will formally introduce this notion of *length*.
+
+
+## Matrix-Vector Products
+
+Now that we know how to calculate dot products,
+we can begin to understand the *product*
+between an $m \times n$ matrix $\mathbf{A}$ 
+and an $n$-dimensional vector $\mathbf{x}$.
+To start off, we visualize our matrix
+in terms of its row vectors
+
+$$\mathbf{A}=
+\begin{bmatrix}
+\mathbf{a}^\top_{1} \\
+\mathbf{a}^\top_{2} \\
+\vdots \\
+\mathbf{a}^\top_m \\
+\end{bmatrix},$$
+
+where each $\mathbf{a}^\top_{i} \in \mathbb{R}^n$
+is a row vector representing the $i^\mathrm{th}$ row 
+of the matrix $\mathbf{A}$.
+
+[**The matrix-vector product $\mathbf{A}\mathbf{x}$
+is simply a column vector of length $m$,
+whose $i^\mathrm{th}$ element is the dot product 
+$\mathbf{a}^\top_i \mathbf{x}$:**]
+
+$$
+\mathbf{A}\mathbf{x}
+= \begin{bmatrix}
+\mathbf{a}^\top_{1} \\
+\mathbf{a}^\top_{2} \\
+\vdots \\
+\mathbf{a}^\top_m \\
+\end{bmatrix}\mathbf{x}
+= \begin{bmatrix}
+ \mathbf{a}^\top_{1} \mathbf{x}  \\
+ \mathbf{a}^\top_{2} \mathbf{x} \\
+\vdots\\
+ \mathbf{a}^\top_{m} \mathbf{x}\\
+\end{bmatrix}.
+$$
+
+We can think of multiplication with a matrix
+$\mathbf{A}\in \mathbb{R}^{m \times n}$
+as a transformation that projects vectors
+from $\mathbb{R}^{n}$ to $\mathbb{R}^{m}$.
+These transformations are remarkably useful.
+For example, we can represent rotations
+as multiplications by certain square matrices.
+Matrix-vector products also describe 
+the key calculation involved in computing
+the outputs of each layer in a neural network
+given the outputs from the previous layer.
+
+:begin_tab:`mxnet`
+To express a matrix-vector product in code,
+we use the same `dot` function.
+The operation is inferred 
+based on the type of the arguments.
+Note that the column dimension of `A` 
+(its length along axis 1)
+must be the same as the dimension of `x` (its length).
+:end_tab:
+
+:begin_tab:`pytorch`
+To express a matrix-vector product in code,
+we use the `mv` function. 
+Note that the column dimension of `A` 
+(its length along axis 1)
+must be the same as the dimension of `x` (its length). 
+PyTorch has a convenience operator `@` 
+that can execute both matrix-vector
+and matrix-matrix products
+(depending on its arguments). 
+Thus we can write `A@x`.
+:end_tab:
+
+:begin_tab:`tensorflow`
+To express a matrix-vector product in code,
+we use the `matvec` function. 
+Note that the column dimension of `A` 
+(its length along axis 1)
+must be the same as the dimension of `x` (its length).
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+A.shape, x.shape, np.dot(A, x)
+```
+
+```{.python .input}
+%%tab pytorch
+A.shape, x.shape, torch.mv(A, x), A@x
+```
+
+```{.python .input}
+%%tab tensorflow
+A.shape, x.shape, tf.linalg.matvec(A, x)
+```
+
+## Matrix-Matrix Multiplication
+
+If you've gotten the hang of dot products and matrix-vector products,
+then *matrix-matrix multiplication* should be straightforward.
+
+Say that we have two matrices 
+$\mathbf{A} \in \mathbb{R}^{n \times k}$ 
+and $\mathbf{B} \in \mathbb{R}^{k \times m}$:
+
+$$\mathbf{A}=\begin{bmatrix}
+ a_{11} & a_{12} & \cdots & a_{1k} \\
+ a_{21} & a_{22} & \cdots & a_{2k} \\
+\vdots & \vdots & \ddots & \vdots \\
+ a_{n1} & a_{n2} & \cdots & a_{nk} \\
+\end{bmatrix},\quad
+\mathbf{B}=\begin{bmatrix}
+ b_{11} & b_{12} & \cdots & b_{1m} \\
+ b_{21} & b_{22} & \cdots & b_{2m} \\
+\vdots & \vdots & \ddots & \vdots \\
+ b_{k1} & b_{k2} & \cdots & b_{km} \\
+\end{bmatrix}.$$
+
+
+Let $\mathbf{a}^\top_{i} \in \mathbb{R}^k$ denote 
+the row vector representing the $i^\mathrm{th}$ row 
+of the matrix $\mathbf{A}$
+and let $\mathbf{b}_{j} \in \mathbb{R}^k$ denote 
+the column vector from the $j^\mathrm{th}$ column 
+of the matrix $\mathbf{B}$:
+
+$$\mathbf{A}=
+\begin{bmatrix}
+\mathbf{a}^\top_{1} \\
+\mathbf{a}^\top_{2} \\
+\vdots \\
+\mathbf{a}^\top_n \\
+\end{bmatrix},
+\quad \mathbf{B}=\begin{bmatrix}
+ \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
+\end{bmatrix}.
+$$
+
+
+To form the matrix product $\mathbf{C} \in \mathbb{R}^{n \times m}$,
+we simply compute each element $c_{ij}$
+as the dot product between 
+the $i^{\mathrm{th}}$ row of $\mathbf{A}$
+and the $j^{\mathrm{th}}$ row of $\mathbf{B}$,
+i.e., $\mathbf{a}^\top_i \mathbf{b}_j$:
+
+$$\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
+\mathbf{a}^\top_{1} \\
+\mathbf{a}^\top_{2} \\
+\vdots \\
+\mathbf{a}^\top_n \\
+\end{bmatrix}
+\begin{bmatrix}
+ \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
+\end{bmatrix}
+= \begin{bmatrix}
+\mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\
+ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\
+ \vdots & \vdots & \ddots &\vdots\\
+\mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m
+\end{bmatrix}.
+$$
+
+[**We can think of the matrix-matrix multiplication $\mathbf{AB}$
+as performing $m$ matrix-vector products 
+or $m \times n$ dot products 
+and stitching the results together 
+to form an $n \times m$ matrix.**]
+In the following snippet, 
+we perform matrix multiplication on `A` and `B`.
+Here, `A` is a matrix with 2 rows and 3 columns,
+and `B` is a matrix with 3 rows and 4 columns.
+After multiplication, we obtain a matrix with 2 rows and 4 columns.
+
+```{.python .input}
+%%tab mxnet
+B = np.ones(shape=(3, 4))
+np.dot(A, B)
+```
+
+```{.python .input}
+%%tab pytorch
+B = torch.ones(3, 4)
+torch.mm(A, B), A@B
+```
+
+```{.python .input}
+%%tab tensorflow
+B = tf.ones((3, 4), tf.float32)
+tf.matmul(A, B)
+```
+
+The term *matrix-matrix multiplication* is 
+often simplified to *matrix multiplication*,
+and should not be confused with the Hadamard product.
+
+
+## Norms
+:label:`subsec_lin-algebra-norms`
+
+Some of the most useful operators in linear algebra are *norms*.
+Informally, the norm of a vector tells us how *big* it is. 
+For instance, the $\\ell_2$ norm measures
+the (Euclidean) length of a vector.
+Here, we are employing a notion of *size* that concerns the magnitude a vector's components
+(not its dimensionality). 
+
+A norm is a function $\| \cdot \|$ that maps a vector
+to a scalar and satisfies the following three properties:
+
+1. Given any vector $\mathbf{x}$, if we scale (all elements of) the vector 
+   by a scalar $\alpha \in \mathbb{R}$, its norm scales accordingly:
+   $$\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.$$
+2. For any vectors $\mathbf{x}$ and $\mathbf{y}$:
+   norms satisfy the triangle inequality:
+   $$\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.$$
+3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
+   $$\|\mathbf{x}\| > 0 \text{ for all } \mathbf{x} \neq 0.$$
+
+Many functions are valid norms and different norms 
+encode different notions of size. 
+The Euclidean norm that we all learned in elementary school geometry
+when calculating the hypotenuse of right triangle
+is the square root of the sum of squares of a vector's elements.
+Formally, this is called [**the $\ell_2$ *norm***] and expressed as
+
+(**$$\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.$$**)
+
+The method `norm` calculates the $\ell_2$ norm.
+
+```{.python .input}
+%%tab mxnet
+u = np.array([3, -4])
+np.linalg.norm(u)
+```
+
+```{.python .input}
+%%tab pytorch
+u = torch.tensor([3.0, -4.0])
+torch.norm(u)
+```
+
+```{.python .input}
+%%tab tensorflow
+u = tf.constant([3.0, -4.0])
+tf.norm(u)
+```
+
+[**The $\ell_1$ norm**] is also popular 
+and the associated metric is called the Manhattan distance. 
+By definition, the $\ell_1$ norm sums 
+the absolute values of a vector's elements:
+
+(**$$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$$**)
+
+Compared to the $\ell_2$ norm, it is less sensitive to outliers.
+To compute the $\ell_1$ norm, 
+we compose the absolute value
+with the sum operation.
+
+```{.python .input}
+%%tab mxnet
+np.abs(u).sum()
+```
+
+```{.python .input}
+%%tab pytorch
+torch.abs(u).sum()
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_sum(tf.abs(u))
+```
+
+Both the $\ell_2$ and $\ell_1$ norms are special cases
+of the more general $\ell_p$ *norms*:
+
+$$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$
+
+In the case of matrices, matters are more complicated. 
+After all, matrices can be viewed both as collections of individual entries 
+*and* as objects that operate on vectors and transform them into other vectors. 
+For instance, we can ask by how much longer 
+the matrix-vector product $\mathbf{X} \mathbf{v}$ 
+could be relative to $\mathbf{v}$. 
+This line of thought leads to a norm called the *spectral* norm. 
+For now, we introduce [**the *Frobenius norm*, 
+which is much easier to compute**] and defined as
+the square root of the sum of the squares 
+of a matrix's elements:
+
+[**$$\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$**]
+
+The Frobenius norm behaves as if it were 
+an $\ell_2$ norm of a matrix-shaped vector.
+Invoking the following function will calculate 
+the Frobenius norm of a matrix.
+
+```{.python .input}
+%%tab mxnet
+np.linalg.norm(np.ones((4, 9)))
+```
+
+```{.python .input}
+%%tab pytorch
+torch.norm(torch.ones((4, 9)))
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.norm(tf.ones((4, 9)))
+```
+
+While we do not want to get too far ahead of ourselves,
+we can plant some intuition already about why these concepts are useful.
+In deep learning, we are often trying to solve optimization problems:
+*maximize* the probability assigned to observed data;
+*maximize* the revenue associated with a recommender model; 
+*minimize* the distance between predictions
+and the ground-truth observations; 
+*minimize* the distance between representations 
+of photos of the same person 
+while *maximizing* the distance between representations 
+of photos of different people. 
+These distances, which constitute 
+the objectives of deep learning algorithms, 
+are often expressed as norms. 
+
+
+## Discussion
+
+In this section, we reviewed all the linear algebra
+that you will need to understand
+a remarkable chunk of modern deep learning.
+There is a lot more to linear algebra
+and much of it is useful for machine learning.
+For example, matrices can be decomposed into factors,
+and these decompositions can reveal
+low-dimensional structure in real-world datasets.
+There are entire subfields of machine learning
+that focus on using matrix decompositions
+and their generalizations to high-order tensors
+to discover structure in datasets 
+and solve prediction problems.
+But this book focuses on deep learning.
+And we believe you will be more inclined 
+to learn more mathematics
+once you have gotten your hands dirty
+applying machine learning to real datasets.
+So while we reserve the right 
+to introduce more mathematics later on,
+we wrap up this section here.
+
+If you are eager to learn more linear algebra,
+there are many excellent books and online resources.
+For a more advanced crash course, consider checking out
+:cite:`Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008`.
+
+To recap:
+
+* Scalars, vectors, matrices, and tensors are 
+  the basic mathematical objects used in linear algebra 
+  and have zero, one, two, and an arbitrary number of axes, respectively.
+* Tensors can be sliced or reduced along specified axes 
+  via indexing, or operations such as `sum` and `mean`, respectively.
+* Elementwise products are called Hadamard products. 
+  By contrast, dot products, matrix-vector products, and matrix-matrix products 
+  are not elementwise operations and in general return objects 
+  that have different shapes than the operands. 
+* Compared to Hadamard products, matrix-matrix products 
+  take considerably longer to compute (cubic rather than quadratic time).
+* Norms capture various notions of the magnitude of a vector, 
+  and are commonly applied to the difference of two vectors 
+  to measure their distance.
+ * Common vector norms include the $\ell_1$ and $\ell_2$ norms, 
+   and common matrix norms include the *spectral* and *Frobenius* norms.
+
+
+## Exercises
+
+1. Prove that the transpose of the transpose of a matrix is the matrix itself: $(\mathbf{A}^\top)^\top = \mathbf{A}$.
+1. Given two matrices $\mathbf{A}$ and $\mathbf{B}$, show that sum and transposition commute: $\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$.
+1. Given any square matrix $\mathbf{A}$, is $\mathbf{A} + \mathbf{A}^\top$ always symmetric? Can you prove the result by using only the result of the previous two exercises?
+1. We defined the tensor `X` of shape (2, 3, 4) in this section. What is the output of `len(X)`? Write your answer without implementing any code, then check your answer using code. 
+1. For a tensor `X` of arbitrary shape, does `len(X)` always correspond to the length of a certain axis of `X`? What is that axis?
+1. Run `A / A.sum(axis=1)` and see what happens. Can you analyze the reason?
+1. When traveling between two points in downtown Manhattan, what is the distance that you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?
+1. Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along axis 0, 1, and 2?
+1. Feed a tensor with 3 or more axes to the `linalg.norm` function and observe its output. What does this function compute for tensors of arbitrary shape?
+1. Define three large matrices, say $\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$, $\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and $\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{14}}$, for instance initialized with Gaussian random variables. You want to compute the product $\mathbf{A} \mathbf{B} \mathbf{C}$. Is there any difference in memory footprint and speed, depending on whether you compute $(\mathbf{A} \mathbf{B}) \mathbf{C}$ or $\mathbf{A} (\mathbf{B} \mathbf{C})$. Why?
+1. Define three large matrices, say $\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$, $\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and $\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{16}}$. Is there any difference in speed depending on whether you compute $\mathbf{A} \mathbf{B}$ or $\mathbf{A} \mathbf{C}^\top$? Why? What changes if you initialize $\mathbf{C} = \mathbf{B}^\top$ without cloning memory? Why?
+1. Define three matrices, say $\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{100 \times 200}$. Constitute a tensor with 3 axes by stacking $[\mathbf{A}, \mathbf{B}, \mathbf{C}]$. What is the dimensionality? Slice out the second coordinate of the third axis to recover $\mathbf{B}$. Check that your answer is correct.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/30)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/31)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/196)
+:end_tab:
diff --git a/chapter_preliminaries/lookup-api.md b/chapter_preliminaries/lookup-api.md
index c27b774..dda63cd 100644
--- a/chapter_preliminaries/lookup-api.md
+++ b/chapter_preliminaries/lookup-api.md
@@ -1,84 +1,85 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # ドキュメンテーション
 
 :begin_tab:`mxnet`
-この本の長さには制約があるため、すべての MXNet 関数とクラスを導入することはできません (また、そうしたくもないでしょう)。API ドキュメントと追加のチュートリアルと例には、本書以外にも多くのドキュメントが用意されています。このセクションでは、MXNet API を探索するためのガイダンスを提供します。
+すべての MXNet 関数とクラスを紹介することはできませんが (情報がすぐに古くなる可能性もあります)、[API documentation](https://mxnet.apache.org/versions/1.8.0/api) と追加の [tutorials](https://mxnet.apache.org/versions/1.8.0/api/python/docs/tutorials/) と例でこのようなドキュメントが提供されています。このセクションでは、MXNet API の探索方法に関するガイダンスを提供します。
 :end_tab:
 
 :begin_tab:`pytorch`
-この本の長さに制約があるため、PyTorch の関数やクラスをひとつひとつ紹介することはできないでしょう (そして皆さんもそうしたくありません)。API ドキュメントと追加のチュートリアルと例には、本書以外にも多くのドキュメントが用意されています。このセクションでは PyTorch API を探索するためのガイダンスを提供します。
+すべてのPyTorch関数とクラスを紹介することはできませんが（情報がすぐに古くなるかもしれません）、[API documentation](https://pytorch.org/docs/stable/index.html)と追加の[tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html)と例はそのようなドキュメントを提供します。このセクションでは、PyTorch API を探索する方法についていくつかのガイダンスを提供します。
 :end_tab:
 
 :begin_tab:`tensorflow`
-この本の長さには制約があるため、TensorFlow のすべての関数とクラスを紹介することはできません (おそらくそうしたくないでしょう)。API ドキュメントと追加のチュートリアルと例には、本書以外にも多くのドキュメントが用意されています。このセクションでは、TensorFlow API を探索するためのガイダンスを提供します。
+すべての TensorFlow 関数とクラスを導入することはできませんが (情報がすぐに古くなる可能性もあります)、[API documentation](https://www.tensorflow.org/api_docs) と追加の [tutorials](https://www.tensorflow.org/tutorials) と例でこのようなドキュメントが提供されています。このセクションでは、TensorFlow API を探索する方法についていくつかのガイダンスを提供します。
 :end_tab:
 
-## モジュール内のすべての関数とクラスを検索する
+## モジュール内の関数とクラス
 
-モジュール内で呼び出せる関数とクラスを知るために、`dir` 関数を呼び出します。たとえば、次のようにします (**乱数を生成するためにモジュール内のすべてのプロパティを照会する**)。
+モジュール内で呼び出せる関数とクラスを知るために、`dir` 関数を呼び出します。例えば、(**乱数を生成するためにモジュール内のすべてのプロパティを照会する**):
 
 ```{.python .input  n=1}
+%%tab mxnet
 from mxnet import np
 print(dir(np.random))
 ```
 
 ```{.python .input  n=1}
-#@tab pytorch
+%%tab pytorch
 import torch
 print(dir(torch.distributions))
 ```
 
 ```{.python .input  n=1}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 print(dir(tf.random))
 ```
 
-一般に、`__` で開始および終了する関数 (Python では特殊オブジェクト) や、単一の `_` で始まる関数 (通常は内部関数) は無視できます。残りの関数名または属性名からすると、このモジュールは一様分布 (`uniform`)、正規分布 (`normal`)、多項分布 (`multinomial`) からのサンプリングなど、乱数を生成するためのさまざまな方法を提供していると推測されるかもしれません。 
+一般に、`__` で開始および終了する関数 (Python では特別なオブジェクト) や、単一の `_` で始まる関数 (通常は内部関数) は無視できます。残りの関数名または属性名に基づいて、このモジュールが一様分布 (`uniform`)、正規分布 (`normal`)、および多項分布 (`multinomial`) からのサンプリングを含む、乱数を生成するためのさまざまな方法を提供していると推測するのは危険かもしれません。 
 
-## 特定の関数とクラスの使い方を調べる
+## 特定の関数とクラス
 
-特定の関数またはクラスの使用方法に関するより具体的な指示については、`help` 関数を呼び出すことができます。例として、[**テンソルの`ones`関数の使用方法を探る**]。
+特定の関数またはクラスの使用方法に関するより具体的な手順については、`help` 関数を呼び出すことができます。例として、[**テンソルの`ones`関数の使用方法を調べる**] してみましょう。
 
 ```{.python .input}
+%%tab mxnet
 help(np.ones)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 help(torch.ones)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 help(tf.ones)
 ```
 
-ドキュメントを見ると、関数 `ones` は指定された形状を持つ新しいテンソルを作成し、すべての要素を値 1 に設定していることがわかります。可能な限り、解釈を確認するために (**クイックテストを実行**) してください。
+ドキュメントから、`ones`関数が指定された形状で新しいテンソルを作成し、すべての要素を1の値に設定することがわかります。可能な限り、解釈を確認するために（**クイックテストを実行**）する必要があります。
 
 ```{.python .input}
+%%tab mxnet
 np.ones(4)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.ones(4)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.ones(4)
 ```
 
-Jupyter ノートブックでは、`?`をクリックすると、ドキュメントが別のウィンドウに表示されます。たとえば、`list?`は `help(list)` とほとんど同じ内容を作成し、新しいブラウザウィンドウに表示します。また、`list?? のように 2 つの疑問符を使うと`の場合、その関数を実装している Python コードも表示されます。 
-
-## [概要
-
-* 公式ドキュメントには、本書にはない説明や例が数多く記載されています。
-* `dir` 関数と `help` 関数、または `?` and `？？`Jupyter ノートブックで。
-
-## 演習
+Jupyter ノートブックでは、`?`は、ドキュメントを別のウィンドウに表示します。たとえば、`list?`は `help(list)` とほぼ同じコンテンツを作成し、新しいブラウザウィンドウに表示します。さらに、`list?? のように2つの疑問符を使うと、`、関数を実装するPythonコードも表示されます。 
 
-1. ディープラーニングフレームワーク内の関数またはクラスのドキュメンテーションを調べます。フレームワークの公式ウェブサイトでもドキュメントを見つけることができますか？
+公式ドキュメントには、この本以外の多くの説明と例が記載されています。私たちの重点は、カバレッジの完全性ではなく、実際的な問題を迅速に開始できるようにする重要なユースケースをカバーすることにあります。また、ライブラリのソースコードを調べて、プロダクションコードの高品質実装の例を確認することをお勧めします。そうすることで、あなたはより優れた科学者になるだけでなく、より優れたエンジニアにもなるでしょう。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/38)
diff --git a/chapter_preliminaries/lookup-api_origin.md b/chapter_preliminaries/lookup-api_origin.md
new file mode 100644
index 0000000..17b6264
--- /dev/null
+++ b/chapter_preliminaries/lookup-api_origin.md
@@ -0,0 +1,133 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Documentation
+:begin_tab:`mxnet`
+While we cannot possibly introduce every single MXNet function and class 
+(and the information might become outdated quickly), 
+the [API documentation](https://mxnet.apache.org/versions/1.8.0/api) 
+and additional [tutorials](https://mxnet.apache.org/versions/1.8.0/api/python/docs/tutorials/) and examples 
+provide such documentation. 
+This section provides some guidance for how to explore the MXNet API.
+:end_tab:
+
+:begin_tab:`pytorch`
+While we cannot possibly introduce every single PyTorch function and class 
+(and the information might become outdated quickly), 
+the [API documentation](https://pytorch.org/docs/stable/index.html) and additional [tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html) and examples 
+provide such documentation.
+This section provides some guidance for how to explore the PyTorch API.
+:end_tab:
+
+:begin_tab:`tensorflow`
+While we cannot possibly introduce every single TensorFlow function and class 
+(and the information might become outdated quickly), 
+the [API documentation](https://www.tensorflow.org/api_docs) and additional [tutorials](https://www.tensorflow.org/tutorials) and examples 
+provide such documentation. 
+This section provides some guidance for how to explore the TensorFlow API.
+:end_tab:
+
+
+## Functions and Classes in a Module
+
+In order to know which functions and classes can be called in a module,
+we invoke the `dir` function. For instance, we can
+(**query all properties in the module for generating random numbers**):
+
+```{.python .input  n=1}
+%%tab mxnet
+from mxnet import np
+print(dir(np.random))
+```
+
+```{.python .input  n=1}
+%%tab pytorch
+import torch
+print(dir(torch.distributions))
+```
+
+```{.python .input  n=1}
+%%tab tensorflow
+import tensorflow as tf
+print(dir(tf.random))
+```
+
+Generally, we can ignore functions that start and end with `__` (special objects in Python) 
+or functions that start with a single `_`(usually internal functions). 
+Based on the remaining function or attribute names, 
+we might hazard a guess that this module offers 
+various methods for generating random numbers, 
+including sampling from the uniform distribution (`uniform`), 
+normal distribution (`normal`), and multinomial distribution (`multinomial`).
+
+## Specific Functions and Classes
+
+For more specific instructions on how to use a given function or class,
+we can invoke the  `help` function. As an example, let's
+[**explore the usage instructions for tensors' `ones` function**].
+
+```{.python .input}
+%%tab mxnet
+help(np.ones)
+```
+
+```{.python .input}
+%%tab pytorch
+help(torch.ones)
+```
+
+```{.python .input}
+%%tab tensorflow
+help(tf.ones)
+```
+
+From the documentation, we can see that the `ones` function 
+creates a new tensor with the specified shape 
+and sets all the elements to the value of 1. 
+Whenever possible, you should (**run a quick test**) 
+to confirm your interpretation:
+
+```{.python .input}
+%%tab mxnet
+np.ones(4)
+```
+
+```{.python .input}
+%%tab pytorch
+torch.ones(4)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.ones(4)
+```
+
+In the Jupyter notebook, we can use `?` to display the document in another
+window. For example, `list?` will create content
+that is almost identical to `help(list)`,
+displaying it in a new browser window.
+In addition, if we use two question marks, such as `list??`,
+the Python code implementing the function will also be displayed.
+
+The official documentation provides plenty of descriptions and examples that are beyond this book. 
+Our emphasis lies on covering important use cases 
+that will allow you to get started quickly with practical problems, 
+rather than completeness of coverage. 
+We also encourage you to study the source code of the libraries 
+to see examples of high quality implementations for production code. 
+By doing this you will become a better engineer 
+in addition to becoming a better scientist.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/38)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/39)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/199)
+:end_tab:
diff --git a/chapter_preliminaries/ndarray.md b/chapter_preliminaries/ndarray.md
index 2e50bf4..0781a19 100644
--- a/chapter_preliminaries/ndarray.md
+++ b/chapter_preliminaries/ndarray.md
@@ -1,371 +1,386 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # データ操作
 :label:`sec_ndarray`
 
-何かを成し遂げるためには、データを保存し操作する何らかの方法が必要です。一般に、データを扱うには、(i) データを取得することと、(ii) データがコンピューター内に収まった後に処理するという2つの重要なことがあります。なんらかの保存方法がないとデータを取得しても意味がないので、まずは合成データをいじって手を汚しましょう。まず、*テンソル* とも呼ばれる $n$ 次元配列を紹介します。 
-
-Python で最も広く使われている科学計算パッケージである NumPy を使ったことがあるなら、この節はよく知っていることでしょう。どのフレームワークを使用するかにかかわらず、その*テンソルクラス* (MXNet では `ndarray`、PyTorch と TensorFlow の両方で `Tensor`) は NumPy の `ndarray` と似ていますが、いくつかのキラーな機能があります。まず、GPU は計算を高速化するために十分にサポートされていますが、NumPy は CPU 計算しかサポートしていません。第2に、テンソルクラスは自動微分をサポートしている。これらの特性により、テンソルクラスはディープラーニングに適しています。本書全体を通して、テンソルと言うとき、特に明記されていない限り、テンソルクラスのインスタンスを指しています。 
+何かを成し遂げるためには、データを保存し操作する何らかの方法が必要です。一般的に、データに関して重要なことは2つあります。(i) データを取得することと、(ii) コンピューター内にいったんデータを処理することです。データを格納する方法がないとデータを取得しても意味がありません。まず、$n$ 次元配列で手を汚しましょう。これを*テンソル* とも呼びます。NumPy の科学計算パッケージを既に知っているなら、これは簡単です。すべての最新のディープラーニングフレームワークでは、*テンソルクラス*（MXNetでは`ndarray`、PyTorchおよびTensorFlowでは`Tensor`）は、NumPyの`ndarray`に似ており、いくつかのキラー機能が追加されています。まず、テンソルクラスは自動微分をサポートします。第二に、数値計算を高速化するためにGPUを活用しますが、NumPyはCPUでのみ動作します。これらの特性により、ニューラルネットワークはコーディングが容易で、実行も高速になります。 
 
 ## はじめに
 
-このセクションでは、本書を読み進めていくにつれて構築する基本的な数学および数値計算ツールを身に付けて、使い始めることを目指しています。数学的な概念やライブラリ関数を掘り起こすのに苦労しても心配しないでください。次のセクションでは、この資料を実際的な例のコンテキストで再検討します。一方、すでに何らかの経歴があり、数学的な内容をより深く掘り下げたい場合は、このセクションをスキップしてください。
-
 :begin_tab:`mxnet`
-まず、`np` (`numpy`) と `npx` (`numpy_extension`) のモジュールを MXnet からインポートします。ここで、`np` モジュールには NumPy でサポートされる関数が含まれ、`npx` モジュールには Numpy ライクな環境でディープラーニングを強化するために開発された一連の拡張機能が含まれています。テンソルを使用する場合、ほとんどの場合 `set_np` 関数を呼び出します。これは、MXNet の他のコンポーネントによるテンソル処理の互換性のためです。
+まず、`np` (`numpy`) および `npx` (`numpy_extension`) モジュールを MXNet からインポートします。ここで、`np` モジュールには NumPy でサポートされる関数が含まれていますが、`npx` モジュールには Numpy ライクな環境でディープラーニングを強化するために開発された一連の拡張が含まれています。テンソルを使用する場合、ほとんどの場合、`set_np` 関数を呼び出します。これは、MXNet の他のコンポーネントによるテンソル処理の互換性のためです。
 :end_tab:
 
 :begin_tab:`pytorch`
-(**まず、`torch` をインポートします。PyTorch という名前ですが、`pytorch` ではなく `torch` をインポートする必要があることに注意してください **)
+(**まず、PyTorch ライブラリをインポートします。パッケージ名は `torch`.** であることに注意してください)
 :end_tab:
 
 :begin_tab:`tensorflow`
-まず、`tensorflow` をインポートします。名前が少し長いため、短いエイリアス `tf` を付けてインポートすることがよくあります。
+まず、`tensorflow`をインポートします。簡潔にするために、開業医はしばしば `tf` というエイリアスを割り当てます。
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 from mxnet import np, npx
 npx.set_np()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 import torch
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 ```
 
-[**テンソルは数値の (多次元の) 配列を表す。**] 1つの軸で、テンソルを*vector* と呼びます。2 つの軸を持つテンソルを*matrix* と呼びます。$k > 2$ 軸では、特殊な名前を削除し、オブジェクトを $k^\mathrm{th}$ *次数テンソル* として参照します。
+[**テンソルは数値の (場合によっては多次元の) 配列を表します。**] 1つの軸では、テンソルは*ベクトル*と呼ばれます。2 つの軸では、テンソルは*マトリックス* と呼ばれます。$k > 2$ 軸では、特殊な名前を削除し、オブジェクトを $k^\mathrm{th}$ *次数テンソル* と呼びます。
 
 :begin_tab:`mxnet`
-MXNet には、値が事前設定された新しいテンソルを作成するためのさまざまな関数が用意されています。たとえば、`arange(n)` を呼び出すと、0 (含まれる) から始まって `n` (含まれていない) で終わる等間隔の値のベクトルを作成できます。デフォルトのインターバルサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
+MXNet は、値があらかじめ入力された新しいテンソルを作成するためのさまざまな関数を提供します。たとえば、`arange(n)` を呼び出すと、0（含まれる）から始まり `n`（含まれていない）で終わる等間隔の値のベクトルを作成できます。デフォルトでは、間隔のサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
 :end_tab:
 
 :begin_tab:`pytorch`
-PyTorch には、値があらかじめ入力された新しいテンソルを作成するためのさまざまな関数が用意されています。たとえば、`arange(n)` を呼び出すと、0 (含まれる) から始まって `n` (含まれていない) で終わる等間隔の値のベクトルを作成できます。デフォルトのインターバルサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
+PyTorch は、値があらかじめ入力された新しいテンソルを作成するためのさまざまな関数を提供します。たとえば、`arange(n)` を呼び出すと、0（含まれる）から始まり `n`（含まれていない）で終わる等間隔の値のベクトルを作成できます。デフォルトでは、間隔のサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
 :end_tab:
 
 :begin_tab:`tensorflow`
-TensorFlow には、値が事前設定された新しいテンソルを作成するためのさまざまな関数が用意されています。たとえば、`range(n)` を呼び出すと、0 (含む) から始まって `n` (含まれていない) で終わる等間隔の値のベクトルを作成できます。デフォルトのインターバルサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
+TensorFlow は、値があらかじめ入力された新しいテンソルを作成するためのさまざまな関数を提供します。たとえば、`range(n)` を呼び出すと、0（含まれる）から始まり `n`（含まれていない）で終わる等間隔の値のベクトルを作成できます。デフォルトでは、間隔のサイズは $1$ です。特に指定しない限り、新しいテンソルはメインメモリに格納され、CPU ベースの計算用に指定されます。
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 x = np.arange(12)
 x
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x = torch.arange(12, dtype=torch.float32)
 x
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 x = tf.range(12, dtype=tf.float32)
 x
 ```
 
-(**テンソルの*shape***) (~~と要素の総数~~) (各軸に沿った長さ) は `shape` プロパティを調べることでアクセスできます。
+:begin_tab:`mxnet`
+これらの値はそれぞれ、テンソルの*要素*と呼ばれます。テンソル `x` には 12 の要素が含まれています。テンソルの要素の総数は、`size` 属性を介して調べることができます。
+:end_tab:
 
-```{.python .input}
-#@tab all
-x.shape
-```
+:begin_tab:`pytorch`
+これらの値はそれぞれ、テンソルの*要素*と呼ばれます。テンソル `x` には 12 の要素が含まれています。テンソルの要素の総数は、`numel` メソッドで調べることができます。
+:end_tab:
 
-テンソルの要素の総数、つまりすべての形状要素の積を知りたいだけなら、その大きさを調べることができます。ここではベクトルを扱っているので、その `shape` の 1 つの要素はそのサイズと同じです。
+:begin_tab:`tensorflow`
+これらの値はそれぞれ、テンソルの*要素*と呼ばれます。テンソル `x` には 12 の要素が含まれています。テンソルの要素の総数は、関数 `size` で調べることができます。
+:end_tab:
 
 ```{.python .input}
+%%tab mxnet
 x.size
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 x.numel()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.size(x)
 ```
 
-[**要素数も値も変えずにテンソルの形を変える**] には、`reshape` 関数を呼び出します。たとえば、テンソル `x` を形状 (12,) の行ベクトルから形状 (3, 4) の行列に変換できます。この新しいテンソルにはまったく同じ値が含まれていますが、3 行 4 列で構成された行列として表示されます。繰り返しますが、形状は変更されましたが、要素は変更されていません。形状を変更してもサイズは変更されないことに注意してください。
+（**テンソルの*形状***）（各軸に沿った長さ）には、`shape`属性を調べることでアクセスできます。ここではベクトルを扱っているので、`shape` は 1 つの要素だけを含み、サイズと同じです。
+
+```{.python .input}
+%%tab all
+x.shape
+```
+
+`reshape` を呼び出すことで、[**サイズや値を変更せずにテンソルの形状を変更**] できます。たとえば、形状が (12,) のベクトル `x` を (3, 4) の形をした行列 `X` に変換できます。この新しいテンソルはすべての要素を保持しますが、それらをマトリックスに再構成します。ベクトルの要素は一度に 1 行ずつ、つまり `x[3] == X[0, 3]` にレイアウトされていることに注目してください。
 
 ```{.python .input}
-#@tab mxnet, pytorch
+%%tab mxnet, pytorch
 X = x.reshape(3, 4)
 X
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 X = tf.reshape(x, (3, 4))
 X
 ```
 
-すべてのディメンションを手動で指定して形状を変更する必要はありません。ターゲットの形状が形状 (高さ、幅) をもつ行列の場合、幅がわかると、高さが暗黙的に与えられます。なぜ自分たちで除算をしなければならないのですか？上の例では、3 行の行列を得るために、3 行と 4 列の両方を指定しました。幸いなことに、テンソルは残りの次元を指定して 1 つの次元を自動的に計算できます。この機能を呼び出すには、テンソルで自動的に推論する次元に `-1` を配置します。私たちの場合、`x.reshape(3, 4)` を呼び出す代わりに、`x.reshape(-1, 4)` または `x.reshape(3, -1)` を同等に呼び出すことができます。 
+すべての形状コンポーネントを `reshape` に指定するのは冗長であることに注意してください。テンソルのサイズはすでにわかっているので、残りを考えれば、形状の1つのコンポーネントを計算できます。たとえば、サイズが$n$のテンソルとターゲット形状（$h$、$w$）を考えると、$w = n/h$であることがわかります。シェイプの 1 つのコンポーネントを自動的に推測するには、自動的に推測されるシェイプコンポーネントに `-1` を配置します。私たちの場合、`x.reshape(3, 4)`を呼び出す代わりに、`x.reshape(-1, 4)`または`x.reshape(3, -1)`を同等に呼び出すことができました。 
 
-通常、行列はゼロ、1、その他の定数、または特定の分布からランダムにサンプリングされた数値のいずれかで初期化されます。[**すべての要素を0に設定したテンソルを表すテンソルを作成できます**](~~or 1~~)、(2, 3, 4) の形状は次のようになります。
+実務家は、多くの場合、すべて0または1を含むように初期化されたテンソルを扱う必要があります。[**すべての要素をゼロに設定したテンソルを構築できます**](~~または one~~) と (2, 3, 4) の形状は `zeros` 関数を使用します。
 
 ```{.python .input}
+%%tab mxnet
 np.zeros((2, 3, 4))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.zeros((2, 3, 4))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.zeros((2, 3, 4))
 ```
 
-同様に、次のように、各要素を 1 に設定したテンソルを作成できます。
+同様に、`ones` を呼び出すと、すべて 1 のテンソルを作成できます。
 
 ```{.python .input}
+%%tab mxnet
 np.ones((2, 3, 4))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.ones((2, 3, 4))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.ones((2, 3, 4))
 ```
 
-多くの場合、何らかの確率分布から [**テンソルの各要素の値をランダムにサンプリング**] します。たとえば、ニューラルネットワークでパラメーターとして機能する配列を作成する場合、通常、配列の値をランダムに初期化します。次のスニペットは、形状 (3, 4) を持つテンソルを作成します。各要素は、平均 0、標準偏差 1 の標準ガウス (正規) 分布からランダムにサンプリングされます。
+私たちはしばしば、与えられた確率分布から [**各要素を無作為に（そして独立して）**] サンプリングしたいと考えています。たとえば、ニューラルネットワークのパラメーターはランダムに初期化されることがよくあります。次のスニペットは、平均 0、標準偏差 1 の標準ガウス (正規) 分布から抽出された要素でテンソルを作成します。
 
 ```{.python .input}
+%%tab mxnet
 np.random.normal(0, 1, size=(3, 4))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.randn(3, 4)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.random.normal(shape=[3, 4])
 ```
 
-また、数値を含む Python リスト (またはリストのリスト) を提供することで、目的のテンソルで [**各要素の正確な値を指定**] することもできます。ここで、最も外側のリストは軸 0 に対応し、内側のリストは軸 1 に対応しています。
+最後に、数値リテラルを含む (ネストされている可能性もある) Python リストを提供することで [**各要素の正確な値を提供する**]、テンソルを構築できます。ここでは、リストのリストを持つ行列を作成します。最も外側のリストは軸0に対応し、内側のリストは軸1に対応します。
 
 ```{.python .input}
+%%tab mxnet
 np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.constant([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 ```
 
-## オペレーション
+## インデックス作成とスライス
+
+Python のリストと同様に、インデックス (0 から開始) することでテンソル要素にアクセスできます。リストの末尾からの相対的な位置に基づいて要素にアクセスするには、負のインデックスを使用できます。最後に、スライス（例：`X[start:stop]`）を介してインデックスの全範囲にアクセスできます。戻り値には最初のインデックス（`start`）*が含まれますが、最後の*（`stop`）は含まれません。最後に、$k^\mathrm{th}$ 次数テンソルにインデックス (またはスライス) を 1 つだけ指定すると、軸 0 に沿って適用されます。したがって、次のコードでは、[**`[-1]` は最後の行を選択し、`[1:3]` は2番目と3番目の行を選択します**]。
 
-この本はソフトウェア工学に関するものではありません。私たちの関心は、単に配列からデータを読み書きすることに限定されません。これらの配列に対して数学演算を実行したいと考えています。最も単純で有用な操作には、*elementwise* 演算があります。これらは、配列の各要素に標準のスカラー演算を適用します。2 つの配列を入力として取る関数の場合、要素単位の演算では、2 つの配列の対応する要素のペアごとに、何らかの標準二項演算子が適用されます。スカラーからスカラーにマッピングする任意の関数から要素単位の関数を作成できます。 
+```{.python .input}
+%%tab all
+X[-1], X[1:3]
+```
 
-数学的表記法では、このような*単項* スカラー演算子 (入力を 1 つ取る) を $f: \mathbb{R} \rightarrow \mathbb{R}$ というシグネチャで表します。これは、関数が任意の実数 ($\mathbb{R}$) から別の実数にマッピングしていることを意味します。同様に、シグネチャ $f: \mathbb{R}, \mathbb{R} \rightarrow \mathbb{R}$ によって、*binary* スカラー演算子 (2 つの実数入力を取り、1 つの出力を生成する) を表します。同じ形状* の 2 つのベクトル $\mathbf{u}$ と $\mathbf{v}$ と二項演算子 $f$ を指定すると、$i$ すべてに対して $c_i \gets f(u_i, v_i)$ を設定することでベクトル $\mathbf{c} = F(\mathbf{u},\mathbf{v})$ を生成できます。$c_i, u_i$ と $v_i$ は $\mathbf{c}, \mathbf{u}$ および $\mathbf{v}$ の $i^\mathrm{th}$ 要素です。ここでは、スカラー関数を要素単位のベクトル演算に*リフト* して、ベクトル値 $F: \mathbb{R}^d, \mathbb{R}^d \rightarrow \mathbb{R}^d$ を生成しました。 
+:begin_tab:`mxnet, pytorch`
+読むだけでなく、(**インデックスを指定して行列の要素を記述することもできます。**)
+:end_tab:
 
-一般的な標準算術演算子 (`+`、`-`、`*`、`/`、および `**`) はすべて、任意の形状の同じ形状のテンソルに対して要素単位の演算に*解除* されています。要素単位の演算は、同じ形状の任意の 2 つのテンソルに対して呼び出すことができます。次の例では、カンマを使用して 5 要素のタプルを生成します。各要素は要素ごとの演算の結果です。 
+:begin_tab:`tensorflow`
+TensorFlow の `Tensors` は不変であり、割り当てることはできません。TensorFlow の `Variables` は、割り当てをサポートする状態の可変コンテナです。TensorFlow のグラデーションは `Variable` の割り当てを通して逆流しないことに注意してください。 
 
-### オペレーション
+`Variable` 全体に値を割り当てる以外に、インデックスを指定して `Variable` の要素を記述できます。
+:end_tab:
 
-[**一般的な標準算術演算子 (`+`、`-`、`*`、`/`、`**`) は、すべて要素単位の演算に*解除* されました。**]
+```{.python .input}
+%%tab mxnet, pytorch
+X[1, 2] = 17
+X
+```
 
 ```{.python .input}
-x = np.array([1, 2, 4, 8])
-y = np.array([2, 2, 2, 2])
-x + y, x - y, x * y, x / y, x ** y  # The ** operator is exponentiation
+%%tab tensorflow
+X_var = tf.Variable(X)
+X_var[1, 2].assign(9)
+X_var
 ```
 
+[**複数の要素に同じ値を割り当てるには、割り当て操作の左側にインデックスを適用します。**] たとえば、`[:2, :]` は 1 行目と 2 行目にアクセスします。ここで、`:` は軸 1 (列) に沿ったすべての要素を取得します。行列の索引付けについて説明しましたが、これはベクトルと 2 次元以上のテンソルに対しても機能します。
+
 ```{.python .input}
-#@tab pytorch
-x = torch.tensor([1.0, 2, 4, 8])
-y = torch.tensor([2, 2, 2, 2])
-x + y, x - y, x * y, x / y, x ** y  # The ** operator is exponentiation
+%%tab mxnet, pytorch
+X[:2, :] = 12
+X
 ```
 
 ```{.python .input}
-#@tab tensorflow
-x = tf.constant([1.0, 2, 4, 8])
-y = tf.constant([2.0, 2, 2, 2])
-x + y, x - y, x * y, x / y, x ** y  # The ** operator is exponentiation
+%%tab tensorflow
+X_var = tf.Variable(X)
+X_var[:2, :].assign(tf.ones(X_var[:2,:].shape, dtype=tf.float32) * 12)
+X_var
 ```
 
-べき乗のような単項演算子を含む、多くの (**より多くの演算を要素単位に適用できる**)。
+## オペレーション
+
+テンソルの構築方法と、テンソルの要素の読み書き方法がわかったので、さまざまな数学演算でテンソルを操作することができます。最も有用なツールには、*要素単位*の操作があります。これらは、テンソルの各要素に標準のスカラー演算を適用します。入力として 2 つのテンソルを取る関数の場合、要素単位の演算は、対応する要素の各ペアに標準の二項演算子を適用します。スカラーからスカラーにマップする任意の関数から要素単位の関数を作成できます。 
+
+数学的表記法では、そのようなことを表します
+*単項* スカラー演算子 (1 つの入力を取る)
+署名$f: \mathbb{R} \rightarrow \mathbb{R}$によります。これは単に、関数が任意の実数から他の実数にマップされることを意味します。$e^x$ のような単項演算子を含め、ほとんどの標準演算子は要素単位に適用できます。
 
 ```{.python .input}
+%%tab mxnet
 np.exp(x)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 torch.exp(x)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.exp(x)
 ```
 
-要素単位の計算に加えて、ベクトルドット積や行列乗算などの線形代数演算も実行できます。:numref:`sec_linear-algebra`では、線形代数の重要な部分（事前知識は想定されていない）について説明します。 
+同様に、シグネチャ$f: \mathbb{R}, \mathbb{R} \rightarrow \mathbb{R}$を介して実数のペアを（単一の）実数にマップする*バイナリ*スカラー演算子を示します。同じ形状* の任意の 2 つのベクトル $\mathbf{u}$ と $\mathbf{v}$ * と二項演算子 $f$ が与えられた場合、すべての $i$ に $c_i \gets f(u_i, v_i)$ を設定することにより、ベクトル $\mathbf{c} = F(\mathbf{u},\mathbf{v})$ を生成できます。ここで、$c_i, u_i$ と $v_i$ は、ベクトル $\mathbf{c}, \mathbf{u}$ および $\mathbf{v}$ の $i^\mathrm{th}$ 要素です。ここでは、スカラー関数を要素単位のベクトル演算に*持ち上げ* して、ベクトル値の $F: \mathbb{R}^d, \mathbb{R}^d \rightarrow \mathbb{R}^d$ を生成しました。加算（`+`）、減算（`-`）、乗算（`*`）、除算（`/`）、およびべき乗（`**`）の一般的な標準算術演算子はすべて、任意の形状の同じ形状のテンソルに対して要素単位の演算に*持ち上げられました。
 
-また、複数のテンソルを端から端まで積み重ねて [***連結*、**] してより大きなテンソルを形成することもできます。テンソルのリストを提供し、どの軸に沿って連結するかをシステムに指示するだけです。以下の例は、行 (図形の最初の要素である軸 0) と列 (軸1、図形の 2 番目の要素) に沿って 2 つの行列を連結した場合の動作を示しています。1 番目の出力テンソルの軸 0 の長さ ($6$) は、2 つの入力テンソルの軸 0 の長さ ($3 + 3$) の和であり、2 番目の出力テンソルの軸 1 の長さ ($8$) は、2 つの入力テンソルの軸 1 の長さ ($4 + 4$) の合計であることがわかります。
+```{.python .input}
+%%tab mxnet
+x = np.array([1, 2, 4, 8])
+y = np.array([2, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.tensor([1.0, 2, 4, 8])
+y = torch.tensor([2, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.constant([1.0, 2, 4, 8])
+y = tf.constant([2.0, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+要素単位の計算に加えて、ドット積や行列の乗算などの線形代数演算も実行できます。これらについては、:numref:`sec_linear-algebra`ですぐに詳しく説明します。 
+
+また、複数のテンソルをまとめて [***連結*] し、それらを端から端まで積み重ねて、より大きなテンソルを形成することもできます。テンソルのリストを提供し、連結する軸をシステムに伝えるだけです。以下の例は、行 (軸 0) と列 (軸 1) に沿って 2 つの行列を連結するとどうなるかを示しています。最初の出力の軸0の長さ（$6$）は、2つの入力テンソルの軸0の長さ（$3 + 3$）の合計であることがわかります。一方、2番目の出力の軸1の長さ（$8$）は、2つの入力テンソルの軸1の長さ（$4 + 4$）の合計です。
 
 ```{.python .input}
+%%tab mxnet
 X = np.arange(12).reshape(3, 4)
 Y = np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 np.concatenate([X, Y], axis=0), np.concatenate([X, Y], axis=1)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 X = torch.arange(12, dtype=torch.float32).reshape((3,4))
 Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 X = tf.reshape(tf.range(12, dtype=tf.float32), (3, 4))
 Y = tf.constant([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
 tf.concat([X, Y], axis=0), tf.concat([X, Y], axis=1)
 ```
 
-時々、[**論理文で二項テンソルを構築する*。**] `X == Y` を例に挙げてみましょう。位置ごとに `X` と `Y` がその位置で等しい場合、新しいテンソルの対応するエントリは値 1 を取ります。これは、論理ステートメント `X == Y` がその位置で真であることを意味します。
+時々、[***論理文*を介してバイナリテンソルを構築する。**] `X == Y`を例にとります。各位置`i, j`について、`X[i, j]`と`Y[i, j]`が等しい場合、結果の対応するエントリは値`1`をとり、そうでない場合は値`0`を取ります。
 
 ```{.python .input}
-#@tab all
+%%tab all
 X == Y
 ```
 
-[**テンソルのすべての要素を合計すると**] は要素が 1 つだけのテンソルになります。
+[**テンソルのすべての要素を合計する**] は、要素が 1 つだけのテンソルになります。
 
 ```{.python .input}
-#@tab mxnet, pytorch
+%%tab mxnet, pytorch
 X.sum()
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 tf.reduce_sum(X)
 ```
 
-## ブロードキャストメカニズム
+## 放送
 :label:`subsec_broadcasting`
 
-上のセクションでは、同じ形状の 2 つのテンソルに対して要素単位の演算を実行する方法を説明しました。特定の条件下では、形状が異なっていても [**ブロードキャストメカニズム*を呼び出すことで要素単位の演算を実行できる。**] このメカニズムは次のように機能します。まず、要素を適切にコピーして一方または両方の配列を展開し、この変換後、2 つのテンソルが同じ形。次に、結果の配列に対して要素ごとの演算を実行します。 
-
-ほとんどの場合、次の例のように、配列が最初は長さが 1 しかない軸に沿ってブロードキャストします。
+これで、同じ形状の 2 つのテンソルに対して要素単位の二項演算を実行する方法がわかりました。特定の条件下では、形状が異なる場合でも、[***ブロードキャストメカニズム*を呼び出すことで要素単位のバイナリ演算を実行できます**] ブロードキャストは、次の2段階の手順に従って動作します。（i）長さ1の軸に沿って要素をコピーして一方または両方の配列を拡張し、この後変換すると、2つのテンソルは同じ形状になります。（ii）結果の配列に対して要素単位の演算を実行します。
 
 ```{.python .input}
+%%tab mxnet
 a = np.arange(3).reshape(3, 1)
 b = np.arange(2).reshape(1, 2)
 a, b
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 a = torch.arange(3).reshape((3, 1))
 b = torch.arange(2).reshape((1, 2))
 a, b
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 a = tf.reshape(tf.range(3), (3, 1))
 b = tf.reshape(tf.range(2), (1, 2))
 a, b
 ```
 
-`a` と `b` はそれぞれ $3\times1$ と $1\times2$ の行列なので、加算してもそれらの形状は一致しません。以下のように、両方の行列のエントリをより大きな $3\times2$ 行列に「ブロードキャスト」します。行列 `a` では列を複製し、行列 `b` では行を複製してから両方を要素ごとに加算します。
+`a`と`b`はそれぞれ$3\times1$と$1\times2$の行列であるため、それらの形状は一致しません。ブロードキャストでは、行列 `a` を列に沿って複製し、行に沿って行列 `b` を要素ごとに加算する前に、より大きな $3\times2$ 行列を生成します。
 
 ```{.python .input}
-#@tab all
+%%tab all
 a + b
 ```
 
-## インデックス作成とスライシング
-
-他の Python 配列と同様に、テンソル内の要素にはインデックスでアクセスできます。Python の配列と同様に、最初の要素のインデックスは 0 で、範囲は最初で最後の要素の*前* を含むように指定されます。標準の Python リストと同様に、負のインデックスを使って、リストの最後に対する相対的な位置に従って要素にアクセスできます。 
+## メモリを節約する
 
-したがって、[**`[-1]` は最後の要素を選択し、`[1:3]` は 2 番目と 3 番目の要素**] を選択します。
+[**操作を実行すると、新しいメモリがホスト結果に割り当てられる可能性があります。**] たとえば、`Y = X + Y`と記述すると、`Y`が指していたテンソルを逆参照し、代わりに新しく割り当てられたメモリを指す`Y`を指します。この問題を Python の `id()` 関数で実証できます。この関数は、メモリ内の参照オブジェクトの正確なアドレスを提供します。`Y = Y + X` を実行した後、`id(Y)` は別の場所を指していることに注意してください。これは、Python が最初に`Y + X`を評価し、結果のために新しいメモリを割り当ててから、`Y`をメモリ内のこの新しい場所を指すためです。
 
 ```{.python .input}
-#@tab all
-X[-1], X[1:3]
-```
-
-:begin_tab:`mxnet, pytorch`
-reading以外にも (**インデックスを指定して行列の要素を書くこともできます**)
-:end_tab:
-
-:begin_tab:`tensorflow`
-TensorFlow の `Tensors` は不変であり、割り当てることはできません。TensorFlow の `Variables` は、割り当てをサポートする状態の可変コンテナです。TensorFlow のグラデーションは `Variable` の割り当てでは逆流しないことに注意してください。 
-
-`Variable` 全体に値を代入するだけでなく、インデックスを指定することで `Variable` の要素を書くことができます。
-:end_tab:
-
-```{.python .input}
-#@tab mxnet, pytorch
-X[1, 2] = 9
-X
-```
-
-```{.python .input}
-#@tab tensorflow
-X_var = tf.Variable(X)
-X_var[1, 2].assign(9)
-X_var
-```
-
-[**複数の要素に同じ値を割り当てるには、すべての要素にインデックスを付けてから値を割り当てます。**] たとえば、`[0:2, :]` は 1 行目と 2 行目にアクセスし、`:` は軸 1 (列) に沿ってすべての要素を取得します。行列の索引付けについて説明しましたが、これは明らかにベクトルや2次元以上のテンソルでも機能します。
-
-```{.python .input}
-#@tab mxnet, pytorch
-X[0:2, :] = 12
-X
-```
-
-```{.python .input}
-#@tab tensorflow
-X_var = tf.Variable(X)
-X_var[0:2, :].assign(tf.ones(X_var[0:2,:].shape, dtype = tf.float32) * 12)
-X_var
-```
-
-## メモリーの節約
-
-[**操作を実行すると、新しいメモリがホストの結果に割り当てられる場合があります**] たとえば、`Y = X + Y` と書くと、`Y` が指していたテンソルを逆参照し、代わりに新しく割り当てられたメモリで `Y` をポイントします。次の例では、Python の `id()` 関数でこれを実証しています。この関数は、メモリ内の参照先オブジェクトの正確なアドレスを与えます。`Y = Y + X` を実行すると、`id(Y)` が別の場所を指していることがわかります。これは、Python が最初に `Y + X` を評価し、結果に対して新しいメモリを割り当て、`Y` がメモリ内のこの新しい位置を指すようにするためです。
-
-```{.python .input}
-#@tab all
+%%tab all
 before = id(Y)
 Y = Y + X
 id(Y) == before
 ```
 
-これは、2 つの理由から望ましくない場合があります。まず、常に不必要にメモリを割り当てることを回避したくありません。機械学習では、数百メガバイトのパラメータがあり、それらすべてを毎秒複数回更新することがあります。通常は、これらの更新を「その場で」* 実行します。2つ目は、複数の変数から同じパラメータを指す場合です。インプレースで更新しなければ、他の参照は古いメモリ位置を指すため、コードの一部が誤って古いパラメータを参照する可能性があります。
+これは、2 つの理由から望ましくない場合があります。まず、不必要にメモリを割り当てて回り回りたくありません。機械学習では、数百メガバイトのパラメータがあり、それらすべてを毎秒複数回更新することがよくあります。可能な限り、これらの更新を*その場*で実行したいと考えています。次に、複数の変数から同じパラメータを指す場合があります。その場で更新しない場合、メモリリークが発生したり、誤って古いパラメータを参照したりしないように、これらの参照をすべて更新するように注意する必要があります。
 
 :begin_tab:`mxnet, pytorch`
-幸い、(**インプレース操作の実行**) は簡単です。操作の結果は、`Y[:] = <expression>` のようにスライス表記を使用して、前に割り当てた配列に代入できます。この概念を説明するために、`zeros_like` を使用して $0$ エントリのブロックを割り当てて、別の `Y` と同じ形状の新しい行列 `Z` を最初に作成します。
+幸い、(**インプレース操作の実行**) は簡単です。操作の結果は、スライス表記法 `Y[:] = <expression>` を使用して、以前に割り当てられた配列 `Y` に割り当てることができます。この概念を説明するために、`zeros_like`を使用してテンソル`Z`の値を初期化した後に上書きし、`Y`と同じ形状にします。
 :end_tab:
 
 :begin_tab:`tensorflow`
-`Variables` は、TensorFlow の可変状態のコンテナーです。モデルパラメータを保存する方法を提供します。`assign` を使用して `Variable` に操作の結果を割り当てることができます。この概念を説明するために、`zeros_like` を使用して $0$ エントリのブロックを割り当てて、別のテンソル `Y` と同じ形状の `Variable` `Z` を作成します。
+`Variables` は TensorFlow の可変状態のコンテナです。これらは、モデルパラメータを保存する方法を提供します。操作の結果を `assign` で `Variable` に割り当てることができます。この概念を説明するために、`Variable` `Z` の値を初期化した後に `zeros_like` を使用して上書きし、`Y` と同じ形状にします。
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 Z = np.zeros_like(Y)
 print('id(Z):', id(Z))
 Z[:] = X + Y
@@ -373,7 +388,7 @@ print('id(Z):', id(Z))
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 Z = torch.zeros_like(Y)
 print('id(Z):', id(Z))
 Z[:] = X + Y
@@ -381,7 +396,7 @@ print('id(Z):', id(Z))
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 Z = tf.Variable(tf.zeros_like(Y))
 print('id(Z):', id(Z))
 Z.assign(X + Y)
@@ -389,30 +404,28 @@ print('id(Z):', id(Z))
 ```
 
 :begin_tab:`mxnet, pytorch`
-[**`X` の値が以降の計算で再利用されない場合、`X[:] = X + Y` または `X += Y` を使用して操作のメモリオーバーヘッドを減らすこともできます。**]
+[**`X`の値が以降の計算で再利用されない場合、`X[:] = X + Y`または`X += Y`を使用して操作のメモリオーバーヘッドを減らすこともできます。**]
 :end_tab:
 
 :begin_tab:`tensorflow`
-`Variable` に状態を永続的に保存した後でも、モデルパラメーターではないテンソルに対する過剰な割り当てを避けることで、メモリ使用量をさらに削減できます。 
-
-TensorFlow `Tensors` は不変であり、勾配は `Variable` の割り当てを通過しないため、TensorFlow では個々の操作をインプレースで実行する明示的な方法を提供していません。 
+`Variable` に状態を永続的に保存した後でも、モデルパラメーターではないテンソルへの過剰な割り当てを回避して、メモリ使用量をさらに削減したい場合があります。TensorFlow `Tensors` は不変であり、グラデーションは `Variable` の割り当てを通過しないため、TensorFlow には個々の操作をインプレースで実行する明示的な方法はありません。 
 
-ただし、TensorFlow には `tf.function` デコレータが用意されており、実行前にコンパイルおよび最適化される TensorFlow グラフ内に計算をラップします。これにより、TensorFlow は未使用の値をプルーニングし、不要になった以前の割り当てを再利用できます。これにより、TensorFlow 計算のメモリオーバーヘッドが最小限に抑えられます。
+ただし、TensorFlow は `tf.function` デコレータを提供し、実行前にコンパイルおよび最適化される TensorFlow グラフ内に計算をラップします。これにより、TensorFlow は未使用の値をプルーニングし、不要になった以前の割り当てを再利用できます。これにより、TensorFlow 計算のメモリオーバーヘッドが最小限に抑えられます。
 :end_tab:
 
 ```{.python .input}
-#@tab mxnet, pytorch
+%%tab mxnet, pytorch
 before = id(X)
 X += Y
 id(X) == before
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 @tf.function
 def computation(X, Y):
     Z = tf.zeros_like(Y)  # This unused value will be pruned out
-    A = X + Y  # Allocations will be re-used when no longer needed
+    A = X + Y  # Allocations will be reused when no longer needed
     B = A + Y
     C = B + Y
     return C + Y
@@ -423,60 +436,63 @@ computation(X, Y)
 ## 他の Python オブジェクトへの変換
 
 :begin_tab:`mxnet, tensorflow`
-[**NumPy テンソル (`ndarray`) への変換 (`ndarray`) **]、またはその逆は簡単です。変換された結果はメモリを共有しません。この小さな不便さは、実際には非常に重要です。CPU や GPU で操作を実行するときに、Python の NumPy パッケージが同じメモリチャンクで何か他の処理を行いたいかどうかを待って、計算を中断したくありません。
+[**NumPy テンソルへの変換 (`ndarray`) **]、またはその逆は簡単です。変換された結果はメモリを共有しません。この小さな不便さは実際には非常に重要です。CPU や GPU で操作を実行するとき、Python の NumPy パッケージが同じメモリチャンクで何か他のことをしたいかどうかを確認するのを待って、計算を停止したくありません。
 :end_tab:
 
 :begin_tab:`pytorch`
-[**NumPy テンソル (`ndarray`) への変換 (`ndarray`) **]、またはその逆は簡単です。Tensor と numpy 配列は、基になるメモリ位置を共有し、インプレース操作で一方を変更すると、もう一方も変更されます。
+[**NumPy テンソルへの変換 (`ndarray`) **]、またはその逆は簡単です。トーチテンソルとnumpy配列は基礎となるメモリを共有し、インプレース操作で一方を変更するともう一方も変更されます。
 :end_tab:
 
 ```{.python .input}
+%%tab mxnet
 A = X.asnumpy()
 B = np.array(A)
 type(A), type(B)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 A = X.numpy()
 B = torch.from_numpy(A)
 type(A), type(B)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 A = X.numpy()
 B = tf.constant(A)
 type(A), type(B)
 ```
 
-(**サイズ 1 のテンソルを Python スカラーに変換**) するには、`item` 関数または Python の組み込み関数を呼び出すことができます。
+(**サイズ1のテンソルをPythonスカラーに変換する**) には、`item`関数またはPythonの組み込み関数を呼び出すことができます。
 
 ```{.python .input}
+%%tab mxnet
 a = np.array([3.5])
 a, a.item(), float(a), int(a)
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 a = torch.tensor([3.5])
 a, a.item(), float(a), int(a)
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 a = tf.constant([3.5]).numpy()
 a, a.item(), float(a), int(a)
 ```
 
-## [概要
+## まとめ
 
-* ディープラーニング用のデータを格納および操作するための主要なインターフェイスは、テンソル ($n$ 次元配列) です。基本的な数学演算、ブロードキャスト、インデックス作成、スライス、メモリ節約、他の Python オブジェクトへの変換など、さまざまな機能を提供します。
+ * テンソルクラスは、ディープラーニングライブラリのデータを格納および操作するための主要なインターフェイスです。
+ * テンソルは、構築ルーチン、索引付けとスライス、基本的な数学演算、ブロードキャスト、メモリ効率の良い代入、他の Python オブジェクトとの変換など、さまざまな機能を提供します。
 
 ## 演習
 
-1. このセクションのコードを実行します。このセクションの条件文 `X == Y` を `X < Y` または `X > Y` に変更し、取得できるテンソルの種類を確認します。
-1. ブロードキャストメカニズムの要素によって動作する 2 つのテンソルを、3 次元テンソルなどの他の形状に置き換えます。結果は期待したとおりですか？
+1. このセクションのコードを実行します。条件ステートメント `X == Y` を `X < Y` または `X > Y` に変更し、取得できるテンソルの種類を確認します。
+1. 放送機構の要素で動作する2つのテンソルを他の形状、たとえば3次元テンソルに置き換えます。結果は期待したとおりですか？
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/26)
diff --git a/chapter_preliminaries/ndarray_origin.md b/chapter_preliminaries/ndarray_origin.md
new file mode 100644
index 0000000..edf701d
--- /dev/null
+++ b/chapter_preliminaries/ndarray_origin.md
@@ -0,0 +1,806 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Data Manipulation
+:label:`sec_ndarray`
+
+In order to get anything done, 
+we need some way to store and manipulate data.
+Generally, there are two important things 
+we need to do with data: 
+(i) acquire them; 
+and (ii) process them once they are inside the computer. 
+There is no point in acquiring data 
+without some way to store it, 
+so to start, let's get our hands dirty
+with $n$-dimensional arrays, 
+which we also call *tensors*.
+If you already know the NumPy 
+scientific computing package, 
+this will be a breeze.
+For all modern deep learning frameworks,
+the *tensor class* (`ndarray` in MXNet, 
+`Tensor` in PyTorch and TensorFlow) 
+resembles NumPy's `ndarray`,
+with a few killer features added.
+First, the tensor class
+supports automatic differentiation.
+Second, it leverages GPUs
+to accelerate numerical computation,
+whereas NumPy only runs on CPUs.
+These properties make neural networks
+both easy to code and fast to run.
+
+
+
+## Getting Started
+
+:begin_tab:`mxnet`
+To start, we import the `np` (`numpy`) and
+`npx` (`numpy_extension`) modules from MXNet.
+Here, the `np` module includes 
+functions supported by NumPy,
+while the `npx` module contains a set of extensions
+developed to empower deep learning 
+within a NumPy-like environment.
+When using tensors, we almost always 
+invoke the `set_np` function:
+this is for compatibility of tensor processing 
+by other components of MXNet.
+:end_tab:
+
+:begin_tab:`pytorch`
+(**To start, we import the PyTorch library.
+Note that the package name is `torch`.**)
+:end_tab:
+
+:begin_tab:`tensorflow`
+To start, we import `tensorflow`. 
+For brevity, practitioners 
+often assign the alias `tf`.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+from mxnet import np, npx
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+import torch
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+```
+
+[**A tensor represents a (possibly multi-dimensional) array of numerical values.**]
+With one axis, a tensor is called a *vector*.
+With two axes, a tensor is called a *matrix*.
+With $k > 2$ axes, we drop the specialized names
+and just refer to the object as a $k^\mathrm{th}$ *order tensor*.
+
+:begin_tab:`mxnet`
+MXNet provides a variety of functions 
+for creating new tensors 
+prepopulated with values. 
+For example, by invoking `arange(n)`,
+we can create a vector of evenly spaced values,
+starting at 0 (included) 
+and ending at `n` (not included).
+By default, the interval size is $1$.
+Unless otherwise specified, 
+new tensors are stored in main memory 
+and designated for CPU-based computation.
+:end_tab:
+
+:begin_tab:`pytorch`
+PyTorch provides a variety of functions 
+for creating new tensors 
+prepopulated with values. 
+For example, by invoking `arange(n)`,
+we can create a vector of evenly spaced values,
+starting at 0 (included) 
+and ending at `n` (not included).
+By default, the interval size is $1$.
+Unless otherwise specified, 
+new tensors are stored in main memory 
+and designated for CPU-based computation.
+:end_tab:
+
+:begin_tab:`tensorflow`
+TensorFlow provides a variety of functions 
+for creating new tensors 
+prepopulated with values. 
+For example, by invoking `range(n)`,
+we can create a vector of evenly spaced values,
+starting at 0 (included) 
+and ending at `n` (not included).
+By default, the interval size is $1$.
+Unless otherwise specified, 
+new tensors are stored in main memory 
+and designated for CPU-based computation.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+x = np.arange(12)
+x
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.arange(12, dtype=torch.float32)
+x
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.range(12, dtype=tf.float32)
+x
+```
+
+:begin_tab:`mxnet`
+Each of these values is called
+an *element* of the tensor.
+The tensor `x` contains 12 elements.
+We can inspect the total number of elements 
+in a tensor via its `size` attribute.
+:end_tab:
+
+:begin_tab:`pytorch`
+Each of these values is called
+an *element* of the tensor.
+The tensor `x` contains 12 elements.
+We can inspect the total number of elements 
+in a tensor via its `numel` method.
+:end_tab:
+
+:begin_tab:`tensorflow`
+Each of these values is called
+an *element* of the tensor.
+The tensor `x` contains 12 elements.
+We can inspect the total number of elements 
+in a tensor via the `size` function.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+x.size
+```
+
+```{.python .input}
+%%tab pytorch
+x.numel()
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.size(x)
+```
+
+(**We can access a tensor's *shape***) 
+(the length along each axis)
+by inspecting its `shape` attribute.
+Because we are dealing with a vector here,
+the `shape` contains just a single element
+and is identical to the size.
+
+```{.python .input}
+%%tab all
+x.shape
+```
+
+We can [**change the shape of a tensor
+without altering its size or values**],
+by invoking `reshape`.
+For example, we can transform 
+our vector `x` whose shape is (12,) 
+to a matrix `X`  with shape (3, 4).
+This new tensor retains all elements
+but reconfigures them into a matrix.
+Notice that the elements of our vector
+are laid out one row at a time and thus
+`x[3] == X[0, 3]`.
+
+```{.python .input}
+%%tab mxnet, pytorch
+X = x.reshape(3, 4)
+X
+```
+
+```{.python .input}
+%%tab tensorflow
+X = tf.reshape(x, (3, 4))
+X
+```
+
+Note that specifying every shape component
+to `reshape` is redundant.
+Because we already know our tensor's size,
+we can work out one component of the shape given the rest.
+For example, given a tensor of size $n$
+and target shape ($h$, $w$),
+we know that $w = n/h$.
+To automatically infer one component of the shape,
+we can place a `-1` for the shape component
+that should be inferred automatically.
+In our case, instead of calling `x.reshape(3, 4)`,
+we could have equivalently called `x.reshape(-1, 4)` or `x.reshape(3, -1)`.
+
+Practitioners often need to work with tensors
+initialized to contain all zeros or ones.
+[**We can construct a tensor with all elements set to zero**] (~~or one~~)
+and a shape of (2, 3, 4) via the `zeros` function.
+
+```{.python .input}
+%%tab mxnet
+np.zeros((2, 3, 4))
+```
+
+```{.python .input}
+%%tab pytorch
+torch.zeros((2, 3, 4))
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.zeros((2, 3, 4))
+```
+
+Similarly, we can create a tensor 
+with all ones by invoking `ones`.
+
+```{.python .input}
+%%tab mxnet
+np.ones((2, 3, 4))
+```
+
+```{.python .input}
+%%tab pytorch
+torch.ones((2, 3, 4))
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.ones((2, 3, 4))
+```
+
+We often wish to 
+[**sample each element randomly (and independently)**] 
+from a given probability distribution.
+For example, the parameters of neural networks
+are often initialized randomly.
+The following snippet creates a tensor 
+with elements drawn from 
+a standard Gaussian (normal) distribution
+with mean 0 and standard deviation 1.
+
+```{.python .input}
+%%tab mxnet
+np.random.normal(0, 1, size=(3, 4))
+```
+
+```{.python .input}
+%%tab pytorch
+torch.randn(3, 4)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.random.normal(shape=[3, 4])
+```
+
+Finally, we can construct tensors by
+[**supplying the exact values for each element**] 
+by supplying (possibly nested) Python list(s) 
+containing numerical literals.
+Here, we construct a matrix with a list of lists,
+where the outermost list corresponds to axis 0,
+and the inner list to axis 1.
+
+```{.python .input}
+%%tab mxnet
+np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+```
+
+```{.python .input}
+%%tab pytorch
+torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.constant([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+```
+
+## Indexing and Slicing
+
+As with  Python lists,
+we can access tensor elements 
+by indexing (starting with 0).
+To access an element based on its position
+relative to the end of the list,
+we can use negative indexing.
+Finally, we can access whole ranges of indices 
+via slicing (e.g., `X[start:stop]`), 
+where the returned value includes 
+the first index (`start`) *but not the last* (`stop`).
+Finally, when only one index (or slice)
+is specified for a $k^\mathrm{th}$ order tensor,
+it is applied along axis 0.
+Thus, in the following code,
+[**`[-1]` selects the last row and `[1:3]`
+selects the second and third rows**].
+
+```{.python .input}
+%%tab all
+X[-1], X[1:3]
+```
+
+:begin_tab:`mxnet, pytorch`
+Beyond reading, (**we can also write elements of a matrix by specifying indices.**)
+:end_tab:
+
+:begin_tab:`tensorflow`
+`Tensors` in TensorFlow are immutable, and cannot be assigned to.
+`Variables` in TensorFlow are mutable containers of state that support
+assignments. Keep in mind that gradients in TensorFlow do not flow backwards
+through `Variable` assignments.
+
+Beyond assigning a value to the entire `Variable`, we can write elements of a
+`Variable` by specifying indices.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet, pytorch
+X[1, 2] = 17
+X
+```
+
+```{.python .input}
+%%tab tensorflow
+X_var = tf.Variable(X)
+X_var[1, 2].assign(9)
+X_var
+```
+
+If we want [**to assign multiple elements the same value,
+we apply the indexing on the left-hand side 
+of the assignment operation.**]
+For instance, `[:2, :]`  accesses 
+the first and second rows,
+where `:` takes all the elements along axis 1 (column).
+While we discussed indexing for matrices,
+this also works for vectors
+and for tensors of more than 2 dimensions.
+
+```{.python .input}
+%%tab mxnet, pytorch
+X[:2, :] = 12
+X
+```
+
+```{.python .input}
+%%tab tensorflow
+X_var = tf.Variable(X)
+X_var[:2, :].assign(tf.ones(X_var[:2,:].shape, dtype=tf.float32) * 12)
+X_var
+```
+
+## Operations
+
+Now that we know how to construct tensors
+and how to read from and write to their elements,
+we can begin to manipulate them
+with various mathematical operations.
+Among the most useful tools 
+are the *elementwise* operations.
+These apply a standard scalar operation
+to each element of a tensor.
+For functions that take two tensors as inputs,
+elementwise operations apply some standard binary operator
+on each pair of corresponding elements.
+We can create an elementwise function 
+from any function that maps 
+from a scalar to a scalar.
+
+In mathematical notation, we denote such
+*unary* scalar operators (taking one input)
+by the signature 
+$f: \mathbb{R} \rightarrow \mathbb{R}$.
+This just means that the function maps
+from any real number onto some other real number.
+Most standard operators can be applied elementwise
+including unary operators like $e^x$.
+
+```{.python .input}
+%%tab mxnet
+np.exp(x)
+```
+
+```{.python .input}
+%%tab pytorch
+torch.exp(x)
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.exp(x)
+```
+
+Likewise, we denote *binary* scalar operators,
+which map pairs of real numbers
+to a (single) real number
+via the signature 
+$f: \mathbb{R}, \mathbb{R} \rightarrow \mathbb{R}$.
+Given any two vectors $\mathbf{u}$ 
+and $\mathbf{v}$ *of the same shape*,
+and a binary operator $f$, we can produce a vector
+$\mathbf{c} = F(\mathbf{u},\mathbf{v})$
+by setting $c_i \gets f(u_i, v_i)$ for all $i$,
+where $c_i, u_i$, and $v_i$ are the $i^\mathrm{th}$ elements
+of vectors $\mathbf{c}, \mathbf{u}$, and $\mathbf{v}$.
+Here, we produced the vector-valued
+$F: \mathbb{R}^d, \mathbb{R}^d \rightarrow \mathbb{R}^d$
+by *lifting* the scalar function
+to an elementwise vector operation.
+The common standard arithmetic operators
+for addition (`+`), subtraction (`-`), 
+multiplication (`*`), division (`/`), 
+and exponentiation (`**`)
+have all been *lifted* to elementwise operations
+for identically-shaped tensors of arbitrary shape.
+
+```{.python .input}
+%%tab mxnet
+x = np.array([1, 2, 4, 8])
+y = np.array([2, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+```{.python .input}
+%%tab pytorch
+x = torch.tensor([1.0, 2, 4, 8])
+y = torch.tensor([2, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+```{.python .input}
+%%tab tensorflow
+x = tf.constant([1.0, 2, 4, 8])
+y = tf.constant([2.0, 2, 2, 2])
+x + y, x - y, x * y, x / y, x ** y
+```
+
+In addition to elementwise computations,
+we can also perform linear algebra operations,
+such as dot products and matrix multiplications.
+We will elaborate on these shortly
+in :numref:`sec_linear-algebra`.
+
+We can also [***concatenate* multiple tensors together,**]
+stacking them end-to-end to form a larger tensor.
+We just need to provide a list of tensors
+and tell the system along which axis to concatenate.
+The example below shows what happens when we concatenate
+two matrices along rows (axis 0)
+vs. columns (axis 1).
+We can see that the first output's axis-0 length ($6$)
+is the sum of the two input tensors' axis-0 lengths ($3 + 3$);
+while the second output's axis-1 length ($8$)
+is the sum of the two input tensors' axis-1 lengths ($4 + 4$).
+
+```{.python .input}
+%%tab mxnet
+X = np.arange(12).reshape(3, 4)
+Y = np.array([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+np.concatenate([X, Y], axis=0), np.concatenate([X, Y], axis=1)
+```
+
+```{.python .input}
+%%tab pytorch
+X = torch.arange(12, dtype=torch.float32).reshape((3,4))
+Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)
+```
+
+```{.python .input}
+%%tab tensorflow
+X = tf.reshape(tf.range(12, dtype=tf.float32), (3, 4))
+Y = tf.constant([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
+tf.concat([X, Y], axis=0), tf.concat([X, Y], axis=1)
+```
+
+Sometimes, we want to 
+[**construct a binary tensor via *logical statements*.**]
+Take `X == Y` as an example.
+For each position `i, j`, if `X[i, j]` and `Y[i, j]` are equal, 
+then the corresponding entry in the result takes value `1`,
+otherwise it takes value `0`.
+
+```{.python .input}
+%%tab all
+X == Y
+```
+
+[**Summing all the elements in the tensor**] yields a tensor with only one element.
+
+```{.python .input}
+%%tab mxnet, pytorch
+X.sum()
+```
+
+```{.python .input}
+%%tab tensorflow
+tf.reduce_sum(X)
+```
+
+## Broadcasting
+:label:`subsec_broadcasting`
+
+By now, you know how to perform 
+elementwise binary operations
+on two tensors of the same shape. 
+Under certain conditions,
+even when shapes differ, 
+we can still [**perform elementwise binary operations
+by invoking the *broadcasting mechanism*.**]
+Broadcasting works according to 
+the following two-step procedure:
+(i) expand one or both arrays
+by copying elements along axes with length 1
+so that after this transformation,
+the two tensors have the same shape;
+(ii) perform an elementwise operation
+on the resulting arrays.
+
+```{.python .input}
+%%tab mxnet
+a = np.arange(3).reshape(3, 1)
+b = np.arange(2).reshape(1, 2)
+a, b
+```
+
+```{.python .input}
+%%tab pytorch
+a = torch.arange(3).reshape((3, 1))
+b = torch.arange(2).reshape((1, 2))
+a, b
+```
+
+```{.python .input}
+%%tab tensorflow
+a = tf.reshape(tf.range(3), (3, 1))
+b = tf.reshape(tf.range(2), (1, 2))
+a, b
+```
+
+Since `a` and `b` are $3\times1$ 
+and $1\times2$ matrices, respectively,
+their shapes do not match up.
+Broadcasting produces a larger $3\times2$ matrix 
+by replicating matrix `a` along the columns
+and matrix `b` along the rows
+before adding them elementwise.
+
+```{.python .input}
+%%tab all
+a + b
+```
+
+## Saving Memory
+
+[**Running operations can cause new memory to be
+allocated to host results.**]
+For example, if we write `Y = X + Y`,
+we dereference the tensor that `Y` used to point to
+and instead point `Y` at the newly allocated memory.
+We can demonstrate this issue with Python's `id()` function,
+which gives us the exact address 
+of the referenced object in memory.
+Note that after we run `Y = Y + X`,
+`id(Y)` points to a different location.
+That's because Python first evaluates `Y + X`,
+allocating new memory for the result 
+and then points `Y` to this new location in memory.
+
+```{.python .input}
+%%tab all
+before = id(Y)
+Y = Y + X
+id(Y) == before
+```
+
+This might be undesirable for two reasons.
+First, we do not want to run around
+allocating memory unnecessarily all the time.
+In machine learning, we often have
+hundreds of megabytes of parameters
+and update all of them multiple times per second.
+Whenever possible, we want to perform these updates *in place*.
+Second, we might point at the 
+same parameters from multiple variables.
+If we do not update in place, 
+we must be careful to update all of these references,
+lest we spring a memory leak 
+or inadvertently refer to stale parameters.
+
+:begin_tab:`mxnet, pytorch`
+Fortunately, (**performing in-place operations**) is easy.
+We can assign the result of an operation
+to a previously allocated array `Y`
+by using slice notation: `Y[:] = <expression>`.
+To illustrate this concept, 
+we overwrite the values of tensor `Z`,
+after initializing it, using `zeros_like`,
+to have the same shape as `Y`.
+:end_tab:
+
+:begin_tab:`tensorflow`
+`Variables` are mutable containers of state in TensorFlow. They provide
+a way to store your model parameters.
+We can assign the result of an operation
+to a `Variable` with `assign`.
+To illustrate this concept, 
+we overwrite the values of `Variable` `Z`
+after initializing it, using `zeros_like`,
+to have the same shape as `Y`.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+Z = np.zeros_like(Y)
+print('id(Z):', id(Z))
+Z[:] = X + Y
+print('id(Z):', id(Z))
+```
+
+```{.python .input}
+%%tab pytorch
+Z = torch.zeros_like(Y)
+print('id(Z):', id(Z))
+Z[:] = X + Y
+print('id(Z):', id(Z))
+```
+
+```{.python .input}
+%%tab tensorflow
+Z = tf.Variable(tf.zeros_like(Y))
+print('id(Z):', id(Z))
+Z.assign(X + Y)
+print('id(Z):', id(Z))
+```
+
+:begin_tab:`mxnet, pytorch`
+[**If the value of `X` is not reused in subsequent computations,
+we can also use `X[:] = X + Y` or `X += Y`
+to reduce the memory overhead of the operation.**]
+:end_tab:
+
+:begin_tab:`tensorflow`
+Even once you store state persistently in a `Variable`, 
+you may want to reduce your memory usage further by avoiding excess
+allocations for tensors that are not your model parameters.
+Because TensorFlow `Tensors` are immutable 
+and gradients do not flow through `Variable` assignments, 
+TensorFlow does not provide an explicit way to run
+an individual operation in-place.
+
+However, TensorFlow provides the `tf.function` decorator 
+to wrap computation inside of a TensorFlow graph 
+that gets compiled and optimized before running.
+This allows TensorFlow to prune unused values, 
+and to reuse prior allocations that are no longer needed. 
+This minimizes the memory overhead of TensorFlow computations.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet, pytorch
+before = id(X)
+X += Y
+id(X) == before
+```
+
+```{.python .input}
+%%tab tensorflow
+@tf.function
+def computation(X, Y):
+    Z = tf.zeros_like(Y)  # This unused value will be pruned out
+    A = X + Y  # Allocations will be reused when no longer needed
+    B = A + Y
+    C = B + Y
+    return C + Y
+
+computation(X, Y)
+```
+
+## Conversion to Other Python Objects
+
+:begin_tab:`mxnet, tensorflow`
+[**Converting to a NumPy tensor (`ndarray`)**], or vice versa, is easy.
+The converted result does not share memory.
+This minor inconvenience is actually quite important:
+when you perform operations on the CPU or on GPUs,
+you do not want to halt computation, waiting to see
+whether the NumPy package of Python 
+might want to be doing something else
+with the same chunk of memory.
+:end_tab:
+
+:begin_tab:`pytorch`
+[**Converting to a NumPy tensor (`ndarray`)**], or vice versa, is easy.
+The torch Tensor and numpy array 
+will share their underlying memory, 
+and changing one through an in-place operation 
+will also change the other.
+:end_tab:
+
+```{.python .input}
+%%tab mxnet
+A = X.asnumpy()
+B = np.array(A)
+type(A), type(B)
+```
+
+```{.python .input}
+%%tab pytorch
+A = X.numpy()
+B = torch.from_numpy(A)
+type(A), type(B)
+```
+
+```{.python .input}
+%%tab tensorflow
+A = X.numpy()
+B = tf.constant(A)
+type(A), type(B)
+```
+
+To (**convert a size-1 tensor to a Python scalar**),
+we can invoke the `item` function or Python's built-in functions.
+
+```{.python .input}
+%%tab mxnet
+a = np.array([3.5])
+a, a.item(), float(a), int(a)
+```
+
+```{.python .input}
+%%tab pytorch
+a = torch.tensor([3.5])
+a, a.item(), float(a), int(a)
+```
+
+```{.python .input}
+%%tab tensorflow
+a = tf.constant([3.5]).numpy()
+a, a.item(), float(a), int(a)
+```
+
+## Summary
+
+ * The tensor class is the main interface for storing and manipulating data in deep learning libraries.
+ * Tensors provide a variety of functionalities including construction routines; indexing and slicing; basic mathematics operations; broadcasting; memory-efficient assignment; and conversion to and from other Python objects.
+
+
+## Exercises
+
+1. Run the code in this section. Change the conditional statement `X == Y` to `X < Y` or `X > Y`, and then see what kind of tensor you can get.
+1. Replace the two tensors that operate by element in the broadcasting mechanism with other shapes, e.g., 3-dimensional tensors. Is the result the same as expected?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/26)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/27)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/187)
+:end_tab:
diff --git a/chapter_preliminaries/pandas.md b/chapter_preliminaries/pandas.md
index 0861cfb..f451256 100644
--- a/chapter_preliminaries/pandas.md
+++ b/chapter_preliminaries/pandas.md
@@ -1,99 +1,103 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
 # データ前処理
 :label:`sec_pandas`
 
-これまで、テンソルにすでに格納されているデータを操作するためのさまざまな手法を紹介してきました。ディープラーニングを現実世界の問題の解決に適用するために、テンソル形式で適切に準備されたデータではなく、生データの前処理から始めることがよくあります。Python でよく使われるデータ分析ツールの中でも、`pandas` パッケージがよく使われています。Python の広大なエコシステムにある他の多くの拡張パッケージと同様に、`pandas` はテンソルと連携して動作することができます。そこで、生データを `pandas` で前処理し、テンソル形式に変換する手順を簡単に説明します。データの前処理テクニックについては、後の章で詳しく説明します。 
+これまで、既製のテンソルで届いた合成データを扱ってきました。しかし、ディープラーニングを実際に適用するには、任意の形式で保存された乱雑なデータを抽出し、ニーズに合わせて前処理する必要があります。幸いなことに、*pandas* [library](https://pandas.pydata.org/)は重い作業の多くを行うことができます。このセクションは、適切な*pandas* [tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)に代わるものではありませんが、最も一般的なルーチンのいくつかについての短期集中コースを提供します。 
 
 ## データセットの読み取り
 
-例として、(**csv (カンマ区切り値) ファイルに格納される人工データセットを作成する**) `../data/house_tiny.csv` から始めます。他の形式で保存されたデータも同様の方法で処理される場合があります。 
-
-以下では、データセットを行ごとにcsvファイルに書き込みます。
+カンマ区切り値 (CSV) ファイルは、表形式 (スプレッドシートのような) データを格納するために広く使用されています。ここで、各行は1つのレコードに対応し、いくつかの（カンマ区切り）フィールドで構成されています。例えば、「アルバート・アインシュタイン、1879年3月14日、ウルム、連邦工科大学、重力物理学の分野での成果」。`pandas` で CSV ファイルを読み込む方法を示すために、(**以下で CSV ファイルを作成します**) `../data/house_tiny.csv`。このファイルは住宅のデータセットを表し、各行は個別の家に対応し、列は部屋数（`NumRooms`）、屋根のタイプ（`RoofType`）、および価格（`Price`）に対応します。
 
 ```{.python .input}
-#@tab all
+%%tab all
 import os
 
 os.makedirs(os.path.join('..', 'data'), exist_ok=True)
 data_file = os.path.join('..', 'data', 'house_tiny.csv')
 with open(data_file, 'w') as f:
-    f.write('NumRooms,Alley,Price\n')  # Column names
-    f.write('NA,Pave,127500\n')  # Each row represents a data example
-    f.write('2,NA,106000\n')
-    f.write('4,NA,178100\n')
-    f.write('NA,NA,140000\n')
+    f.write('''NumRooms,RoofType,Price
+NA,NA,127500
+2,NA,106000
+4,Slate,178100
+NA,NA,140000''')
 ```
 
-[**作成した csv ファイルから生のデータセットを読み込む**] には、`pandas` パッケージをインポートし、`read_csv` 関数を呼び出します。このデータセットには 4 つの行と 3 つの列があり、各行には家の部屋数 (「numRooms」)、路地のタイプ (「路地」)、価格 (「価格」) が記述されています。
+それでは、`pandas`をインポートして、`read_csv`でデータセットをロードしましょう。
 
 ```{.python .input}
-#@tab all
-# If pandas is not installed, just uncomment the following line:
-# !pip install pandas
+%%tab all
 import pandas as pd
 
 data = pd.read_csv(data_file)
 print(data)
 ```
 
-## 欠損データの処理
+## データ準備
+
+教師あり学習では、*入力*値のセットを指定して、指定された*目標*値を予測するようにモデルをトレーニングします。データセットを処理する最初のステップは、入力値とターゲット値に対応する列を分離することです。列は、名前または整数位置ベースのインデックス (`iloc`) によって選択できます。 
 
-「NaN」エントリは欠損値であることに注意してください。欠損データを処理するために、典型的な方法には*imputation* と*delettion* があります。補完では欠損値が置換された値に置き換えられ、削除では欠損値は無視されます。ここでは、帰属について検討します。 
+`pandas`が値`NA`を持つすべてのCSVエントリを特別な`NaN`（*数字ではない*）値に置き換えたことに気づいたかもしれません。これは、「3,, ,270000" など、エントリが空の場合にも発生する可能性があります。これらは*ミッシングバリュー*と呼ばれ、データサイエンスの「トコジラミ」であり、キャリアを通じて直面する永続的な脅威です。コンテキストによっては、欠落した値は*代入* または*削除* によって処理されます。補完は欠損値をその値の推定値に置き換え、削除は欠損値を含む行または列のいずれかを単に破棄します。  
 
-整数位置ベースのインデックス (`iloc`) により、`data` を `inputs` と `outputs` に分割しました。前者は最初の 2 つのカラムを取り、後者は最後のカラムだけを保持します。`inputs` の数値が欠落している場合は、[**「NaN」エントリを同じ列の平均値に置き換えます。**]
+以下に、一般的な帰属ヒューリスティックをいくつか示します。[**カテゴリ入力フィールドの場合、`NaN`をカテゴリとして扱うことができます。**] `RoofType`列は`Slate`と`NaN`の値を取るため、`pandas`はこの列を`RoofType_Slate`と`RoofType_nan`の2つの列に変換できます。路地タイプが `Slate` の行は、`RoofType_Slate` と `RoofType_nan` の値をそれぞれ 1 と 0 に設定します。`RoofType` の値が欠落している行については、その逆が成り立ちます。
 
 ```{.python .input}
-#@tab all
-inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
-inputs = inputs.fillna(inputs.mean())
+%%tab all
+inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
+inputs = pd.get_dummies(inputs, dummy_na=True)
 print(inputs)
 ```
 
-[** `inputs` のカテゴリ値または不連続値については、「NaN」をカテゴリと見なします。**]「Alley」列は「Pave」と「NaN」の 2 種類のカテゴリ値しか取らないため、`pandas` はこの列を「Alley_Pave」と「alley_NAN」の 2 つの列に自動的に変換できます。路地タイプが「Pave」の行は、「Alley_Pave」と「alley_NAN」の値を 1 と 0 に設定します。路地タイプが欠落している行は、その値を 0 と 1 に設定します。
+欠落している数値の場合の一般的なヒューリスティックは、[**`NaN`エントリを対応する列の平均値に置き換える**] です。
 
 ```{.python .input}
-#@tab all
-inputs = pd.get_dummies(inputs, dummy_na=True)
+%%tab all
+inputs = inputs.fillna(inputs.mean())
 print(inputs)
 ```
 
 ## テンソル形式への変換
 
-[**`inputs` と `outputs` のすべてのエントリは数値なので、テンソル形式に変換できます。**] データがこの形式になると、:numref:`sec_ndarray` で紹介したテンソル関数でさらに操作できるようになります。
+[**`inputs`と`targets`のすべてのエントリが数値であるため、テンソルにロードできます**]（:numref:`sec_ndarray`を思い出してください）。
 
 ```{.python .input}
+%%tab mxnet
 from mxnet import np
 
-X, y = np.array(inputs.values), np.array(outputs.values)
+X, y = np.array(inputs.values), np.array(targets.values)
 X, y
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 import torch
 
-X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
+X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
 X, y
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 import tensorflow as tf
 
-X, y = tf.constant(inputs.values), tf.constant(outputs.values)
+X, y = tf.constant(inputs.values), tf.constant(targets.values)
 X, y
 ```
 
-## [概要
+## ディスカッション
 
-* Python の広大なエコシステムにある他の多くの拡張パッケージと同様に、`pandas` はテンソルと連携して動作することができます。
-* 補完と削除は、欠損データを処理するために使用できます。
+これで、データ列を分割し、欠損変数を補完し、`pandas` データをテンソルに読み込む方法がわかりました。:numref:`sec_kaggle_house`では、もう少しデータ処理スキルを習得します。この短期集中コースは物事をシンプルに保ちましたが、データ処理は毛むくじゃらになることがあります。たとえば、データセットが 1 つの CSV ファイルに収まるのではなく、リレーショナルデータベースから抽出された複数のファイルに分散している場合があります。たとえば、eコマースアプリケーションでは、顧客の住所は1つのテーブルにあり、購入データは別のテーブルにあります。さらに、開業医は、カテゴリや数値を超える無数のデータタイプに直面しています。その他のデータタイプには、テキスト文字列、画像、オーディオデータ、および点群が含まれます。多くの場合、データ処理が機械学習パイプラインの最大のボトルネックになるのを防ぐために、高度なツールと効率的なアルゴリズムが必要です。これらの問題は、コンピュータービジョンと自然言語処理に着いたときに発生します。最後に、データ品質に注意を払う必要があります。実世界のデータセットは、外れ値、センサーからの誤った測定、および記録エラーに悩まされることが多く、データをモデルに送る前に対処する必要があります。[seaborn](https://seaborn.pydata.org/)、[Bokeh](https://docs.bokeh.org/)、[matplotlib](https://matplotlib.org/) などのデータ視覚化ツールは、データを手動で検査し、対処する必要がある問題について直感的に理解するのに役立ちます。 
 
 ## 演習
 
-行と列の数が多い生データセットを作成します。 
-
-1. 欠損値が最も多い列を削除します。
-2. 前処理されたデータセットをテンソル形式に変換します。
+1. [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)からAbaloneなどのデータセットをロードして、そのプロパティを調べてみてください。欠損値があるのはどの割合ですか？変数のどの部分が数値、カテゴリ、またはテキストですか？
+1. データ列のインデックスを作成し、列番号ではなく名前で選択してみてください。[indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)のpandasのドキュメントには、これを行う方法の詳細が記載されています。
+1. この方法で読み込めるデータセットの大きさはどれくらいだと思いますか？どのような制限がありますか？ヒント:データ、表現、処理、およびメモリフットプリントを読み取る時間を考慮してください。これをラップトップで試してみてください。サーバーで試してみると何が変わりますか？ 
+1. カテゴリ数が非常に多いデータをどのように扱いますか？カテゴリラベルがすべて一意の場合はどうなりますか？後者を含めるべきですか？
+1. パンダに代わるものは何ですか？[loading NumPy tensors from a file](https://numpy.org/doc/stable/reference/generated/numpy.load.html)はどう？Python イメージングライブラリ [Pillow](https://python-pillow.org/) をチェックしてください。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/28)
diff --git a/chapter_preliminaries/pandas_origin.md b/chapter_preliminaries/pandas_origin.md
new file mode 100644
index 0000000..83259ee
--- /dev/null
+++ b/chapter_preliminaries/pandas_origin.md
@@ -0,0 +1,201 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Data Preprocessing
+:label:`sec_pandas`
+
+So far, we have been working with synthetic data
+that arrived in ready-made tensors.
+However, to apply deep learning in the wild
+we must extract messy data 
+stored in arbitrary formats,
+and preprocess it to suit our needs.
+Fortunately, the *pandas* [library](https://pandas.pydata.org/) 
+can do much of the heavy lifting.
+This section, while no substitute 
+for a proper *pandas* [tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html),
+will give you a crash course
+on some of the most common routines.
+
+
+## Reading the Dataset
+
+Comma-separated values (CSV) files are ubiquitous 
+for storing tabular (spreadsheet-like) data.
+Here, each line corresponds to one record
+and consists of several (comma-separated) fields, e.g.,
+"Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics".
+To demonstrate how to load CSV files with `pandas`, 
+we (**create a CSV file below**) `../data/house_tiny.csv`. 
+This file represents a dataset of homes,
+where each row corresponds to a distinct home
+and the columns correspond to the number of rooms (`NumRooms`),
+the roof type (`RoofType`), and the price (`Price`).
+
+```{.python .input}
+%%tab all
+import os
+
+os.makedirs(os.path.join('..', 'data'), exist_ok=True)
+data_file = os.path.join('..', 'data', 'house_tiny.csv')
+with open(data_file, 'w') as f:
+    f.write('''NumRooms,RoofType,Price
+NA,NA,127500
+2,NA,106000
+4,Slate,178100
+NA,NA,140000''')
+```
+
+Now let's import `pandas` and load the dataset with `read_csv`.
+
+```{.python .input}
+%%tab all
+import pandas as pd
+
+data = pd.read_csv(data_file)
+print(data)
+```
+
+## Data Preparation
+
+In supervised learning, we train models
+to predict a designated *target* value,
+given some set of *input* values. 
+Our first step in processing the dataset
+is to separate out columns corresponding
+to input versus target values. 
+We can select columns either by name or
+via integer-location based indexing (`iloc`).
+
+You might have noticed that `pandas` replaced
+all CSV entries with value `NA`
+with a special `NaN` (*not a number*) value. 
+This can also happen whenever an entry is empty,
+e.g., "3,,,270000".
+These are called *missing values* 
+and they are the "bed bugs" of data science,
+a persistent menace that you will confront
+throughout your career. 
+Depending upon the context, 
+missing values might be handled
+either via *imputation* or *deletion*.
+Imputation replaces missing values 
+with estimates of their values
+while deletion simply discards 
+either those rows or those columns
+that contain missing values. 
+
+Here are some common imputation heuristics.
+[**For categorical input fields, 
+we can treat `NaN` as a category.**]
+Since the `RoofType` column takes values `Slate` and `NaN`,
+`pandas` can convert this column 
+into two columns `RoofType_Slate` and `RoofType_nan`.
+A row whose alley type is `Slate` will set values 
+of `RoofType_Slate` and `RoofType_nan` to 1 and 0, respectively.
+The converse holds for a row with a missing `RoofType` value.
+
+```{.python .input}
+%%tab all
+inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
+inputs = pd.get_dummies(inputs, dummy_na=True)
+print(inputs)
+```
+
+For missing numerical values, 
+one common heuristic is to 
+[**replace the `NaN` entries with 
+the mean value of the corresponding column**].
+
+```{.python .input}
+%%tab all
+inputs = inputs.fillna(inputs.mean())
+print(inputs)
+```
+
+## Conversion to the Tensor Format
+
+Now that [**all the entries in `inputs` and `targets` are numerical,
+we can load them into a tensor**] (recall :numref:`sec_ndarray`).
+
+```{.python .input}
+%%tab mxnet
+from mxnet import np
+
+X, y = np.array(inputs.values), np.array(targets.values)
+X, y
+```
+
+```{.python .input}
+%%tab pytorch
+import torch
+
+X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
+X, y
+```
+
+```{.python .input}
+%%tab tensorflow
+import tensorflow as tf
+
+X, y = tf.constant(inputs.values), tf.constant(targets.values)
+X, y
+```
+
+## Discussion
+
+You now know how to partition data columns, 
+impute missing variables, 
+and load `pandas` data into tensors. 
+In :numref:`sec_kaggle_house`, you will
+pick up some more data processing skills. 
+While this crash course kept things simple,
+data processing can get hairy.
+For example, rather than arriving in a single CSV file,
+our dataset might be spread across multiple files
+extracted from a relational database.
+For instance, in an e-commerce application,
+customer addresses might live in one table
+and purchase data in another.
+Moreover, practitioners face myriad data types
+beyond categorical and numeric. 
+Other data types include text strings, images,
+audio data, and point clouds. 
+Oftentimes, advanced tools and efficient algorithms 
+are required to prevent data processing from becoming
+the biggest bottleneck in the machine learning pipeline. 
+These problems will arise when we get to 
+computer vision and natural language processing. 
+Finally, we must pay attention to data quality.
+Real-world datasets are often plagued 
+by outliers, faulty measurements from sensors, and recording errors, 
+which must be addressed before 
+feeding the data into any model. 
+Data visualization tools such as [seaborn](https://seaborn.pydata.org/), 
+[Bokeh](https://docs.bokeh.org/), or [matplotlib](https://matplotlib.org/)
+can help you to manually inspect the data 
+and develop intuitions about 
+what problems you may need to address.
+
+
+## Exercises
+
+1. Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?
+1. Try out indexing and selecting data columns by name rather than by column number. The pandas documentation on [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) has further details on how to do this.
+1. How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What changes if you try it out on a server? 
+1. How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?
+1. What alternatives to pandas can you think of? How about [loading NumPy tensors from a file](https://numpy.org/doc/stable/reference/generated/numpy.load.html)? Check out [Pillow](https://python-pillow.org/), the Python Imaging Library. 
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/28)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/29)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/195)
+:end_tab:
diff --git a/chapter_preliminaries/probability.md b/chapter_preliminaries/probability.md
index 766e980..26484a5 100644
--- a/chapter_preliminaries/probability.md
+++ b/chapter_preliminaries/probability.md
@@ -1,332 +1,352 @@
-# 確率
-:label:`sec_prob`
-
-何らかの形で、機械学習は予測を行うことがすべてです。病歴を考慮して、来年に心臓発作を起こす患者の*確率*を予測したいと思うかもしれません。異常検出では、飛行機のジェットエンジンからの一連の測定値が正常に動作している場合にどの程度*可能性*高いかを評価したい場合があります。強化学習では、エージェントが環境内でインテリジェントに行動することを求めています。これは、利用可能な各アクションで高い報酬を得る確率について考える必要があることを意味します。また、レコメンダーシステムを構築する際には、確率についても考える必要があります。たとえば、ある大手オンライン書店で働いていたと*仮説的に*言ってください。特定のユーザーが特定の本を購入する確率を推定したい場合があります。そのためには確率の言葉を使う必要があります。コース、専攻、論文、キャリア、さらには部門全体が、確率に専念しています。ですから、当然のことながら、このセクションの目標は、主題全体を教えることではありません。代わりに、最初のディープラーニングモデルの構築を開始できるように十分に教え、必要に応じて自分で探索を開始できるように、テーマに十分なフレーバーを与えることを望んでいます。 
-
-前のセクションでは、確率が正確に何であるかを明確にしたり、具体的な例を挙げたりすることなく、すでに確率を呼び出しました。写真に基づいて猫と犬を区別するという最初のケースを考えて、もっと真剣になりましょう。これは単純に聞こえるかもしれませんが、実際には手ごわい挑戦です。まず、問題の難易度は画像の解像度によって異なる場合があります。 
-
-![Images of varying resolutions ($10 \times 10$, $20 \times 20$, $40 \times 40$, $80 \times 80$, and $160 \times 160$ pixels).](../img/cat-dog-pixels.png)
-:width:`300px`
-:label:`fig_cat_dog`
-
-:numref:`fig_cat_dog` に示すように、人間は猫と犬を $160 \times 160$ ピクセルの解像度では簡単に認識できますが、$40 \times 40$ ピクセルでは難しく、$10 \times 10$ ピクセルではほぼ不可能になります。言い換えれば、猫と犬を遠くから区別する (したがって解像度が低い) 能力は、情報に基づかない推測に近づくかもしれません。確率は、私たちの確実性のレベルについての正式な推論方法を提供します。画像が猫を描いていることが完全に確信できれば、対応するラベル $y$ が「cat」である*確率*、$P(y=$「cat」と表記される $)$ は $1$ と等しいと言います。$y =$「cat」または $y =$「dog」を示唆する証拠がなければ、2つの可能性は同等であると言えるかもしれません。
-*おそらく* これを $P(y=$「猫」$) = P(y=$「犬」$) = 0.5$ と表現します。私たちが合理的だったら
-自信はありますが、画像が猫を描いているかどうかわからない場合は、確率$0.5  < P(y=$「cat」$) < 1$を割り当てることができます。 
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
 
-2つ目のケースを考えてみましょう。気象モニタリングデータがあれば、明日台北で雨が降る確率を予測したいと考えています。夏なら確率0.5で雨が降るかもしれません。 
+# 確率と統計
+:label:`sec_prob`
 
-どちらの場合も、関心のある価値があります。どちらの場合も、結果については不明です。しかし、この2つのケースには重要な違いがあります。この最初のケースでは、イメージは実際には犬か猫のどちらかであり、どちらがわからないだけです。2番目のケースでは、そのようなことを信じるなら（そしてほとんどの物理学者がそうする）、結果は実際にはランダムな出来事かもしれません。したがって、確率は、私たちの確実性のレベルを推論するための柔軟な言語であり、幅広いコンテキストで効果的に適用できます。 
+いずれにせよ、機械学習は不確実性に関するものです。教師あり学習では、既知のもの（*特徴*）から未知のもの（*ターゲット*）を予測したいと考えています。目的に応じて、ターゲットの最も可能性の高い値を予測しようとするかもしれません。または、ターゲットから予想される最小距離で値を予測することもできます。また、特定の値を予測するだけでなく、*不確実性を定量化*したい場合もあります。たとえば、患者を説明するいくつかの特徴を考えると、来年に心臓発作を起こす可能性が*どの程度*あるかを知りたい場合があります。教師なし学習では、私たちはしばしば不確実性を気にします。一連の測定値が異常であるかどうかを判断するには、対象の母集団の値を観測する可能性がどの程度あるかを知ることが役立ちます。また、強化学習では、さまざまな環境でインテリジェントに作用するエージェントを開発したいと考えています。これには、環境がどのように変化すると予想されるか、利用可能な各アクションに応じてどのような報酬が発生すると予想されるかについての推論が必要です。 
 
-## 基礎確率論
+*確率* は数学的なフィールドです
+不確実性の下での推論に関心がある。あるプロセスの確率モデルを考えると、さまざまなイベントの可能性について推論できます。繰り返し可能なイベント（コイントスのような）の頻度を記述するために確率を使用することは、かなり議論の余地がありません。実際、*頻度論者*の学者は、そのような反復可能な出来事に*のみ*当てはまる確率の解釈に固執しています。対照的に、*ベイジアン*の学者は、不確実性の下での推論を形式化するために、確率の言語をより広く使用します。ベイズ確率は、2つのユニークな特徴によって特徴付けられます。（i）反復不可能な出来事に信念度を割り当てる、例えば、月がチーズでできているという*確率*はどれくらいですか？; と（ii）主観性—ベイズ確率は、新しい証拠に照らして信念をどのように更新すべきかについての明確なルールを提供しますが、異なる個人が異なる*以前の*信念から始めることを可能にします。
+*統計*は、私たちが逆に推論するのに役立ちます。
+データの収集と整理から始めて、データを生成したプロセスについてどのような推論を引き出すかを取り戻します。データセットを分析し、より広範な人口を特徴付ける可能性のあるパターンを探すときはいつでも、統計的思考を採用しています。ほとんどのコース、専攻、論文、キャリア、学科、企業、教育機関は、確率と統計の研究に専念してきました。このセクションは表面を傷つけるだけですが、モデルの構築を開始するために必要な基礎を提供します。 
 
-私たちがサイコロを投げ、別の数字ではなく1が見える可能性を知りたいとします。ダイスが公平であれば、$\{1, \ldots, 6\}$ の 6 つの結果すべてが等しく発生する可能性が高いため、6 つのケースのうちの 1 つで $1$ が見られます。正式には、$1$は確率$\frac{1}{6}$で発生すると述べている。 
+## 簡単な例:コインを投げる
 
-工場から受け取る本物の金型については、その割合がわからない場合があり、汚れていないかどうかを確認する必要があります。金型を調査する唯一の方法は、何度も鋳造して結果を記録することです。ダイスのキャストごとに、$\{1, \ldots, 6\}$ の値が観測されます。これらの結果を踏まえて、各結果が観察される確率を調べたいと思います。 
+コインを投げる予定で、頭（対尾）が見える可能性を定量化したいと想像してみてください。コインが*公平*であれば、両方の結果（ヘッドとテール）は等しくありそうです。さらに、コインを$n$回投げることを計画している場合、私たちが*期待*する*頭の割合は、*予想される*テールの割合と正確に一致するはずです。これを直感的に確認する方法の1つは対称性です。$n_h$の頭と$n_t = (n - n_h)$の尾を持つ可能性のあるすべての結果について、$n_t$の頭と$n_h$の尾で同じ可能性の高い結果があります。これが可能なのは、平均して$1/2$のトスが頭上に上がり、$1/2$が尾を上がると予想される場合にのみ可能であることに注意してください。もちろん、$n=1000000$をそれぞれ投げてこの実験を何度も行った場合、$n_h = n_t$が正確に試行されることは決してないかもしれません。 
 
-各値に対する自然なアプローチの 1 つは、その値の個々のカウントを取り、それを合計トス数で割ることです。これにより、与えられた*イベント*の確率を*推定*することができます。*大数の法則*は、投げる回数が増えるにつれて、この推定値が真の基礎となる確率にますます近づくことを示しています。ここで何が起こっているのかを詳しく説明する前に、試してみましょう。 
+正式には、$1/2$という量は*確率*と呼ばれ、ここでは、与えられたトスが頭に浮かぶ確実性を捉えています。確率は、*事象*と呼ばれる関心のある結果に$0$から$1$の間のスコアを割り当てます。ここで、関心のある事象は$\textrm{heads}$であり、対応する確率$P(\textrm{heads})$を示します。$1$の確率は絶対確実性を示し（両側が頭だったトリックコインを想像してみてください）、$0$の確率は不可能であることを示します（たとえば、両側が尾である場合）。周波数$n_h/n$と$n_t/n$は確率ではなく、むしろ*統計*です。確率は、データ生成プロセスの基礎となる「理論的」な量です。ここで、確率$1/2$はコイン自体の特性です。対照的に、統計は、観測データの関数として計算される「経験的」な量です。確率的および統計的量に対する私たちの関心は、密接に絡み合っています。私たちはしばしば、データセットが与えられると、確率などのモデルパラメータの*推定*を生成する、*推定器*と呼ばれる特別な統計を設計します。さらに、これらの推定器が*一貫性*と呼ばれる優れた特性を満たす場合、推定値は対応する確率に収束します。次に、これらの推定された確率は、将来遭遇する可能性のある、同じ母集団からのデータの統計的特性について示しています。 
 
-まず、必要なパッケージをインポートしてみましょう。
+本当の$P(\textrm{heads})$を知らなかった本物のコインに出くわしたとします。この量を統計的手法で調べるには、(i) いくつかのデータを収集し、(ii) 推定量を設計する必要があります。ここでのデータ取得は簡単です。コインを何度も投げて、すべての結果を記録できます。正式には、基礎となるランダムプロセスから実現を引き出すことを*サンプリング*と呼びます。ご想像のとおり、自然推定量の1つは、観察された*頭*の数と投げの総数の間の割合です。
 
 ```{.python .input}
+%%tab mxnet
 %matplotlib inline
 from d2l import mxnet as d2l
 from mxnet import np, npx
+from mxnet.numpy.random import multinomial
 import random
 npx.set_np()
 ```
 
 ```{.python .input}
-#@tab pytorch
+%%tab pytorch
 %matplotlib inline
 from d2l import torch as d2l
+import random
 import torch
-from torch.distributions import multinomial
+from torch.distributions.multinomial import Multinomial
 ```
 
 ```{.python .input}
-#@tab tensorflow
+%%tab tensorflow
 %matplotlib inline
 from d2l import tensorflow as d2l
+import random
 import tensorflow as tf
-import tensorflow_probability as tfp
-import numpy as np
+from tensorflow_probability import distributions as tfd
 ```
 
-次に、サイコロを投げることができるようにします。統計学では、確率分布から例を引き出すこのプロセスを「サンプリング」と呼んでいます。いくつかの離散的な選択肢に確率を割り当てる分布は、
-*多項分布*。より正式な定義を挙げて
-*ディストリビューション*は後になりますが、大まかに言うと、それは単なる
-イベントに対する確率。 
+ここで、コインが実際に公正だったと仮定します。つまり、$P(\textrm{heads}) = 0.5$。公正なコインの投げをシミュレートするために、任意の乱数ジェネレータを呼び出すことができます。確率$0.5$で事象のサンプルを抽出する簡単な方法。たとえば、Pythonの`random.random`は、$[0,1]$の間隔の数値を生成します。ここで、サブインターバル$[a, b] \subset [0,1]$にある確率は、$b-a$に等しくなります。したがって、返された浮動小数点数が`0.5`より大きいかどうかをテストすることにより、`0`と`1`をそれぞれ確率`0.5`で取得できます。
+
+```{.python .input}
+%%tab all
+num_tosses = 100
+heads = sum([random.random() > 0.5 for _ in range(100)])
+tails = num_tosses - heads
+print("heads, tails: ", [heads, tails])
+```
 
-1 つのサンプルを描画するには、単純に確率のベクトルを渡します。出力は同じ長さの別のベクトルです。インデックス $i$ の値は、サンプリング結果が $i$ に相当する回数です。
+より一般的には、多項関数を呼び出し、最初の引数をドローの数に設定し、2番目の引数をそれぞれに関連付けられた確率のリストとして設定することにより、有限数の可能な結果（コインの投げやサイコロのロールなど）を持つ任意の変数からの複数のドローをシミュレートできます。可能な結果。公正なコインを10回投げることをシミュレートするために、確率ベクトル`[0.5, 0.5]`を割り当て、インデックス0をヘッド、インデックス1をテールと解釈します。この関数は、可能な結果の数 (ここでは 2) に等しい長さのベクトルを返します。最初の成分は頭部の出現回数を示し、2 番目の成分は尾の発生数を示します。
 
 ```{.python .input}
-fair_probs = [1.0 / 6] * 6
-np.random.multinomial(1, fair_probs)
+%%tab mxnet
+fair_probs = [0.5, 0.5]
+multinomial(100, fair_probs)
 ```
 
 ```{.python .input}
-#@tab pytorch
-fair_probs = torch.ones([6]) / 6
-multinomial.Multinomial(1, fair_probs).sample()
+%%tab pytorch
+fair_probs = torch.tensor([0.5, 0.5])
+Multinomial(100, fair_probs).sample()
 ```
 
 ```{.python .input}
-#@tab tensorflow
-fair_probs = tf.ones(6) / 6
-tfp.distributions.Multinomial(1, fair_probs).sample()
+%%tab tensorflow
+fair_probs = tf.ones(2) / 2
+tfd.Multinomial(100, fair_probs).sample()
 ```
 
-サンプラーを何度も実行すると、毎回ランダムな値が出てくることがわかります。ダイスの公平性を推定する場合と同様に、同じ分布から多数のサンプルを生成したい場合がよくあります。Python `for` ループでこれを行うのは耐え難いほど遅くなるので、使っている関数は一度に複数のサンプルを描画し、望みどおりの形状の独立したサンプルの配列を返すことをサポートしています。
+このサンプリングプロセスを実行するたびに、前の結果とは異なる可能性のある新しい乱数値が得られます。投げる回数で割ると、データに含まれる各結果の*頻度*がわかります。これらの周波数は、推定する確率と同様に、合計が$1$になることに注意してください。
 
 ```{.python .input}
-np.random.multinomial(10, fair_probs)
+%%tab mxnet
+multinomial(100, fair_probs) / 100
 ```
 
 ```{.python .input}
-#@tab pytorch
-multinomial.Multinomial(10, fair_probs).sample()
+%%tab pytorch
+Multinomial(100, fair_probs).sample() / 100
 ```
 
 ```{.python .input}
-#@tab tensorflow
-tfp.distributions.Multinomial(10, fair_probs).sample()
+%%tab tensorflow
+tfd.Multinomial(100, fair_probs).sample() / 100
 ```
 
-ダイスのロールをサンプリングする方法がわかったところで、1000 個のロールをシミュレートできます。その後、1000回のロールのそれぞれの後に、各数字が何回ロールされたかを調べて数えることができます。具体的には、相対度数を真の確率の推定値として計算します。
+ここでは、シミュレートしたコインが公平であっても（確率`[0.5, 0.5]`を自分たちで設定しました）、頭と尾のカウントは同一ではないかもしれません。これは、有限数のサンプルしか描画しなかったからです。シミュレーションを自分たちで実装せず、結果だけを見た場合、コインが少し不公平なのか、$1/2$からの逸脱の可能性がサンプルサイズの小さいアーティファクトだったのかをどうやって知ることができますか？`10000`の投げをシミュレートするとどうなるか見てみましょう。
 
 ```{.python .input}
-counts = np.random.multinomial(1000, fair_probs).astype(np.float32)
-counts / 1000
+%%tab mxnet
+counts = multinomial(10000, fair_probs).astype(np.float32)
+counts / 10000
 ```
 
 ```{.python .input}
-#@tab pytorch
-# Store the results as 32-bit floats for division
-counts = multinomial.Multinomial(1000, fair_probs).sample()
-counts / 1000  # Relative frequency as the estimate
+%%tab pytorch
+counts = Multinomial(10000, fair_probs).sample()
+counts / 10000
 ```
 
 ```{.python .input}
-#@tab tensorflow
-counts = tfp.distributions.Multinomial(1000, fair_probs).sample()
-counts / 1000
+%%tab tensorflow
+counts = tfd.Multinomial(10000, fair_probs).sample()
+counts / 10000
 ```
 
-公正なダイスからデータを生成したため、各結果の真の確率 $\frac{1}{6}$、およそ $0.167$ であることがわかっているため、上記の出力推定値は良好に見えます。 
-
-また、これらの確率が時間の経過とともに真の確率に向かってどのように収束するかを視覚化することもできます。各グループが10個のサンプルを採取する500グループの実験を行いましょう。
+一般に、繰り返されるイベント（コイントスのような）の平均では、繰り返し数が増えるにつれて、推定値は真の基礎となる確率に収束することが保証されます。この現象の数学的証明は*大数の法則*と呼ばれ、*中心極限定理*は、多くの状況で、サンプルサイズ$n$が大きくなるにつれて、これらの誤差は$(1/\sqrt{n})$の割合で減少するはずであることを示しています。投げの数を`1`から`10000`に増やすにつれて、見積もりがどのように進化するかを調べて、もう少し直感的になりましょう。
 
 ```{.python .input}
-counts = np.random.multinomial(10, fair_probs, size=500)
+%%tab mxnet
+counts = multinomial(1, fair_probs, size=10000)
 cum_counts = counts.astype(np.float32).cumsum(axis=0)
 estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
-
-d2l.set_figsize((6, 4.5))
-for i in range(6):
-    d2l.plt.plot(estimates[:, i].asnumpy(),
-                 label=("P(die=" + str(i + 1) + ")"))
-d2l.plt.axhline(y=0.167, color='black', linestyle='dashed')
-d2l.plt.gca().set_xlabel('Groups of experiments')
-d2l.plt.gca().set_ylabel('Estimated probability')
-d2l.plt.legend();
 ```
 
 ```{.python .input}
-#@tab pytorch
-counts = multinomial.Multinomial(10, fair_probs).sample((500,))
+%%tab pytorch
+counts = Multinomial(1, fair_probs).sample((10000,))
 cum_counts = counts.cumsum(dim=0)
 estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)
-
-d2l.set_figsize((6, 4.5))
-for i in range(6):
-    d2l.plt.plot(estimates[:, i].numpy(),
-                 label=("P(die=" + str(i + 1) + ")"))
-d2l.plt.axhline(y=0.167, color='black', linestyle='dashed')
-d2l.plt.gca().set_xlabel('Groups of experiments')
-d2l.plt.gca().set_ylabel('Estimated probability')
-d2l.plt.legend();
+estimates = estimates.numpy()
 ```
 
 ```{.python .input}
-#@tab tensorflow
-counts = tfp.distributions.Multinomial(10, fair_probs).sample(500)
+%%tab tensorflow
+counts = tfd.Multinomial(1, fair_probs).sample(10000)
 cum_counts = tf.cumsum(counts, axis=0)
 estimates = cum_counts / tf.reduce_sum(cum_counts, axis=1, keepdims=True)
+estimates = estimates.numpy()
+```
 
-d2l.set_figsize((6, 4.5))
-for i in range(6):
-    d2l.plt.plot(estimates[:, i].numpy(),
-                 label=("P(die=" + str(i + 1) + ")"))
-d2l.plt.axhline(y=0.167, color='black', linestyle='dashed')
-d2l.plt.gca().set_xlabel('Groups of experiments')
+```{.python .input}
+%%tab all
+d2l.set_figsize((4.5, 3.5))
+d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
+d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
+d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
+d2l.plt.gca().set_xlabel('Samples')
 d2l.plt.gca().set_ylabel('Estimated probability')
 d2l.plt.legend();
 ```
 
-各実線曲線はダイの6つの値の1つに対応し、各実験グループの後に評価されたとおりに、ダイスがその値を上げる推定確率を示します。黒い破線は真の基礎となる確率を示しています。より多くの実験を行うことでより多くのデータを得ると、$6$ の実線曲線は真の確率に向かって収束します。 
+各実線曲線は、コインの2つの値のうちの1つに対応し、実験の各グループの後にコインがその値を上げると推定される確率を示します。黒い破線は、真の基礎確率を示しています。より多くの実験を行うことでより多くのデータを取得すると、曲線は真の確率に向かって収束します。統計学者を悩ませる、より高度な質問の形をすでに理解し始めているかもしれません。この収束はどれくらい早く起こるのですか？同じ工場で製造された多くのコインをすでにテストした場合、この情報をどのように組み込むことができるでしょうか？ 
+
+##  よりフォーマルな扱い
 
-### 確率論の公理
+確率的モデルの提案、合成データの生成、統計的推定の実行、収束の経験的評価、エラーメトリクスの報告（偏差のチェック）など、すでにかなり遠くまで進んでいます。しかし、さらに先に進むには、より正確にする必要があります。 
 
-ダイスのロールを扱う場合、集合 $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$ を*サンプル空間* または*結果空間*と呼びます。ここで、各要素は*結果* です。*event* は、特定のサンプル空間からの結果のセットです。たとえば、「$5$」($\{5\}$) と「奇数を見る」($\{1, 3, 5\}$) は、どちらもサイコロを振るのに有効なイベントです。ランダム実験の結果がイベント $\mathcal{A}$ の場合、イベント $\mathcal{A}$ が発生していることに注意してください。つまり、サイコロを振った後に $3$ ドットが上向きになった場合、$3 \in \{1, 3, 5\}$ 以降、「奇数を見た」というイベントが発生したと言えます。 
+ランダム性を扱う場合、可能な結果の集合を$\mathcal{S}$と表し、それを*サンプル空間*または*結果空間*と呼びます。ここで、各要素は異なる可能性のある*結果*です。単一のコインを転がす場合、$\mathcal{S} = \{\textrm{heads}, \textrm{tails}\}$。単一のダイの場合、$\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$。2つのコインをひっくり返すと、4つの結果が考えられます：$\{(\textrm{heads}, \textrm{heads}), (\textrm{heads}, \textrm{tails}), (\textrm{tails}, \textrm{heads}),  (\textrm{tails}, \textrm{tails})\}$。
+*イベント* はサンプル空間のサブセットです。
+たとえば、「最初のコイントスが頭を上げる」というイベントは、セット$\{(\textrm{heads}, \textrm{heads}), (\textrm{heads}, \textrm{tails})\}$に対応します。ランダム実験の結果 $z$ が $z \in \mathcal{A}$ を満たすたびに、事象 $\mathcal{A}$ が発生しています。サイコロを1回振ると、「$5$を見る」（$\mathcal{A} = \{5\}$）と「奇数を見る」（$\mathcal{B} = \{1, 3, 5\}$）というイベントを定義できます。この場合、ダイが`5`になった場合、$A$と$B$の両方が発生したと言えます。一方、$z = 3$の場合、$\mathcal{A}$は発生しませんでしたが、$\mathcal{B}$は発生しました。 
 
-正式には、*probability* は集合を実数値にマッピングする関数と考えることができます。$P(\mathcal{A})$ と表される所定のサンプル空間 $\mathcal{S}$ における事象 $\mathcal{A}$ の確率は、次の特性を満たします。 
+*確率* 関数は、イベントを実数値 ${P: \mathcal{A} \subseteq \mathcal{S} \rightarrow [0,1]}$ にマッピングします。$P(\mathcal{A})$と示される、指定されたサンプル空間$\mathcal{S}$における事象$\mathcal{A}$の確率は、次の特性を満たします。 
 
-* いずれの事象 $\mathcal{A}$ についても、その確率は負になることはありません。つまり $P(\mathcal{A}) \geq 0$
+* 任意の事象の確率$\mathcal{A}$は非負の実数、すなわち$P(\mathcal{A}) \geq 0$です。
 * サンプル空間全体の確率は$1$、つまり$P(\mathcal{S}) = 1$です。
-* *相互に排他的* ($i \neq j$ すべてで $\mathcal{A}_i \cap \mathcal{A}_j = \emptyset$) の可算事象 $\mathcal{A}_1, \mathcal{A}_2, \ldots$ のシーケンスの場合、発生する確率は個々の確率の合計、つまり $P(\bigcup_{i=1}^{\infty} \mathcal{A}_i) = \sum_{i=1}^{\infty} P(\mathcal{A}_i)$ と等しくなります。
+* *相互に排他的*（$\mathcal{A}_i \cap \mathcal{A}_j = \emptyset$、$i \neq j$）の可算イベントシーケンス $\mathcal{A}_1, \mathcal{A}_2, \ldots$ の場合、それらのいずれかが発生する確率は、個々の確率の合計、つまり $P(\bigcup_{i=1}^{\infty} \mathcal{A}_i) = \sum_{i=1}^{\infty} P(\mathcal{A}_i)$ と等しくなります。
 
-これらは1933年にコルモゴロフが提案した確率論の公理でもあります。この公理システムのおかげで、ランダム性に関する哲学的な論争を避けることができます。代わりに、数学的な言語で厳密に推論することができます。たとえば、事象 $\mathcal{A}_1$ をサンプル空間全体とし、$i > 1$ を $\mathcal{A}_i = \emptyset$ とすることで、$P(\emptyset) = 0$、つまり不可能な事象の確率が $0$ であることを証明できます。 
+:citet:`Kolmogorov.1933`によって提案されたこれらの確率論の公理は、多くの重要な結果を迅速に導き出すために適用することができます。たとえば、任意の事象の確率は$\mathcal{A}$とすぐにわかります
+*または* その補数 $\mathcal{A}'$ の発生は 1 です
+(なぜなら$\mathcal{A} \cup \mathcal{A}' = \mathcal{S}$)。$P(\emptyset) = 0$は、$1 = P(\mathcal{S} \cup \mathcal{S}') = P(\mathcal{S} \cup \emptyset) = P(\mathcal{S}) + P(\emptyset) = 1 + P(\emptyset)$であるため、それを証明することもできます。その結果、いずれかの事象の確率$\mathcal{A}$
+*と* その補数 $\mathcal{A}'$ が同時に発生する
+は$P(\mathcal{A} \cap \mathcal{A}') = 0$です。非公式に、これは不可能なイベントが発生する可能性がゼロであることを示しています。 
 
-### ランダム変数
+## ランダム変数
 
-サイコロを投げるランダム実験では、*ランダム変数*の概念を導入しました。確率変数はほとんどどんな量でもよく、決定論的ではありません。ランダム実験では、一連の可能性のうち 1 つの値を取ることができます。ダイスを転がすサンプル空間 $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$ に値が入る確率変数 $X$ を考えてみましょう。「$5$ を見ている」というイベントを $\{X = 5\}$ または $X = 5$、その確率は $P(\{X = 5\})$ または $P(X = 5)$ と表すことができます。$P(X = a)$ では、確率変数 $X$ と $X$ が取ることができる値 (例:$a$) とを区別します。しかし、そのような歩兵は面倒な表記法になります。コンパクトな表記法では、$P(X)$ を確率変数 $X$ に対する*分布* として表すことができます。この分布は $X$ が何らかの値をとる確率を示しています。一方、確率変数が値 $a$ を取る確率を示すために $P(a)$ と簡単に書くことができます。確率論の事象は標本空間からの結果の集合であるため、確率変数が取る値の範囲を指定することができます。たとえば、$P(1 \leq X \leq 3)$ は事象 $\{1 \leq X \leq 3\}$ の確率を表し、$\{X = 1, 2, \text{or}, 3\}$ を意味します。同様に、$P(1 \leq X \leq 3)$ は、確率変数 $X$ が $\{1, 2, 3\}$ から値をとることができる確率を表します。 
+ダイスのロールが来るオッズや最初のコイントスのような出来事について話したとき、私たちは*ランダム変数*のアイデアを呼び起こしていました。正式には、確率変数は、基礎となるサンプル空間から一連の (場合によっては多数の) 値へのマッピングです。確率変数はサンプル空間とどう違うのか不思議に思うかもしれません。どちらも結果の集まりだからです。重要なのは、確率変数は生のサンプル空間よりもはるかに粗い場合があることです。基礎となるサンプル空間が無限大の場合でも、「0.5より大きい」などのバイナリ確率変数を定義できます。たとえば、$0$と$1$の間の線分などです。さらに、複数の確率変数が同じ基礎となるサンプル空間を共有できます。たとえば、「自宅のアラームが鳴るかどうか」と「自宅が盗難に遭ったかどうか」は、基礎となるサンプル空間を共有するバイナリ確率変数です。したがって、ある確率変数がとる値を知ることで、別の確率変数の可能性のある値について何かを知ることができます。警報が鳴ったことを知っていると、その家が強盗されたのではないかと疑うかもしれません。 
 
-ダイスの側面のような*離散*確率変数と、人の体重や身長のような*連続*確率変数には微妙な違いがあることに注意してください。2人の身長がまったく同じかどうかを尋ねる意味はほとんどありません。十分に正確に測定すれば、地球上でまったく同じ身長を持つ人はいないことがわかります。実際、十分に細かく測定すると、起床時と寝るときの身長は同じではありません。したがって、誰かの身長が1.80139278291028719210196740527486202メートルである確率について尋ねる目的はありません。世界の人口を考えると、確率は事実上 0.この場合、誰かの身長が特定の間隔、たとえば1.79〜1.81メートルに収まるかどうかを尋ねる方が理にかなっています。このような場合、値が*density* として認識される尤度を定量化します。ちょうど1.80メートルの高さには確率はありませんが、密度はゼロではありません。2つの異なる高さの間の区間では、確率はゼロではありません。このセクションの残りの部分では、離散空間における確率について考察します。連続確率変数に対する確率については、:numref:`sec_random_variables` を参照してください。 
+確率変数によって取られるすべての値は、基礎となるサンプル空間のサブセットに対応します。したがって、$X=v$によって示される確率変数$X$が値$v$をとるオカレンスは*事象*であり、$P(X=v)$はその確率を示します。この表記法が不格好になることもあり、文脈が明確な場合は記法を乱用する可能性があります。たとえば、$P(X)$を使用して、$X$の*分布*、つまり、$X$が任意の値をとる確率を示す関数を広く参照できます。また、$P(X,Y) = P(X) P(Y)$のような式を、確率変数$X$と$Y$が取ることができるすべての値に当てはまるステートメントを表現するための省略形として、つまり、すべての$i,j$に対してその$P(X=i \textrm{ and } Y=j) = P(X=i)P(Y=j)$を保持します。また、確率変数が文脈から明らかな場合、$P(v)$と書くことで表記法を乱用することもあります。確率論における事象はサンプル空間からの結果の集合であるため、確率変数が取る値の範囲を指定できます。たとえば、$P(1 \leq X \leq 3)$ は事象 $\{1 \leq X \leq 3\}$ の確率を示します。 
 
-## 複数の確率変数を扱う
+コインのフリップやサイコロの投げなどの*離散*確率変数と、人口から無作為にサンプリングされた人の体重や身長などの*連続*の確率変数には微妙な違いがあることに注意してください。この場合、私たちは誰かの正確な身長を本当に気にすることはめったにありません。さらに、十分に正確に測定すると、地球上でまったく同じ高さの人は2人いないことがわかります。実際、十分に細かい測定値があれば、目覚めたときと寝るときと同じ高さになることは決してありません。身長が1.801392782910287192メートルである正確な確率について尋ねる意味はほとんどありません。代わりに、私たちは通常、誰かの身長が特定の間隔、たとえば1.79メートルから1.81メートルの間にあるかどうかを言うことができることを重視しています。これらのケースでは、確率*密度*を使って作業します。正確に1.80メートルの高さには確率はありませんが、密度はゼロではありません。区間に割り当てられた確率を出すには、その区間の密度の*積分*を取らなければなりません。 
 
-多くの場合、一度に複数の確率変数を検討する必要があります。例えば、病気と症状の関係をモデル化したいと思うかもしれません。「インフルエンザ」や「咳」などの病気や症状を考えると、ある程度の確率で患者に発生する場合と発生しない場合があります。両者の確率がゼロに近づくことを願っていますが、これらの確率と互いの関係を推定して、推論を適用してより良い医療を実現できるようにしたいと思うかもしれません。 
+## 複数のランダム変数
 
-より複雑な例として、イメージには数百万のピクセルが含まれており、したがって数百万の確率変数が含まれています。また、多くの場合、画像にはラベルが付いており、画像内のオブジェクトを識別します。ラベルは確率変数と考えることもできます。すべてのメタデータは、位置、時間、絞り、焦点距離、ISO、焦点距離、カメラタイプなどのランダム変数と考えることもできます。これらはすべて、連動して発生する確率変数です。複数の確率変数を扱う場合、関心のある量がいくつかあります。 
+複数の確率変数間の相互作用を含むステートメントを作成しないと、最後のセクションを通過することすらできないことに気づいたかもしれません（$P(X,Y) = P(X) P(Y)$を思い出してください）。機械学習のほとんどは、そのような関係に関係しています。ここで、サンプルスペースは、関心のある集団、たとえば企業と取引する顧客、インターネット上の写真、または生物学者に知られているタンパク質です。各確率変数は、異なる属性の（未知の）値を表します。母集団から個体をサンプリングするたびに、各確率変数の実現が観察されます。確率変数によって取られる値は、重なり合っている、部分的に重なっている、または完全に切り離されている可能性があるサンプル空間のサブセットに対応するため、ある確率変数が取る値を知ることで、別の確率変数のどの値になり得るかについての信念を更新することができます。患者が病院に入院し、呼吸困難で嗅覚が失われているのを観察した場合、呼吸障害がなく、完全に普通の嗅覚がない場合よりも、COVID-19に感染する可能性が高いと私たちは信じています。 
 
-### 合同確率
+複数の確率変数を扱う場合、変数が一緒に取ることができる値のあらゆる組み合わせに対応するイベントを構築できます。これらの組み合わせのそれぞれに確率を割り当てる確率関数 (例:$A=a$ と $B=b$) は、*結合確率* 関数と呼ばれ、サンプル空間の対応するサブセットの交差に割り当てられた確率を単純に返します。確率変数$A$と$B$がそれぞれ$a$と$b$の値を取る事象に割り当てられた*結合確率*は、$P(A = a, B = b)$と表され、カンマは「and」を示します。$a$ と $b$ のいずれかの値については、$P(A=a, B=b) \leq P(A=a)$ と $P(A=a, B=b) \leq P(B = b)$ が保持されることに注意してください。$A=a$ と $B=b$ が発生するには、$A=a$ が発生する必要があるため、* $B=b$ も発生する必要があります。興味深いことに、結合確率は、確率的な意味でこれらの確率変数について知ることができ、個々の分布$P(A)$および$P(B)$の回復を含む、他の多くの有用な量を導出するために使用できることをすべて教えてくれます。$P(A=a)$ を回復するには、ランダム変数 $B$ が取ることができるすべての値 $v$ に対して $P(A=a, B=v)$ を単純に合計します。$P(A=a) = \sum_v P(A=a, B=v)$。 
 
-1つ目は*ジョイント確率* $P(A = a, B=b)$と呼ばれます。$a$と$b$の値があれば、結合確率で答えることができます。$A=a$と$B=b$が同時に発生する確率はどれくらいですか？$a$ と $b$ の値については $P(A=a, B=b) \leq P(A=a)$ であることに注意してください。$A=a$ と $B=b$ が起こるためには $A=a$ が起こらなければならず、* $B=b$ も起こらなければならない (逆も同様) ので、これは事実でなければならない。したがって、$A=a$ と $B=b$ はそれぞれ $A=a$ または $B=b$ よりも高い確率になることはありません。 
+$\frac{P(A=a, B=b)}{P(A=a)} \leq 1$の比率は非常に重要であることがわかりました。これは*条件付き確率*と呼ばれ、「$\mid$」記号$P(B=b \mid A=a) = P(A=a,B=b)/P(A=a)$によって示されます。これは、$A=a$が発生したという事実を条件付けると、イベント$B=b$に関連する新しい確率を示します。この条件付き確率は、$A=a$ に関連付けられたサンプル空間のサブセットのみに注意を制限し、すべての確率の合計が 1 になるように再正規化すると考えることができます。条件付き確率は実際には確率であり、すべての項を同じ事象に条件付けして同じサンプル空間に注意を制限する限り、すべての公理を尊重します。たとえば、切り離されたイベント $\mathcal{B}$ と $\mathcal{B}'$ の場合、その $P(\mathcal{B} \cup \mathcal{B}' \mid A = a) = P(\mathcal{B} \mid A = a) + P(\mathcal{B}' \mid A = a)$ があります。 
 
-### 条件付き確率
+条件付き確率の定義を使用して、*ベイズの定理*と呼ばれる有名な結果を導き出すことができます。構造上、$P(A, B) = P(B\mid A) P(A)$と$P(A, B) = P(A\mid B) P(B)$があります。両方の方程式を組み合わせると $P(B\mid A) P(A) = P(A\mid B) P(B)$ が得られ、したがって 
 
-これにより、$0 \leq \frac{P(A=a, B=b)}{P(A=a)} \leq 1$という興味深い比率になります。この比率を*条件付き確率*と呼び、$P(B=b \mid A=a)$ で表します。$A=a$ が発生した場合、$B=b$ の確率です。 
+$$P(A \mid B) = \frac{P(B\mid A) P(A)}{P(B)}.$$
 
-### ベイズの定理
+この単純な方程式は、条件付けの順序を逆にすることができるため、深い意味を持ちます。$P(B\mid A)$、$P(A)$、$P(B)$を推定する方法がわかっている場合は、$P(A\mid B)$を推定できます。私たちはしばしば、一方の項を直接推定するが、他の項は推定しない方が簡単であり、ベイズの定理がここで助けになります。たとえば、特定の疾患の症状の有病率と、その疾患と症状の全体的な有病率をそれぞれ知っていれば、その症状に基づいてその病気にかかる可能性を判断できます。場合によっては、症状の有病率など、$P(B)$に直接アクセスできない場合があります。この場合、ベイズの定理の簡略版が役に立ちます。 
 
-条件付き確率の定義を用いて、統計学で最も有用で有名な方程式の一つ、*ベイズの定理*を導き出すことができます。それは次のようになります。構造上、$P(A, B) = P(B \mid A) P(A)$という*乗算ルール*があります。対称性により、これは$P(A, B) = P(A \mid B) P(B)$にも当てはまります。$P(B) > 0$ と仮定します。得た条件変数の1つを解く 
+$$P(A \mid B) \propto P(B \mid A) P(A).$$
 
-$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}.$$
+$P(A \mid B)$は$1$、つまり$\sum_a P(A=a \mid B) = 1$に正規化する必要があることがわかっているので、これを使用して計算できます 
 
-ここでは $P(A, B)$ が*結合分布*、$P(A \mid B)$ が*条件付き分布* であるよりコンパクトな表記法を使用していることに注意してください。このような分布は、特定の値 $A = a, B=b$ について評価できます。 
+$$P(A \mid B) = \frac{P(B \mid A) P(A)}{\sum_b P(B=b \mid A) P(A)}.$$
 
-### 疎外化
+ベイズ統計では、オブザーバーは、*事前* $P(H)$にエンコードされた利用可能な仮説の妥当性についての（主観的な）事前信念と、各仮説について収集された証拠の価値を観察する可能性がどの程度あるかを示す*尤度関数*を持っていると考えていますクラス$P(E \mid H)$。ベイズの定理は、*事後*信念$P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}$を生成するために入手可能な証拠$E$に照らして、最初の*前の* $P(H)$を更新する方法を教えてくれると解釈されます。非公式には、これは「事後は前の時間の尤度を証拠で割った値と等しい」と表現できます。さて、証拠$P(E)$はすべての仮説で同じであるため、仮説を単純に正規化することで回避できます。 
 
-ベイズの定理は、因果関係など、あるものを別のものから推測したい場合に非常に役立ちますが、このセクションの後半で説明するように、逆方向のプロパティしか知りません。これを成し遂げるために必要な重要な操作の一つが、*疎外化*です。$P(A, B)$から$P(B)$を決定する演算です。$B$ の確率は $A$ のすべての可能な選択肢を説明し、それらすべての結合確率を集計することになることがわかります。 
+$\sum_a P(A=a \mid B) = 1$では、確率変数に対して*疎外*することもできることに注意してください。つまり、$P(A, B)$ などの共同分布から変数を削除できます。結局のところ、私たちはそれを持っています 
 
-$$P(B) = \sum_{A} P(A, B),$$
+$$\sum_a P(A=a, B) = P(B) \sum_a P(A = a \mid B) = P(B).$$
 
-これは*sumルール*としても知られています。周縁化の結果として生じる確率または分布は、*周辺確率* または*周辺分布* と呼ばれます。 
+独立性は、統計学における多くの重要なアイデアのバックボーンを形成するもう1つの根本的に重要な概念です。つまり、$A$の値を条件付けても、$B$に関連する確率分布に変化が生じない場合、またはその逆の場合、2つの変数は*独立*です。より正式には、$A \perp B$と示される独立性には、$P(A \mid B) = P(A)$が必要であり、その結果、$P(A,B) = P(A \mid B) P(B) = P(A) P(B)$が必要です。独立性はしばしば適切な仮定です。たとえば、確率変数$A$が1つの公正なコインを投げた結果を表し、確率変数$B$が別のコインを投げた結果を表す場合、$A$が上向きになったかどうかを知ることは、$B$が上向きになる確率に影響しないはずです。 
 
-### 独立性
+独立性は、基礎となる分布からデータを連続的に引き出す場合（強力な統計的結論を出すことができる）、またはデータ内のさまざまな変数に保持され、この独立構造をエンコードするより単純なモデルで作業できるようにする場合に特に役立ちます。一方、確率変数間の依存関係を推定することは、多くの場合、学習のまさに目的です。私たちは、病気と症状が*独立していない*と考えているため、特に症状が与えられた疾患の確率を推定することに気を配っています。 
 
-チェックすべきもう 1 つの便利なプロパティは、*dependence* と*dependence* です。2 つの確率変数 $A$ と $B$ が独立しているということは、$A$ の 1 つの事象が発生しても $B$ の事象の発生に関する情報は明らかにされないことを意味します。この場合は $P(B \mid A) = P(B)$ です。統計学者は通常、これを $A \perp  B$ と表現します。ベイズの定理からすると、$P(A \mid B) = P(A)$もすぐに続く。それ以外の場合は $A$ と $B$ を従属と呼んでいます。たとえば、ダイスの連続する2つのロールは独立しています。対照的に、照明スイッチの位置と部屋の明るさはそうではありません (ただし、電球の破損、停電、またはスイッチの破損が常に発生することがあるため、これらは完全に決定論的ではありません)。 
+条件付き確率は適切な確率であるため、独立性と依存性の概念もそれらに適用されることに注意してください。2つの確率変数$A$と$B$は、$P(A, B \mid C) = P(A \mid C)P(B \mid C)$の場合に限り、3番目の変数$C$が与えられると、*条件付きで独立*されます。興味深いことに、2つの変数は一般的に独立している可能性がありますが、3つ目の変数に条件付けを行うと従属します。これは、2つの確率変数$A$と$B$が第3の変数$C$の原因に対応する場合によく発生します。例えば、骨折や肺がんは一般集団では独立しているかもしれませんが、入院を条件付ければ、骨折は肺がんと負の相関があることがわかるかもしれません。それは、骨折がなぜ入院しているのかを「説明する」ため、肺がんになる可能性が低くなるからです。 
 
-$P(A \mid B) = \frac{P(A, B)}{P(B)} = P(A)$ は $P(A, B) = P(A)P(B)$ と等しいので、2 つの確率変数は、その結合分布が個々の分布の積である場合にのみ独立します。同様に、2 つの確率変数 $A$ と $B$ は、$P(A, B \mid C) = P(A \mid C)P(B \mid C)$ の場合に限り、別の確率変数 $C$ が与えられると、*条件付きで独立* になります。これは $A \perp B \mid C$ と表現されます。 
+逆に、2 つの従属確率変数は、3 つ目の条件付けによって独立になります。これは、関連性のない 2 つのイベントに共通の原因がある場合によく発生します。靴のサイズと読解レベルは小学生の間で高い相関があるが、年齢を条件付ければこの相関関係はなくなる。 
 
-### アプリケーション
+## 一例
 :label:`subsec_probability_hiv_app`
 
-私たちのスキルを試してみよう。医師が患者にHIV検査を実施すると仮定します。この検査はかなり正確で、患者が健康であるが病気であると報告した場合、1％の確率で不合格となります。さらに、患者が実際にHIVに感染していれば、HIVの検出に失敗することはありません。診断を示すために $D_1$ を使用し (陽性の場合は $1$、陰性の場合は $0$)、HIV の状態を示すために $H$ (陽性の場合は $1$、陰性の場合は $0$) を使用します。:numref:`conditional_prob_D1` は、このような条件付き確率を列挙しています。 
-
-:$P(D_1 \mid H)$ の条件付き確率。 
+私たちのスキルを試そう。医師が患者にHIV検査を行うと仮定します。この検査はかなり正確であり、患者が健康であるが病気であると報告している場合、1％の確率で失敗するだけです。さらに、患者が実際にHIVに感染していれば、HIVの検出に失敗することはありません。診断を示すために$D_1 \in \{0, 1\}$（陰性の場合は$0$、陽性の場合は$1$）、HIVの状態を示すために$H \in \{0, 1\}$を使用します。 
 
 | Conditional probability | $H=1$ | $H=0$ |
-|---|---|---|
-|$P(D_1 = 1 \mid H)$|            1 |         0.01 |
-|$P(D_1 = 0 \mid H)$|            0 |         0.99 |
-:label:`conditional_prob_D1`
+|:------------------------|------:|------:|
+| $P(D_1 = 1 \mid H)$        |     1 |  0.01 |
+| $P(D_1 = 0 \mid H)$        |     0 |  0.99 |
 
-条件付き確率は確率と同様に合計が 1 になる必要があるため、列の合計はすべて 1 です (ただし、行の合計はそうではありません)。検査が陽性になった場合、患者がHIVに感染する確率、つまり$P(H = 1 \mid D_1 = 1)$を調べてみましょう。明らかに、これは誤警報の数に影響を与えるため、病気がどれほど一般的であるかに依存します。人口が非常に健康であると仮定します (例:$P(H=1) = 0.0015$)。ベイズの定理を適用するには、周縁化と乗算則を適用して決定する必要があります 
+列の合計は条件付き確率であるため、すべて1です（ただし、行の合計はそうではありません）。検査が陽性になった場合、患者がHIVに感染する確率、つまり$P(H = 1 \mid D_1 = 1)$を計算してみましょう。直感的には、これは誤報の数に影響を与えるため、病気がどれほど一般的であるかに依存します。人口がかなり健全であると仮定します (例:$P(H=1) = 0.0015$)。ベイズの定理を適用するには、疎外化を適用して決定する必要があります 
 
 $$\begin{aligned}
-&P(D_1 = 1) \\
+P(D_1 = 1)
 =& P(D_1=1, H=0) + P(D_1=1, H=1)  \\
 =& P(D_1=1 \mid H=0) P(H=0) + P(D_1=1 \mid H=1) P(H=1) \\
 =& 0.011485.
 \end{aligned}
 $$
 
-したがって、我々が得る 
+これは私たちを導きます 
 
-$$\begin{aligned}
-&P(H = 1 \mid D_1 = 1)\\ =& \frac{P(D_1=1 \mid H=1) P(H=1)}{P(D_1=1)} \\ =& 0.1306 \end{aligned}.$$
-
-つまり、非常に正確な検査を行っているにもかかわらず、患者が実際にHIVに感染している可能性は13.06％しかありません。ご覧のとおり、確率は直観に反する可能性があります。 
-
-そのような恐ろしい知らせを受け取ったとき、患者は何をすべきでしょうか？おそらく、患者は明確にするために別の検査を実施するよう医師に依頼するでしょう。2番目のテストは特性が異なり、:numref:`conditional_prob_D2`に示すように、最初のテストほど良くありません。 
+$$P(H = 1 \mid D_1 = 1) = \frac{P(D_1=1 \mid H=1) P(H=1)}{P(D_1=1)} = 0.1306.$$
 
-:$P(D_2 \mid H)$ の条件付き確率。 
+言い換えれば、非常に正確な検査を使用しているにもかかわらず、患者が実際にHIVに感染する可能性は13.06％しかありません。ご覧のとおり、確率は直観に反する可能性があります。そのような恐ろしい知らせを受けたとき、患者は何をすべきか？おそらく、患者は明確にするために別の検査を行うように医師に依頼するでしょう。2番目のテストにはさまざまな特性があり、最初のテストほど良くありません。 
 
 | Conditional probability | $H=1$ | $H=0$ |
-|---|---|---|
-|$P(D_2 = 1 \mid H)$|            0.98 |         0.03 |
-|$P(D_2 = 0 \mid H)$|            0.02 |         0.97 |
-:label:`conditional_prob_D2`
+|:------------------------|------:|------:|
+| $P(D_2 = 1 \mid H)$          |  0.98 |  0.03 |
+| $P(D_2 = 0 \mid H)$          |  0.02 |  0.97 |
 
-残念ながら、2番目のテストも陽性に戻ります。条件付き独立性を仮定して、ベイズの定理を呼び出すために必要な確率を計算してみましょう。 
-
-$$\begin{aligned}
-&P(D_1 = 1, D_2 = 1 \mid H = 0) \\
-=& P(D_1 = 1 \mid H = 0) P(D_2 = 1 \mid H = 0)  \\
-=& 0.0003,
-\end{aligned}
-$$
+残念ながら、2番目のテストも陽性に戻ります。条件付き独立性を仮定して、ベイズの定理を呼び出すために必要な確率を計算しましょう。 
 
 $$\begin{aligned}
-&P(D_1 = 1, D_2 = 1 \mid H = 1) \\
-=& P(D_1 = 1 \mid H = 1) P(D_2 = 1 \mid H = 1)  \\
+P(D_1 = 1, D_2 = 1 \mid H = 0)
+& = P(D_1 = 1 \mid H = 0) P(D_2 = 1 \mid H = 0)
+=& 0.0003, \\
+P(D_1 = 1, D_2 = 1 \mid H = 1)
+& = P(D_1 = 1 \mid H = 1) P(D_2 = 1 \mid H = 1)
 =& 0.98.
 \end{aligned}
 $$
 
-ここで、疎外化と乗算ルールを適用できます。 
+これで、疎外化を適用して、両方のテストが陽性になる確率を得ることができます。 
 
 $$\begin{aligned}
-&P(D_1 = 1, D_2 = 1) \\
+P(D_1 = 1, D_2 = 1)
 =& P(D_1 = 1, D_2 = 1, H = 0) + P(D_1 = 1, D_2 = 1, H = 1)  \\
 =& P(D_1 = 1, D_2 = 1 \mid H = 0)P(H=0) + P(D_1 = 1, D_2 = 1 \mid H = 1)P(H=1)\\
 =& 0.00176955.
 \end{aligned}
 $$
 
-結局、両方の陽性検査を受けた患者がHIVに感染する確率は 
+最後に、両方の検査で患者がHIVに感染する確率は陽性です 
 
-$$\begin{aligned}
-&P(H = 1 \mid D_1 = 1, D_2 = 1)\\
-=& \frac{P(D_1 = 1, D_2 = 1 \mid H=1) P(H=1)}{P(D_1 = 1, D_2 = 1)} \\
-=& 0.8307.
-\end{aligned}
-$$
+$$P(H = 1 \mid D_1 = 1, D_2 = 1)
+= \frac{P(D_1 = 1, D_2 = 1 \mid H=1) P(H=1)}{P(D_1 = 1, D_2 = 1)}
+= 0.8307.$$
+
+つまり、2回目のテストでは、すべてが順調ではないという確信がはるかに高まりました。2番目のテストは最初のテストよりもかなり正確ではありませんが、それでも見積もりは大幅に改善されました。両方のテストが互いに独立した条件付きであるという仮定は、より正確な推定値を生成する能力にとって重要でした。同じテストを2回実行する極端なケースを考えてみましょう。この状況では、両方の時間で同じ結果が期待されるため、同じテストを再度実行しても追加の洞察は得られません。賢明な読者は、診断が明白な視界に隠れている分類器のように振る舞い、より多くの特徴（検査結果）が得られるにつれて患者が健康であるかどうかを判断する能力が高まることに気付いたかもしれません。 
+
+## 期待
+
+多くの場合、意思決定を行うには、個々のイベントに割り当てられた確率を見るだけでなく、ガイダンスを提供できる有用な集計にまとめる必要があります。たとえば、確率変数が連続スカラー値をとる場合、*平均*で期待される値を知ることがしばしば気になります。この量は、正式には*期待値*と呼ばれます。私たちが投資を行っている場合、最初の関心の量は、すべての可能な結果を平均して（そして適切な確率で重み付けして）期待できるリターンかもしれません。たとえば、50% の確率で投資が完全に失敗する可能性があり、40% の確率で2$\times$のリターンが得られ、10% の確率で10$\times$のリターンが10$\times$になる可能性があるとします。期待リターンを計算するには、すべてのリターンを合計し、それぞれにリターンの発生確率を掛けます。これにより、$0.5 \cdot 0 + 0.4 \cdot 2 + 0.1 \cdot 10 = 1.8$ という期待値が得られます。したがって、期待されるリターンは1.8$\times$です。 
+
+一般に、確率変数$X$の*期待値*（または平均）は次のように定義されます。 
+
+$$E[X] = E_{x \sim P}[x] = \sum_{x} x P(X = x).$$
+
+同様に、密度については $E[X] = \int x \;dp(x)$ を取得します。時々、$x$のいくつかの関数の期待値に興味があります。これらの期待値は次のように計算できます。 
+
+$$E_{x \sim P}[f(x)] = \sum_x f(x) P(x) \text{ and } E_{x \sim P}[f(x)] = \int f(x) p(x) \;dx$$
+
+離散確率と密度のそれぞれについて。上記の投資例に戻ると、$f$はリターンに関連する*効用*（幸福）かもしれません。行動経済学者は長い間、人々はベースラインと比較して1ドルを稼ぐことから得られる効用よりも大きな不実用性とお金の損失を関連付けることに注目してきました。さらに、お金の価値はサブリニアになる傾向があります。ゼロドルに対して10万ドルを所有することは、家賃を払うこと、よく食べること、そして質の高い医療を楽しむこととホームレスを通して苦しむことの違いを生むことができます。一方、100kに対して200kを所有することによる利益はそれほど劇的ではありません。このような推論は、「お金の効用は対数的である」という決まり文句の動機付けになります。 
+
+総損失に関連するユーティリティが-1で、リターン1、2、および10に関連するユーティリティがそれぞれ1、2、4である場合、投資の期待幸福度は$0.5 \cdot (-1) + 0.4 \cdot 2 + 0.1 \cdot 4 = 0.7$（期待されるユーティリティの損失 30%）になります。本当にこれがあなたのユーティリティ機能だったら、お金を銀行に保管するのが最善かもしれません。 
+
+財務上の決定については、投資がどの程度*リスクが高い*かを測定したい場合もあります。ここでは、期待値だけでなく、実際の値がこの値に対してどの程度*変化*する傾向があるかを考慮します。実際の値と期待される値の差を単に期待することはできないことに注意してください。これは、違いの期待が期待値の差であり、$E[X - E[X]] = E[X] - E[E[X]] = 0$だからです。しかし、この差の非負の関数の期待を見ることができます。確率変数の*分散*は、*二乗*偏差の期待値を調べることによって計算されます。 
+
+$$\mathrm{Var}[X] = E\left[(X - E[X])^2\right] = E[X^2] - E[X]^2.$$
+
+ここでは、$(X - E[X])^2 = X^2 - 2 X E[X] + E[X]^2$を拡張し、各学期に期待を出すことで平等が続きます。分散の平方根は、*標準偏差*と呼ばれる別の有用な量です。分散と標準偏差は同じ情報を伝達しますが（どちらも他方から計算できます）、標準偏差には、確率変数で表される元の量と同じ単位で表されるという優れた特性があります。 
+
+最後に、確率変数の関数の分散は次のように定義されます。 
+
+$$\mathrm{Var}_{x \sim P}[f(x)] = E_{x \sim P}[f^2(x)] - E_{x \sim P}[f(x)]^2.$$
 
-つまり、2番目のテストでは、すべてがうまくいっているわけではないという確信がはるかに高まりました。2番目のテストは最初のテストよりもかなり精度が低くなりましたが、それでも見積もりは大幅に改善されました。 
+投資の例に戻ると、投資の分散を計算できます。これは $0.5 \cdot 0 + 0.4 \cdot 2^2 + 0.1 \cdot 10^2 - 1.8^2 = 8.36$ によって与えられます。すべての意図と目的に対して、これはリスクの高い投資です。数学的な慣習では、平均と分散は$\mu$と$\sigma^2$として参照されることがよくあります。これは、ガウス分布をパラメータ化するために使用する場合は常に特に一般的です。 
 
-## 期待値と分散
+*スカラー* 確率変数に期待値と分散を導入したのと同じように、ベクトル値の確率変数についてもそうすることができます。要素ごとに適用できるので、期待は簡単です。たとえば、$\boldsymbol{\mu} \stackrel{\mathrm{def}}{=} E_{\mathbf{x} \sim P}[\mathbf{x}]$ の座標は $\mu_i = E_{\mathbf{x} \sim P}[x_i]$ です。共分散はもっと複雑です。この問題を解決するには、確率変数とその平均の差の*外積*を期待します。 
 
-確率分布の重要な特徴をまとめるには、いくつかの測度が必要です。確率変数 $X$ の*期待値* (または平均) は次のように表されます。 
+$$\boldsymbol{\Sigma} \stackrel{\mathrm{def}}{=} \mathrm{Cov}_{\mathbf{x} \sim P}[\mathbf{x}] = E_{\mathbf{x} \sim P}\left[(\mathbf{x} - \boldsymbol{\mu}) (\mathbf{x} - \boldsymbol{\mu})^\top\right].$$
 
-$$E[X] = \sum_{x} x P(X = x).$$
+この行列 $\boldsymbol{\Sigma}$ は、共分散行列と呼ばれます。その効果を確認する簡単な方法は、$\mathbf{x}$ と同じサイズのベクトル $\mathbf{v}$ を検討することです。それに従う 
 
-関数 $f(x)$ の入力が、分布 $P$ から取り出され、値 $x$ が異なる確率変数である場合、$f(x)$ の期待値は次のように計算されます。 
+$$\mathbf{v}^\top \boldsymbol{\Sigma} \mathbf{v} = E_{\mathbf{x} \sim P}\left[\mathbf{v}^\top(\mathbf{x} - \boldsymbol{\mu}) (\mathbf{x} - \boldsymbol{\mu})^\top \mathbf{v}\right] = \mathrm{Var}_{x \sim P}[\mathbf{v}^\top \mathbf{x}].$$
 
-$$E_{x \sim P}[f(x)] = \sum_x f(x) P(x).$$
+そのため、$\boldsymbol{\Sigma}$では、$\mathbf{x}$の任意の線形関数の分散を単純な行列乗算で計算できます。非対角要素は、座標がどの程度相関しているかを示します。値0は相関がないことを意味し、正の値が大きいほど相関が強いことを意味します。 
 
-多くの場合、確率変数 $X$ が期待値からどれだけ逸脱しているかを測定します。これは分散によって定量化できます。 
+## ディスカッション
 
-$$\mathrm{Var}[X] = E\left[(X - E[X])^2\right] =
-E[X^2] - E[X]^2.$$
+機械学習には、不確かなことがたくさんあります！入力されたラベルの値については不確かな場合があります。パラメータの推定値については不確かな場合があります。展開時に到着するデータがトレーニングデータと同じ分布からのものであるかどうかさえ不確かになることもあります。 
 
-その平方根を*標準偏差*といいます。確率変数の関数の分散は、確率変数の値 $x$ が分布からサンプリングされるため、関数が関数の期待値からどれだけ逸脱しているかによって測定されます。 
+*偶然性の不確実性*によって、問題に内在する不確実性、および観測された変数によって説明されない真のランダム性による不確実性を示します。*認識論的不確実性*とは、モデルのパラメータに対する不確実性を表します。これは、より多くのデータを収集することで削減できると期待できる種類の不確実性です。コインが頭を上げる確率に関して認識論的な不確実性があるかもしれませんが、この確率を知っていても、将来のトスの結果について偶然性の不確実性が残っています。誰かが公正なコインを投げるのをどれだけ長く見ても、次のトスが頭に浮かぶことを50％以上または下回ることは決してありません。これらの用語は、機械モデリングの文献によるものです（[uncertainty quantification](https://en.wikipedia.org/wiki/Uncertainty_quantification)のこの側面に関するレビューについては、例えば、:citet:`Der-Kiureghian.Ditlevsen.2009`を参照してください）。これらの用語は言葉のわずかな乱用を構成することに注意する価値があります。*認識論的*という用語は、*知識*に関するあらゆるものを指し、したがって、哲学的な意味では、すべての不確実性は認識論的です。 
 
-$$\mathrm{Var}[f(x)] = E\left[\left(f(x) - E[f(x)]\right)^2\right].$$
+いくつかの未知の確率分布からのサンプリングデータは、データ生成分布のパラメータを推定するために使用できる情報を提供できることがわかりました。とはいえ、これが可能な速度はかなり遅くなる可能性があります。私たちのコイン投げの例（および他の多くの例）では、$1/\sqrt{n}$のレートで収束する推定量を設計するよりも良い方法はありません。ここで、$n$はサンプルサイズ（例えば、投げの数）です。これは、10から1000のオブザベーション（通常は非常に達成可能なタスク）に移行することにより、不確実性が10倍に減少するのに対し、次の1000のオブザベーションは比較的ほとんど役に立たず、1.41倍の削減しか提供しないことを意味します。これは機械学習の永続的な機能です。多くの場合、簡単に利益を得ることができますが、非常に大量のデータが必要であり、さらに利益を得るには膨大な量の計算が必要になることがよくあります。大規模言語モデルに関するこの事実の実証的レビューについては、:citet:`Revels.Lubin.Papamarkou.2016`を参照してください。 
 
-## [概要
+また、統計モデリングのための言語とツールを磨きました。その過程で、条件付き確率と、統計学で最も重要な方程式の1つであるベイズの定理について学びました。これは、観測値$B$が選択されたパラメータ$A$と最初にどの程度妥当であるかを決定する事前確率$P(A)$と$A$の特定の選択がどれほど妥当であるかを決定する事前確率$P(A)$に対処する尤度項$P(B \mid A)$を介してデータによって伝達される情報を分離するための効果的なツールです。特に、このルールを適用して、検査の有効性*および*疾患自体の有病率（つまり、以前の病気）に基づいて、診断に確率を割り当てる方法を見ました。 
 
-* 確率分布からサンプリングできます。
-* 結合分布、条件分布、ベイズの定理、疎外化、独立性仮定を使用して、複数の確率変数を分析できます。
-* 期待値と分散は、確率分布の主要な特徴を要約するうえで有用な測度となります。
+最後に、特定の確率分布の効果、つまり期待と分散に関する重要な質問の第1セットを導入しました。確率分布には、線形および二次的な期待以上のものがありますが、これら2つは、分布の考えられる動作に関する十分な知識をすでに提供しています。たとえば、[チェビシェフの不等式](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality) には、$P(|X - \mu| \geq k \sigma) \leq 1/k^2$、$\mu$ は期待値、$\sigma^2$ は分布の分散、$k > 1$ は私たちが選択した信頼パラメータであると述べています。これは、期待値を中心とした$[-\sqrt{2} \sigma, \sqrt{2} \sigma]$区間内で少なくとも50％の確率で分布嘘から引き出すことを示しています。 
 
 ## 演習
 
-1. $m=500$ グループの実験を行い、各グループが $n=10$ 個のサンプルを抽出しました。$m$ と $n$ を変化させてください。実験結果を観察し、分析する。
-1. 確率が $P(\mathcal{A})$ と $P(\mathcal{B})$ の 2 つの事象について、$P(\mathcal{A} \cup \mathcal{B})$ と $P(\mathcal{A} \cap \mathcal{B})$ の上限と下限を計算します。(ヒント:[Venn Diagram](https://en.wikipedia.org/wiki/Venn_diagram) を使用して状況を表示してください。)
-1. $A$、$B$、$C$ などの一連の確率変数があるとします。$B$ は $A$ にのみ依存し、$C$ は $B$ にのみ依存します。結合確率 $P(A, B, C)$ を単純化できますか？(ヒント:これは [Markov Chain](https://en.wikipedia.org/wiki/Markov_chain) です。)
-1. :numref:`subsec_probability_hiv_app` では、最初の検定がより正確になりました。最初のテストと2番目のテストの両方を実行するのではなく、最初のテストを2回実行しないのはなぜですか？
+1. より多くのデータを観察することで、結果に関する不確実性の量を任意に低いレベルまで減らすことができる例を挙げてください。
+1. より多くのデータを観察しても、ある時点までの不確実性の量を減らすだけで、それ以上は減少しないという例を挙げてください。これが当てはまる理由と、この点が発生すると予想される場所を説明してください。
+1. 私たちは、コイン投げの平均値への収束を経験的に実証しました。$n$サンプルを描画した後に頭が見える確率の推定値の分散を計算します。
+    1. 分散は観測値の数とどのように比例しますか？
+    1. チェビシェフの不等式を使用して、期待値からの偏差を制限します。
+    1. それは中心極限定理とどのように関係していますか？
+1. ゼロ平均と単位分散をもつ確率分布から $n$ サンプル $x_i$ を抽出すると仮定します。$z_m \stackrel{\mathrm{def}}{=} m^{-1} \sum_{i=1}^m x_i$ の平均値を計算します。すべての$z_m$にチェビシェフの不等式を個別に適用できますか？どうしてだい？
+1. 確率が$P(\mathcal{A})$と$P(\mathcal{B})$の2つの事象を考えて、$P(\mathcal{A} \cup \mathcal{B})$と$P(\mathcal{A} \cap \mathcal{B})$の上限と下限を計算します。ヒント:[Venn diagram](https://en.wikipedia.org/wiki/Venn_diagram) を使用して状況をグラフ化してください。
+1. たとえば、$A$、$B$、$C$ などの確率変数のシーケンスがあると仮定します。ここで、$B$ は $A$ にのみ依存し、$C$ は $B$ にのみ依存しますが、結合確率 $P(A, B, C)$ を単純化できますか？ヒント:これは [Markov chain](https://en.wikipedia.org/wiki/Markov_chain) です。
+1. :numref:`subsec_probability_hiv_app`では、2つの検定の結果が独立していないと仮定します。特に、どちらかのテストだけで偽陽性率が 10%、偽陰性率が 1% であると仮定します。つまり、$P(D =1 \mid H=0) = 0.1$ と $P(D = 0 \mid H=1) = 0.01$ と仮定します。さらに、$H = 1$（感染）の検査結果は条件付きで独立している、すなわち$P(D_1, D_2 \mid H=1) = P(D_1 \mid H=1) P(D_2 \mid H=1)$であるが、健康な患者の場合、結果は$P(D_1 = D_2 = 1 \mid H=0) = 0.02$を介して結合すると仮定する。
+    1. これまでに得た情報に基づいて $H=0$ を考えると、$D_1$ と $D_2$ の合同確率表を計算します。
+    1. 1回の検査で陽性が戻った後に患者が陽性になる確率（$H=1$）を導き出します。以前と同じベースライン確率 $P(H=1) = 0.0015$ を仮定できます。
+    1. 両方のテストが陽性になった後、患者が陽性になる確率（$H=1$）を導き出します。
+1. あなたが投資銀行の資産運用会社であり、投資する株式 $s_i$ の選択肢があると仮定します。あなたのポートフォリオは、各株式のウェイト$\alpha_i$で$1$まで合計する必要があります。株価の平均リターンは$\boldsymbol{\mu} = E_{\mathbf{s} \sim P}[\mathbf{s}]$で、共分散は$\boldsymbol{\Sigma} = \mathrm{Cov}_{\mathbf{s} \sim P}[\mathbf{s}]$です。
+    1. 特定のポートフォリオの期待収益を計算します $\boldsymbol{\alpha}$。
+    1. ポートフォリオのリターンを最大化したい場合、投資をどのように選択すべきですか？
+    1. ポートフォリオの*分散*を計算します。
+    1. 分散を上限に制約したままリターンを最大化する最適化問題を定式化します。これはノーベル賞受賞の [マルコヴィッツポートフォリオ](https://en.wikipedia.org/wiki/Markowitz_model) :cite:`Mangram.2013`) です。これを解決するには、この本の範囲をはるかに超える二次計画法ソルバーが必要です。
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/36)
diff --git a/chapter_preliminaries/probability_origin.md b/chapter_preliminaries/probability_origin.md
new file mode 100644
index 0000000..b769ddc
--- /dev/null
+++ b/chapter_preliminaries/probability_origin.md
@@ -0,0 +1,1070 @@
+```{.python .input}
+%load_ext d2lbook.tab
+tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
+```
+
+# Probability and Statistics
+:label:`sec_prob`
+
+One way or another,
+machine learning is all about uncertainty.
+In supervised learning, we want to predict
+something unknown (the *target*)
+given something known (the *features*).
+Depending on our objective,
+we might attempt to predict
+the most likely value of the target.
+Or we might predict the value with the smallest
+expected distance from the target.
+And sometimes we wish not only
+to predict a specific value
+but to *quantify our uncertainty*.
+For example, given some features
+describing a patient,
+we might want to know *how likely* they are
+to suffer a heart attack in the next year.
+In unsupervised learning,
+we often care about uncertainty.
+To determine whether a set of measurements are anomalous,
+it helps to know how likely one is
+to observe values in a population of interest.
+Moreover, in reinforcement learning,
+we wish to develop agents
+that act intelligently in various environments.
+This requires reasoning about
+how an environment might be expected to change
+and what rewards one might expect to encounter
+in response to each of the available actions.
+
+*Probability* is the mathematical field
+concerned with reasoning under uncertainty.
+Given a probabilistic model of some process,
+we can reason about the likelihood of various events.
+The use of probabilities to describe
+the frequencies of repeatable events
+(like coin tosses)
+is fairly uncontroversial.
+In fact, *frequentist* scholars adhere
+to an interpretation of probability
+that applies *only* to such repeatable events.
+By contrast *Bayesian* scholars
+use the language of probability more broadly
+to formalize our reasoning under uncertainty.
+Bayesian probability is characterized
+by two unique features:
+(i) assigning degrees of belief
+to non-repeatable events,
+e.g., what is the *probability*
+that the moon is made out of cheese?;
+and (ii) subjectivity---while Bayesian
+probability provides unambiguous rules
+for how one should update their beliefs
+in light of new evidence,
+it allows for different individuals
+to start off with different *prior* beliefs.
+*Statistics* helps us to reason backwards,
+starting off with collection and organization of data
+and backing out to what inferences
+we might draw about the process
+that generated the data.
+Whenever we analyze a dataset, hunting for patterns
+that we hope might characterize a broader population,
+we are employing statistical thinking.
+Most courses, majors, theses, careers, departments,
+companies, and institutions have been devoted
+to the study of probability and statistics.
+While this section only scratches the surface,
+we will provide the foundation
+that you need to begin building models.
+
+
+
+## A Simple Example: Tossing Coins
+
+Imagine that we plan to toss a coin
+and want to quantify how likely
+we are to see heads (vs. tails).
+If the coin is *fair*,
+then both outcomes
+(heads and tails),
+are equally likely.
+Moreover if we plan to toss the coin $n$ times
+then the fraction of heads
+that we *expect* to see
+should exactly match
+the *expected* fraction of tails.
+One intuitive way to see this
+is by symmetry:
+for every possible outcome
+with $n_h$ heads and $n_t = (n - n_h)$ tails,
+there is an equally likely outcome
+with $n_t$ heads and $n_h$ tails.
+Note that this is only possible
+if on average we expect to see
+$1/2$ of tosses come up heads
+and $1/2$ come up tails.
+Of course, if you conduct this experiment
+many times with $n=1000000$ tosses each,
+you might never see a trial
+where $n_h = n_t$ exactly.
+
+
+Formally, the quantity $1/2$ is called a *probability*
+and here it captures the certainty with which
+any given toss will come up heads.
+Probabilities assign scores between $0$ and $1$
+to outcomes of interest, called *events*.
+Here the event of interest is $\textrm{heads}$
+and we denote the corresponding probability $P(\textrm{heads})$.
+A probability of $1$ indicates absolute certainty
+(imagine a trick coin where both sides were heads)
+and a probability of $0$ indicates impossibility
+(e.g., if both sides were tails).
+The frequencies $n_h/n$ and $n_t/n$ are not probabilities
+but rather *statistics*.
+Probabilities are *theoretical* quantities
+that underly the data generating process.
+Here, the probability $1/2$
+is a property of the coin itself.
+By contrast, statistics are *empirical* quantities
+that are computed as functions of the observed data.
+Our interests in probabilistic and statistical quantities
+are inextricably intertwined.
+We often design special statistics called *estimators*
+that, given a dataset, produce *estimates*
+of model parameters like probabilities.
+Moreover, when those estimators satisfy
+a nice property called *consistency*,
+our estimates will converge
+to the corresponding probability.
+In turn, these inferred probabilities
+tell about the likely statistical properties
+of data from the same population
+that we might encounter in the future.
+
+Suppose that we stumbled upon a real coin
+for which we did not know
+the true $P(\textrm{heads})$.
+To investigate this quantity
+with statistical methods,
+we need to (i) collect some data;
+and (ii) design an estimator.
+Data acquisition here is easy;
+we can toss the coin many times
+and record all of the outcomes.
+Formally, drawing realizations
+from some underlying random process
+is called *sampling*.
+As you might have guessed,
+one natural estimator
+is the fraction between
+the number of observed *heads*
+by the total number of tosses.
+
+```{.python .input}
+%%tab mxnet
+%matplotlib inline
+from d2l import mxnet as d2l
+from mxnet import np, npx
+from mxnet.numpy.random import multinomial
+import random
+npx.set_np()
+```
+
+```{.python .input}
+%%tab pytorch
+%matplotlib inline
+from d2l import torch as d2l
+import random
+import torch
+from torch.distributions.multinomial import Multinomial
+```
+
+```{.python .input}
+%%tab tensorflow
+%matplotlib inline
+from d2l import tensorflow as d2l
+import random
+import tensorflow as tf
+from tensorflow_probability import distributions as tfd
+```
+
+Now, suppose that the coin was in fact fair,
+i.e., $P(\textrm{heads}) = 0.5$.
+To simulate tosses of a fair coin,
+we can invoke any random number generator.
+Some easy ways to draw samples
+of an event with probability $0.5$.
+For example Python's `random.random`
+yields numbers in the interval $[0,1]$
+where the probability of lying
+in any sub-interval $[a, b] \subset [0,1]$
+is equal to $b-a$.
+Thus we can get out `0` and `1` with probability `0.5` each
+by testing whether the returned float is greater than `0.5`
+
+```{.python .input}
+%%tab all
+num_tosses = 100
+heads = sum([random.random() > 0.5 for _ in range(100)])
+tails = num_tosses - heads
+print("heads, tails: ", [heads, tails])
+```
+
+More generally, we can simulate multiple draws
+from any variable with a finite number
+of possible outcomes
+(like the toss of a coin or roll of a die)
+by calling the multinomial function,
+setting the first argument
+to the number of draws
+and the second as a list of probabilities
+associated with each of the possible outcomes.
+To simulate ten tosses of a fair coin,
+we assign probability vector `[0.5, 0.5]`,
+interpreting index 0 as heads
+and index 1 as tails.
+The function returns a vector
+with length equal to the number
+of possible outcomes (here, 2),
+where the first component tells us
+the number of occurrences of heads
+and the second component tells us
+the number of occurrences of tails.
+
+```{.python .input}
+%%tab mxnet
+fair_probs = [0.5, 0.5]
+multinomial(100, fair_probs)
+```
+
+```{.python .input}
+%%tab pytorch
+fair_probs = torch.tensor([0.5, 0.5])
+Multinomial(100, fair_probs).sample()
+```
+
+```{.python .input}
+%%tab tensorflow
+fair_probs = tf.ones(2) / 2
+tfd.Multinomial(100, fair_probs).sample()
+```
+
+Each time you run this sampling process,
+you will receive a new random value
+that may differ from the previous outcome.
+Dividing by the number of tosses
+gives us the *frequency*
+of each outcome in our data.
+Note that these frequencies,
+like the probabilities
+that they are intended
+to estimate, sum to $1$.
+
+```{.python .input}
+%%tab mxnet
+multinomial(100, fair_probs) / 100
+```
+
+```{.python .input}
+%%tab pytorch
+Multinomial(100, fair_probs).sample() / 100
+```
+
+```{.python .input}
+%%tab tensorflow
+tfd.Multinomial(100, fair_probs).sample() / 100
+```
+
+Here, even though our simulated coin is fair
+(we set the probabilities `[0.5, 0.5]` ourselves),
+the counts of heads and tails may not be identical.
+That's because we only drew a finite number of samples.
+If we didn't implement the simulation ourselves,
+and only saw the outcome,
+how would we know if the coin were slightly unfair
+or if the possible deviation from $1/2$ was
+just an artifact of the small sample size?
+Let's see what happens when we simulate `10000` tosses.
+
+```{.python .input}
+%%tab mxnet
+counts = multinomial(10000, fair_probs).astype(np.float32)
+counts / 10000
+```
+
+```{.python .input}
+%%tab pytorch
+counts = Multinomial(10000, fair_probs).sample()
+counts / 10000
+```
+
+```{.python .input}
+%%tab tensorflow
+counts = tfd.Multinomial(10000, fair_probs).sample()
+counts / 10000
+```
+
+In general, for averages of repeated events (like coin tosses),
+as the number of repetitions grows,
+our estimates are guaranteed to converge
+to the true underlying probabilities.
+The mathematical proof of this phenomenon
+is called the *law of large numbers*
+and the *central limit theorem*
+tells us that in many situations,
+as the sample size $n$ grows,
+these errors should go down
+at a rate of $(1/\sqrt{n})$.
+Let's get some more intuition by studying
+how our estimate evolves as we grow
+the number of tosses from `1` to `10000`.
+
+```{.python .input}
+%%tab mxnet
+counts = multinomial(1, fair_probs, size=10000)
+cum_counts = counts.astype(np.float32).cumsum(axis=0)
+estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
+```
+
+```{.python .input}
+%%tab pytorch
+counts = Multinomial(1, fair_probs).sample((10000,))
+cum_counts = counts.cumsum(dim=0)
+estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)
+estimates = estimates.numpy()
+```
+
+```{.python .input}
+%%tab tensorflow
+counts = tfd.Multinomial(1, fair_probs).sample(10000)
+cum_counts = tf.cumsum(counts, axis=0)
+estimates = cum_counts / tf.reduce_sum(cum_counts, axis=1, keepdims=True)
+estimates = estimates.numpy()
+```
+
+```{.python .input}
+%%tab all
+d2l.set_figsize((4.5, 3.5))
+d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
+d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
+d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
+d2l.plt.gca().set_xlabel('Samples')
+d2l.plt.gca().set_ylabel('Estimated probability')
+d2l.plt.legend();
+```
+
+Each solid curve corresponds to one of the two values of the coin
+and gives our estimated probability that the coin turns up that value
+after each group of experiments.
+The dashed black line gives the true underlying probability.
+As we get more data by conducting more experiments,
+the curves converge towards the true probability.
+You might already begin to see the shape
+of some of the more advanced questions
+that preoccupy statisticians:
+How quickly does this convergence happen?
+If we had already tested many coins
+manufactured at the same plant,
+how might we incorporate this information?
+
+##  A More Formal Treatment
+
+We've already gotten pretty far: posing
+a probabilistic model,
+generating synthetic data,
+running a statistical estimator,
+empirically assessing convergence,
+and reporting error metrics (checking the deviation).
+However, to go much further,
+we will need to be more precise.
+
+
+When dealing with randomness,
+we denote the set of possible outcomes $\mathcal{S}$
+and call it the *sample space* or *outcome space*.
+Here, each element is a distinct possible *outcome*.
+In the case of rolling a single coin,
+$\mathcal{S} = \{\textrm{heads}, \textrm{tails}\}$.
+For a single die, $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$.
+When flipping two coins, we have four possible outcomes:
+$\{(\textrm{heads}, \textrm{heads}), (\textrm{heads}, \textrm{tails}), (\textrm{tails}, \textrm{heads}),  (\textrm{tails}, \textrm{tails})\}$.
+*Events* are subsets of the sample space.
+For instance, the event "the first coin toss comes up heads"
+corresponds to the set $\{(\textrm{heads}, \textrm{heads}), (\textrm{heads}, \textrm{tails})\}$.
+Whenever the outcome $z$ of a random experiment satisfies
+$z \in \mathcal{A}$, then event $\mathcal{A}$ has occurred.
+For a single roll of a die, we could define the events
+"seeing a $5$" ($\mathcal{A} = \{5\}$)
+and "seeing an odd number"  ($\mathcal{B} = \{1, 3, 5\}$).
+In this case, if the die came up `5`,
+we would say that both $A$ and $B$ occurred.
+On the other hand, if $z = 3$,
+then $\mathcal{A}$ did not occur
+but $\mathcal{B}$ did.
+
+
+A *probability* function maps events
+onto real values ${P: \mathcal{A} \subseteq \mathcal{S} \rightarrow [0,1]}$.
+The probability of an event $\mathcal{A}$
+in the given sample space $\mathcal{S}$,
+denoted $P(\mathcal{A})$,
+satisfies the following properties:
+
+* The probability of any event $\mathcal{A}$ is a non-negative real number, i.e., $P(\mathcal{A}) \geq 0$;
+* The probability of the entire sample space is $1$, i.e., $P(\mathcal{S}) = 1$;
+* For any countable sequence of events $\mathcal{A}_1, \mathcal{A}_2, \ldots$ that are *mutually exclusive* ($\mathcal{A}_i \cap \mathcal{A}_j = \emptyset$ for all $i \neq j$), the probability that any of them happens is equal to the sum of their individual probabilities, i.e., $P(\bigcup_{i=1}^{\infty} \mathcal{A}_i) = \sum_{i=1}^{\infty} P(\mathcal{A}_i)$.
+
+These axioms of probability theory,
+proposed by :citet:`Kolmogorov.1933`,
+can be applied to rapidly derive a number of important consequences.
+For instance, it follows immediately
+that the probability of any event $\mathcal{A}$
+*or* its complement $\mathcal{A}'$ occurring is 1
+(because $\mathcal{A} \cup \mathcal{A}' = \mathcal{S}$).
+We can also prove that $P(\emptyset) = 0$
+because $1 = P(\mathcal{S} \cup \mathcal{S}') = P(\mathcal{S} \cup \emptyset) = P(\mathcal{S}) + P(\emptyset) = 1 + P(\emptyset)$.
+Consequently, the probability of any event $\mathcal{A}$
+*and* its complement $\mathcal{A}'$ occurring simultaneously
+is $P(\mathcal{A} \cap \mathcal{A}') = 0$.
+Informally, this tells us that impossible events
+have zero probability of occurring.
+
+
+
+## Random Variables
+
+When we spoke about events like the roll of a die
+coming up odds or the first coin toss coming up heads,
+we were invoking the idea of a *random variable*.
+Formally, random variables are mappings
+from an underlying sample space
+to a set of (possibly many) values.
+You might wonder how a random variable
+is different from the sample space,
+since both are collections of outcomes.
+Importantly, random variables can be much coarser
+than the raw sample space.
+We can define a binary random variable like "greater than 0.5"
+even when the underlying sample space is infinite,
+e.g., the line segment between $0$ and $1$.
+Additionally, multiple random variables
+can share the same underlying sample space.
+For example "whether my home alarm goes off"
+and "whether my house was burglarized"
+are both binary random variables
+that share an underlying sample space.
+Consequently, knowing the value taken by one random variable
+can tell us something about the likely value of another random variable.
+Knowing that the alarm went off,
+we might suspect that the house was likely burglarized.
+
+
+Every value taken by a random variable corresponds
+to a subset of the underlying sample space.
+Thus the occurrence where the random variable $X$
+takes value $v$, denoted by $X=v$, is an *event*
+and $P(X=v)$ denotes its probability.
+Sometimes this notation can get clunky,
+and we can abuse notation when the context is clear.
+For example, we might use $P(X)$ to refer broadly
+to the *distribution* of $X$, i.e.,
+the function that tells us the probability
+that $X$ takes any given value.
+Other times we write expressions
+like $P(X,Y) = P(X) P(Y)$,
+as a shorthand to express a statement
+that is true for all of the values
+that the random variables $X$ and $Y$ can take, i.e.,
+for all $i,j$ it holds that $P(X=i \textrm{ and } Y=j) = P(X=i)P(Y=j)$.
+Other times, we abuse notation by writing
+$P(v)$ when the random variable is clear from the context.
+Since an event in probability theory is a set of outcomes from the sample space,
+we can specify a range of values for a random variable to take.
+For example, $P(1 \leq X \leq 3)$ denotes the probability of the event $\{1 \leq X \leq 3\}$.
+
+
+Note that there is a subtle difference
+between *discrete* random variables,
+like flips of a coin or tosses of a die,
+and *continuous* ones,
+like the weight and the height of a person
+sampled at random from the population.
+In this case we seldom really care about
+someone's exact height.
+Moreover, if we took precise enough measurements,
+we would find that no two people on the planet
+have the exact same height.
+In fact, with fine enough measurements,
+you would never have the same height
+when you wake up and when you go to sleep.
+There's little point in asking about
+the exact probability that someone
+is 1.801392782910287192 meters tall.
+Instead, we typically care more about being able to say
+whether someone's height falls into a given interval,
+say between 1.79 and 1.81 meters.
+In these cases we work with probability *densities*.
+The height of exactly 1.80 meters
+has no probability, but nonzero density.
+To get out the probability assigned to an interval,
+we must take an *integral* of the density
+over that interval.
+
+
+
+
+## Multiple Random Variables
+
+You might have noticed that we couldn't even
+make it past the last section without
+making statements involving interactions
+among multiple random variables
+(recall $P(X,Y) = P(X) P(Y)$).
+Most of machine learning
+is concerned with such relationships.
+Here, the sample space would be
+the population of interest,
+say customers who transact with a business,
+photographs on the internet,
+or proteins known to biologists.
+Each random variable would represent
+the (unknown) value of a different attribute.
+Whenever we sample an individual from the population,
+we observe a realization of each of the random variables.
+Because the values taken by random variables
+correspond to subsets of the sample space
+that could be overlapping, partially overlapping,
+or entirely disjoint,
+knowing the value taken by one random variable
+can cause us to update our beliefs
+about what values of another random variable are likely.
+If a patient walks into a hospital
+and we observe that they
+are having trouble breathing
+and have lost their sense of smell,
+then we believe that they are more likely
+to have COVID-19 than we might
+if they had no trouble breathing
+and a perfectly ordinary sense of smell.
+
+
+When working with multiple random variables,
+we can construct events corresponding
+to every combination of values
+that the variables can jointly take.
+The probability function that assigns
+probabilities to each of these combinations
+(e.g. $A=a$ and $B=b$)
+is called the *joint probability* function
+and simply returns the probability assigned
+to the intersection of the corresponding subsets
+of the sample space.
+The *joint probability* assigned to the event
+where random variables $A$ and $B$
+take values $a$ and $b$, respectively,
+is denoted $P(A = a, B = b)$,
+where the comma indicates "and".
+Note that for any values $a$ and $b$,
+it holds that
+$P(A=a, B=b) \leq P(A=a)$
+and $P(A=a, B=b) \leq P(B = b)$,
+since for $A=a$ and $B=b$ to happen,
+$A=a$ has to happen *and* $B=b$ also has to happen.
+Interestingly, the joint probability
+tells us all that we can know about these
+random variables in a probabilistic sense,
+and can be used to derive many other
+useful quantities, including recovering the
+individual distributions $P(A)$ and $P(B)$.
+To recover $P(A=a)$ we simply sum up
+$P(A=a, B=v)$ over all values $v$
+that the random variable $B$ can take:
+$P(A=a) = \sum_v P(A=a, B=v)$.
+
+
+The ratio $\frac{P(A=a, B=b)}{P(A=a)} \leq 1$
+turns out to be extremely important.
+It is called the *conditional probability*,
+and is denoted via the "$\mid$" symbol,
+$P(B=b \mid A=a) = P(A=a,B=b)/P(A=a)$.
+It tells us the new probability
+associated with the event $B=b$,
+once we condition on the fact $A=a$ took place.
+We can think of this conditional probability
+as restricting attention only to the subset
+of the sample space associated with $A=a$
+and then renormalizing so that
+all probabilities sum to 1.
+Conditional probabilities
+are in fact probabilities
+and thus respect all of the axioms,
+so long as we condition all terms
+on the same event and thus
+restrict attention to the same sample space.
+For instance, for disjoint events
+$\mathcal{B}$ and $\mathcal{B}'$, we have that
+$P(\mathcal{B} \cup \mathcal{B}' \mid A = a) = P(\mathcal{B} \mid A = a) + P(\mathcal{B}' \mid A = a)$.
+
+
+Using the definition of conditional probabilities,
+we can derive the famous result called *Bayes' theorem*.
+By construction, we have that $P(A, B) = P(B\mid A) P(A)$
+and $P(A, B) = P(A\mid B) P(B)$.
+Combining both equations yields
+$P(B\mid A) P(A) = P(A\mid B) P(B)$ and hence
+
+$$P(A \mid B) = \frac{P(B\mid A) P(A)}{P(B)}.$$
+
+This simple equation has profound implications because
+it allows us to reverse the order of conditioning.
+If we know how to estimate $P(B\mid A)$, $P(A)$, and $P(B)$,
+then we can estimate $P(A\mid B)$.
+We often find it easier to estimate one term directly
+but not the other and Bayes' theorem can come to the rescue here.
+For instance, if we know the prevalence of symptoms for a given disease,
+and the overall prevalences of the disease and symptoms, respectively,
+we can determine how likely someone is
+to have the disease based on their symptoms.
+In some cases we might not have direct access to $P(B)$,
+such as the prevalence of symptoms.
+In this case a simplified version of Bayes' theorem comes in handy:
+
+$$P(A \mid B) \propto P(B \mid A) P(A).$$
+
+Since we know that $P(A \mid B)$ must be normalized to $1$, i.e., $\sum_a P(A=a \mid B) = 1$,
+we can use it to compute
+
+$$P(A \mid B) = \frac{P(B \mid A) P(A)}{\sum_b P(B=b \mid A) P(A)}.$$
+
+In Bayesian statistics, we think of an observer
+as possessing some (subjective) prior beliefs
+about the plausibility of the available hypotheses
+encoded in the *prior* $P(H)$,
+and a *likelihood function* that says how likely
+one is to observe any value of the collected evidence
+for each of the hypotheses in the class $P(E \mid H)$.
+Bayes' theorem is then interpreted as telling us
+how to update the initial *prior* $P(H)$
+in light of the available evidence $E$
+to produce *posterior* beliefs
+$P(H \mid E) = \frac{P(E \mid H) P(H)}{P(E)}$.
+Informally, this can be stated as
+"posterior equals prior times likelihood, divided by the evidence".
+Now, because the evidence $P(E)$ is the same for all hypotheses,
+we can get away with simply normalizing over the hypotheses.
+
+Note that $\sum_a P(A=a \mid B) = 1$ also allows us to *marginalize* over random variables. That is, we can drop variables from a joint distribution such as $P(A, B)$. After all, we have that
+
+$$\sum_a P(A=a, B) = P(B) \sum_a P(A = a \mid B) = P(B).$$
+
+Independence is another fundamentally important concept
+that forms the backbone of
+many important ideas in statistics.
+In short, two variables are *independent*
+if conditioning on the value of $A$ does not
+cause any change to the probability distribution
+associated with $B$ and vice versa.
+More formally, independence, denoted $A \perp B$,
+requires that $P(A \mid B) = P(A)$ and, consequently,
+that $P(A,B) = P(A \mid B) P(B) = P(A) P(B)$.
+Independence is often an appropriate assumption.
+For example, if the random variable $A$
+represents the outcome from tossing one fair coin
+and the random variable $B$
+represents the outcome from tossing another,
+then knowing whether $A$ came up heads
+should not influence the probability
+of $B$ coming up heads.
+
+
+Independence is especially useful when it holds among the successive
+draws of our data from some underlying distribution
+(allowing us to make strong statistical conclusions)
+or when it holds among various variables in our data,
+allowing us to work with simpler models
+that encode this independence structure.
+On the other hand, estimating the dependencies
+among random variables is often the very aim of learning.
+We care to estimate the probability of disease given symptoms
+specifically because we believe
+that diseases and symptoms are *not* independent.
+
+
+Note that because conditional probabilities are proper probabilities,
+the concepts of independence and dependence also apply to them.
+Two random variables $A$ and $B$ are *conditionally independent*
+given a third variable $C$ if and only if $P(A, B \mid C) = P(A \mid C)P(B \mid C)$.
+Interestingly, two variables can be independent in general
+but become dependent when conditioning on a third.
+This often occurs when the two random variables $A$ and $B$
+correspond to causes of some third variable $C$.
+For example, broken bones and lung cancer might be independent
+in the general population but if we condition on being in the hospital
+then we might find that broken bones are negatively correlated with lung cancer.
+That's because the broken bone *explains away* why some person is in the hospital
+and thus lowers the probability that they have lung cancer.
+
+
+And conversely, two dependent random variables
+can become independent upon conditioning on a third.
+This often happens when two otherwise unrelated events
+have a common cause.
+Shoe size and reading level are highly correlated
+among elementary school students,
+but this correlation disappears if we condition on age.
+
+
+
+## An Example
+:label:`subsec_probability_hiv_app`
+
+Let's put our skills to the test.
+Assume that a doctor administers an HIV test to a patient.
+This test is fairly accurate and it fails only with 1% probability
+if the patient is healthy but reporting him as diseased.
+Moreover, it never fails to detect HIV if the patient actually has it.
+We use $D_1 \in \{0, 1\}$ to indicate the diagnosis
+($0$ if negative and $1$ if positive)
+and $H \in \{0, 1\}$ to denote the HIV status.
+
+| Conditional probability | $H=1$ | $H=0$ |
+|:------------------------|------:|------:|
+| $P(D_1 = 1 \mid H)$        |     1 |  0.01 |
+| $P(D_1 = 0 \mid H)$        |     0 |  0.99 |
+
+Note that the column sums are all 1 (but the row sums don't),
+since they are conditional probabilities.
+Let's compute the probability of the patient having HIV
+if the test comes back positive, i.e., $P(H = 1 \mid D_1 = 1)$.
+Intuitively this is going to depend on how common the disease is,
+since it affects the number of false alarms.
+Assume that the population is fairly healthy, e.g., $P(H=1) = 0.0015$.
+To apply Bayes' theorem, we need to apply marginalization
+to determine
+
+$$\begin{aligned}
+P(D_1 = 1)
+=& P(D_1=1, H=0) + P(D_1=1, H=1)  \\
+=& P(D_1=1 \mid H=0) P(H=0) + P(D_1=1 \mid H=1) P(H=1) \\
+=& 0.011485.
+\end{aligned}
+$$
+
+This leads us to
+
+$$P(H = 1 \mid D_1 = 1) = \frac{P(D_1=1 \mid H=1) P(H=1)}{P(D_1=1)} = 0.1306.$$
+
+In other words, there is only a 13.06% chance
+that the patient actually has HIV,
+despite using a very accurate test.
+As we can see, probability can be counterintuitive.
+What should a patient do upon receiving such terrifying news?
+Likely, the patient would ask the physician
+to administer another test to get clarity.
+The second test has different characteristics
+and it is not as good as the first one.
+
+| Conditional probability | $H=1$ | $H=0$ |
+|:------------------------|------:|------:|
+| $P(D_2 = 1 \mid H)$          |  0.98 |  0.03 |
+| $P(D_2 = 0 \mid H)$          |  0.02 |  0.97 |
+
+Unfortunately, the second test comes back positive, too.
+Let's calculate the requisite probabilities to invoke Bayes' theorem
+by assuming conditional independence:
+
+$$\begin{aligned}
+P(D_1 = 1, D_2 = 1 \mid H = 0)
+& = P(D_1 = 1 \mid H = 0) P(D_2 = 1 \mid H = 0)
+=& 0.0003, \\
+P(D_1 = 1, D_2 = 1 \mid H = 1)
+& = P(D_1 = 1 \mid H = 1) P(D_2 = 1 \mid H = 1)
+=& 0.98.
+\end{aligned}
+$$
+
+Now we can apply marginalization to obtain the probability
+that both tests come back positive:
+
+$$\begin{aligned}
+P(D_1 = 1, D_2 = 1)
+=& P(D_1 = 1, D_2 = 1, H = 0) + P(D_1 = 1, D_2 = 1, H = 1)  \\
+=& P(D_1 = 1, D_2 = 1 \mid H = 0)P(H=0) + P(D_1 = 1, D_2 = 1 \mid H = 1)P(H=1)\\
+=& 0.00176955.
+\end{aligned}
+$$
+
+Finally, the probability of the patient having HIV given both tests being positive is
+
+$$P(H = 1 \mid D_1 = 1, D_2 = 1)
+= \frac{P(D_1 = 1, D_2 = 1 \mid H=1) P(H=1)}{P(D_1 = 1, D_2 = 1)}
+= 0.8307.$$
+
+That is, the second test allowed us to gain much higher confidence that not all is well.
+Despite the second test being considerably less accurate than the first one,
+it still significantly improved our estimate.
+The assumption of both tests being conditional independent of each other
+was crucial for our ability to generate a more accurate estimate.
+Take the extreme case where we run the same test twice.
+In this situation we would expect the same outcome in both times,
+hence no additional insight is gained from running the same test again.
+The astute reader might have noticed that the diagnosis behaved
+like a classifier hiding in plain sight
+where our ability to decide whether a patient is healthy
+increases as we obtain more features (test outcomes).
+
+
+## Expectations
+
+Often, making decisions requires not just looking
+at the probabilities assigned to individual events
+but composing them together into useful aggregates
+that can provide us with guidance.
+For example, when random variables take continuous scalar values,
+we often care about knowing what value to expect *on average*.
+This quantity is formally called an *expectation*.
+If we are making investments,
+the first quantity of interest
+might be the return we can expect,
+averaging over all the possible outcomes
+(and weighting by the appropriate probabilities).
+For instance, say that with 50% probability,
+an investment might fail altogether,
+with 40% probability it might provide a 2$\times$ return,
+and with 10% probability it might provide a 10$\times$ return 10$\times$.
+To calculate the expected return,
+we sum over all returns, multiplying each
+by the probability that they will occur.
+This yields the expectation
+$0.5 \cdot 0 + 0.4 \cdot 2 + 0.1 \cdot 10 = 1.8$.
+Hence the expected return is 1.8$\times$.
+
+
+In general, the *expectation* (or average)
+of the random variable $X$ is defined as
+
+$$E[X] = E_{x \sim P}[x] = \sum_{x} x P(X = x).$$
+
+Likewise, for densities we obtain $E[X] = \int x \;dp(x)$.
+Sometimes we are interested in the expected value
+of some function of $x$.
+We can calculate these expectations as
+
+$$E_{x \sim P}[f(x)] = \sum_x f(x) P(x) \text{ and } E_{x \sim P}[f(x)] = \int f(x) p(x) \;dx$$
+
+for discrete probabilities and densities, respectively.
+Returning to the investment example from above,
+$f$ might be the *utility* (happiness)
+associated with the return.
+Behavior economists have long noted
+that people associate greater disutility
+with losing money than the utility gained
+from earning one dollar relative to their baseline.
+Moreover, the value of money tends to be sub-linear.
+Possessing 100k dollars versus zero dollars
+can make the difference between paying the rent,
+eating well, and enjoying quality healthcare
+versus suffering through homelessness.
+On the other hand, the gains due to possessing
+200k versus 100k are less dramatic.
+Reasoning like this motivates the cliché
+that "the utility of money is logarithmic".
+
+
+If  the utility associated with a total loss were -1,
+and the utilities associated with returns of 1, 2, and 10
+were 1, 2 and 4, respectively,
+then the expected happiness of investing
+would be $0.5 \cdot (-1) + 0.4 \cdot 2 + 0.1 \cdot 4 = 0.7$
+(an expected loss of utility of 30%).
+If indeed this were your utility function,
+you might be best off keeping the money in the bank.
+
+For financial decisions,
+we might also want to measure
+how *risky* an investment is.
+Here, we care not just about the expected value
+but how much the actual values tend to *vary*
+relative to this value.
+Note that we can't just take
+the expectation of the difference
+between the actual and expected values.
+That's because the expectation of a difference
+is the difference of the expectations,
+and thus $E[X - E[X]] = E[X] - E[E[X]] = 0$.
+However, we can look at the expectation
+of any non-negative function of this difference.
+The *variance* of a random variable is calculated by looking
+at the expected value of the *squared* deviations:
+
+$$\mathrm{Var}[X] = E\left[(X - E[X])^2\right] = E[X^2] - E[X]^2.$$
+
+Here the equality follows by expanding
+$(X - E[X])^2 = X^2 - 2 X E[X] + E[X]^2$
+and taking expectations for each term.
+The square root of the variance is another
+useful quantity called the *standard deviation*.
+While the variance and standard deviation
+convey the same information (either can be calculated from the other),
+the standard deviation has the nice property
+that it is expressed in the same units
+as the original quantity represented
+by the random variable.
+
+Lastly, the variance of a function
+of a random variable
+is defined analogously as
+
+$$\mathrm{Var}_{x \sim P}[f(x)] = E_{x \sim P}[f^2(x)] - E_{x \sim P}[f(x)]^2.$$
+
+Returning to our investment example,
+we can now compute the variance of the investment.
+It is given by $0.5 \cdot 0 + 0.4 \cdot 2^2 + 0.1 \cdot 10^2 - 1.8^2 = 8.36$.
+For all intents and purposes this is a risky investment.
+Note that by mathematical convention mean and variance
+are often referenced as $\mu$ and $\sigma^2$.
+This is particularly common whenever we use it
+to parametrize a Gaussian distribution.
+
+In the same way as we introduced expectations
+and variance for *scalar* random variables,
+we can do so for vector-valued ones.
+Expectations are easy, since we can apply them elementwise.
+For instance, $\boldsymbol{\mu} \stackrel{\mathrm{def}}{=} E_{\mathbf{x} \sim P}[\mathbf{x}]$
+has coordinates $\mu_i = E_{\mathbf{x} \sim P}[x_i]$.
+Covariances are more complicated.
+We resolve the problem by taking expectations of the *outer product*
+of the difference between random variables and their mean.
+
+$$\boldsymbol{\Sigma} \stackrel{\mathrm{def}}{=} \mathrm{Cov}_{\mathbf{x} \sim P}[\mathbf{x}] = E_{\mathbf{x} \sim P}\left[(\mathbf{x} - \boldsymbol{\mu}) (\mathbf{x} - \boldsymbol{\mu})^\top\right].$$
+
+This matrix $\boldsymbol{\Sigma}$ is referred to as the covariance matrix.
+An easy way to see its effect is to consider some vector $\mathbf{v}$
+of the same size as $\mathbf{x}$.
+It follows that
+
+$$\mathbf{v}^\top \boldsymbol{\Sigma} \mathbf{v} = E_{\mathbf{x} \sim P}\left[\mathbf{v}^\top(\mathbf{x} - \boldsymbol{\mu}) (\mathbf{x} - \boldsymbol{\mu})^\top \mathbf{v}\right] = \mathrm{Var}_{x \sim P}[\mathbf{v}^\top \mathbf{x}].$$
+
+As such, $\boldsymbol{\Sigma}$ allows us to compute the variance
+for any linear function of $\mathbf{x}$
+by a simple matrix multiplication.
+The off-diagonal elements tell us how correlated coordinates are:
+a value of 0 means no correlation,
+where a larger positive value
+means that they are more strongly correlated.
+
+
+
+## Discussion
+
+In machine learning, there are many things to be uncertain about!
+We can be uncertain about the value of a label given an input.
+We can be uncertain about the estimated value of a parameter.
+We can even be uncertain about whether data arriving at deployment
+is even from the same distribution as the training data.
+
+By *aleatoric uncertainty*, we denote that uncertainty
+that is intrinsic to the problem,
+and due to genuine randomness
+unaccounted for by the observed variables.
+By *epistemic uncertainty*, we denote uncertainty
+over a model's parameters, the sort of uncertainty
+that we can hope to reduce by collecting more data.
+We might have epistemic uncertainty
+concerning the probability
+that a coin turns up heads,
+but even once we know this probability,
+we are left with aleatoric uncertainty
+about the outcome of any future toss.
+No matter how long we watch someone tossing a fair coin,
+we will never be more or less than 50% certain
+that the next toss will come up heads.
+These terms owe to literature in mechanical modeling,
+(see e.g., :citet:`Der-Kiureghian.Ditlevsen.2009` for a review on this aspect of [uncertainty quantification](https://en.wikipedia.org/wiki/Uncertainty_quantification)).
+It's worth noting that these terms constitute a slight abuse of language.
+The term *epistemic* refers to anything concerning *knowledge*
+and thus in the philosophical sense, all uncertainty is epistemic.
+
+
+We saw that sampling data from some unknown probability distribution
+can provide us with information that can be used to estimate
+the parameters of the data generating distribution.
+That said, the rate at which this is possible can be quite slow.
+In our coin tossing example (and many others)
+we can do no better than to design estimators
+that converge at a rate of $1/\sqrt{n}$,
+where $n$ is the sample size (e.g., the number of tosses).
+This means that by going from 10 to 1000 observations (usually a very achievable task)
+we see a tenfold reduction of uncertainty,
+whereas the next 1000 observations help comparatively little,
+offering only a 1.41 times reduction.
+This is a persistent feature of machine learning:
+while there are often easy gains, it takes a very large amount of data,
+and often with it an enormous amount of computation to make even further gains.
+For an empirical review of this fact for large scale language models see :citet:`Revels.Lubin.Papamarkou.2016`.
+
+We also sharpened our language and tools for statistical modeling.
+In the process of that we learned about conditional probabilities
+and about one of the most important equations in statistics---Bayes' theorem.
+It is an effective tool for decoupling information conveyed by data
+through a likelihood term $P(B \mid A)$ that addresses
+how well observations $B$ match a choice of parameters $A$,
+and a prior probability $P(A)$ which governs how plausible
+a particular choice of $A$ was in the first place.
+In particular, we saw how this rule can be applied
+to assign probabilities to diagnoses,
+based on the efficacy of the test *and*
+the prevalence of the disease itself (i.e., our prior).
+
+Lastly, we introduced a first set of nontrivial questions
+about the effect of a specific probability distribution,
+namely expectations and variances.
+While there are many more than just linear and quadratic
+expectations for a probability distribution,
+these two already provide a good deal of knowledge
+about the possible behavior of the distribution.
+For instance, [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality)
+states that $P(|X - \mu| \geq k \sigma) \leq 1/k^2$,
+where $\mu$ is the expectation, $\sigma^2$ is the variance of the distribution,
+and $k > 1$ is a confidence parameter of our choosing.
+It tells us that draws from a distribution lie
+with at least 50% probability
+within a $[-\sqrt{2} \sigma, \sqrt{2} \sigma]$
+interval centered on the expectation.
+
+
+
+
+## Exercises
+
+1. Give an example where observing more data can reduce the amount of uncertainty about the outcome to an arbitrarily low level.
+1. Give an example where observing more data will only reduce the amount of uncertainty up to a point and then no further. Explain why this is the case and where you expect this point to occur.
+1. We empirically demonstrated convergence to the mean for the toss of a coin. Calculate the variance of the estimate of the probability that we see a head after drawing $n$ samples.
+    1. How does the variance scale with the number of observations?
+    1. Use Chebyshev's inequality to bound the deviation from the expectation.
+    1. How does it relate to the central limit theorem?
+1. Assume that we draw $n$ samples $x_i$ from a probability distribution with zero mean and unit variance. Compute the averages $z_m \stackrel{\mathrm{def}}{=} m^{-1} \sum_{i=1}^m x_i$. Can we apply Chebyshev's inequality for every $z_m$ independently? Why not?
+1. Given two events with probability $P(\mathcal{A})$ and $P(\mathcal{B})$, compute upper and lower bounds on $P(\mathcal{A} \cup \mathcal{B})$ and $P(\mathcal{A} \cap \mathcal{B})$. Hint: graph the situation using a [Venn diagram](https://en.wikipedia.org/wiki/Venn_diagram).
+1. Assume that we have a sequence of random variables, say $A$, $B$, and $C$, where $B$ only depends on $A$, and $C$ only depends on $B$, can you simplify the joint probability $P(A, B, C)$? Hint: this is a [Markov chain](https://en.wikipedia.org/wiki/Markov_chain).
+1. In :numref:`subsec_probability_hiv_app`, assume that the outcomes of the two tests are not independent. In particular assume that either test on its own has a false positive rate of 10% and a false negative rate of 1%. That is, assume that $P(D =1 \mid H=0) = 0.1$ and that $P(D = 0 \mid H=1) = 0.01$. Moreover, assume that for $H = 1$ (infected) the test outcomes are conditionally independent, i.e., that $P(D_1, D_2 \mid H=1) = P(D_1 \mid H=1) P(D_2 \mid H=1)$ but that for healthy patients the outcomes are coupled via $P(D_1 = D_2 = 1 \mid H=0) = 0.02$.
+    1. Work out the joint probability table for $D_1$ and $D_2$, given $H=0$ based on the information you have so far.
+    1. Derive the probability of the patient being positive ($H=1$) after one test returns positive. You can assume the same baseline probability $P(H=1) = 0.0015$ as before.
+    1. Derive the probability of the patient being positive ($H=1$) after both tests return positive.
+1. Assume that you are an asset manager for an investment bank and you have a choice of stocks $s_i$ to invest in. Your portfolio needs to add up to $1$ with weights $\alpha_i$ for each stock. The stocks have an average return $\boldsymbol{\mu} = E_{\mathbf{s} \sim P}[\mathbf{s}]$ and covariance $\boldsymbol{\Sigma} = \mathrm{Cov}_{\mathbf{s} \sim P}[\mathbf{s}]$.
+    1. Compute the expected return for a given portfolio $\boldsymbol{\alpha}$.
+    1. If you wanted to maximize the return of the portfolio, how should you choose your investment?
+    1. Compute the *variance* of the portfolio.
+    1. Formulate an optimization problem of maximizing the return while keeping the variance constrained to an upper bound. This is the Nobel-Prize winning [Markovitz portfolio](https://en.wikipedia.org/wiki/Markowitz_model) :cite:`Mangram.2013`. To solve it you will need a quadratic programming solver, something way beyond the scope of this book.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/36)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/37)
+:end_tab:
+
+:begin_tab:`tensorflow`
+[Discussions](https://discuss.d2l.ai/t/198)
+:end_tab:
diff --git a/config.ini b/config.ini
index de1f86e..bdf1dc2 100644
--- a/config.ini
+++ b/config.ini
@@ -12,7 +12,7 @@ author = Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
 
 copyright = 2021, All authors. Licensed under CC-BY-SA-4.0 and MIT-0.
 
-release = 0.17.1
+release = 1.0.0-alpha1.post0
 
 lang = ja
 
@@ -79,48 +79,59 @@ lib_file = d2l/mxnet.py
 lib_name = np
 
 # Map from d2l.xx to np.xx
-simple_alias = ones, zeros, arange, meshgrid, sin, sinh, cos, cosh, tanh,
+simple_alias = ones_like, ones, zeros_like, zeros, arange, meshgrid, sin, sinh, cos, cosh, tanh,
                linspace, exp, log, tensor -> array, normal -> random.normal,
-               rand -> random.rand, matmul -> dot, int32, float32,
+               randn -> random.randn, expand_dims
+               rand -> random.rand, matmul -> dot, int32, int64, float32,
                concat -> concatenate, stack, abs, eye
 
 # Map from d2l.xx(a, *args, **kwargs) to a.xx(*args, **kwargs)
 fluent_alias = numpy -> asnumpy, reshape, to -> as_in_context, reduce_sum -> sum,
-               argmax, astype
+               argmax, astype, reduce_mean -> mean, swapaxes, repeat
 
 alias =
        size = lambda a: a.size
        transpose = lambda a: a.T
+       nn_Module = nn.Block
+       sigmoid = npx.sigmoid
+       batch_matmul = npx.batch_dot
 
 reverse_alias =
        d2l.size\(([\w\_\d]+)\) -> \1.size
        d2l.transpose\(([\w\_\d]+)\) -> \1.T
+       d2l.nn_Module -> nn.Block
+       d2l.sigmoid -> npx.sigmoid
+       d2l.batch_matmul -> npx.batch_dot
 
 [library-pytorch]
 
 lib_file = d2l/torch.py
 lib_name = torch
 
-simple_alias = ones, zeros, tensor, arange, meshgrid, sin, sinh, cos, cosh,
-               tanh, linspace, exp, log, normal, rand, matmul, int32, float32,
-               concat -> cat, stack, abs, eye
+simple_alias = ones_like, ones, zeros_like, zeros, tensor, arange, meshgrid, sin, sinh, cos, cosh,
+               tanh, linspace, exp(, log, normal, rand, randn, matmul, int32, int64, float32,
+               concat -> cat, stack, abs, eye, sigmoid, batch_matmul -> bmm
 
 fluent_alias = numpy -> detach().numpy, size -> numel, reshape, to,
-               reduce_sum -> sum, argmax, astype -> type, transpose -> t
+               reduce_sum -> sum, argmax, astype -> type, transpose -> t,
+               reduce_mean -> mean, expand_dims -> unsqueeze, swapaxes, repeat
 alias =
+       nn_Module = nn.Module
 
 reverse_alias =
+       d2l.nn_Module -> nn.Module
 
 [library-tensorflow]
 
 lib_file = d2l/tensorflow.py
 lib_name = tf
 
-simple_alias = reshape, ones, zeros, meshgrid, sin, sinh, cos, cosh, tanh,
+simple_alias = reshape, ones_like, ones, zeros_like, zeros, meshgrid, sin, sinh, cos, cosh, tanh,
                linspace, exp, normal -> random.normal, rand -> random.uniform,
-               matmul, reduce_sum, argmax, tensor -> constant,
-               arange -> range, astype -> cast, int32, float32, transpose,
-               concat, stack, abs, eye
+               matmul, reduce_sum, reduce_mean, argmax, tensor -> constant,
+               arange -> range, astype -> cast, int32, int64, float32, transpose,
+               concat, stack, abs, eye, log -> math.log, sigmoid, expand_dims, repeat,
+               batch_matmul -> matmul
 
 fluent_alias = numpy,
 
@@ -129,6 +140,7 @@ alias =
 
 reverse_alias =
        d2l.size\(([\w\_\d]+)\) -> tf.size(\1).numpy()
+       d2l.nn_Module -> tf.keras.Model
 
 [deploy]
 
diff --git a/d2l/__init__.py b/d2l/__init__.py
old mode 100755
new mode 100644
index d973621..38999d8
--- a/d2l/__init__.py
+++ b/d2l/__init__.py
@@ -1,8 +1,11 @@
 """Saved source code for "Dive into Deep Learning" (https://d2l.ai).
+
 Please import d2l by one of the following ways:
+
 from d2l import mxnet as d2l  # Use MXNet as the backend
 from d2l import torch as d2l  # Use PyTorch as the backend
 from d2l import tensorflow as d2l  # Use TensorFlow as the backend
+
 """
 
-__version__ = "0.17.1"
+__version__ = "1.0.0-alpha1.post0"
diff --git a/d2l/mxnet.py b/d2l/mxnet.py
index e958926..5eae55c 100644
--- a/d2l/mxnet.py
+++ b/d2l/mxnet.py
@@ -1,3 +1,16 @@
+USE_MXNET = True
+USE_PYTORCH = False
+USE_TENSORFLOW = False
+
+DATA_HUB = dict()
+DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+
+from mxnet import autograd, context, gluon, image, init, np, npx
+from mxnet.gluon import nn, rnn
+from mxnet.gluon.data.vision import transforms
+
+nn_Module = nn.Block
+
 #################   WARNING   ################
 # The below part is generated automatically through:
 #    d2lbook build lib
@@ -5,6 +18,7 @@
 
 import collections
 import hashlib
+import inspect
 import math
 import os
 import random
@@ -19,6 +33,7 @@
 import requests
 from IPython import display
 from matplotlib import pyplot as plt
+from matplotlib_inline import backend_inline
 
 d2l = sys.modules[__name__]
 
@@ -29,7 +44,7 @@ def use_svg_display():
     """Use the svg format to display a plot in Jupyter.
 
     Defined in :numref:`sec_calculus`"""
-    display.set_matplotlib_formats('svg')
+    backend_inline.set_matplotlib_formats('svg')
 
 def set_figsize(figsize=(3.5, 2.5)):
     """Set the figure size for matplotlib.
@@ -42,357 +57,453 @@ def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
     """Set the axes for matplotlib.
 
     Defined in :numref:`sec_calculus`"""
-    axes.set_xlabel(xlabel)
-    axes.set_ylabel(ylabel)
-    axes.set_xscale(xscale)
-    axes.set_yscale(yscale)
-    axes.set_xlim(xlim)
-    axes.set_ylim(ylim)
+    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
+    axes.set_xscale(xscale), axes.set_yscale(yscale)
+    axes.set_xlim(xlim),     axes.set_ylim(ylim)
     if legend:
         axes.legend(legend)
     axes.grid()
 
-def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
+def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
          ylim=None, xscale='linear', yscale='linear',
          fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
     """Plot data points.
 
     Defined in :numref:`sec_calculus`"""
-    if legend is None:
-        legend = []
 
-    set_figsize(figsize)
-    axes = axes if axes else d2l.plt.gca()
-
-    # Return True if `X` (tensor or list) has 1 axis
-    def has_one_axis(X):
+    def has_one_axis(X):  # True if `X` (tensor or list) has 1 axis
         return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                 and not hasattr(X[0], "__len__"))
 
-    if has_one_axis(X):
-        X = [X]
+    if has_one_axis(X): X = [X]
     if Y is None:
         X, Y = [[]] * len(X), X
     elif has_one_axis(Y):
         Y = [Y]
     if len(X) != len(Y):
         X = X * len(Y)
+
+    set_figsize(figsize)
+    if axes is None: axes = d2l.plt.gca()
     axes.cla()
     for x, y, fmt in zip(X, Y, fmts):
-        if len(x):
-            axes.plot(x, y, fmt)
-        else:
-            axes.plot(y, fmt)
+        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
     set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
 
-class Timer:
-    """Record multiple running times."""
-    def __init__(self):
-        """Defined in :numref:`subsec_linear_model`"""
-        self.times = []
-        self.start()
-
-    def start(self):
-        """Start the timer."""
-        self.tik = time.time()
-
-    def stop(self):
-        """Stop the timer and record the time in a list."""
-        self.times.append(time.time() - self.tik)
-        return self.times[-1]
-
-    def avg(self):
-        """Return the average time."""
-        return sum(self.times) / len(self.times)
-
-    def sum(self):
-        """Return the sum of time."""
-        return sum(self.times)
-
-    def cumsum(self):
-        """Return the accumulated time."""
-        return np.array(self.times).cumsum().tolist()
-
-def synthetic_data(w, b, num_examples):
-    """Generate y = Xw + b + noise.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    X = d2l.normal(0, 1, (num_examples, len(w)))
-    y = d2l.matmul(X, w) + b
-    y += d2l.normal(0, 0.01, y.shape)
-    return X, d2l.reshape(y, (-1, 1))
-
-def linreg(X, w, b):
-    """The linear regression model.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return d2l.matmul(X, w) + b
-
-def squared_loss(y_hat, y):
-    """Squared loss.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
-
-def sgd(params, lr, batch_size):
-    """Minibatch stochastic gradient descent.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    for param in params:
-        param[:] = param - lr * param.grad / batch_size
-
-def load_array(data_arrays, batch_size, is_train=True):
-    """Construct a Gluon data iterator.
-
-    Defined in :numref:`sec_linear_concise`"""
-    dataset = gluon.data.ArrayDataset(*data_arrays)
-    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
-
-def get_fashion_mnist_labels(labels):
-    """Return text labels for the Fashion-MNIST dataset.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
-                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
-    return [text_labels[int(i)] for i in labels]
-
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
-    """Plot a list of images.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        ax.imshow(d2l.numpy(img))
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-
-def get_dataloader_workers():
-    """Use 4 processes to read the data except for Windows.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    return 0 if sys.platform.startswith('win') else 4
-
-def load_data_fashion_mnist(batch_size, resize=None):
-    """Download the Fashion-MNIST dataset and then load it into memory.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    dataset = gluon.data.vision
-    trans = [dataset.transforms.ToTensor()]
-    if resize:
-        trans.insert(0, dataset.transforms.Resize(resize))
-    trans = dataset.transforms.Compose(trans)
-    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
-    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
-    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
-                                  num_workers=get_dataloader_workers()),
-            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
-                                  num_workers=get_dataloader_workers()))
-
-def accuracy(y_hat, y):
-    """Compute the number of correct predictions.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
-        y_hat = d2l.argmax(y_hat, axis=1)
-    cmp = d2l.astype(y_hat, y.dtype) == y
-    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
-
-def evaluate_accuracy(net, data_iter):
-    """Compute the accuracy for a model on a dataset.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-    for X, y in data_iter:
-        metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-
-class Accumulator:
-    """For accumulating sums over `n` variables."""
-    def __init__(self, n):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        self.data = [0.0] * n
-
-    def add(self, *args):
-        self.data = [a + float(b) for a, b in zip(self.data, args)]
-
-    def reset(self):
-        self.data = [0.0] * len(self.data)
-
-    def __getitem__(self, idx):
-        return self.data[idx]
-
-def train_epoch_ch3(net, train_iter, loss, updater):
-    """Train a model within one epoch (defined in Chapter 3).
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    if isinstance(updater, gluon.Trainer):
-        updater = updater.step
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with autograd.record():
-            y_hat = net(X)
-            l = loss(y_hat, y)
-        l.backward()
-        updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.size)
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-
-class Animator:
-    """For plotting data in animation."""
-    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+def add_to_class(Class):
+    """Defined in :numref:`sec_oo-design`"""
+    def wrapper(obj):
+        setattr(Class, obj.__name__, obj)
+    return wrapper
+
+class HyperParameters:
+    def save_hyperparameters(self, ignore=[]):
+        """Defined in :numref:`sec_oo-design`"""
+        raise NotImplemented
+
+    def save_hyperparameters(self, ignore=[]):
+        """Save function arguments into class attributes.
+    
+        Defined in :numref:`sec_utils`"""
+        frame = inspect.currentframe().f_back
+        _, _, _, local_vars = inspect.getargvalues(frame)
+        self.hparams = {k:v for k, v in local_vars.items()
+                        if k not in set(ignore+['self']) and not k.startswith('_')}
+        for k, v in self.hparams.items():
+            setattr(self, k, v)
+
+class ProgressBoard(d2l.HyperParameters):
+    """Plot data points in animation.
+
+    Defined in :numref:`sec_oo-design`"""
+    def __init__(self, xlabel=None, ylabel=None, xlim=None,
                  ylim=None, xscale='linear', yscale='linear',
-                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
-                 figsize=(3.5, 2.5)):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        # Incrementally plot multiple lines
-        if legend is None:
-            legend = []
+                 ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
+                 fig=None, axes=None, figsize=(3.5, 2.5), display=True):
+        self.save_hyperparameters()
+
+    def draw(self, x, y, label, every_n=1):
+        raise NotImplemented
+
+    def draw(self, x, y, label, every_n=1):
+        """Defined in :numref:`sec_utils`"""
+        Point = collections.namedtuple('Point', ['x', 'y'])
+        if not hasattr(self, 'raw_points'):
+            self.raw_points = collections.OrderedDict()
+            self.data = collections.OrderedDict()
+        if label not in self.raw_points:
+            self.raw_points[label] = []
+            self.data[label] = []
+        points = self.raw_points[label]
+        line = self.data[label]
+        points.append(Point(x, y))
+        if len(points) != every_n:
+            return
+        mean = lambda x: sum(x) / len(x)
+        line.append(Point(mean([p.x for p in points]),
+                          mean([p.y for p in points])))
+        points.clear()
+        if not self.display:
+            return
         d2l.use_svg_display()
-        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
-        if nrows * ncols == 1:
-            self.axes = [self.axes, ]
-        # Use a lambda function to capture arguments
-        self.config_axes = lambda: d2l.set_axes(
-            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
-        self.X, self.Y, self.fmts = None, None, fmts
-
-    def add(self, x, y):
-        # Add multiple data points into the figure
-        if not hasattr(y, "__len__"):
-            y = [y]
-        n = len(y)
-        if not hasattr(x, "__len__"):
-            x = [x] * n
-        if not self.X:
-            self.X = [[] for _ in range(n)]
-        if not self.Y:
-            self.Y = [[] for _ in range(n)]
-        for i, (a, b) in enumerate(zip(x, y)):
-            if a is not None and b is not None:
-                self.X[i].append(a)
-                self.Y[i].append(b)
-        self.axes[0].cla()
-        for x, y, fmt in zip(self.X, self.Y, self.fmts):
-            self.axes[0].plot(x, y, fmt)
-        self.config_axes()
+        if self.fig is None:
+            self.fig = d2l.plt.figure(figsize=self.figsize)
+        plt_lines, labels = [], []
+        for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):
+            plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v],
+                                          linestyle=ls, color=color)[0])
+            labels.append(k)
+        axes = self.axes if self.axes else d2l.plt.gca()
+        if self.xlim: axes.set_xlim(self.xlim)
+        if self.ylim: axes.set_ylim(self.ylim)
+        if not self.xlabel: self.xlabel = self.x
+        axes.set_xlabel(self.xlabel)
+        axes.set_ylabel(self.ylabel)
+        axes.set_xscale(self.xscale)
+        axes.set_yscale(self.yscale)
+        axes.legend(plt_lines, labels)
         display.display(self.fig)
         display.clear_output(wait=True)
 
-def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
-    """Train a model (defined in Chapter 3).
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
-                        legend=['train loss', 'train acc', 'test acc'])
-    for epoch in range(num_epochs):
-        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
-        test_acc = evaluate_accuracy(net, test_iter)
-        animator.add(epoch + 1, train_metrics + (test_acc,))
-    train_loss, train_acc = train_metrics
-    assert train_loss < 0.5, train_loss
-    assert train_acc <= 1 and train_acc > 0.7, train_acc
-    assert test_acc <= 1 and test_acc > 0.7, test_acc
-
-def predict_ch3(net, test_iter, n=6):
-    """Predict labels (defined in Chapter 3).
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    for X, y in test_iter:
-        break
-    trues = d2l.get_fashion_mnist_labels(y)
-    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1))
-    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
-    d2l.show_images(
-        d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])
-
-def evaluate_loss(net, data_iter, loss):
-    """Evaluate the loss of a model on the given dataset.
-
-    Defined in :numref:`sec_model_selection`"""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        l = loss(net(X), y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
+class Module(d2l.nn_Module, d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
+        super().__init__()
+        self.save_hyperparameters()
+        self.board = ProgressBoard()
+    def loss(self, y_hat, y):
+        raise NotImplementedError
 
-DATA_HUB = dict()
-DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+    def forward(self, X):
+        assert hasattr(self, 'net'), 'Neural network is defined'
+        return self.net(X)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'Trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        self.board.draw(x, d2l.numpy(value), (
+            'train_' if train else 'val_') + key, every_n=int(n))
+    def training_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=True)
+        return l
+
+    def validation_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=False)
+
+    def configure_optimizers(self):
+        raise NotImplementedError
 
-def download(name, cache_dir=os.path.join('..', 'data')):
-    """Download a file inserted into DATA_HUB, return the local filename.
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_classification`"""
+        params = self.parameters()
+        if isinstance(params, list):
+            return d2l.SGD(params, self.lr)
+        return gluon.Trainer(params, 'sgd', {'learning_rate': self.lr})
+
+    def get_scratch_params(self):
+        """Defined in :numref:`sec_classification`"""
+        params = []
+        for attr in dir(self):
+            a = getattr(self, attr)
+            if isinstance(a, np.ndarray):
+                params.append(a)
+            if isinstance(a, d2l.Module):
+                params.extend(a.get_scratch_params())
+        return params
+    
+
+    def parameters(self):
+        """Defined in :numref:`sec_classification`"""
+        params = self.collect_params()
+        return params if isinstance(params, gluon.parameter.ParameterDict) and len(
+            params.keys()) else self.get_scratch_params()
+
+    def set_scratch_params_device(self, device):
+        """Defined in :numref:`sec_use_gpu`"""
+        for attr in dir(self):
+            a = getattr(self, attr)
+            if isinstance(a, np.ndarray):
+                with autograd.record():
+                    setattr(self, attr, a.as_in_ctx(device))
+                getattr(self, attr).attach_grad()
+            if isinstance(a, d2l.Module):
+                a.set_scratch_params_device(device)
+            if isinstance(a, list):
+                for elem in a:
+                    elem.set_scratch_params_device(device)
+
+class DataModule(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, root='../data', num_workers=4):
+        self.save_hyperparameters()
+
+    def get_dataloader(self, train):
+        raise NotImplementedError
 
-    Defined in :numref:`sec_kaggle_house`"""
-    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
-    url, sha1_hash = DATA_HUB[name]
-    os.makedirs(cache_dir, exist_ok=True)
-    fname = os.path.join(cache_dir, url.split('/')[-1])
-    if os.path.exists(fname):
-        sha1 = hashlib.sha1()
-        with open(fname, 'rb') as f:
-            while True:
-                data = f.read(1048576)
-                if not data:
-                    break
-                sha1.update(data)
-        if sha1.hexdigest() == sha1_hash:
-            return fname  # Hit cache
-    print(f'Downloading {fname} from {url}...')
-    r = requests.get(url, stream=True, verify=True)
-    with open(fname, 'wb') as f:
-        f.write(r.content)
-    return fname
+    def train_dataloader(self):
+        return self.get_dataloader(train=True)
+
+    def val_dataloader(self):
+        return self.get_dataloader(train=False)
+
+    def get_tensorloader(self, tensors, train, indices=slice(0, None)):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        tensors = tuple(a[indices] for a in tensors)
+        dataset = gluon.data.ArrayDataset(*tensors)
+        return gluon.data.DataLoader(dataset, self.batch_size,
+                                     shuffle=train)
+
+class Trainer(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        self.save_hyperparameters()
+        assert num_gpus == 0, 'No GPU support yet'
+
+    def prepare_data(self, data):
+        self.train_dataloader = data.train_dataloader()
+        self.val_dataloader = data.val_dataloader()
+        self.num_train_batches = len(self.train_dataloader)
+        self.num_val_batches = (len(self.val_dataloader)
+                                if self.val_dataloader is not None else 0)
+
+    def prepare_model(self, model):
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        self.model = model
+
+    def fit(self, model, data):
+        self.prepare_data(data)
+        self.prepare_model(model)
+        self.optim = model.configure_optimizers()
+        self.epoch = 0
+        self.train_batch_idx = 0
+        self.val_batch_idx = 0
+        for self.epoch in range(self.max_epochs):
+            self.fit_epoch()
+
+    def fit_epoch(self):
+        raise NotImplementedError
 
-def download_extract(name, folder=None):
-    """Download and extract a zip/tar file.
+    def prepare_batch(self, batch):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return batch
 
-    Defined in :numref:`sec_kaggle_house`"""
-    fname = download(name)
-    base_dir = os.path.dirname(fname)
-    data_dir, ext = os.path.splitext(fname)
-    if ext == '.zip':
-        fp = zipfile.ZipFile(fname, 'r')
-    elif ext in ('.tar', '.gz'):
-        fp = tarfile.open(fname, 'r')
-    else:
-        assert False, 'Only zip/tar files can be extracted.'
-    fp.extractall(base_dir)
-    return os.path.join(base_dir, folder) if folder else data_dir
+    def fit_epoch(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        for batch in self.train_dataloader:
+            with autograd.record():
+                loss = self.model.training_step(self.prepare_batch(batch))
+            loss.backward()
+            if self.gradient_clip_val > 0:
+                self.clip_gradients(self.gradient_clip_val, self.model)
+            self.optim.step(1)
+            self.train_batch_idx += 1
+        if self.val_dataloader is None:
+            return
+        for batch in self.val_dataloader:
+            self.model.validation_step(self.prepare_batch(batch))
+            self.val_batch_idx += 1
+
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        """Defined in :numref:`sec_use_gpu`"""
+        self.save_hyperparameters()
+        self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
+    
+
+    def prepare_batch(self, batch):
+        """Defined in :numref:`sec_use_gpu`"""
+        if self.gpus:
+            batch = [d2l.to(a, self.gpus[0]) for a in batch]
+        return batch
+    
+
+    def prepare_model(self, model):
+        """Defined in :numref:`sec_use_gpu`"""
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        if self.gpus:
+            model.collect_params().reset_ctx(self.gpus[0])
+            model.set_scratch_params_device(self.gpus[0])
+        self.model = model
+
+class SyntheticRegressionData(d2l.DataModule):
+    """Defined in :numref:`sec_synthetic-regression-data`"""
+    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
+                 batch_size=32):
+        super().__init__()
+        self.save_hyperparameters()
+        n = num_train + num_val
+        self.X = d2l.randn(n, len(w))
+        noise = d2l.randn(n, 1) * noise
+        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + noise
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        i = slice(0, self.num_train) if train else slice(self.num_train, None)
+        return self.get_tensorloader((self.X, self.y), train, i)
+
+class LinearRegressionScratch(d2l.Module):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, num_inputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.w = d2l.normal(0, sigma, (num_inputs, 1))
+        self.b = d2l.zeros(1)
+        self.w.attach_grad()
+        self.b.attach_grad()
 
-def download_all():
-    """Download all files in the DATA_HUB.
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_scratch`"""
+        return d2l.matmul(X, self.w) + self.b
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_scratch`"""
+        l = (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+        return d2l.reduce_mean(l)
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return SGD([self.w, self.b], self.lr)
+
+class SGD(d2l.HyperParameters):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, params, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+
+    def step(self, _):
+        for param in self.params:
+            param -= self.lr * param.grad
+
+class LinearRegression(d2l.Module):
+    """Defined in :numref:`sec_linear_concise`"""
+    def __init__(self, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.Dense(1)
+        self.net.initialize(init.Normal(sigma=0.01))
 
-    Defined in :numref:`sec_kaggle_house`"""
-    for name in DATA_HUB:
-        download(name)
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_concise`"""
+        return self.net(X)
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_concise`"""
+        fn = gluon.loss.L2Loss()
+        return fn(y_hat, y).mean()
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return gluon.Trainer(self.collect_params(),
+                             'sgd', {'learning_rate': self.lr})
+
+    def get_w_b(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return (self.net.weight.data(), self.net.bias.data())
+
+class FashionMNIST(d2l.DataModule):
+    """Defined in :numref:`sec_fashion_mnist`"""
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = gluon.data.vision.FashionMNIST(
+            train=True).transform_first(trans)
+        self.val = gluon.data.vision.FashionMNIST(
+            train=False).transform_first(trans)
+
+    def text_labels(self, indices):
+        """Return text labels.
+    
+        Defined in :numref:`sec_fashion_mnist`"""
+        labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                  'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+        return [labels[int(i)] for i in indices]
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        data = self.train if train else self.val
+        return gluon.data.DataLoader(data, self.batch_size, shuffle=train,
+                                     num_workers=self.num_workers)
+
+    def visualize(self, batch, nrows=1, ncols=8, labels=[]):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        X, y = batch
+        if not labels:
+            labels = self.text_labels(y)
+        d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
 
-DATA_HUB['kaggle_house_train'] = (
-    DATA_URL + 'kaggle_house_pred_train.csv',
-    '585e9cc93e70b39160e7921475f9bcd7d31219ce')
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
 
-DATA_HUB['kaggle_house_test'] = (
-    DATA_URL + 'kaggle_house_pred_test.csv',
-    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')
+    Defined in :numref:`sec_fashion_mnist`"""
+    raise NotImplementedError
+
+class Classifier(d2l.Module):
+    """Defined in :numref:`sec_classification`"""
+    def validation_step(self, batch):
+        Y_hat = self(*batch[:-1])
+        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
+        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
+
+    def accuracy(self, Y_hat, Y, averaged=True):
+        """Compute the number of correct predictions.
+    
+        Defined in :numref:`sec_classification`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
+        compare = d2l.astype(preds == d2l.reshape(Y, -1), d2l.float32)
+        return d2l.reduce_mean(compare) if averaged else compare
+
+    def loss(self, Y_hat, Y, averaged=True):
+        """Defined in :numref:`sec_softmax_concise`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        Y = d2l.reshape(Y, (-1,))
+        fn = gluon.loss.SoftmaxCrossEntropyLoss()
+        l = fn(Y_hat, Y)
+        return l.mean() if averaged else l
+
+def cpu():
+    """Defined in :numref:`sec_use_gpu`"""
+    return npx.cpu()
+def gpu(i=0):
+    """Defined in :numref:`sec_use_gpu`"""
+    return npx.gpu(i)
+
+def num_gpus():
+    """Defined in :numref:`sec_use_gpu`"""
+    return npx.num_gpus()
 
 def try_gpu(i=0):
     """Return gpu(i) if exists, otherwise return cpu().
 
     Defined in :numref:`sec_use_gpu`"""
-    return npx.gpu(i) if npx.num_gpus() >= i + 1 else npx.cpu()
+    if num_gpus() >= i + 1:
+        return gpu(i)
+    return cpu()
 
 def try_all_gpus():
-    """Return all available GPUs, or [cpu()] if no GPU exists.
+    """Return all available GPUs, or [cpu(),] if no GPU exists.
 
     Defined in :numref:`sec_use_gpu`"""
-    devices = [npx.gpu(i) for i in range(npx.num_gpus())]
-    return devices if devices else [npx.cpu()]
+    return [gpu(i) for i in range(num_gpus())]
 
 def corr2d(X, K):
     """Compute 2D cross-correlation.
@@ -2803,11 +2914,473 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
     return float(loss_G.sum())
 
 d2l.DATA_HUB['pokemon'] = (d2l.DATA_URL + 'pokemon.zip',
-                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')# Alias defined in config.ini
+                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')
+
+def load_array(data_arrays, batch_size, is_train=True):
+    """Construct a Gluon data iterator.
+
+    Defined in :numref:`sec_utils`"""
+    dataset = gluon.data.ArrayDataset(*data_arrays)
+    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):
+    """Generate y = Xw + b + noise.
+
+    Defined in :numref:`sec_utils`"""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size):
+    """Minibatch stochastic gradient descent.
+
+    Defined in :numref:`sec_utils`"""
+    for param in params:
+        param[:] = param - lr * param.grad / batch_size
+
+def get_dataloader_workers():
+    """Use 4 processes to read the data except for Windows.
+
+    Defined in :numref:`sec_utils`"""
+    return 0 if sys.platform.startswith('win') else 4
+
+def load_data_fashion_mnist(batch_size, resize=None):
+    """Download the Fashion-MNIST dataset and then load it into memory.
+
+    Defined in :numref:`sec_utils`"""
+    dataset = gluon.data.vision
+    trans = [dataset.transforms.ToTensor()]
+    if resize:
+        trans.insert(0, dataset.transforms.Resize(resize))
+    trans = dataset.transforms.Compose(trans)
+    mnist_train = dataset.FashionMNIST(train=True).transform_first(trans)
+    mnist_test = dataset.FashionMNIST(train=False).transform_first(trans)
+    return (gluon.data.DataLoader(mnist_train, batch_size, shuffle=True,
+                                  num_workers=get_dataloader_workers()),
+            gluon.data.DataLoader(mnist_test, batch_size, shuffle=False,
+                                  num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None):
+    """Compute the accuracy for a model on a dataset using a GPU.
+
+    Defined in :numref:`sec_utils`"""
+    if not device:  # Query the first device where the first parameter is on
+        device = list(net.collect_params().values())[0].list_ctx()[0]
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+    for X, y in data_iter:
+        X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+        metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6).
+
+    Defined in :numref:`sec_utils`"""
+    net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
+    loss = gluon.loss.SoftmaxCrossEntropyLoss()
+    trainer = gluon.Trainer(net.collect_params(),
+                            'sgd', {'learning_rate': lr})
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            # Here is the major difference from `d2l.train_epoch_ch3`
+            X, y = X.as_in_ctx(device), y.as_in_ctx(device)
+            with autograd.record():
+                y_hat = net(X)
+                l = loss(y_hat, y)
+            l.backward()
+            trainer.step(X.shape[0])
+            metric.add(l.sum(), d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+
+def grad_clipping(net, theta):
+    """Clip the gradient.
+
+    Defined in :numref:`sec_utils`"""
+    if isinstance(net, gluon.Block):
+        params = [p.data() for p in net.collect_params().values()]
+    else:
+        params = net.params
+    norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+
+def evaluate_accuracy(net, data_iter):
+    """Compute the accuracy for a model on a dataset.
+
+    Defined in :numref:`sec_utils`"""
+    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
+    for X, y in data_iter:
+        metric.add(accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+def linreg(X, w, b):
+    """The linear regression model.
+
+    Defined in :numref:`sec_utils`"""
+    return d2l.matmul(X, w) + b
+
+def squared_loss(y_hat, y):
+    """Squared loss.
+
+    Defined in :numref:`sec_utils`"""
+    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+
+def get_fashion_mnist_labels(labels):
+    """Return text labels for the Fashion-MNIST dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [text_labels[int(i)] for i in labels]
+
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
+
+    Defined in :numref:`sec_utils`"""
+    figsize = (num_cols * scale, num_rows * scale)
+    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
+    axes = axes.flatten()
+    for i, (ax, img) in enumerate(zip(axes, imgs)):
+        try:
+            img = d2l.numpy(img)
+        except:
+            pass
+        ax.imshow(img)
+        ax.axes.get_xaxis().set_visible(False)
+        ax.axes.get_yaxis().set_visible(False)
+        if titles:
+            ax.set_title(titles[i])
+    return axes
+
+class Animator:
+    """For plotting data in animation."""
+    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
+                 figsize=(3.5, 2.5)):
+        """Defined in :numref:`sec_utils`"""
+        # Incrementally plot multiple lines
+        if legend is None:
+            legend = []
+        d2l.use_svg_display()
+        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
+        if nrows * ncols == 1:
+            self.axes = [self.axes, ]
+        # Use a lambda function to capture arguments
+        self.config_axes = lambda: d2l.set_axes(
+            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+        self.X, self.Y, self.fmts = None, None, fmts
+
+    def add(self, x, y):
+        # Add multiple data points into the figure
+        if not hasattr(y, "__len__"):
+            y = [y]
+        n = len(y)
+        if not hasattr(x, "__len__"):
+            x = [x] * n
+        if not self.X:
+            self.X = [[] for _ in range(n)]
+        if not self.Y:
+            self.Y = [[] for _ in range(n)]
+        for i, (a, b) in enumerate(zip(x, y)):
+            if a is not None and b is not None:
+                self.X[i].append(a)
+                self.Y[i].append(b)
+        self.axes[0].cla()
+        for x, y, fmt in zip(self.X, self.Y, self.fmts):
+            self.axes[0].plot(x, y, fmt)
+        self.config_axes()
+        display.display(self.fig)
+        display.clear_output(wait=True)
+
+class Accumulator:
+    """For accumulating sums over `n` variables."""
+    def __init__(self, n):
+        """Defined in :numref:`sec_utils`"""
+        self.data = [0.0] * n
+
+    def add(self, *args):
+        self.data = [a + float(b) for a, b in zip(self.data, args)]
+
+    def reset(self):
+        self.data = [0.0] * len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+
+def accuracy(y_hat, y):
+    """Compute the number of correct predictions.
+
+    Defined in :numref:`sec_utils`"""
+    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
+        y_hat = d2l.argmax(y_hat, axis=1)
+    cmp = d2l.astype(y_hat, y.dtype) == y
+    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
+
+def download(url, folder='../data', sha1_hash=None):
+    """Download a file to folder and return the local filepath.
+
+    Defined in :numref:`sec_utils`"""
+    if not url.startswith('http'):
+        # For back compatability
+        url, sha1_hash = DATA_HUB[url]
+    os.makedirs(folder, exist_ok=True)
+    fname = os.path.join(folder, url.split('/')[-1])
+    # Check if hit cache
+    if os.path.exists(fname) and sha1_hash:
+        sha1 = hashlib.sha1()
+        with open(fname, 'rb') as f:
+            while True:
+                data = f.read(1048576)
+                if not data:
+                    break
+                sha1.update(data)
+        if sha1.hexdigest() == sha1_hash:
+            return fname
+    # Download
+    print(f'Downloading {fname} from {url}...')
+    r = requests.get(url, stream=True, verify=True)
+    with open(fname, 'wb') as f:
+        f.write(r.content)
+    return fname
+
+def extract(filename, folder=None):
+    """Extract a zip/tar file into folder.
+
+    Defined in :numref:`sec_utils`"""
+    base_dir = os.path.dirname(filename)
+    _, ext = os.path.splitext(filename)
+    assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
+    if ext == '.zip':
+        fp = zipfile.ZipFile(filename, 'r')
+    else:
+        fp = tarfile.open(filename, 'r')
+    if folder is None:
+        folder = base_dir
+    fp.extractall(folder)
+
+def download_extract(name, folder=None):
+    """Download and extract a zip/tar file.
+
+    Defined in :numref:`sec_utils`"""
+    fname = download(name)
+    base_dir = os.path.dirname(fname)
+    data_dir, ext = os.path.splitext(fname)
+    if ext == '.zip':
+        fp = zipfile.ZipFile(fname, 'r')
+    elif ext in ('.tar', '.gz'):
+        fp = tarfile.open(fname, 'r')
+    else:
+        assert False, 'Only zip/tar files can be extracted.'
+    fp.extractall(base_dir)
+    return os.path.join(base_dir, folder) if folder else data_dir
+
+
+def tokenize(lines, token='word'):
+    """Split text lines into word or character tokens.
+
+    Defined in :numref:`sec_utils`"""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+
+def evaluate_loss(net, data_iter, loss):
+    """Evaluate the loss of a model on the given dataset.
+
+    Defined in :numref:`sec_utils`"""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        l = loss(net(X), y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+
+d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
+                           '94646ad1522d915e7b0f9296181140edcf86a4f5')
+
+def read_data_nmt():
+    """Load the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    data_dir = d2l.download_extract('fra-eng')
+    with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
+        return f.read()
+
+def preprocess_nmt(text):
+    """Preprocess the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    def no_space(char, prev_char):
+        return char in set(',.!?') and prev_char != ' '
+
+    # Replace non-breaking space with space, and convert uppercase letters to
+    # lowercase ones
+    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
+    # Insert space between words and punctuation marks
+    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
+           for i, char in enumerate(text)]
+    return ''.join(out)
+
+def tokenize_nmt(text, num_examples=None):
+    """Tokenize the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    source, target = [], []
+    for i, line in enumerate(text.split('\n')):
+        if num_examples and i > num_examples:
+            break
+        parts = line.split('\t')
+        if len(parts) == 2:
+            source.append(parts[0].split(' '))
+            target.append(parts[1].split(' '))
+    return source, target
+
+
+def truncate_pad(line, num_steps, padding_token):
+    """Truncate or pad sequences.
+
+    Defined in :numref:`sec_utils`"""
+    if len(line) > num_steps:
+        return line[:num_steps]  # Truncate
+    return line + [padding_token] * (num_steps - len(line))  # Pad
+
+
+def build_array_nmt(lines, vocab, num_steps):
+    """Transform text sequences of machine translation into minibatches.
+
+    Defined in :numref:`sec_utils`"""
+    lines = [vocab[l] for l in lines]
+    lines = [l + [vocab['<eos>']] for l in lines]
+    array = d2l.tensor([truncate_pad(
+        l, num_steps, vocab['<pad>']) for l in lines])
+    valid_len = d2l.reduce_sum(
+        d2l.astype(array != vocab['<pad>'], d2l.int32), 1)
+    return array, valid_len
+
+
+def load_data_nmt(batch_size, num_steps, num_examples=600):
+    """Return the iterator and the vocabularies of the translation dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text = preprocess_nmt(read_data_nmt())
+    source, target = tokenize_nmt(text, num_examples)
+    src_vocab = d2l.Vocab(source, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    tgt_vocab = d2l.Vocab(target, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
+    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
+    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
+    data_iter = d2l.load_array(data_arrays, batch_size)
+    return data_iter, src_vocab, tgt_vocab
+
+class MaskedSoftmaxCELoss(gluon.loss.SoftmaxCELoss):
+    """The softmax cross-entropy loss with masks.
+
+    Defined in :numref:`sec_utils`"""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        # `weights` shape: (`batch_size`, `num_steps`, 1)
+        weights = np.expand_dims(np.ones_like(label), axis=-1)
+        weights = npx.sequence_mask(weights, valid_len, True, axis=1)
+        return super(MaskedSoftmaxCELoss, self).forward(pred, label, weights)
+
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    net.initialize(init.Xavier(), force_reinit=True, ctx=device)
+    trainer = gluon.Trainer(net.collect_params(), 'adam',
+                            {'learning_rate': lr})
+    loss = MaskedSoftmaxCELoss()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [
+                x.as_in_ctx(device) for x in batch]
+            bos = np.array(
+                [tgt_vocab['<bos>']] * Y.shape[0], ctx=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with autograd.record():
+                Y_hat, _ = net(X, dec_input, X_valid_len)
+                l = loss(Y_hat, Y, Y_valid_len)
+            l.backward()
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            trainer.step(num_tokens)
+            metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = np.array([len(src_tokens)], ctx=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = np.expand_dims(np.array(src_tokens, ctx=device), axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = np.expand_dims(np.array([tgt_vocab['<bos>']], ctx=device), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(axis=2)
+        pred = dec_X.squeeze(axis=0).astype('int32').item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+
+
+# Alias defined in config.ini
 size = lambda a: a.size
 transpose = lambda a: a.T
+nn_Module = nn.Block
+sigmoid = npx.sigmoid
+batch_matmul = npx.batch_dot
 
+ones_like = np.ones_like
 ones = np.ones
+zeros_like = np.zeros_like
 zeros = np.zeros
 arange = np.arange
 meshgrid = np.meshgrid
@@ -2821,9 +3394,12 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
 log = np.log
 tensor = np.array
 normal = np.random.normal
+randn = np.random.randn
+expand_dims = np.expand_dims
 rand = np.random.rand
 matmul = np.dot
 int32 = np.int32
+int64 = np.int64
 float32 = np.float32
 concat = np.concatenate
 stack = np.stack
@@ -2835,4 +3411,7 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
 reduce_sum = lambda x, *args, **kwargs: x.sum(*args, **kwargs)
 argmax = lambda x, *args, **kwargs: x.argmax(*args, **kwargs)
 astype = lambda x, *args, **kwargs: x.astype(*args, **kwargs)
+reduce_mean = lambda x, *args, **kwargs: x.mean(*args, **kwargs)
+swapaxes = lambda x, *args, **kwargs: x.swapaxes(*args, **kwargs)
+repeat = lambda x, *args, **kwargs: x.repeat(*args, **kwargs)
 
diff --git a/d2l/tensorflow.py b/d2l/tensorflow.py
index 173c725..b27ecff 100644
--- a/d2l/tensorflow.py
+++ b/d2l/tensorflow.py
@@ -1,3 +1,11 @@
+DATA_HUB = dict()
+DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+
+import numpy as np
+import tensorflow as tf
+
+nn_Module = tf.keras.Model
+
 #################   WARNING   ################
 # The below part is generated automatically through:
 #    d2lbook build lib
@@ -5,6 +13,7 @@
 
 import collections
 import hashlib
+import inspect
 import math
 import os
 import random
@@ -19,6 +28,7 @@
 import requests
 from IPython import display
 from matplotlib import pyplot as plt
+from matplotlib_inline import backend_inline
 
 d2l = sys.modules[__name__]
 
@@ -29,7 +39,7 @@ def use_svg_display():
     """Use the svg format to display a plot in Jupyter.
 
     Defined in :numref:`sec_calculus`"""
-    display.set_matplotlib_formats('svg')
+    backend_inline.set_matplotlib_formats('svg')
 
 def set_figsize(figsize=(3.5, 2.5)):
     """Set the figure size for matplotlib.
@@ -42,382 +52,404 @@ def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
     """Set the axes for matplotlib.
 
     Defined in :numref:`sec_calculus`"""
-    axes.set_xlabel(xlabel)
-    axes.set_ylabel(ylabel)
-    axes.set_xscale(xscale)
-    axes.set_yscale(yscale)
-    axes.set_xlim(xlim)
-    axes.set_ylim(ylim)
+    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
+    axes.set_xscale(xscale), axes.set_yscale(yscale)
+    axes.set_xlim(xlim),     axes.set_ylim(ylim)
     if legend:
         axes.legend(legend)
     axes.grid()
 
-def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
+def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
          ylim=None, xscale='linear', yscale='linear',
          fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
     """Plot data points.
 
     Defined in :numref:`sec_calculus`"""
-    if legend is None:
-        legend = []
-
-    set_figsize(figsize)
-    axes = axes if axes else d2l.plt.gca()
 
-    # Return True if `X` (tensor or list) has 1 axis
-    def has_one_axis(X):
+    def has_one_axis(X):  # True if `X` (tensor or list) has 1 axis
         return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                 and not hasattr(X[0], "__len__"))
 
-    if has_one_axis(X):
-        X = [X]
+    if has_one_axis(X): X = [X]
     if Y is None:
         X, Y = [[]] * len(X), X
     elif has_one_axis(Y):
         Y = [Y]
     if len(X) != len(Y):
         X = X * len(Y)
+
+    set_figsize(figsize)
+    if axes is None: axes = d2l.plt.gca()
     axes.cla()
     for x, y, fmt in zip(X, Y, fmts):
-        if len(x):
-            axes.plot(x, y, fmt)
-        else:
-            axes.plot(y, fmt)
+        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
     set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
 
-class Timer:
-    """Record multiple running times."""
-    def __init__(self):
-        """Defined in :numref:`subsec_linear_model`"""
-        self.times = []
-        self.start()
-
-    def start(self):
-        """Start the timer."""
-        self.tik = time.time()
-
-    def stop(self):
-        """Stop the timer and record the time in a list."""
-        self.times.append(time.time() - self.tik)
-        return self.times[-1]
-
-    def avg(self):
-        """Return the average time."""
-        return sum(self.times) / len(self.times)
-
-    def sum(self):
-        """Return the sum of time."""
-        return sum(self.times)
-
-    def cumsum(self):
-        """Return the accumulated time."""
-        return np.array(self.times).cumsum().tolist()
-
-def synthetic_data(w, b, num_examples):
-    """Generate y = Xw + b + noise.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    X = d2l.zeros((num_examples, w.shape[0]))
-    X += tf.random.normal(shape=X.shape)
-    y = d2l.matmul(X, tf.reshape(w, (-1, 1))) + b
-    y += tf.random.normal(shape=y.shape, stddev=0.01)
-    y = d2l.reshape(y, (-1, 1))
-    return X, y
-
-def linreg(X, w, b):
-    """The linear regression model.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return d2l.matmul(X, w) + b
-
-def squared_loss(y_hat, y):
-    """Squared loss.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
-
-def sgd(params, grads, lr, batch_size):
-    """Minibatch stochastic gradient descent.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    for param, grad in zip(params, grads):
-        param.assign_sub(lr*grad/batch_size)
-
-def load_array(data_arrays, batch_size, is_train=True):
-    """Construct a TensorFlow data iterator.
-
-    Defined in :numref:`sec_linear_concise`"""
-    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
-    if is_train:
-        dataset = dataset.shuffle(buffer_size=1000)
-    dataset = dataset.batch(batch_size)
-    return dataset
-
-def get_fashion_mnist_labels(labels):
-    """Return text labels for the Fashion-MNIST dataset.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
-                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
-    return [text_labels[int(i)] for i in labels]
-
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
-    """Plot a list of images.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        ax.imshow(d2l.numpy(img))
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-
-def load_data_fashion_mnist(batch_size, resize=None):
-    """Download the Fashion-MNIST dataset and then load it into memory.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
-    # Divide all numbers by 255 so that all pixel values are between
-    # 0 and 1, add a batch dimension at the last. And cast label to int32
-    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
-                            tf.cast(y, dtype='int32'))
-    resize_fn = lambda X, y: (
-        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
-    return (
-        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
-            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
-        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
-            batch_size).map(resize_fn))
-
-def accuracy(y_hat, y):
-    """Compute the number of correct predictions.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
-        y_hat = d2l.argmax(y_hat, axis=1)
-    cmp = d2l.astype(y_hat, y.dtype) == y
-    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
-
-def evaluate_accuracy(net, data_iter):
-    """Compute the accuracy for a model on a dataset.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-    for X, y in data_iter:
-        metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-
-class Accumulator:
-    """For accumulating sums over `n` variables."""
-    def __init__(self, n):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        self.data = [0.0] * n
-
-    def add(self, *args):
-        self.data = [a + float(b) for a, b in zip(self.data, args)]
-
-    def reset(self):
-        self.data = [0.0] * len(self.data)
-
-    def __getitem__(self, idx):
-        return self.data[idx]
-
-def train_epoch_ch3(net, train_iter, loss, updater):
-    """The training loop defined in Chapter 3.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        with tf.GradientTape() as tape:
-            y_hat = net(X)
-            # Keras implementations for loss takes (labels, predictions)
-            # instead of (predictions, labels) that users might implement
-            # in this book, e.g. `cross_entropy` that we implemented above
-            if isinstance(loss, tf.keras.losses.Loss):
-                l = loss(y, y_hat)
-            else:
-                l = loss(y_hat, y)
-        if isinstance(updater, tf.keras.optimizers.Optimizer):
-            params = net.trainable_variables
-            grads = tape.gradient(l, params)
-            updater.apply_gradients(zip(grads, params))
-        else:
-            updater(X.shape[0], tape.gradient(l, updater.params))
-        # Keras loss by default returns the average loss in a batch
-        l_sum = l * float(tf.size(y)) if isinstance(
-            loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
-        metric.add(l_sum, accuracy(y_hat, y), tf.size(y))
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-
-class Animator:
-    """For plotting data in animation."""
-    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+def add_to_class(Class):
+    """Defined in :numref:`sec_oo-design`"""
+    def wrapper(obj):
+        setattr(Class, obj.__name__, obj)
+    return wrapper
+
+class HyperParameters:
+    def save_hyperparameters(self, ignore=[]):
+        """Defined in :numref:`sec_oo-design`"""
+        raise NotImplemented
+
+    def save_hyperparameters(self, ignore=[]):
+        """Save function arguments into class attributes.
+    
+        Defined in :numref:`sec_utils`"""
+        frame = inspect.currentframe().f_back
+        _, _, _, local_vars = inspect.getargvalues(frame)
+        self.hparams = {k:v for k, v in local_vars.items()
+                        if k not in set(ignore+['self']) and not k.startswith('_')}
+        for k, v in self.hparams.items():
+            setattr(self, k, v)
+
+class ProgressBoard(d2l.HyperParameters):
+    """Plot data points in animation.
+
+    Defined in :numref:`sec_oo-design`"""
+    def __init__(self, xlabel=None, ylabel=None, xlim=None,
                  ylim=None, xscale='linear', yscale='linear',
-                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
-                 figsize=(3.5, 2.5)):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        # Incrementally plot multiple lines
-        if legend is None:
-            legend = []
+                 ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
+                 fig=None, axes=None, figsize=(3.5, 2.5), display=True):
+        self.save_hyperparameters()
+
+    def draw(self, x, y, label, every_n=1):
+        raise NotImplemented
+
+    def draw(self, x, y, label, every_n=1):
+        """Defined in :numref:`sec_utils`"""
+        Point = collections.namedtuple('Point', ['x', 'y'])
+        if not hasattr(self, 'raw_points'):
+            self.raw_points = collections.OrderedDict()
+            self.data = collections.OrderedDict()
+        if label not in self.raw_points:
+            self.raw_points[label] = []
+            self.data[label] = []
+        points = self.raw_points[label]
+        line = self.data[label]
+        points.append(Point(x, y))
+        if len(points) != every_n:
+            return
+        mean = lambda x: sum(x) / len(x)
+        line.append(Point(mean([p.x for p in points]),
+                          mean([p.y for p in points])))
+        points.clear()
+        if not self.display:
+            return
         d2l.use_svg_display()
-        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
-        if nrows * ncols == 1:
-            self.axes = [self.axes, ]
-        # Use a lambda function to capture arguments
-        self.config_axes = lambda: d2l.set_axes(
-            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
-        self.X, self.Y, self.fmts = None, None, fmts
-
-    def add(self, x, y):
-        # Add multiple data points into the figure
-        if not hasattr(y, "__len__"):
-            y = [y]
-        n = len(y)
-        if not hasattr(x, "__len__"):
-            x = [x] * n
-        if not self.X:
-            self.X = [[] for _ in range(n)]
-        if not self.Y:
-            self.Y = [[] for _ in range(n)]
-        for i, (a, b) in enumerate(zip(x, y)):
-            if a is not None and b is not None:
-                self.X[i].append(a)
-                self.Y[i].append(b)
-        self.axes[0].cla()
-        for x, y, fmt in zip(self.X, self.Y, self.fmts):
-            self.axes[0].plot(x, y, fmt)
-        self.config_axes()
+        if self.fig is None:
+            self.fig = d2l.plt.figure(figsize=self.figsize)
+        plt_lines, labels = [], []
+        for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):
+            plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v],
+                                          linestyle=ls, color=color)[0])
+            labels.append(k)
+        axes = self.axes if self.axes else d2l.plt.gca()
+        if self.xlim: axes.set_xlim(self.xlim)
+        if self.ylim: axes.set_ylim(self.ylim)
+        if not self.xlabel: self.xlabel = self.x
+        axes.set_xlabel(self.xlabel)
+        axes.set_ylabel(self.ylabel)
+        axes.set_xscale(self.xscale)
+        axes.set_yscale(self.yscale)
+        axes.legend(plt_lines, labels)
         display.display(self.fig)
         display.clear_output(wait=True)
 
-def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
-    """Train a model (defined in Chapter 3).
+class Module(d2l.nn_Module, d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
+        super().__init__()
+        self.save_hyperparameters()
+        self.board = ProgressBoard()
+        self.training = None
 
-    Defined in :numref:`sec_softmax_scratch`"""
-    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
-                        legend=['train loss', 'train acc', 'test acc'])
-    for epoch in range(num_epochs):
-        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
-        test_acc = evaluate_accuracy(net, test_iter)
-        animator.add(epoch + 1, train_metrics + (test_acc,))
-    train_loss, train_acc = train_metrics
-    assert train_loss < 0.5, train_loss
-    assert train_acc <= 1 and train_acc > 0.7, train_acc
-    assert test_acc <= 1 and test_acc > 0.7, test_acc
-
-class Updater():
-    """For updating parameters using minibatch stochastic gradient descent.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    def __init__(self, params, lr):
-        self.params = params
-        self.lr = lr
-
-    def __call__(self, batch_size, grads):
-        d2l.sgd(self.params, grads, self.lr, batch_size)
-
-def predict_ch3(net, test_iter, n=6):
-    """Predict labels (defined in Chapter 3).
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    for X, y in test_iter:
-        break
-    trues = d2l.get_fashion_mnist_labels(y)
-    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1))
-    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
-    d2l.show_images(
-        d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])
+    def loss(self, y_hat, y):
+        raise NotImplementedError
 
-def evaluate_loss(net, data_iter, loss):
-    """Evaluate the loss of a model on the given dataset.
+    def forward(self, X):
+        assert hasattr(self, 'net'), 'Neural network is defined'
+        return self.net(X)
 
-    Defined in :numref:`sec_model_selection`"""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        l = loss(net(X), y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
+    def call(self, X, *args, **kwargs):
+        if kwargs and "training" in kwargs:
+            self.training = kwargs['training']
+        return self.forward(X, *args)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'Trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        self.board.draw(x, d2l.numpy(value), (
+            'train_' if train else 'val_') + key, every_n=int(n))
+    def training_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=True)
+        return l
+
+    def validation_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=False)
+
+    def configure_optimizers(self):
+        raise NotImplementedError
 
-DATA_HUB = dict()
-DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_classification`"""
+        return tf.keras.optimizers.SGD(self.lr)
 
-def download(name, cache_dir=os.path.join('..', 'data')):
-    """Download a file inserted into DATA_HUB, return the local filename.
+class DataModule(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, root='../data'):
+        self.save_hyperparameters()
 
-    Defined in :numref:`sec_kaggle_house`"""
-    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
-    url, sha1_hash = DATA_HUB[name]
-    os.makedirs(cache_dir, exist_ok=True)
-    fname = os.path.join(cache_dir, url.split('/')[-1])
-    if os.path.exists(fname):
-        sha1 = hashlib.sha1()
-        with open(fname, 'rb') as f:
-            while True:
-                data = f.read(1048576)
-                if not data:
-                    break
-                sha1.update(data)
-        if sha1.hexdigest() == sha1_hash:
-            return fname  # Hit cache
-    print(f'Downloading {fname} from {url}...')
-    r = requests.get(url, stream=True, verify=True)
-    with open(fname, 'wb') as f:
-        f.write(r.content)
-    return fname
+    def get_dataloader(self, train):
+        raise NotImplementedError
 
-def download_extract(name, folder=None):
-    """Download and extract a zip/tar file.
+    def train_dataloader(self):
+        return self.get_dataloader(train=True)
+
+    def val_dataloader(self):
+        return self.get_dataloader(train=False)
+
+    def get_tensorloader(self, tensors, train, indices=slice(0, None)):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        tensors = tuple(a[indices] for a in tensors)
+        shuffle_buffer = tensors[0].shape[0] if train else 1
+        return tf.data.Dataset.from_tensor_slices(tensors).shuffle(
+            buffer_size=shuffle_buffer).batch(self.batch_size)
+    
+
+class Trainer(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        self.save_hyperparameters()
+        assert num_gpus == 0, 'No GPU support yet'
+
+    def prepare_data(self, data):
+        self.train_dataloader = data.train_dataloader()
+        self.val_dataloader = data.val_dataloader()
+        self.num_train_batches = len(self.train_dataloader)
+        self.num_val_batches = (len(self.val_dataloader)
+                                if self.val_dataloader is not None else 0)
+
+    def prepare_model(self, model):
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        self.model = model
+
+    def fit(self, model, data):
+        self.prepare_data(data)
+        self.prepare_model(model)
+        self.optim = model.configure_optimizers()
+        self.epoch = 0
+        self.train_batch_idx = 0
+        self.val_batch_idx = 0
+        for self.epoch in range(self.max_epochs):
+            self.fit_epoch()
+
+    def fit_epoch(self):
+        raise NotImplementedError
 
-    Defined in :numref:`sec_kaggle_house`"""
-    fname = download(name)
-    base_dir = os.path.dirname(fname)
-    data_dir, ext = os.path.splitext(fname)
-    if ext == '.zip':
-        fp = zipfile.ZipFile(fname, 'r')
-    elif ext in ('.tar', '.gz'):
-        fp = tarfile.open(fname, 'r')
-    else:
-        assert False, 'Only zip/tar files can be extracted.'
-    fp.extractall(base_dir)
-    return os.path.join(base_dir, folder) if folder else data_dir
+    def prepare_batch(self, batch):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return batch
+
+    def fit_epoch(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        self.model.training = True
+        for batch in self.train_dataloader:
+            with tf.GradientTape() as tape:
+                loss = self.model.training_step(self.prepare_batch(batch))
+            grads = tape.gradient(loss, self.model.trainable_variables)
+            if self.gradient_clip_val > 0:
+                grads = self.clip_gradients(self.gradient_clip_val, grads)
+            self.optim.apply_gradients(zip(grads, self.model.trainable_variables))
+            self.train_batch_idx += 1
+        if self.val_dataloader is None:
+            return
+        self.model.training = False
+        for batch in self.val_dataloader:
+            self.model.validation_step(self.prepare_batch(batch))
+            self.val_batch_idx += 1
+
+class SyntheticRegressionData(d2l.DataModule):
+    """Defined in :numref:`sec_synthetic-regression-data`"""
+    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
+                 batch_size=32):
+        super().__init__()
+        self.save_hyperparameters()
+        n = num_train + num_val
+        self.X = tf.random.normal((n, w.shape[0]))
+        noise = tf.random.normal((n, 1)) * noise
+        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + noise
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        i = slice(0, self.num_train) if train else slice(self.num_train, None)
+        return self.get_tensorloader((self.X, self.y), train, i)
+
+class LinearRegressionScratch(d2l.Module):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, num_inputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        w = tf.random.normal((num_inputs, 1), mean=0, stddev=0.01)
+        b = tf.zeros(1)
+        self.w = tf.Variable(w, trainable=True)
+        self.b = tf.Variable(b, trainable=True)
+
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_scratch`"""
+        return d2l.matmul(X, self.w) + self.b
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_scratch`"""
+        l = (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+        return d2l.reduce_mean(l)
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return SGD(self.lr)
+
+class SGD(d2l.HyperParameters):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+
+    def apply_gradients(self, grads_and_vars):
+        for grad, param in grads_and_vars:
+            param.assign_sub(self.lr * grad)
+
+class LinearRegression(d2l.Module):
+    """Defined in :numref:`sec_linear_concise`"""
+    def __init__(self, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        initializer = tf.initializers.RandomNormal(stddev=0.01)
+        self.net = tf.keras.layers.Dense(1, kernel_initializer=initializer)
+
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_concise`"""
+        return self.net(X)
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_concise`"""
+        fn = tf.keras.losses.MeanSquaredError()
+        return fn(y, y_hat)
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return tf.keras.optimizers.SGD(self.lr)
+
+    def get_w_b(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return (self.get_weights()[0], self.get_weights()[1])
+
+class FashionMNIST(d2l.DataModule):
+    """Defined in :numref:`sec_fashion_mnist`"""
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        self.train, self.val = tf.keras.datasets.fashion_mnist.load_data()
+
+    def text_labels(self, indices):
+        """Return text labels.
+    
+        Defined in :numref:`sec_fashion_mnist`"""
+        labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                  'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+        return [labels[int(i)] for i in indices]
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        data = self.train if train else self.val
+        process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                                tf.cast(y, dtype='int32'))
+        resize_fn = lambda X, y: (tf.image.resize_with_pad(X, *self.resize), y)
+        shuffle_buf = len(data[0]) if train else 1
+        return tf.data.Dataset.from_tensor_slices(process(*data)).batch(
+            self.batch_size).map(resize_fn).shuffle(shuffle_buf)
+
+    def visualize(self, batch, nrows=1, ncols=8, labels=[]):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        X, y = batch
+        if not labels:
+            labels = self.text_labels(y)
+        d2l.show_images(tf.squeeze(X), nrows, ncols, titles=labels)
 
-def download_all():
-    """Download all files in the DATA_HUB.
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
 
-    Defined in :numref:`sec_kaggle_house`"""
-    for name in DATA_HUB:
-        download(name)
+    Defined in :numref:`sec_fashion_mnist`"""
+    raise NotImplementedError
+
+class Classifier(d2l.Module):
+    """Defined in :numref:`sec_classification`"""
+    def validation_step(self, batch):
+        Y_hat = self(*batch[:-1])
+        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
+        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
+
+    def accuracy(self, Y_hat, Y, averaged=True):
+        """Compute the number of correct predictions.
+    
+        Defined in :numref:`sec_classification`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
+        compare = d2l.astype(preds == d2l.reshape(Y, -1), d2l.float32)
+        return d2l.reduce_mean(compare) if averaged else compare
+
+    def loss(self, Y_hat, Y, averaged=True):
+        """Defined in :numref:`sec_softmax_concise`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        Y = d2l.reshape(Y, (-1,))
+        fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        return fn(Y, Y_hat)
+
+def cpu():
+    """Defined in :numref:`sec_use_gpu`"""
+    return tf.device('/CPU:0')
 
-DATA_HUB['kaggle_house_train'] = (
-    DATA_URL + 'kaggle_house_pred_train.csv',
-    '585e9cc93e70b39160e7921475f9bcd7d31219ce')
+def gpu(i=0):
+    """Defined in :numref:`sec_use_gpu`"""
+    return tf.device(f'/GPU:{i}')
 
-DATA_HUB['kaggle_house_test'] = (
-    DATA_URL + 'kaggle_house_pred_test.csv',
-    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')
+def num_gpus():
+    """Defined in :numref:`sec_use_gpu`"""
+    return len(tf.config.experimental.list_physical_devices('GPU'))
 
 def try_gpu(i=0):
     """Return gpu(i) if exists, otherwise return cpu().
 
     Defined in :numref:`sec_use_gpu`"""
-    if len(tf.config.experimental.list_physical_devices('GPU')) >= i + 1:
-        return tf.device(f'/GPU:{i}')
-    return tf.device('/CPU:0')
+    if num_gpus() >= i + 1:
+        return gpu(i)
+    return cpu()
 
 def try_all_gpus():
     """Return all available GPUs, or [cpu(),] if no GPU exists.
 
     Defined in :numref:`sec_use_gpu`"""
-    num_gpus = len(tf.config.experimental.list_physical_devices('GPU'))
-    devices = [tf.device(f'/GPU:{i}') for i in range(num_gpus)]
-    return devices if devices else [tf.device('/CPU:0')]
+    return [gpu(i) for i in range(num_gpus())]
 
 def corr2d(X, K):
     """Compute 2D cross-correlation."""
@@ -1483,11 +1515,493 @@ def update_G(Z, net_D, net_G, loss, optimizer_G):
     return loss_G
 
 d2l.DATA_HUB['pokemon'] = (d2l.DATA_URL + 'pokemon.zip',
-                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')# Alias defined in config.ini
+                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')
+
+def load_array(data_arrays, batch_size, is_train=True):
+    """Construct a TensorFlow data iterator.
+
+    Defined in :numref:`sec_utils`"""
+    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
+    if is_train:
+        dataset = dataset.shuffle(buffer_size=1000)
+    dataset = dataset.batch(batch_size)
+    return dataset
+
+def synthetic_data(w, b, num_examples):
+    """Generate y = Xw + b + noise.
+
+    Defined in :numref:`sec_utils`"""
+    X = tf.zeros((num_examples, w.shape[0]))
+    X += tf.random.normal(shape=X.shape)
+    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
+    y += tf.random.normal(shape=y.shape, stddev=0.01)
+    y = tf.reshape(y, (-1, 1))
+    return X, y
+
+
+def sgd(params, grads, lr, batch_size):
+    """Minibatch stochastic gradient descent.
+
+    Defined in :numref:`sec_utils`"""
+    for param, grad in zip(params, grads):
+        param.assign_sub(lr * grad / batch_size)
+
+def load_data_fashion_mnist(batch_size, resize=None):
+    """Download the Fashion-MNIST dataset and then load it into memory.
+
+    Defined in :numref:`sec_utils`"""
+    mnist_train, mnist_test = tf.keras.datasets.fashion_mnist.load_data()
+    # Divide all numbers by 255 so that all pixel values are between
+    # 0 and 1, add a batch dimension at the last. And cast label to int32
+    process = lambda X, y: (tf.expand_dims(X, axis=3) / 255,
+                            tf.cast(y, dtype='int32'))
+    resize_fn = lambda X, y: (
+        tf.image.resize_with_pad(X, resize, resize) if resize else X, y)
+    return (
+        tf.data.Dataset.from_tensor_slices(process(*mnist_train)).batch(
+            batch_size).shuffle(len(mnist_train[0])).map(resize_fn),
+        tf.data.Dataset.from_tensor_slices(process(*mnist_test)).batch(
+            batch_size).map(resize_fn))
+
+class TrainCallback(tf.keras.callbacks.Callback):
+    """A callback to visiualize the training progress.
+
+    Defined in :numref:`sec_utils`"""
+    def __init__(self, net, train_iter, test_iter, num_epochs, device_name):
+        self.timer = d2l.Timer()
+        self.animator = d2l.Animator(
+            xlabel='epoch', xlim=[1, num_epochs], legend=[
+                'train loss', 'train acc', 'test acc'])
+        self.net = net
+        self.train_iter = train_iter
+        self.test_iter = test_iter
+        self.num_epochs = num_epochs
+        self.device_name = device_name
+    def on_epoch_begin(self, epoch, logs=None):
+        self.timer.start()
+    def on_epoch_end(self, epoch, logs):
+        self.timer.stop()
+        test_acc = self.net.evaluate(
+            self.test_iter, verbose=0, return_dict=True)['accuracy']
+        metrics = (logs['loss'], logs['accuracy'], test_acc)
+        self.animator.add(epoch + 1, metrics)
+        if epoch == self.num_epochs - 1:
+            batch_size = next(iter(self.train_iter))[0].shape[0]
+            num_examples = batch_size * tf.data.experimental.cardinality(
+                self.train_iter).numpy()
+            print(f'loss {metrics[0]:.3f}, train acc {metrics[1]:.3f}, '
+                  f'test acc {metrics[2]:.3f}')
+            print(f'{num_examples / self.timer.avg():.1f} examples/sec on '
+                  f'{str(self.device_name)}')
+
+def train_ch6(net_fn, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6).
+
+    Defined in :numref:`sec_utils`"""
+    device_name = device._device_name
+    strategy = tf.distribute.OneDeviceStrategy(device_name)
+    with strategy.scope():
+        optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
+        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+        net = net_fn()
+        net.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
+    callback = TrainCallback(net, train_iter, test_iter, num_epochs,
+                             device_name)
+    net.fit(train_iter, epochs=num_epochs, verbose=0, callbacks=[callback])
+    return net
+
+def evaluate_accuracy(net, data_iter):
+    """Compute the accuracy for a model on a dataset.
+
+    Defined in :numref:`sec_utils`"""
+    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
+    for X, y in data_iter:
+        metric.add(accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+def linreg(X, w, b):
+    """The linear regression model.
+
+    Defined in :numref:`sec_utils`"""
+    return d2l.matmul(X, w) + b
+
+def squared_loss(y_hat, y):
+    """Squared loss.
+
+    Defined in :numref:`sec_utils`"""
+    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+
+def get_fashion_mnist_labels(labels):
+    """Return text labels for the Fashion-MNIST dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [text_labels[int(i)] for i in labels]
+
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
+
+    Defined in :numref:`sec_utils`"""
+    figsize = (num_cols * scale, num_rows * scale)
+    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
+    axes = axes.flatten()
+    for i, (ax, img) in enumerate(zip(axes, imgs)):
+        try:
+            img = d2l.numpy(img)
+        except:
+            pass
+        ax.imshow(img)
+        ax.axes.get_xaxis().set_visible(False)
+        ax.axes.get_yaxis().set_visible(False)
+        if titles:
+            ax.set_title(titles[i])
+    return axes
+
+class Animator:
+    """For plotting data in animation."""
+    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
+                 figsize=(3.5, 2.5)):
+        """Defined in :numref:`sec_utils`"""
+        # Incrementally plot multiple lines
+        if legend is None:
+            legend = []
+        d2l.use_svg_display()
+        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
+        if nrows * ncols == 1:
+            self.axes = [self.axes, ]
+        # Use a lambda function to capture arguments
+        self.config_axes = lambda: d2l.set_axes(
+            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+        self.X, self.Y, self.fmts = None, None, fmts
+
+    def add(self, x, y):
+        # Add multiple data points into the figure
+        if not hasattr(y, "__len__"):
+            y = [y]
+        n = len(y)
+        if not hasattr(x, "__len__"):
+            x = [x] * n
+        if not self.X:
+            self.X = [[] for _ in range(n)]
+        if not self.Y:
+            self.Y = [[] for _ in range(n)]
+        for i, (a, b) in enumerate(zip(x, y)):
+            if a is not None and b is not None:
+                self.X[i].append(a)
+                self.Y[i].append(b)
+        self.axes[0].cla()
+        for x, y, fmt in zip(self.X, self.Y, self.fmts):
+            self.axes[0].plot(x, y, fmt)
+        self.config_axes()
+        display.display(self.fig)
+        display.clear_output(wait=True)
+
+class Accumulator:
+    """For accumulating sums over `n` variables."""
+    def __init__(self, n):
+        """Defined in :numref:`sec_utils`"""
+        self.data = [0.0] * n
+
+    def add(self, *args):
+        self.data = [a + float(b) for a, b in zip(self.data, args)]
+
+    def reset(self):
+        self.data = [0.0] * len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+
+def accuracy(y_hat, y):
+    """Compute the number of correct predictions.
+
+    Defined in :numref:`sec_utils`"""
+    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
+        y_hat = d2l.argmax(y_hat, axis=1)
+    cmp = d2l.astype(y_hat, y.dtype) == y
+    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
+
+def download(url, folder='../data', sha1_hash=None):
+    """Download a file to folder and return the local filepath.
+
+    Defined in :numref:`sec_utils`"""
+    if not url.startswith('http'):
+        # For back compatability
+        url, sha1_hash = DATA_HUB[url]
+    os.makedirs(folder, exist_ok=True)
+    fname = os.path.join(folder, url.split('/')[-1])
+    # Check if hit cache
+    if os.path.exists(fname) and sha1_hash:
+        sha1 = hashlib.sha1()
+        with open(fname, 'rb') as f:
+            while True:
+                data = f.read(1048576)
+                if not data:
+                    break
+                sha1.update(data)
+        if sha1.hexdigest() == sha1_hash:
+            return fname
+    # Download
+    print(f'Downloading {fname} from {url}...')
+    r = requests.get(url, stream=True, verify=True)
+    with open(fname, 'wb') as f:
+        f.write(r.content)
+    return fname
+
+def extract(filename, folder=None):
+    """Extract a zip/tar file into folder.
+
+    Defined in :numref:`sec_utils`"""
+    base_dir = os.path.dirname(filename)
+    _, ext = os.path.splitext(filename)
+    assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
+    if ext == '.zip':
+        fp = zipfile.ZipFile(filename, 'r')
+    else:
+        fp = tarfile.open(filename, 'r')
+    if folder is None:
+        folder = base_dir
+    fp.extractall(folder)
+
+def download_extract(name, folder=None):
+    """Download and extract a zip/tar file.
+
+    Defined in :numref:`sec_utils`"""
+    fname = download(name)
+    base_dir = os.path.dirname(fname)
+    data_dir, ext = os.path.splitext(fname)
+    if ext == '.zip':
+        fp = zipfile.ZipFile(fname, 'r')
+    elif ext in ('.tar', '.gz'):
+        fp = tarfile.open(fname, 'r')
+    else:
+        assert False, 'Only zip/tar files can be extracted.'
+    fp.extractall(base_dir)
+    return os.path.join(base_dir, folder) if folder else data_dir
+
+
+def tokenize(lines, token='word'):
+    """Split text lines into word or character tokens.
+
+    Defined in :numref:`sec_utils`"""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+
+def evaluate_loss(net, data_iter, loss):
+    """Evaluate the loss of a model on the given dataset.
+
+    Defined in :numref:`sec_utils`"""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        l = loss(net(X), y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+
+def grad_clipping(grads, theta):
+    """Clip the gradient.
+
+    Defined in :numref:`sec_utils`"""
+    theta = tf.constant(theta, dtype=tf.float32)
+    new_grad = []
+    for grad in grads:
+        if isinstance(grad, tf.IndexedSlices):
+            new_grad.append(tf.convert_to_tensor(grad))
+        else:
+            new_grad.append(grad)
+    norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
+                        for grad in new_grad))
+    norm = tf.cast(norm, tf.float32)
+    if tf.greater(norm, theta):
+        for i, grad in enumerate(new_grad):
+            new_grad[i] = grad * theta / norm
+    else:
+        new_grad = new_grad
+    return new_grad
+
+d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
+                           '94646ad1522d915e7b0f9296181140edcf86a4f5')
+
+def read_data_nmt():
+    """Load the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    data_dir = d2l.download_extract('fra-eng')
+    with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
+        return f.read()
+
+def preprocess_nmt(text):
+    """Preprocess the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    def no_space(char, prev_char):
+        return char in set(',.!?') and prev_char != ' '
+
+    # Replace non-breaking space with space, and convert uppercase letters to
+    # lowercase ones
+    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
+    # Insert space between words and punctuation marks
+    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
+           for i, char in enumerate(text)]
+    return ''.join(out)
+
+def tokenize_nmt(text, num_examples=None):
+    """Tokenize the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    source, target = [], []
+    for i, line in enumerate(text.split('\n')):
+        if num_examples and i > num_examples:
+            break
+        parts = line.split('\t')
+        if len(parts) == 2:
+            source.append(parts[0].split(' '))
+            target.append(parts[1].split(' '))
+    return source, target
+
+
+def truncate_pad(line, num_steps, padding_token):
+    """Truncate or pad sequences.
+
+    Defined in :numref:`sec_utils`"""
+    if len(line) > num_steps:
+        return line[:num_steps]  # Truncate
+    return line + [padding_token] * (num_steps - len(line))  # Pad
+
+
+def build_array_nmt(lines, vocab, num_steps):
+    """Transform text sequences of machine translation into minibatches.
+
+    Defined in :numref:`sec_utils`"""
+    lines = [vocab[l] for l in lines]
+    lines = [l + [vocab['<eos>']] for l in lines]
+    array = d2l.tensor([truncate_pad(
+        l, num_steps, vocab['<pad>']) for l in lines])
+    valid_len = d2l.reduce_sum(
+        d2l.astype(array != vocab['<pad>'], d2l.int32), 1)
+    return array, valid_len
+
+
+def load_data_nmt(batch_size, num_steps, num_examples=600):
+    """Return the iterator and the vocabularies of the translation dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text = preprocess_nmt(read_data_nmt())
+    source, target = tokenize_nmt(text, num_examples)
+    src_vocab = d2l.Vocab(source, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    tgt_vocab = d2l.Vocab(target, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
+    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
+    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
+    data_iter = d2l.load_array(data_arrays, batch_size)
+    return data_iter, src_vocab, tgt_vocab
+
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences.
+
+    Defined in :numref:`sec_utils`"""
+    maxlen = X.shape[1]
+    mask = tf.range(start=0, limit=maxlen, dtype=tf.float32)[
+        None, :] < tf.cast(valid_len[:, None], dtype=tf.float32)
+
+    if len(X.shape) == 3:
+        return tf.where(tf.expand_dims(mask, axis=-1), X, value)
+    else:
+        return tf.where(mask, X, value)
+
+
+class MaskedSoftmaxCELoss(tf.keras.losses.Loss):
+    """The softmax cross-entropy loss with masks.
+
+    Defined in :numref:`sec_utils`"""
+    def __init__(self, valid_len):
+        super().__init__(reduction='none')
+        self.valid_len = valid_len
+
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def call(self, label, pred):
+        weights = tf.ones_like(label, dtype=tf.float32)
+        weights = sequence_mask(weights, self.valid_len)
+        label_one_hot = tf.one_hot(label, depth=pred.shape[-1])
+        unweighted_loss = tf.keras.losses.CategoricalCrossentropy(
+            from_logits=True, reduction='none')(label_one_hot, pred)
+        weighted_loss = tf.reduce_mean((unweighted_loss*weights), axis=1)
+        return weighted_loss
+
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
+    animator = d2l.Animator(xlabel="epoch", ylabel="loss",
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            X, X_valid_len, Y, Y_valid_len = [x for x in batch]
+            bos = tf.reshape(tf.constant([tgt_vocab['<bos>']] * Y.shape[0]),
+                             shape=(-1, 1))
+            dec_input = tf.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            with tf.GradientTape() as tape:
+                Y_hat, _ = net(X, dec_input, X_valid_len, training=True)
+                l = MaskedSoftmaxCELoss(Y_valid_len)(Y, Y_hat)
+            gradients = tape.gradient(l, net.trainable_variables)
+            gradients = d2l.grad_clipping(gradients, 1)
+            optimizer.apply_gradients(zip(gradients, net.trainable_variables))
+            num_tokens = tf.reduce_sum(Y_valid_len).numpy()
+            metric.add(tf.reduce_sum(l), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device._device_name)}')
+
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    save_attention_weights=False):
+    """Predict for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = tf.constant([len(src_tokens)])
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = tf.expand_dims(src_tokens, axis=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len, training=False)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = tf.expand_dims(tf.constant([tgt_vocab['<bos>']]), axis=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state, training=False)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = tf.argmax(Y, axis=2)
+        pred = tf.squeeze(dec_X, axis=0)
+        # Save attention weights
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred.numpy())
+    return ' '.join(tgt_vocab.to_tokens(tf.reshape(output_seq, shape = -1).numpy().tolist())), attention_weight_seq
+
+
+# Alias defined in config.ini
 size = lambda a: tf.size(a).numpy()
 
 reshape = tf.reshape
+ones_like = tf.ones_like
 ones = tf.ones
+zeros_like = tf.zeros_like
 zeros = tf.zeros
 meshgrid = tf.meshgrid
 sin = tf.sin
@@ -1501,16 +2015,23 @@ def update_G(Z, net_D, net_G, loss, optimizer_G):
 rand = tf.random.uniform
 matmul = tf.matmul
 reduce_sum = tf.reduce_sum
+reduce_mean = tf.reduce_mean
 argmax = tf.argmax
 tensor = tf.constant
 arange = tf.range
 astype = tf.cast
 int32 = tf.int32
+int64 = tf.int64
 float32 = tf.float32
 transpose = tf.transpose
 concat = tf.concat
 stack = tf.stack
 abs = tf.abs
 eye = tf.eye
+log = tf.math.log
+sigmoid = tf.sigmoid
+expand_dims = tf.expand_dims
+repeat = tf.repeat
+batch_matmul = tf.matmul
 numpy = lambda x, *args, **kwargs: x.numpy(*args, **kwargs)
 
diff --git a/d2l/torch.py b/d2l/torch.py
index ae5f9df..eb7a89d 100644
--- a/d2l/torch.py
+++ b/d2l/torch.py
@@ -1,3 +1,17 @@
+DATA_HUB = dict()
+DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+
+import numpy as np
+import torch
+import torchvision
+from PIL import Image
+from torch import nn
+from torch.nn import functional as F
+from torch.utils import data
+from torchvision import transforms
+
+nn_Module = nn.Module
+
 #################   WARNING   ################
 # The below part is generated automatically through:
 #    d2lbook build lib
@@ -5,6 +19,7 @@
 
 import collections
 import hashlib
+import inspect
 import math
 import os
 import random
@@ -19,6 +34,7 @@
 import requests
 from IPython import display
 from matplotlib import pyplot as plt
+from matplotlib_inline import backend_inline
 
 d2l = sys.modules[__name__]
 
@@ -35,7 +51,7 @@ def use_svg_display():
     """Use the svg format to display a plot in Jupyter.
 
     Defined in :numref:`sec_calculus`"""
-    display.set_matplotlib_formats('svg')
+    backend_inline.set_matplotlib_formats('svg')
 
 def set_figsize(figsize=(3.5, 2.5)):
     """Set the figure size for matplotlib.
@@ -48,381 +64,431 @@ def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
     """Set the axes for matplotlib.
 
     Defined in :numref:`sec_calculus`"""
-    axes.set_xlabel(xlabel)
-    axes.set_ylabel(ylabel)
-    axes.set_xscale(xscale)
-    axes.set_yscale(yscale)
-    axes.set_xlim(xlim)
-    axes.set_ylim(ylim)
+    axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
+    axes.set_xscale(xscale), axes.set_yscale(yscale)
+    axes.set_xlim(xlim),     axes.set_ylim(ylim)
     if legend:
         axes.legend(legend)
     axes.grid()
 
-def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
+def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
          ylim=None, xscale='linear', yscale='linear',
          fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
     """Plot data points.
 
     Defined in :numref:`sec_calculus`"""
-    if legend is None:
-        legend = []
-
-    set_figsize(figsize)
-    axes = axes if axes else d2l.plt.gca()
 
-    # Return True if `X` (tensor or list) has 1 axis
-    def has_one_axis(X):
+    def has_one_axis(X):  # True if `X` (tensor or list) has 1 axis
         return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                 and not hasattr(X[0], "__len__"))
 
-    if has_one_axis(X):
-        X = [X]
+    if has_one_axis(X): X = [X]
     if Y is None:
         X, Y = [[]] * len(X), X
     elif has_one_axis(Y):
         Y = [Y]
     if len(X) != len(Y):
         X = X * len(Y)
+
+    set_figsize(figsize)
+    if axes is None: axes = d2l.plt.gca()
     axes.cla()
     for x, y, fmt in zip(X, Y, fmts):
-        if len(x):
-            axes.plot(x, y, fmt)
-        else:
-            axes.plot(y, fmt)
+        axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
     set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
 
-class Timer:
-    """Record multiple running times."""
-    def __init__(self):
-        """Defined in :numref:`subsec_linear_model`"""
-        self.times = []
-        self.start()
-
-    def start(self):
-        """Start the timer."""
-        self.tik = time.time()
-
-    def stop(self):
-        """Stop the timer and record the time in a list."""
-        self.times.append(time.time() - self.tik)
-        return self.times[-1]
-
-    def avg(self):
-        """Return the average time."""
-        return sum(self.times) / len(self.times)
-
-    def sum(self):
-        """Return the sum of time."""
-        return sum(self.times)
-
-    def cumsum(self):
-        """Return the accumulated time."""
-        return np.array(self.times).cumsum().tolist()
-
-def synthetic_data(w, b, num_examples):
-    """Generate y = Xw + b + noise.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    X = d2l.normal(0, 1, (num_examples, len(w)))
-    y = d2l.matmul(X, w) + b
-    y += d2l.normal(0, 0.01, y.shape)
-    return X, d2l.reshape(y, (-1, 1))
-
-def linreg(X, w, b):
-    """The linear regression model.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return d2l.matmul(X, w) + b
-
-def squared_loss(y_hat, y):
-    """Squared loss.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
-
-def sgd(params, lr, batch_size):
-    """Minibatch stochastic gradient descent.
-
-    Defined in :numref:`sec_linear_scratch`"""
-    with torch.no_grad():
-        for param in params:
-            param -= lr * param.grad / batch_size
-            param.grad.zero_()
-
-def load_array(data_arrays, batch_size, is_train=True):
-    """Construct a PyTorch data iterator.
-
-    Defined in :numref:`sec_linear_concise`"""
-    dataset = data.TensorDataset(*data_arrays)
-    return data.DataLoader(dataset, batch_size, shuffle=is_train)
-
-def get_fashion_mnist_labels(labels):
-    """Return text labels for the Fashion-MNIST dataset.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
-                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
-    return [text_labels[int(i)] for i in labels]
-
-def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
-    """Plot a list of images.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    figsize = (num_cols * scale, num_rows * scale)
-    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
-    axes = axes.flatten()
-    for i, (ax, img) in enumerate(zip(axes, imgs)):
-        if torch.is_tensor(img):
-            # Tensor Image
-            ax.imshow(img.numpy())
-        else:
-            # PIL Image
-            ax.imshow(img)
-        ax.axes.get_xaxis().set_visible(False)
-        ax.axes.get_yaxis().set_visible(False)
-        if titles:
-            ax.set_title(titles[i])
-    return axes
-
-def get_dataloader_workers():
-    """Use 4 processes to read the data.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    return 4
-
-def load_data_fashion_mnist(batch_size, resize=None):
-    """Download the Fashion-MNIST dataset and then load it into memory.
-
-    Defined in :numref:`sec_fashion_mnist`"""
-    trans = [transforms.ToTensor()]
-    if resize:
-        trans.insert(0, transforms.Resize(resize))
-    trans = transforms.Compose(trans)
-    mnist_train = torchvision.datasets.FashionMNIST(
-        root="../data", train=True, transform=trans, download=True)
-    mnist_test = torchvision.datasets.FashionMNIST(
-        root="../data", train=False, transform=trans, download=True)
-    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
-                            num_workers=get_dataloader_workers()),
-            data.DataLoader(mnist_test, batch_size, shuffle=False,
-                            num_workers=get_dataloader_workers()))
-
-def accuracy(y_hat, y):
-    """Compute the number of correct predictions.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
-        y_hat = d2l.argmax(y_hat, axis=1)
-    cmp = d2l.astype(y_hat, y.dtype) == y
-    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
-
-def evaluate_accuracy(net, data_iter):
-    """Compute the accuracy for a model on a dataset.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    if isinstance(net, torch.nn.Module):
-        net.eval()  # Set the model to evaluation mode
-    metric = Accumulator(2)  # No. of correct predictions, no. of predictions
-
-    with torch.no_grad():
-        for X, y in data_iter:
-            metric.add(accuracy(net(X), y), d2l.size(y))
-    return metric[0] / metric[1]
-
-class Accumulator:
-    """For accumulating sums over `n` variables."""
-    def __init__(self, n):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        self.data = [0.0] * n
-
-    def add(self, *args):
-        self.data = [a + float(b) for a, b in zip(self.data, args)]
-
-    def reset(self):
-        self.data = [0.0] * len(self.data)
-
-    def __getitem__(self, idx):
-        return self.data[idx]
-
-def train_epoch_ch3(net, train_iter, loss, updater):
-    """The training loop defined in Chapter 3.
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    # Set the model to training mode
-    if isinstance(net, torch.nn.Module):
-        net.train()
-    # Sum of training loss, sum of training accuracy, no. of examples
-    metric = Accumulator(3)
-    for X, y in train_iter:
-        # Compute gradients and update parameters
-        y_hat = net(X)
-        l = loss(y_hat, y)
-        if isinstance(updater, torch.optim.Optimizer):
-            # Using PyTorch in-built optimizer & loss criterion
-            updater.zero_grad()
-            l.sum().backward()
-            updater.step()
-        else:
-            # Using custom built optimizer & loss criterion
-            l.sum().backward()
-            updater(X.shape[0])
-        metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
-    # Return training loss and training accuracy
-    return metric[0] / metric[2], metric[1] / metric[2]
-
-class Animator:
-    """For plotting data in animation."""
-    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+def add_to_class(Class):
+    """Defined in :numref:`sec_oo-design`"""
+    def wrapper(obj):
+        setattr(Class, obj.__name__, obj)
+    return wrapper
+
+class HyperParameters:
+    def save_hyperparameters(self, ignore=[]):
+        """Defined in :numref:`sec_oo-design`"""
+        raise NotImplemented
+
+    def save_hyperparameters(self, ignore=[]):
+        """Save function arguments into class attributes.
+    
+        Defined in :numref:`sec_utils`"""
+        frame = inspect.currentframe().f_back
+        _, _, _, local_vars = inspect.getargvalues(frame)
+        self.hparams = {k:v for k, v in local_vars.items()
+                        if k not in set(ignore+['self']) and not k.startswith('_')}
+        for k, v in self.hparams.items():
+            setattr(self, k, v)
+
+class ProgressBoard(d2l.HyperParameters):
+    """Plot data points in animation.
+
+    Defined in :numref:`sec_oo-design`"""
+    def __init__(self, xlabel=None, ylabel=None, xlim=None,
                  ylim=None, xscale='linear', yscale='linear',
-                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
-                 figsize=(3.5, 2.5)):
-        """Defined in :numref:`sec_softmax_scratch`"""
-        # Incrementally plot multiple lines
-        if legend is None:
-            legend = []
+                 ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
+                 fig=None, axes=None, figsize=(3.5, 2.5), display=True):
+        self.save_hyperparameters()
+
+    def draw(self, x, y, label, every_n=1):
+        raise NotImplemented
+
+    def draw(self, x, y, label, every_n=1):
+        """Defined in :numref:`sec_utils`"""
+        Point = collections.namedtuple('Point', ['x', 'y'])
+        if not hasattr(self, 'raw_points'):
+            self.raw_points = collections.OrderedDict()
+            self.data = collections.OrderedDict()
+        if label not in self.raw_points:
+            self.raw_points[label] = []
+            self.data[label] = []
+        points = self.raw_points[label]
+        line = self.data[label]
+        points.append(Point(x, y))
+        if len(points) != every_n:
+            return
+        mean = lambda x: sum(x) / len(x)
+        line.append(Point(mean([p.x for p in points]),
+                          mean([p.y for p in points])))
+        points.clear()
+        if not self.display:
+            return
         d2l.use_svg_display()
-        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
-        if nrows * ncols == 1:
-            self.axes = [self.axes, ]
-        # Use a lambda function to capture arguments
-        self.config_axes = lambda: d2l.set_axes(
-            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
-        self.X, self.Y, self.fmts = None, None, fmts
-
-    def add(self, x, y):
-        # Add multiple data points into the figure
-        if not hasattr(y, "__len__"):
-            y = [y]
-        n = len(y)
-        if not hasattr(x, "__len__"):
-            x = [x] * n
-        if not self.X:
-            self.X = [[] for _ in range(n)]
-        if not self.Y:
-            self.Y = [[] for _ in range(n)]
-        for i, (a, b) in enumerate(zip(x, y)):
-            if a is not None and b is not None:
-                self.X[i].append(a)
-                self.Y[i].append(b)
-        self.axes[0].cla()
-        for x, y, fmt in zip(self.X, self.Y, self.fmts):
-            self.axes[0].plot(x, y, fmt)
-        self.config_axes()
+        if self.fig is None:
+            self.fig = d2l.plt.figure(figsize=self.figsize)
+        plt_lines, labels = [], []
+        for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):
+            plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v],
+                                          linestyle=ls, color=color)[0])
+            labels.append(k)
+        axes = self.axes if self.axes else d2l.plt.gca()
+        if self.xlim: axes.set_xlim(self.xlim)
+        if self.ylim: axes.set_ylim(self.ylim)
+        if not self.xlabel: self.xlabel = self.x
+        axes.set_xlabel(self.xlabel)
+        axes.set_ylabel(self.ylabel)
+        axes.set_xscale(self.xscale)
+        axes.set_yscale(self.yscale)
+        axes.legend(plt_lines, labels)
         display.display(self.fig)
         display.clear_output(wait=True)
 
-def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
-    """Train a model (defined in Chapter 3).
+class Module(d2l.nn_Module, d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
+        super().__init__()
+        self.save_hyperparameters()
+        self.board = ProgressBoard()
+    def loss(self, y_hat, y):
+        raise NotImplementedError
 
-    Defined in :numref:`sec_softmax_scratch`"""
-    animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
-                        legend=['train loss', 'train acc', 'test acc'])
-    for epoch in range(num_epochs):
-        train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
-        test_acc = evaluate_accuracy(net, test_iter)
-        animator.add(epoch + 1, train_metrics + (test_acc,))
-    train_loss, train_acc = train_metrics
-    assert train_loss < 0.5, train_loss
-    assert train_acc <= 1 and train_acc > 0.7, train_acc
-    assert test_acc <= 1 and test_acc > 0.7, test_acc
-
-def predict_ch3(net, test_iter, n=6):
-    """Predict labels (defined in Chapter 3).
-
-    Defined in :numref:`sec_softmax_scratch`"""
-    for X, y in test_iter:
-        break
-    trues = d2l.get_fashion_mnist_labels(y)
-    preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1))
-    titles = [true +'\n' + pred for true, pred in zip(trues, preds)]
-    d2l.show_images(
-        d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])
+    def forward(self, X):
+        assert hasattr(self, 'net'), 'Neural network is defined'
+        return self.net(X)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'Trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        self.board.draw(x, d2l.numpy(d2l.to(value, d2l.cpu())),
+                        ('train_' if train else 'val_') + key,
+                        every_n=int(n))
+
+    def training_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=True)
+        return l
+
+    def validation_step(self, batch):
+        l = self.loss(self(*batch[:-1]), batch[-1])
+        self.plot('loss', l, train=False)
+
+    def configure_optimizers(self):
+        raise NotImplementedError
 
-def evaluate_loss(net, data_iter, loss):
-    """Evaluate the loss of a model on the given dataset.
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_classification`"""
+        return torch.optim.SGD(self.parameters(), lr=self.lr)
 
-    Defined in :numref:`sec_model_selection`"""
-    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
-    for X, y in data_iter:
-        out = net(X)
-        y = d2l.reshape(y, out.shape)
-        l = loss(out, y)
-        metric.add(d2l.reduce_sum(l), d2l.size(l))
-    return metric[0] / metric[1]
+    def apply_init(self, inputs, init=None):
+        """Defined in :numref:`sec_lazy_init`"""
+        self.forward(*inputs)
+        if init is not None:
+            self.net.apply(init)
 
-DATA_HUB = dict()
-DATA_URL = 'http://d2l-data.s3-accelerate.amazonaws.com/'
+class DataModule(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, root='../data', num_workers=4):
+        self.save_hyperparameters()
 
-def download(name, cache_dir=os.path.join('..', 'data')):
-    """Download a file inserted into DATA_HUB, return the local filename.
+    def get_dataloader(self, train):
+        raise NotImplementedError
 
-    Defined in :numref:`sec_kaggle_house`"""
-    assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
-    url, sha1_hash = DATA_HUB[name]
-    os.makedirs(cache_dir, exist_ok=True)
-    fname = os.path.join(cache_dir, url.split('/')[-1])
-    if os.path.exists(fname):
-        sha1 = hashlib.sha1()
-        with open(fname, 'rb') as f:
-            while True:
-                data = f.read(1048576)
-                if not data:
-                    break
-                sha1.update(data)
-        if sha1.hexdigest() == sha1_hash:
-            return fname  # Hit cache
-    print(f'Downloading {fname} from {url}...')
-    r = requests.get(url, stream=True, verify=True)
-    with open(fname, 'wb') as f:
-        f.write(r.content)
-    return fname
+    def train_dataloader(self):
+        return self.get_dataloader(train=True)
+
+    def val_dataloader(self):
+        return self.get_dataloader(train=False)
+
+    def get_tensorloader(self, tensors, train, indices=slice(0, None)):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        tensors = tuple(a[indices] for a in tensors)
+        dataset = torch.utils.data.TensorDataset(*tensors)
+        return torch.utils.data.DataLoader(dataset, self.batch_size,
+                                           shuffle=train)
+
+class Trainer(d2l.HyperParameters):
+    """Defined in :numref:`sec_oo-design`"""
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        self.save_hyperparameters()
+        assert num_gpus == 0, 'No GPU support yet'
+
+    def prepare_data(self, data):
+        self.train_dataloader = data.train_dataloader()
+        self.val_dataloader = data.val_dataloader()
+        self.num_train_batches = len(self.train_dataloader)
+        self.num_val_batches = (len(self.val_dataloader)
+                                if self.val_dataloader is not None else 0)
+
+    def prepare_model(self, model):
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        self.model = model
+
+    def fit(self, model, data):
+        self.prepare_data(data)
+        self.prepare_model(model)
+        self.optim = model.configure_optimizers()
+        self.epoch = 0
+        self.train_batch_idx = 0
+        self.val_batch_idx = 0
+        for self.epoch in range(self.max_epochs):
+            self.fit_epoch()
+
+    def fit_epoch(self):
+        raise NotImplementedError
 
-def download_extract(name, folder=None):
-    """Download and extract a zip/tar file.
+    def prepare_batch(self, batch):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return batch
 
-    Defined in :numref:`sec_kaggle_house`"""
-    fname = download(name)
-    base_dir = os.path.dirname(fname)
-    data_dir, ext = os.path.splitext(fname)
-    if ext == '.zip':
-        fp = zipfile.ZipFile(fname, 'r')
-    elif ext in ('.tar', '.gz'):
-        fp = tarfile.open(fname, 'r')
-    else:
-        assert False, 'Only zip/tar files can be extracted.'
-    fp.extractall(base_dir)
-    return os.path.join(base_dir, folder) if folder else data_dir
+    def fit_epoch(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        self.model.train()
+        for batch in self.train_dataloader:
+            loss = self.model.training_step(self.prepare_batch(batch))
+            self.optim.zero_grad()
+            with torch.no_grad():
+                loss.backward()
+                if self.gradient_clip_val > 0:  # To be discussed later
+                    self.clip_gradients(self.gradient_clip_val, self.model)
+                self.optim.step()
+            self.train_batch_idx += 1
+        if self.val_dataloader is None:
+            return
+        self.model.eval()
+        for batch in self.val_dataloader:
+            with torch.no_grad():
+                self.model.validation_step(self.prepare_batch(batch))
+            self.val_batch_idx += 1
+
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
+        """Defined in :numref:`sec_use_gpu`"""
+        self.save_hyperparameters()
+        self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
+    
+
+    def prepare_batch(self, batch):
+        """Defined in :numref:`sec_use_gpu`"""
+        if self.gpus:
+            batch = [d2l.to(a, self.gpus[0]) for a in batch]
+        return batch
+    
+
+    def prepare_model(self, model):
+        """Defined in :numref:`sec_use_gpu`"""
+        model.trainer = self
+        model.board.xlim = [0, self.max_epochs]
+        if self.gpus:
+            model.to(self.gpus[0])
+        self.model = model
+
+class SyntheticRegressionData(d2l.DataModule):
+    """Defined in :numref:`sec_synthetic-regression-data`"""
+    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
+                 batch_size=32):
+        super().__init__()
+        self.save_hyperparameters()
+        n = num_train + num_val
+        self.X = d2l.randn(n, len(w))
+        noise = d2l.randn(n, 1) * noise
+        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + noise
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_synthetic-regression-data`"""
+        i = slice(0, self.num_train) if train else slice(self.num_train, None)
+        return self.get_tensorloader((self.X, self.y), train, i)
+
+class LinearRegressionScratch(d2l.Module):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, num_inputs, lr, sigma=0.01):
+        super().__init__()
+        self.save_hyperparameters()
+        self.w = d2l.normal(0, sigma, (num_inputs, 1), requires_grad=True)
+        self.b = d2l.zeros(1, requires_grad=True)
+
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_scratch`"""
+        return d2l.matmul(X, self.w) + self.b
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_scratch`"""
+        l = (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+        return d2l.reduce_mean(l)
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_scratch`"""
+        return SGD([self.w, self.b], self.lr)
+
+class SGD(d2l.HyperParameters):
+    """Defined in :numref:`sec_linear_scratch`"""
+    def __init__(self, params, lr):
+        """Minibatch stochastic gradient descent."""
+        self.save_hyperparameters()
+
+    def step(self):
+        for param in self.params:
+            param -= self.lr * param.grad
+
+    def zero_grad(self):
+        for param in self.params:
+            if param.grad is not None:
+                param.grad.zero_()
+
+class LinearRegression(d2l.Module):
+    """Defined in :numref:`sec_linear_concise`"""
+    def __init__(self, lr):
+        super().__init__()
+        self.save_hyperparameters()
+        self.net = nn.LazyLinear(1)
+        self.net.weight.data.normal_(0, 0.01)
+        self.net.bias.data.fill_(0)
 
-def download_all():
-    """Download all files in the DATA_HUB.
+    def forward(self, X):
+        """The linear regression model.
+    
+        Defined in :numref:`sec_linear_concise`"""
+        return self.net(X)
+
+    def loss(self, y_hat, y):
+        """Defined in :numref:`sec_linear_concise`"""
+        fn = nn.MSELoss()
+        return fn(y_hat, y)
+
+    def configure_optimizers(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return torch.optim.SGD(self.parameters(), self.lr)
+
+    def get_w_b(self):
+        """Defined in :numref:`sec_linear_concise`"""
+        return (self.net.weight.data, self.net.bias.data)
+
+class FashionMNIST(d2l.DataModule):
+    """Defined in :numref:`sec_fashion_mnist`"""
+    def __init__(self, batch_size=64, resize=(28, 28)):
+        super().__init__()
+        self.save_hyperparameters()
+        trans = transforms.Compose([transforms.Resize(resize),
+                                    transforms.ToTensor()])
+        self.train = torchvision.datasets.FashionMNIST(
+            root=self.root, train=True, transform=trans, download=True)
+        self.val = torchvision.datasets.FashionMNIST(
+            root=self.root, train=False, transform=trans, download=True)
+
+    def text_labels(self, indices):
+        """Return text labels.
+    
+        Defined in :numref:`sec_fashion_mnist`"""
+        labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                  'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+        return [labels[int(i)] for i in indices]
+
+    def get_dataloader(self, train):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        data = self.train if train else self.val
+        return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train,
+                                           num_workers=self.num_workers)
+
+    def visualize(self, batch, nrows=1, ncols=8, labels=[]):
+        """Defined in :numref:`sec_fashion_mnist`"""
+        X, y = batch
+        if not labels:
+            labels = self.text_labels(y)
+        d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
 
-    Defined in :numref:`sec_kaggle_house`"""
-    for name in DATA_HUB:
-        download(name)
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
 
-DATA_HUB['kaggle_house_train'] = (
-    DATA_URL + 'kaggle_house_pred_train.csv',
-    '585e9cc93e70b39160e7921475f9bcd7d31219ce')
+    Defined in :numref:`sec_fashion_mnist`"""
+    raise NotImplementedError
+
+class Classifier(d2l.Module):
+    """Defined in :numref:`sec_classification`"""
+    def validation_step(self, batch):
+        Y_hat = self(*batch[:-1])
+        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
+        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
+
+    def accuracy(self, Y_hat, Y, averaged=True):
+        """Compute the number of correct predictions.
+    
+        Defined in :numref:`sec_classification`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
+        compare = d2l.astype(preds == d2l.reshape(Y, -1), d2l.float32)
+        return d2l.reduce_mean(compare) if averaged else compare
+
+    def loss(self, Y_hat, Y, averaged=True):
+        """Defined in :numref:`sec_softmax_concise`"""
+        Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
+        Y = d2l.reshape(Y, (-1,))
+        return F.cross_entropy(
+            Y_hat, Y, reduction='mean' if averaged else 'none')
+
+def cpu():
+    """Defined in :numref:`sec_use_gpu`"""
+    return torch.device('cpu')
+def gpu(i=0):
+    """Defined in :numref:`sec_use_gpu`"""
+    return torch.device(f'cuda:{i}')
 
-DATA_HUB['kaggle_house_test'] = (
-    DATA_URL + 'kaggle_house_pred_test.csv',
-    'fa19780a7b011d9b009e8bff8e99922a8ee2eb90')
+def num_gpus():
+    """Defined in :numref:`sec_use_gpu`"""
+    return torch.cuda.device_count()
 
 def try_gpu(i=0):
     """Return gpu(i) if exists, otherwise return cpu().
 
     Defined in :numref:`sec_use_gpu`"""
-    if torch.cuda.device_count() >= i + 1:
-        return torch.device(f'cuda:{i}')
-    return torch.device('cpu')
+    if num_gpus() >= i + 1:
+        return gpu(i)
+    return cpu()
 
 def try_all_gpus():
     """Return all available GPUs, or [cpu(),] if no GPU exists.
 
     Defined in :numref:`sec_use_gpu`"""
-    devices = [torch.device(f'cuda:{i}')
-             for i in range(torch.cuda.device_count())]
-    return devices if devices else [torch.device('cpu')]
+    return [gpu(i) for i in range(num_gpus())]
 
 def corr2d(X, K):
     """Compute 2D cross-correlation.
@@ -2679,10 +2745,507 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
     return loss_G
 
 d2l.DATA_HUB['pokemon'] = (d2l.DATA_URL + 'pokemon.zip',
-                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')# Alias defined in config.ini
+                           'c065c0e2593b8b161a2d7873e42418bf6a21106c')
+
+def load_array(data_arrays, batch_size, is_train=True):
+    """Construct a PyTorch data iterator.
+
+    Defined in :numref:`sec_utils`"""
+    dataset = data.TensorDataset(*data_arrays)
+    return data.DataLoader(dataset, batch_size, shuffle=is_train)
+
+def synthetic_data(w, b, num_examples):
+    """Generate y = Xw + b + noise.
+
+    Defined in :numref:`sec_utils`"""
+    X = d2l.normal(0, 1, (num_examples, len(w)))
+    y = d2l.matmul(X, w) + b
+    y += d2l.normal(0, 0.01, y.shape)
+    return X, d2l.reshape(y, (-1, 1))
+
+def sgd(params, lr, batch_size):
+    """Minibatch stochastic gradient descent.
+
+    Defined in :numref:`sec_utils`"""
+    with torch.no_grad():
+        for param in params:
+            param -= lr * param.grad / batch_size
+            param.grad.zero_()
+
+def get_dataloader_workers():
+    """Use 4 processes to read the data.
+
+    Defined in :numref:`sec_utils`"""
+    return 4
+
+def load_data_fashion_mnist(batch_size, resize=None):
+    """Download the Fashion-MNIST dataset and then load it into memory.
+
+    Defined in :numref:`sec_utils`"""
+    trans = [transforms.ToTensor()]
+    if resize:
+        trans.insert(0, transforms.Resize(resize))
+    trans = transforms.Compose(trans)
+    mnist_train = torchvision.datasets.FashionMNIST(
+        root="../data", train=True, transform=trans, download=True)
+    mnist_test = torchvision.datasets.FashionMNIST(
+        root="../data", train=False, transform=trans, download=True)
+    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
+                            num_workers=get_dataloader_workers()),
+            data.DataLoader(mnist_test, batch_size, shuffle=False,
+                            num_workers=get_dataloader_workers()))
+
+def evaluate_accuracy_gpu(net, data_iter, device=None):
+    """Compute the accuracy for a model on a dataset using a GPU.
+
+    Defined in :numref:`sec_utils`"""
+    if isinstance(net, nn.Module):
+        net.eval()  # Set the model to evaluation mode
+        if not device:
+            device = next(iter(net.parameters())).device
+    # No. of correct predictions, no. of predictions
+    metric = d2l.Accumulator(2)
+
+    with torch.no_grad():
+        for X, y in data_iter:
+            if isinstance(X, list):
+                # Required for BERT Fine-tuning (to be covered later)
+                X = [x.to(device) for x in X]
+            else:
+                X = X.to(device)
+            y = y.to(device)
+            metric.add(d2l.accuracy(net(X), y), d2l.size(y))
+    return metric[0] / metric[1]
+
+
+def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
+    """Train a model with a GPU (defined in Chapter 6).
+
+    Defined in :numref:`sec_utils`"""
+    def init_weights(m):
+        if type(m) == nn.Linear or type(m) == nn.Conv2d:
+            nn.init.xavier_uniform_(m.weight)
+    net.apply(init_weights)
+    print('training on', device)
+    net.to(device)
+    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
+    loss = nn.CrossEntropyLoss()
+    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
+                            legend=['train loss', 'train acc', 'test acc'])
+    timer, num_batches = d2l.Timer(), len(train_iter)
+    for epoch in range(num_epochs):
+        # Sum of training loss, sum of training accuracy, no. of examples
+        metric = d2l.Accumulator(3)
+        net.train()
+        for i, (X, y) in enumerate(train_iter):
+            timer.start()
+            optimizer.zero_grad()
+            X, y = X.to(device), y.to(device)
+            y_hat = net(X)
+            l = loss(y_hat, y)
+            l.backward()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
+            timer.stop()
+            train_l = metric[0] / metric[2]
+            train_acc = metric[1] / metric[2]
+            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
+                animator.add(epoch + (i + 1) / num_batches,
+                             (train_l, train_acc, None))
+        test_acc = evaluate_accuracy_gpu(net, test_iter)
+        animator.add(epoch + 1, (None, None, test_acc))
+    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
+          f'test acc {test_acc:.3f}')
+    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
+          f'on {str(device)}')
+
+def linreg(X, w, b):
+    """The linear regression model.
+
+    Defined in :numref:`sec_utils`"""
+    return d2l.matmul(X, w) + b
+
+def squared_loss(y_hat, y):
+    """Squared loss.
+
+    Defined in :numref:`sec_utils`"""
+    return (y_hat - d2l.reshape(y, y_hat.shape)) ** 2 / 2
+
+def get_fashion_mnist_labels(labels):
+    """Return text labels for the Fashion-MNIST dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
+                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
+    return [text_labels[int(i)] for i in labels]
+
+def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
+    """Plot a list of images.
+
+    Defined in :numref:`sec_utils`"""
+    figsize = (num_cols * scale, num_rows * scale)
+    _, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
+    axes = axes.flatten()
+    for i, (ax, img) in enumerate(zip(axes, imgs)):
+        try:
+            img = d2l.numpy(img)
+        except:
+            pass
+        ax.imshow(img)
+        ax.axes.get_xaxis().set_visible(False)
+        ax.axes.get_yaxis().set_visible(False)
+        if titles:
+            ax.set_title(titles[i])
+    return axes
+
+class Animator:
+    """For plotting data in animation."""
+    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
+                 ylim=None, xscale='linear', yscale='linear',
+                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
+                 figsize=(3.5, 2.5)):
+        """Defined in :numref:`sec_utils`"""
+        # Incrementally plot multiple lines
+        if legend is None:
+            legend = []
+        d2l.use_svg_display()
+        self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
+        if nrows * ncols == 1:
+            self.axes = [self.axes, ]
+        # Use a lambda function to capture arguments
+        self.config_axes = lambda: d2l.set_axes(
+            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
+        self.X, self.Y, self.fmts = None, None, fmts
+
+    def add(self, x, y):
+        # Add multiple data points into the figure
+        if not hasattr(y, "__len__"):
+            y = [y]
+        n = len(y)
+        if not hasattr(x, "__len__"):
+            x = [x] * n
+        if not self.X:
+            self.X = [[] for _ in range(n)]
+        if not self.Y:
+            self.Y = [[] for _ in range(n)]
+        for i, (a, b) in enumerate(zip(x, y)):
+            if a is not None and b is not None:
+                self.X[i].append(a)
+                self.Y[i].append(b)
+        self.axes[0].cla()
+        for x, y, fmt in zip(self.X, self.Y, self.fmts):
+            self.axes[0].plot(x, y, fmt)
+        self.config_axes()
+        display.display(self.fig)
+        display.clear_output(wait=True)
+
+class Accumulator:
+    """For accumulating sums over `n` variables."""
+    def __init__(self, n):
+        """Defined in :numref:`sec_utils`"""
+        self.data = [0.0] * n
+
+    def add(self, *args):
+        self.data = [a + float(b) for a, b in zip(self.data, args)]
+
+    def reset(self):
+        self.data = [0.0] * len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+
+def accuracy(y_hat, y):
+    """Compute the number of correct predictions.
+
+    Defined in :numref:`sec_utils`"""
+    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
+        y_hat = d2l.argmax(y_hat, axis=1)
+    cmp = d2l.astype(y_hat, y.dtype) == y
+    return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))
+
+def download(url, folder='../data', sha1_hash=None):
+    """Download a file to folder and return the local filepath.
+
+    Defined in :numref:`sec_utils`"""
+    if not url.startswith('http'):
+        # For back compatability
+        url, sha1_hash = DATA_HUB[url]
+    os.makedirs(folder, exist_ok=True)
+    fname = os.path.join(folder, url.split('/')[-1])
+    # Check if hit cache
+    if os.path.exists(fname) and sha1_hash:
+        sha1 = hashlib.sha1()
+        with open(fname, 'rb') as f:
+            while True:
+                data = f.read(1048576)
+                if not data:
+                    break
+                sha1.update(data)
+        if sha1.hexdigest() == sha1_hash:
+            return fname
+    # Download
+    print(f'Downloading {fname} from {url}...')
+    r = requests.get(url, stream=True, verify=True)
+    with open(fname, 'wb') as f:
+        f.write(r.content)
+    return fname
+
+def extract(filename, folder=None):
+    """Extract a zip/tar file into folder.
+
+    Defined in :numref:`sec_utils`"""
+    base_dir = os.path.dirname(filename)
+    _, ext = os.path.splitext(filename)
+    assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
+    if ext == '.zip':
+        fp = zipfile.ZipFile(filename, 'r')
+    else:
+        fp = tarfile.open(filename, 'r')
+    if folder is None:
+        folder = base_dir
+    fp.extractall(folder)
+
+def download_extract(name, folder=None):
+    """Download and extract a zip/tar file.
+
+    Defined in :numref:`sec_utils`"""
+    fname = download(name)
+    base_dir = os.path.dirname(fname)
+    data_dir, ext = os.path.splitext(fname)
+    if ext == '.zip':
+        fp = zipfile.ZipFile(fname, 'r')
+    elif ext in ('.tar', '.gz'):
+        fp = tarfile.open(fname, 'r')
+    else:
+        assert False, 'Only zip/tar files can be extracted.'
+    fp.extractall(base_dir)
+    return os.path.join(base_dir, folder) if folder else data_dir
+
+
+def tokenize(lines, token='word'):
+    """Split text lines into word or character tokens.
+
+    Defined in :numref:`sec_utils`"""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+
+def evaluate_loss(net, data_iter, loss):
+    """Evaluate the loss of a model on the given dataset.
+
+    Defined in :numref:`sec_utils`"""
+    metric = d2l.Accumulator(2)  # Sum of losses, no. of examples
+    for X, y in data_iter:
+        out = net(X)
+        y = d2l.reshape(y, out.shape)
+        l = loss(out, y)
+        metric.add(d2l.reduce_sum(l), d2l.size(l))
+    return metric[0] / metric[1]
+
+def grad_clipping(net, theta):
+    """Clip the gradient.
+
+    Defined in :numref:`sec_utils`"""
+    if isinstance(net, nn.Module):
+        params = [p for p in net.parameters() if p.requires_grad]
+    else:
+        params = net.params
+    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+
+d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
+                           '94646ad1522d915e7b0f9296181140edcf86a4f5')
+
+def read_data_nmt():
+    """Load the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    data_dir = d2l.download_extract('fra-eng')
+    with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
+        return f.read()
+
+def preprocess_nmt(text):
+    """Preprocess the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    def no_space(char, prev_char):
+        return char in set(',.!?') and prev_char != ' '
+
+    # Replace non-breaking space with space, and convert uppercase letters to
+    # lowercase ones
+    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
+    # Insert space between words and punctuation marks
+    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
+           for i, char in enumerate(text)]
+    return ''.join(out)
+
+def tokenize_nmt(text, num_examples=None):
+    """Tokenize the English-French dataset.
+
+    Defined in :numref:`sec_utils`"""
+    source, target = [], []
+    for i, line in enumerate(text.split('\n')):
+        if num_examples and i > num_examples:
+            break
+        parts = line.split('\t')
+        if len(parts) == 2:
+            source.append(parts[0].split(' '))
+            target.append(parts[1].split(' '))
+    return source, target
+
+
+def truncate_pad(line, num_steps, padding_token):
+    """Truncate or pad sequences.
+
+    Defined in :numref:`sec_utils`"""
+    if len(line) > num_steps:
+        return line[:num_steps]  # Truncate
+    return line + [padding_token] * (num_steps - len(line))  # Pad
+
+
+def build_array_nmt(lines, vocab, num_steps):
+    """Transform text sequences of machine translation into minibatches.
+
+    Defined in :numref:`sec_utils`"""
+    lines = [vocab[l] for l in lines]
+    lines = [l + [vocab['<eos>']] for l in lines]
+    array = d2l.tensor([truncate_pad(
+        l, num_steps, vocab['<pad>']) for l in lines])
+    valid_len = d2l.reduce_sum(
+        d2l.astype(array != vocab['<pad>'], d2l.int32), 1)
+    return array, valid_len
+
+
+def load_data_nmt(batch_size, num_steps, num_examples=600):
+    """Return the iterator and the vocabularies of the translation dataset.
+
+    Defined in :numref:`sec_utils`"""
+    text = preprocess_nmt(read_data_nmt())
+    source, target = tokenize_nmt(text, num_examples)
+    src_vocab = d2l.Vocab(source, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    tgt_vocab = d2l.Vocab(target, min_freq=2,
+                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
+    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
+    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
+    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
+    data_iter = d2l.load_array(data_arrays, batch_size)
+    return data_iter, src_vocab, tgt_vocab
+
+def sequence_mask(X, valid_len, value=0):
+    """Mask irrelevant entries in sequences.
+
+    Defined in :numref:`sec_utils`"""
+    maxlen = X.size(1)
+    mask = torch.arange((maxlen), dtype=torch.float32,
+                        device=X.device)[None, :] < valid_len[:, None]
+    X[~mask] = value
+    return X
+
+
+class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
+    """The softmax cross-entropy loss with masks.
+
+    Defined in :numref:`sec_utils`"""
+    # `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
+    # `label` shape: (`batch_size`, `num_steps`)
+    # `valid_len` shape: (`batch_size`,)
+    def forward(self, pred, label, valid_len):
+        weights = torch.ones_like(label)
+        weights = sequence_mask(weights, valid_len)
+        self.reduction='none'
+        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
+            pred.permute(0, 2, 1), label)
+        weighted_loss = (unweighted_loss * weights).mean(dim=1)
+        return weighted_loss
+
+def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
+    """Train a model for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    def xavier_init_weights(m):
+        if type(m) == nn.Linear:
+            nn.init.xavier_uniform_(m.weight)
+        if type(m) == nn.GRU:
+            for param in m._flat_weights_names:
+                if "weight" in param:
+                    nn.init.xavier_uniform_(m._parameters[param])
+    net.apply(xavier_init_weights)
+    net.to(device)
+    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
+    loss = MaskedSoftmaxCELoss()
+    net.train()
+    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
+                            xlim=[10, num_epochs])
+    for epoch in range(num_epochs):
+        timer = d2l.Timer()
+        metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
+        for batch in data_iter:
+            optimizer.zero_grad()
+            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
+            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
+                               device=device).reshape(-1, 1)
+            dec_input = d2l.concat([bos, Y[:, :-1]], 1)  # Teacher forcing
+            Y_hat, _ = net(X, dec_input, X_valid_len)
+            l = loss(Y_hat, Y, Y_valid_len)
+            l.sum().backward()  # Make the loss scalar for `backward`
+            d2l.grad_clipping(net, 1)
+            num_tokens = Y_valid_len.sum()
+            optimizer.step()
+            with torch.no_grad():
+                metric.add(l.sum(), num_tokens)
+        if (epoch + 1) % 10 == 0:
+            animator.add(epoch + 1, (metric[0] / metric[1],))
+    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
+          f'tokens/sec on {str(device)}')
+
+
+def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
+                    device, save_attention_weights=False):
+    """Predict for sequence to sequence.
+
+    Defined in :numref:`sec_utils`"""
+    # Set `net` to eval mode for inference
+    net.eval()
+    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
+        src_vocab['<eos>']]
+    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
+    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
+    # Add the batch axis
+    enc_X = torch.unsqueeze(
+        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
+    enc_outputs = net.encoder(enc_X, enc_valid_len)
+    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
+    # Add the batch axis
+    dec_X = torch.unsqueeze(torch.tensor(
+        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
+    output_seq, attention_weight_seq = [], []
+    for _ in range(num_steps):
+        Y, dec_state = net.decoder(dec_X, dec_state)
+        # We use the token with the highest prediction likelihood as input
+        # of the decoder at the next time step
+        dec_X = Y.argmax(dim=2)
+        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
+        # Save attention weights (to be covered later)
+        if save_attention_weights:
+            attention_weight_seq.append(net.decoder.attention_weights)
+        # Once the end-of-sequence token is predicted, the generation of the
+        # output sequence is complete
+        if pred == tgt_vocab['<eos>']:
+            break
+        output_seq.append(pred)
+    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
+
 
+# Alias defined in config.ini
+nn_Module = nn.Module
 
+ones_like = torch.ones_like
 ones = torch.ones
+zeros_like = torch.zeros_like
 zeros = torch.zeros
 tensor = torch.tensor
 arange = torch.arange
@@ -2697,13 +3260,17 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
 log = torch.log
 normal = torch.normal
 rand = torch.rand
+randn = torch.randn
 matmul = torch.matmul
 int32 = torch.int32
+int64 = torch.int64
 float32 = torch.float32
 concat = torch.cat
 stack = torch.stack
 abs = torch.abs
 eye = torch.eye
+sigmoid = torch.sigmoid
+batch_matmul = torch.bmm
 numpy = lambda x, *args, **kwargs: x.detach().numpy(*args, **kwargs)
 size = lambda x, *args, **kwargs: x.numel(*args, **kwargs)
 reshape = lambda x, *args, **kwargs: x.reshape(*args, **kwargs)
@@ -2712,4 +3279,8 @@ def update_G(Z, net_D, net_G, loss, trainer_G):
 argmax = lambda x, *args, **kwargs: x.argmax(*args, **kwargs)
 astype = lambda x, *args, **kwargs: x.type(*args, **kwargs)
 transpose = lambda x, *args, **kwargs: x.t(*args, **kwargs)
+reduce_mean = lambda x, *args, **kwargs: x.mean(*args, **kwargs)
+expand_dims = lambda x, *args, **kwargs: x.unsqueeze(*args, **kwargs)
+swapaxes = lambda x, *args, **kwargs: x.swapaxes(*args, **kwargs)
+repeat = lambda x, *args, **kwargs: x.repeat(*args, **kwargs)
 
diff --git a/index.md b/index.md
index fc5a3ac..126b0a7 100644
--- a/index.md
+++ b/index.md
@@ -22,9 +22,10 @@ chapter_notation/index
 
 chapter_introduction/index
 chapter_preliminaries/index
-chapter_linear-networks/index
+chapter_linear-regression/index
+chapter_linear-classification/index
 chapter_multilayer-perceptrons/index
-chapter_deep-learning-computation/index
+chapter_builders-guide/index
 chapter_convolutional-neural-networks/index
 chapter_convolutional-modern/index
 chapter_recurrent-neural-networks/index
diff --git a/setup.py b/setup.py
index c50e32c..2800eaa 100644
--- a/setup.py
+++ b/setup.py
@@ -2,11 +2,12 @@
 import d2l
 
 requirements = [
-    'jupyter==1.0.0',
-    'numpy==1.22.2',
-    'matplotlib==3.4',
-    'requests==2.25.1',
-    'pandas==1.2.4'
+    'jupyter',
+    'numpy',
+    'matplotlib',
+    'matplotlib-inline',
+    'requests',
+    'pandas'
 ]
 
 setup(
diff --git a/static/build.yml b/static/build.yml
index e969b1c..b5fc3af 100644
--- a/static/build.yml
+++ b/static/build.yml
@@ -1,13 +1,13 @@
 dependencies:
-  - python=3.8
+  - python=3.9
   - pip
   - pip:
     - ..  # d2l
     - git+https://github.com/d2l-ai/d2l-book
     - mxnet-cu102==1.7.0
-    - torch==1.10.2+cu102
+    - torch==1.12.0+cu102
     - -f https://download.pytorch.org/whl/torch_stable.html
-    - torchvision==0.11.3+cu102
+    - torchvision==0.13.0+cu102
     - -f https://download.pytorch.org/whl/torch_stable.html
-    - tensorflow==2.8.0
-    - tensorflow-probability==0.16.0
+    - tensorflow==2.9.1
+    - tensorflow-probability==0.17.0