Skip to content

Commit

Permalink
update F2,F3,F4 notebooks with contribution statements
Browse files Browse the repository at this point in the history
  • Loading branch information
dgedon committed Oct 31, 2023
1 parent c1cfa4e commit 05a17e4
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 44 deletions.
107 changes: 63 additions & 44 deletions preparatory_notebooks/F2_linear_regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@
"source": [
"# Notebook: F2 -- Linear Regression\n",
"\n",
"This notebook is complementary to lecture F2 about linear regressoin in order to highlight its key concepts to refresh your knowledge and gain intuition. The focus will be one\n",
"*Authors*: Hanna Malmvall, Jennifer Andersson<br>\n",
"*Date*: 31.10.2023\n",
"\n",
"This notebook is complementary to lecture F2 about linear regression. The purpose of the notebook is to highlight key concepts. It is also an opportunity to refresh your knowledge and gain intuition. The focus will be on:\n",
"1. **Generating data** for supervised machine learning problems\n",
"2. **Fit linear models** to this data\n",
"3. **Evaluate** the fitted models to see how it performs on new data\n",
"2. **Fitting linear models** to this data\n",
"3. **Evaluating** the fitted models to see how it performs on new data\n",
"\n",
"Please read the instructions and play around with the notebook where it is described.\n",
"\n"
"Please read the instructions and play around with the notebook where it is described."
]
},
{
Expand Down Expand Up @@ -52,15 +54,15 @@
"\n",
"## 1. Data Generation\n",
"\n",
"The first step for a supervised machine learning problem is to **get a dataset**. Each input $x_i$ comes with a corresponding output or label $y_i$. Here, $i$ denotes the index of a particular sample, and we collect $n$ samples in total. Compactly, we denote our dataset as $\\mathcal{T} = \\{x_i, y_i\\}_{i=1}^{n}$.\n",
"The first step when solving a supervised machine learning problem is to **get a dataset**. Each input $x_i$ comes with a corresponding output or label $y_i$. Here, $i$ denotes the index of a particular sample, and we collect $n$ samples in total. Compactly, we denote our dataset as $\\mathcal{T} = \\{x_i, y_i\\}_{i=1}^{n}$.\n",
"\n",
"Now we:\n",
"1. Generate a synthetic dataset $\\mathcal{T}$.\n",
"2. Split the dataset into one train dataset and one test dataset. The train dataset will be used to fit a model to the data, and the test dataset will be used to evaluate our model. \n",
"2. Split the dataset into one train dataset and one test dataset. The train dataset will be used to fit a model to the data, and the test dataset will be used to evaluate our model.\n",
"\n",
"The **goal** of our supervised machine learning method is to find a model that performs well the unseen test data. So it is important to leave out a part of the data (the test dataset) from the training process to be able to evaluate how well our model will perform on new input datapoints $x$ in the future.\n",
"The **goal** of our supervised machine learning method is to find a model that performs well on the unseen test data. Therefore, it is important to leave out a part of the data (the test dataset) from the training process to be able to evaluate how well our model will perform on new input datapoints $x$ in the future.\n",
"\n",
"Below, we have some helper function to generate synthetic data, split the data and then plot them. Skip over and go to the next box."
"Below, we have some helper function to generate synthetic data, split the data into a train- and test dataset and then plot them. Run the cell and continue to the next cell."
]
},
{
Expand Down Expand Up @@ -88,7 +90,7 @@
" print(f\"Train Data | {np.shape(X_train)[0]}{' ' * (3 - len(str(np.shape(X_train)[0])))} | {np.shape(y_train)[0]}{' ' * (3 - len(str(np.shape(y_train)[0])))} |\")\n",
" # Print test data row\n",
" print(f\"Test Data | {np.shape(X_test)[0]}{' ' * (3 - len(str(np.shape(X_test)[0])))} | {np.shape(y_test)[0]}{' ' * (3 - len(str(np.shape(y_test)[0])))} |\")\n",
" \n",
"\n",
" return X_train, y_train, X_test, y_test\n",
"\n",
"# Plot the train data and test data\n",
Expand All @@ -107,19 +109,24 @@
"id": "W9oSQ59uhW7U"
},
"source": [
"Now we can generate our dataset and plot them to get an understanding of what our data looks like. We plot both our train data (in blue) and our test data (in orange). \n",
"Now we can generate our datasets and plot them to get an understanding of what our data looks like. We plot both our train data (in blue) and our test data (in orange).\n",
"\n",
"Task:\n",
"- Run the cell below to visualize the synthetic train- and test datasets.\n",
"- Check if the test data are representative of the train data.\n",
"- Check if the test data is representative of the train data.\n",
"- Is there some relationship between $x$ and $y$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AzuGzLTZfsrk"
"colab": {
"base_uri": "https://localhost:8080/",
"height": 522
},
"id": "AzuGzLTZfsrk",
"outputId": "f2f83261-fb19-4d92-9e8f-d8aa3c64ea27"
},
"outputs": [],
"source": [
Expand All @@ -136,24 +143,28 @@
"id": "HoYjDXdJPlwK"
},
"source": [
"## Explore the Family of Linear Models\n",
"From our plot, we notice that there seems to be a linear pattern: $y$ increases linearly with $x$. Hence, as model it might be suitable to use a **linear model** on the form:\n",
"---\n",
"\n",
"## 2. Explore the Family of Linear Models\n",
"From our plot, we notice that there seems to be a linear pattern: $y$ increases linearly with $x$. Hence, it might be suitable to use a **linear model** on the form:\n",
"\n",
"$$\n",
"y=θ_0+θ_1x + ϵ\n",
"$$\n",
"\n",
"We call $θ_0$ and $θ_1$ the **parameters** of our model, and $ϵ$ is a noise term capturing random errors in our data that our model does not account for.\n",
"We call $θ_0$ and $θ_1$ the **parameters** of our model. $ϵ$ is a noise term capturing random errors in our data that our model does not account for.\n",
"\n",
"Finding a **good model**: This amount to fitting our model to the data. Meaning, finding good values of $θ_0$ and $θ_1$, so that $y_i\\approxθ_0+θ_1x_i$ holds for the samples in our training dataset $\\mathcal{T}_{train} = \\{x_i, y_i\\}_{i=1}^{m}$. Here, $m$ denotes the number of samples in our train set, i.e. $m=80$. \n",
"Finding a **good model** amount to fitting our model to the data. In other words, we want to find good values of $θ_0$ and $θ_1$, so that $y_i\\approxθ_0+θ_1x_i$ holds for the samples in our training dataset $\\mathcal{T}_{train} = \\{x_i, y_i\\}_{i=1}^{m}$. Here, $m$ denotes the number of samples in our train set, i.e. $m=80$.\n",
"\n",
"Below is a helper function that plots the linear models. Skip over and go to the next box."
"Below is a helper function that plots some linear models. Skip over and go to the next box."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"id": "CsVHfVtU_f5t"
},
"outputs": [],
"source": [
"def plot_linear_models(\n",
Expand All @@ -166,13 +177,13 @@
" if not all(element is None for element in model1_params):\n",
" y_model1 = model1_params[0] + X * model1_params[1]\n",
" plt.plot(X, y_model1, 'r', label='Model 1', alpha=0.5)\n",
" \n",
"\n",
" # model 2\n",
" if not all(element is None for element in model2_params):\n",
" print('aaa')\n",
" y_model2 = model2_params[0] + X * model2_params[1]\n",
" plt.plot(X, y_model2, 'm', label='Model 2', alpha=0.5)\n",
" \n",
"\n",
" # model 3\n",
" if not all(element is None for element in model3_params):\n",
" y_model3 = model3_params[0] + X * model3_params[1]\n",
Expand All @@ -188,22 +199,27 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"id": "LE75rcTF_f5t"
},
"source": [
"In this section we want to find a good linear model. We plot the training data, as well as the linear models which are fully described by $θ_0$ and $θ_1$. If a model fits the data, we can use our parameters along with the inputs of the data (variable $\\mathtt{X\\_train}$) to calculate predicted y-values which are close to the true y-values.\n",
"In this section we want to find a good linear model. We plot the training data, as well as the linear models (which are fully described by $θ_0$ and $θ_1$). If a model fits the data, we can use our parameters along with the inputs of the data (variable $\\mathtt{X\\_train}$) to calculate predicted y-values which are close to the true y-values.\n",
"\n",
"Tasks:\n",
"\n",
"1. Run the code below and visualize model 1 with the given parameters. Does it fit the data?\n",
"2. Try to optimize the parameters of model 2 and model 3 to obtain better fits to the data. Replace the $\\mathtt{None}$ values with what you think are better parameters.\n",
"3. Which set of parameters fit the data best? \n",
"3. Which set of parameters fits the data the best?\n",
"4. What does $θ_0$ and $θ_1$ stand for?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"id": "ssVhf5BD_f5u",
"outputId": "a8c366f9-45c1-4e01-f26b-10a2b1c3f69c"
},
"outputs": [],
"source": [
"# model 1:\n",
Expand Down Expand Up @@ -235,23 +251,25 @@
"source": [
"---\n",
"\n",
"## 2. Model evaluation\n",
"## 3. Model evaluation\n",
"\n",
"Above you visually fit the linear model to the data. But how can we determine quantitatively which model is better? A common metric is the mean squared error (MSE):\n",
"In the above exercise, you visually fit the linear model to the data. But how can we quantitatively determine which model is better? A common metric is the mean squared error (MSE):\n",
"\n",
"$$\n",
"\\frac{1}{m} \\sum_{i=1}^{m} {(y_i - f_{\\theta}(x_i))}^2\n",
"$$\n",
"\n",
"Here, $y_i$ denotes the true value for each input $x_i$ in the train dataset, and $f_{\\theta}(x_i) = \\theta_0 + \\theta_{1}x_i$ is the output of the model parameterized by our particular choice of $\\theta_0$ and $\\theta_1$.\n",
"\n",
"Below is a helper function which compute model predictions and the mean squared error of that model on the given data points. Skip over and go to the next box."
"Below you can find two helper functions computing the model predictions for a particular model as well as the mean squared error of that model on the given data points. Skip over and go to the next box."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"id": "hOl9HG_0_f5u"
},
"outputs": [],
"source": [
"def model_prediction(X, model_params):\n",
Expand All @@ -264,9 +282,11 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"id": "253dAe75_f5u"
},
"source": [
"Following there are two code cells which perform the following:\n",
"The following two code cells perform the following:\n",
"- Compare the MSE of the three models on the **train dataset**.\n",
"- Compare the MSE of the three models on the **test dataset** and plot the function with the test data.\n",
"\n",
Expand All @@ -280,7 +300,8 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WNDAkUdSnaPz"
"id": "WNDAkUdSnaPz",
"outputId": "45e34c69-b3ac-49ac-f925-25ee8b525e9c"
},
"outputs": [],
"source": [
Expand All @@ -303,7 +324,8 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HXPpgsGvpJbk"
"id": "HXPpgsGvpJbk",
"outputId": "9f1fc9b4-3e25-41b3-8169-bd5afa899dd1"
},
"outputs": [],
"source": [
Expand Down Expand Up @@ -333,11 +355,11 @@
"source": [
"---\n",
"\n",
"## 3. Finding the \"optimal\" linear model:\n",
"## 4. Finding the \"optimal\" linear model:\n",
"\n",
"Now we can not just fit a model by visual inspection but also select a model quantitatively by it's lowest MSE.\n",
"Now, as we have seen, we can not only fit a model by visual inspection but also select a model quantitatively by finding the model with the lowest MSE.\n",
"\n",
"But is there a **systematic way** to select the model parameters $\\theta_0$ and $\\theta_1$? We define the \"best possible linear model\" as the model generating the smallest MSE. Finding the model parameters that minimize the MSE is equivalent to finding the parameters that minimize the squared L2-norm of the residual vector. Thus, to find the best linear model we want to solve the following optimization problem with respect to $\\theta=[\\theta_0, \\theta_1]^\\top$:\n",
"But is there a **systematic way** to select the model parameters $\\theta_0$ and $\\theta_1$? We define the \"best possible linear model\" as the model generating the smallest MSE. Finding the model parameters that minimize the MSE is equivalent to finding the parameters that minimize the squared L2-norm of the residual vector. Thus, to find the best linear model, we want to solve the following optimization problem with respect to $\\theta=[\\theta_0, \\theta_1]^\\top$:\n",
"\n",
"$$\n",
"\\hat{\\mathbf{\\theta}} = \\text{arg}\\min_{\\mathbf{\\theta}} \\frac{1}{m} \\sum_{i=1}^{m} {(y_i - f_{\\theta}(x_i))}^2 = \\text{arg}\\min_{\\mathbf{\\theta}} ||{(\\mathbf{y} - \\mathbf{X}\\mathbf{\\theta})}||_2^2\n",
Expand Down Expand Up @@ -385,7 +407,11 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cVJhyY80Ie1K"
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cVJhyY80Ie1K",
"outputId": "88024254-82dd-4b5c-eec5-27c27a05c13f"
},
"outputs": [],
"source": [
Expand Down Expand Up @@ -433,13 +459,6 @@
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
3 changes: 3 additions & 0 deletions preparatory_notebooks/F3_logistic_regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
"source": [
"# Notebook: F3 -- Logistic Regression\n",
"\n",
"*Authors*: Amir Baghi, Daniel Gedon<br>\n",
"*Date*: 31.10.2023\n",
"\n",
"This notebook is complementary to lecture F3 about Logistic Regression in order to highlight the key concepts. The focus will be on\n",
"1. Understanding and visualizing different loss functions: **Misclassification** and **Logistic Loss**\n",
"2. A basic classifier and its **Misclassification Loss** and modifying the parameters to see the effects on the loss.\n",
Expand Down
3 changes: 3 additions & 0 deletions preparatory_notebooks/F4_lda_qda.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
"source": [
"# Notebook: F4 -- LDA, QDA\n",
"\n",
"*Authors*: Amir Baghi, Daniel Gedon<br>\n",
"*Date*: 31.10.2023\n",
"\n",
"This notebook is complementary to lecture F4 about LDA/QDA in order to highlight its key concepts. The focus will be on\n",
"1. Visualizing **multivariate Gaussian distributions**\n",
"2. **LDA**: Fitting Gaussian with the same covariance to data\n",
Expand Down

0 comments on commit 05a17e4

Please sign in to comment.