-
Notifications
You must be signed in to change notification settings - Fork 0
/
COMP 551 lecture notes.tex
2280 lines (1579 loc) · 132 KB
/
COMP 551 lecture notes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,12pt]{article}
\usepackage [utf8]{inputenc}
%\usepackage [francais]{babel}
\usepackage [T1] {fontenc}
\usepackage{listings}
\usepackage{algpseudocode}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{mathtools}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\softmax}{softmax}
\title{COMP 551: Applied Machine Learning\\ \large Lecture Notes}
\author{Étienne Fortier-Dubois}
\date{Winter 2018}
\begin{document}
\maketitle
\frenchspacing
\tableofcontents
\newpage
\section{Introduction to machine learning}
Imagine two tasks. In the first task, there are four pictures of people, one of which is male, and the goal is to identify the male face. In the second task, the goal is to compute $154443586 \times 29506220$. Which one is the most difficult task? For a human, the second is difficult while the first is trivial. For a computer, the reverse is true.
Alan Turing asked: Can machines think? He devised the Turing test, also called the imitation game, in which a computer must behave in a way indistiguishable from a human to pass the test.
\subsection{What can machines do?}
They can beat humans at chess (IBM Deep Blue, 1997). They can fly a helicopter (Stanford Autonomous Helicopter, 2004). They can answer questions (IBM Watson, 2011). They can play video games like Atari Breakout (Google DeepMind's Deep Q-learning, 2013). They can beat humans at Go (AlphaGo, 2016).
As of 2018, there has been a lot of progress in some practical applications. These include speech recognition and synthesis, language translation, medical analysis and drug discovery, financial applications, personal AI assistants (like Google Home and Amazon Alexa), robotics, self-driving cars, and even fiding new planets.
\subsection{What is machine learning?}
Machine learning is the study of algorithms that improve their performance $P$ at some task $T$ with experience $E$. The learning therefore involves the variables $<P, T, E>$
First example: The task $T$ is to classify emails as spam or non-spam. $P$ is the accuracy of the classification. $E$ is a set of example maisls labelled as spam or non-spam.
Second example: $T$ is the detection of fraudulent credit card transactions. $P$ is accuracy. $E$ is a set of historical transactions marked as legit or fraudulent.
\subsection{Prediction problems}
This is one of the most common types of machine learning problems. Given variable $x$, predict variable $y$.
Examples: Given the size of the house, predict the selling price. Given the current stock price of a company (at time $t$), predict the same after 10 minutes ($t+10$). In these two examples, $y$ is a real number ($y \in \mathbb{R}$). These problems are called \textbf{regression problems}.
Given an image of a hand-written digit, preduct the digit (e.g. for reading cheques). Given the temperature, humidity and wind, predict whether it will rain or not. In these two examples, $y$ is categorical ($y \in \{0, 1, ..., 9\}$ in the first case, $y \in \{yes, no\}$ in the second case). These problems are called \textbf{classification problems}.
\subsection{What is learning?}
Definition of learning: Given the value of an input $x$, make a good precition of output $y$, denoted $\hat y$.
Definition of supervised learning: Given a traning set $\{ x^i, y^i\}$ from $i$ to $N$, find a function $f: x \rightarrow y$ such that a prediction $\hat y$ can be made from $x$. In other words, given a set of data points, find a way to predict further data points.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction to machine learning, continued}
\subsection{Curve fitting example}
This is a regression problem. Suppose we have data points created with a function, and we want to know what the function is. In our example (but not in real-life), we know that the function is $\sin(2\pi x)$. In addition, there is some Gaussian noise $\epsilon$ so that the data points do not a perferct curve. Thus
$$y = \sin(2\pi x) + \epsilon$$
Let's assume a linear model. The goal is to learn the parameters $w \{w_0, w_1\}$ of the model. The prediction equation is
$$\hat y = w_0 + w_1x$$
If we do not add $w_0$, we can only cover lines passing through the origin, and so $w_0$ is known as the \textbf{bias} in ML literature. Thus the prediction depends on both $x$ and $w$, and we write $\hat y(x; w)$.
The goal is that $\hat y(x; w)$ be as close as possible to $y$. This is done by minimizing the error function $$E(w) = \frac{1}{2}\sum_{n=1}^N \{ \hat y (x^n; w) - y^n \}$$
$E(w)$ is a quadrative function of the parameters. Therefore, the derivative of $E(w)$ with respect to $w$ is linear.
But note that this model does not fit the data we have very well (since it was generated with a sin function). We can consider higher order polynomials. Using the fact that $x^0 = 1$, we can write
\begin{align*}
\hat y(x; w) &= w_0x^0 + w_1x^1 + ... + w_Mx^M\\
&= \sum_{j=0}^M w_jx^j
\end{align*}
The error function is actually still a quadratic function of the parameters $w$, because the model is still linear. So the solution for $w$ is constant.
In our example, a third-degree polynomial would appear to fit the data well. A ninth-degree polynomial actuall fits the data perfectly---there is zero training error. Which one is the best solution?
Knowing the original function (sin), the third-degree polynomial seems better, because it could predict better further data points. With $M=9$, we obtain an excellent fit to the training data, but it is a poor model. This is called \textbf{overfitting}.
But we know this only because we already know the true function. What to do when we don't?
\subsubsection{Test set}
The solution is to use a separate test set. When training the model, use data points labeled as the training set. Then, to assess how good the model is, use other data points---the test set. This is a proxy for performance.
We can evaluate performance (and compare the training with the testing) with root mean square error:
$$E_{RMS} = \sqrt{\frac{2E(w^*)}{N}}$$
If we plot the $E_{RMS}$ against $M$ (polynomial degree), we'll see low error for $M=9$ for the training, but high error for the testing. We should pick the degree that minimizes testing error.
Why does this happen in the example with $M=9$? As $M$ increases, the magnitude of the coefficients of the weight parameters ($w_i$) gets larger. In fact, if we have 10 data points and 10 weights ($w_0$ to $w_9$), then the model finds weight such that it can fit each data point perfectly.
Solution 1 to overfitting is to simply use more data points. As you add them, the model will approximate the curve better and better (even as $M$ stays constant).
Solution 2 is to add a penalty term to the error function to discourage the coefficients from reaching large values. For example:
$$E(w) = \frac{1}{2} \sum_{n=1}^N \left\{ \hat y(x^n; w) - y^n\right\}^2 + \frac{\lambda}{2}||w||^2$$
where $\lambda$ is a something coefficient. If $\lambda$ is too high or too small, the error will be high; thus there is a region of values of $\lambda$ that is optimal, and it is crucial to choose a correct $\lambda$.
We call $\lambda$ a \textbf{hyperparameter} of the model. The difference between a hyperparameter and a parameter is that the former is fixed, and the second is learned.
\subsubsection{Model selection}
But how do we choose $\lambda$? Can we use the test set to choose it? No. The test set is supposed to be a proxy for completely new $x$.
The solution is to separate a third set, called \textbf{validation set}. Our initial data points will be divided into a training, a validation, and a testing set.
Validation works like this: (1) For different values of $\lambda$, train the model and compute the validation performance. (2) Pick the value of $\lambda$ that has the best validation performance. (3) Compute the \textit{test} performance for the model with the chosen $\lambda$. This is called \textbf{model selection}.
Note that $M$, i.e. the degree of the polynomial, is also a hyperparameter. This leads us to solution 3 to overfitting: try different values of $M$ and select the value of $M$ based on validation performance. This is another form of model selection. There are many other solutions to overfitting.
\subsection{Machine learning pipeline}
These are the usual steps to go through when solving a machine learning problem. In this course, we will spend a lot of time on steps 4 (model selection) and 5 (error function).
\begin{enumerate}
\item Define the input $x$ and output $y$.
\item Collect examples for the the task.
\item Divide the examples into a training, validation, and test set.
\item Define your model: parameters and hyperparameters.
\item Define the error function (a.k.a. loss function) that you want to minimize.
\item For different values of hyperparemeters, learn model parameters (with the training set) by minimizing the loss function, and compute validation performance with the validation set.
\item Pick the best model based on validation performance.
\item Test the model with the test set.
\end{enumerate}
\subsection{Classification example}
The curve fitting example was a regression problem. Here is an example of a classification problem.
Suppose we have a bunch of points that are either orange or blue, and are on plotted on two axes $x_1$ and $x_2$. (Typically, with two categories, we label them 0 and 1, or -1 and 1. Let's say blue = 0 and orange = 1). The goal is to be able to classify any new data point with known $x_1$ and $x_2$ as either blue or orange.
\subsubsection{Model 1: linear model}
We have $x = (x_1, x_2)$. We can generalize to $x_T = (x_1, x_2, ..., x_p)$. We also have parameters $w^T = (w_0, w_1, ..., w_p)$.
$$\hat y = w_0 + \sum_{j=1}P x_j w_j $$
$$= \sum_{j=0}P x_j w_j$$
where $x_0 = 1$.
As a result, the scalar $\hat y = x^Tw$ (inner product of two vectors). We need to learn the parameter vector $w$.
We do this the same way as in the linear regression example, by minimizing the squares (least squares). The residual sum of squares is:
$$RSS(w) = \frac{1}{2} \sum_{i=1}^N (y^{(i)} - x^{(i)^T}w)^2$$
We can write this in matrix notation. In the following, $X$ is the data matrix; and $y$ is the target vector.
$$X = \begin{pmatrix}
x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \ddots & \vdots\\
x_1^{(N)} & x_2^{(N)} & \cdots & x_p^{(N)}
\end{pmatrix},
y = \begin{pmatrix}
y^{(1)}\\
y^{(2)}\\
\vdots\\
y^{(N)}
\end{pmatrix}$$
Rows in $X$ (including the associated $y^{(n)}$) are \textbf{examples} (data points), and columns are \textbf{features} of the data. $X$ is an $N\times p$ matrix; $y$ is an $N$ column vector; and $w$ is a $p$ row vector. Thus the residual sum of squares is:
$$RSS(w) = \frac{1}{2}(y - Xw)^T(y - Xw)$$
(We can check that the multiplications make sense. $y$ is $N\times 1$, $X$ is $N\times p$ and $w$ is $p\times 1$. Thus $Xw$ is $N\times 1$, and the transpose of $(y-Xw)$ is $1\times N$. Multiplied with $(y-Xw)$, we have $(1\times N)(N\times 1)$ which yields a scalar, the RSS.)
We can solve this in closed form. The solution is obtained by differentiating with respect to $w$ and setting the differential to 0.
\begin{align*}
-2 \frac{1}{2} X^T (y - Xw) &= 0\\
X^T (y - Xw) &= 0\\
X^Ty - X^TXw &= 0\\
X^TXw &= X^Ty\\
w &= (X^TX)^{-1}X^Ty
\end{align*}
Therefore, the closed form solution to the residual sums of squares is
$$w^* = (X^T X)^{-1} X^T y$$
and the model with the learnt parameter $w^*$ is $$\hat y = w^{*^T} x$$
After we have computed $\hat y$ for a given data point, we classify it according to a decision boundary. For $\{0, 1\}$ classification,
$$\hat y = \begin{cases}
1 \text{ if } \hat y > 0.5\\
0 \text{ if } \hat y \leq 0.5
\end{cases}$$
and for $\{-1, 1\}$ classification,
$$\hat y = \begin{cases}
-1 \text{ if } \hat y < 0\\
1 \text{ if } \hat y \geq 0
\end{cases}$$
where $\hat y = 0.5$ (in the first case) and $\hat y = 0$ (in the second case) is the decision boundary. In other words, a data point is classified as one or the other categories depending on whether the point falls on one or the other side of the boundary.
\subsubsection{Model 2: nearest-neighbor}
Use those observations in the training set $T$ closest in input space to $x$ to form the prediction $\hat y$.
$$\hat y(x) = \frac{1}{k} \sum_{x^i \in N_k(x)} y^i $$
where $k$ is the number of neighbors. The neighborhood of $x$ can be written $N_k$.
If $\hat y(x) > 0.5$ then classify as class 1; else, classify as class 0.
In other words, classify based on the classification of the neighbors (``majority voting'' of the neighbors). This has the implicit assumption that the class distribution is \textbf{locally smooth}, i.e. the class of an object is similar to other objects in its neighborhood.
If we look at a 2D distribution of points, the decision boundary will look irregular, unlike in the linear model. Nearest-neighbor classification also responds to local clusters, and the decision boundary can create several disjoint areas if there are distinct clusters of points.
The output of the model depends on $k$. For instance, if $k=1$, the decision boundary gets even more irregular, such that every point is correctly classified. But recall the overfitting lesson: the model where a cluster is made for every data point may not be valid for classifying new data points.
Similarly to the regression example, $k$ is a hyperparameter of the algorithm, and can be chosen from a separate validation step. The hyperparameter should be picked such that error in the validation step is minimal.
Are there any other hyperparameters for the $k$-nearest neighbor algorithm? Yes: the metric to compute nearest-neighbors (here we use Euclidian distance by default).
\subsubsection{Least squares vs. nearest-neighbors}
In least squares (linear model), the decision boundary is very smooth and the algorithm is more stable, but there is a strong assumption that the decision boundary is linear. There is high bias and low variance.
In nearest-neighbors, the decision boundary is wiggly and depends on a handful of input points and their position. The algorithm is less stable. The assumption that the class distribution is locally smooth is not such a strong assumption. The bias is low and variance is high.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Statistical decision theory}
\subsection{Expected prediction error}
In a regression problem, we have a real valued random input vector $x \in \mathbb{R}^P$; a real valued random output variable $y \in \mathbb{R}$; and the joint distribution of $x, y$: $Pr(x, y)$. The goal is to find a function $f(x)$ for predicting $y$ given values of $x$.
We can evaluate the performance of any prediction function by defining a loss function $L(y, f(x))$ that penalizes errors in prediction by comparing it to the real data. A common loss function is squared error loss:
$$L(y, f(x)) = (y - f(x))^2$$
From this function, we can derive the expected prediction error (EPE):
\begin{align*}
EPE(f) &= E\left[(y-f(x))^2\right]\\
&= \int_x \int_y (y-f(x))^2 P(x,y) dx dy\\
&= \int_x \int_y (y-f(x))^2 P(x|y)dy~P(x) dx\\
&= \int_x E_{y|x} \left[(y-f)^2 | X=x\right]P(x)dx\\
&= E_x E_{y|x} \left[(y-f)^2 | X=x\right]
\end{align*}
It suffices to minimize the EPE pointwise (differentiate):
$$f^* = \argmin_f ~E_{y|x}\left[(y-f)^2 | X=x\right]$$
\begin{align*}
\frac{\partial}{\partial f} E_{y|x}\left[(y-f)^2 | X=x\right] &= 0 \\
\frac{\partial}{\partial f} E_{y|x}(y^2|X=x) + f^2 - 2 E_{y|x}(y|X=x)f &= 0\\
2f - 2 E_{y|x}(y|X=x) &= 0
\end{align*}
The function that minimizes the EPE is therefore:
$$f^* = E_{y|x} (y | X=x).$$
This means that the best prediction of $y$ at any point $X=x$ is the conditional mean(expected value of the prediction given the value of the data), when ``best'' is measured by average squared error. This is true for any model, not just linear regression.
\subsection{Expected prediction error for $k$-nearest neighbors}
The function that minimizes EPE for neareast-neighbor is:
$$\hat f(x) = Average(y_i | x_i \in N_k(x)).$$
In other words, the function averages the predictions made for the class of $x$ for each of the $k$ neighbors of $x$. This is similar to the regression EPE-minimization function, but with two approximations.
\begin{enumerate}
\item Expectation is approximated by averaging over sample data.
\item Conditioning at a point is relaxed to conditioning on some region ``close'' to the target point.
\end{enumerate}
Note 1: For a large training data sample size $N$, the points in the neighorhood are likely to be close to $x$.
Note 2: As $k$ gets larger, the average will get more stable.
Under mild regularity conditions on $P(x,y)$, we can show that as $N,k\rightarrow \infty$ such that $k/N\rightarrow0$ (i.e., many less classes than samples), then $\hat f(x) \rightarrow E(y|x=n)$ (the function approximates the expected prediction error better).
Does that mean we have a universal function approximator? No. As the number of features increases, in high dimensions, we need a very large number of samples.
\subsection{Local methods in high dimensions}
Consider inputs uniformly distributed in a $p$-dimensional hypercube. Consider a hypercubical neighborhood about a target point to capture a fraction of the observations. In other words, we want to compute a fraction $r$ of the unit volume.
The expected edge length is $e_p(r) = r^{1/p}$. For instance, in 10 dimensions, the edge length for a 1\% fraction is $e_{10}(0.01) = 0.01^{1/10} \approx 0.63$. To cover 1\% of the data, to form a local average, we must cover 63\% of the range of each input variable.
Such neighborhoods are no longer ``local''. And if we reduce $r$, with fewer observations, variance will be high. This is known as the \textbf{Curse of Dimensionality}. In other words, to search for some number of nearest neighbors, we need to cover a larger fraction of the total volume when in higher dimensions. Note that intuition often breaks down in higher dimensions!
\subsection{Expected prediction error for linear regression}
For linear regression, the function that minimizes the EPE is:
$$\hat f(x) \approx x^Tw$$
This is a model-based approach. We are specifying a model for the regression function.
$$w = [E(XX^T)]^{-1} E(XY)$$
Note: we have not conditioned on $X$. Rather, we have used our knowledge of the functional relationship to pool over values of $X$. As an approximation, we can replace the expectation in the $w$ equation above by averages over training data.
\subsection{Inductive bias}
We have made the following assumptions:
\begin{enumerate}
\item Least squares assumes $f(x)$ is well approximated by a globally linear function.
\item $k$-nearest neighbors assumes $f(x)$ is well approximated by a locally constant function.
\end{enumerate}
The assumptions made by any algorithm are known as the \textbf{inductive bias} of the algorithm.
We will cover a lot of model-based algorithms. For example, additive models:
$$f(x) = \sum^P_{j=1} f_j(r_j).$$
This retains the additivity of the linear model, but each coordinate function $f_j$ is arbitrary.
\subsection{Non-squared error loss}
We have used squared error loss, $L_2$. We could also use $L_1$:
$$E|y-f(x)|.$$
The solution is different: $\hat f(n) = \text{median}(y|X=x)$. This is a different measure of location, and its estimates are more robust than those for the conditional mean.
$L_1$ criteria have discontinuities on their derivatives, however, which have hindered their widespread use.
\subsection{Expected prediction error for classification}
For classification problems, the same paradigm is valid, but we need a different loss function. The loss function should penalize prediction errors.
Let the output categorical variable $G$ be the set of all possible classes $G_1, ..., G_k$. Let $\hat G$ be the estimate and $G$ the true class.
The loss function is a $k\times k$ matrix where $k$ is the cardinality of $G$. The function $L(k, l)$ is the price for classifying an observation belonging to class $k$ in class $l$. For instance, classifying a person as having cancer when in fact they don't might carry a cost (and this cost would be different from classifying a person who has cancer as someone who doesn't).
A common loss function is the \textbf{0/1 loss function}. This function assigns a penalty of 0 for a correct classification and 1 for a misclassification.
The expected prediction error for a classification problem is
$$EPE = E\left[L(G, \hat G(X))\right].$$
Expectation is with respect to $P(G,X)$. Thus
$$EPE = E_X \sum_{k=1}^k L\left[G_k, \hat G(X)\right]P(G_k|X).$$
Again, it suffices to minimize EPE pointwise:
$$\hat G(x) = \argmin_{g\in G} \sum^k_{k=1} L(G_k, g)P(G_k|X=x)$$
For the 0/1 loss function, this yields a solution that can be written in either of two forms:
$$\hat G(x) = \argmin_{g\in G} \left[1- P(g|X=x)\right]$$
$$\hat G(x) = \argmax_{g\in G} \left[P(g|X=x)\right]$$
This solution is known as the \textbf{Bayes classifier}. We classify the most probable class, using the conditional distribution $P(G|X)$. The error rate of the Bayes classifier is called the \textbf{Bayes rate}. This is the optimal Bayesian decision boundary.
The $k$-nearest neighbors method directly approximates this solution. In this algorithm, we do majority voting in the neighborhood. The approximations are:
\begin{enumerate}
\item The conditional probability at a point is relaxed to the conditional probability with its neighborhood.
\item Probabilities are estimated by training sample proportions.
\end{enumerate}
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Linear regression}
The simplest model for linear regression is:
$$\hat y(x; w) = w_0 + w_1x_1+...+w_px_p$$
This is a linear function of the parameters $w$. This is also a linear function of the inputs $x_1, ..., x_p$, which is a limitation.
\subsection{Linear basis function models}
We can write the linear regression model like this:
$$\hat y(x; w) = w_0 + \sum_{j=1}^{M-1} w_j \phi_j(x)$$
where $\phi_j(x)$ is a \textbf{basis function}, a fixed non-linear function. Assuming $\phi_0(x) = 1$, we get
\begin{align*}
y(x; w) &= \sum_{j=0}^{M-1} w_j \phi_j(x)\\
&= w^T\phi(X)
\end{align*}
where $w = (w_0, ..., w_{M-1})^T$ and $\phi = (\phi_0, ..., \phi_{M-1})^T$.
The $\hat y(x; w)$ is a non-linear function of $x$. It is still linear with respect to the parameters, and hence is a linear model.
Below we describe some basis functions.
\subsubsection{Polynomial basis function}
If the basis function is
$$\phi_j(x) = x^j$$
then we get
$$\hat y(x; w) = w_0 + w_1x+ w_2x^2+...+w_{M-1}x^{M-1}.$$
\subsubsection{Gaussian basis function}
$$\phi_j(x) = \exp \left\{ -\frac{(x-\mu_j)^2}{2s^2} \right\}$$
where $\mu_j$ governs the locations of the basis function in input space, and $s$ governs the spatial scale. Note: there is no probabilistic interpretation. The normalization constant is unimportant here because the basis functions will be multiplied by the adaptive parameters $w_j$.
\subsubsection{Sigmoidal basis functions}
$$\phi_j(x) = \sigma \left( -\frac{(x-\mu_j)}{s} \right)$$
where we have a logistic sigmoid function
$$\sigma(a) = \frac{1}{1+e^{-a}}.$$
\subsubsection{Identity basis function}
$$\phi(x) = x$$
If we have $\hat y(x; w) = w_0 + w_1x_1+...+w_px_p$, then $\phi_j(x) = x_j = \sum_{j=0}^P w_j \phi_j w$.
\subsection{Least squares}
$$RSS(w) = (y - \phi w)^T (y - \phi w)$$
$$w^* = \argmin_w RSS(w)$$
$$w^* = (\phi^T\phi)^{-1} \phi^T y$$
$(\phi^T\phi)^{-1} \phi^T$ is the Moore-Penrose pseudo-inverse of matrix $\phi$, i.e.
$$\phi^+ = (\phi^T\phi)^{-1}\phi^T.$$
Note: if $\phi$ is square and invertible, then $\phi^+ = \phi^{-1}$.
\subsubsection{Geometry of least squares}
Consider an N-dimensional space whose axes are given by $y^{(n)}$ so that $y = (y^{(1)}, ..., y^{(n)})^T$ is a vector in this space. Each basis function $\phi_j(x^{(n)})$, evaluated at $N$ data points, can also be represented as a vector in the same space, denoted by $\varphi_j$.
$\varphi_j$ is the $j$th column of $\phi$, while $\phi(x^{(n)})$ is the $n$th row of $\phi$.
If the number $M$ of basis functions is smaller than the number $N$ of data points, then the $M$ vectors $\phi^j(x^{(n)})$ will span a linear subspace $S$ of dimensionality $M$.
We know that $\hat y$ is a linear combination of the vectors $\phi_j$. So it lives anywhere in the $M$-dimensional subspace.
SSE is the squared Euclidian distance between $y$ and $\hat y$. Thus the least squares solution for $w$ corresponds to that choice of $\hat y$ that lies in the subspace $S$ and that is closest to $y$. This solution corresponds to the orthogonal projection of $y$ onto the subspace $S$.
\textbf{First issue}: There may be columns of $X$ that are not linearly independent; then $X$ is not of full rank. For instance, if two features are perfectly correlated (e.g. date of birth and age), then $X^TX$ is singular and $w^*$ is not uniquely defined. The solution is to filter the redundant features.
\textbf{Second issue}: There can also be rank deficiency if there are more features $p$ than the number of training cases $N$. The solution is to reduce the number of features (dimensionality reduction) or add regularization (L1, L2).
\subsection{Learning by gradient descent}
This is a fundamental learning algorithm. Consider a one-dimensional input. The linear regression model is $\hat y(x; w) = w_0 + w_1x$. The parameters are $w = (w_0, w_1)$. The loss function is
$$L(w_0, w_1) = \frac{1}{2N} \sum_{i=1}^N\left(\hat y(x^i; w) - y^i\right)^2$$
The goal is to minimize $L(w_0, w_1)$.
This is a simple example which we can solve analytically. But there are many situations where analytical solutions are not computationally possible. Then we can use the gradient descent approach. Visually, it is about taking step after step downward on the loss surface in order to reach the minimum.
\subsubsection{Gradient descent outline}
\begin{enumerate}
\item Start with some random $w_0, w_1$.
\item Keep changing $w_0, w_1$ to reduce $L(w_0, w_1)$ until we (hopefully) end at a minimum.
\end{enumerate}
Algorithm: repeat, until convergence, the following:
$$w_j = w_j - \alpha \frac{\partial}{\partial w_j} L(w_0, w_1)$$
and simultaneously update $j=0$ and $j=1$.
$\alpha$ is called the step size or learning rate. If it is too small, gradient descent will be slow. If it is too large, gradient descent can overshoot the minimum and may fail to converge (or even diverge). We see that $\alpha$ is a hyperparameter of the algorithm. Furthermore, it is very sensitive and must be chosen carefully.
Note: gradient descent can converge to a local minimum even with the learning rate $\alpha$ fixed, because the gradient becomes less steep as we approach the minimum. The algorithm will then automatically make smaller steps.
\subsubsection{Linear regression example}
Suppose have the loss function
$$L(w_0, w_1) = \frac{1}{2N} \sum_{i=1}^N\left(w_0 + w_1x^{(i)} - y^{(i)}\right)^2$$
with two parameters $w_0, w_1$. We get the gradients
$$\frac{\partial L}{\partial w_0} = \frac{2}{2N}\sum_{i=1}^N\left(w_0 + w_1x^{(i)} - y^{(i)}\right)$$
$$\frac{\partial L}{\partial w_1} = \frac{2}{2N}\sum_{i=1}^N\left(w_0 + w_1x^{(i)} - y^{(i)}\right)x^{(i)}.$$
The algorithm will therefore repeat, until convergence, the updates
$$w_0 = w_0 - \alpha\frac{1}{N}\sum_{i=1}^N\left(\hat y(x^{(i)}; w) - y^{(i)}\right)$$
and
$$w_1 = w_1 - \alpha\frac{1}{N}\sum_{i=1}^N\left(\hat y(x^{(i)}; w) - y^{(i)}\right)x^{(i)}.$$
\subsection{The special nature of the linear regression loss function}
The loss function for linear regression is a convex function. Thus is has only one minimum, which is the global minimum. This is because of the use of least square loss with a linear model.
For non-linear models, we might get non-convex loss functions, and then there is no guarantee to reach a global minimum; we can get a local one.
\subsection{Online or sequential learning}
In an online context, data come in a continuous stream. We use \textbf{stochastic gradient descent}. The loss function for stochastic gradient descent is a sum of losses for every data point:
$$\text{loss} = \sum_n \text{loss}_n.$$
We can approximate the gradient of this total loss by the gradient of individual data point loss.
$$w = w -\alpha \frac{\partial}{\partial w} \text{loss}_n$$
The stochastic step only tries to approximate the true gradient. The approximation error can help escaping a local minimum.
\subsubsection{Online gradient descent algorithm}
With $N$ data points, repeat until convergence:
\noindent for $i$ in 1 to $N$ \{
$$w_0 = w_0 - \alpha (\hat y(x^{(i)}; w) - y^{(i)})$$
$$w_1 = w_1 - \alpha (\hat y(x^{(i)}; w) - y^{(i)})x^{(i)}$$
\}
One \textbf{epoch} is a full iteration of the loop, i.e. one sweep of all the examples in the training set.
\subsection{Multiple outputs}
So far, we have considered only a single target variable $y$. However, $y$ could be a vector ($y \in \mathbb{R}^k$). How does that affect the regression model?
A first solution could be a set of basis functions for each component of $y$, i.e. as if we had independent regression problems.
A better solution is to use the same set of basis functions to model all the components of the target vector:
$$\hat y (x; w) = W^T \phi(x)$$
where $W$ is an $M\times k$ matrix of parameters.
$$W^* = (\phi^T\phi)^{-1}\phi^TY$$
where $Y$ is an $N\times k$ matrix. For an indivdual target variable,
$$w_k = (\phi^T\phi)^{-1}\phi y_k = \phi^t y_k.$$
The solution to the regression problem decouples between the different target variables. We only need to compute a single pseudo-inverse matrix which is shared by all of the vectors $W_k$.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bias-variance tradeoff, regularization}
\subsection{Empirical risk minimization}
Let $X \in \mathbb{R}^P$, $y \in \mathbb{R}$. Our hypothesis $h$ is $X \rightarrow y$. We calculate the expected prediction error
\begin{align*}
EPE(f) &= \left[L(y, h(x))\right] \\
&= \iint \left[L(y, h(x))\right] P(X, y) \,dx \,dy
\end{align*}
This is known as the \textbf{risk} associated with hypothesis $h$, denoted $R(h)$.
Our goal is to find a hypothesis $h^*$ among the class of functions for which the risk $R(h)$ is minimal. In mathematical terms,
$$h^* = \argmin_{h\in H} R(h).$$
In general, $R(h)$ cannot be computed because the distribution $P(x, y)$ is unknown to the learning algorithm. However, we can compute an empirical approximation for this risk by using the training set. The empirical risk formula is thus:
$$R_{emp}(h) = \frac{1}{N} \sum_{i=1}^N L\left(y^{(i)}, h(x^{(i)})\right)$$
and empirical risk minimization is to find the $\hat h$ that minimizes the empirical risk.
$$\hat h = \argmin_{h\in H} R_{emp}(h)$$
Note 1: There is no guarantee that $h^* = \hat h$
Note 2: The inductive bias of the algorithm restricts the hypothesis class. For instance, linear models restrict the hypothesis class to linear functions only. The restricted hypothesis class $H' \subseteq H$ may or may not contain $h^*$.
\subsection{Bias-variance decomposition}
Consider a regression problem with the least squares loss function:
$$L(y, h(x)) = (h(x) - y)^2$$
Then the hypothesis that minimizes risk is
$$h^* = \argmin_{h\in H} E\left[L(y, h(x))\right] = E[y|x]$$
This is the regression function. Let's start with the loss function and derivate:
\begin{align*}
\left\{ h(x) - y \right\}^2 &= \left\{ h(x) - E[y|x] + E[y|x] - y \right\}^2 \\
&= (h(x) - E[y|x])^2 + (E[y|x] - y)^2 + 2(h(x) - E[y|x])(E[y|x] - y)
\end{align*}
Thus we can write
\begin{align*}
E\left[ L(y, h(x)) \right] &= E\left[ \left\{ h(x) - y \right\}^2 \right]\\
&= E\left[(h(x) - E[y|x])^2 \right] + E\left[ (E[y|x] - y)^2 \right] + 2E\left[(h(x) - E[y|x])(E[y|x] - y) \right]
\end{align*}
There are three terms in the latter equation. Let us examine them in turn.
The first term is:
\begin{align*}
E\left[(h(x) - E[y|x])^2 \right] &= \int_x \int_{y|x} (h(x) - E[y|x])^2 \,P(y|x) \,dy \,P(x) \, dx\\
&= \int_x (h(x) - E[y|x])^2 \,P(x) \, dx
\end{align*}
(We can do this because the squared term is not a function of $y$.).
The second term is:
\begin{align*}
E\left[(E[y|x] - y)^2 \right] &= \int_x \int_{y|x} (E[y|x] - y)^2 \,P(y|x) \,dy \,P(x) \, dx\\
&= \int_x \int_{y|x} \left(E[y|x]^2 + y^2 - 2E[y|x]y\right) \,P(y|x) \,dy \,P(x) \, dx\\
&= \int_x E[y|x]^2 + E[y^2] - 2E[y|x]^2 \,P(x) \, dx\\
&= \int_x E[y^2] - (E[y|x])^2 \,P(x) \, dx\\
&= E_X\left[var(y|x)\right]
\end{align*}
and the third term is:
\begin{align*}
E\left[(h(x) - E[y|x])(E[y|x] - y) \right]
&= E_X E_{y|X} \left(h(x)E[y|x] - h(x)y - (E[y|x])^2 + yE[y|x]\right) \\
&= E_X \left(h(x)E[y|x] - h(x)E[y|x] - (E[y|x])^2 + E[y|x]^2\right) \\
&= E_X(0)\\
&= 0
\end{align*}
Therefore we can write that the expectation of the loss function is
$$E[L(y, h(x))] = \int_x (h(x) - E[y|x])^2 P(x)\,dx + \int_x var(y|x) P(x) \,dx$$
Note 1: The second term is the variance of the distribution $y$, averaged over $x$. It represents the intrinsic variability of the target data and can be regarded as noise. It is independent of $h(x)$, and is therefore an irreducible minimum value of the loss function.
Note 2: The first term is a function of $h(x)$. It is a minimum only when $h(x) = E[y|x]$:
$$(h(x) - E[y|x])^2$$
In practice, we have a dataset $D$ containing only a finite number $N$ of data points. We run our algorithm and find a hypothesis that we call $h(x; D)$. We can check the following:
\begin{align*}
E_D\left[\left\{ h(x; D) - E[y|x] \right\}^2\right] &=
\left\{ E_D[h(x; D)] - E[y|x] \right\}^2 + E_D\left[\left\{ h(x; D) - E_D[h(x; D)] \right\}^2\right] + 0 \\
&= (\text{bias})^2 + \text{variance}
\end{align*}
The \textbf{bias} is the extent to which the average prediction over all datasets varies around the average. The \textbf{variance} is the extent to which solutions for individual datasets vary around their average (the extent to which $h(x; D)$ is sensitive to a particular choice of dataset).
The key idea is
$$\text{Expected loss} = (\text{bias})^2 + \text{variance} + \text{noise}$$
where
\begin{align*}
(\text{bias})^2 &= \int \left\{E_D[h(x; D)] - E[y|x]\right\}^2 \,p(x)\,dx\\
\text{variance} &= \int \left\{E_D[h(x; D)] - E_D(h(x;D))\right\}^2 \,p(x)\,dx\\
\text{noise} &= \int \left\{E[y|x] - y\right\}^2 \,p(x,y)\,dx\,dy\\
\end{align*}
There is a tradeoff between bias and variance. Very flexible models have low bias, but high variance. Rigid models have high bias and low variance. Generally, the more complex a model, the highest the variance but the lowest the bias.
\subsubsection{Occam's razor}
``One should not increase, beyond what is necessary, the number of entities required to explain anything''. In other words, seek the simplest explanation. This is kind of inductive bias: we make the assumption that simpler models are better.
\subsubsection{Regression example}
Suppose we have $L=100$ datasets, each containing $N=25$ data points sampled from a $\sin(2\pi x)$ curve. Our model is 24 Gaussian basis functions.
Our average prediction is
$$\bar h(x) = \frac{1}{L} \sum_{l=1}^L h^{(l)}(x)$$
and
$$(\text{bias})^2 = \frac{1}{N} \sum_{n=1}^N \left\{\bar (x^{(n)}) - E[y^{(n)}|x^{(n)}] \right\}^2$$
$$\text{variance} = \frac{1}{N} \sum_{n=1}^N \frac{1}{L} \sum_{l=1}^L \left\{h^{(l)}(x^{(n)}) - \bar h(x^{(n)})\right\}^2$$
As the model increases in complexity, we reach a point where the empirical risk is low, but the true risk is high. This is overfitting. Underfitting occurs when both the true and the empirical risk are high. The best model is the one that minimizes the true risk, i.e. doesn't overfit or underfit.
\subsection{Regularization}
Adding a regularization term to an error function helps controlling overfitting. The total error function can be written as:
$$E_D(w) + \lambda E_w(w)$$
where $\lambda$ is a \textbf{regularization coefficient} that controls the relative importance of $E_w(w)$.
For the following, we will use the case of the SSE loss function, for which
$$E_D(w) = (y-Xw)^T(y-Xw).$$
\subsubsection{L2 regularization}
With L2 regularization, also known as ridge regression or weight decay, $E_w(w) = w^Tw$. We want to minimize the following:
$$\min_w \, (y-Xw)^T(y-Xw) + \lambda w^Tw$$
Differentiating and setting to 0:
$$-2X^T(y-Xw)+2\lambda w = 0$$
$$(X^TX + \lambda I)w = X^Ty$$
$$w = (X^TX + \lambda I)^{-1} x^Ty$$
The solution adds a positive constant to the diagonal of $X^T\lambda$ before inversion. This makes the problem non-singular even if $X^TX$ is not of full rank.
The unconstrained optimization problem (the minimization above) can be converted into a constrained optimization problem
$$\min_w \, (y-Xw)^T(y-Xw)$$
subject to $w^Tw \leq \eta$ where $\eta$ can be interpreted as the radius of a circle over the origin when in 2D. One can show that
$$\eta \propto \frac{1}{\lambda}.$$
If we visualize a 2D parameter configuration (with $w_0$ and $w_1$), there is a circle of radius $\eta$ at the origin. Somewhere outside of that are the possible parameter configurations $\hat w$. The optimal solution is the intersection of $\hat w$ and the constraint circle, $\hat w_{ridge}$. If we increase the value of $\lambda$, we decrese $\eta$ and the size of the circle, and thus $\hat w_{ridge}$ will have a smaller magnitude.
\subsubsection{L1 regularization}
With L1 regularization, also known as lasso regression, $E_w(w) = |w|$. We want to minimize the following:
$$\min_w \, (y-Xw)^T(y-Xw) + \lambda |w|.$$
There is no closed form solution to this, because the function is not differentiable at $w=0$. The corresponding constrained optimization problem is:
$$\min_w \, (y-Xw)^T(y-Xw)$$
subject to $|w| \leq \eta$ where $\eta \propto \frac{1}{\lambda}$. The 2D visualization replaces the constraint circle with a constraint square with the vertices on the axes. The $\hat w_{lasso}$ will be on one of the vertices, which means one of the parameters will be zero.
As we increase the value of $\lambda$, $\eta$ decreases and hence more parameters $w_i$ will go to zero. We can say that L1 regularization performs feature selection by setting the weights of irrelevant features to zero. L1 regularization prefers sparse models.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Linear classification}
Consider a $k$-class classification problem, where the classes are $c \in \{C_1, ..., C_k\}$. There are two stages in classification:
\begin{enumerate}
\item
\textbf{Inference stage}: Use training data to learn a model for $P(C_k|x)$ (the probability that data point $x$ belongs to class $C_k$).
\item
\textbf{Decision stage}: Use posterior probabilities to make optimal class assignments.
$$\hat C(x) = \argmin_{c\in C} \sum_{k=1}^k L(C_k, c) P(C_k|X=x)$$
where $L(C_k, c)$ is the price for classifying in class $C_k$ an observation truly belonging to class $c$. If we use a 0/1 loss function, this becomes
\begin{align*}
\hat C(x) &= \argmin_{c\in C} [1-P(c|X=x)] \\
&= \argmax_{c\in C} [P(c|X=x)]
\end{align*}
which is the Bayes classifier (requiring the class probabilities).
\end{enumerate}
\subsection{Approaches to classification}
There are three approaches to solving classification problems: generative models, discriminative models, and discriminant-based models.
\subsubsection{Generative models}
In this approach, we solve the inference problem of determining the class-conditional densities $P(X|C_k)$ (the probability that the data belongs to a given class) for each class $C_k$ individually. We also infer the class probabilities $P(C_k)$. Then we use Bayes' theorem:
$$P(C_k|X) = \frac{P(C_k)P(X|C_k)}{P(X)}$$
where
$$P(X) = \sum_k P(X|C_k)P(C_k)$$
to get the posterior probabilities (of being in class $C_k$ given the data). It would also be possible to learn the joint distribution $P(X, C_k)$ and normalize to get $P(C_k|X)$.
Once $P(C_k|X)$ is found, use decision theory to determine the class. This is known as a generative model, because we can generate synthetic data points in the input space by using the learnt distribution.
\subsubsection{Discriminative models}
In this approach, we directly model $P(C_k|X)$ and then use decision theory to determine the class.
\subsubsection{Discriminant-based models}
In this approach, we find a function $f(X)$, called a discriminant function, that maps $X$ directly to class labels. There is no probability involved.
\subsection{Linearly separable problems}
Consider a two-class problem, with classes $C_1$ and $C_2$. If we plot the data points, they might cluster into two areas that can be separated by a linear decision surface. If such a decision surface can indeed separate the classes exactly, then the problem is said to be linearly separable.
A linear model for classification is one that uses, as a decision surface, a linear function of the input vector $X$ and is hence defined by $(D-1)$-dimensional hyperplanes with the $D$-dimensional input space.
\subsection{Target variable for classification}
For regression, the target variable is $t \in \mathbb{R}$. For classification, the target variable is the class $c \in \{C_1, ..., C_k\}$.
In a two-class problem, we say $t=1$ represents class $C_1$, and $t=0$ represents class $C_2$. Therefore the target is $t \in \{0, 1\}$ and can be interpreted as the probability that the class is $C_1$, with the probability taking only extreme values of 0 and 1.
When we have $k>2$ classes, we cannot just consider $t \in \{1, 2, ..., k\}$, because this would create artificial distance between classes: $C_1$ would be closer to $C_2$ than $C_5$, for instance. The solution is to use a 1-of-$k$ coding scheme. Thus, if $k=5$ classes, then class 1 corresponds to $t_1 = (1, 0, 0, 0, 0)^T$, class 2 to $t_2 = (0, 1, 0, 0, 0)^T$, and so on. Each $t_k$ can be interpreted as the probability that the class is $C_k$.
\subsection{Discriminant functions}
Recall that a discriminant function takes an input vector $X$ and assigns it to one of $k$ classes, denoted $C_k$. If the function is a linear discriminant, then the decision surface is a hyperplane.
\subsubsection{Two-class scenario}
A simple linear discriminant function would be:
$$y(x) = w^Tx+w_0$$
where $w$ is the weight vector, and $w_0$ the bias. The decision boundary in this case is $y(x)=0$. If $y(x)\geq0$, then the data point is in class $C_1$; if $y(x)<0$, then the data point is in class $C_2$.
The bias $w_0$ can be considered as threshold. For instance, if $w_0 = -7$, then $w^Tx$ has to be at least +7 to be classified as $C_1$.
Consider two points, $x_A$ and $x_B$, both of which lie on the decision surface. We know that
$$y(X_A) = w^Tx_A + w_0 = 0$$
$$y(X_B) = w^Tx_B + w_0 = 0$$
and thus $w^T(x_A - x_B) = 0$. In other words, the $w$ vector is orthogonal to every vector lying on the decision surface---which means $w$ determines the orientation of the decision surface.
Therefore, any point $x'$ that is on the decision surface and closest to the origin can be represented as $x' = \alpha w$ for some scalar $\alpha$. Also, since $x'$ is on the decision surface:
$$w^Tx' + w_0 = 0$$
$$\alpha w^T w + w_0 = 0$$
$$\alpha = - \frac{w_0}{||w||^2}.$$
The distance from $x'$ to the origin is
\begin{align*}
||x'|| &= ||\alpha w|| \\
&= \alpha ||w|| \\
&= - \frac{w_0}{||w||^2}||w|| \\
&= - \frac{w_0}{||w||} \\
\end{align*}
Thus, while $w$ determines the orientation of the decision surface, $w_0$ determines the location of the decision function. The value of $y(x)$ gives a signed measure of the perpendicular distance $\gamma$ of the point $x$ from the decision surface.
Consider an arbitrary point $x$ and let $x_\bot$ be its orthogonal projection onto the decision surface, so that
$$x = x_\bot + \gamma \frac{w}{||w||}$$
$$w^Tx = w^Tx_\bot + \gamma \frac{w^Tw}{||w||}$$
$$w^Tx + w_0 = w^Tx_\bot + w_0 + \gamma \frac{w^Tw}{||w||}$$
Notice that the latter equation contains both $y(x)$ and $y(x_\bot)$.
$$y(x) = y(x_\bot) + \gamma \frac{w^Tw}{||w||}$$
$$y(x) = 0 + \gamma ||w||$$
$$\gamma = \frac{y(x)}{||w||}.$$
In summary,
\begin{enumerate}
\item
The decision surface is perpendicular to $w$;
\item
The displacement of the decision surface is controlled by the bias parameter;
\item
The signed orthogonal distance of a general point $x$ from the decision surface is given by $\frac{y(x)}{||w||}$.
\end{enumerate}
\subsubsection{Multiple classes scenario}
There are three ways to do $k$-class classification: one-versus-the-rest, one-versus-one, and a single discriminant comprising $k$ linear functions
With a one-versus-the-rest classifier, we train $k-1$ classifiers, each of which solves a two-class problem of separating points in class $C_k$ from points not in that class.
With a one-versus-one classifier, we need $k(k-1)/2$ binary discriminant functions, one for each possible pair of classes. Then we perform majority voting among the discriminant functions for classification.
With the third approach, consider $k$ linear functions of the form:
$$y_k(x) = w_k^Tx + w_{k0}.$$
If $y_k(x) > y_j(x)$ for all $j\neq k$, then $x$ belongs to $C_k$. The decision boundary between two classes $C_k$ and $C_j$ is
$$y_k(x) = y_j(x)$$
$$w_kx + w_{k0} = w_jx + w_{j0}$$
$$(w_k - w_j)^Tx + (w_{k0}-w_{j0}) = 0,$$
similar to the decision boundary for two classes. Therefore analogous geometrical properties apply.
Note: decision regions of such a discriminant are always singly connected and convex. Consider $x_A, x_B \in R_k$, where $R_k$ is the region corresponding to class $C_k$. Any point $\hat x$ that lies on the line connecting $x_A$ and $x_B$ can be expressed as
$$\hat x = \lambda x_A + (1-\lambda)x_B$$
with $0\leq\lambda\leq1$. By the linearity of the discriminant,
$$y_k(\hat x) = \lambda y_k(x_A)+ (1-\lambda) y_k(x_B)$$
We know that $y_k(x_A)>y_j(x_A)$ and $y_k(x_B)>y_j(x_B)$ for all $j\neq k$. Therefore $h_k(\hat x) > y_j(\hat x)$ and so $\hat x \in R_k$. Thus $R_k$ is singly connected and convex.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Indicator regression, PCA, LDA}
\subsection{Least squares for classification}
Consider a classification problem with $k$ classes and a 1-of-$k$ binary coding scheme for the target vector $t$.
We know that least squares approximates the conditional expectation $E[t|X]$ of the target values given the input vector. For the binary coding scheme, this conditional expectation is given by the vector of posterior probabilities, so it makes sense to use least squares. However, there is no guarantee for the values to be in the $(0, 1)$ range.
Each class $C_k$ is described by its own linear model:
$$y_k(x) = w_k^T x + w_0 $$
for $k=1, ..., k$. We can write this in vector notation:
$$Y(X) = \tilde W^T \tilde X$$
where $\tilde W$ is a matrix whose $k$th column comprises the $D+1$ dimensional vector $\tilde w_k = (w_{k0}, w_k^T)^T$, and $\tilde X$ is the augmented input vector $(1, X^T)^T$.
Consider the training dataset $\{X^{(n)}, t^{(n)}\}_{n=1}^N$. We define a matrix $T$ whose $n$th row is the vector $t^{(n)^T}$, together with a matrix $\tilde X$ whose $n$th row is $\tilde X^{(n)^T}$. The sum of squares error function is thus:
$$E_D(\tilde w) = \frac{1}{2} \text{Tr}\left\{(\tilde X \tilde W - T)^T (\tilde X \tilde W - T) \right\}$$
and the solution to this is
\begin{align*}
\tilde W &= (\tilde X^T \tilde X)^{-1} \tilde X^T T \\
&= \tilde X^+ T
\end{align*}
where $\tilde X^+$ is the pseudoinverse. The discriminant funtion is
$$y(x) = \tilde W^T w\tilde X = T^T (\tilde X^+)^T \tilde X.$$
Note 1: If every target vector in the training set satisfies some linear constraint $a^T t^{(n)} + b = 0$ for some constants $a, b$, then the least squares model prediction for any value of $x$ will satisfy the same constraint so that $a^T y(x) + b = 0$. Since we use 1-of-$k$ encoding, the model's predictions will sum to 1 for any value of $x$. But the values are not constrained to be in the range $(0, 1)$.
Note 2: The least squares solution lacks robustness to outliers. If there are some members of a class that lie far from the main cluster, the decision boundary may end up within the main cluster, for instance. The sum of squares error function penalizes predictions that are ``too correct'' in that they lie a long way on the correct side of the decision boundary.
Note 3: In many cases, the least squares method doesn't work well. For instance, suppose there are three classes, with the green clas data points being located roughly between the blue and red classes. The classifier may assign only a very small region to the green class compared to the other two.
\subsection{Principal component analysis (PCA)}
Let $X \in \mathbb{R}^d$ be a random vector. We wish to find $k<d$ directions that capture as much as possible of the variance of $X$. Principal component analysis allows us to do that.
This is an example of unsupervised learning: there is no $y$. The applications of PCA include dimensionality reduction, lossy data compression, feature extraction, and data visualization.
\subsubsection{Formal definition}
We want $P\in \mathbb{R}^d$ (the direction) such that $||P|| = 1$, so as to maximize $var(P^T X)$. Why must the magnitude of $P$ be 1? Because otherwise we could maximize the variance by letting $||P||\rightarrow \infty$.
Let $\mu = EX$ and
$$S = cov(X) = E[(X-\mu)(X-\mu)^T].$$
For any $P \in \mathbb{R}^d$, the projection $P^TX$ has mean $E[P^TX] = P^T\mu$ and variance
\begin{align*}
var(P^TX) &= E[(P^TX - P^T\mu)^2] \\
&= E[P^T(X-\mu)(X=\mu)^TP] \\
&= P^TSP.
\end{align*}
The goal is to maximize the variance. To do this, we will use the spectral decomposition of $S$, which is $Q\Lambda Q^T$, where $Q$ is a matrix whose orthonormal row vectors are the eigenvectors of $S$, and $\Lambda$ is a diagonal matrix with the corresponding eigenvalues. Also note that $QQ^T = I$. We will also define $y=Q^TP$.
\begin{align*}
\max_{||P||=1} P^TSP &= \max_{P\neq 0} \frac{P^TSP}{P^TP} \\
&= \max_{P\neq 0} \frac{P^TQ\Lambda Q^TP}{P^TQQ^TP} \\
&= \max_{y\neq 0} \frac{y^T\Lambda y}{y^Ty} \\
&= \max_{y\neq 0} \frac{\lambda_1y_1^2 + ... + \lambda_dy_d^2}{y_1^2 + ... + y_d^2} \\
&\leq \lambda_1
\end{align*}
where equality is attained in the last step when $y=e_1$, i.e. $P=Q^Te_1 = u_1$ when $u_1$ is the first eigenvector of $S$. The first principal component is the first eigen vector of $cov(X)$.
Similarly, the $k$-dimensional subspace that captures as much of the variance of $X$ as possible is simply the subspace spanned by the top $k$ eigenvectors of $cov(X)$: $u_1, ..., u_k$.
\subsubsection{Procedure}
\begin{enumerate}
\item
Compute the mean $\mu$ and the covariance matrix $S$ of the data $X$.
\item
Compute the top $k$ eigenvectors $u_1, ..., u_k$ of $S$.
\item
Project $X$ onto $P^TX$, where $P^T$ is th e $k\times d$ matrix whose rows are $u_1, ..., u_k$.
\end{enumerate}
\subsubsection{Can we use PCA dimensions for classification?}
No, not the raw method: projecting the data onto the first principal component collapses the classes instead of discriminating them. A better solution is to find the projection that maximizes the class separation. This is the idea behind Fisher's linear discriminant, or LDA.
\subsection{Linear discriminant analysis (LDA)}
Consider a two-class problem in which there are $N_1$ points of class $C_1$, and $N_2$ points of class $C_2$. The mean vectors of $C_1$ and $C_2$ are:
$$m_1 = \frac{1}{N_1} \sum_{n\in C_1} X^{(n)}$$
$$m_2 = \frac{1}{N_2} \sum_{n\in C_2} X^{(n)}$$
The idea is to choose $w$ so as to maximize $w^Tm_2 - w^Tm_1$, i.e. find the parameters that give the largest difference between the classes. So that the expression cannot be made arbitrarily large by increasing the magnitude of $w$, we constrain $||w||=1$. Thus:
$$\max_w w^Tm_2 - w^Tm_1$$
$$\text{subject to }w^Tw = 1$$
We can write the Lagrangian formulation of the problem:
$$L = w^T(m_2 - m_1) + \lambda(w^Tw - 1)$$
$$\frac{\partial L}{\partial w} = m^2 - m_1 + 2 \lambda w = 0$$
$$w = \frac{-1(m_2 - m_1)}{2\lambda}$$
$$w \propto (m_2 - m_1)$$
With this approach, two classes that are well separated in the original space may have considerable overlap in the projected space. This is due to the strongly nondiagonal covariances of class distributions.
\subsubsection{Solution proposed by Fisher}
Maximize a function that will give a large separation between the projected class means, while also giving a small variance within each class, thereby minimizing class overlap.
We can express the within-class variance of the transformed data, for class $C_k$, as:
$$s_k^2 = \sum_{n\in C_k} (w^Tx^{(n)} - w^T m_k)^2$$
and thus the total within-class variance, for two classes, is $s_1^2 + s_2^2$. The Fisher criterion for classification is:
$$J(w) = \frac{(w^Tm_2 - w^Tm_1)^2}{s_1^2 + s_2^2}$$
Let's rewrite both the numerator and bottom parts of this expression.
\begin{align*}
(w^Tm_2 - w^Tm_1)^2 &= (w^T(m_2 - m_1))^2 \\
&= w^T(m_2 - m_1)(m_2 - m_1)^Tw\\
&= w^TS_Bw
\end{align*}
where $S_B$ is the betwee-class covariance matrix.
\begin{align*}
s_1^2 + s_2^2 &= \sum_{n\in C_1} \left(w^T(x^{(n)} - m_1)\right)^2
+ \sum_{k\in C_2} \left(w^T(x^{(k)} - m_2)\right)^2 \\
&= \sum_{n\in C_1} w^T (x^{(n)} - m_1)(x^{(n)} - m_1)^Tw
+ \sum_{k\in C_2} w^T (x^{(k)} - m_2)(x^{(k)} - m_2)^T \\
&= w^TS_ww
\end{align*}
where
$$S_w = \sum_{n\in C_1} (x^{(n)} - m_1)(x^{(n)} - m_1)^T + \sum_{k\in C_2} (x^{(k)} - m_2)(x^{(k)} - m_2)^T.$$
We get a new formulation of the Fisher criterion:
$$J(w) = \frac{w^TS_Bw}{w^TS_ww}$$
which we can differentiate and set to 0 to find the maximum.
$$(w^TS_Bw)S_ww - (w^TS_ww)S_Bw = 0$$
$$(w^TS_Bw)S_ww = (w^TS_ww)S_Bw$$
We know that $S_B = (m_2-m_1)(m_2-m_1)^T$. Therefore, $S_Bw$ is always in the direction of $(m_2-m_1)$. Since we do not care about the magnitude of $w$, only about its direction, we can drop the scalars $(w^TS_Bw)$ and $(w^TS_ww)$. We get
$$S_ww \propto (m_2 - m_1)$$
$$w \propto S_w^{-1}(m_2 - m_1).$$
Note: If the within-class covariance is isotropic, so that $S_w$ is proportional to the unit matrix, then $w$ is proportional to the class means.
This model is known as Fisher's linear discriminant. However, it is not a discriminant. It is a specific choice of direction for the projection of data. The projected data can be used to construct a discriminant by choosing a threshold $\theta$, such that if $w^Tx \geq \theta$ then $x$ is in $C_1$, and if $w^Tx < \theta$ then $x$ is in $C_2$. $\theta$ can be found by minimizing the training error.
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{GLMs, GDA, evaluation metrics}
\subsection{Generalized linear models}
In linear regression, the model prediction $y(x; w)$ was given by a linear function of the parameters $w$. In the simplest case, the model is also linear in the input variables:
$$y(x) = w^Tx+w_0$$
where $y(x)$ is a real number.
For classification, we wish to predict discrete class labels or posterior probabilities that lie in the range (0, 1). To do this, we transform the linear function of $w$ using a non-linear function $f(.)$ (called an \textbf{activation function}) so that
$$y(x) = f(w^Tx + w_0)$$
Note that this model is no longer linear in the parameters, because $f(.)$ is non-linear.
The decision surface is given by
$$y(x) = \text{ constant}$$
$$f(w^Tx + w_0) = \text{ constant}$$
$$w^Tx + w_0 = f^{-1}(\text{constant})$$
$$w^Tx + w_0 = \text{constant}$$
Therefore, decision surfaces are linear functions of $x$, even if $f(.)$ is non-linear. Models of this type are called \textbf{generalized linear models} (GLM). GLMs have complex analytical and computational properties compared to linear models, but they are still simpler than more general nonlinear models.
\subsection{Probabilistic generative models}
These GLMs perform the classification task by modelling the class-conditional densities, $P(X|C_k)$, and the class priors $P(C_k)$, and then use Bayes' theorem to compute posterior probabilities $P(C_k|X)$, i.e. the probability of being in class $C_k$ given data point $x$.
\subsubsection{2-class problem}
Let's consider a 2-class problem, with classes $C_1$ and $C_2$. We have:
\begin{align*}
P(C_1|x) &= \frac{P(X|C_1)P(C_1)}{P(X)}\\
&= \frac{P(X|C_1)P(C_1)}{P(X|C_1)P(C_1)+P(X|C_2)P(C_2)}\\
&= \frac{1}{1 + \frac{P(X|C_2)P(C_2)}{P(X|C_1)P(C_1)}}\\
&= \frac{1}{1 + \exp(-\ln\frac{P(X|C_1)P(C_1)}{P(X|C_2)P(C_2)})}\\
&= \frac{1}{1 + \exp(-a)}\\
&= \sigma(a)
\end{align*}
where
$$a = \ln\left(\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}\right)$$
and $\sigma(a) = \frac{1}{1+e^{-a}}$ is the logistic sigmoid function.
The sigmoid function has range (0, 1) and the symmetry property $\sigma(-a) = 1 - \sigma(a)$. The inverse of the sigmoid is called the logit function:
$$a = \ln\left(\frac{\sigma}{1-\sigma}\right).$$
It is a log of the ratio of probabilities. The expression
$$\ln\left(\frac{P(X|C_1)}{P(X|C_2)}\right)$$
for the two classes is also known as the log odds.
\subsubsection{$k$-class problem, $k>2$}
With several classes, Bayes' theorem looks like this:
\begin{align*}
P(C_k|X) &= \frac{P(X|C_k)P(C_k)}{\sum_j P(X|C_j) P(C_j)} \\
&= \frac{\exp(a_k)}{\sum_j \exp(a_j)}
\end{align*}
where $a_k = \ln (P(X|C_k)P(C_k))$. This is a normalized exponential, which is a multiclass generalization of the logistic sigmoid. This function is also called the \textbf{sotfmax function}, as it represents a smoothed version of the `max' function. This is because if $a_k >> a_j$ for all $j\neq k$, then $P(C_k|X) \approx 1$ and $P(C_j|X) \approx 0$.
\subsection{Gaussian discriminant analysis (GDA)}
This is also known as probabilistic LDA. The assumptions are:
\begin{enumerate}
\item
Class conditional densities are Gaussian.
\item
All classes share the same covariance matrix.
\end{enumerate}
The posterior class probability is given by:
$$P(x|C_k) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}} \exp \left\{-\frac{1}{2} (X-\mu_k)^T \Sigma^{-1}(X-\mu_k)\right\}$$
where $\mu_k$ is the mean, $\Sigma$ is the covariance matrix, and $|\Sigma|$ is the determinant of the covariance matrix.
\subsubsection{Two-class case}
$P(C_1|X) = \sigma(a)$ where
\begin{align*}
a &= \ln\left(\frac{P(X|C_1)P(C_1)}{P(X|C_2)P(C_2)}\right)\\
&= \ln\left(\frac{P(X|C_1)}{P(X|C_2)}\right) + \ln\left(\frac{P(C_1)}{P(C_2)}\right) \\
&= \ln \exp \left\{-\frac{1}{2} (X-\mu_1)^T \Sigma^{-1}(X-\mu_1) +\frac{1}{2} (X-\mu_2)^T \Sigma^{-1}(X-\mu_2) \right\} + \ln\left(\frac{P(C_1)}{P(C_2)}\right) \\
&= \frac{1}{2} (X^T\Sigma^{-1}X - X^T\Sigma\mu_1 - \mu_1^T\Sigma^{-1}X + \mu_1^T\Sigma^{-1}\mu_1 \\
&\qquad- X^T\Sigma^{-1}X + X^T\Sigma^{-1}\mu_2 + \mu_2^T \Sigma^{-1}X - \mu_2^T\Sigma^{-1}\mu_2) + \ln\left(\frac{P(C_1)}{P(C_2)}\right)\\
&= - X^T\Sigma\mu_1 - X^T\Sigma^{-1}\mu_2 - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \ln\left(\frac{P(C_1)}{P(C_2)}\right) \\
&= X^T(\Sigma^{-1}(\mu_1-\mu_2)) - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \ln\left(\frac{P(C_1)}{P(C_2)}\right) \\
&= w^TX + w_0
\end{align*}
where
$$w = \Sigma^{-1}(\mu_1 - \mu_2)$$
$$w_0 = - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \ln\left(\frac{P(C_1)}{P(C_2)}\right)$$
We have found a way to write $a$ in $P(C_1|x) = \sigma(a)$. Therefore we can write:
$$P(C_1|X) = \sigma(w^TX + w_0).$$
Note: the quadratic terms in $x$ from the exponents of the Gaussian densities have cancelled due to the assumption of the common covariance matrices.
The model is a linear function of $x$. Therefore, the decision boundary is also a linear function of $x$. The prior probabilities $P(C_k)$ enter the model only through the bias parameter $w_0$. Changes in prior information have the effect of making parallel shifts of the decision boundary.
\subsubsection{Learning the parameters}
The parameter of the GDA model are the class means $\mu_1$ and $\mu_2$, the covariance matrix $\Sigma$, and the prior probabilities $P(C_1)$ and $P(C_2)$. How can we learn these parameters?
Consider a dataset $\mathcal{D} = \left\{X^{(n)}, t^{(n)}\right\}_{n=1}^N$ of $N$ examples. The target variable is $t_n=1$ for class $C_2$ and $t_n=0$ for class $C_2$. Also let the prior probabilities be $P(C_1) = \pi$ and $P(C_2) = 1 - \pi$.
Consider data point $x^{(n)}$. The joint probabilities of this data point being in classes $C_1$ and $C_2$, respectively, are:
$$P(X^{(n)}, C_1) = P(C_1) P(X^{(n)}|C_1) = \pi \mathcal{N}(X^{(n)}|\mu_1, \Sigma)$$
$$P(X^{(n)}, C_2) = P(C_2) P(X^{(n)}|C_2) = (1-\pi) \mathcal{N}(X^{(n)}|\mu_2, \Sigma)$$
where $\mathcal{N}$ is the normal (Gaussian) distribution. The likelihood of the data point $x^{(n)}, t^{(n)}$ is given by putting these together:
$$\left[\pi \mathcal{N}(x^{(n)}|\mu_1, \Sigma)\right]^{t^{(n)}} \left[(1-\pi) \mathcal{N}(x^{(n)}|\mu_2, \Sigma)\right]^{1-t^{(n)}}$$
Assuming that the examples are independent and identically distributed (i.i.d.), we can therefore write the posterior probability of the entire dataset like this:
$$P(\mathcal{D} | \pi, \mu_1, \mu_2, \Sigma) = \prod_{n=1}^N \left[\left[\pi \mathcal{N}(X^{(n)}|\mu_1, \Sigma)\right]^{t^{(n)}} \left[(1-\pi) \mathcal{N}(X^{(n)}|\mu_2, \Sigma)\right]^{1-t^{(n)}}\right]$$
This is a \textbf{likelihood function}. The goal of GDA is to find the parameters that maximize the likelihood function.
\subsubsection{Maximum likelohood solution}