Information

Title: (cs229) Lecture 2: Linear Regression and Gradient Descent
Link: http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes1.pdf
Keywords: Machine Learning, Bayesian Inference, Maximum Liklihood Estimation, Linear Regression

Definition of Machine Learning:
Finding a model and its parameters so that the resulting predictor performs well on unseen data
Probabilistic interpretation
- Estimation:

Cost function: (1)
where are parameters, are training examples, and are targets.
Section 1: LMS algorithm
- Gradeint Descent: . This becomes . This is called batch gradient descent becuase you are using an entire training set.
  On the other hand, if you update the following way, you're using stochastic gradient descent. But you have to update parameters at the same time, i.e., you can't update the first element of parameters before updating a second parameter.

        Loop{  
              for i = 1 to n,  
                  {

(for every j)

                  }//for  
            }//loop

Section 2: The normal equations
Using matrix, you can also transform (1) into
Then you can take gradient with respect to and get that minimizes the cost function.
Section 3: Probabilistic Interpretation
When approaching regression problem, why bother using specifically the least square function J?

Let's redefine the relation between the inputs and target varaibles as the following: , where is error term that represents, e.g. random noise. We assume the error follows Normal distrubution (or Gaussian distribution).
Then we can that (2)

Interpreting (2) as a function of , we can instead call it the likelihood function:

According to maximum likelihoold, we should choose that makes the data as high probability as possible. For the convenience of calculation, we use log likelihood as the following:

To make it the maximum, we need to minimize

Provide feedback