Supervised Learning Questions

Model Training and General Questions

Bias Variance Decomposition

Regression

Linear Regression

Assumptions and intuition
- when do the assumptions not important? (when do analyze rather than prediction)
  - indeed a projection from y to **X **vector space
- diagnostics and treatments
  - fitted and residual plot
  - heteroskedasticity
  - co-linearity
    - regularization via Lasso
    - PCA (matrix transformation)
MLE and OLS
1. compare difference
2. proof MLE Estimate <=> RSS Min Estimate
3. use p.d.f of $$\epsilon$$ on beta
Derive
- xy vs yx regression (notice whose variance get "contributed averaged")
- Single variable max MLE - min RSS = $$\sum (y_i - (\beta_1 x_i + \beta_0))^2$$
- Muti-variable $$\mathbf{w^*} = (\mathbf{X^T X})^{-1} \mathbf{X^T y} $$
Loss Function
- proof: min Loss is solution for linear problem (MLE/OLS)
- Blue
- Other loss functions
  - Absolute Loss
  - Huber Loss Function
Regularization
- Ridge, Lasso (Scarcity)

Logistic Regression

Intuition - Why do we have it
- different contribution of large/small data
  - exponential family and penalize
- assumption - Bernoulli distribution
  - odds, log odds
Loss function derive parameter estimation
- MLE - y is bernoulli function (entropy loss)
implement Gradient Descent
- entropy loss (Bernoulli MLE Loss)
- Mean Square Loss

Classification

Confusion Matrix and Model Evaluation
- Precision, Recall
- Accuracy
- ROC Curve
how to do multi-class calssification
- multiple binary classficiation
- softmax

Tree-Based Model

Tree

Algo
- ID3
- ID4.5
- CART
  - Hyperparamters
    - max__depth, min__samples__split, min_samples_leaf, max_leaf_nodes_
Loss
- entropy
- Gini Index
Advantages
- interaction handling
- insensitivity to outliers

Tree Ensembling

Random Forest
- bagging
  - bagging of data
  - bagging of feature
    - rule of sum - select sqrt(k) features
- feature importance calculation
  - out-of-bag performance
Advantage and Disadvantage
- natural parallel (embarrassed parallel algo)
- feature importance

Support Vector Machine

Hinge Loss
Kernel Trick

K-nearest Neighbor

No Training stage
higher K is, more robust the model can be