ML Basic #6 | Conventional Models

2018-06-22 1. Machine Learning Comments

Introduction

Regression
- Linear Regression
Classification
- Logistic Regression
- SVM
- KNN
- Decision Tree
- Random Forest / Boosting
Clustering
- K-means
Dimensionality Reduction
- PCA
RL basics

1.1 Linear Regression

Introduction : Models a linear relationship between input features and a continuous target variable using:

$$ \hat{y} = w^T x + b = \theta^T x \quad (\text{with } x_0 = 1) $$
Methods
- Objective (Ordinary Least Squares) : minimize the sum of squared residuals (the vertical deviation from each point to the line).
  
  $$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n (y_i - \theta^T x_i)^2 $$
- Solving for $\theta$
  - Closed form (Normal Equation) : $\theta = (X^T X)^{-1} X^T y$.
    - Pros: exact, no learning-rate tuning, no iteration.
    - Cons: $O(d^3)$ to invert $X^T X$; numerically unstable when features are collinear (singular $X^T X$). In practice use SVD or QR via np.linalg.lstsq, not the inverse.
  - Gradient Descent : iterative; scales to large $n$ and $d$ where the closed form is infeasible.
Discussion
- Probabilistic view : OLS is the MLE under the assumption that residuals are i.i.d. Gaussian — i.e. minimizing MSE equals maximizing the Gaussian log-likelihood. (Same derivation we used in ML02 §1.)
- Regularization (penalize large weights to combat overfitting and collinearity)
  - Ridge (L2) : $J + \lambda |w|_2^2$ — shrinks all weights toward zero; well-conditioned even with collinear features.
  - Lasso (L1) : $J + \lambda |w|_1$ — induces sparsity, useful as feature selection.
  - Elastic Net : convex combination of L1 + L2.
- Evaluation : RMSE / MAE for absolute error, $R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$ for proportion of variance explained. Always inspect residual plots — patterned residuals signal model mis-specification.

1. K-Nearest Neighbors

Introduction : a method that predicts a given data based on the nearest K neigbors in existing data.
Method : If regression, return mean value of k nearest neibors, if classification, return the mode class of k nearest neibors ,
Hyperparameter : k(odd number, if k is too big, underfit, if k is too small, overfit), distance metric(L1, Manhattan, L2, Euclidean)

3. Logistic regression

Introduction : Unlike Linear Regression, problems arise when the input/target data is categorical. Logistic Regression is better thought of as solving a classification problem rather than a regression problem.
Methods : Logistic Regression uses a sigmoid, so using MSE produces a non-convex loss surface (multiple local minima) rather than the convex quadratic we expect. Cross Entropy is therefore used as the loss, optimized with Gradient Descent.

4. RBM : Restricted Boltzmann Machine (RBM)

Introduction : RBMs are two layered NN with generative capabilities, that have the ability to learn a prob dist over its input. RBMs can be used for dimensionality reduction, classification, regression, and so on.
Architecture : Two layers — one input layer and one hidden layer. Unlike a Boltzmann Machine where units in the same layer are connected, in an RBM there are no intra-layer connections (it is restricted in this sense).
Training : Trained with Gibbs Sampling — a stochastic algorithm that draws a sequence of samples from the joint distribution of two or more random variables (Unsupervised Learning).
The training intuition: feed the input forward, run the activations backward to reconstruct it, and minimize the difference between the reconstruction and the input — much like an autoencoder.
1. forward : $ y_{out} = \sigma (wx + b)$
2. backward (reconstruct the input) : $ y_{recon} = y_{out} x + b$
3. KL Divergence can be considered as the reconstruction error: $ KL(x|y_{recon}) $
4. Reduce KLD in iterative manner : common training algorithms for RBMs approximate the log-likelihood gradient given some data and perform gradient ascest on these approximations.(Gibbs Sampling)

5. DBN : Deep Belief Network

Architecture : Multiple RBMs can also be stacked and can be fine tuned through the process of gradient descent and backpropagation.
Train : Once the first RBM is trained, its parameters are frozen and a second RBM is trained using the hidden units of the first as input. The full DBN stack can then be fine-tuned with gradient descent and backpropagation.
DBNs drew attention not so much because they modeled the joint probability well, but because using a DBN to pretrain another DNN mitigated overfitting and improved downstream performance.
However, with enough data, random initialization is known to outperform DBN-based weight initialization, so DBNs are rarely used in practice today.

6. HDBSCAN (Hierarchical DBSCAN, 2013)

Introduction : HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
Methods :
1. Transform the space according to the density/sparsity.
  - The single-linkage algorithm at the core of hierarchical clustering is extremely sensitive to noise: a single mis-placed noise point can chain two clusters into one. A more robust distance is therefore needed.
  - Mutual Reachability Distance : the MRD between a and b is the maximum of { distance(a, k-nearest neighbor of a), distance(b, knn of b), distance(a, b) }. In dense regions this falls back to the direct distance d(a,b); in sparse regions it uses each point’s distance to its k-th neighbor instead, making the metric robust to noise. [reference]
    
    $$ MRD = max[core_k(a), core_k(b), d(a,b)] $$
2. Build the minimum spanning tree of the distance-weighted graph : Treat each data point as a node and connect them with weighted edges whose weight is the MR distance between the two endpoints. The resulting graph is the same regardless of where the construction starts (fig1).
3. Construct a cluster hierarchy of connected components : Walk the edges from the longest MR distance down, lowering a threshold and cutting connections as the threshold decreases.
4. Condense the cluster hierarchy based on minimum cluster size : Prune the hierarchy where a branch would contain fewer points than the specified minimum cluster size.
5. Extract the stable clusters from the condensed tree : we want the choose clusters that persist and have a longer lifetime;