Introduction
- Regression
- Linear Regression
- Classification
- Logistic Regression
- SVM
- KNN
- Decision Tree
- Random Forest / Boosting
- Clustering
- K-means
- Dimensionality Reduction
- PCA
- RL basics
1.1 Linear Regression
-
Introduction : Models a linear relationship between input features and a continuous target variable using:
$$ \hat{y} = w^T x + b = \theta^T x \quad (\text{with } x_0 = 1) $$
-
Methods
-
Objective (Ordinary Least Squares) : minimize the sum of squared residuals (the vertical deviation from each point to the line).
$$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n (y_i - \theta^T x_i)^2 $$
-
Solving for $\theta$
- Closed form (Normal Equation) : $\theta = (X^T X)^{-1} X^T y$.
- Pros: exact, no learning-rate tuning, no iteration.
- Cons: $O(d^3)$ to invert $X^T X$; numerically unstable when features are collinear (singular $X^T X$). In practice use SVD or QR via
np.linalg.lstsq, not the inverse.
- Gradient Descent : iterative; scales to large $n$ and $d$ where the closed form is infeasible.
- Closed form (Normal Equation) : $\theta = (X^T X)^{-1} X^T y$.
-
-
Discussion
-
Probabilistic view : OLS is the MLE under the assumption that residuals are i.i.d. Gaussian — i.e. minimizing MSE equals maximizing the Gaussian log-likelihood. (Same derivation we used in ML02 §1.)
-
Regularization (penalize large weights to combat overfitting and collinearity)
- Ridge (L2) : $J + \lambda |w|_2^2$ — shrinks all weights toward zero; well-conditioned even with collinear features.
- Lasso (L1) : $J + \lambda |w|_1$ — induces sparsity, useful as feature selection.
- Elastic Net : convex combination of L1 + L2.
-
Evaluation : RMSE / MAE for absolute error, $R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$ for proportion of variance explained. Always inspect residual plots — patterned residuals signal model mis-specification.
-
1. K-Nearest Neighbors
-
Introduction : a method that predicts a given data based on the nearest K neigbors in existing data.
-
Method : If regression, return mean value of k nearest neibors, if classification, return the mode class of k nearest neibors ,
-
Hyperparameter : k(odd number, if k is too big, underfit, if k is too small, overfit), distance metric(L1, Manhattan, L2, Euclidean)
3. Logistic regression
-
Introduction : Unlike Linear Regression, problems arise when the input/target data is categorical. Logistic Regression is better thought of as solving a classification problem rather than a regression problem.
-
Methods : Logistic Regression uses a sigmoid, so using MSE produces a non-convex loss surface (multiple local minima) rather than the convex quadratic we expect. Cross Entropy is therefore used as the loss, optimized with Gradient Descent.
4. RBM : Restricted Boltzmann Machine (RBM)
-
Introduction : RBMs are two layered NN with generative capabilities, that have the ability to learn a prob dist over its input. RBMs can be used for dimensionality reduction, classification, regression, and so on.
-
Architecture : Two layers — one input layer and one hidden layer. Unlike a Boltzmann Machine where units in the same layer are connected, in an RBM there are no intra-layer connections (it is restricted in this sense).
-
Training : Trained with Gibbs Sampling — a stochastic algorithm that draws a sequence of samples from the joint distribution of two or more random variables (Unsupervised Learning).
-
The training intuition: feed the input forward, run the activations backward to reconstruct it, and minimize the difference between the reconstruction and the input — much like an autoencoder.
-
forward : $ y_{out} = \sigma (wx + b)$
-
backward (reconstruct the input) : $ y_{recon} = y_{out} x + b$
-
KL Divergence can be considered as the reconstruction error: $ KL(x|y_{recon}) $
-
Reduce KLD in iterative manner : common training algorithms for RBMs approximate the log-likelihood gradient given some data and perform gradient ascest on these approximations.(Gibbs Sampling)
-
5. DBN : Deep Belief Network
-
Architecture : Multiple RBMs can also be stacked and can be fine tuned through the process of gradient descent and backpropagation.
-
Train : Once the first RBM is trained, its parameters are frozen and a second RBM is trained using the hidden units of the first as input. The full DBN stack can then be fine-tuned with gradient descent and backpropagation.
-
DBNs drew attention not so much because they modeled the joint probability well, but because using a DBN to pretrain another DNN mitigated overfitting and improved downstream performance.
-
However, with enough data, random initialization is known to outperform DBN-based weight initialization, so DBNs are rarely used in practice today.
6. HDBSCAN (Hierarchical DBSCAN, 2013)
-
Introduction : HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
-
Methods :
-
Transform the space according to the density/sparsity.
-
The single-linkage algorithm at the core of hierarchical clustering is extremely sensitive to noise: a single mis-placed noise point can chain two clusters into one. A more robust distance is therefore needed.
-
Mutual Reachability Distance : the MRD between
aandbis the maximum of { distance(a, k-nearest neighbor of a), distance(b, knn of b), distance(a, b) }. In dense regions this falls back to the direct distanced(a,b); in sparse regions it uses each point’s distance to its k-th neighbor instead, making the metric robust to noise. [reference]$$ MRD = max[core_k(a), core_k(b), d(a,b)] $$
-
-
Build the minimum spanning tree of the distance-weighted graph : Treat each data point as a node and connect them with weighted edges whose weight is the MR distance between the two endpoints. The resulting graph is the same regardless of where the construction starts (fig1).
-
Construct a cluster hierarchy of connected components : Walk the edges from the longest MR distance down, lowering a threshold and cutting connections as the threshold decreases.
-
Condense the cluster hierarchy based on minimum cluster size : Prune the hierarchy where a branch would contain fewer points than the specified minimum cluster size.
-
Extract the stable clusters from the condensed tree : we want the choose clusters that persist and have a longer lifetime;
-