Mathematics for ML #9 | Linear Regression


9. Linear Regression

  • In the following, we will apply the mathematical concepts from previous chapters, to solve linear regression (curve fitting) problems.

  • In regression, we aim to find a function $f$ that maps inputs $x∈R^D$ to corresponding function values $f(x)∈R.$

    • We are given a set of training inputs $x_n$ and corresponding noisy observations $y_n=f(x_n) + \epsilon$,

      • where $\epsilon$ is an i.i.d random variable that describes measurement and observation noise => simply zero-mean Gaussian noise (not further in this chapter)
    • Then the task is to infer the function $f$ that generated the data and generalizes well to function values at new input locations.

  • Solving a regression function requires a variety of problems:

    • (1) Choice of the model (type) and the parametrization of the regression function.

    • (2) Finding good parameters.

    • (3) Overfitting and model selection.

    • (4) Relationship between loss functions and parameter priors.

    • (5) Uncertainty modeling



9.1 Problem Formulation

  • Objective : is to find a function that is close to the unknown function $f$ that generated the observed data (and that generalizes well).

  • Observation Noise

    • The functional relationship between input $x$ and $y$ is given as below.
    • Noise $ε∼N(0, σ^2)$ is independent, identically distributed (i.i.d.) Gaussian measurement noise with zero-mean and variance of $σ^2$.

    $$ y = f (x) + ε $$

  • Define Likelihood function :

    • Because of the presence of observation noise, we will adopt a probabilistic approach and explicitly model the noise using a likelihood function.

    • Likelihood function would be Gaussian with mean, $y|f(x)$ and variance, $\sigma^2$ :

    $$ p(y|x) = N(y|f(x), σ^2) $$

  • Parametrization :

    • We consider the special case that the parameters $\theta$ appear linearly in our model.

    $$ p(y|x,θ) = N(y|x^⊤θ, σ^2) \leftrightarrow y=x^⊤θ+ε, ε∼N(0,σ^2) $$



9.2 Parameter Estimation

  • In the following, we will have a look at parameter estimation by maximizing the likelihood, a topic that we already covered to some degree in Section 8.3.

9.2.1 Maximum Likelihood Estimation

  • (1) Maximum Likelihood Estimation
    • A widely used approach to finding the desired parameters $θ_{ML}$
    • where we find parameters $θ_{ML}$ that maximize the likelihood.

$$ θ_{ML} = \text{argmax}_\theta \ p(Y |X,θ). $$

  • (2) Likelihood Factorization : For given training set, $y_i$ and $y_j$ are conditionally independent given their respective inputs $x_i, x_j$ so that the likelihood factorizes according to :

$$ p(Y|X,θ) = p(y_1,…,y_N |x_1,…,x_N,θ) $$

$$ = \prod p(y_n |x_n,θ) = \prod N(y_n |x^⊤_n θ, σ^2) \tag{9.5b} $$

  • (3) Log-Transformation :
    • Since the likelihood (9.5b) is a product of N Gaussian distributions, the log-transformation is useful
    • To find $\theta_{ML}$, we can minimize the negative log-likelihood, which is equal to some of squared errors

$$ −logp(Y|X,θ) = −log \prod p(y_n|x_n,θ) = − \sum logp(y_n|x_n,θ) $$

$$ logp(y_n|x_n,θ)= − \frac{1}{2\sigma^2} (y_n−x^⊤_nθ)^2 +const = L(\theta) $$

  • (4) Global Optimum : We can find the global optimum by computing the gradient of $L$, setting it to 0 and solving for $θ$

$$ \frac{dL}{d\theta} = 0^⊤ \Longleftrightarrow θ_{ML} = (X^⊤X)^{−1}X^⊤y $$

  • Polynomial Regression (MLE with Features) : straight lines are not sufficiently expressive when it comes to fitting more interesting data
  • Estimating the Noise Variance : We can also use the principle of MLE to obtain the maximum likelihood estimator $\sigma_{ML}^2$ for the noise variance

9.2.2 Overfitting in Linear Regression

  • Model Selection : we can use the RMSE (or the negative log-likelihood) to determine the best degree of the polynomial that minimizes the objective.

    • polynomials of low degree fit the data poorly

    • high-degree polynomials oscillate wildly and are a poor representation of the underlying function that generated the data, such that we suffer from overfitting


9.2.3 Maximum A Posteriori Estimation

  • (1) Prior and Posterior :

    • The posterior over the parameters θ, given the training data $X,Y$ is obtained by applying Bayes’ theorem (Section 6.3) as :

    $$ p(θ|X,Y)= \frac{p(Y|X,θ)p(θ)}{p(Y | X )} $$

    • Since the posterior explicitly depends on the parameter prior $p(θ)$, the prior will have an effect on the parameter vector we find as the maximizer of the posterior.
  • (2) Log-Transformation :

    • To find the MAP estimate, We start with the log-transform and compute the log-posterior

    $$ log p(θ | X , Y) = log p(Y | X , θ) + log p(θ) + const , $$

    • With a (conjugate) Gaussian prior $p(θ) = N(0, b^2I)$􏰂on the parameters θ, we obtain the negative log posterior

    $$ −logp(θ|X,Y)= \frac{1}{2\sigma^2}(y−Φθ)^⊤(y−Φθ)+ \frac{1}{2b^2} θ^⊤θ+const. $$

  • And then, we minimize the negative log-poterior distribution with respect to $\theta$

    $$ θ_{MAP} ∈ argmin_\theta[−logp(Y|X,θ)−logp(θ)]. $$

  • (3) Gradient :

    • Get gradient of the negative log-posterior with respect to θ, and we can see the first term on the right-had side as the gradient of the negative log-lgikelihood

      $$ − \frac{dlogp(θ|X,Y)}{d\theta} = −\frac{dlogp(Y|X,θ)}{d\theta} − \frac{dlogp(θ)}{d\theta} $$

  • (4) Global Optimum : We will find $θ_{MAP}$ by setting this gradient to $0^⊤$ and solving for $θ_{MAP}$

$$ θ_{MAP} = (Φ^⊤Φ+\frac{σ^2}{b^2}I)^{-1}Φ^⊤y $$


9.2.4 MAP Estimation as Regularization

  • Regularized least squares :

    • In Regularized Least Squares, we consider the loss function as below, which we minimize with respect to θ (see Section 8.2.3)

    $$ ∥y − Φθ∥^2 + λ ∥θ∥^2_2 $$

    • the first term is a data-fit term, which is proportional to the negative log-likelihood; see (9.10b).

    • The second term is called the regularizer, and the regularization parameter λ>0 controls the “strictness” of the regularization => can be interpreted as a negative log-Gaussian prior

  • So far, we have covered parameter estimation using MLE and MAP where we found point estimates $\theta^*$ that optimize an objective function

  • In the next section, we will use Bayesian inference (8.4) to find a posterior distribution over the unknown parameters, which we subsequently use to make predictions.



9.3 Bayesian Linear Regression

  • Bayesian linear regression :
    • does not even attempt to compute a point estimate of the parameters,
    • but instead the full posterior distribution over the parameters is taken into account when making predictions

9.3.1 Model

  • Prior, $p(θ) = N(m_0, S_0)$

  • likelihood, $p(y | x, θ) = N (y | φ^⊤(x)θ, σ^2)$

  • Full probabilistic model : the joint distribution of observed and unobserved random variables, y and θ, respectively, is:

$$ p(y,θ|x) = p(y|x,θ)p(θ) $$


9.3.2 Prior Predictions

  • Introduction

    • we are usually not so much interested in the parameter θ themselves.

    • Instead, our focus often lies in the predictions we make with those parameter values.

    • In a Bayesian setting, we take the parameter distribution and average over all plausible parameter settings when we make predictions.

    • More specifically, to make predictions at an input x∗, we integrate out θ and obtain :

    $$ p(y_∗| x_∗) = \int p(y_∗| x_∗, θ)p(θ)dθ = E_θ[p(y_∗|x_∗, θ)] $$

  • Summary

    • So far, we looked at computing preds using the parameter prior $p(θ)$.

    • However, when we have a parameter posterior (given some training data X, Y) => we just need to replace the prior $p(θ)$ with the posterior $p(θ|X,Y)$.


9.3.3 Posterior Distribution

  • Bayes’ theorem : Given a training set of inputs $x_n ∈ R^D$ and corresponding observations $y_n ∈ R$, we compute the posterior over the parameters using Bayes’ theorem :

$$ p(θ|X,Y) = \frac{p(Y|X,θ)p(θ)}{p(Y|X)} $$

  • Marginal Likelihood : independent of the parameters θ and ensures that the posterior is normalized, the likelihood averaged over all possible parameter settings

9.3.4 Posterior Predictions

  • In principle, predicting with the parameter posterior p(θ|X,Y) is not fundamentally different given that in our conjugate model the prior and posterior are both Gaussian (with different parameters).

  • Therefore, by following the same reasoning as in Section 9.3.2, we obtain the (posterior) predictive distribution

  • SKIP DETAILS



9.4 Maximum Likelihood as Orthogonal Projection

  • Introduction :
    • we will now provide a geometric interpretation of MLE. Let us consider a simple linear regression setting.

$$ y=xθ+ε, ε∼N(0,σ^2), \tag{9.65} $$

  • Problem Definition :

    • The parameter θ determines the slope of the line.

    • We obtained the MLE for the slope parameter as (9.66) from section 9.2.1

    • This means, $θ_{ML}$ is the approximation with the minimum least-squares error between $y$ and $Xθ$. (9.67)

    $$ θ_{ML} = (X^⊤X)^{−1}X^⊤y \tag{9.66} $$

$$ Xθ_{ML} = (X^⊤X)^{−1}XX^⊤y = \frac{XX^⊤}{X^⊤X}y \tag{9.67} $$

  • Solution : As we are looking for a solution of $y = Xθ$, we can think of linear regression as a problem for solving systems of linear equations

    • In particular, looking carefully at (9.67) we see that the maximum likelihood estimator $θ_{ML}$ in our example from (9.65) effectively does an orthogonal projection of $y$ onto the one-dimensional subspace spanned by $X$

      • Recalling the results on orthogonal projections from Section 3.8, we identify $\frac{X X^⊤}{X^⊤X}$ as the projection matrix,
    • Therefore, the maximum likelihood solution provides also a geometrically optimal solution by finding the vectors in the subspace spanned by $X$ that are “closest” to the corresponding observations y,

    • In the general (polynomial) linear regression case, we again can interpret the maximum likelihood result as a projection onto a K-dimensional subspace of $R^N$.

  • Summary

    1. To model the given dataset (observation), we decided to use degree-1 curve (line) => equation 9.65
    • We need to find appropriate parameter $\theta$ of our linear model, using MLE => equation 9.66



Summary

  • In this chapter, we will see how to find a function that maps input $x$ to $y$ in observed training data set.

  • Section 9.1, Problem Formulation

    • Functional Relationship : Objective is to find a funtion that is close to the unknown function $f$ that generated the observed data

    • Observation noise : Set the observation noise $ε∼N(0, σ^2)$ as (i.i.d.) Gaussian measurement with zero-mean and variance of $σ^2$.

  • Section 9.2, Parameter Estimation

    • Maximum Liklihood Estimation : We find the model parameters $\theta_{ML}$ that maximize the likelihood.

      • (1) Define the likelihood function : with gaussian function

      • (2) Likelihood Factorization : training samples are independent to each other, the likelihood factorizes according to $\prod N(y_n |x^⊤_n θ, σ^2) $

      • (3) Log Transformation :

        • Likelihood, a product of N Gaussian Distributions, can be transformed to sum of N log gaussian dist.
        • The problem also transformed to minimize the negative log-likelihood
      • (4) Global Minimum : We can find the global optimum by computing the gradient

    • Maximum A Posterior : MLE can lead to severe overfitting, MAP addresses this issue by placing a prior on the params that play the role of a regularizer

  • Section 9.3, Bayesian Linear Regression

    • Bayesian Regression : From section 9.2, we focused on estimating appropriate parameters that work well, However in section 9.3, we focus on prediction itself, not parameters, so we just use parameter distribution and average over all plausible parameter settings when we make predictions

      • Prior Prediction : to make predictions at an input $x_∗$, we integrate out θ.

      • Posterior Prediction : when we have a parameter posterior (given some training data X, Y) then, we just need to replace the prior $p(θ)$ with the posterior $p(θ|X,Y)$

  • Section 9.4, Maximum Likelihood as Orthogonal Projection

    • The MLE solution provides also a geometrically optimal solution by finding the vectors in the subspace spanned by $X$ that are “closest” to the corresponding observations $y$