ML Basic #2 | Loss Functions

0. Introduction

  • Loss Function : a function that evaluates how well the algorithm models the target dataset
    • MSE : From a probabilistic view, MSE equals the negative log-likelihood of a Gaussian distribution — i.e. MSE is the MLE under a Gaussian.
    • Cross Entropy : From a probabilistic view, CE equals the negative log-likelihood of a Multinomial distribution — i.e. CE is the MLE under a Multinomial.
    • Binary Cross Entropy : From a probabilistic view, BCE equals the negative log-likelihood of a Binomial distribution — i.e. BCE is the MLE under a Binomial.


1. Mean Squared Error (MSE)

  • introduction : easily computable quantity that measures the average of the squares of the errors.

    $$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - t_i)^2 $$

  • L1 Loss (LAD : Least Absolute Deviation), L2 Loss (LSE : Least Square Error)

    $$ L_{L1} = \sum_{i=1}^n | y_i - t_i |$$

    $$ L_{L1} = LSE = \sum_{i=1}^n (y_i - t_i)^2$$

  • Probabilistic view : MSE is equivalent to the negative log-likelihood under a Gaussian distribution (assuming the model’s prediction parameterizes the mean as $y_i = \theta^T x_i = \mu$).

$$ \text{log likelihood} , logP(t|x;\theta) = log \prod_i P(t_i | x_i;\theta) = \sum_i log \frac{1}{\sqrt{2\pi \sigma}} e^{- \frac{(t_i - \theta^T x_i)^2}{2\sigma^2} } \ = \sum_i \frac{1}{\sqrt{2 \pi \sigma}} + \sum_i (- \frac{(t_i - \theta^T x_i)^2}{2\sigma^2} ) \ = C_1 - C_2 \sum_i (t_i - \theta^T x_i)^2$$



2. Cross Entropy Loss

  • Information-theoretic view : Cross Entropy is the KL divergence between the entropy of the training data (the average information content) and the entropy of the model’s predictions.

    1. Information content : surprisal of a random variable or signal is the amount of information

      • Information content of getting heads from a coin flip: $ -log_2 (0.5) = 1$

      • Information content of rolling a 1 on a die: $ -log_2 (1/6) = 2.5849$

      • Information content = degree of surprise: rarer events carry more information.

      • Why take the log of the reciprocal $\frac{1}{p(x)}$ instead of using the reciprocal directly? Taking the log expresses the minimum resources needed to represent that surprise. For example, an event with probability 1/8 needs at least $\log_2(8) = 3$ bits to encode in binary.

        $$ I(E) = -log P(E)$$

        1. Entropy (Shannon Entropy) : average (expectation of) information content of discrete random variable X

          • In other words, entropy is the average information content — equivalently, the average resource required to represent the events.

            $$ Entropy = -\sum_{i=1}^{N} p_i log{p_i} $$

        2. Relative Entropy (Kullback-Leibler Divergence) : measure of how one probability distribution is different from second, reference probability distribution.

          $$ KL(p|q) = \sum_i p_i log(\frac{p_i}{q_i}) = \sum_i (p_i log p_i - p_i log q_i) = (-\sum_{i=1}^{N} p_i log{q_i}) - (-\sum_{i=1}^{N} p_i log{p_i})$$

        3. Cross-Entropy: To minimize the dissimilarity of distributions, we should find the $q$ that minimize the first term (second term is independent to the distrib. $q$). From this we can get cross entropy error.

          $$ CE = -\sum_{i=1}^{N} p_i log{q_i} $$

      • Probabilistic view : negative log-likelihood of a Multinomial distribution.

        $$ \text{log likelihood} , logP(t|x;\theta) = log \prod_i \prod_k \pi_k^{t_k} = \sum_i \sum_k t_k log \pi_k$$



3. Binary Cross Entropy

  • A simple specialization of the Cross Entropy formula with $N=2$: in binary classification $p_2 = 1 - p_1$.

$$BCE = -(p_1 logq_1 + p_2 logq_2) = - (p_1 log q_1 + (1-p_1) log (1-q_1))$$

  • Probabilistic view : the log-likelihood of a Binomial distribution equals BCE.

$$ \text{log likelihood} , logP(y|x;\theta) = log \prod_i p_i^{y_i} (1-p_i)^{1-y_i} \ = \sum_i log( p_i^{y_i} (1-p_i)^{1-y_i}) \ = \sum_i ( y_i log p_i + (1-y_i)log(1-p_i) ) $$



4. Dice-Coefficient Loss

$$\text{Dice Coefficient} = \frac{2*|X \cap Y|}{|X| + |Y|} $$

  • Where $ |X| $ is the cardinality (i.e. the number of elements in each set) of the set $X$.

  • It has been used as a metric in the field of image segmentation. And sometimes it can be used as a loss function.

  • In the field of image segmentation, Dice Loss can be used as a loss function when the foreground is smaller than the background, causing imbalance problem

    def dice_loss(pred, target):
        smooth = 1
        # have to use contiguous since they may from a torch.view op
        iflat, tflat = pred.contiguous().view(-1).cuda(), target.contiguous().view(-1).cuda()
        intersection = (iflat * tflat).sum()
        A_sum, B_sum = torch.sum(iflat * iflat), torch.sum(tflat * tflat)
    return 1 - ((2. * intersection + smooth) / (A_sum + B_sum + smooth)
    


5. Noise Contrastive Learning

  • Intuitive explanation :
    1. If we want to find the next words (you) appropriate for a given context ([Nice, to, meet]) .
    2. We can train the network with the softmax CE loss function, which it will return the probabilities for all candidate words.
    3. This means that the output “scores” for each class have to be normalized, converted into actual probabilities for each class. => computationally expensive
    4. Let’s simplify this multinomial classification problem to binary classification (logistic regression) by transformation inputs with “set of words and answer” to positive and negative pairs.
    5. So the model only needs to predict whether a pair is positive or negative rather than find which word would be the answer