0. Introduction
- Loss Function : a function that evaluates how well the algorithm models the target dataset
- MSE : From a probabilistic view, MSE equals the negative log-likelihood of a Gaussian distribution — i.e. MSE is the MLE under a Gaussian.
- Cross Entropy : From a probabilistic view, CE equals the negative log-likelihood of a Multinomial distribution — i.e. CE is the MLE under a Multinomial.
- Binary Cross Entropy : From a probabilistic view, BCE equals the negative log-likelihood of a Binomial distribution — i.e. BCE is the MLE under a Binomial.
1. Mean Squared Error (MSE)
-
introduction : easily computable quantity that measures the average of the squares of the errors.
$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - t_i)^2 $$
-
L1 Loss (LAD : Least Absolute Deviation), L2 Loss (LSE : Least Square Error)
$$ L_{L1} = \sum_{i=1}^n | y_i - t_i |$$
$$ L_{L1} = LSE = \sum_{i=1}^n (y_i - t_i)^2$$
-
Probabilistic view : MSE is equivalent to the negative log-likelihood under a Gaussian distribution (assuming the model’s prediction parameterizes the mean as $y_i = \theta^T x_i = \mu$).
$$ \text{log likelihood} , logP(t|x;\theta) = log \prod_i P(t_i | x_i;\theta) = \sum_i log \frac{1}{\sqrt{2\pi \sigma}} e^{- \frac{(t_i - \theta^T x_i)^2}{2\sigma^2} } \ = \sum_i \frac{1}{\sqrt{2 \pi \sigma}} + \sum_i (- \frac{(t_i - \theta^T x_i)^2}{2\sigma^2} ) \ = C_1 - C_2 \sum_i (t_i - \theta^T x_i)^2$$
2. Cross Entropy Loss
-
Information-theoretic view : Cross Entropy is the KL divergence between the entropy of the training data (the average information content) and the entropy of the model’s predictions.
-
Information content : surprisal of a random variable or signal is the amount of information
-
Information content of getting heads from a coin flip: $ -log_2 (0.5) = 1$
-
Information content of rolling a 1 on a die: $ -log_2 (1/6) = 2.5849$
-
Information content = degree of surprise: rarer events carry more information.
-
Why take the log of the reciprocal $\frac{1}{p(x)}$ instead of using the reciprocal directly? Taking the log expresses the minimum resources needed to represent that surprise. For example, an event with probability 1/8 needs at least $\log_2(8) = 3$ bits to encode in binary.
$$ I(E) = -log P(E)$$
-
Entropy (Shannon Entropy) : average (expectation of) information content of discrete random variable X
-
In other words, entropy is the average information content — equivalently, the average resource required to represent the events.
$$ Entropy = -\sum_{i=1}^{N} p_i log{p_i} $$
-
-
Relative Entropy (Kullback-Leibler Divergence) : measure of how one probability distribution is different from second, reference probability distribution.
$$ KL(p|q) = \sum_i p_i log(\frac{p_i}{q_i}) = \sum_i (p_i log p_i - p_i log q_i) = (-\sum_{i=1}^{N} p_i log{q_i}) - (-\sum_{i=1}^{N} p_i log{p_i})$$
-
Cross-Entropy: To minimize the dissimilarity of distributions, we should find the $q$ that minimize the first term (second term is independent to the distrib. $q$). From this we can get cross entropy error.
$$ CE = -\sum_{i=1}^{N} p_i log{q_i} $$
-
-
Probabilistic view : negative log-likelihood of a Multinomial distribution.
$$ \text{log likelihood} , logP(t|x;\theta) = log \prod_i \prod_k \pi_k^{t_k} = \sum_i \sum_k t_k log \pi_k$$
-
-
3. Binary Cross Entropy
- A simple specialization of the Cross Entropy formula with $N=2$: in binary classification $p_2 = 1 - p_1$.
$$BCE = -(p_1 logq_1 + p_2 logq_2) = - (p_1 log q_1 + (1-p_1) log (1-q_1))$$
- Probabilistic view : the log-likelihood of a Binomial distribution equals BCE.
$$ \text{log likelihood} , logP(y|x;\theta) = log \prod_i p_i^{y_i} (1-p_i)^{1-y_i} \ = \sum_i log( p_i^{y_i} (1-p_i)^{1-y_i}) \ = \sum_i ( y_i log p_i + (1-y_i)log(1-p_i) ) $$
4. Dice-Coefficient Loss
$$\text{Dice Coefficient} = \frac{2*|X \cap Y|}{|X| + |Y|} $$
-
Where $ |X| $ is the cardinality (i.e. the number of elements in each set) of the set $X$.
-
It has been used as a metric in the field of image segmentation. And sometimes it can be used as a loss function.
-
In the field of image segmentation, Dice Loss can be used as a loss function when the foreground is smaller than the background, causing imbalance problem
def dice_loss(pred, target): smooth = 1 # have to use contiguous since they may from a torch.view op iflat, tflat = pred.contiguous().view(-1).cuda(), target.contiguous().view(-1).cuda() intersection = (iflat * tflat).sum() A_sum, B_sum = torch.sum(iflat * iflat), torch.sum(tflat * tflat) return 1 - ((2. * intersection + smooth) / (A_sum + B_sum + smooth)
5. Noise Contrastive Learning
- Intuitive explanation :
- If we want to find the next words (
you) appropriate for a given context ([Nice, to, meet]) . - We can train the network with the softmax CE loss function, which it will return the probabilities for all candidate words.
- This means that the output “scores” for each class have to be normalized, converted into actual probabilities for each class. => computationally expensive
- Let’s simplify this multinomial classification problem to binary classification (logistic regression) by transformation inputs with “set of words and answer” to positive and negative pairs.
- So the model only needs to predict whether a pair is positive or negative rather than find which word would be the answer
- If we want to find the next words (