Log Loss and Multi-Class Hinge Loss (1)

2 minute read

Hinge loss functions are mainly used in support vector machines for classification problem, while cross-entropy loss functions are ubiquitous in neural networks. In this note, we will study their basic properties and compare their usage.

First, let’s have a look at the binary case. In the binary classification problem, we only need one output node, say $f(x)$ . Then the hinge loss is defined by

$\begin{equation} \ell^h(y,f(x)) = \max(0, 1-yf(x))\,. \end{equation}$

It is a piecewise linear function, which dominates the 0-1 loss.

The Hinge Loss Originates From SVM

In the linear classification problem, we use the hyperplane $\vec{w}\cdot\vec{x} + b = 0$ as the criteria; that is $\vec{w}\cdot\vec{x} + b > 0$ implies $y=1$ and vice versa. To find such a plane, we define two reference planes $\vec{w}\cdot\vec{x} + b = \pm 1$ . If a point $x_i$ is on the correct side of the reference plane, that is, $(\vec{w}\cdot\vec{x}_i + b)y_i \ge 1$ , then the classifier should not suffer a loss. If the point is on the wrong side of the reference plane, the classifier should suffer a loss proportional to the distance between the point and its reference plane,

$\begin{equation} \frac{\vert 1 - y_i(\vec{w}\cdot\vec{x}_i + b) \vert}{\Vert w \Vert} \end{equation}$

Therefore, the overall loss the classifier should suffer is given by

$\begin{equation} \frac{\ell^h(y_i, \vec{w}\cdot\vec{x}_i + b)}{\Vert w \Vert}\,. \end{equation}$

However, we can see that to reduce such a loss, we can simply increase $\Vert w \Vert$ , which means to narrow down the distance between the reference plane and the classifier plane, called margin. Reducing the margin too much will result in a loss in generalization performance. It is clear in the separable case, where we have many difference choices of hyperplane perfectly separating two classes, the ones with smaller margins tend to separate two classes in a skewed way. More interpretation can be found in the paper Duality and Geometry in SVM Classifiers.

The Log Loss Originates From Information Theory

According to Wikipedia, the cross entropy is used in information theory to characterize the average amount of bits needed to identify an event drawn from the set, if a coding scheme is used optimized for $\{ q_i \}$ , rather than the true distribution $\{ p_i \}$ . Since $\log q_i$ represents the length of bits required to encode the event with probability $q_i$ . It is easy to see that

$\begin{align} H(p,q) & = -\sum_i p_i\log q_i \\ & = -\sum_i p_i\log\frac{q_i}{p_i} - \sum_i p_i \log p_i \\ & = D_{KL}(p\Vert q) + H(p) \\ & \ge H(p)\,, \end{align}$

since by Jensen’s inequality, $\sum_i p_i\log\frac{q_i}{p_i} \le \log\sum_i q_i = 0$ Learning algorithm usually returns a continuous function without further constraints. To use the cross entropy as a loss function, we need to normalize the output of the model to a probability distribution. This is done by raising the output to exponent and then normalize them. And hence for a group of outputs $\{ f_1(x),f_2(x),\ldots,f_K{x} \}$ , the log loss gives

$\begin{equation} \ell^{\log}(f_1,\ldots,f_K,x,p_1,\ldots,p_K) = -\sum_{i=1}^K p_i(x)\log\frac{e^{f_i(x)}} {\sum_i e^{f_i(x)}}\,. \end{equation}$

Let’s first have a look at the binary case, where the log loss has the form

$\begin{align} \ell^{\log}(f_1,f_{-1},p_1,p_{-1},x) & = p_1\log\frac{e^{f_1(x)} + e^{f_{-1}(x)}}{e^{f_1(x)}} + p_{-1}\log\frac{e^{f_1(x)} + e^{f_{-1}(x)}}{e^{f_{-1}(x)}} \\ & = p_1\log\left(1 + e^{f_{-1}(x) - f_1(x)}\right) + p_{-1}\log\left(1 + e^{f_1(x) - f_{-1}(x)}\right) \\ & = p_1\log\left(1 + e^{-g(x)}\right) + p_{-1}\log\left(1 + e^{g(x)}\right) \\ & = \mathbb{E}_y\left[\log\left(1 + e^{-yg(x)}\right)\right]\,, \end{align}$

where $g(x):=f_{1}(x) - f_{-1}(x)$ . We can see the similarity between the log loss and the hinge loss from the graph. By minimizing the empirical risk, we get the solution $\hat{g}$ . The probability $p_1$ and $p_{-1}$ are recovered using the formula $p_y(x) = (1 + e^{-y\hat{g}(x)})^{-1}$ . In particular, if $g(x) = \beta_1 \cdot x + \beta_0$ , this is called logistic regression.

Share on

Twitter Facebook LinkedIn

Yitong Sun

Log Loss and Multi-Class Hinge Loss (1)

The Hinge Loss Originates From SVM

The Log Loss Originates From Information Theory

Share on

You May Also Enjoy

快速排序与快速选择算法

What Is Random Fourier Features Method?

如何阻止ssh重命名tmux窗口

HiDPI Chromebook上Crouton的设置