Log Loss and Multi-Class Hinge Loss (1)
Hinge loss functions are mainly used in support vector machines for classification problem, while cross-entropy loss functions are ubiquitous in neural networks. In this note, we will study their basic properties and compare their usage.
First, let’s have a look at the binary case. In the binary classification problem, we only need one output node, say . Then the hinge loss is defined by
It is a piecewise linear function, which dominates the 0-1 loss.
The Hinge Loss Originates From SVM
In the linear classification problem, we use the hyperplane as the criteria; that is implies and vice versa. To find such a plane, we define two reference planes . If a point is on the correct side of the reference plane, that is, , then the classifier should not suffer a loss. If the point is on the wrong side of the reference plane, the classifier should suffer a loss proportional to the distance between the point and its reference plane,
Therefore, the overall loss the classifier should suffer is given by
However, we can see that to reduce such a loss, we can simply increase , which means to narrow down the distance between the reference plane and the classifier plane, called margin. Reducing the margin too much will result in a loss in generalization performance. It is clear in the separable case, where we have many difference choices of hyperplane perfectly separating two classes, the ones with smaller margins tend to separate two classes in a skewed way. More interpretation can be found in the paper Duality and Geometry in SVM Classifiers.
The Log Loss Originates From Information Theory
According to Wikipedia, the cross entropy is used in information theory to characterize the average amount of bits needed to identify an event drawn from the set, if a coding scheme is used optimized for , rather than the true distribution . Since represents the length of bits required to encode the event with probability . It is easy to see that
since by Jensen’s inequality, Learning algorithm usually returns a continuous function without further constraints. To use the cross entropy as a loss function, we need to normalize the output of the model to a probability distribution. This is done by raising the output to exponent and then normalize them. And hence for a group of outputs , the log loss gives
Let’s first have a look at the binary case, where the log loss has the form
where . We can see the similarity between the log loss and the hinge loss from the graph. By minimizing the empirical risk, we get the solution . The probability and are recovered using the formula . In particular, if , this is called logistic regression.