Skip to main content
KD_K

Cross Entropy

In life, there are times when each day seems like a repeat of the previous one, and before you know it, those days have passed in a blink. Then there are moments when a single new experience leaves a lasting impact on us. These memorable events—trips, special occasions, and so on—share one common trait: they don’t happen often. Even though an event may initially surprise us, if it begins to occur regularly, we quickly become accustomed to it, and its impact diminishes.

From the perspective of informational value, everyday experiences like going to school, working, or having lunch and dinner don’t necessarily provide much new “information” for us. In contrast, a trip to a new place, the first day at school and work, those experiences certainly give us a lot of new input. Most people, for example, are more likely to remember their summer holiday than the routine of work or school throughout the year.

Like this, one way to quantify information is by considering how frequently an event occurs—how surprising it is, a concept also known as surprisal. If an event happens frequently, it is not surprising, and therefore, it carries a low informational value. We can think of the magnitude of information as being inversely proportional to the probability of incident xx, p(x)p(x).

Information1p(x)\text{Information} \sim {1\over p(x)}

Entropy is Expected Information

Taking the logarithm of the above equation, we have log1p(x)\log {1\over p(x)}. When p(x)p(x) is 11 - a certain event, the informational value of the event is zero, which brings log1p(x)\log {1\over p(x)} to 00.

Entropy(HH) is defined as the expected value of information for the distribution pp.

H(p)=E[log1p(x)]=E[logp(x)]=p(x)logp(x)\begin{align*} H(p) &= \mathbb{E}[\log {1\over p(x)}] \cr &= \mathbb{E}[- \log p(x)] \cr &= - \sum p(x) \log p(x) \cr \end{align*}

Cross Entropy is Surprisal Between Two Distributions

The cross entropy between distribution pp (the true distribution) and qq (the model's predicted distribution) is defined as follows:

H(p,q)=p(x)logq(x)H(p,q) = - \sum p(x) \log q(x)

Entropy measures the expected informational value of an event xx under the true distribution pp. In contrast, cross entropy evaluates the expected surprisal of event xx under the model's predicted distribution qq while assuming weights (probabilities) of event xx follow the probability from the true distribution pp. In other words, cross entropy can be understood as the expected number of bits (basic unit of information) required to encode events from the true distribution pp when using a coding scheme (or model) optimized for the distribution qq.

Cross entropy is important in machine learning because it is one of the most commonly used loss functions for classification tasks. It helps evaluate how well a model’s predictions match the actual distribution.

Binary Classification Example

For binary classification problems with two possible classes (0 and 1), we represent the target probabilities as pp and the model’s predicted probabilities as qq. Typically, the target probabilities are defined as p{y,1y}p \in \lbrace y, 1-y \rbrace and q{y^,1y^}q \in \lbrace \hat{y}, 1-\hat{y}\rbrace, where yy is the probability of class 1 and y^\hat{y} is the predicted probability of class 1. The cross entropy between the target and predicted distributions is then defined as:

H(p,q)=ipilogqi=ylogy^(1y)log(1y^)\begin{align*} H(p,q) &= -\sum_i p_i \log q_i \cr &= -y \log \hat{y} - (1-y) \log (1-\hat{y}) \end{align*}

Reference

Wikipedia Entropy

Wikipedia Cross entropy

PyTorch CrossEntropyLoss