We have eight shapes in a bag that are either triangles, squares, or circles. Their color is either red or green.
draw_obj(example_obj);
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
$B_i$ | circle | square | triangle | $$P(Ai)$$ |
---|---|---|---|---|
$A_i$ | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
$$P(Bi)$$ | 0.5 | 0.125 | 0.375 | 1 |
In the second table, the six cells top left show the joint probability distribution $P(A_i,B_i)$. The final column and final row show the marginal probability distributions for $\mathcal{A}$ and $\mathcal{B}$, respectively.
Recall from the lecture:
Let $A$ denote an event and let $P(A)$ denote the occurrence probability of $A$. Then the entropy (self-information, information content) of $A$ is defined as $-\kern-2pt\log{_2}(P(A))$.
Let $\mathcal{A}$ be an experiment with the exclusive outcomes (events) $A_1,\ldots,A_k$. Then the mean information content of $\mathcal{A}$, denoted as $H(\mathcal{A})$, is called Shannon entropy or entropy of experiment $\mathcal{A}$ and is defined as follows:
$H(\mathcal{A}) = -\sum_{i=1}^k P(A_i) \cdot \log_2(P(A_i))$
In other words, the Shannon entropy gives a us measure for the degree of "surprise" from learning the value of $\mathcal{A}$.
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions'); entropy_calculation(example, 'A')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
$B_i$ | circle | square | triangle | $$P(Ai)$$ |
---|---|---|---|---|
$A_i$ | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
$$P(Bi)$$ | 0.5 | 0.125 | 0.375 | 1 |
For a binary variable, the entropy reaches a maximum of 1.0 when both outcomes are equally probable.
In other words, an entropy of 1 bit corresponds to the information content of a (fair) coin toss.
binary_entropy()
display_dfs(example, 'Dataset', proba, 'Joint and marginal probabilities', '\n'); entropy_calculation(example, 'A', 'B')
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
$B_i$ | circle | square | triangle | $$P(Ai)$$ |
---|---|---|---|---|
$A_i$ | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
$$P(Bi)$$ | 0.5 | 0.125 | 0.375 | 1 |
We now want to answer the question, how much information about variable A is still uncertain, after we learn the value of variable B?
Recall from the slides:
Let $\mathcal{A}$ be an experiment with the exclusive outcomes (events) $A_1,\ldots,A_k$, and let $\mathcal{B}$ be another experiment with the outcomes $B_1,\ldots,B_s$. Then the conditional entropy of the combined experiment $(\mathcal{A}\mid\mathcal{B})$ is defined as follows:
$ H(\mathcal{A}\mid\mathcal{B}) = \sum_{j=1}^s P(B_j) \cdot {H(\mathcal{A}\mid B_j)} $
where
$ {H(\mathcal{A}\ {\mid B_j})} = - \sum_{i=1}^k P(A_i\ {\color{\textcolorB}\mid B_j}) \cdot\log{_2}(P(A_i\ {\color{\textcolorB}\mid B_j})) $
display_dfs(example, 'Dataset', proba, 'Joint and marginal distributions', entropies, 'Entropies', '\n', cp, 'Conditional probabilities $P(A_i | B_i)$', ch, 'Specific Conditional entropies');
A | B | |
---|---|---|
0 | green | circle |
1 | red | circle |
2 | green | triangle |
3 | red | triangle |
4 | green | circle |
5 | red | square |
6 | red | triangle |
7 | red | circle |
$B_i$ | circle | square | triangle | $$P(Ai)$$ |
---|---|---|---|---|
$A_i$ | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
$$P(Bi)$$ | 0.5 | 0.125 | 0.375 | 1 |
entropy | |
---|---|
variable | |
$$H(\mathcal{A})$$ | 0.954 |
$$H(\mathcal{B})$$ | 1.406 |
$B_i$ | circle | square | triangle |
---|---|---|---|
$A_i$ | |||
green | 0.5 | 0 | 0.333 |
red | 0.5 | 1 | 0.667 |
$${H(\mathcal{A}|B_i)}$$ | |
---|---|
$B_i$ | |
circle | 1 |
square | 0 |
triangle | 0.918 |
The expression $H(\mathcal{A}\mid B_i)$ reads as "the entropy of variable $\mathcal{A}$ among only those records that have $\mathcal{B}=B_i$."
In our example, if we know that $\mathcal{B}=\mathrm{'circle'}$, the value of $\mathcal{A}$ is exactly as uncertain as a coin toss, but not so for the other possible values of $\mathcal{B}$.
The expression $H(\mathcal{A}|\mathcal{B})$ is the average of $H(\mathcal{A}|B_i)$ over all possible values of $\mathcal{B}$, each weighted by their probability.
display_dfs(proba, 'Joint and marginal distributions', cp, 'Conditional probabilities $P(A_i | B_i)$', ch, 'Specific Conditional entropies');
calc_cond_ent(ch, proba, "A", "B")
$B_i$ | circle | square | triangle | $$P(Ai)$$ |
---|---|---|---|---|
$A_i$ | ||||
green | 0.25 | 0 | 0.125 | 0.375 |
red | 0.25 | 0.125 | 0.25 | 0.625 |
$$P(Bi)$$ | 0.5 | 0.125 | 0.375 | 1 |
$B_i$ | circle | square | triangle |
---|---|---|---|
$A_i$ | |||
green | 0.5 | 0 | 0.333 |
red | 0.5 | 1 | 0.667 |
$${H(\mathcal{A}|B_i)}$$ | |
---|---|
$B_i$ | |
circle | 1 |
square | 0 |
triangle | 0.918 |
display_dfs(entropies, 'Entropies');
calc_cond_ent(ch, proba, "A", "B")
entropy | |
---|---|
variable | |
$$H(\mathcal{A})$$ | 0.954 |
$$H(\mathcal{B})$$ | 1.406 |
The information gain tells us how much we learn about variable $\mathcal{A}$ by knowing the value of variable $\mathcal{B}$, expressed as the reduction in entropy:
$$\begin{align} H(\mathcal{A}) - H(\mathcal{A}\mid\mathcal{B}) &= 0.954 - 0.844 = 0.11 \end{align}$$