Published

Thu 21 January 2016

←Home

Why you should use square root of Gini Index

In this post I will explain why you should use square root of Gini index while building decision tree classification models. In decision tress, We know that at every node we need to choose a feature that provides the best split i.e. the feature that reduces the child nodes' impurity the most.

In case of using entropy or normal Gini index that decision is affected by class distribution, but if we use square root of Gini Index, it is not affected. Let me show Why.

Let's take a case where we have to classify a sample either as positive or negative and we are at a point where we have to decide amongst features that provide two splits 8+ 2- | 2+ 8- and 10+ 5-| 0+ 5- . Let's apply both standard Gini Index and square root of Gini index.

Gini Index is given by the formula $2 * p (1-p)$ and square root of Gini Index is $\sqrt{ 2 p (1-p) }$. The total impurity can be measured as the weighted sum of the in \frac idual nodes' impurity. Discarding 2 for brevity in the formula, we see that if we choose the Gini Index we will have the values as $\frac {8} {10} * \frac {2} {10} * \frac {10} {20} + \frac {2} {10} * \frac {8} {10} * \frac {10} {20}$ i.e. .16 and $\frac {10} {15} * \frac {5} {15} * \frac {15} {20} + \frac {0} {5} * \frac {5} {5} * \frac {5} {20}$ i.e. .1667. So by using Gini Index first choice* seems preferable.

For square root of Gini Index we can similarly calculate impurity as $\sqrt{\frac {82} {1010}} * \frac {10} {20} + \sqrt{\frac {28} {1010}} \frac {10} {20}$ i.e. .4 and for second split the value is $\sqrt{\frac {105} {1515}} * \frac {15} {20} + \sqrt{\frac {05} {55}} * \frac {5} {20}$ i.e. .3535. So the $\sqrt{Gini Index}$ seems to prefer the second choice*.

To see the problem that can be caused by using a non representative sample data let's suppose that on the sample data, positives are oversampled by a factor of 2. The decision now requires choosing between 16+ 2-| 4+ 8- and 20+ 5- | 0+ 5-. Calculating as above we find the value of Gini Index as $\frac {16} {18} * \frac {2} {18} \frac {18} {30} + \frac {4} {12} * \frac {8} {12} * \frac {12} {30}$ i.e. .1441 and for second choice of split, we have the value as $\frac {20} {25} * \frac {5} {25} * \frac {25} {30} + \frac {0} {5} * \frac {5} {5} * \frac {5} {30}$ i.e. .1333. Alas! now the Gini Index also prefers the second choice of split*.

After calculating this for the square root of Gini Index, we find that the square root of Gini Index still prefers the second choice.

What Changed ?

The distribution is now skewed because we have a sample data where class distribution has changed and the Gini Index is affected by the changes in the class distribution. To see this mathematically, let's assume that after doing the split we have nodes p1 and p2. Let $n1_+$ represent the number of positives in node p1, $n1_-$ represent the negative samples in p1 and n1 represent the total number of samples in the node p1.

Then, In case of Gini Index, to measure the impurity we have the formula $\frac {n1} {n}* \frac {n1_+} {n1} * \frac {n1_-} {n1} + \frac {n2} {n} * \frac {n2_+} {n2} * \frac {n2_-} {n2}$ For the parent node the impurity is given by $ \frac {n_+} {n} * \frac {n_-} {n}$ Considering only a child node relative to the parent we have $\frac {n} {n1} * \frac {(n1_+ * n1_-)} {(n_+ * n_-)}$ . As you can see from the equation, the second part does not change even if we have accidentally oversampled the positives or negatives. Oversampling of a class can be considered as multiplying both the numerator and denominator in the equation by a constant, Hence they the constant term will cancel out, but the first part can change and hence drives the change in split decision.

In case of $\sqrt{Gini Index}$ , we have the above formula as $ \frac {n1} {n} * \sqrt{\frac {n1_+} {n1} * \frac {n1_-} {n1}} + \frac {n2} {n}* \sqrt{\frac {n2_+} {n2} * \frac {n2_-} {n2}}$. And for the parent node we have $\sqrt{\frac {n_+} {n} * \frac {n_-} {n}}$ which can be simplified as $\frac {\sqrt{n_+ * n_-}} {n}$. We can write the impurity of the child node as $\frac {n1} {n} * \frac {\sqrt{n1_+ * n1_-}} {n1}$ . On simplification this yields $\frac {1} {n} * \sqrt{n1_+ * n1_-}$. So relative to the parent node, we can write the expression as $\frac {\sqrt{n1_+ * n1_-}} {\sqrt{n_+ * n_-}}$. We can see that using the square root eliminates the factor which was causing the change in split decision. Thus, it can be seen that $\sqrt{Gini Index}$ is not affected by changes in class distribution.

Conclusions

Prefer the $\sqrt{Gini Index}$ as impurity measure as it is not affected by changes in class distribution, especially when you are not sure that the sample data space is representative of real data space.

Go Top
comments powered by Disqus