Information gain in FSelectorRcpp

Some intuitions behind the Information Gain, Gain ratio and Symmetrical Uncertain calculated by the FSelectorRcpp package, that can be a good proxy for correlation between unordered factors.

I a big fan of using FSelectorRcpp in the exploratory phase to get some overview about the data. The main workhorse is the information_gain function which calculates… information gain. But how to interpret the output of this function?

To understand this, you need to know a bit about entropy. The good place is its Wikipedia page - https://en.wikipedia.org/wiki/Entropy_(information_theory). If you don’t know anything about entropy from the information theory please start there.

Now go the code. To calculate the entropy in FSelectorRcpp all variables must be categorized (factor or character). By default information_gain automatically discretizes numeric values using so called MDL algorithm (it’s not this post, so it won’t be covered there). But I’ll go step by step and discretize all the values on my own.

library(FSelectorRcpp)

disc <- discretize(Species ~ ., iris)
head(disc)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1  (-Inf,5.55] (3.35, Inf]  (-Inf,2.45]  (-Inf,0.8]  setosa
## 2  (-Inf,5.55] (2.95,3.35]  (-Inf,2.45]  (-Inf,0.8]  setosa
## 3  (-Inf,5.55] (2.95,3.35]  (-Inf,2.45]  (-Inf,0.8]  setosa
## 4  (-Inf,5.55] (2.95,3.35]  (-Inf,2.45]  (-Inf,0.8]  setosa
## 5  (-Inf,5.55] (3.35, Inf]  (-Inf,2.45]  (-Inf,0.8]  setosa
## 6  (-Inf,5.55] (3.35, Inf]  (-Inf,2.45]  (-Inf,0.8]  setosa

Then calculating information_gain looks like this:

# calling the information_gain on iris
# would give the same result
# information_gain(Species ~ ., iris) 
information_gain(Species ~ ., disc)

##     attributes importance
## 1 Sepal.Length  0.4521286
## 2  Sepal.Width  0.2672750
## 3 Petal.Length  0.9402853
## 4  Petal.Width  0.9554360

The theory tells us that information gain is defined as \(H(Class) + H(Attribute) - H(Class, Attribute)\) where \(H(X)\) is Shannon’s Entropy and \(H(X, Y)\) is a conditional Shannon’s Entropy for a variable X with a condition to Y.

So now we calculate the information step by step:

# function to calculate entropy
entropy <- function(x) {
  n <- table(x)
  p <- n/sum(n)
  -sum(p*log(p))
}
x <- entropy(disc$Sepal.Length) # H(Attribute)
y <- entropy(disc$Species) # H(Class)

# This step is quite fun, because to calculate conditional entropy you can 
# just glue the values together (think a little bit on the equation from wikipedia
# and it will become obvious).
xy <- entropy(paste(disc$Sepal.Length, disc$Species)) # H(Class, Attribute)

So the final information gain is equal to:

x + y - xy

## [1] 0.4521286

Note that conditional entropy is equal to \(H(X) + H(y) = H(x,y)\) when there’s no relation between \(x\) and \(y\) (in this case the information gain will be zero).

entropy(disc$Species)

## [1] 1.098612

set.seed(123)
# sample function used to destroy relation between variables
entropy(paste(sample(disc$Species), sample(disc$Species)))

## [1] 2.178778

Gain ratio and Symmetrical Uncertain

FSelectorRcpp allows to use two another methods to calculate feature importance based on the entropy and the information gain measure.

Gain ratio - defined as \((H(Class) + H(Attribute) - H(Class, Attribute)) / H(Attribute)\).
Symmetrical Uncertain - equal to \(2 * (H(Class) + H(Attribute) - H(Class, Attribute)) / (H(Attribute) + H(Class))\).

Both scales the information gain into \([0,1]\) range (zero when there’s no relation, and 1 for perfect dependability).

information_gain(Species ~ ., disc, type = "gainratio")

##     attributes importance
## 1 Sepal.Length  0.4196464
## 2  Sepal.Width  0.2472972
## 3 Petal.Length  0.8584937
## 4  Petal.Width  0.8713692

information_gain(Species ~ ., disc, type = "symuncert")

##     attributes importance
## 1 Sepal.Length  0.4155563
## 2  Sepal.Width  0.2452743
## 3 Petal.Length  0.8571872
## 4  Petal.Width  0.8705214

Note that because both values are defined on the \([0,1]\) range they can be a proxy for correlation between two unordered factors (which sometimes is useful).

Other resources:

https://victorzhou.com/blog/information-gain/ - information gain from the Decision Trees perspective.