The main purpose of the FSelectorRcpp package is the feature selection based on the entropy function. However, it also contains a function to discretize continuous variable into nominal attributes, and we decided to slightly change the API related to this functionality, to make it more user-friendly.
EDIT: Updated version (0.2.1) is on CRAN. It can be installed using:
install.packages("FSelectorRcpp")
The dev version can be installed using devtools:
devtools::install_github("mi2-warsaw/FSelectorRcpp", ref = "dev")
discretize now returns all columns by default.
In the current version available on CRAN, calling discretize(Species ~ Sepal.Length, iris return a discretized data frame with only two variables - “Species”, and “Sepal.Length”, all others are discarded. However, it seems to be more natural to return all columns by default, because it allows to easily chain multiple calls to discretize with different methods used for different columns. See the example below:
library(FSelectorRcpp)
# before 0.2.0
discretize(Species ~ Sepal.Length, iris, all = FALSE)
## Species Sepal.Length
## 1 setosa (-Inf,5.55]
## 2 setosa (-Inf,5.55]
## 3 setosa (-Inf,5.55]
## 4 setosa (-Inf,5.55]
## 5 setosa (-Inf,5.55]
## 6 setosa (-Inf,5.55]
## 7 setosa (-Inf,5.55]
## 8 setosa (-Inf,5.55]
## 9 setosa (-Inf,5.55]
## 10 setosa (-Inf,5.55]
## [ reached getOption("max.print") -- omitted 140 rows ]
# After - returns all columns by default:
discretize(Species ~ Sepal.Length, iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 (-Inf,5.55] 3.5 1.4 0.2 setosa
## 2 (-Inf,5.55] 3.0 1.4 0.2 setosa
## 3 (-Inf,5.55] 3.2 1.3 0.2 setosa
## 4 (-Inf,5.55] 3.1 1.5 0.2 setosa
## [ reached getOption("max.print") -- omitted 146 rows ]
library(magrittr)
discData <- iris %>%
discretize(Species ~ Sepal.Length, customBreaksControl(c(0, 5, 7.5, 10))) %>%
discretize(Species ~ Sepal.Width, equalsizeControl(5)) %>%
discretize(Species ~ .)
discData
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 (5,7.5] (3.4, Inf] (-Inf,2.45] (-Inf,0.8] setosa
## 2 (0,5] (3,3.1] (-Inf,2.45] (-Inf,0.8] setosa
## 3 (0,5] (3.1,3.4] (-Inf,2.45] (-Inf,0.8] setosa
## 4 (0,5] (3.1,3.4] (-Inf,2.45] (-Inf,0.8] setosa
## [ reached getOption("max.print") -- omitted 146 rows ]
discretize_transform
We also added a discretize_transform which takes a result of the discretize function and uses its cutpoints to discretize new data set. It might be useful in the ML pipelines, where you want to apply the same transformations to the train and test sets.
set.seed(123)
idx <- sort(sample.int(150, 100))
irisTrain <- iris[idx, ]
irisTest <- iris[-idx, ]
discTrain <- irisTrain %>%
discretize(Species ~ Sepal.Length, customBreaksControl(c(0, 5, 7.5, 10))) %>%
discretize(Species ~ Sepal.Width, equalsizeControl(5)) %>%
discretize(Species ~ .)
discTest <- discretize_transform(discTrain, irisTest)
# levels for both columns are equal
all.equal(
lapply(discTrain, levels),
lapply(discTest, levels)
)
## [1] TRUE
discTrain
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 (5,7.5] (3.4, Inf] (-Inf,2.6] (-Inf,0.8] setosa
## 3 (0,5] (3.2,3.4] (-Inf,2.6] (-Inf,0.8] setosa
## 5 (0,5] (3.4, Inf] (-Inf,2.6] (-Inf,0.8] setosa
## 6 (5,7.5] (3.4, Inf] (-Inf,2.6] (-Inf,0.8] setosa
## [ reached getOption("max.print") -- omitted 96 rows ]
discTest
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 (0,5] (2.75,3] (-Inf,2.6] (-Inf,0.8] setosa
## 4 (0,5] (3,3.2] (-Inf,2.6] (-Inf,0.8] setosa
## 10 (0,5] (3,3.2] (-Inf,2.6] (-Inf,0.8] setosa
## 13 (0,5] (2.75,3] (-Inf,2.6] (-Inf,0.8] setosa
## [ reached getOption("max.print") -- omitted 46 rows ]
discretize and information_gain
The code below shows how to compare the feature importance of the two discretization methods applied to the same data. Note that you can discretization using the default method, and then passing the output to the information_gain leads to the same result as directly calling information_gain, on the data without discretization.
library(dplyr)
discTrainCustom <- irisTrain %>%
discretize(Species ~ Sepal.Length, customBreaksControl(c(0, 5, 7.5, 10))) %>%
discretize(Species ~ Sepal.Width, equalsizeControl(5)) %>%
discretize(Species ~ .)
discTrainMdl <- irisTrain %>% discretize(Species ~ .)
custom <- information_gain(Species ~ ., discTrainCustom)
mdl <- information_gain(Species ~ ., discTrainMdl)
all.equal(
information_gain(Species ~ ., discretize(irisTrain, Species ~ .)),
information_gain(Species ~ ., discTrainMdl)
)
## [1] TRUE
custom <- custom %>% rename(custom = importance)
mdl <- mdl %>% rename(mdl = importance)
inner_join(mdl, custom, by = "attributes")
## attributes mdl custom
## 1 Sepal.Length 0.4340278 0.2368476
## 2 Sepal.Width 0.2333229 0.3301300
## 3 Petal.Length 0.9934589 0.9934589
## 4 Petal.Width 0.9520131 0.9520131
customBreaksControl
We also added a new customBreaksControl method, which allows using your breaks in the discretize pipeline. It uses the standard cut function with default arguments, so the output is always closed on the right. If you want more flexibility (like custom labels) feel free to fill an issue on the https://github.com/mi2-warsaw/FSelectorRcpp/issues, and we will see what can be done.
library(ggplot2)
library(tidyr)
library(dplyr)
br <- customBreaksControl(breaks = c(0, 5, 10, Inf))
disc <- discretize(iris, Species ~ ., control = br)
gDisc <- gather(disc, key = "Variable", value = "Value", -Species)
ggplot(gDisc) + geom_bar(aes(Value, fill = Species)) + facet_wrap("Variable")

Summary
We still need to do some works on the upcoming release (e.g., write more tests), but we hope you will find it useful.
For more information on FSelectorRcpp you can check:
- http://r-addict.com/2017/01/08/Entropy-Based-Image-Binarization.html
- http://r-addict.com/2017/03/14/FSelectorRcpp-Release.html
- http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
- https://cran.r-project.org/web/packages/FSelectorRcpp/vignettes/get_started.html
- https://cran.r-project.org/web/packages/FSelectorRcpp/vignettes/benchmarks_discretize.html