All Posts

Information gain in FSelectorRcpp

Some intuitions behind the Information Gain, Gain ratio and Symmetrical Uncertain calculated by the FSelectorRcpp package, that can be a good proxy for correlation between unordered factors. I a big fan of using FSelectorRcpp in the exploratory phase to get some overview about the data. The main workhorse is the information_gain function which calculates… information gain. But how to interpret the output of this function? To understand this, you need to know a bit about entropy.

Active learning - part 1

I just started exploring the ‘active learning’ topic. It’s a very handy tool when the number of data points to build a model is limited and labelling new points is costly. It allows to determine which points should be labelled next to bring the most gain in model performance. In this post I will cover some of my small experiments in this area. Caution! If you’re interested in ready-to-use tools for active learning, this post might not be for you - I don’t cover any framework here.

Some notes on Apache Spark

Some notes based on two videos describing Apache Spark concepts. https://www.youtube.com/watch?v=AoVmgzontXo - Spark SQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal https://www.youtube.com/watch?v=vfiJQ7wg81Y - Top 5 Mistakes When Writing Spark Applications. Spark SQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal https://youtu.be/AoVmgzontXo?t=641 - an example of the optimization done by Catalyst (better to watch the whole video to get better understanding of the whole context).

Reproducible package management in R

Reproducibility is a severe issue. Writing code usually helps, because the code is like a journal of your work, especially if you combine it with literate programming techniques, which in R’s world is so easy to do (Rmarkdown, knitr). However, there’s one thing, which can cause some problems - the packages versions. Some of the old code might not work, because there were changes in the API or in the behavior of the packages (I’m looking at you - dplyr).

customLayout 0.2.0 is now on CRAN

The new version of my customLayout package is on CRAN. It now supports working with PowerPoint slides using layouts created in R. For more information please read the vignette here. It also extends the idea of adjusting the font size for the flextables (see this post) - check the phl_adjust_table function. I also created a simple roadmap which describes my next steps. Please note that this package is still under development.

Functional boxplot - some intuitions.

Warning! This post describes some intuitions behind the idea of the functional boxplots. I think that it is a very useful technique, but all statistical tools should be used with caution. Reading only one blog post might be not enough to apply them in practice. At the end of the post, I added an information about useful resources covering this topic in a more rigid way. A classical boxplot is an excellent tool for the quick summary of the data.

Notes on tidyeval

I recently watched the “Tidy eval: Programming with dplyr, tidyr, and ggplot2” video. It’s an excellent introduction to the concept of the tidy evaluation, which is the core concept for programming with dplyr and friends. In this video, Hadley showed on the slide the grouped_mean function (12:48). An attempt to implement this functions might be a good exercise in tidy evaluation, and an excellent opportunity to compare this approach with standard evaluation rules provided by the seplyr package.

Common problems with rJava

rJava is an essential package because it allows accessing rich Java world. There are at least dozens of packages on CRAN which depends on Java (e.g., the excellent rscala for calling scala code from R). However, sometimes installing rJava might be quite problematic. In this post, I’ll focus on the pitfalls found on Linux/Ubuntu, but if you are on Windows following instructions from here, should solve your problem. R CMD javareconf One of the first thing that you should try if you have a problem with rJava is to check if you have java installed on your system, by running java -version in the console.

FSelectorRcpp 0.2.1 release

New release of FSelectorRcpp (0.2.1) is on CRAN. I described near all the new functionality here. The last thing that we added just before release is an extract_discretize_transformer. It can be used to get a small object from the result of discretize function to transform the new data using estimated cutpoints. See the example below. library(FSelectorRcpp) set.seed(123) idx <- sort(sample.int(150, 100)) iris1 <- iris[idx, ] iris2 <- iris[-idx, ] disc <- discretize(Species ~ .

Spark Streaming and Mllib

In my first post on Spark Streaming, I described how to use Netcast to emulate incoming stream. But later I found this question on StackOverflow. In one of the answer, there’s a piece of code which shows how to emulate incoming stream programmatically, without external tools like Netcat, it makes life much more comfortable. In this post, I describe how to fit a model using Spark’s MLlib, and then use it on the incoming data, and save the result in a parquet file.