All Posts

Functional boxplot - some intuitions.

Warning! This post describes some intuitions behind the idea of the functional boxplots. I think that it is a very useful technique, but all statistical tools should be used with caution. Reading only one blog post might be not enough to apply them in practice. At the end of the post, I added an information about useful resources covering this topic in a more rigid way. A classical boxplot is an excellent tool for the quick summary of the data.

Notes on tidyeval

I recently watched the “Tidy eval: Programming with dplyr, tidyr, and ggplot2” video. It’s an excellent introduction to the concept of the tidy evaluation, which is the core concept for programming with dplyr and friends. In this video, Hadley showed on the slide the grouped_mean function (12:48). An attempt to implement this functions might be a good exercise in tidy evaluation, and an excellent opportunity to compare this approach with standard evaluation rules provided by the seplyr package.

Common problems with rJava

rJava is an essential package because it allows accessing rich Java world. There are at least dozens of packages on CRAN which depends on Java (e.g., the excellent rscala for calling scala code from R). However, sometimes installing rJava might be quite problematic. In this post, I’ll focus on the pitfalls found on Linux/Ubuntu, but if you are on Windows following instructions from here, should solve your problem. R CMD javareconf One of the first thing that you should try if you have a problem with rJava is to check if you have java installed on your system, by running java -version in the console.

FSelectorRcpp 0.2.1 release

New release of FSelectorRcpp (0.2.1) is on CRAN. I described near all the new functionality here. The last thing that we added just before release is an extract_discretize_transformer. It can be used to get a small object from the result of discretize function to transform the new data using estimated cutpoints. See the example below. library(FSelectorRcpp) set.seed(123) idx <- sort(sample.int(150, 100)) iris1 <- iris[idx, ] iris2 <- iris[-idx, ] disc <- discretize(Species ~ .

Spark Streaming and Mllib

In my first post on Spark Streaming, I described how to use Netcast to emulate incoming stream. But later I found this question on StackOverflow. In one of the answer, there’s a piece of code which shows how to emulate incoming stream programmatically, without external tools like Netcat, it makes life much more comfortable. In this post, I describe how to fit a model using Spark’s MLlib, and then use it on the incoming data, and save the result in a parquet file.

Spark Streaming - basic setup

Streaming data is quite a hot topic right now, so I decided to write something on this topic on my blog. I’m new in that area, but I don’t think this is much different than standard batch processing. Of course, I’m more focused on building models and other ML stuff, not all the administration things, like setting up Kafka, making everything fault tolerant, etc. In this post, I’ll describe a very basic app, not very different than the one described in the https://spark.

Upcoming changes in FSelectorRcpp-0.2.0

The main purpose of the FSelectorRcpp package is the feature selection based on the entropy function. However, it also contains a function to discretize continuous variable into nominal attributes, and we decided to slightly change the API related to this functionality, to make it more user-friendly. EDIT: Updated version (0.2.1) is on CRAN. It can be installed using: install.packages("FSelectorRcpp") The dev version can be installed using devtools: devtools::install_github("mi2-warsaw/FSelectorRcpp", ref = "dev")

Partitioning in Spark

Spark is delightful for Big Data analysis. It allows using very high-level code to perform a large variety of operations. It also supports SQL, so you don’t need to learn a lot of new stuff to start being productive in Spark (of course assuming that you have some knowledge of SQL). However, if you want to use Spark more efficiently, you need to learn a lot of concepts, especially about data partitioning, relations between partitions (narrow dependencies vs.

Scala in knitr

I use blogdown to write my blog posts. It allows me to create a Rmarkdown file, and then execute all the code and format the output. It has great support for R (it’s R native) and Python. Some other languages are also supported, but the functionality is pretty limited. For example, each code chunk is evaluated in a separate session (I’m not sure if it’s the case for all engines, I read about this in https://yihui.

Conda

One of the most important things in software development and data analysis is to manage your dependencies, to make sure that your work can be easily replicated or deployed to the production. In R’s ecosystem, there are plenty of tools and materials on this topic. There’s a short list: https://rstudio.github.io/packrat/ https://ropenscilabs.github.io/r-docker-tutorial/ https://mran.microsoft.com/ https://rviews.rstudio.com/2018/01/18/package-management-for-reproducible-r-code/ https://cran.r-project.org/web/views/ReproducibleResearch.html However, I’m starting to spend a bit more time in the Python world, where I don’t have a lot of experience.