All Posts

FSelectorRcpp 0.2.1 release

New release of FSelectorRcpp (0.2.1) is on CRAN. I described near all the new functionality here. The last thing that we added just before release is an extract_discretize_transformer. It can be used to get a small object from the result of discretize function to transform the new data using estimated cutpoints. See the example below. library(FSelectorRcpp) set.seed(123) idx <- sort(sample.int(150, 100)) iris1 <- iris[idx, ] iris2 <- iris[-idx, ] disc <- discretize(Species ~ .

Spark Streaming and Mllib

In my first post on Spark Streaming, I described how to use Netcast to emulate incoming stream. But later I found this question on StackOverflow. In one of the answer, there’s a piece of code which shows how to emulate incoming stream programmatically, without external tools like Netcat, it makes life much more comfortable. In this post, I describe how to fit a model using Spark’s MLlib, and then use it on the incoming data, and save the result in a parquet file.

Spark Streaming - basic setup

Streaming data is quite a hot topic right now, so I decided to write something on this topic on my blog. I’m new in that area, but I don’t think this is much different than standard batch processing. Of course, I’m more focused on building models and other ML stuff, not all the administration things, like setting up Kafka, making everything fault tolerant, etc. In this post, I’ll describe a very basic app, not very different than the one described in the https://spark.

Upcoming changes in FSelectorRcpp-0.2.0

The main purpose of the FSelectorRcpp package is the feature selection based on the entropy function. However, it also contains a function to discretize continuous variable into nominal attributes, and we decided to slightly change the API related to this functionality, to make it more user-friendly. EDIT: Updated version (0.2.1) is on CRAN. It can be installed using: install.packages("FSelectorRcpp") The dev version can be installed using devtools: devtools::install_github("mi2-warsaw/FSelectorRcpp", ref = "dev")

Partitioning in Spark

Spark is delightful for Big Data analysis. It allows using very high-level code to perform a large variety of operations. It also supports SQL, so you don’t need to learn a lot of new stuff to start being productive in Spark (of course assuming that you have some knowledge of SQL). However, if you want to use Spark more efficiently, you need to learn a lot of concepts, especially about data partitioning, relations between partitions (narrow dependencies vs.

Scala in knitr

I use blogdown to write my blog posts. It allows me to create a Rmarkdown file, and then execute all the code and format the output. It has great support for R (it’s R native) and Python. Some other languages are also supported, but the functionality is pretty limited. For example, each code chunk is evaluated in a separate session (I’m not sure if it’s the case for all engines, I read about this in https://yihui.

Conda

One of the most important things in software development and data analysis is to manage your dependencies, to make sure that your work can be easily replicated or deployed to the production. In R’s ecosystem, there are plenty of tools and materials on this topic. There’s a short list: https://rstudio.github.io/packrat/ https://ropenscilabs.github.io/r-docker-tutorial/ https://mran.microsoft.com/ https://rviews.rstudio.com/2018/01/18/package-management-for-reproducible-r-code/ https://cran.r-project.org/web/views/ReproducibleResearch.html However, I’m starting to spend a bit more time in the Python world, where I don’t have a lot of experience.

ggplot2 with 2 y-axes

On one of my R workshops, someone asked me about creating a ggplot2 with two Y-axes. I do not use such types of plots, because I read somewhere that they have some problems with perception. However, I committed myself to check if it’s possible to create such visualizations using ggplot2. Without a lot of digging, I found this answer from the author of the ggplot2 package on StackOverflow - https://stackoverflow.com/a/3101876. He thinks that those types of plots are bad, fundamentally flawed, and you shouldn’t use them, and ggplot2 does not allow to create them.

Create pptx in R using officer package

When you need to create a pptx file in R, the best way is to use an officer package. officer is quite easy to use and the documentation is quite extensive so that I won’t describe the basics (https://davidgohel.github.io/officer/articles/powerpoint.html - link to the officer‘s docs). However, I always have some problems with specifying the proper parameters for the ph_with_* functions, especially the type and index parameters. Of course one can use the ph_with_*_at versions, but it requires to manually adjust all the coordinates, which might be even more problematic.

Type-S Errors.

I’m a big fan of Andrew’s Gelman blog (http://andrewgelman.com/). I think that my statistical intuition is way much better after reading it. For example, there’s a post about different types of errors in NHST, not limited to the widely known Type I and Type II errors - http://andrewgelman.com/2004/12/29/type_1_type_2_t/. You should read this before continuing because the rest of this post will be based on it, and the article which is linked in that post (http://www.