All Posts

Spark Streaming - basic setup

Streaming data is quite a hot topic right now, so I decided to write something on this topic on my blog. I’m new in that area, but I don’t think this is much different than standard batch processing. Of course, I’m more focused on building models and other ML stuff, not all the administration things, like setting up Kafka, making everything fault tolerant, etc. In this post, I’ll describe a very basic app, not very different than the one described in the https://spark.

Upcoming changes in FSelectorRcpp-0.2.0

The main purpose of the FSelectorRcpp package is the feature selection based on the entropy function. However, it also contains a function to discretize continuous variable into nominal attributes, and we decided to slightly change the API related to this functionality, to make it more user-friendly. EDIT: Updated version (0.2.1) is on CRAN. It can be installed using: install.packages("FSelectorRcpp") The dev version can be installed using devtools: devtools::install_github("mi2-warsaw/FSelectorRcpp", ref = "dev")

Partitioning in Spark

Spark is delightful for Big Data analysis. It allows using very high-level code to perform a large variety of operations. It also supports SQL, so you don’t need to learn a lot of new stuff to start being productive in Spark (of course assuming that you have some knowledge of SQL). However, if you want to use Spark more efficiently, you need to learn a lot of concepts, especially about data partitioning, relations between partitions (narrow dependencies vs.

Scala in knitr

I use blogdown to write my blog posts. It allows me to create a Rmarkdown file, and then execute all the code and format the output. It has great support for R (it’s R native) and Python. Some other languages are also supported, but the functionality is pretty limited. For example, each code chunk is evaluated in a separate session (I’m not sure if it’s the case for all engines, I read about this in https://yihui.

Conda

One of the most important things in software development and data analysis is to manage your dependencies, to make sure that your work can be easily replicated or deployed to the production. In R’s ecosystem, there are plenty of tools and materials on this topic. There’s a short list: https://rstudio.github.io/packrat/ https://ropenscilabs.github.io/r-docker-tutorial/ https://mran.microsoft.com/ https://rviews.rstudio.com/2018/01/18/package-management-for-reproducible-r-code/ https://cran.r-project.org/web/views/ReproducibleResearch.html However, I’m starting to spend a bit more time in the Python world, where I don’t have a lot of experience.

ggplot2 with 2 y-axes

On one of my R workshops, someone asked me about creating a ggplot2 with two Y-axes. I do not use such types of plots, because I read somewhere that they have some problems with perception. However, I committed myself to check if it’s possible to create such visualizations using ggplot2. Without a lot of digging, I found this answer from the author of the ggplot2 package on StackOverflow - https://stackoverflow.com/a/3101876. He thinks that those types of plots are bad, fundamentally flawed, and you shouldn’t use them, and ggplot2 does not allow to create them.

Create pptx in R using officer package

When you need to create a pptx file in R, the best way is to use an officer package. officer is quite easy to use and the documentation is quite extensive so that I won’t describe the basics (https://davidgohel.github.io/officer/articles/powerpoint.html - link to the officer‘s docs). However, I always have some problems with specifying the proper parameters for the ph_with_* functions, especially the type and index parameters. Of course one can use the ph_with_*_at versions, but it requires to manually adjust all the coordinates, which might be even more problematic.

Type-S Errors.

I’m a big fan of Andrew’s Gelman blog (http://andrewgelman.com/). I think that my statistical intuition is way much better after reading it. For example, there’s a post about different types of errors in NHST, not limited to the widely known Type I and Type II errors - http://andrewgelman.com/2004/12/29/type_1_type_2_t/. You should read this before continuing because the rest of this post will be based on it, and the article which is linked in that post (http://www.

Calculate the font size for the R's flextable package.

The flextable table is an excellent package for creating beautiful tables, especially if you want to export them to the pptx file. However, it might be a bit problematic to set the proper font size for the given size of the table. E.g., I have a table with five rows (+ 1 header row), and I want to create a table which height is 2 inches. What’s the best font size for this setting?

Lua

In one of my projects, I had to choose which language I would use to write some small module - C, C++ or Lua. I didn’t want to use C, because the module required a lot of string handling. Another option was C++, but some other parts of the system were written in Lua, so I thought it would be much easier to integrate everything without switching languages, and I had heard some good things about Lua, so in the end, I decided to give it a shot.