In my first post on Spark Streaming, I described how to use Netcast to emulate incoming stream. But later I found this question on StackOverflow. In one of the answer, there’s a piece of code which shows how to emulate incoming stream programmatically, without external tools like Netcat, it makes life much more comfortable. In this post, I describe how to fit a model using Spark’s MLlib, and then use it on the incoming data, and save the result in a parquet file.
Streaming data is quite a hot topic right now, so I decided to write something on this topic on my blog. I’m new in that area, but I don’t think this is much different than standard batch processing. Of course, I’m more focused on building models and other ML stuff, not all the administration things, like setting up Kafka, making everything fault tolerant, etc. In this post, I’ll describe a very basic app, not very different than the one described in the https://spark.
I use blogdown to write my blog posts. It allows me to create a Rmarkdown file, and then execute all the code and format the output. It has great support for R (it’s R native) and Python. Some other languages are also supported, but the functionality is pretty limited. For example, each code chunk is evaluated in a separate session (I’m not sure if it’s the case for all engines, I read about this in https://yihui.