I had some time to look at some of my started, yet never finished projects. I found something which served me very well for some time, and it was quite useful.

In my one project, I was working with a lot of large logs files. In the beginning, I was loading the whole file into R memory, and then I was processing it using stringi package and other tools. This was not the best solution. The reading of file which contains a few gigabytes of data takes a lot of time and uses a lot of memory, so I was able to process only one file at a time. Then I found AWK which is a great, small utility language which can solve a lot of common problems with working with logs files. For some time it was my go-to language for this type of tasks. In the end, I wrote a parser in C++ to directly export data to the R session but this is a different story.

I started to write a package called rawk to allow directly calling AWK scripts from R console. I don’t know if interface works properly (as I said, I switched to C++…), but there’s one interesting function which allows a user to cache a function result based on file modification time. So if a file was not changed since last function call, the result will be read from the cache located on the disk.

Let me show you an example:

library(rawk)
## Loading required package: stringi
fnc <- function(file) {
   n <- as.numeric(readLines(file, warn = FALSE))
   rnorm(n) # some random values
}

file <- tempfile()
cat(5, file = file)

# Every call leads to different result.
fnc(file)
## [1]  1.0542106  0.5495267  0.3604597 -0.5716983 -1.3153510
fnc(file)
## [1] -0.4181775 -0.6788306 -1.3950384 -0.6021542 -0.8587127
all.equal(fnc(file), fnc(file))
## [1] "Mean relative difference: 1.453267"
# Create new version with cache:
fcached <- file_modification_time_cache(fnc)

fcached(file)
## Saved /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: f518213fe192be56b8038e849e4d7f0f
## [1]  1.6695812 -1.3698780 -0.4390327  0.3085819  0.6329413
fcached(file) # The same
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: f518213fe192be56b8038e849e4d7f0f
## [1]  1.6695812 -1.3698780 -0.4390327  0.3085819  0.6329413
all.equal(fcached(file), fcached(file)) # still the same
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: f518213fe192be56b8038e849e4d7f0f
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: f518213fe192be56b8038e849e4d7f0f
## [1] TRUE
x <- fcached(file)
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: f518213fe192be56b8038e849e4d7f0f
cat(5, file = file)
y <- fcached(file)
## Saved /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: 9ada92bb5844e4dffa285130ce5cc91a
all.equal(x, y) # different
## [1] "Mean relative difference: 1.008521"

Here’s second example with function with two parameters:

fnc2 <- function(file, k = 2) {
   n <- as.numeric(readLines(file, warn = FALSE))
   rnorm(n * k)
}

fcached2 <- file_modification_time_cache(fnc2)
all.equal(fcached2(file,1), fcached2(file,1))
## Saved /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: d39758a0b1b6aa59b5a3dbae62539a1f
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: d39758a0b1b6aa59b5a3dbae62539a1f
## [1] TRUE
all.equal(fcached2(file,1), fcached2(file,2))
## Loaded /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: d39758a0b1b6aa59b5a3dbae62539a1f
## Saved /tmp/RtmpP3fcEg/file2de741ef55b, last access time: 2018-06-11 17:43:06, hash: e461baa37550576a7bd02b612d90db23
## [1] "Numeric: lengths (5, 10) differ"
# Remove cache directory
if(file.exists(".cache")) unlink(".cache", recursive = TRUE)
Warning! The first argument of the cached must be a path to the file, which time will be tested.

Cached functions prints nicely:

fcached2
## Function with a cache based on the file modification time
## 
## Cache directory:  /home/zzawadz/OpenRepos/blog/content/post/.cache/fnc2
## To force recalculation please use .FORCE_RECALC = TRUE
## 
## Original function:
## function(file, k = 2) {
##    n <- as.numeric(readLines(file, warn = FALSE))
##    rnorm(n * k)
## }
## <bytecode: 0x55b20b1806b0>

To install this package use:

devtools::install_github("zzawadz/rawk")

If you find this interesting, reach me on Twitter (@zzawadz) or Github.