Reproducibility is a severe issue. Writing code usually helps, because the code is like a journal of your work, especially if you combine it with literate programming techniques, which in R’s world is so easy to do (
However, there’s one thing, which can cause some problems - the packages versions. Some of the old code might not work, because there were changes in the API or in the behavior of the packages (I’m looking at you -
First is the
packrat package, which allows creating a private library of packages for each project. It also keeps the record of the versions of installed packages, and it can recreate that private library, even if you change the machines. It is an excellent solution to synchronize the package versions when a few people work on the same project.
- Easy to set up (just
- Easy to use. Usually, after setting up the private library, you don’t need to do anything else, just run
packrat::snapshot()when you install a new library.
- The packages take some space (e.g., for some of my projects the packrat library is around 500MB).
- Does not solve the problem of the R version. E.g., if you created a packrat when you were using R3.0, and now you use R3.5 packrat might not be able to install all the packages for the new version (I had such a situation).
Another way of solving the problem of package versions is to use the
checkpoint package from Microsoft. Its usage is described here. It allows freezing the CRAN state to a given date, so all installed packages will be as they were installed precisely in that day. It works in a quite similar way to
packrat, but it creates the library for a specific date, not by project. So two projects can share the packages from the same library if they use the same
- Freezing the date is quite simple and does not require to keep a big
- The problem of the R version is still unsolved.
- It is a bit harder to keep specific versions of packages. For example, I use the checkpoint date
2018-01-01, where the X package has a version 0.5, but in my project, I need a version 0.3 (but I’m fine that all other packages are from
2018-01-01). In this case, packrat is more natural to use.
When I write this post the most popular container technology is Docker, but who knows what will be in the future?
I think this is the only (easy) solution which allows solving the problem of the R version. You just put everything into the container, every required system library, R installation, R packages, all the code, and then work from the container (it’s good to install the Rstudio Server inside the container). You can even combine this approach with
- By using containers you can freeze everything, so there’s no worry about changing the R version, or even any version of the system library.
- Requires some knowledge about containers.
- It bundles nearly everything inside, so it needs a lot of disk space.
- Building a container takes much more time than merely installing the packages.
In this post, I described the three ways for making your R code a bit more reproducible by freezing the R packages for a given project. I usually use
packrat, because it keeps all the dependencies inside the project. If I need to be sure that everything will work in the future, I use
Docker, but sometimes it not possible (e.g.,
Docker is not installed, or the system administrator doesn’t want to use containers), so then I stick with