One of the most important things in software development and data analysis is to manage your dependencies, to make sure that your work can be easily replicated or deployed to the production. In R’s ecosystem, there are plenty of tools and materials on this topic. There’s a short list:

However, I’m starting to spend a bit more time in the Python world, where I don’t have a lot of experience. Of course, the most obvious choice might be to install everything in the container (or virtual machine), but still, some knowledge about existing repositories is required.

Without a lot of searching, I found an conda, which seems to be a packrat on steroids. It does not only manage Python packages but other libs too. I won’t describe all conda capabilities, because everything can be found in the documentation. If you want more information, here are some useful links:

Conda basics

I use my blog not only to present my ideas, or share interesting stuff but also as a private knowledge base, where I can easily find useful code snippets. So that’s why I’ll describe some elementary but helpful conda functions.

# Create nlp env and install packages
conda create --name nlp nltk gensim spacy matplotlib pandas numpy nb_conda

# activate env
source activate nlp

# install package in the active env
conda install package-name

# create another env
conda create --name mlpython pandas matplotlib numpy scipy scikit-learn nb_conda

# list packages in environment
conda list

Export conda env to a file:

Conda allows to export environment information into a single file, and then use it to recreate everything on the different machine with one, simple command.

conda create -n testEnv python=3.6 numpy -y
# add path to the anaconda. I added this only because the code
# is evaluated inside knitr's bash environment, and it could not
# find the path to anaconda
export PATH="~/anaconda3/bin:$PATH" 

# Export env to file:
source activate testEnv
conda env export > environment.yml
cat environment.yml
## name: testEnv
## channels:
##   - defaults
## dependencies:
##   - blas=1.0=mkl
##   - ca-certificates=2018.03.07=0
##   - certifi=2018.4.16=py36_0
##   - intel-openmp=2018.0.3=0
##   - libedit=3.1.20170329=h6b74fdf_2
##   - libffi=3.2.1=hd88cf55_4
##   - libgcc-ng=7.2.0=hdf63c60_3
##   - libgfortran-ng=7.2.0=hdf63c60_3
##   - libstdcxx-ng=7.2.0=hdf63c60_3
##   - mkl=2018.0.3=1
##   - mkl_fft=1.0.2=py36h651fb7a_0
##   - mkl_random=1.0.1=py36h4414c95_1
##   - ncurses=6.1=hf484d3e_0
##   - numpy=1.14.5=py36h1b885b7_4
##   - numpy-base=1.14.5=py36hdbf6ddf_4
##   - openssl=1.0.2o=h20670df_0
##   - pip=10.0.1=py36_0
##   - python=3.6.6=hc3d631a_0
##   - readline=7.0=ha6073c6_4
##   - setuptools=39.2.0=py36_0
##   - sqlite=3.24.0=h84994c4_0
##   - tk=8.6.7=hc745277_3
##   - wheel=0.31.1=py36_0
##   - xz=5.2.4=h14c3975_4
##   - zlib=1.2.11=ha838bed_2
## prefix: /home/zzawadz/anaconda3/envs/testEnv
# Create env from yml file:
conda env create -f environment.yml

Conda env in Jupyter notebook

I think that the most common way of using Python in Data Science is by using Jupyter Notebooks (http://jupyter.org/). It’s quite a good piece of software, but it does not support conda environments out of the box. However the fix is dead easy, you need to install nb_conda library.

conda install nb_conda

More info:

Conda environments in R’s knitr.

R’s knitr allows using Python’s code chunks inside Rmarkdown documents. I won’t describe how to do this because it’s super easy. You can find more information on https://cran.r-project.org/web/packages/reticulate/vignettes/r_markdown.html or https://github.com/yihui/knitr-examples/blob/master/023-engine-python.Rmd (this is an example document).

The only thing that you might need is to use a specific conda environment in knitr without calling source activate envname.

Let’s start with creating two environments:

conda create -n py36 python=3.6 -y
conda create -n py27 python=2.7 -y

To launch a specific version I would need to activate the environment, but I can also use a standard path to the executable which should be located somewhere in ~/anaconda3/envs/{envname}/bin/python. I can test this using the following code (note that I’m using knitr to render this post, and I can use bash to evaluate the chunk):

~/anaconda3/envs/py36/bin/python --version
## Python 3.6.6 :: Anaconda, Inc.

And the Python’s version in the second env:

~/anaconda3/envs/py27/bin/python --version
## Python 2.7.15 :: Anaconda, Inc.

There’s one last step. I need to instruct R to use a specific Python version when executing the Python’s chunks (this is the R’s chunk):

library(knitr)
opts_chunk$set(
  engine.path = list(
    python = '~/anaconda3/envs/py36/bin/python'
  )
)

And the example:

# This is python
import sys
print(sys.version_info)
## sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
print(1 + 1)
## 2
x = 20

Note that the variables are shared between code chunks:

print(x * 20)
## 400

Summary

It seems to be that conda is a good way to manage Python’s dependencies. The only problem is that not all packages can be installed from the conda cloud (e.g., the polyglot - http://polyglot.readthedocs.io/en/latest/), and they require a bit more effort, but this is not in the scope of this post;)