API and R

Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python, curl, and the R own httr package.

By the way, you might notice that in some parts of the code, I’ve included TODO comments. I’m not promising to update this post later (although I might!). The comments are just there to get you thinking about how you might expand on this in a real-world environment.

Some preparation

The whole document has been created using rmarkdown. It means that you take the code and reproduce everything by yourself.

# The files required to make this document are stored in tmp directory
 dir.create("tmp", recursive = TRUE, showWarnings = FALSE)

# The function below is used to get the content of code chunk and
# output it to a file to make it available for the `plumber` package.
write_chunk_file <- function(chunk_name, dest_file) {
  lines <- readLines("2023-04-04-rest-api-in-r-with-plumber.Rmd", warn = FALSE)

  line_start <- grep(lines, pattern = paste("```\\{r", chunk_name))+1
  line_end <- grep(lines, pattern = "```")
  line_end <- head(line_end[line_end > line_start],1) - 1

  api_code <- c(lines[line_start:line_end], "")

  writeLines(api_code, con = dest_file)
}

# When an API is started it might take some time to initialize
# this function stops the main execution and wait until
# plumber API is ready to take queries.
wait_for_api <- function(log_path, timeout = 60, check_every = 1) {
  times <- timeout / check_every

  for(i in seq_len(times)) {
    Sys.sleep(check_every)
    if(any(grepl(readLines(log_path), pattern = "Running plumber API"))) {
      return(invisible())
    }
  }
  stop("Waiting timed!")
}

Oh, in some examples I am using redis. So, before you dive in, make sure to fire up a simple redis server. At the end of the script, I’ll be turning redis off, so you don’t want to be using it for anything else at the same time. I just want to remind you that this code isn’t meant to be run on a production server.

redis-server --daemonize yes

redis is launched in a background, , so you might want to wait a little bit to make sure it’s fully up and running before moving on.

wait_for_redis <- function(timeout = 60, check_every = 1) {

  times <- timeout / check_every

  for(i in seq_len(times)) {
    Sys.sleep(check_every)

    status <- suppressWarnings(system2("redis-cli", "PING", stdout = TRUE, stderr = TRUE) == "PONG")

    if(status) {
      return(invisible())
    }
  }
  stop("Redis waiting timed!")

}

dir.create("tmp")

## Warning in dir.create("tmp"): 'tmp' already exists

wait_for_redis()

Just a quick tip: if you want to double-check that redis is totally empty, here’s how to do it.

redis-cli FLUSHALL

## OK

Gene APIv1

First things first - let’s save the API code into a file:

# Write the content of the next code chunk to tmp/api_v1.R
write_chunk_file("api_v1", "tmp/api_v1.R")

Alright, now let’s dive into the main topic - how to create an API using plumber. It’s actually not that difficult - you can head over to https://www.rplumber.io/index.html, grab the basic example, and have something up and running in just 10 minutes! But let’s talk about how to write an API that will save you lots of headaches in the future.

First off, let’s talk about logging. I try to log as much as possible, especially in critical areas like database accesses, and interactions with other systems. This way, if there’s an issue in the future (and trust me, there will be), I should be able to diagnose the problem just by looking at the logs alone. Logging is like “print debugging” (putting print(“I am here”), print(“I am here 2”) everywhere), but done ahead of time. I always try to think about what information might be needed to make a correct diagnosis, so logging variable values is a must. The logger and glue packages are your best friends in that area.

Next, it might also be useful to add a unique request identifier ((I am doing that in setuuid filter)) to be able to track it across the whole pipeline (since a single request might be passed across many functions). You might also want to add some other identifiers, such as MACHINE_ID - your API might be deployed on many machines, so it could be helpful for diagnosing if the problem is associated with a specific instance or if it’s a global issue.

In general you shouldn’t worry too much about the size of the logs. Even if you generate ~10KB per request, it will take 100000 requests to generate 1GB. And for the plumber API, 100000 requests generated in a short time is A LOT. In such scenario you should look into other languages. And if you have that many requests, you probably have a budget for storing those logs:)

It might also be a good idea to setup some automatic system to monitor those logs (e.g. Amazon CloudWatch if you are on AWS). In my example I would definitely monitor Error when reading key from cache string. That would give me an indication of any ongoing problems with API cache.

Speaking of cache, you might use it to save a lot of resources. Caching is a very broad topic with many pitfalls (what to cache, stale cache, etc) so I won’t spend too much time on it, but you might want to read at least a little bit about it. In my example, I am using redis key-value store, which allows me to save the result for a given request, and if there is another requests that asks for the same data, I can read it from redis much faster.

Note that you could use memoise package to achieve similar thing using R only. However, redis might be useful when you are using multiple workers. Then, one cached request becomes available for all other R processes. But if you need to deploy just one process, memoise is fine, and it does not introduce another dependency - which is always a plus.

library(biomaRt)
library(redux)
redis <- redux::hiredis()
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

#* @apiTitle Plumber Gene id/name API.
#* @apiDescription Simple API.

info <- function(req, ...) {
  do.call(
    log_info,
    c(
      list("MachineId: {MACHINE_ID}, ReqId: {req$request_id}"),
      list(...),
      .sep = ", "
    ), envir = parent.frame(1)
  )
}

#* Log some information about the incoming request
#* https://www.rplumber.io/articles/routing-and-input.html - this is a must read!
#* @filter setuuid
function(req) {
  req$request_id <- UUIDgenerate(n = 1)
  plumber::forward()
}

#* Log some information about the incoming request
#* @filter logger
function(req) {

  if(!grepl(req$PATH_INFO, pattern = "PATH_INFO")) {
    info(
      req,
      "REQUEST_METHOD: {req$REQUEST_METHOD}",
      "PATH_INFO: {req$PATH_INFO}",
      "HTTP_USER_AGENT: {req$HTTP_USER_AGENT}",
      "REMOTE_ADDR: {req$REMOTE_ADDR}"
    )
  }
  plumber::forward()
}

get_from_cache <- function(key, redis, req) {
  result <- tryCatch({
    r <- redis$GET(key)
    if(!is.null(r)) {
      info(req, "Key `{key}` read from cache.")
      r <- redux::bin_to_object(r)
    } else {
      info(req, "Key `{key}` cache miss.")
      NULL
    }
  }, error = function(e) {
    info(req, "Error when reading key from cache: ", e$message)
    return(NULL)
  })
}

set_in_cache <- function(key, value, redis, req) {

  tryCatch({
    redis$SET(key, redux::object_to_bin(value))
    info(req, "Key `{key}` cached in redis")
  }, error = function(e) {
    info(req, "Key `{key}` falied to be cached in  redis: ", e$message)
  })

  return(invisible(NULL))
}

#* @serializer unboxedJSON
#* @get /genes/ensgs/<ensg>
get_gene_name <- function(req, ensg) {
  res <- get_from_cache(ensg, redis, req)
  if(!is.null(res)) {
    return(res)
  } else {
    res <- as.list(getBM(
      filters = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol", "description"),
      values = ensg,
      mart = mart
    ))
    set_in_cache(ensg, res, redis, req)
    res
  }
}

#* @serializer unboxedJSON
#* @get /genes/symbols/<symbol>
get_gene_ensg <- function(req, symbol) {
  res <- get_from_cache(symbol, redis, req)
  if(!is.null(res)) {
    return(res)
  } else {
    res <- as.list(getBM(
      filters = "hgnc_symbol",
      attributes = c("ensembl_gene_id", "hgnc_symbol", "description"),
      values = symbol,
      mart = mart
    ))
    set_in_cache(symbol, res, redis, req)
    res
  }
}

To run the API in background, one additional file is needed. Here I am creating it using a simple bash script.

cat << EOT > tmp/run_api.R
library(plumber)
library(optparse)
library(uuid)
library(logger)

MACHINE_ID <- "MAIN_1"
PORT_NUMBER <- 8761

log_level(logger::TRACE)

pr("tmp/api_v1.R") %>%
  pr_run(port = PORT_NUMBER)
EOT

Rscript tmp/run_api.R > tmp/API_LOG.log 2>&1 &
echo $! > tmp/API_PID
disown

Similarly to redis I need to wait till the API is ready to serve.

wait_for_api("tmp/API_LOG.log")

And now, API is ready to go.

Using an API

APIs are great because they use the HTTP protocol, which makes them incredibly easy to use from different environments like Python, R, Java, and even Bash. One of the best things about this is that you don’t need to install any special libraries (although they can be useful for a more native experience). All you have to do is send a GET or POST request and you’re good to go.

R

In the case of R, you’ll need to install the httr library to have a pleasant experience. Once you have httr installed, you can easily send API requests. The object returned by its functions contains a wealth of information, including the status_code, which indicates the status of the response. In most cases, receiving a 200 means everything is okay, but any other code might require further investigation. For a complete list of possible response codes, refer to: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

result <- httr::GET("http://127.0.0.1:8761/genes/symbols/KRAS")
result$request

## <request>
## GET http://127.0.0.1:8761/genes/symbols/KRAS
## Output: write_memory
## Options:
## * useragent: libcurl/7.81.0 r-curl/4.3.2 httr/1.4.2
## * httpget: TRUE
## Headers:
## * Accept: application/json, text/xml, application/xml, */*

result$status_code

## [1] 200

httr::content(result)

## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

Now it might be a good time to look into the logs:

cat(readLines("tmp/API_LOG.log"), sep = "\n")

## Running plumber API at http://127.0.0.1:8761
## Running swagger Docs at http://127.0.0.1:8761/__docs__/
## INFO [2023-04-04 19:46:04] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, REQUEST_METHOD: GET, PATH_INFO: /genes/symbols/KRAS, HTTP_USER_AGENT: libcurl/7.81.0 r-curl/4.3.2 httr/1.4.2, REMOTE_ADDR: 127.0.0.1
## INFO [2023-04-04 19:46:04] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, Key `KRAS` cache miss.
## INFO [2023-04-04 19:46:05] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, Key `KRAS` cached in redis

Two first lines are generated by plumber itself, they show on which port API is listening (first line), and on which the swagger UI is available (it’s worth to play with swagger to manually test/explore your API).

All 3 following lines are related to the request made by R. The first line gives you some basic information about the request, e.g. that this is GET request to /genes/symbols/KRAS path. Later on it can be used to see how your API is utilized, which endpoints are used the most, etc.

Second line tells you that this request was not cached. It might be useful to calculate cache hit ratio to make sure that your caching strategy is working as expected.

Third entry might be useful to diagnose any problems with writing to redis. In this state it just tells you that everything is fine.

Bash

In bash, good old curl is your friend. To process the ouput which usually comes as json, you can use jq tool (https://stedolan.github.io/jq/). jq requires some practice, like any other command line tool (awk, sed, grep, etc), but if you need to work in bash, jq might be way faster than running python or R.

curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703"

## {"ensembl_gene_id":"ENSG00000133703","hgnc_symbol":"KRAS","description":"KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"}

curl -s -X GET "http://127.0.0.1:8761/genes/symbols/BCAT1"

## {"ensembl_gene_id":"ENSG00000060982","hgnc_symbol":"BCAT1","description":"branched chain amino acid transaminase 1 [Source:HGNC Symbol;Acc:HGNC:976]"}

# Just symbol
curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703" | jq '.hgnc_symbol'

## "KRAS"

curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703" |
  jq '(.hgnc_symbol) + " is a symbol for " + (.ensembl_gene_id) + " ensg"'

## "KRAS is a symbol for ENSG00000133703 ensg"

curl -s -X GET "http://127.0.0.1:8761/genes/symbols/KRAS" | jq '.description'

## "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

Have I told you that I really like awk for processing log data? No? Here you have one-liner to calculate cache hit ratio (percentage of the API calls served from the cache).

awk '{if ($0 ~ /cache miss/) {++miss} else if($0 ~ /read from cache/) {++hit} }
      END {print "Cache hit ratio: " hit/(miss+hit)}' tmp/API_LOG.log

## Cache hit ratio: 0,5

Python

No surprises in Python - using API is really simple. Just grab requests module and you are ready to go.

import requests

x = requests.get('http://127.0.0.1:8761/genes/ensgs/ENSG00000133703')
x.json()

## {'ensembl_gene_id': 'ENSG00000133703', 'hgnc_symbol': 'KRAS', 'description': 'KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]'}

Creating R package for interacting with an API

Okay, now that we have our API up and running and can communicate with it using tools like curl, it’s functional. But we can do better than that. We can wrap all of our API endpoints (the URL paths that we use to send the requests) in R functions, and bundle them together as a package.

An example of how to do this can be found in the following resource: https://httr.r-lib.org/articles/api-packages.html. However, I’ll also provide a simple example for our API.

library(httr)

# Print API object
print.gene_api <- function(x, ...) {
  cat("<GENE_API ", x$path, ">\n", sep = "")
  str(x$content)
  invisible(x)
}

# Workhorse function - it handles basic stuff like errors (status code other than 200),
# and parsing the output.
gene_api <- function(path) {

  gene_api_path <- Sys.getenv("GENE_API_PATH", "http://127.0.0.1:8761")
  url <- modify_url(gene_api_path, path = path)
  resp <- GET(url)
  if (http_type(resp) != "application/json") {
    # TODO: what might be the reason? Maybe a better information for the user can
    # be provided?
    stop("API did not return json", call. = FALSE)
  }

  if (status_code(resp) != 200) {
    # TODO: serve other statuses that make sense (e.g. 201, 402 or 418)
    stop(
      sprintf(
        "Gene API request failed [%s]\n%s\n<%s>",
        status_code(resp),
        parsed$message,
        parsed$documentation_url
      ),
      call. = FALSE
    )
  }

  parsed <- jsonlite::fromJSON(content(resp, "text", encoding = "UTF-8"), simplifyVector = FALSE)
  structure(
    list(
      content = parsed,
      path = path,
      response = resp
    ),
    class = "gene_api"
  )
}

# Raw usage
gene_api("genes/symbols/KRAS")

## <GENE_API genes/symbols/KRAS>
## List of 3
##  $ ensembl_gene_id: chr "ENSG00000133703"
##  $ hgnc_symbol    : chr "KRAS"
##  $ description    : chr "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

gene_api("genes/ensgs/ENSG00000133703")

## <GENE_API genes/ensgs/ENSG00000133703>
## List of 3
##  $ ensembl_gene_id: chr "ENSG00000133703"
##  $ hgnc_symbol    : chr "KRAS"
##  $ description    : chr "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

# Funcion wrappers for the most important API endpoints
ga_symbol_info <- function(symbol) {
  checkmate::assert_character(symbol, min.chars = 1, any.missing = FALSE, len = 1)
  gene_api(glue::glue("genes/symbols/{symbol}"))$content
}

ga_ensg_info <- function(ensg) {
  checkmate::assert_character(ensg, min.chars = 1, any.missing = FALSE, len = 1)
  gene_api(glue::glue("genes/ensgs/{ensg}"))$content
}

ga_symbol_info("KRAS")

## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

ga_ensg_info("ENSG00000133703")

## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"

Note that in the code above, I use checkmate to verify the parameters passed to the functions. It’s very handy, as it checks if the required conditions are met and provides informative messages to users if they are not.

APIv2 - authentication, authorization, cookies, tokens…

By default, anyone with network access to an API can send requests against it. However, most APIs need to be protected from unauthorized users. Securing an API is a broad topic, and there are various ways to do it, so it is essential to spend some time thinking about it. After all, the last thing anyone wants is to make their API a source of company data leaks. One may need to study things like AWS Cognito, API Gateway, OAuth2, OpenID, etc.

For learning purposes, I created a simple example that uses cookie and access token-based authentication. It’s straightforward: to access the API, a user needs to make a call to a special endpoint and provide a unique key generated for each user in advance. (There might be a special process for obtaining those keys.)

/session_token - by calling this endpoint, a user gets a special token that will be valid for some time (e.g. 15 minutes), and it has to be added to each subsequent API request.
/session_cookie - this endpoint is very similar to the previous one, but the token is encrypted in a cookie. The user must add the cookie to each request. It is also valid for a specified period.

To encrypt the cookie, I create a key that will be used to secure the content sent to the user. While in this example, the cookie stores credentials, it can store other information as well. Encrypting the cookie ensures that users won’t be able to access its content, which can be helpful in some cases.

Then I need to generate some API keys for the users. In my examples they are just hashes obtained from 1,2,3,4,5. However, this is just for illustration purposes.

library(dplyr)
library(DBI)
library(RSQLite)

pswd_file <- "tmp/cookie_pass_gene_api"
cat(plumber::random_cookie_key(), file = pswd_file)
Sys.chmod(pswd_file, mode = "0600")

db_file <- "tmp/user-info.sqlite"
if(file.exists(db_file)) unlink(db_file)
db <- dbConnect(RSQLite::SQLite(), "tmp/user-info.sqlite")

# This will be the table for storing the API keys
dbWriteTable(conn = db, name = "apikeys",
  data.frame(
    user    = paste0("user", 1:5),
    api_key = purrr::map_chr(1:5, digest::digest)
  ), overwrite = TRUE
)

# Show the content of the table
# as you can see "4b5630ee914e848e8d07221556b0a2fb"
# looks way more professional than "1"
tbl(db, "apikeys")

## # Source:   table<apikeys> [5 x 2]
## # Database: sqlite 3.40.1 [tmp/user-info.sqlite]
##   user  api_key
##   <chr> <chr>
## 1 user1 4b5630ee914e848e8d07221556b0a2fb
## 2 user2 c01f179e4b57ab8bd9de309e6d576c48
## 3 user3 11946e7a3ed5e1776e81c0f0ecd383d0
## 4 user4 234a2a5581872457b9fe1187d1616b13
## 5 user5 dd4ad37ee474732a009111e3456e7ed7

dbDisconnect(db)

write_chunk_file("api_v2", "tmp/api_v2.R")

Here there’s the code for the API:

library(plumber)
library(dplyr)
library(dbplyr)
library(pool)

# https://github.com/rstudio/pool
# it's really cool package
pool <- dbPool(
  RSQLite::SQLite(),
  dbname = "user-info.sqlite",
  idleTimeout = 10,
  minSize = 0,
  onCreate = function(con) {
    logger::log_info("Database connection created")
    con
  }
)

# TODO: there's nearly no logging - but it would be
# worthwhile to add some

# Redis stores the tokens for specified period of time
# if a token is not present, then it means that the
# is not authorized anymore to perform any actions.
# Note that if you use any database to store the credentials,
# you have to make it really secure!
redis <- redux::hiredis()

# This function checks if the user_key exists
# in the database. If so, it return a token
# that either be send as a response or store in a cookie.
# Redis takes care of expiring the token after specified time.
# In the end the verification is very easy - if the token exists
# in the Redis, then the user is authorized to perform
# specific action.
authorize <- function(req, valid_time = 10L) {
  if(!"authorization" %in% names(req$HEADERS)) {
    res <- list(
      status_code = 401, msg = "api not provided"
    )
    return(res)
  }

  authorization <- strsplit(req$HEADERS[["authorization"]], split = " ")[[1]]
  user_key <- authorization[[2]]

  key <- tbl(pool, "apikeys") %>% filter(api_key == user_key) %>% collect()
  if(nrow(key) == 0) {
    list(
      status_code = 401
    )
  } else {
    token <- digest::digest(paste0(key$api_key, uuid::UUIDgenerate()))

    token_valid_date <- Sys.time() + valid_time

    # TODO: what should be done when redis is not available?
    redis$SET(token, key$user, EX = valid_time)
    list(
      token = token,
      status_code = 200,
      token_valid_date = token_valid_date
    )
  }
}


#* @get /session_token
#* @serializer unboxedJSON
function(req, res) {

  authorization_status <- authorize(req)

  if(authorization_status$status_code != 200) {
    res$status <- authorization_status$status_code
    return(list(error = paste("Access denied!", authorization_status$msg)))
  }

  authorization_status
}

#* @get /session_cookie
#* @serializer unboxedJSON
function(req, res) {

  authorization_status <- authorize(req)

  if(authorization_status$status_code != 200) {
    res$status <- authorization_status$status_code
    return(list(error = paste("Access denied!", authorization_status$msg)))
  }

  # Content of the encrypted cookies can be set using
  # req$session${COOKIE_NAME}
  req$session$token <- authorization_status$token

  # authorization_status will be returned, but it should not
  # contain the token in an open text
  authorization_status$token <- NULL
  authorization_status$info  <- "authorization cookie retrived successfully"

  names(authorization_status) <- dplyr::case_when(
    names(authorization_status) == "token_valid_date" ~ "cookie_valid_date",
    TRUE ~ names(authorization_status)
  )

  authorization_status

}

#* @filter authorization
function(req, res) {

  # this filter is called for other endpoints than the 4 specified below:
  ignored_paths <- c("/session_cookie", "/session_token", "/__docs__", "/openapi.json")
  if(!any(stringr::str_detect(req$PATH_INFO, ignored_paths))) {


    token <- if(!is.null(req$session$token)) {
      # read from cookie
      req$session$token
    } else {

      if (!"authorization" %in% names(req$HEADERS)) {
        # if neither token nor cookie is provided
        # then it return immediately
        res$status <- 401
        return("No cookie or token provided!")
      }

      strsplit(req$HEADERS[["authorization"]], split = " ")[[1]][[2]]
    }

    # if user token does not exists in redis, it might mean that
    # it is expired.
    user <- redis$GET(token)

    if(is.null(user)) {
      res$status <- 401
      return("Cookie/token expired!")
    }

    req$user <- user
  }

  plumber::forward()
}

#* @get /session_counter
#* @serializer unboxedJSON
function(req, res) {
  # Simple endpoint that uses user information to store the
  # total number of api calls made against it.
  redis$INCR(req$user)
  value <- redis$GET(req$user)
  logger::log_info("[session_counter] {req$user}, value: {value}")
  list("total_number_of_api_calls" = value, user = req$user)
}

Now let’s write some code to run the API. Note that to use a secure cookie you need to make a call to pr_cookie.

cat << EOT > run_api_v2.R
library(plumber)
library(optparse)
library(uuid)
library(logger)

MACHINE_ID <- "MAIN_1"
PORT_NUMBER <- 8762

log_level(logger::TRACE)

pr("tmp/api_v2.R") %>%
  pr_cookie(readLines("tmp/cookie_pass_gene_api", warn = FALSE), "token") %>%
  pr_run(port = PORT_NUMBER)
EOT

Rscript run_api_v2.R > tmp/API_LOG_v2.log 2>&1 &
echo $! > tmp/API_V2_PID
disown

wait_for_api("tmp/API_LOG_v2.log")

curl -X GET "http://127.0.0.1:8762/session_cookie" --silent

## {"error":"Access denied! api not provided"}

curl -X GET "http://127.0.0.1:8762/session_counter" --silent

## ["No cookie or token provided!"]

# Set cookie
curl -X GET "http://127.0.0.1:8762/session_cookie" \
  -H "Authorization: Bearer 4b5630ee914e848e8d07221556b0a2fb" \
  -c tmp/cookie_auth.txt --silent

## {"status_code":200,"cookie_valid_date":"2023-04-04 19:46:21","info":"authorization cookie retrived successfully"}

# Use cookie in curl
curl -X GET "http://127.0.0.1:8762/session_counter" --cookie tmp/cookie_auth.txt --silent
curl -X GET "http://127.0.0.1:8762/session_counter" --cookie tmp/cookie_auth.txt --silent

## {"total_number_of_api_calls":"1","user":"user1"}{"total_number_of_api_calls":"2","user":"user1"}

Now let’s wait 11 seconds to make sure that token stored in cookie expires.

sleep 11
curl -X GET "http://127.0.0.1:8762/session_counter" --cookie tmp/cookie_auth.txt --silent

## ["Cookie/token expired!"]

To renew the token I need to call /session_token once again. In practice it is usually a little bit more complicated, because there might be a second renew token send with access token that does the renewal automatically.

curl -X GET "http://127.0.0.1:8762/session_token" \
  -H "Authorization: Bearer 4b5630ee914e848e8d07221556b0a2fb" \
  -c tmp/cookie_auth.txt --silent

## {"token":"9ec25c5fe99193b8a168ed66baa5f4fc","status_code":200,"token_valid_date":"2023-04-04 19:46:33"}

Now, short example with the /session_token endpoint:

# API Token
curl -X GET "http://127.0.0.1:8762/session_token" -H "Authorization: Bearer 4b5630ee914e848e8d07221556b0a2fb" --silent

## {"token":"b4f3df37d8d725b1b8ba60073b09ed69","status_code":200,"token_valid_date":"2023-04-04 19:46:33"}

To get API you can use jq as shown in the example below:

API_TOKEN=$(curl -X GET "http://127.0.0.1:8762/session_token" -H "Authorization: Bearer 4b5630ee914e848e8d07221556b0a2fb" --silent | jq -r .token)
echo "\n$API_TOKEN\n"
curl -X GET "http://127.0.0.1:8762/session_counter" -H "Authorization: Bearer $API_TOKEN" --silent

## \n5893f0ff3990c6cf43b21ceb32b2309c\n
## {"total_number_of_api_calls":"3","user":"user1"}

APIv3 - API Gateway

In practice, the probability that you will need to implement your own authorization code is quite low, especially in a cloud environment. Most likely, your API will be hidden behind what is called an API Gateway - a single entry point for different APIs that performs routing, rate limiting, authentication, and authorization.

With an API Gateway, your code will be similar to the first version of the API. However, make sure to properly configure access controls. For example, your API servers should be protected with either a Network Access Control List (NACL) or security groups (speaking in AWS terminology), so that all external and direct accesses are prohibited.

Some useful links for further study: - https://tutorialsdojo.com/security-group-vs-nacl/ NACL vs secruity group. - https://auth0.com/docs/get-started/identity-fundamentals/authentication-and-authorization - Authentication vs. Authorization - those are two different things! - https://microservices.io/patterns/apigateway.html, https://www.redhat.com/en/topics/api/what-does-an-api-gateway-do - API Gateway.

POST and GET

In the last part of this post I will show you how to write data using an API created in plumber.

write_chunk_file("api_post", "tmp/plumber_post.R")

library(plumber)
library(dplyr)
library(dbplyr)
library(pool)
database_path <- tempfile()

# Remove old file if exists to make sure that we have a clean start.
# In production YOU DON"T WANT TO DO THIS!
if(file.exists(database_path)) unlink(database_path)

pool <- dbPool(
  RSQLite::SQLite(),
  dbname = database_path,
  idleTimeout = 10,
  minSize = 0,
  onCreate = function(con) {
    logger::log_info("Database connection created")
    con
  }
)


# Here I create a simple schema for storing the data.
# One table with experiment_id and experiment_name
# and a second table with some data related to this experiment
dbExecute(pool, "CREATE TABLE IF NOT EXISTS experiments (
  experiment_id INTEGER PRIMARY KEY AUTOINCREMENT,
  experiment_name VARCHAR NOT NULL)
")

dbExecute(pool, "CREATE TABLE IF NOT EXISTS genes_expression (
  experiment_id INTEGER NOT NULL,
  sample_id INTEGER NOT NULL,
  ensg VARCHAR NOT NULL,
  tpm REAL NOT NULL,
  FOREIGN KEY(experiment_id) REFERENCES Artists(experiment_id)
)
")

if(nrow(tbl(pool, "experiments") %>% filter(experiment_id == 1) %>% collect()) == 0) {

  # Here I add some data to the database
  experiment_name <- "my_first_experiment"

  DBI::dbExecute(
    pool,
    glue::glue_sql("INSERT INTO experiments (experiment_name) VALUES({experiment_name})", .con = pool)
  )

  items <- c(
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 1, 'ENSG00000133703', 20.0)",
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 2, 'ENSG00000133703', 11.0)",
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 3, 'ENSG00000133703', 45.0)",
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 1, 'ENSG00000268173', 137.2)",
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 2, 'ENSG00000268173', 431.1)",
    "INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 3, 'ENSG00000268173', 19.9)"
  )

  purrr::map(items, function(x) DBI::dbExecute(pool, x))

}

#* This endpoint reads all genes_expression data associated with an experiment
#* @get /experiments/<id>/genes_expression
#* @serializer unboxedJSON
function(req, res, id) {
  result <- tbl(pool, "experiments") %>% filter(experiment_id == id) %>% collect()
  expressions <- tbl(pool, "genes_expression")

  if(!is.null(req$args$ensg)) {
    ensgs <- req$args$ensg
    expressions <- expressions %>% dplyr::filter(ensg %in% ensgs)
  }
  expressions <- expressions %>% select(sample_id, ensg, tpm) %>% collect()
  list(experiment = result, expressions = expressions)
}

#* Read experiment_name by experiment_id
#* @get /experiments/<id>
#* @serializer unboxedJSON
function(req, res, id) {
  tbl(pool, "experiments") %>% filter(experiment_id == id) %>% collect() %>% as.list
}

#* The most important part of this API - it writes the data to the database
#* @post /experiments
#* @serializer unboxedJSON
function(req, res) {

  # Data is sent in the request body,
  # the next line excondes it to R list from raw bytes
  data <- jsonlite::fromJSON(rawToChar(req$bodyRaw))

  # If an experiment with the same name already exists, returns appropriate response
  experiment_name_ <- data$experiment
  experiment_exists <- nrow(tbl(pool, "experiments") %>% dplyr::filter(experiment_name == experiment_name_) %>% collect()) > 0
  if(experiment_exists) {
    res$status <- 409
    return(paste(data$experiment, "already exists in the database!"))
  }

  # TODO: validation of the input - is everything present?

  # TODO: Handle errors in the transaction.
  # TODO: different way of extracting last added id
  # Btw - transaction is a very important concept in DB world
  # here's some starting links if you don't know what it mean
  # https://www.sqlite.org/transactional.html
  # https://en.wikipedia.org/wiki/Database_transaction
  experiment_id <- pool::poolWithTransaction(pool, function(conn){
    DBI::dbExecute(conn, glue::glue_sql("INSERT INTO experiments(experiment_name) VALUES({data$experiment})", .con = conn))

    # TODO: better way to get the id of the just added experiment_name
    experiment_id <-
      DBI::dbGetQuery(
        conn,
        glue::glue_sql(
          .con = conn,
          "SELECT experiment_id FROM experiments WHERE experiment_name = {data$experiment}"
        )
      )$experiment_id

    # Creating INSERT statement, note that I am using glue_sql for proper escaping
    # if you don't know why you need to escape your sql queries read about SQL injection
    data$expressions$experiment_id <- experiment_id
    insert_sql <- paste(glue::glue_sql(
      .con = conn,
      "({experiment_id},{sample_id},{ensg},{tpm})",
      .envir = data$expressions
    ), collapse = ", ")
   insert_sql <- paste("INSERT INTO genes_expression(experiment_id, sample_id, ensg, tpm) VALUES", insert_sql)
   DBI::dbExecute(conn, insert_sql)
   experiment_id
  })

  list(experiment_id = experiment_id)
}

As usual - some code for running the API in background:

cat << EOT > tmp/run_api_post.R
library(plumber)
library(optparse)
library(uuid)
library(logger)

MACHINE_ID <- "MAIN_1"
PORT_NUMBER <- 8113

log_level(logger::TRACE)

plumb(file='tmp/plumber_post.R') %>%
  pr_run(port = PORT_NUMBER)
EOT

Rscript tmp/run_api_post.R > tmp/POST_API_LOG.log 2>&1 &
echo $! > tmp/POST_API_PID
disown

wait_for_api("tmp/POST_API_LOG.log")

curl -X GET "http://127.0.0.1:8113/experiments/1" -H "accept: application/json" --silent

## {"experiment_id":1,"experiment_name":"my_first_experiment"}

/genes_expression allows to filter by specific ENSG. It can be provided using a query string:

curl -X GET "http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703" -H "accept: application/json" --silent

## {"experiment":[{"experiment_id":1,"experiment_name":"my_first_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45}]}

For sending a vector, you can use the same argument many times as show below:

curl -X GET "http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703&ensg=ENSG00000268173" -H "accept: application/json" --silent

## {"experiment":[{"experiment_id":1,"experiment_name":"my_first_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45},{"sample_id":1,"ensg":"ENSG00000268173","tpm":137.2},{"sample_id":2,"ensg":"ENSG00000268173","tpm":431.1},{"sample_id":3,"ensg":"ENSG00000268173","tpm":19.9}]}

Now, let’s prepare a json to be added to the database using a POST request.

input_to_api <- list(
  experiment = "my_second_experiment",
  expressions = tibble(sample_id = 1:4, ensg = "ENSG00000133703", tpm = c(52.6,14.1,146.5,2000.1))
)

jsonlite::toJSON(input_to_api, auto_unbox = TRUE, pretty = TRUE)

## {
##   "experiment": "my_second_experiment",
##   "expressions": [
##     {
##       "sample_id": 1,
##       "ensg": "ENSG00000133703",
##       "tpm": 52.6
##     },
##     {
##       "sample_id": 2,
##       "ensg": "ENSG00000133703",
##       "tpm": 14.1
##     },
##     {
##       "sample_id": 3,
##       "ensg": "ENSG00000133703",
##       "tpm": 146.5
##     },
##     {
##       "sample_id": 4,
##       "ensg": "ENSG00000133703",
##       "tpm": 2000.1
##     }
##   ]
## }

And in the last step, the request is sent with the json containing all the required information.

curl -X POST "http://127.0.0.1:8113/experiments" --silent --data '
{
  "experiment": "my_second_experiment",
  "expressions": [
    {
      "sample_id": 1,
      "ensg": "ENSG00000133703",
      "tpm": 52.6
    },
    {
      "sample_id": 2,
      "ensg": "ENSG00000133703",
      "tpm": 14.1
    },
    {
      "sample_id": 3,
      "ensg": "ENSG00000133703",
      "tpm": 146.5
    },
    {
      "sample_id": 4,
      "ensg": "ENSG00000133703",
      "tpm": 2000.1
    }
  ]
}
'

## {"experiment_id":2}

And reading the output:

curl -X GET "http://127.0.0.1:8113/experiments/2/genes_expression" -H "accept: application/json" --silent

## {"experiment":[{"experiment_id":2,"experiment_name":"my_second_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45},{"sample_id":1,"ensg":"ENSG00000268173","tpm":137.2},{"sample_id":2,"ensg":"ENSG00000268173","tpm":431.1},{"sample_id":3,"ensg":"ENSG00000268173","tpm":19.9},{"sample_id":1,"ensg":"ENSG00000133703","tpm":52.6},{"sample_id":2,"ensg":"ENSG00000133703","tpm":14.1},{"sample_id":3,"ensg":"ENSG00000133703","tpm":146.5},{"sample_id":4,"ensg":"ENSG00000133703","tpm":2000.1}]}

One last thing, just for a pure joy of using bash parallel and some other console tools. The script below, runs curl 500 times on 8 workers and logs some performace information. Then I use some simple tools to perform basic analysis:

# https://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl

seq 500 | parallel --joblog tmp/joblog.txt -n0 -j8 "curl \
    -X GET 'http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703' \
    -H 'accept: application/json' --silent \
    -w '%{time_connect} %{time_starttransfer} %{time_total}\n' \
    -o /dev/null" > tmp/all_times

Calculate mean connection time, time to first byte transferred and total time of for a request:

LC_NUMERIC=C # for some reason Rstudio changes the numeric locale!
awk '{N+=1; CONNECT+=$1; TTFB+=$2; TOTAL+=$3} END {printf "Connect: %.3gs TTFB: %.2gs TOTAL: %.3gs\n", CONNECT/N, TTFB/N, TOTAL/N}' tmp/all_times

## Connect: 0.00015s TTFB: 0.5s TOTAL: 0.504s

As a bonus - number of requests processed each second (to be precise I use number of requests started each second by parallel). As you can see, this API can handle 14-15req/sec. Can you explain what each command in this bash pipeline does?

LC_NUMERIC=C
tail -n +2 tmp/joblog.txt | cut -f 3 | xargs printf "%.0f\n"|
  sort | uniq -c | tr -s " " | cut -f2 -d " " | sort |
  awk '{a[i++]=$0} END {print (i%2==0)?(a[int(i/2)-1]+a[int(i/2)])/2:a[int(i/2)]}'

## 16

Clean-up

To keep the system clean, I kill all the processes started in the background. To do that I use the PID files created when they were started. Stopping redis is a little more complicated, because I need to extract its process_id from the cli.

# Kill R processes
for f in tmp/*_PID; do echo $f && kill -15 $(cat $f); done

# Kill redis-server
kill -15 $(redis-cli INFO | grep process_id | cut -d":" -f2 | sed -e "s/\r//g")

rm -rf tmp

## tmp/API_PID
## tmp/API_V2_PID
## tmp/POST_API_PID

Summary

The topics covered in this post should be treated as a starting point for further study. Here’s a bunch of links that you might find useful:

https://posit.co/blog/plumber-v1-1-0/#parallel-exec - plumber can use async workers. It might be an easy way to scale-up your API, but it might be not enough for heavy-workload environment.
https://blog.bytebytego.com/p/why-is-restful-api-so-popular - some basic information about what RESTful API are popular.
https://blog.bytebytego.com/p/ep53-design-effective-and-safe-apis - this post contains a simple iconography showing some good practices in API design.

zstat.pl - blog

REST API in R with plumber