API and R
Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R
. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python
, curl
, and the R own httr
package.
By the way, you might notice that in some parts of the code, I’ve included TODO comments. I’m not promising to update this post later (although I might!). The comments are just there to get you thinking about how you might expand on this in a real-world environment.
Some preparation
The whole document has been created using rmarkdown
. It means that you take the code and reproduce everything by yourself.
# The files required to make this document are stored in tmp directory
dir.create("tmp", recursive = TRUE, showWarnings = FALSE)
# The function below is used to get the content of code chunk and
# output it to a file to make it available for the `plumber` package.
write_chunk_file <- function(chunk_name, dest_file) {
lines <- readLines("2023-04-04-rest-api-in-r-with-plumber.Rmd", warn = FALSE)
line_start <- grep(lines, pattern = paste("```\\{r", chunk_name))+1
line_end <- grep(lines, pattern = "```")
line_end <- head(line_end[line_end > line_start],1) - 1
api_code <- c(lines[line_start:line_end], "")
writeLines(api_code, con = dest_file)
}
# When an API is started it might take some time to initialize
# this function stops the main execution and wait until
# plumber API is ready to take queries.
wait_for_api <- function(log_path, timeout = 60, check_every = 1) {
times <- timeout / check_every
for(i in seq_len(times)) {
Sys.sleep(check_every)
if(any(grepl(readLines(log_path), pattern = "Running plumber API"))) {
return(invisible())
}
}
stop("Waiting timed!")
}
Oh, in some examples I am using redis
. So, before you dive in, make sure to fire up a simple redis
server. At the end of the script, I’ll be turning redis
off, so you don’t want to be using it for anything else at the same time. I just want to remind you that this code isn’t meant to be run on a production server.
redis-server --daemonize yes
redis
is launched in a background, , so you might want to wait a little bit to make sure it’s fully up and running before moving on.
wait_for_redis <- function(timeout = 60, check_every = 1) {
times <- timeout / check_every
for(i in seq_len(times)) {
Sys.sleep(check_every)
status <- suppressWarnings(system2("redis-cli", "PING", stdout = TRUE, stderr = TRUE) == "PONG")
if(status) {
return(invisible())
}
}
stop("Redis waiting timed!")
}
dir.create("tmp")
## Warning in dir.create("tmp"): 'tmp' already exists
wait_for_redis()
Just a quick tip: if you want to double-check that redis
is totally empty, here’s how to do it.
redis-cli FLUSHALL
## OK
Gene APIv1
First things first - let’s save the API code into a file:
# Write the content of the next code chunk to tmp/api_v1.R
write_chunk_file("api_v1", "tmp/api_v1.R")
Alright, now let’s dive into the main topic - how to create an API using plumber. It’s actually not that difficult - you can head over to https://www.rplumber.io/index.html, grab the basic example, and have something up and running in just 10 minutes! But let’s talk about how to write an API that will save you lots of headaches in the future.
First off, let’s talk about logging. I try to log as much as possible, especially in critical areas like database accesses, and interactions with other systems. This way, if there’s an issue in the future (and trust me, there will be), I should be able to diagnose the problem just by looking at the logs alone. Logging is like “print debugging” (putting print(“I am here”), print(“I am here 2”) everywhere), but done ahead of time. I always try to think about what information might be needed to make a correct diagnosis, so logging variable values is a must. The logger
and glue
packages are your best friends in that area.
Next, it might also be useful to add a unique request identifier ((I am doing that in setuuid
filter)) to be able to track it across the whole pipeline (since a single request might be passed across many functions). You might also want to add some other identifiers, such as MACHINE_ID - your API might be deployed on many machines, so it could be helpful for diagnosing if the problem is associated with a specific instance or if it’s a global issue.
In general you shouldn’t worry too much about the size of the logs. Even if you generate ~10KB per request, it will take 100000 requests to generate 1GB. And for the plumber
API, 100000 requests generated in a short time is A LOT. In such scenario you should look into other languages. And if you have that many requests, you probably have a budget for storing those logs:)
It might also be a good idea to setup some automatic system to monitor those logs (e.g. Amazon CloudWatch
if you are on AWS). In my example I would definitely monitor Error when reading key from cache
string. That would give me an indication of any ongoing problems with API cache.
Speaking of cache, you might use it to save a lot of resources. Caching is a very broad topic with many pitfalls (what to cache, stale cache, etc) so I won’t spend too much time on it, but you might want to read at least a little bit about it. In my example, I am using redis
key-value store, which allows me to save the result for a given request, and if there is another requests that asks for the same data, I can read it from redis
much faster.
Note that you could use memoise
package to achieve similar thing using R only. However, redis
might be useful when you are using multiple workers. Then, one cached request becomes available for all other R processes. But if you need to deploy just one process, memoise
is fine, and it does not introduce another dependency - which is always a plus.
library(biomaRt)
library(redux)
redis <- redux::hiredis()
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
#* @apiTitle Plumber Gene id/name API.
#* @apiDescription Simple API.
info <- function(req, ...) {
do.call(
log_info,
c(
list("MachineId: {MACHINE_ID}, ReqId: {req$request_id}"),
list(...),
.sep = ", "
), envir = parent.frame(1)
)
}
#* Log some information about the incoming request
#* https://www.rplumber.io/articles/routing-and-input.html - this is a must read!
#* @filter setuuid
function(req) {
req$request_id <- UUIDgenerate(n = 1)
plumber::forward()
}
#* Log some information about the incoming request
#* @filter logger
function(req) {
if(!grepl(req$PATH_INFO, pattern = "PATH_INFO")) {
info(
req,
"REQUEST_METHOD: {req$REQUEST_METHOD}",
"PATH_INFO: {req$PATH_INFO}",
"HTTP_USER_AGENT: {req$HTTP_USER_AGENT}",
"REMOTE_ADDR: {req$REMOTE_ADDR}"
)
}
plumber::forward()
}
get_from_cache <- function(key, redis, req) {
result <- tryCatch({
r <- redis$GET(key)
if(!is.null(r)) {
info(req, "Key `{key}` read from cache.")
r <- redux::bin_to_object(r)
} else {
info(req, "Key `{key}` cache miss.")
NULL
}
}, error = function(e) {
info(req, "Error when reading key from cache: ", e$message)
return(NULL)
})
}
set_in_cache <- function(key, value, redis, req) {
tryCatch({
redis$SET(key, redux::object_to_bin(value))
info(req, "Key `{key}` cached in redis")
}, error = function(e) {
info(req, "Key `{key}` falied to be cached in redis: ", e$message)
})
return(invisible(NULL))
}
#* @serializer unboxedJSON
#* @get /genes/ensgs/<ensg>
get_gene_name <- function(req, ensg) {
res <- get_from_cache(ensg, redis, req)
if(!is.null(res)) {
return(res)
} else {
res <- as.list(getBM(
filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol", "description"),
values = ensg,
mart = mart
))
set_in_cache(ensg, res, redis, req)
res
}
}
#* @serializer unboxedJSON
#* @get /genes/symbols/<symbol>
get_gene_ensg <- function(req, symbol) {
res <- get_from_cache(symbol, redis, req)
if(!is.null(res)) {
return(res)
} else {
res <- as.list(getBM(
filters = "hgnc_symbol",
attributes = c("ensembl_gene_id", "hgnc_symbol", "description"),
values = symbol,
mart = mart
))
set_in_cache(symbol, res, redis, req)
res
}
}
To run the API in background, one additional file is needed. Here I am creating it using a simple bash script.
cat << EOT > tmp/run_api.R
library(plumber)
library(optparse)
library(uuid)
library(logger)
MACHINE_ID <- "MAIN_1"
PORT_NUMBER <- 8761
log_level(logger::TRACE)
pr("tmp/api_v1.R") %>%
pr_run(port = PORT_NUMBER)
EOT
Rscript tmp/run_api.R > tmp/API_LOG.log 2>&1 &
echo $! > tmp/API_PID
disown
Similarly to redis
I need to wait till the API is ready to serve.
wait_for_api("tmp/API_LOG.log")
And now, API is ready to go.
Using an API
APIs are great because they use the HTTP protocol, which makes them incredibly easy to use from different environments like Python, R, Java, and even Bash. One of the best things about this is that you don’t need to install any special libraries (although they can be useful for a more native experience). All you have to do is send a GET or POST request and you’re good to go.
R
In the case of R, you’ll need to install the httr
library to have a pleasant experience. Once you have httr
installed, you can easily send API requests. The object returned by its functions contains a wealth of information, including the status_code
, which indicates the status of the response. In most cases, receiving a 200 means everything is okay, but any other code might require further investigation. For a complete list of possible response codes, refer to: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.
result <- httr::GET("http://127.0.0.1:8761/genes/symbols/KRAS")
result$request
## <request>
## GET http://127.0.0.1:8761/genes/symbols/KRAS
## Output: write_memory
## Options:
## * useragent: libcurl/7.81.0 r-curl/4.3.2 httr/1.4.2
## * httpget: TRUE
## Headers:
## * Accept: application/json, text/xml, application/xml, */*
result$status_code
## [1] 200
httr::content(result)
## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
Now it might be a good time to look into the logs:
cat(readLines("tmp/API_LOG.log"), sep = "\n")
## Running plumber API at http://127.0.0.1:8761
## Running swagger Docs at http://127.0.0.1:8761/__docs__/
## INFO [2023-04-04 19:46:04] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, REQUEST_METHOD: GET, PATH_INFO: /genes/symbols/KRAS, HTTP_USER_AGENT: libcurl/7.81.0 r-curl/4.3.2 httr/1.4.2, REMOTE_ADDR: 127.0.0.1
## INFO [2023-04-04 19:46:04] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, Key `KRAS` cache miss.
## INFO [2023-04-04 19:46:05] MachineId: MAIN_1, ReqId: 82f148ab-9c75-4467-8577-b8a9ac226f6c, Key `KRAS` cached in redis
Two first lines are generated by plumber
itself, they show on which port API is listening (first line), and on which the swagger UI is available (it’s worth to play with swagger to manually test/explore your API).
All 3 following lines are related to the request made by R. The first line gives you some basic information about the request, e.g. that this is GET
request to /genes/symbols/KRAS
path. Later on it can be used to see how your API is utilized, which endpoints are used the most, etc.
Second line tells you that this request was not cached. It might be useful to calculate cache hit ratio
to make sure that your caching strategy is working as expected.
Third entry might be useful to diagnose any problems with writing to redis
. In this state it just tells you that everything is fine.
Bash
In bash, good old curl
is your friend. To process the ouput which usually comes as json, you can use jq
tool (https://stedolan.github.io/jq/). jq
requires some practice, like any other command line tool (awk
, sed
, grep
, etc), but if you need to work in bash, jq
might be way faster than running python
or R
.
curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703"
## {"ensembl_gene_id":"ENSG00000133703","hgnc_symbol":"KRAS","description":"KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"}
curl -s -X GET "http://127.0.0.1:8761/genes/symbols/BCAT1"
## {"ensembl_gene_id":"ENSG00000060982","hgnc_symbol":"BCAT1","description":"branched chain amino acid transaminase 1 [Source:HGNC Symbol;Acc:HGNC:976]"}
# Just symbol
curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703" | jq '.hgnc_symbol'
## "KRAS"
curl -s -X GET "http://127.0.0.1:8761/genes/ensgs/ENSG00000133703" |
jq '(.hgnc_symbol) + " is a symbol for " + (.ensembl_gene_id) + " ensg"'
## "KRAS is a symbol for ENSG00000133703 ensg"
curl -s -X GET "http://127.0.0.1:8761/genes/symbols/KRAS" | jq '.description'
## "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
Have I told you that I really like awk
for processing log data? No? Here you have one-liner to calculate cache hit ratio (percentage of the API calls served from the cache).
awk '{if ($0 ~ /cache miss/) {++miss} else if($0 ~ /read from cache/) {++hit} }
END {print "Cache hit ratio: " hit/(miss+hit)}' tmp/API_LOG.log
## Cache hit ratio: 0,5
Python
No surprises in Python - using API is really simple. Just grab requests
module and you are ready to go.
import requests
x = requests.get('http://127.0.0.1:8761/genes/ensgs/ENSG00000133703')
x.json()
## {'ensembl_gene_id': 'ENSG00000133703', 'hgnc_symbol': 'KRAS', 'description': 'KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]'}
Creating R package for interacting with an API
Okay, now that we have our API up and running and can communicate with it using tools like curl, it’s functional. But we can do better than that. We can wrap all of our API endpoints (the URL paths that we use to send the requests) in R functions, and bundle them together as a package.
An example of how to do this can be found in the following resource: https://httr.r-lib.org/articles/api-packages.html. However, I’ll also provide a simple example for our API.
library(httr)
# Print API object
print.gene_api <- function(x, ...) {
cat("<GENE_API ", x$path, ">\n", sep = "")
str(x$content)
invisible(x)
}
# Workhorse function - it handles basic stuff like errors (status code other than 200),
# and parsing the output.
gene_api <- function(path) {
gene_api_path <- Sys.getenv("GENE_API_PATH", "http://127.0.0.1:8761")
url <- modify_url(gene_api_path, path = path)
resp <- GET(url)
if (http_type(resp) != "application/json") {
# TODO: what might be the reason? Maybe a better information for the user can
# be provided?
stop("API did not return json", call. = FALSE)
}
if (status_code(resp) != 200) {
# TODO: serve other statuses that make sense (e.g. 201, 402 or 418)
stop(
sprintf(
"Gene API request failed [%s]\n%s\n<%s>",
status_code(resp),
parsed$message,
parsed$documentation_url
),
call. = FALSE
)
}
parsed <- jsonlite::fromJSON(content(resp, "text", encoding = "UTF-8"), simplifyVector = FALSE)
structure(
list(
content = parsed,
path = path,
response = resp
),
class = "gene_api"
)
}
# Raw usage
gene_api("genes/symbols/KRAS")
## <GENE_API genes/symbols/KRAS>
## List of 3
## $ ensembl_gene_id: chr "ENSG00000133703"
## $ hgnc_symbol : chr "KRAS"
## $ description : chr "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
gene_api("genes/ensgs/ENSG00000133703")
## <GENE_API genes/ensgs/ENSG00000133703>
## List of 3
## $ ensembl_gene_id: chr "ENSG00000133703"
## $ hgnc_symbol : chr "KRAS"
## $ description : chr "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
# Funcion wrappers for the most important API endpoints
ga_symbol_info <- function(symbol) {
checkmate::assert_character(symbol, min.chars = 1, any.missing = FALSE, len = 1)
gene_api(glue::glue("genes/symbols/{symbol}"))$content
}
ga_ensg_info <- function(ensg) {
checkmate::assert_character(ensg, min.chars = 1, any.missing = FALSE, len = 1)
gene_api(glue::glue("genes/ensgs/{ensg}"))$content
}
ga_symbol_info("KRAS")
## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
ga_ensg_info("ENSG00000133703")
## $ensembl_gene_id
## [1] "ENSG00000133703"
##
## $hgnc_symbol
## [1] "KRAS"
##
## $description
## [1] "KRAS proto-oncogene, GTPase [Source:HGNC Symbol;Acc:HGNC:6407]"
Note that in the code above, I use checkmate
to verify the parameters passed to the functions. It’s very handy, as it checks if the required conditions are met and provides informative messages to users if they are not.
APIv3 - API Gateway
In practice, the probability that you will need to implement your own authorization code is quite low, especially in a cloud environment. Most likely, your API will be hidden behind what is called an API Gateway - a single entry point for different APIs that performs routing, rate limiting, authentication, and authorization.
With an API Gateway, your code will be similar to the first version of the API. However, make sure to properly configure access controls. For example, your API servers should be protected with either a Network Access Control List (NACL) or security groups (speaking in AWS terminology), so that all external and direct accesses are prohibited.
Some useful links for further study: - https://tutorialsdojo.com/security-group-vs-nacl/ NACL vs secruity group. - https://auth0.com/docs/get-started/identity-fundamentals/authentication-and-authorization - Authentication vs. Authorization - those are two different things! - https://microservices.io/patterns/apigateway.html, https://www.redhat.com/en/topics/api/what-does-an-api-gateway-do - API Gateway.
POST and GET
In the last part of this post I will show you how to write data using an API created in plumber
.
write_chunk_file("api_post", "tmp/plumber_post.R")
library(plumber)
library(dplyr)
library(dbplyr)
library(pool)
database_path <- tempfile()
# Remove old file if exists to make sure that we have a clean start.
# In production YOU DON"T WANT TO DO THIS!
if(file.exists(database_path)) unlink(database_path)
pool <- dbPool(
RSQLite::SQLite(),
dbname = database_path,
idleTimeout = 10,
minSize = 0,
onCreate = function(con) {
logger::log_info("Database connection created")
con
}
)
# Here I create a simple schema for storing the data.
# One table with experiment_id and experiment_name
# and a second table with some data related to this experiment
dbExecute(pool, "CREATE TABLE IF NOT EXISTS experiments (
experiment_id INTEGER PRIMARY KEY AUTOINCREMENT,
experiment_name VARCHAR NOT NULL)
")
dbExecute(pool, "CREATE TABLE IF NOT EXISTS genes_expression (
experiment_id INTEGER NOT NULL,
sample_id INTEGER NOT NULL,
ensg VARCHAR NOT NULL,
tpm REAL NOT NULL,
FOREIGN KEY(experiment_id) REFERENCES Artists(experiment_id)
)
")
if(nrow(tbl(pool, "experiments") %>% filter(experiment_id == 1) %>% collect()) == 0) {
# Here I add some data to the database
experiment_name <- "my_first_experiment"
DBI::dbExecute(
pool,
glue::glue_sql("INSERT INTO experiments (experiment_name) VALUES({experiment_name})", .con = pool)
)
items <- c(
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 1, 'ENSG00000133703', 20.0)",
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 2, 'ENSG00000133703', 11.0)",
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 3, 'ENSG00000133703', 45.0)",
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 1, 'ENSG00000268173', 137.2)",
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 2, 'ENSG00000268173', 431.1)",
"INSERT INTO genes_expression (experiment_id, sample_id, ensg, tpm) VALUES (1, 3, 'ENSG00000268173', 19.9)"
)
purrr::map(items, function(x) DBI::dbExecute(pool, x))
}
#* This endpoint reads all genes_expression data associated with an experiment
#* @get /experiments/<id>/genes_expression
#* @serializer unboxedJSON
function(req, res, id) {
result <- tbl(pool, "experiments") %>% filter(experiment_id == id) %>% collect()
expressions <- tbl(pool, "genes_expression")
if(!is.null(req$args$ensg)) {
ensgs <- req$args$ensg
expressions <- expressions %>% dplyr::filter(ensg %in% ensgs)
}
expressions <- expressions %>% select(sample_id, ensg, tpm) %>% collect()
list(experiment = result, expressions = expressions)
}
#* Read experiment_name by experiment_id
#* @get /experiments/<id>
#* @serializer unboxedJSON
function(req, res, id) {
tbl(pool, "experiments") %>% filter(experiment_id == id) %>% collect() %>% as.list
}
#* The most important part of this API - it writes the data to the database
#* @post /experiments
#* @serializer unboxedJSON
function(req, res) {
# Data is sent in the request body,
# the next line excondes it to R list from raw bytes
data <- jsonlite::fromJSON(rawToChar(req$bodyRaw))
# If an experiment with the same name already exists, returns appropriate response
experiment_name_ <- data$experiment
experiment_exists <- nrow(tbl(pool, "experiments") %>% dplyr::filter(experiment_name == experiment_name_) %>% collect()) > 0
if(experiment_exists) {
res$status <- 409
return(paste(data$experiment, "already exists in the database!"))
}
# TODO: validation of the input - is everything present?
# TODO: Handle errors in the transaction.
# TODO: different way of extracting last added id
# Btw - transaction is a very important concept in DB world
# here's some starting links if you don't know what it mean
# https://www.sqlite.org/transactional.html
# https://en.wikipedia.org/wiki/Database_transaction
experiment_id <- pool::poolWithTransaction(pool, function(conn){
DBI::dbExecute(conn, glue::glue_sql("INSERT INTO experiments(experiment_name) VALUES({data$experiment})", .con = conn))
# TODO: better way to get the id of the just added experiment_name
experiment_id <-
DBI::dbGetQuery(
conn,
glue::glue_sql(
.con = conn,
"SELECT experiment_id FROM experiments WHERE experiment_name = {data$experiment}"
)
)$experiment_id
# Creating INSERT statement, note that I am using glue_sql for proper escaping
# if you don't know why you need to escape your sql queries read about SQL injection
data$expressions$experiment_id <- experiment_id
insert_sql <- paste(glue::glue_sql(
.con = conn,
"({experiment_id},{sample_id},{ensg},{tpm})",
.envir = data$expressions
), collapse = ", ")
insert_sql <- paste("INSERT INTO genes_expression(experiment_id, sample_id, ensg, tpm) VALUES", insert_sql)
DBI::dbExecute(conn, insert_sql)
experiment_id
})
list(experiment_id = experiment_id)
}
As usual - some code for running the API in background:
cat << EOT > tmp/run_api_post.R
library(plumber)
library(optparse)
library(uuid)
library(logger)
MACHINE_ID <- "MAIN_1"
PORT_NUMBER <- 8113
log_level(logger::TRACE)
plumb(file='tmp/plumber_post.R') %>%
pr_run(port = PORT_NUMBER)
EOT
Rscript tmp/run_api_post.R > tmp/POST_API_LOG.log 2>&1 &
echo $! > tmp/POST_API_PID
disown
wait_for_api("tmp/POST_API_LOG.log")
curl -X GET "http://127.0.0.1:8113/experiments/1" -H "accept: application/json" --silent
## {"experiment_id":1,"experiment_name":"my_first_experiment"}
/genes_expression
allows to filter by specific ENSG
. It can be provided using a query string:
curl -X GET "http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703" -H "accept: application/json" --silent
## {"experiment":[{"experiment_id":1,"experiment_name":"my_first_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45}]}
For sending a vector, you can use the same argument many times as show below:
curl -X GET "http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703&ensg=ENSG00000268173" -H "accept: application/json" --silent
## {"experiment":[{"experiment_id":1,"experiment_name":"my_first_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45},{"sample_id":1,"ensg":"ENSG00000268173","tpm":137.2},{"sample_id":2,"ensg":"ENSG00000268173","tpm":431.1},{"sample_id":3,"ensg":"ENSG00000268173","tpm":19.9}]}
Now, let’s prepare a json
to be added to the database using a POST
request.
input_to_api <- list(
experiment = "my_second_experiment",
expressions = tibble(sample_id = 1:4, ensg = "ENSG00000133703", tpm = c(52.6,14.1,146.5,2000.1))
)
jsonlite::toJSON(input_to_api, auto_unbox = TRUE, pretty = TRUE)
## {
## "experiment": "my_second_experiment",
## "expressions": [
## {
## "sample_id": 1,
## "ensg": "ENSG00000133703",
## "tpm": 52.6
## },
## {
## "sample_id": 2,
## "ensg": "ENSG00000133703",
## "tpm": 14.1
## },
## {
## "sample_id": 3,
## "ensg": "ENSG00000133703",
## "tpm": 146.5
## },
## {
## "sample_id": 4,
## "ensg": "ENSG00000133703",
## "tpm": 2000.1
## }
## ]
## }
And in the last step, the request is sent with the json
containing all the required
information.
curl -X POST "http://127.0.0.1:8113/experiments" --silent --data '
{
"experiment": "my_second_experiment",
"expressions": [
{
"sample_id": 1,
"ensg": "ENSG00000133703",
"tpm": 52.6
},
{
"sample_id": 2,
"ensg": "ENSG00000133703",
"tpm": 14.1
},
{
"sample_id": 3,
"ensg": "ENSG00000133703",
"tpm": 146.5
},
{
"sample_id": 4,
"ensg": "ENSG00000133703",
"tpm": 2000.1
}
]
}
'
## {"experiment_id":2}
And reading the output:
curl -X GET "http://127.0.0.1:8113/experiments/2/genes_expression" -H "accept: application/json" --silent
## {"experiment":[{"experiment_id":2,"experiment_name":"my_second_experiment"}],"expressions":[{"sample_id":1,"ensg":"ENSG00000133703","tpm":20},{"sample_id":2,"ensg":"ENSG00000133703","tpm":11},{"sample_id":3,"ensg":"ENSG00000133703","tpm":45},{"sample_id":1,"ensg":"ENSG00000268173","tpm":137.2},{"sample_id":2,"ensg":"ENSG00000268173","tpm":431.1},{"sample_id":3,"ensg":"ENSG00000268173","tpm":19.9},{"sample_id":1,"ensg":"ENSG00000133703","tpm":52.6},{"sample_id":2,"ensg":"ENSG00000133703","tpm":14.1},{"sample_id":3,"ensg":"ENSG00000133703","tpm":146.5},{"sample_id":4,"ensg":"ENSG00000133703","tpm":2000.1}]}
One last thing, just for a pure joy of using bash parallel
and some other console tools. The script below,
runs curl
500 times on 8 workers and logs some performace information. Then I use some simple tools
to perform basic analysis:
# https://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl
seq 500 | parallel --joblog tmp/joblog.txt -n0 -j8 "curl \
-X GET 'http://127.0.0.1:8113/experiments/1/genes_expression?ensg=ENSG00000133703' \
-H 'accept: application/json' --silent \
-w '%{time_connect} %{time_starttransfer} %{time_total}\n' \
-o /dev/null" > tmp/all_times
Calculate mean connection time, time to first byte transferred and total time of for a request:
LC_NUMERIC=C # for some reason Rstudio changes the numeric locale!
awk '{N+=1; CONNECT+=$1; TTFB+=$2; TOTAL+=$3} END {printf "Connect: %.3gs TTFB: %.2gs TOTAL: %.3gs\n", CONNECT/N, TTFB/N, TOTAL/N}' tmp/all_times
## Connect: 0.00015s TTFB: 0.5s TOTAL: 0.504s
As a bonus - number of requests processed each second (to be precise I use number of requests started each second by parallel
).
As you can see, this API can handle 14-15req/sec
. Can you explain what each command in this bash pipeline does?
LC_NUMERIC=C
tail -n +2 tmp/joblog.txt | cut -f 3 | xargs printf "%.0f\n"|
sort | uniq -c | tr -s " " | cut -f2 -d " " | sort |
awk '{a[i++]=$0} END {print (i%2==0)?(a[int(i/2)-1]+a[int(i/2)])/2:a[int(i/2)]}'
## 16
Clean-up
To keep the system clean, I kill all the processes started in the background. To do that I use
the PID files created when they were started. Stopping redis
is a little more complicated, because I need
to extract its process_id
from the cli
.
# Kill R processes
for f in tmp/*_PID; do echo $f && kill -15 $(cat $f); done
# Kill redis-server
kill -15 $(redis-cli INFO | grep process_id | cut -d":" -f2 | sed -e "s/\r//g")
rm -rf tmp
## tmp/API_PID
## tmp/API_V2_PID
## tmp/POST_API_PID
Summary
The topics covered in this post should be treated as a starting point for further study. Here’s a bunch of links that you might find useful:
- https://posit.co/blog/plumber-v1-1-0/#parallel-exec -
plumber
can useasync
workers. It might be an easy way to scale-up your API, but it might be not enough for heavy-workload environment. - https://blog.bytebytego.com/p/why-is-restful-api-so-popular - some basic information about what RESTful API are popular.
- https://blog.bytebytego.com/p/ep53-design-effective-and-safe-apis - this post contains a simple iconography showing some good practices in API design.