When I was checking my notes, I found a piece of information about xray package. It is pretty simple, only exports three functions, but all of them are quite useful.

Search for the most common problems. The first function anomalies, reports some statistics regarding the most basic data problems:

I usually test the data for the presence of NAs using some simple code like sapply(data, function(x) mean(is.na(x))), but I’ll happily switch to xray, because with one line it performs more tests, and the output is far more nicely formatted.

See the example below:

# install.packages("xray")
library(xray)

anom <- anomalies(iris)
anom
## $variables
##       Variable   q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1      Species 150   0   -     0     -      0      -    0    -         3
## 2  Petal.Width 150   0   -     0     -      0      -    0    -        22
## 3  Sepal.Width 150   0   -     0     -      0      -    0    -        23
## 4 Sepal.Length 150   0   -     0     -      0      -    0    -        35
## 5 Petal.Length 150   0   -     0     -      0      -    0    -        43
##      type anomalous_percent
## 1  Factor                 -
## 2 Numeric                 -
## 3 Numeric                 -
## 4 Numeric                 -
## 5 Numeric                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA              
##  [4] pNA               qZero             pZero            
##  [7] qBlank            pBlank            qInf             
## [10] pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)
iris2 <- iris

iris2$Petal.Length[sample.int(150, 80)] <- 0
iris2$Sepal.Width[sample.int(150, 80)] <- ifelse(rbinom(80, size = 1, prob = 0.3) == 0, Inf, -Inf)
iris2$SpeciesChar <- ifelse(rbinom(150, size = 1, prob = 0.9) == 0, as.character(iris2$Species), "")
iris2$Species[sample.int(150, 20)] <- NA

anom2 <- xray::anomalies(iris2)
## Warning in xray::anomalies(iris2): Found 1 possible problematic variables: 
## SpeciesChar
anom2
## $variables
##       Variable   q qNA    pNA qZero  pZero qBlank pBlank qInf   pInf
## 1  SpeciesChar 150   0      -     0      -    135    90%    0      -
## 2  Sepal.Width 150   0      -     0      -      0      -   80 53.33%
## 3 Petal.Length 150   0      -    80 53.33%      0      -    0      -
## 4      Species 150  20 13.33%     0      -      0      -    0      -
## 5  Petal.Width 150   0      -     0      -      0      -    0      -
## 6 Sepal.Length 150   0      -     0      -      0      -    0      -
##   qDistinct      type anomalous_percent
## 1         4 Character               90%
## 2        20   Numeric            53.33%
## 3        34   Numeric            53.33%
## 4         4    Factor            13.33%
## 5        22   Numeric                 -
## 6        35   Numeric                 -
## 
## $problem_variables
##      Variable   q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1 SpeciesChar 150   0   -     0     -    135    90%    0    -         4
##        type anomalous_percent                              problems
## 1 Character               90% Anomalies present in 90% of the rows.

The second function distributions plots distributions for each variable, and returns a summary table. It’s also a quite common step, so I’m glad that it’s automated in the package, and the output is also beautifully formatted.

The next example:

distributions(iris2) # as simple as this
## ===========================================================================
## Warning: Removed 80 rows containing non-finite values (stat_bin).

##       Variable  p_1 p_10 p_25 p_50 p_75 p_90  p_99
## 1  Sepal.Width -Inf -Inf  2.8  3.4  Inf  Inf   Inf
## 2 Petal.Length    0    0    0    0    4 5.41 6.651
## 3  Petal.Width  0.1  0.2  0.3  1.3  1.8  2.2   2.5
## 4 Sepal.Length  4.4  4.8  5.1  5.8  6.4  6.9   7.7

The last function is used to check the data over time. The example usage can be found in the https://github.com/sicarul/xray/, so check the package’s website.

Summary.

This package inspired me to the further research, for other tools to validate and prepare your data to the actual analysis, and I found a real pearl - the vtreat package. It seems to be a real swiss-army knife for preparing data for predictive modeling. But it’s a topic for another post;)