| efficient-programming {collapse} | R Documentation |
A small set of functions to addresses some common inefficiencies in R, such as the creation of logical vectors to compare quantities, unnecessary copies of objects in elementary mathematical or subsetting operations, obtaining information about objects (esp. data frames), or dealing with missing values.
anyv(x, value) # Faster than any(x == value)
allv(x, value) # Faster than all(x == value)
allNA(x) # Faster than all(is.na(x))
whichv(x, value, # Faster than which(x == value)
invert = FALSE) # or which(x != value). See also Note (3)
whichNA(x, invert = FALSE) # Faster than which((!)is.na(x))
x %==% value # Infix for whichv(v, value, FALSE), use e.g. in fsubset
x %!=% value # Infix for whichv(v, value, TRUE). See also Note (3)
alloc(value, n) # Faster than rep_len(value, n)
copyv(X, v, R, ..., invert # Faster than replace(x, x == v, r) or replace(x, v, r[v])
= FALSE, vind1 = FALSE) # or replace(x, x != v, r) or replace(x, !v, r[!v])
setv(X, v, R, ..., invert # Same for x[x (!/=)= v] <- r or x[(!)v] <- r[(!)v]
= FALSE, vind1 = FALSE) # modifies x by reference, fastest
setop(X, op, V, ..., # Faster than X <- X +\-\*\/ V (modifies by reference)
rowwise = FALSE) # optionally can also add v to rows of a matrix or list
X %+=% V # Infix for setop(X, "+", V). See also Note (2)
X %-=% V # Infix for setop(X, "-", V). See also Note (2)
X %*=% V # Infix for setop(X, "*", V). See also Note (2)
X %/=% V # Infix for setop(X, "/", V). See also Note (2)
na_rm(x) # Fast: if(anyNA(x)) x[!is.na(x)] else x,
# also removes NULL / empty elements from list
na_omit(X, cols = NULL, # Faster na.omit for matrices and data frames,
na.attr = FALSE, ...) # can use selected columns and attach indices
na_insert(X, prop = 0.1, # Insert missing values at random
value = NA)
missing_cases(X, # The oposite of complete.cases(), faster for
cols = NULL) # data frames
vlengths(X, use.names=TRUE) # Faster version of lengths() (in C, no method dispatch)
vtypes(X, use.names = TRUE) # Get data storage types (faster vapply(X, typeof, ...))
vgcd(x) # Greatest common divisor of positive integers or doubles
frange(x, na.rm = TRUE) # Much faster base::range, for integer and double objects
fnlevels(x) # Faster version of nlevels(x) (for factors)
fnrow(X) # Faster nrow for data frames (not faster for matrices)
fncol(X) # Faster ncol for data frames (not faster for matrices)
fdim(X) # Faster dim for data frames (not faster for matrices)
seq_row(X) # Fast integer sequences along rows of X
seq_col(X) # Fast integer sequences along columns of X
cinv(x) # Choleski (fast) inverse of symmetric PD matrix, e.g. X'X
X, V, R |
a vector, matrix or data frame. | ||||||||||||||||||||||||||
x, v |
a (atomic) vector or matrix ( | ||||||||||||||||||||||||||
value |
a single value of any (atomic) vector type. For | ||||||||||||||||||||||||||
invert |
logical. | ||||||||||||||||||||||||||
vind1 |
logical. If | ||||||||||||||||||||||||||
op |
an integer or character string indicating the operation to perform.
| ||||||||||||||||||||||||||
rowwise |
logical. | ||||||||||||||||||||||||||
cols |
select columns to check for missing values using column names, indices, a logical vector or a function (e.g. | ||||||||||||||||||||||||||
n |
integer. The length of the vector to allocate with | ||||||||||||||||||||||||||
na.attr |
logical. | ||||||||||||||||||||||||||
prop |
double. Specify the proportion of observations randomly replaced with | ||||||||||||||||||||||||||
use.names |
logical. Preserve names if | ||||||||||||||||||||||||||
na.rm |
logical. | ||||||||||||||||||||||||||
... |
for |
copyv and setv are designed to optimize operations that require replacing a single value in an object e.g. X[X == value] <- r or X[X == value] <- R[R == value] or simply copying parts of an existing object into another object e.g. X[v] <- R[v]. Thus they only cover cases where base R is inefficient by either creating a logical vector or materializing a subset to do some replacement. No alternative is provided in cases where base R is efficient e.g. x[v] <- r or cases provided by set and copy from the data.table package. Both functions work equivalently, with the difference that copyv creates a deep copy of the data before making the replacements and returns the copy, whereas setv modifies the data directly without creating a copy and returns the modified object invisibly. Thus setv is considerably more efficient.
copyv and setv perform different tasks, depending on the input. If v is a scalar, the elements of X are compared to v, and the matching ones (or non-matching ones if invert = TRUE) are replaced with R, where R can be either a scalar or an object of the same dimensions as X. If X is a data frame, R can also be a column-vector matching fnrow(X). The second option is if v is either a logical or integer vector of indices with length(v) > 1L, indicating the elements of a vector / matrix (or rows if X is a data frame) to replace with corresponding elements from R. Thus R has to be of equal dimensions as X, but could also be a column-vector if X is a data frame. Setting vind1 = TRUE ensures that v is always interpreted as an index, even if length(v) == 1L.
(1) None of these functions currently support complex vectors.
(2) setop and the operators %+=%, %-=%, %*=% and %/=% also work with integer data, but do not perform any integer related checks. R's integers are bounded between +-2,147,483,647 and NA_integer_ is stored as the value -2,147,483,648. Thus computations resulting in values exceeding +-2,147,483,647 will result in integer overflows, and NA_integer_ should not occur on either side of a setop call. These are programmers functions and meant to provide the most efficient math possible to responsible users.
(3) It is possible to compare factors by the levels (e.g. iris$Species %==% "setosa")) or using integers (iris$Species %==% 1L). The latter is slightly more efficient. Nothing special is implemented for other objects apart from basic types, e.g. for dates (which are stored as doubles) you need to generate a date object i.e. wlddev$date %==% as.Date("2019-01-01"). Using wlddev$date %==% "2019-01-01" will give integer(0).
Data Transformations, Small (Helper) Functions, Collapse Overview
## Which value
whichNA(wlddev$PCGDP) # Same as which(is.na(wlddev$PCGDP))
whichNA(wlddev$PCGDP, invert = TRUE) # Same as which(!is.na(wlddev$PCGDP))
whichv(wlddev$country, "Chad") # Same as which(wlddev$county == "Chad")
wlddev$country %==% "Chad" # Same thing
whichv(wlddev$country, "Chad", TRUE) # Same as which(wlddev$county != "Chad")
wlddev$country %!=% "Chad" # Same thing
lvec <- wlddev$country == "Chad" # If we already have a logical vector...
whichv(lvec, FALSE) # is fastver than which(!lvec)
rm(lvec)
# Using the %==% operator can yield tangible performance gains
fsubset(wlddev, iso3c %==% "DEU") # 3x faster than:
fsubset(wlddev, iso3c == "DEU")
## Math by reference: permissible types of operations
x <- alloc(1.0, 1e5) # Vector
x %+=% 1
x %+=% 1:1e5
xm <- matrix(alloc(1.0, 1e5), ncol = 100) # Matrix
xm %+=% 1
xm %+=% 1:1e3
setop(xm, "+", 1:100, rowwise = TRUE)
xm %+=% xm
xm %+=% 1:1e5
xd <- qDF(replicate(100, alloc(1.0, 1e3), simplify = FALSE)) # Data Frame
xd %+=% 1
xd %+=% 1:1e3
setop(xd, "+", 1:100, rowwise = TRUE)
xd %+=% xd
rm(x, xm, xd)
## Missing values
mtc_na <- na_insert(mtcars, 0.15) # Set 15% of values missing at random
fnobs(mtc_na) # See observation count
na_omit(mtc_na) # 12x faster than na.omit(mtc_na)
na_omit(mtc_na, na.attr = TRUE) # Adds attribute with removed cases, like na.omit
na_omit(mtc_na, cols = c("vs","am")) # Removes only cases missing vs or am
na_omit(qM(mtc_na)) # Also works for matrices
na_omit(mtc_na$vs, na.attr = TRUE) # Also works with vectors
na_rm(mtc_na$vs) # For vectors na_rm is faster ...
rm(mtc_na)