`data.table` global search - filter rows given pattern match in `any` column

One way would be to loop through the columns, apply your regex, which'll return a logical data.table back. You can use rowSums to get the rows then.

dt <- data.table(a=c("Aa1","bb","1c"),b=c("A1","a1","1C"), c=letters[1:3])
# "a1" is the pattern to search for
ldt <- dt[, lapply(.SD, function(x) grepl("a1", x, perl=TRUE))] 
dt[rowSums(ldt)>0]
#      a  b c
# 1: Aa1 A1 a
# 2:  bb a1 b

Solution 3:

First construct the logical grep expression appending all columns. Then eval the overall expression in one go:

dt <- data.table(a=c("a1","bb","1c"),b=c("A1","BB","1C"))

search.data.table <- function(x, pattern) {
  nms <- names(x)
  string <- eval(expression(paste0("grepl('",
                                   pattern, 
                                   "', ",
                                   nms,",
                                   ignore.case=TRUE, perl=FALSE)",
                                   collapse = " | ")))
  x[eval(as.call(parse(text=string))[[1]])]
}

search.data.table(dt, "a1")
#      a  b c
# 1: Aa1 A1 a
# 2:  bb a1 b

Benchmarking

# functions

Raffael <- function(x, pattern) {
# unfortunately this implementation throws an error so I can't run the benchmark test. 
# Any help?
  combined <- apply(x,1,function(r) paste(r,collapse="/%/"))
  grepped <- grepl(pattern,apply(x,1,function(r) paste(r,collapse="/")))
  x[grepped,]
}

Arun <- function(x, pattern) {
  ldt <- x[, lapply(.SD, function(x) grepl(pattern, x, perl=TRUE, ignore.case=TRUE))] 
  x[rowSums(ldt)>0]
}

DanielKrizian <- function(x, pattern) {
  nms <- names(x)
  string <- eval(expression(paste0("grepl('", pattern, "', ",nms,", ignore.case=TRUE,      perl=FALSE)",collapse = " | ")))
  x[eval(as.call(parse(text=string))[[1]])]
}

# generate 1000 x 1000 benchmark data.table

require(data.table)
expr <- quote(paste0(sample(c(LETTERS,tolower(LETTERS),0:9),12, replace=T)
                 ,collapse=""))
set.seed(1)
BIGISH <- data.table(matrix(replicate(1000*1000,eval(expr)),nrow = 1000))
object.size(BIGISH) # 68520912 bytes

# test

benchmark(
  DK <- DanielKrizian(BIGISH,"qx"),
  A <- Arun(BIGISH,"qx"),
  replications=100)

Results

                               test replications elapsed relative user.self sys.self user.child sys.child
2           A <- Arun(BIGISH, "qx")          100   57.72    1.000     51.95     0.44         NA        NA
1 DK <- DanielKrizian(BIGISH, "qx")          100   59.28    1.027     53.72     0.50         NA        NA

identical(DK,A)
[1] TRUE

I am not betting that this is the best way to do it. But it serves the purpose:

> dt <- data.table(a=c("a1","bb","1c"),b=c("A1","BB","1C"))
> dt
    a  b
1: a1 A1
2: bb BB
3: 1c 1C

> combined <- apply(dt,1,function(r) paste(r,collapse="/%/"))
> combined
[1] "a1/%/A1" "bb/%/BB" "1c/%/1C"

> grepped <- grepl("[a-z][0-9]",apply(dt,1,function(r) paste(r,collapse="/")))
> grepped
[1]  TRUE FALSE FALSE

> dt[grepped,]
    a  b
1: a1 A1

The "/%/" would have to be something that is not relevant to the pattern and reliably separates columns.

The steps can be combined into a single expression of course.

`data.table` global search - filter rows given pattern match in `any` column

Tags:

Regex

R

Data.Table

Related

Recent Posts