Access n-th element after string splitting
1) data.frame Convert to a data frame and then it is easy to pick off a column or subset of columns:
DF <- read.table(text = string, sep = ",", as.is = TRUE)
DF[[1]]
## [1] "A" "B" "A"
DF[[3]]
## [1] "some text" "some other text" "yet another one"
DF[-1]
## V2 V3 V4
## 1 1 some text 200
## 2 2 some other text 300
## 3 3 yet another one 100
DF[2:3]
## V2 V3
## 1 1 some text
## 2 2 some other text
## 3 3 yet another one
2) data.table::tranpose The data.table package has a function to tranpose lists so that if stringt
is the tranposed list then stringt[[3]]
is the vector of third fields, say, in a similar way to (1). Even more compact is data.table's tstrsplit
mentioned by @Henrik below or the same package's fread
mentioned by @akrun below.
library(data.table)
stringt <- transpose(strsplit(string, ","))
# or
stringt <- tstrsplit(string, ",")
stringt[[1]]
## [1] "A" "B" "A"
stringt[[3]]
## [1] "some text" "some other text" "yet another one"
stringt[-1]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text" "some other text" "yet another one"
##
## [[3]]
## [1] "200" "300" "100"
stringt[2:3]
## [[1]]
## [1] "1" "2" "3"
##
## [[2]]
## [1] "some text" "some other text" "yet another one"
purrr also has a transpose
function but
library(purrr)
transpose(strsplit(string, ","))
produces a list of lists rather than a list of character vectors.
One option is to use word
from stringr
with sep
argument
library(stringr)
word(string, 1, sep = ",")
#[1] "A" "B" "A"
word(string, 3, sep = ",")
#[1] "some text" "some other text" "yet another one"
Since the performance of word
is the worst among all I found out another option using regular expression in base R.
#Get 1st element
sub("(?:[^,],){0}([^,]*).*", "\\1",string)
#[1] "A" "B" "A"
#Get 3rd element
sub("(?:[^,],){2}([^,]*).*", "\\1",string)
#[1] "some text" "some other text" "yet another one"
There are two groups to match here. First one matches any characters that are not a comma followed by a comma for n
times and then again matches another set of characters that are not comma. The first group is not captured (?:
) while the second group is captured and returned. Also note that the number in brackets ({}
) has to be one less than the word we want. So {0}
returns 1st word and {2}
returns 3rd word.
Benchmark
string <- c("A,1,some text,200","B,2,some other text,300","A,3,yet another one,100")
string <- rep(string, 1e5)
library(microbenchmark)
microbenchmark(
tmfmnk_sapply = sapply(strsplit(string, ","), function(x) x[1]),
tmfmnk_tstrsplit = tstrsplit(string, ",")[[1]],
avid_useR_sapply = sapply(strsplit(string, ","), '[', 1),
avid_useR_str_split = str_split(string, ",", simplify = TRUE)[,1],
Ronak_Shah_word = word(string, 1, sep = ","),
Ronak_Shah_sub = sub("(?:[^,],){0}([^,]*).*", "\\1",string),
G_Grothendieck ={DF <- read.table(text = string, sep = ",",as.is = TRUE);DF[[1]]},
times = 5
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# tmfmnk_sapply 1629.69 1641.61 2128.14 1834.99 1893.43 3640.96 5
# tmfmnk_tstrsplit 1269.94 1283.79 1286.29 1286.68 1290.76 1300.30 5
# avid_useR_sapply 1445.40 1447.64 1555.76 1498.14 1609.52 1778.13 5
#avid_useR_str_split 324.68 332.28 332.30 333.97 334.01 336.54 5
# Ronak_Shah_word 6571.29 6810.92 6956.20 6930.86 7217.26 7250.69 5
# Ronak_Shah_sub 349.76 354.77 356.91 358.91 359.17 361.94 5
# G_Grothendieck 354.93 358.24 364.43 362.24 367.79 378.94 5
I haven't included Christoph's solution as it is not clear to me how it will work for variable n
's. For example for 3rd position , for 4th position etc.
We can simplify OP's code to:
sapply(strsplit(string, ","), '[', 1)
# [1] "A" "B" "A"
sapply(strsplit(string, ","), '[', 3)
# [1] "some text" "some other text" "yet another one"
Also, with stringr::str_split
and simplify = TRUE
, we can directly index the column, since the output would be a matrix:
library(stringr)
str_split(string, ",", simplify = TRUE)[,1]
# [1] "A" "B" "A"
str_split(string, ",", simplify = TRUE)[,3]
# [1] "some text" "some other text" "yet another one"