Count values separated by a comma in a character string

These two approaches are each short, work on vectors of strings, do not involve the expense of explicitly constructing the split string and do not use any packages. Here d is a vector of strings such as d <- c("1,2,3", "5,2") :

1) count.fields

count.fields(textConnection(d), sep = ",")

2) gregexpr

lengths(gregexpr(",", d)) + 1

Here is a possibility

> as.numeric(unlist(strsplit("30,3", ",")))
# 30  3

You could also try stringi package stri_count_* funcitons (should be very effcient)

library(stringi)
stri_count_regex(d, "\\d+")
## [1] 2
stri_count_fixed(d, ",") + 1
## [1] 2

stringr package has a similar functionality

library(stringr)
str_count(d, "\\d+")
## [1] 2

Update:

If you want to subset your data set by length 2 vectors, could try

df[stri_count_regex(df$d, "\\d+") == 2,, drop = FALSE]
#      d
# 2 30,5

Or simpler

subset(df, stri_count_regex(d, "\\d+") == 2)
#      d
# 2 30,5

Update #2

Here's a benchmark that illustrates why one should consider using external packages (@rengis answer wasn't included because it doesn't answer the question)

library(microbenchmark)
library(stringi)
d <- rep("30,3", 1e4)

microbenchmark( akrun = nchar(gsub('[^,]', '', d))+1,
                GG1 = count.fields(textConnection(d), sep = ","),
                GG2 = sapply(gregexpr(",", d), length) + 1,
                DA1 = stri_count_regex(d, "\\d+"),
                DA2 = stri_count_fixed(d, ",") + 1)

# Unit: microseconds
#  expr       min         lq       mean     median        uq       max neval
# akrun  8817.950  9479.9485 11489.7282 10642.4895 12480.845  46538.39   100
#   GG1 55451.474 61906.2460 72324.0820 68783.9935 78980.216 150673.72   100
#   GG2 33026.455 43349.5900 60960.8762 51825.6845 72293.923 203126.27   100
#   DA1  4730.302  5120.5145  6206.8297  5550.7930  7179.536  10507.09   100
#   DA2   380.147   418.2395   534.6911   448.2405   597.259   2278.11   100

You could use scan.

 v1 <- scan(text=d, sep=',', what=numeric(), quiet=TRUE)
 v1
 #[1] 30  3

Or using stri_split from stringi. This should take both character and factor class without converting explicitly to character using as.character

library(stringi)
v2 <- as.numeric(unlist(stri_split(d,fixed=',')))
v2
#[1] 30  3

You can do the count using base R by

length(v1)
#[1] 2

nchar(gsub('[^,]', '', d))+1
#[1] 2

Visualize the regex

 [^,]

Regular expression visualization

Debuggex Demo

Update

If d is a column in a dataset df and want to subset rows with number of digits equals 2

  d<-c("30,3,5","30,5") 
  df <- data.frame(d,stringsAsFactors=FALSE)
  df[nchar(gsub('[^,]', '',df$d))+1==2,,drop=FALSE]
  #    d
  #2 30,5

Just to test

  df[nchar(gsub('[^,]', '',df$d))+1==10,,drop=FALSE]
  #[1] d
  #<0 rows> (or 0-length row.names)

Count values separated by a comma in a character string

Update

Tags:

Character

R

Vector

Related

Recent Posts