How to split all strings in a column AND include prefix in all the new data
A tidyr
option using separate
library(dplyr)
library(tidyr)
df %>% separate(code, paste0("code_", 1:3), sep = "/", fill = "right", remove = F)
# code code_1 code_2 code_3
#1 PC211/212.5(C)/664F PC211 212.5(C) 664F
#2 VC23152(A)/23550F VC23152(A) 23550F <NA>
#3 PC459/460(B)M PC459 460(B)M <NA>
#4 PC187(A)/664F PC187(A) 664F <NA>
Note that your expected output doesn't seem to match your input data. For example, for row 1 your expected output for code_3
gives "PC664F"
, whereas the relevant input string is "664F"
. code_2
for the same row has "212.5(C)F"
whereas the input string is "212.5(C)"
. I assume these are errors.
Update
Thanks to @andrew_reece's comment I (think I) now understand your question. Here is an option
df %>%
rowid_to_column("row") %>%
separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
separate(main, into = paste0("code_", 1:3), sep = "/", fill = "right") %>%
gather(key, entry, starts_with("code_")) %>%
filter(!is.na(entry)) %>%
unite(entry, prefix, entry, suffix, sep = "") %>%
spread(key, entry) %>%
select(-row)
Explanation: We first separate
the prefixes and suffixes from code
, then separate
the individual components from the main code
part. We reshape from wide to long, remove NA
entries, and concatenate each code
component with the prefix
and suffix
before reshaping back from long to wide.
This reproduces your expected output except for code_2
in row 1.
An alternative
As an alternative approach, it might be more useful to store the pre- & suffixed codes in a list
column, rather than storing them in a wide format with additional columns code_1
, code_2
and so on. This has the advantage that you wouldn't have to hard-code the number of codes you have in column code
; the following approach will work for any number of codes in code
and only assumes that
- the first 2 characters of
code
define theprefix
- the last character of
code
is thesuffix
.
df %>%
separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
transmute(
code,
codes_as_list = pmap(
list(prefix, str_split(main, "/"), suffix),
function(x, y, z) paste0(x, y, z)))
# code codes_as_list
#1 PC211/212.5(C)/664F PC211F, PC212.5(C)F, PC664F
#2 VC23152(A)/23550F VC23152(A)F, VC23550F
#3 PC459/460(B)M PC459M, PC460(B)M
#4 PC187(A)/664F PC187(A)F, PC664F
Note that codes_as_list
is now a list
column with the properly pre/suffixed codes, making it easy to operate on the elements with the purrr::map
machinery.
IIUC, this will give you the prefix and suffix on each separated column:
library(tidyverse)
df %>%
mutate(prefix = str_extract(code, "^[A-Z]+"),
suffix = str_extract(code, "[A-Z]$")) %>%
separate(code, into = c("code_1", "code_2", "code_3"),
sep = "/", fill = "right", remove = F) %>%
mutate_at(vars(matches("_1$")),
list(~paste0(., suffix))) %>%
mutate_at(vars(matches("_2$")),
list(~if_else(str_sub(., -1) == suffix,
paste0(prefix, .),
paste0(paste0(prefix, .), suffix)))) %>%
mutate_at(vars(matches("_3$")),
list(~if_else(is.na(.),
NA_character_,
paste0(prefix, .)))) %>%
select(-prefix, -suffix)
code code_1 code_2 code_3
1 PC211/212.5(C)/664F PC211F PC212.5(C)F PC664F
2 VC23152(A)/23550F VC23152(A)F VC23550F <NA>
3 PC459/460(B)M PC459M PC460(B)M <NA>
4 PC187(A)/664F PC187(A)F PC664F <NA>
Here is another option with separate
and str_extract_all
. We create a pattern ('pat') that use a regex lookaround to match the position between a /
followed by number ([0-9]
) and a second pattern to match character position before /
. Using str_replace_all
, insert the position matched by 'pat' with the first two characters (substr
) of the string, as well as insert the position before /
with the last character of string, then use separate
to split the column into three at the delimiter /
library(tidyverse)
#pat <- "(?<=\\/)(?=[0-9]+\\(?[A-Z])"
pat <- "(?<=\\/)(?=[0-9])"
pat2 <- "(?=\\/)"
df %>%
mutate(code1 = str_replace_all(code, pat, substr(code, 1, 2)) %>%
str_replace_all(pat2, substring(code, nchar(code))))%>%
separate(code1, into = paste0("code_", 1:3), sep="[/]")
# code code_1 code_2 code_3
#1 PC211/212.5(C)/664F PC211F PC212.5(C)F PC664F
#2 VC23152(A)/23550F VC23152(A)F VC23550F <NA>
#3 PC459/460(B)M PC459M PC460(B)M <NA>
#4 PC187(A)/664F PC187(A)F PC664F <NA>