How to split all strings in a column AND include prefix in all the new data

A tidyr option using separate

library(dplyr)
library(tidyr)
df %>% separate(code, paste0("code_", 1:3), sep = "/", fill = "right", remove = F)
#                 code     code_1   code_2 code_3
#1 PC211/212.5(C)/664F      PC211 212.5(C)   664F
#2   VC23152(A)/23550F VC23152(A)   23550F   <NA>
#3       PC459/460(B)M      PC459  460(B)M   <NA>
#4       PC187(A)/664F   PC187(A)     664F   <NA>

Note that your expected output doesn't seem to match your input data. For example, for row 1 your expected output for code_3 gives "PC664F", whereas the relevant input string is "664F". code_2 for the same row has "212.5(C)F" whereas the input string is "212.5(C)". I assume these are errors.

Update

Thanks to @andrew_reece's comment I (think I) now understand your question. Here is an option

df %>%
    rowid_to_column("row") %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    separate(main, into = paste0("code_", 1:3), sep = "/", fill = "right") %>%
    gather(key, entry, starts_with("code_")) %>%
    filter(!is.na(entry)) %>%
    unite(entry, prefix, entry, suffix, sep = "") %>%
    spread(key, entry) %>%
    select(-row)

Explanation: We first separate the prefixes and suffixes from code, then separate the individual components from the main code part. We reshape from wide to long, remove NA entries, and concatenate each code component with the prefix and suffix before reshaping back from long to wide.

This reproduces your expected output except for code_2 in row 1.

An alternative

As an alternative approach, it might be more useful to store the pre- & suffixed codes in a list column, rather than storing them in a wide format with additional columns code_1, code_2 and so on. This has the advantage that you wouldn't have to hard-code the number of codes you have in column code; the following approach will work for any number of codes in code and only assumes that

the first 2 characters of code define the prefix
the last character of code is the suffix.

df %>%
    separate(code, c("prefix", "main", "suffix"), sep = c(2, -1), remove = F) %>%
    transmute(
        code,
        codes_as_list = pmap(
            list(prefix, str_split(main, "/"), suffix),
            function(x, y, z) paste0(x, y, z)))
#                 code               codes_as_list
#1 PC211/212.5(C)/664F PC211F, PC212.5(C)F, PC664F
#2   VC23152(A)/23550F       VC23152(A)F, VC23550F
#3       PC459/460(B)M           PC459M, PC460(B)M
#4       PC187(A)/664F           PC187(A)F, PC664F

Note that codes_as_list is now a list column with the properly pre/suffixed codes, making it easy to operate on the elements with the purrr::map machinery.

IIUC, this will give you the prefix and suffix on each separated column:

library(tidyverse)

df %>%
  mutate(prefix = str_extract(code, "^[A-Z]+"),
         suffix = str_extract(code, "[A-Z]$")) %>%
  separate(code, into = c("code_1", "code_2", "code_3"), 
           sep = "/", fill = "right", remove = F) %>%
  mutate_at(vars(matches("_1$")), 
            list(~paste0(., suffix))) %>%
  mutate_at(vars(matches("_2$")), 
            list(~if_else(str_sub(., -1) == suffix, 
                          paste0(prefix, .),
                          paste0(paste0(prefix, .), suffix)))) %>%
  mutate_at(vars(matches("_3$")), 
            list(~if_else(is.na(.), 
                          NA_character_, 
                          paste0(prefix, .)))) %>%
  select(-prefix, -suffix)

                 code      code_1      code_2 code_3
1 PC211/212.5(C)/664F      PC211F PC212.5(C)F PC664F
2   VC23152(A)/23550F VC23152(A)F    VC23550F   <NA>
3       PC459/460(B)M      PC459M   PC460(B)M   <NA>
4       PC187(A)/664F   PC187(A)F      PC664F   <NA>

Here is another option with separate and str_extract_all. We create a pattern ('pat') that use a regex lookaround to match the position between a / followed by number ([0-9]) and a second pattern to match character position before /. Using str_replace_all, insert the position matched by 'pat' with the first two characters (substr) of the string, as well as insert the position before / with the last character of string, then use separate to split the column into three at the delimiter /

library(tidyverse)
#pat <- "(?<=\\/)(?=[0-9]+\\(?[A-Z])"
pat <- "(?<=\\/)(?=[0-9])"
pat2 <- "(?=\\/)"
df %>% 
  mutate(code1 = str_replace_all(code, pat, substr(code, 1, 2)) %>% 
  str_replace_all(pat2, substring(code, nchar(code))))%>%
  separate(code1, into = paste0("code_", 1:3), sep="[/]")
#                 code      code_1      code_2 code_3
#1 PC211/212.5(C)/664F      PC211F PC212.5(C)F PC664F
#2   VC23152(A)/23550F VC23152(A)F    VC23550F   <NA>
#3       PC459/460(B)M      PC459M   PC460(B)M   <NA>
#4       PC187(A)/664F   PC187(A)F      PC664F   <NA>

How to split all strings in a column AND include prefix in all the new data

Update

An alternative

Tags:

Regex

R

Related

Recent Posts