Opposite of unnest_tokens
library(tidyverse)
tidy_austen %>%
group_by(book,linenumber) %>%
summarise(text = str_c(word, collapse = " "))
Not a stupid question! The answer depends a bit on exactly what you are trying to do, but here would be my typical approach if I wanted to get my text back to its original form after some processing in its tidied form, using the group_by()
function from dplyr.
First, let's go from raw text to a tidied format.
library(tidyverse)
library(tidytext)
tidy_austen <- janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_austen
#> # A tibble: 725,055 x 3
#> book linenumber word
#> <fct> <int> <chr>
#> 1 Sense & Sensibility 1 sense
#> 2 Sense & Sensibility 1 and
#> 3 Sense & Sensibility 1 sensibility
#> 4 Sense & Sensibility 3 by
#> 5 Sense & Sensibility 3 jane
#> 6 Sense & Sensibility 3 austen
#> 7 Sense & Sensibility 5 1811
#> 8 Sense & Sensibility 10 chapter
#> 9 Sense & Sensibility 10 1
#> 10 Sense & Sensibility 13 the
#> # … with 725,045 more rows
The text is tidy now! But we can untidy it, back to something sort of like its original form. I typically approach this using group_by()
and summarize()
from dplyr, and str_c()
from stringr. What does the text look like at the end, in this particular case?
tidy_austen %>%
group_by(book, linenumber) %>%
summarize(text = str_c(word, collapse = " ")) %>%
ungroup()
#> # A tibble: 62,272 x 3
#> book linenumber text
#> <fct> <int> <chr>
#> 1 Sense & Sensib… 1 sense and sensibility
#> 2 Sense & Sensib… 3 by jane austen
#> 3 Sense & Sensib… 5 1811
#> 4 Sense & Sensib… 10 chapter 1
#> 5 Sense & Sensib… 13 the family of dashwood had long been settled…
#> 6 Sense & Sensib… 14 was large and their residence was at norland…
#> 7 Sense & Sensib… 15 their property where for many generations th…
#> 8 Sense & Sensib… 16 respectable a manner as to engage the genera…
#> 9 Sense & Sensib… 17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib… 18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows
Created on 2019-07-11 by the reprex package (v0.3.0)