Unable to convert a Corpus to Data Frame in R
This ought to do it:
data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)
edited with working solution, using crude
as example
The problem here is that you cannot apply stemCompletion
as a transformation.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument" "stripWhitespace"
does not include stemCompletion
, which takes a vector of stemmed tokens as input.
So this should do it: first you extract the transformed texts and tokenise them, then complete the stems, then paste back together. Here I have illustrated the solution using the built-in crude
corpus.
data(crude)
myCorpus <- crude
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
# tokenize the corpus
myCorpusTokenized <- lapply(myCorpus, scan_tokenizer)
# stem complete each token vector
myTokensStemCompleted <- lapply(myCorpusTokenized, stemCompletion, dictCorpus)
# concatenate tokens by document, create data frame
myDf <- data.frame(text = sapply(myTokensStemCompleted, paste, collapse = " "), stringsAsFactors = FALSE)
I've redone some of your earlier code with magrittr, just cause.
library(dplyr)
library(tm)
dictCorpus =
c("I love my cat", "Cullen bae is bae", "4ever alone :(") %>%
VectorSource %>%
Corpus %>%
tm_map(removeWords, stopwords('english')) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation)
myCorpus =
dictCorpus %>%
tm_map(stemDocument) %>%
tm_map(stemCompletion, dictionary=dictCorpus)
data =
data_frame(object =
myCorpus %>%
`class<-`("list") %>%
use_series(content) ) %>%
rowwise %>%
mutate(content =
object %>%
names %>%
extract(1) )