Remove empty documents from DocumentTermMatrix in R topicmodels?
agstudy's answer works great, but using it on a slow computer proved mildly problematic.
tic()
row_total = apply(dtm, 1, sum)
dtm.new = dtm[row_total>0,]
toc()
4.859 sec elapsed
(this was done with a 4000x15000 dtm)
The bottleneck appears to be applying sum()
to a sparse matrix.
A document-term-matrix created by the tm
package contains the names i and j , which are indices for where entries are in the sparse matrix. If dtm$i
does not contain a particular row index p
, then row p
is empty.
tic()
ui = unique(dtm$i)
dtm.new = dtm[ui,]
toc()
0.121 sec elapsed
ui
contains all the non-zero indices, and since dtm$i
is already ordered, dtm.new
will be in the same order as dtm
. The performance gain may not matter for smaller document term matrices, but may become significant with larger matrices.
"Each row of the input matrix needs to contain at least one non-zero entry"
The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new <- dtm[rowTotals> 0, ] #remove all docs without words