R Fast XML Parsing
Updated for the comments
d = xmlRoot(doc)
size = xmlSize(d)
names = NULL
for(i in 1:size){
v = getChildrenStrings(d[[i]])
names = unique(c(names, names(v)))
}
for(i in 1:size){
v = getChildrenStrings(d[[i]])
cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}
This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.
Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.
if you want to construct the data.frame dynamically, you can do this with data.table
, data.table
is a little bit slower than the above csv method, but faster than data.frame
m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
v = getChildrenStrings(d[[i]])
m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]
It finishes in about 1.1 second for the same document.
Just in case it helps someone, I found this solution using data.table to be even faster in my use case, as it only converts data to data.table once is has finished looping over the rows:
library(XML)
library(data.table)
doc <- xmlParse(filename)
d <- getNodeSet(doc, "//Data")
size <- xmlSize(d)
dt <- rbindlist(lapply(1:size, function(i) {
as.list(getChildrenStrings(d[[i]]))
}))