How to split a data frame by rows, and then process the blocks?

I seem to recall that plain old split() has a method for data.frames, so that split(data,data$site) would produce a list of blocks. You could then operate on this list using sapply/lapply/for.

split() is also nice because of unsplit(), which will create a vector the same length as the original data and in the correct order.


You can use isplit (from the "iterators" package) to create an iterator object that loops over the blocks defined by the site column:

require(iterators)
site.data <- read.table("isplit-data.txt",header=T) 
sites <- isplit(site.data,site.data$site)

Then you can use foreach (from the "foreach" package) to create a plot within each block:

require(foreach)
foreach(site=sites) %dopar% {
 pdf(paste(site$key[[1]],".pdf",sep=""))
 plot(site$value$year,site$value$peak,main=site$key[[1]])
 dev.off()
}

As a bonus, if you have a multiprocessor machine and call registerDoMC() first (from the "doMC" package), the loops will run in parallel, speeding things up. More details in this Revolutions blog post: Block-processing a data frame with isplit


Another choice is use the ddply function from the ggplot2 library. But you mention you mostly want to do a plot of peak vs. year, so you could also just use qplot:

A <- read.table("example.txt",header=TRUE)
library(ggplot2)
qplot(peak,year,data=A,colour=site,geom="line",group=site)
ggsave("peak-year-comparison.png")

alt text

On the other hand, I do like David Smith's solution that allows the applying of the function to be run across several processors.


Here's what I would do, although it looks like you guys have it handled by library functions.

for(i in 1:length(unique(data$site))){
  constrainedData = data[data$site==data$site[i]];
  doSomething(constrainedData);
}

This kind of code is more direct and might be less efficient, but I prefer to be able to read what it is doing than learn some new library function for the same thing. makes this feel more flexible too, but in all honesty this is just the way I figured it out as a novice.

Tags:

Split

R

Dataframe