Good Ways to Visualize Longitudinal Categorical Data in R
In researching my question, I've found a few other options that I'll list here.
A number of relatively new R packages are designed for visualizing and analyzing "life history" or "multistate sequence" data. The idea is that over time people (or objects) enter and exit various categories--for example, career changes, marriage and divorce, health and disease, or, in my case, categories of academic standing in college.
R packages for visualizing sequence or life history data include biograph, mentioned by @timriffe in a comment above, and TraMineR. The author of the biograph package, Frans Willekens, has a book on the package, Biograph. Multistate analysis of life histories with R, that will be published by Springer this fall. TraMineR has a detailed user manual at the link above and also a shorter JSS article. JSS also has a special issue on multi-state models in the context of risk analysis that discusses additional R packages for multistate modeling.
I also found some specialized software designed to visualize movements between categories over time. Parallel Sets is a simple, free program for producing basic visualizations, although it has limited flexibility. Lifeflow is more sophisticated. It's also free, but you have to send an email to the creator requesting a copy.
I'll add more details to this answer, once I've had a chance to try out these tools.
Here are a few ideas for plotting your data. I've used ggplot2, and I've reformatted the data a bit in places.
Figure 1
I've used a stacked barplot to mimic your mosaic plot and solve the alignment issue.
Figure 2
Data points for each student are connected by a gray line, making this reminiscent of a parallel coordinates plot. Coloring the points shows the categorical standing. Using GPA on the y-axis helps spread out the points to reduce overplotting, and shows correlation of standing and GPA. A major problem is that many valid standing
datapoints drop out because they lack a matching termGPA value.
Figure 3
Here I've created a new variable called initial_standing to use for facetting. Each panel contains students who match in both cohort and initial_standing. Plotting the id as text makes this figure a bit cluttered, but could be useful in some cases.
Figure 4
This plot is like a heatmap where each row is a student. I controlled the order of the id
axis to force initial_standing and cohort groupings to stay together. If you have many more rows, you may want to consider sorting rows by some type of clustering.
library(ggplot2)
# Create new data frame for determining initial standing.
standing_data = data.frame(id=unique(df1$id), initial_standing=NA, cohort=NA)
for (i in 1:nrow(standing_data)) {
id = standing_data$id[i]
subdat = df1[df1$id == id, ]
subdat = subdat[complete.cases(subdat), ]
initial_standing = subdat$standing[which.min(subdat$term)]
standing_data[i, "initial_standing"] = as.character(initial_standing)
standing_data[i, "cohort"] = as.character(subdat$cohort[1])
}
standing_data$cohort = factor(standing_data$cohort, levels=levels(df1$cohort))
standing_data$initial_standing = factor(standing_data$initial_standing,
levels=levels(df1$standing))
# Add the new column (initial_standing) to df1.
df1 = merge(df1, standing_data[, c("id", "initial_standing")], by="id")
# Remove rows where standing is missing. Make some plots tidier.
df1 = df1[!is.na(df1$standing), ]
# Create id factor, controlling the sort order of the levels.
id_order = order(standing_data$initial_standing, standing_data$cohort)
df1$id = factor(df1$id, levels=as.character(standing_data$id)[id_order])
p1 = ggplot(df1, aes(x=term, fill=standing)) +
geom_bar(position="fill", colour="grey20", size=0.5, width=1.0) +
facet_grid(cohort ~ .) +
scale_fill_brewer(palette="Set1")
p2 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=standing), size=4) +
facet_grid(cohort ~ .) +
scale_colour_brewer(palette="Set1")
p3 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=standing), size=4) +
geom_text(aes(label=id), hjust=-0.30, size=3) +
facet_grid(initial_standing ~ cohort) +
scale_colour_brewer(palette="Set1")
p4 = ggplot(df1, aes(x=term, y=id, fill=standing)) +
geom_tile(colour="grey20") +
facet_grid(initial_standing ~ ., space="free_y", scales="free_y") +
scale_fill_brewer(palette="Set1") +
opts(panel.grid.major=theme_blank()) +
opts(panel.grid.minor=theme_blank())
ggsave("plot_1.png", p1, width=10, height=6.25, dpi=80)
ggsave("plot_2.png", p2, width=10, height=6.25, dpi=80)
ggsave("plot_3.png", p3, width=10, height=6.25, dpi=80)
ggsave("plot_4.png", p4, width=10, height=6.25, dpi=80)
I wish I had found @bdemarest's answer before I wrote an R package to solve this problem, but since the OP requested additional updates, I'll share one more solution. What bdemarest suggested in Figure 4 is what I have been calling a type of horizontal line plot.
In developing the longCatEDA
R package, we found that sorting the data was crucial to making useful plots (see example(sorter)
and the report linked in the comment below for technical details), especially as the size of the problem became large. For example, we started the problem with daily drinking data (abstinent, use, abuse) for several thousand participants over 3 years (>1000 days).
The code to apply the horizontal line plot to @eipi10's data is below. Figure 1 stratifies by term
, and Figure 2 stratifies by first status as with Figure 4 of @bdemarest, though the results are not identical due to within strata sorting.
Figure 1
Figure 2
# libraries
install.packages('longCatEDA')
library(longCatEDA)
library(RColorBrewer)
# transform data long to wide
dfw <- reshape(df1,
timevar = 'term',
idvar = c('id', 'cohort'),
direction = 'wide')
# set up objects required by longCat()
y <- dfw[,seq(3,15,by=2)]
Labels <- levels(df1$standing)
tLabels <- levels(df1$term)
groupLabels <- levels(dfw$cohort)
# use the same colors as bdemarest
cols <- brewer.pal(7, "Set1")
# plot the longCat object
png('plot1.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc <- longCat(y=y, Labels=Labels, tLabels=tLabels, id=dfw$id)
longCatPlot(lc, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()
# stratify by term
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc.g <- sorter(lc, group=dfw$cohort, groupLabels=groupLabels)
longCatPlot(lc.g, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()
# stratify by first status, akin to Figure 4 by bdemarest
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
first <- apply(!is.na(y), 1, function(x) which(x)[1])
first <- y[cbind(seq_along(first), first)]
lc.1 <- sorter(lc, group=factor(first), groupLabels = sort(unique(first)))
longCatPlot(lc.1, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()