Using Stata Variable Labels in R

Here's a function to evaluate any expression you want with Stata variable labels:

#' Function to prettify the output of another function using a `var.labels` attribute
#' This is particularly useful in combination with read.dta et al.
#' @param dat A data.frame with attr `var.labels` giving descriptions of variables
#' @param expr An expression to evaluate with pretty var.labels
#' @return The result of the expression, with variable names replaced with their labels
#' @examples
#' testDF <- data.frame( a=seq(10),b=runif(10),c=rnorm(10) )
#' attr(testDF,"var.labels") <- c("Identifier","Important Data","Lies, Damn Lies, Statistics")
#' prettify( testDF, quote(str(dat)) )
prettify <- function( dat, expr ) {
  labels <- attr(dat,"var.labels")
  for(i in seq(ncol(dat))) colnames(dat)[i] <- labels[i]
  attr(dat,"var.labels") <- NULL
  eval( expr )
}

You can then prettify(testDF, quote(table(...))) or whatever you want.

See this thread for more info.


I would recommend that you use the new haven package (GitHub) for importing your data.

As Hadley Wickham mentions in the README.md file:

You always get a data frame, date times are converted to corresponding R classes and labelled vectors are returned as new labelled class. You can easily coerce to factors or replace labelled values with missings as appropriate. If you also use dplyr, you'll notice that large data frames are printed in a convenient way.

(emphasis mine)

If you use RStudio this will automatically display the labels under variable names in the View("data.frame") viewer pane (source).

Variable labels are attached as an attribute to each variable. These are not printed (because they tend to be long), but if you have a preview version of RStudio, you’ll see them in the revamped viewer pane.

You can install the package using:

install.packages("haven")

and import your Stata date using:

read_dta("path/to/file")

For more info see:

help("read_dta")

R does not have a built in way to handle variable labels. Personally I think that this is disadvantage that should be fixed. Hmisc does provide some facilitiy for hadling variable labels, but the labels are only recognized by functions in that package. read.dta creates a data.frame with an attribute "var.labels" which contains the labeling information. You can then create a data dictionary from that.

> data(swiss)
> write.dta(swiss,swissfile <- tempfile())
> a <- read.dta(swissfile)
> 
> var.labels <- attr(a,"var.labels")
> 
> data.key <- data.frame(var.name=names(a),var.labels)
> data.key
          var.name       var.labels
1        Fertility        Fertility
2      Agriculture      Agriculture
3      Examination      Examination
4        Education        Education
5         Catholic         Catholic
6 Infant_Mortality Infant.Mortality

Of course this .dta file doesn't have very interesting labels, but yours should be more meaningful.