Double clustered standard errors for panel data
Frank Harrell's package rms
(which used to be named Design
) has a function that I use often when clustering: robcov
.
See this part of ?robcov
, for example.
cluster: a variable indicating groupings. ‘cluster’ may be any type of
vector (factor, character, integer). NAs are not allowed.
Unique values of ‘cluster’ indicate possibly correlated
groupings of observations. Note the data used in the fit and
stored in ‘fit$x’ and ‘fit$y’ may have had observations
containing missing values deleted. It is assumed that if any
NAs were removed during the original model fitting, an
‘naresid’ function exists to restore NAs so that the rows of
the score matrix coincide with ‘cluster’. If ‘cluster’ is
omitted, it defaults to the integers 1,2,...,n to obtain the
"sandwich" robust covariance matrix estimate.
This is an old question. But seeing as people still appear to be landing on it, I thought I'd provide some modern approaches to multiway clustering in R:
Option 1 (fastest): fixest::feols()
library(fixest)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols
## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'iid') ## vanilla SEs
summary(est_feols, se = 'hc1') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode) ## add third cluster var (not in original model call)
## etc.
Option 2 (fast): lfe::felm()
library(lfe)
## Note the third "| 0 " slot means we're not using IV
est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)
Option 3 (slower, but flexible): sandwich
library(sandwich)
library(lmtest)
est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork)
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Benchmark
Aaaand, just to belabour the point about speed. Here's a benchmark of the three different approaches (using two fixed FEs and twoway clustering).
est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork),
vcov = vcovCL, cluster = ~ race + year)}
microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> est_feols() 11.94122 11.96158 12.55835 11.98193 12.86692 13.75191 3 a
#> est_felm() 87.18064 95.89905 100.69589 104.61746 107.45352 110.28957 3 b
#> est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886 3 c
For panel regressions, the plm
package can estimate clustered SEs along two dimensions.
Using M. Petersen’s benchmark results:
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
So that now you can obtain clustered SEs:
##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.022189 1.3376 0.1811
x 1.034833 0.031679 32.6666 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.064580 0.4596 0.6458
x 1.034833 0.052465 19.7243 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For more details see:
- Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.
However the above works only if your data can be coerced to a pdata.frame
. It will fail if you have "duplicate couples (time-id)"
. In this case you can still cluster, but only along one dimension.
Trick plm
into thinking that you have a proper panel data set by specifying only one index:
fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
So that now you can obtain clustered SEs:
##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can also use this workaround to cluster by a higher dimension or at a higher level (e.g. industry
or country
). However in that case you won't be able to use the group
(or time
) effects
, which is the main limit of the approach.
Another approach that works for both panel and other types of data is the multiwayvcov
package. It allows double clustering, but also clustering at higher dimensions. As per the packages's website, it is an improvement upon Arai's code:
- Transparent handling of observations dropped due to missingness
- Full multi-way (or n-way, or n-dimensional, or multi-dimensional) clustering
Using the Petersen data and cluster.vcov()
:
library("lmtest")
library("multiwayvcov")
data(petersen)
m1 <- lm(y ~ x, data = petersen)
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029680 0.065066 0.4561 0.6483
## x 1.034833 0.053561 19.3206 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Arai's function can be used for clustering standard-errors. He has another version for clustering in multiple dimensions:
mcl <- function(dat,fm, cluster1, cluster2){
attach(dat, warn.conflicts = F)
library(sandwich);library(lmtest)
cluster12 = paste(cluster1,cluster2, sep="")
M1 <- length(unique(cluster1))
M2 <- length(unique(cluster2))
M12 <- length(unique(cluster12))
N <- length(cluster1)
K <- fm$rank
dfc1 <- (M1/(M1-1))*((N-1)/(N-K))
dfc2 <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum))
u2j <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum))
u12j <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum))
vc1 <- dfc1*sandwich(fm, meat=crossprod(u1j)/N )
vc2 <- dfc2*sandwich(fm, meat=crossprod(u2j)/N )
vc12 <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
vcovMCL <- vc1 + vc2 - vc12
coeftest(fm, vcovMCL)}
For references and usage example see:
- Clustered Standard Errors in R