It All Stacks Up
Mark Twain famously quipped, "The report of my death was an exaggeration." As with Twain, reports of #EconTwitter's demise has also been grossly exaggerated. Case in point, an interesting econometric question was posed by Casey Wichman. The question concerned testing hypotheses containing parameters estimated from two different specifications (where one specification is estimated using only a subset of the observations used to estimate the other specification ... but that does not change anything).
Several individuals immediately offered a solution: stacked regression.
Hat tip to those responding to Casey, especially Paul Goldsmith-Pinkham's detailed response. Nonetheless, it seemed like a great topic for a new blog post.
I recall hearing the term -- stacked regression -- discussed during graduate school. Like many things at the time, I did not really understand what it meant. As a young professor, I also often heard discussions about stacking moment conditions. Again, I did not really grasp the concept. Because imposter syndrome causes many of us to not inquire about that which we do not understand, and stacking can be extremely useful, I am procrastinating from my duties and writing this in my favorite coffee shop.
Stacking is actually very simple. That's why it is such a shame if it escapes the econometric tool kit of applied researchers. In short, it means appending new observations to the end of the initial data set, where the first M observations are used to estimate specification 1 and the remaining N-M observations are used to estimate specification 2. However, the vector of outcomes, y, and the design matrix, x, are organized so as to "fool" the computer into estimating a single regression. Because estimation of both specifications is done simultaneously, the full covariance matrix of the parameter estimates is obtained, allowing for all standard hypothesis testing. The "computer" does not need to know you are actually estimating two models. Since the data for one specification is appended below the data for the other, it is as if the data are stacked. Hence, the name.
This is best illustrated in an example. Rather than using Casey's from Twitter, I will use my favorite example which ought to be of interest to many in its own right. Laporte & Windmeijer (2005) consider estimation of a panel data treatment effects model (back when we were still allowed to use two way fixed effects (TWFE)). It is well known that the fixed effects (FE) estimator and first difference (FD) estimator are both unbiased and consistent (for the slope coefficients for large N, fixed T) if the standard TWFE data-generation process (DGP) is correctly specified. FE and FD use different transformations to eliminate the unit-specific FE and then estimate the transformed model using Ordinary Least Squares (OLS).
As such, when the DGP is correctly specified, there is little advantage to using one estimator over the other. (Note, there may be efficiency gains to using one.) However, Laporte & Windmeijer are concerned about model mis-specification (in particular, about mis-timing of the treatment effect). The authors show that if the DGP is mis-specified, then FE and FD are both biased, but in different ways. McKinnish (2008) does something similar when the estimates are biased due to -- gasp! -- measurement error.
Because mis-specification of the DGP differentially biases FE and FD, then a specification test is available: test equality of the FE and FD estimates.
But how? If one applies the FE and FD estimators separately, then one gets the coefficient estimates from each along with the standard errors. However, testing equality of coefficients also requires knowledge of the covariance between the parameter estimates. One could do a clustered bootstrap. However, there is another way.
Without loss of generality, assume you have a balanced panel, i=1,...,N and t=1,...,T. To proceed, mean-difference the data "by hand" to generate y_it - y_i-bar and x_it - x_i-bar, where y_i-bar and x_i-bar are the unit-specific means of y and x, respectively. Generate a variable in the data set, S, equal to 0 for all observations. Save this as data set #1. Now, return to the original data set and first-difference the data "by hand" to generate Δy_it and Δx_it, where Δ is the FD operator. Generate a variable in the data set, S, equal to 1 for all observations. Save this as data set #2.
Next, append data set #2 to the bottom of data set #1. Define the outcome variable as
y = y_it - y_i-bar if S == 0
y = Δy_it if S == 1
Similarly define the design matrix as
x = x_it - x_i-bar if S == 0
x = Δx_it if S == 1
Finally, estimate the following model using OLS (clustering at the unit level) using 2NT-N observations
y = x*b1 + Sx*b2 + e
where Sx is the interaction between S and x. This interaction is equal to zero for all observations in data set #1 and equal to Δx_it for all observations in data set #2. After OLS, one can easily test
Ho : b2 = 0
using a standard linear hypothesis test. Rejecting the null using a two-sided alternative is consistent with the FE and FD estimates being statistically different at the appropriate confidence level. This is analogous to a Chow test for a structural break. However, here the structural break is across estimation techniques rather than samples and rejection implies mis-specification of the underlying DGP.
Finally, note there is nothing special about stacking only two specifications. Stack away! And, there is a Stata command that helps do this although I have not tried it to see how flexible it is.
Now, let's turn to a related trick: stacked moment conditions. Generalized Method of Moments (GMM) is even more flexible than (and nests) OLS. Without getting into the weeds, GMM specifies sample moment conditions, which are, in words, equations that are functions only of data and unknown parameters to be estimated. Each moment condition is one equation. Thus, if one wishes to estimate K unknown parameters, one needs L (linearly independent) moment conditions, where L ≥ K. If L = K, then this is known as Method of Moments (MM) and the parameters are exactly identified. If L > K, then the model is overidentified and there is no solution. In this case, GMM proceeds by choosing the parameter estimates to minimize the (weighted) sum of squared errors of these L moment conditions.
The beauty of GMM is that you can add as many moment conditions (equations) as you want to the system. And, these moment conditions can come from anywhere (i.e., any model specification) you have in mind. The Laporte & Windmeijer model could instead be estimated using the following moments
E[x'e] = E[x'(y - x*b1 - Sx*b2)] = 0
E[Sx'e] = E[Sx'(y - x*b1 - Sx*b2)] = 0
This is a system of 2K (population) moment conditions that are a function of 2K parameters (x and Sx are each (2NT-N) x K). The sample moment conditions replace the expectations with the sample means.
The most common use of stacked moment conditions in applied research -- that users may not even be aware of -- must be the system GMM estimator for dynamic panel data models by Blundell & Bond (1998). The AR(1) dynamic panel data model is given by
y_it = g*y_it-1 + x*b + a_i + e_it
Arellano & Bond (1991) propose estimating this model by FD to remove a_i and then GMM using instruments to deal with the endogeneity of the lagged dependent variable. This gives rise to moment conditions of the form
E[z'Δe] = E[z'(Δy_it - g*Δy_it-1 - Δx*b )] = 0
where z is a list of all (internal and external) instruments that are orthogonal to Δe.
Blundell & Bond then propose to add more moment conditions. These include
E[w'(a+e)] = E[w'(y_it - g*y_it-1 - x*b )] = 0
where w includes any additional (internal or external) instruments that are orthogonal to the composite error in levels, a_i + e_it.
Despite some moment conditions coming from the model specification in FD and some coming from the model specification in levels, all moment conditions can be stacked and used in the estimation.
So ... happy stacking!