There is Exogeneity, and Then There is Strict Exogeneity

Panel data has become quite abundant. As a result, fixed effects models have become prevalent in applied research. But, this week I handled a paper at one of the journals for which I am on the editorial board and was reminded of a mistake that I have seen all too frequently over the years. Most researchers are probably aware of the issue I am highlighting, but indulge me.

In a cross-sectional regression model, we have

y_i = a + b*x_i + e_i

OLS is unbiased if x is exogenous, which requires that Cov(x_i,e_i) = 0. In other words, the covariates need to be uncorrelated with the error term from the same time period.

In contrast, in a panel regression model with individual effects, we have

y_it = a_i + b*x_it + e_it

If we wish to allow for the possibility that Cov(a_i,x_it) differs from zero, then a_i is a "fixed" effect instead of a "random" effect. To understand what we need to assume to obtain unbiased estimates of b, we need to understand how the model is estimated.

There are two common methods: mean-differencing and first-differencing. Both approaches entail transforming the model to eliminate the individual effects from the estimating equation, and then applying OLS to the transformed model. However, there is a difference between them. Typically, researchers simply state that they estimate a "fixed effects" model without being specific. Let's look at each in turn.

Mean-differencing (aka, the "within" estimator) is obtained by applying OLS to

(y_it - y_i) = b*(x_it - x_i) + (e_it - e_i)

where y_i is the mean value of y for observation i over the T time periods, and likewise for x_i and e_i. As in the cross-sectional case, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,e_it - e_i) = 0. Since x_i is a function of x from every time period for observation i, and likewise for e_i, unbiasedness (in practice) requires the covariates from every time period to be uncorrelated with e from every time period. This is referred to as strict exogeneity, and is clearly a much stronger assumption than simply assuming that Cov(x_it,e_it) = 0.

Image result for oh what a relief it is meme

First-differencing requires something a bit weaker, technically. First-differencing is obtained by applying OLS to

(y_it - y_i,t-1) = b*(x_it - x_i,t-1) + (e_it - e_i,t-1)

Again, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,t-1,e_it - e_i,t-1) = 0. So, unbiasedness (in practice) requires Cov(x_it,e_it) = 0, Cov(x_it,e_i,t-1) = 0, and Cov(x_it-1,e_i,t) = 0, which implies that the covariates must be uncorrelated with e from the contemporaneous period, the preceding period, and the subsequent period. This is a slightly weaker version of strict exogeneity since it does not require a lack of correlation over longer time spans.

Image result for oh what a relief it is meme

The first point to make here, which is not my main point, is that researchers should, generally speaking, pay more attention to justifying the assumption of strict exogeneity when applying panel data models.

But my main point ...

Image result for get to the point

is that when one may be concerned about the endogeneity of one or more covariates, lagging covariates in a fixed effects model (estimated using either the mean- or first-differencing estimator) is (in practice) never a solution!

For example, if one replaces x_it with its lag, the transformed model after mean-differencing is

(y_it - y_i) = b*(x_i,t-1 - x_i,-1) + (e_it - e_i)

where x_i,-1 is the average of x_it over the periods 0,...,T-1. Thus, unbiased estimates requires Cov(x_i,t-1 - x_i,-1,e_it - e_i) = 0, which still requires strict exogeneity since x_i,-1 and e_i contain (essentially) x and e from every time period. A similar story emerges if one uses first-differencing.

To illustrate the lack of solution that lagging offers, I conduct some simulations. I have one covariate, and Corr(x_it,e_it) = 0.5. In case you think serial correlation matters, I generate the data such that neither x nor e are serially correlated. I consider N = 100, 1000 and T = 3, 10, 50. For all six combinations, I simulate 250,000 data sets and examine the estimated value of b, the bias, mean absolute error (MAE), and mean squared error (MSE) when the covariate is x_it and when it is x_i,t-1. The true value of b is 1. Lo and behold, it's not good.

Covariate          N          T      Mean(b)            Bias           MAE            MSE
x_it 100 3 1.500 0.500 0.500 0.254
x_it 100 10 1.500 0.500 0.500 0.251
x_it 100 50 1.500 0.500 0.500 0.250
x_it 1000 3 1.500 0.500 0.500 0.250
x_it 1000 10 1.500 0.500 0.500 0.250
x_it 1000 50 1.500 0.500 0.500 0.250
x_i,t-1 100 3 -0.750 -1.750 1.750 3.088
x_i,t-1 100 10 -0.167 -1.167 1.167 1.365
x_i,t-1 100 50 -0.031 -1.031 1.031 1.063
x_i,t-1 1000 3 -0.750 -1.750 1.750 3.065
x_i,t-1 1000 10 -0.167 -1.167 1.167 1.361
x_i,t-1 1000 50 -0.031 -1.031 1.031 1.062

So, what to do if x_it is not strictly exogenous? First, let it out.

Image result for screaming goat gif with sound

Second, consult the references below, as well as look for instruments...

Note: Code is available here: http://faculty.smu.edu/millimet/blog.html

References
Bellemare, M.F., Masaki, T. and T.B. Pepinsky (2017), "Lagged Explanatory Variables and the Estimation of Causal Effect," The Journal of Politics79, 949-963

Reed, W.R. (2015), "On the Practice of Lagging Variables to Avoid Simultaneity," Oxford Bulletin of Economics and Statistics, 77, 897-905

Popular posts from this blog

Faulty Logic?

Different, but the Same