There is Exogeneity, and Then There is Strict Exogeneity
Panel data has become quite abundant. As a result, fixed effects models have become prevalent in applied research. But, this week I handled a paper at one of the journals for which I am on the editorial board and was reminded of a mistake that I have seen all too frequently over the years. Most researchers are probably aware of the issue I am highlighting, but indulge me.
In a cross-sectional regression model, we have
y_i = a + b*x_i + e_i
OLS is unbiased if x is exogenous, which requires that Cov(x_i,e_i) = 0. In other words, the covariates need to be uncorrelated with the error term from the same time period.
In contrast, in a panel regression model with individual effects, we have
y_it = a_i + b*x_it + e_it
If we wish to allow for the possibility that Cov(a_i,x_it) differs from zero, then a_i is a "fixed" effect instead of a "random" effect. To understand what we need to assume to obtain unbiased estimates of b, we need to understand how the model is estimated.
There are two common methods: mean-differencing and first-differencing. Both approaches entail transforming the model to eliminate the individual effects from the estimating equation, and then applying OLS to the transformed model. However, there is a difference between them. Typically, researchers simply state that they estimate a "fixed effects" model without being specific. Let's look at each in turn.
Mean-differencing (aka, the "within" estimator) is obtained by applying OLS to
(y_it - y_i) = b*(x_it - x_i) + (e_it - e_i)
where y_i is the mean value of y for observation i over the T time periods, and likewise for x_i and e_i. As in the cross-sectional case, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,e_it - e_i) = 0. Since x_i is a function of x from every time period for observation i, and likewise for e_i, unbiasedness (in practice) requires the covariates from every time period to be uncorrelated with e from every time period. This is referred to as strict exogeneity, and is clearly a much stronger assumption than simply assuming that Cov(x_it,e_it) = 0.
First-differencing requires something a bit weaker, technically. First-differencing is obtained by applying OLS to
(y_it - y_i,t-1) = b*(x_it - x_i,t-1) + (e_it - e_i,t-1)
Again, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,t-1,e_it - e_i,t-1) = 0. So, unbiasedness (in practice) requires Cov(x_it,e_it) = 0, Cov(x_it,e_i,t-1) = 0, and Cov(x_it-1,e_i,t) = 0, which implies that the covariates must be uncorrelated with e from the contemporaneous period, the preceding period, and the subsequent period. This is a slightly weaker version of strict exogeneity since it does not require a lack of correlation over longer time spans.
The first point to make here, which is not my main point, is that researchers should, generally speaking, pay more attention to justifying the assumption of strict exogeneity when applying panel data models.
But my main point ...
is that when one may be concerned about the endogeneity of one or more covariates, lagging covariates in a fixed effects model (estimated using either the mean- or first-differencing estimator) is (in practice) never a solution!
For example, if one replaces x_it with its lag, the transformed model after mean-differencing is
(y_it - y_i) = b*(x_i,t-1 - x_i,-1) + (e_it - e_i)
where x_i,-1 is the average of x_it over the periods 0,...,T-1. Thus, unbiased estimates requires Cov(x_i,t-1 - x_i,-1,e_it - e_i) = 0, which still requires strict exogeneity since x_i,-1 and e_i contain (essentially) x and e from every time period. A similar story emerges if one uses first-differencing.
To illustrate the lack of solution that lagging offers, I conduct some simulations. I have one covariate, and Corr(x_it,e_it) = 0.5. In case you think serial correlation matters, I generate the data such that neither x nor e are serially correlated. I consider N = 100, 1000 and T = 3, 10, 50. For all six combinations, I simulate 250,000 data sets and examine the estimated value of b, the bias, mean absolute error (MAE), and mean squared error (MSE) when the covariate is x_it and when it is x_i,t-1. The true value of b is 1. Lo and behold, it's not good.
So, what to do if x_it is not strictly exogenous? First, let it out.
Second, consult the references below, as well as look for instruments...
Note: Code is available here: http://faculty.smu.edu/millimet/blog.html
References
In a cross-sectional regression model, we have
y_i = a + b*x_i + e_i
OLS is unbiased if x is exogenous, which requires that Cov(x_i,e_i) = 0. In other words, the covariates need to be uncorrelated with the error term from the same time period.
In contrast, in a panel regression model with individual effects, we have
y_it = a_i + b*x_it + e_it
If we wish to allow for the possibility that Cov(a_i,x_it) differs from zero, then a_i is a "fixed" effect instead of a "random" effect. To understand what we need to assume to obtain unbiased estimates of b, we need to understand how the model is estimated.
There are two common methods: mean-differencing and first-differencing. Both approaches entail transforming the model to eliminate the individual effects from the estimating equation, and then applying OLS to the transformed model. However, there is a difference between them. Typically, researchers simply state that they estimate a "fixed effects" model without being specific. Let's look at each in turn.
Mean-differencing (aka, the "within" estimator) is obtained by applying OLS to
(y_it - y_i) = b*(x_it - x_i) + (e_it - e_i)
where y_i is the mean value of y for observation i over the T time periods, and likewise for x_i and e_i. As in the cross-sectional case, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,e_it - e_i) = 0. Since x_i is a function of x from every time period for observation i, and likewise for e_i, unbiasedness (in practice) requires the covariates from every time period to be uncorrelated with e from every time period. This is referred to as strict exogeneity, and is clearly a much stronger assumption than simply assuming that Cov(x_it,e_it) = 0.
First-differencing requires something a bit weaker, technically. First-differencing is obtained by applying OLS to
(y_it - y_i,t-1) = b*(x_it - x_i,t-1) + (e_it - e_i,t-1)
Again, unbiased estimates require the covariate to be uncorrelated with the error term. But, now, this requires Cov(x_it - x_i,t-1,e_it - e_i,t-1) = 0. So, unbiasedness (in practice) requires Cov(x_it,e_it) = 0, Cov(x_it,e_i,t-1) = 0, and Cov(x_it-1,e_i,t) = 0, which implies that the covariates must be uncorrelated with e from the contemporaneous period, the preceding period, and the subsequent period. This is a slightly weaker version of strict exogeneity since it does not require a lack of correlation over longer time spans.
The first point to make here, which is not my main point, is that researchers should, generally speaking, pay more attention to justifying the assumption of strict exogeneity when applying panel data models.
But my main point ...
is that when one may be concerned about the endogeneity of one or more covariates, lagging covariates in a fixed effects model (estimated using either the mean- or first-differencing estimator) is (in practice) never a solution!
For example, if one replaces x_it with its lag, the transformed model after mean-differencing is
where x_i,-1 is the average of x_it over the periods 0,...,T-1. Thus, unbiased estimates requires Cov(x_i,t-1 - x_i,-1,e_it - e_i) = 0, which still requires strict exogeneity since x_i,-1 and e_i contain (essentially) x and e from every time period. A similar story emerges if one uses first-differencing.
To illustrate the lack of solution that lagging offers, I conduct some simulations. I have one covariate, and Corr(x_it,e_it) = 0.5. In case you think serial correlation matters, I generate the data such that neither x nor e are serially correlated. I consider N = 100, 1000 and T = 3, 10, 50. For all six combinations, I simulate 250,000 data sets and examine the estimated value of b, the bias, mean absolute error (MAE), and mean squared error (MSE) when the covariate is x_it and when it is x_i,t-1. The true value of b is 1. Lo and behold, it's not good.
Covariate | N | T | Mean(b) | Bias | MAE | MSE |
x_it | 100 | 3 | 1.500 | 0.500 | 0.500 | 0.254 |
x_it | 100 | 10 | 1.500 | 0.500 | 0.500 | 0.251 |
x_it | 100 | 50 | 1.500 | 0.500 | 0.500 | 0.250 |
x_it | 1000 | 3 | 1.500 | 0.500 | 0.500 | 0.250 |
x_it | 1000 | 10 | 1.500 | 0.500 | 0.500 | 0.250 |
x_it | 1000 | 50 | 1.500 | 0.500 | 0.500 | 0.250 |
x_i,t-1 | 100 | 3 | -0.750 | -1.750 | 1.750 | 3.088 |
x_i,t-1 | 100 | 10 | -0.167 | -1.167 | 1.167 | 1.365 |
x_i,t-1 | 100 | 50 | -0.031 | -1.031 | 1.031 | 1.063 |
x_i,t-1 | 1000 | 3 | -0.750 | -1.750 | 1.750 | 3.065 |
x_i,t-1 | 1000 | 10 | -0.167 | -1.167 | 1.167 | 1.361 |
x_i,t-1 | 1000 | 50 | -0.031 | -1.031 | 1.031 | 1.062 |
So, what to do if x_it is not strictly exogenous? First, let it out.
Second, consult the references below, as well as look for instruments...
Note: Code is available here: http://faculty.smu.edu/millimet/blog.html
References
Bellemare, M.F., Masaki, T. and T.B. Pepinsky (2017), "Lagged Explanatory Variables and the Estimation of Causal Effect," The Journal of Politics, 79, 949-963
Reed, W.R. (2015), "On the Practice of Lagging Variables to Avoid Simultaneity," Oxford Bulletin of Economics and Statistics, 77, 897-905
Reed, W.R. (2015), "On the Practice of Lagging Variables to Avoid Simultaneity," Oxford Bulletin of Economics and Statistics, 77, 897-905