Eye on the Prize

So much of this week was ugly and infuriating. 

Sick Schmidt GIF by New Girl - Find & Share on GIPHY

The suppression of voting rights is real and just, literally, out there in the open. The pandemic has no end in sight and will likely get worse before it gets better. If it gets better. 

But, you know what else is ugly and infuriating? Empirical researchers using the wrong tool for the job. One particular example came up in a small discussion on #EconTwitter this week with my part-time antagonist, part-time protagonist. 


In this example, let's say we all agree that the true data-generating process (DGP) is

Y = a + bX + e.

How should one estimate the model?

30 Second Timer GIF | Gfycat

The answer is: ¯\_(ツ)_/¯. 

We need more information. But, what information? Probably the first thing that went through your mind is that you need to know whether Cov(X,e) is zero or non-zero. If you are old school, you might also wonder about the properties of e itself. These factors might influence your choice among Ordinary Least Squares (OLS), Generalized Least Squares (GLS), and Instrumental Variables (IV).

So, let's assume we all agree that, in addition to the linear equation above, that

Cov(X,e) ≠ 0
Cov(Z,e) = 0
Cov(X,Z) >> 0
e ~ N(0,s)

Now, how should one estimate the model?

the final countdown | Tumblr

The answer is: ¯\_(ツ)_/¯. 

We need more information! But, what information? We need to know what the objective is. What are we trying to learn? We cannot possibly identify the right tool for the job if we don't know what the job is
So, what's the objective? The objective is usually one of three things for empirical researchers. 

1. Learn something about the unknown population parameters, a and b.
2. Predict the value of Y for observations in the sample, Y-hat
3. Predict the value of Y for observations out of the sample, Y-hat

More formally, the DGP given above implies that

E[Y|X] = a + bX + E[e|X],

where E[e|X] = 0 under the assumptions of the Classical Linear Regression Model but not in the above DGP, and the marginal effect of X is given by

d(E[Y|X])/dX = b + d(E[e|X])/dX,

where d(E[e|X])/dX = 0 under the assumptions of the Classical Linear Regression Model but not in the above DGP. Thus, Objective 1 is to learn something about the population parameters (which in part relate to the derivative of the conditional expectation function (CEF)). Objectives 2 and 3 are to learn something about the CEF evaluated at particular values of X, which may or may not be observed in sample.


The difference between these objectives, particularly 1 versus 2 and 3, is huge. It is a distinction that has garnered much attention of late as we are co-existing during both the age of big data and machine learning and the age of the credibility revolution. While machine learning is evolving rapidly, and I am by no means an expert, the bulk of machine learning is focused on Objectives 2 and 3, whereas the credibility revolution is focused on Objective 1. 

Clearly, there is a lot of new work on the forefront that seeks to combine these two ages into one big love fest.

Gravitational waves seen in neutron star collision, LIGO ...

More power to these individuals. But, for an excellent introduction to keeping the distinction among these objectives clear, I highly recommend the paper by Mullainathan & Spiess (2017).

Returning to this week's #EconTwitter, the discussion focused on whether obtaining an estimate of b that has a causal interpretation (i.e., using an estimator that is consistent for b) leads to be better predictions (i.e., estimates of the CEF at different values of X).

To put this in context, let us return to the full DGP outlined above. X is endogenous. There is a constant marginal effect, b. There exists a strong and valid instrument, Z. Clearly, OLS is biased and inconsistent for b, whereas IV is biased but consistent. So, if Objective 1 is the goal, then IV is the right tool for the job (or at least better than OLS).

What if Objective 2 or 3 is the goal? The answer is, you guessed it, ¯\_(ツ)_/¯. 

gilmore girls, taco, alexis bledel, rory gilmore, i love tacos, i ...

We need more information! We need to know two things. First, are we interested in estimating the CEF at values of X in the estimation sample or in a hold-out sample? Second, if we are interested in estimating the CEF in a hold-out sample, is the DGP the same in both the estimation and hold-out samples?

The intuition, at least to me, is simple. When the objective is to learn b, endogeneity is bad as we need to separate out the effects of X on Y from the effect of e on Y. However, when the objective is to predict Y, then we don't care about the individual components of Y, we only care about learning everything we can about all of the determinants of Y, including E[e|X]. In this case, endogeneity allows us to learn something about e when we know X. This can be useful information when predicting Y ... if what we learn about e through X in the estimation sample also holds true in the set of observations for which we are trying to predict Y.

To illustrate this, I conduct a simple Monte Carlo exercise. The code is available here. The DGP is given by

Y = a + bX + e, i = 1,...,N0+N1

where i=1,...,N0 is the estimation sample and i=N0+1,...,N0+N1 is the hold-out sample. In other words, N0 observations are used for estimation and N1 observations are used to assess out-of-sample predictions.

For the N0 observations in the estimation sample, X and e are drawn from a bivariate standard normal distribution with a correlation of 0.25. Thus, X is endogenous. However, I also simulate a strong and valid instrumental variable, Z, that is also standard normal and has a correlation with X of 0.5.

For the N1 observations in the hold-out sample, I vary the correlation between X and e. I use five different DGPs, setting this correlation to {0.5, 0.25, 0, -0.25, -0.5}. These are referred to as DGPs 1-5, respectively, below. So, DGP2 has the same correlation between X and e in the estimation and hold-out samples; the correlation differs (in both directions) across the other four DGPs.

For each DGP, I set N0 to 1000 and N1 to 100 and run 10,000 simulations. For each DGP and simulation, I compute the in-sample prediction errors

e-hat_i = Y_i - Y-hat_i, i = 1,...,N0

and the out-of-sample prediction errors

e-hat_i = Y_i - Y-hat_i, i = 1,...,N1

where the model is estimated via OLS and IV. I then compute the root mean squared error (RMSE) and the mean absolute error (MAE) for each simulation. The tables below then report the mean RMSE and mean MAE across the 10,000 simulations for each DGP.


What do we learn? OLS predicts better in-sample. OLS continues to do better in the hold-out sample in DGPs 1 and 2, where the correlation between X and e are in the same direction as in the estimation sample. However, in DGPs 3-5 the reverse is true.

Best Guinness Brilliant GIFs | Gfycat

To see the (random) forest for the trees, this leads to two important conclusions. First, ignoring causal inference is (sometimes, if not often) preferable when the objective is prediction. In many (most?) cases, one would think that that the covariance between X and e at least has the same sign in the hold-out sample as in the estimation sample. 


Second, just because an estimator yields good predictions doesn't mean it provides good estimates of marginal effects. This second point is equally as important. Assessing goodness of fit is not, in any way, a useful tool for understanding whether the parameters one has estimated have causal interpretations.

So, remember, you gotta keep your eye on the prize!

UPDATES (8.15.20)

Thanks to Mark Schaffer for pointing out this older blog post by Francis X. Diebold, which in turn references another prior blog post by Wasserman.


Thanks as well to David Childers for suggesting this video presentation by Jonas Peters.

UPDATES (8.17.20)

Thanks to Hugo Jales for pointing out E[Y|X] = a + bX + E[e|X] in the DGP above since Cov(e,X) ≠ 0.

UPDATES (8.24.20)

Grant McDermott re-did my simulations in R. The code is available here. Thanks, Grant!

References

Mullainathan, S. and J. Spiess (2017), "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, 31, 87-106

Popular posts from this blog

The Great Divide

There is Exogeneity, and Then There is Strict Exogeneity

Black Magic