Can't Hack It!

As the academic year from hell reaches completion, the sun emerges, and the ability to go on vacation resumes, we should all have a little spring in our step and a growing sense of optimism. Then comes a new NBER WP that puts a damper in the moods of many people.  


Aside from the topic itself, many comments have focused on the Instrumental Variable (IV) strategy used in the paper. 


Forgetting about the specific model at hand, the authors are perfectly upfront about the fact that two instruments are considered, but only one is chosen after seeing the results of some specification tests. And that brings me to the subject of this blog that I have wanted to write for a while, but lacked the mental bandwidth to do so.


The issue at hand is pre-test bias; a subject I have written about here and here. The notion is simple, but unfortunately is not well understood it seems by many applied researchers. I, too, was grossly deficient in my understanding coming out of graduate school. It wasn't until I started interacting with a time series colleague at SMU that I became enlightened. Yes, time series can be useful. Occasionally.  


As I said, the basic notion of pre-test bias is very simple. David Giles defines a pre-test strategy as arising when "we proceed in a sequential manner when drawing inferences about parameters." Following such a strategy often leads to biased inference. Researchers are predominantly aware of this as it relates to the idea of p-hacking. With p-hacking, researchers search over model specifications for significant results (i.e., p-values less than conventional levels of significance such as 0.10 or 0.05). Such results are unlikely to hold up to scrutiny as they are more than likely due to pure chance. 

While it is naïve to think that p-hacking does not occur despite researchers' awareness of the issue, other pre-test strategies remain common and out in the open due to a lack of awareness. The paper mentioned above typifies perhaps the most common situation: IV specification testing. When using IV, researchers often try numerous instruments, retaining results based on instruments that are strong and pass overidentification tests (if applicable). Let's call this F-hacking in reference to the first-stage F-test of instrument strength.

Two recent papers that have reviewed IV applications suggest that F-hacking is a severe problem. Andrews et al. (2019) provide the following histogram of published F-statistics. Given the rule-of-thumb associated with the critical value of ten from Stock et al. (2002), this histogram is surely not arising from chance.


Frankly, it looks like researchers are giving us the bird. Too cynical? 

Brodeur et al. (2020) perform a similar analysis, showing not only the disproportionate number of studies with F-statistics just above ten, but also evidence of p-hacking in the second-stage as well.


Some may see this and think, "What's the big deal?" F-hacking seems a lot less problematic than p-hacking since (i) the choice of instrument is not the choice of a final estimate and (ii) a strong and valid instrument is necessary for a IV estimator with good properties.


Well, F-hacking can be just as problematic. As Mark Schaffer mentioned in a tweet, Hall et al. (1996) made this point more than two decades ago. Jeff Wooldridge made a similar point in a recent tweet as well. Bi et al. (2020) do as well.

To illustrate the issue, I conducted a brief simulation; similar to that done in Hall et al. (1996). The data-generating process (DGP) is as follows.

X, Z1,...,Z100, e ~ N(0,1)
Corr(X,e) = 0.8
Corr(X,Z1) = 0.1
Corr(X,Zk) = 0, k=2,...,100
Corr(Zk,e) = 0, k=1,...,100
Y = bX + e

In other words, X is an endogenous regressor. Z1 is a valid instrument. Zk, k=2,...,100, are invalid instruments since they are not correlated with X. The true value of b is set to zero.

I simulate 1000 data sets, with 500 observations in each. I compare several estimation strategies.

1. IV using Z1 to instrument for X.
2. IV using Z1 to instrument for X iff F>10.
3. IV using Zk* to instrument for X, where Zk* is the instrument -- chosen from Z1,...,Z100 -- that yields the highest first-stage F.
 
Strategy 1 entails no pre-testing. Instead it relies only on the researcher's institutional knowledge that Z1 is a valid (albeit somewhat weak) instrument in the population. Strategy 2 is similar to Strategy 1, but takes the file-drawer approach of squashing the paper if the instrument is not strong in the sample. This strategy was mentioned by Scott Cunningham in a tweet. Strategy 3 searches for the strongest instrument among a set of candidate instruments. No institutional knowledge is relied upon beyond the construction of the candidate set.

The results are here:

Strategy 1 is the only median unbiased strategy, although the dispersion of the estimates is greater and it over-rejects the null b=0. Strategy 2 fails to produce any inference in more than 80% of the samples (i.e., the researcher would file-drawer the results) and the retained results drastically over-reject the null b=0. Strategy 3 fares even worse. The median bias is even greater and the null b=0 is rejected more than 70% of the time.  

So, what's the takeaway? No, it is not to file-drawer IV as an estimation strategy! Yes, IV is finicky. So is every estimator when the required assumptions are not met. IMHO an estimator should never be unilaterally rejected when it has the potential to work so beautifully in many circumstances. 


Moreover, what are you going to do instead? Write another difference-in-differences paper? After pre-testing for the pre-sence of pre-trends? Me thinks not! And I have Roth (2021) to back me up.


Instead, researchers should use their institutional knowledge (dare I suggest draw a DAG?) and that knowledge only to choose instruments. Specification tests should still be performed, but not for the purposes of altering the estimation procedure. Moreover, as Mark Schaffer suggests in the tweet above, perhaps institutional knowledge combined with weak IV robust inference provides some layer of protection against pre-test bias as there is no (direct) need to pre-test IV strength; weak IVs just show up as wider confidence intervals. However, researchers can still CI-hack to search for more narrow confidence intervals.
Bi et al. (2020) offer an alternative solution in a recent working paper.

The larger point, even beyond IV, is that applied researchers need to be wary of pre-tests. While following the yellow brick road worked out alright for Dorothy, researchers need to avoid following paths when conducting inference.


Code is available here.

UPDATE (5.21.21)

Mark Schaffer also pointed out another possible solution based on the LASSO and post-LASSO as espoused in Belloni et al. (2012). This automates the instrument selection procedure under the assumption of sparsity (as in my DGP above where only Z1 is a relevant instrument).

References 

Andrews, I., J. Stock, and L. Sun (2019), "Weak Instruments in IV Regression: Theory and Practice," Annual Review of Economics, 11, 727-753

Belloni, A., D. Chen, V. Chernozhukov, C. Hansen (2012), "Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain," Econometrica, 80, 2369-2429.

Bi, N., H. Kang, and J. Taylor (2020), "Inferring Treatment Effects After Testing Instrument Strength in Linear Models," unpublished manuscript

Brodeur, A., N. Cook, and A. Heyes (2020), "Methods Matter: P-Hacking and Causal Inference in Economics and Finance," American Economic Review, 110, 3634-3660

Giles, J.A. and D.E.A. Giles (1993), "Pre-test Estimation and Testing in Econometrics: Recent Developments," Journal of Economic Surveys, 7, 145-197

Hall, A., G. Rudebusch, and D. Wilcox (1996), "Judging Instrument Relevance in Instrumental Variables Estimation," International Economic Review, 37(2), 283-298

Roth, J. (2021), "Pre-test with Caution: Event-study Estimates After Testing for Parallel Trends," unpublished manuscript. 

Stock, J.H., J.H. Wright, and M. Yogo (2002), "A Survey of Weak Instruments and Weak Identification in Generalized Method of Moments," Journal of Business & Economic Statistics, 20, 518-529






Popular posts from this blog

There is Exogeneity, and Then There is Strict Exogeneity

Faulty Logic?

Different, but the Same