Being sick and stuck at home makes me cranky. So, seems like the perfect time to write a new post about something that has long irked me. Robustness checks.
Khoa Vu has tweeted many times about the absurd number of robustness checks in today's empirical papers. See, for example, here. Jokes abound about countless robustness requests by the infamous Referee 2, the length of appendices to NBER working papers, and the despair that researchers feel when a good paper comes crashing down due to failure of the 87th robustness check.
In my view, researchers have just accepted this as the new reality when publishing, and we do not stop think about things in sufficient detail. In my view, we need to realize that, just as with penguins, not all robustness checks are created equally.
When producing an empirical study, there is no doubt that there are countless decisions that must be made along the way. It is impossible appreciate this until it's your paper. However,
Huntington-Klein et al. (2021) gave it their best effort by showing the range in results obtained when researchers set out to replicate existing studies.
That said, a well-executed empirical study ought to make all these choices as clear as possible to the reader and ought to lead to a preferred model. This preferred model is often referred to as the baseline specification.
As such, the results obtained from the baseline specification ought to be the ones in which you have the most confidence. Robustness checks assess changes in the results if the choices made to arrive at the baseline specification had been different. Passing robustness checks is important. However, it does not deserve to be put on the pedestal as it seems to be made out to be.
Robustness checks are, when you think about, a bit of an exercise in intellectual laziness. Yes, that is strong. As I said, being sick makes me cranky. But, as economists, we bring more than just research questions and data to empirical analysis. We also bring our expertise in modeling human behavior. This is what typically distinguishes us from statisticians. As a result, we ought to use these skills to justify our baseline specification as being the 'correct' specification.
If we conduct a robustness check by altering our baseline model in some manner and yet the results do not change, then this saves us from having think deeply about our choice regarding that particular aspect of the model. It allows us to be lazy, not worry about this particular choice, and not have to devote too many brain cells justifying our choice.
But, what if we fail a robustness check? Often times this leads the author to despair, readers to quickly dismiss the paper, and referees to recommend rejection. That, too, is lazy. Failing a robustness check is not, in and of itself, a reason to condemn a study. It should be the start of the discussion, not the end. This failure tells us this particular modeling choice matters for the results. Now, the intellectual work begins. Which model is 'right'?
Presumably since you, the author, choose the baseline specification as your preferred specification, you trust that model more. It is now up to you to explain this. Explain that while the robustness check is useful information, it does not invalidate the paper if you don't believe the model in said robustness check.
This is where the not all robustness checks (or penguins) are created equally comes into play. If the change made from the baseline specification to the model estimated in the robustness check is something about which economics, econometrics, or institutional knowledge provides no information, then we really have little way to justify the baseline specification as being more likely to produce the 'right' answer. This, then, is a critical robustness check.
However, if we do have information that points to the baseline specification as being preferred, then authors should fret not and readers/referees ought not be so quick to dismiss. Just go back to the arguments on why the baseline specification was your preferred model in the start. It is incumbent upon you, the author, to explain why this particular robustness check should not be given special penguin status.
To do anything else is bad science. Remember Type I and Type II errors?
Even if
no changes made as part of one's robustness checks matter in truth, there is always a probability of
incorrectly finding that the results are sensitive to the changes made in some robustness checks. This is a classic multiple hypothesis testing problem. We cannot let intellectual laziness preclude us from attempting to understand these false positives by thinking about which robustness checks may cause us to question the results of a study, and which perhaps should not.
Thank you for indulging my rant. I feel better already.
UPDATE (1.25.2023)
Thanks to Ai Deng on
twitter for pointing out the paper by
Lu & White (2014). They critique a certain type of robustness check that I learned as Extreme Bounds Analysis (EBA), which assesses the sensitivity of a coefficient(s) of interest to changes in the other covariates included in the model.