Schrödinger's Cat
In many ways, economists are very wishy-washy. We are probably all aware of Harry Truman's frustration with economists and their two hands:
"Give me a one-handed Economist. All my economists say 'on ONE hand...', then 'but on the other..."
But, such wishy-washiness is usually a sign of knowledge ... as one learns more about a topic, one realizes how there are never simple answers in complex matters. This leads me to one of my favorite quotes from Mark Twain:
"Education: the path from cocky ignorance to miserable uncertainty."
Cocky ignorance also goes by another name: the Dunning-Kruger effect. Or, put simply, the ill-informed do not know what they do not know. As an aside, a friend sent me an article this week on how the D-K effect explains our current political predicament.
Anyway, this brings me the topic I want to discuss today: while economists are notorious for being wishy-washy, applied econometricians are not. In my opinion, they abhor living in a world of gray, and much prefer the simplicity of black and white.
This lack of appreciation for the gray implicitly lies beneath a blunt tweet that I came across this week:
"Without googling, why are people obsessed with *unbiased* estimators?"
Applied econometricians not interested in forecasting or prediction are infatuated with unbiased estimators because it is black and white. It is or it isn't. If it is, ACCEPT! If it isn't, REJECT! Many publication decisions come down to this simple heuristic. Of course, we never know for sure whether an estimator is biased or not in a particular application. As a result, referee recommendations often come down whether they can think of a possible source, any source, of bias. If so, REJECT!
But ...
It doesn't have to be this way. We can choose to embrace the gray.
One very cool illustration of the gray that I came across a few years ago is a paper by Pesaran & Zhou (2018, PZ hereafter). The big picture insight of this paper that honestly blew my mind was that not all observations have to have the same data-generating process (DGP).
Wait, what? Don't we know this? After all, what I just described is a structural break ... a change in the DGP across observations. Yes, I agree that that is a well-known example where different observations are drawn from different DGPs, but that is not what PZ have in mind. Instead, the authors consider a case where a more subtle part of the DGP is allowed to differ across observations.
Specifically, they consider a traditional linear panel data regression with unobserved unit-specific fixed effects. The model can have time effects; they are not primary to this discussion. However, they put a spin on the traditional model by positing that the fixed effect, αi, is only correlated with the time-varying covariates, xit, for some i.
In other words, either αi is uncorrelated with the time-varying covariates, xit, for some i or αi = α for some i. Both cases imply that Cov(α, xit) = 0 for some units. For the remaining units, Cov(αi, xit) ≠ 0.
This setup resides in the gray for two very important reasons that are not typically in the consciousness of applied researchers. First, it's not a question of whether the time-varying covariates are or are not correlated with the fixed effects. Rather, there is a third possibility: correlated for some units, but not all. Applied researchers rarely think this way even though thinking about structural breaks and, more recently, treatment effect heterogeneity are common. Why do we restrict ourselves to binary thinking when it comes to the setup of the model?
Second, allowing the time-varying covariates to be correlated with the fixed effects for only some units changes the properties of estimators in meaningful ways that put us further in the gray.
PZ consider the choice between Pooled OLS (one could think about random effects as well) and the fixed effects (FE) estimator. For Pooled OLS, we know that if Cov(αi, xit) ≠ 0 for even one unit i, then the estimates will be biased. However, the bias may be quite small if Cov(αi, xit) ≠ 0 for a small proportion of units. But, more interestingly, Pooled OLS will be consistent if the number of units where Cov(αi, xit) ≠ 0 goes to infinity slower than the number where Cov(αi, xit) = 0 as the sample size grows. This is the case if Cov(αi, xit) ≠ 0 for Nδ units, where δ < 1. In this case, the proportion of units causing bias, Nδ/N, goes to zero as N goes to infinity, leading to a consistent estimator. Going a step further, Pooled OLS has a smaller asymptotic mean squared error than FE if δ < 0.5 due to the information discarded by FE (i.e., the between variation).
Now, I know most of us applied peeps slept through asymptotic theory, but this is a nice and intuitive result: while the estimator (Pooled OLS) is biased (a finite sample property) it is consistent (an asymptotic property) because the problematic observations become overwhelmed by unproblematic observations as N goes to infinity.
I can hear the counterarguments.
Why should we care about asymptotic results when every data set we have ever encountered has finite N?
We should care about asymptotic results because (i) sample sizes are often quite large and/or (ii) in some cases the performance of an estimator mimics its asymptotic properties even when N is far from infinity. We learn this through simulation; not math. As suggested by the tweet above ... we do not need to be obsessed with only unbiased estimators!
Why should I assume that some unproblematic observations exist, let alone dominate asymptotically?
We should assume the dominant presence of unproblematic observations if we have institutional knowledge supporting this claim. I don't dispute that this may often not be the case, but I will provide an example in a different context below.
So, the paper by PZ should cause applied researchers to step back and think a minute.
The paper highlights two new ways of thinking for many applied researchers. First, sources of econometric concern -- endogeneity in this case due to unobserved, time invariant heterogeneity -- do not need to be black or white, all or nothing. Second, the econometric concerns themselves -- performance of the Pooled OLS estimator in this case -- do not need to be black or white, all or nothing. In both cases, there is a third possibility that lies in the gray: concerning behavior for only some and bias but consistency and lower mean squared error in large samples. Remind you of anyone?
A secret third thing!
As further illustration of what we can gain as researchers by embracing the gray, as well as illustrating the use of institutional knowledge to justify the presence of unproblematic observations, I will self-promote briefly (mostly to highlight my junior colleague!). My recent paper entitled, "Embrace the Noise? It's OK to Ignore Covariate Measurement Error, Sometimes" with my colleague, Hao Dong, is now forthcoming in the Journal of the Royal Statistical Society, Series A. [Note: I will provide a link when it is available; an ungated prior version is here.]
The paper follows PZ but instead of comparing Pooled OLS to FE when unobserved, time invariant heterogeneity only affects some units, we compare OLS to Instrumental Variables (IV) when a covariate suffers from measurement error for only some observations. The results are analogous to PZ: OLS is biased but consistent if the measurement error only affects Nδ units, where δ < 1 and OLS has a smaller asymptotic mean squared error than IV if δ < 0.5. Through simulation, we show that this holds in practice even if N is relatively small, if all observations are subject to at least some measurement error, and if the instrument is strong. We also show that this is practically relevant by examining the effect of body mass index (BMI) on income using data with both self-reported and clinical BMI.
Guilty.
Aside from PZ, there are a number of other examples where I wish researchers would embrace the gray a bit more. Without making this post even longer, two come to mind immediately. First, partial identification a la Manski. The ultimate definition of gray. I can see Truman rolling over in his grave now. My experience is that many dismiss the usefulness of partial identification because they hate to give up the black-and-white certainty of a point estimate. It always reminds me of the scene in Singles.
Linda: I still love my car.
Second, so-called Stein-like estimators. Time series people (and maybe machine learners?) take it for granted that combining multiple forecasts outperforms a single forecast even if that single forecast comes from the 'true' model. Stein-like estimators apply this logic to places where β is the goal, not prediction. For example, instead of choosing Pooled OLS (or RE) or FE or instead of choosing OLS or IV, a Stein-like estimator takes a weighted average of the two to minimize the mean squared error. See, e.g., Hansen (2017), Huang et al. (2017), Huang (2020). These estimators average an inconsistent and a consistent estimator, yielding an inconsistent estimator but one with smaller mean squared error typically. But, this requires being comfortable with a biased estimator!
Live in the gray. Die in the gray. Just like Schrödinger's Cat. Either way, embrace the gray.
Thanks for reading!