The Non-Monotonicity of Wrongness


In addition to being perhaps the happiest statistician in the history of statistics, George said it best.


Perhaps he was so happy because he accepted the inherent flaws in all models. Acceptance is very powerful. Of course, correlation, causation, ...

Image result for elaine yada yada yada

But, George's statement is intriguing. If all models are wrong and only a subset are useful, how do we know which of the incorrect models are, in fact, useful? One criteria that might be (implicitly) used in empirical research is based on a simple accounting exercise: If one is choosing between two competing models, the one with fewer errors is preferred. We can summarize this thought process in the following claim:

"The usefulness of a model is (weakly?) diminishing in its wrongness"

where 'wrongness' is measured by the number of errors.

Image result for the count gif

While I don't have specific examples off the top of my head (and wouldn't want to throw specific papers under the bus even if I did), an implicit or explicit motivation for many empirical papers is to claim that (i) existing empirical models suffer from problems a, b, and c and (ii) in this paper, problem a is solved. Perhaps you have read, or even written, such a paper.

Are such papers really making a contribution to our scientific knowledge? It clearly depends on whether a less erroneous model is a more useful model. In other words, is the usefulness of a model (weakly) monotonic in its wrongness?

The answer is obvious once we stop and think about it: No, the usefulness of a model is not monotonically decreasing in its wrongness ... as measured by the number of problems!

Image result for its new math gif

I was thinking about this point this week for two reasons. First, it came up in my class lecture on regression specification. Second, a specific example of this point was made in a paper I came across while refereeing.

In my lecture, I consider a situation where the 'true' model contains three covariates and satisfies all the assumptions of the Classical Linear Regression model. I can then compare two mis-specified models: (i) excluding two relevant covariates, x2 and x3, and (ii) excluding a single relevant covariate, x3.

Q: Which mis-specified model yields a 'better' estimate of the coefficient on x1?
A:  ðŸ¤·‍♀️. It could be either.


This ambiguity means that when two relevant variables are omitted from the regression model, making things less wrong can actually make the model less useful. In the above example, adding x2 to the specification, while 'right', could be 'wrong' in the sense of making the model less useful.


On the other hand, wrongly excluding both x2 and x3 may make things worse than excluding only x3 if the biases go in the same direction. In this case ...

Image result for two wrongs make a right

For further discussion of the OLS context, see Clarke (2005, 2009), De Luca et al. (2018), and Basu (2020).

The other example I came across this week is from Abay et al. (2019). The model of interest expresses the relationship between farm output (crop yield) and plot size (cultivated land). For reasons not relevant here, crop yield, y*, and plot size, x*, may both be mismeasured. Moreover, the measurement errors may not be classical. Deriving the expectation of the OLS estimate of the coefficient on observed plot size reveals that the bias depends on both sources of measurement error in ways that could, but do not have to, (partially) offset. Thus, replacing either the observed y or x with its true value, but leaving the other mismeasured, while 'right', could be 'wrong' in the sense of again making the model less useful. The authors state (p. 174): "Relatedly, even if we correct for measurement error in one measure, the bias in the OLS estimator could grow rather than shrink."


Of course, this discussion might also trigger flashbacks to that oldie, but goodie, Griliches (1977). Here, the focus is on estimating the returns to education when the model suffers from two potential problems: (i) measurement error in years of schooling and (ii) unobserved ability. The expectation of the estimated return to education, assuming classical measurement error in schooling, is given by

E[b] = beta(1-RR) + gamma*delta,

where beta is the 'true' return to education, RR is the reliability ratio of the observed schooling (RR = Var(S*)/Var(S), where S* and S are true and observed schooling, respectively), gamma is the coefficient on unobserved ability, and delta is the population coefficient on observed schooling in a regression of ability on observed schooling.

In this situation, measurement error results in attenuation bias, while it is assumed that omitted ability biases the estimate up. As a result, correcting one of these issues, but leaving the other problem to remain, while 'right', could be 'wrong' in the sense of making the model less useful. 

These examples illustrate cases where the wrongness of the model  the flip side to the usefulness of the model  is not monotonic in the number of errors in the model. As a result, be wary of your own work, or that by others, motivated by the solving of some, but not all, econometric issues that might be thought to plague a literature.

Image result for all or nothing


References

Abay, K.A., G.T. Abate, C.B. Barrett, and T. Bernard (2019), "Correlated Non-Classical Measurement Errors, ‘Second Best’ Policy Inference, and the Inverse Size-Productivity Relationship in Agriculture," Journal of Development Economics,  139, 171-184

Basu, D. (2020), "Bias of OLS Estimators due to Exclusion of Relevant Variables and Inclusion of Irrelevant Variables," Oxford Bulletin of Economics & Statistics, 82, 209-234

Clarke, K.A. (2005), "The Phantom Menace: Omitted Variable Bias in Econometric Research," Conflict Management and Peace Science, 22, 341-352

Clarke, K.A. (2005), "Return of the Phantom Menace: Omitted Variable Bias in Econometric Research," Conflict Management and Peace Science, 26, 46-66

De Luca, G., J.R. Magnus, and F. Peracchi (2018), "Balanced Variable Addition in Linear Models," Journal of Economic Surveys, 32, 1183-1200

Griliches, Z. (1977), "Estimating the Returns to Schooling: Some Econometric Problems," Econometrica, 45, 1-22

Popular posts from this blog

There is Exogeneity, and Then There is Strict Exogeneity

Faulty Logic?

Different, but the Same