Too Much, Not Enough?

Who knew that there is a song called No More Tears (Enough is Enough) by Barbra Streisand and Donna Summer? [And, if you are thinking to yourself, "Who are Barbra Streisand and Donna Summer?", just keep it to yourself.] The final part of the lyrics are

No more tears

No more tears

No more tears

I've had it, we've had it, you've had it, he's had it

No more tears

Is enough, is enough, is enough, is enough, is enough, is enough, is enough


Unfortunately, Barbra and Donna could have easily been discussing econometrics in this song. Econometrics, to many at least, is synonymous with tears. And, all us, at some point, have had enough.


But, I have a specific issue in mind that centers on the theme of "When is enough enough?" when it comes to econometrics. The issue comes up a lot when I teach, a lot in my own research, and implicitly dominates much of the tension (and tears) that arise when applied econometricians encounter theoretical econometricians. The issue focuses on the choices of researchers (and preferences of consumers of research) that reflect a balancing act between too much information and too little information, between too complex and too simple

At the highest level, applied research proceeds by positing a research question, gathering data, and then seeking to answer said question. For purposes here, I'll take the first two steps as given. When it comes to the final step, however, the choices are numerous, spanning the gauntlet from simple to complex.


And, the choice is not without immense consequences. How one seeks to answer the question determines the type of information learned. At one end of the continuum is the extreme complexity that comes from just providing the data for visual inspection. This is 'complex' in the sense that, by showing all the data, one is conveying all the information in the data. When the data entail only 2 or 3 variables, this 'complex' analysis is obviously fairly trivial.



However, once we get to 4 variables (or more), things start to get a bit unwieldy. 



Thus, the complex method, which attempts to convey all the information available to the reader, ends up overwhelming the audience and, in the end, not providing a useful answer to the research question. So, what does the applied researcher do? They move to less complex methods that trade-off the amount of information conveyed for clarity of exposition. 

Alternatively, the data might be completely summarized by a single number. This represents the other end of the spectrum, extreme simplicity. For instance, with 2 variables, the data may be summarized by the correlation coefficient. [However, even then, there are choices among correlation coefficients (e.g., Pearson or Spearman)]. By providing a single number, the information being provided is easy to interpret. But, by reducing the data down to a single number, much information is potentially lost. 

Regression analysis is marginally less simple. With 2 variables, Ordinary Least Squares summarizes the data with two parameters, an intercept and a slope. In multiple regression analysis with K variables (1 being the dependent variable), the data are summarized by K numbers (a constant and K-1 slope coefficients). This still potentially hides a ton of information from the audience. Given large sample sizes, we ought to be very wary of reducing, say, 1,000 or 10,000 or 1 million observations down to K numbers. But, OLS remains the dominant tool in applied research. If a researchers opts for some alternative estimator, the audience (i.e., referees and editors) are at best suspicious and at worst reduced to tears.  


But, researchers have a great many tools available to them to answer research questions in ways that are less complex than plotting the data, yet still convey a great deal of information to the audience, and are more complex than (partial) correlations obtained via OLS, yet convey important information to the audience. This vast middle ground is, in my humble opinion, all too much avoided. It remains reminiscent of a post-apocalyptic wasteland.


We must do better as researchers and consumers of research to navigate this continuum. For example, at the more complex end of the spectrum are nonparametric regression methods and parametric limited dependent variable models. These estimators are in some sense more 'complex' but convey more information. Both nonparametric methods and limited dependent variable models give rise to observation-specific marginal effects. Thus, these estimators characterize a data set containing N observations and K-1 covariates, with N(K-1) parameters. This can be reduced all the way down to K-1 parameters if the researcher only provides the marginal effect at the sample mean or the sample mean of the marginal effects. Or a researcher can present some subset of the N(K-1) results by presenting marginal effects for particular observations or at specific percentiles of the distribution of the parameters. 

When sets of marginal effects are presented -- so more than K-1 estimates are provided -- researchers and consumers of research are often overwhelmed and brought to tears. Admittedly, it is hard to tell a story that reflects all the nuanced findings being presented. The findings also open the researcher up to more criticism; the more parameters presented, the more there is to be criticized. 

The result is that more than 200 years after the discovery of OLS, it remains the estimator of choice in applied research. I would argue that this is a failing on our part. 


In my own work over the years and, more importantly, as part of Manski's life work, this preference in the applied community for 'simple' methods that convey less information repeatedly rears its head in the context of partial identification techniques. Just this morning, the incomparable Paul HΓΌnermund -- unbeknownst to him -- set himself up perfectly to appear in this post with his tweet to me:


Partial identification relaxes the assumptions needed for OLS (and many other estimators). The result of this complexity is the additional information conveyed to the audience. PI provides more information in the sense that a parameter is determined to be within a segment of the real number line, rather than a point on the line, thus providing the audience a set of plausible values of a parameter rather than a single value only valid under a very specific situation. Yet, as Paul makes clear (as had my experience with some referees as well as the fact that Manski has yet to win a Nobel), audiences are willing to accept less information (a specific point estimate) for the sake of simplicity.


Other examples of this preference for simplicity exist as well (and I imagine readers of this post may have others that I can add to the list). But, consider two. First, quantile regression. QR is much closer to OLS than nonparametric regression, but it is more complex as it conveys more information, summarizing a data set with N observations and K-1 covariates with 99*K parameters (sticking to only integer quantiles and including an intercept). But, QR is still way less common than OLS. A quick Google Scholar search shows 734,000 results for "Ordinary Least Squares" and only 114,000 for "Quantile Regression". 

Second, modeling zeros. There has been renewed attention given to regression models where the dependent variable has a mass at zero. Tobit models have fallen out of favor due its lack of robustness to the distributional assumption for the error term. The preferred approach by applied researchers is typically to replace the dependent variable, Y, with log(Y+1), and stick to OLS. Then, along came the inverse hyperbolic sine. "Fancy" for sure. But, this has been adopted by applied researchers because it is trivial to do and does not impede the ability to still use OLS. 


However, it has been pointed out recently that regression estimates using IHS are highly sensitive to the unit of measurement. That, as we say, is not good. A better option, based on Santos Silva & Tenreyro (2006), and advocated for by Jeff Wooldridge on Twitter:


This solution is more complex, but also conveys more information given that the Poisson model is non-linear and thereby gives rise to observation-specific marginal effects. So, we are back to characterizing a data set with N observations and K-1 covariates with N(K-1) marginal effects. A far cry above what we learn using log(Y+1) or IHS. 

More recently, another solution to the zeros problem has been put forth in Mullahy & Norton (2022). Long story short, the authors advocate researchers consider the use of so-called two-part models. Such models are complex in the sense that they entail, um, two parts. But seriously, such models are more complex, but also convey more information by separately modeling zeros vs non-zeros and then positive values conditional on being non-zero. Again, marginal effects are observation-specific. Will this catch on with applied researchers, or will we still see log(Y+1) in another hundred years?



I'm not sure where we stand now. But, this post started by asking the question of when Is enough, is enough, is enough, is enough, is enough, is enough, is enough. I don't have a simple answer (or a complex one). But, I think current practice in empirical research errs much to much on the side of simplicity, both of method and, more importantly, of information conveyed to the audience. Of course, complexity for the sake of complexity is not good either.



I shall conclude with two final thoughts by top minds of our generation. First, Jeff tells us to use the right tool for the job. Given a research question and data, don't choose the estimator that is 'simplest' or conveys less information to avoid serious discussions. Every estimator has its place, but you need to make sure you choose the right one for the right reasons in any given context.


And, from that other modern luminary:


No more tears. Enough is enough.



Popular posts from this blog

There is Exogeneity, and Then There is Strict Exogeneity

Different, but the Same

Chicken or Egg? Part II