The quote by statistician George Box
"All models are wrong but some are useful"
is extremely well known. It served as motivation for a prior post
here. It serves as motivation for this post as well, which brings us to another famous quote, this time by a non-statistician
In addition to being a great athlete, playing the greatest of games, Yogi was a smart fella.
I was reminded of the quote by Box while teaching this week. In particular, I was discussing two examples of quasi maximum likelihood estimation (QMLE). In papers, we sometimes see authors reference the estimation technique being used as MLE, while other times we see authors use the term QMLE. As applied researchers, we may not have learned the difference (or we have forgotten!) and thus we gloss over this, continuing on with our lives, our feelings of imposter syndrome growing steadily worse.
Well, it turns out that, as applied researchers, we ought to pay significant attention to that "Q." It turns out to not merely be quasi-useful, but actually extremely useful.
In simple language, as I am -- at my core -- a simple person, the "Q" stands for Cox.
To see why, let's refresh our memory on how maximum likelihood (the original ML) works. We begin by specifying a data-generating process (DGP) for the data, including distributional assumptions for stochastic elements in the model. Assuming independence between observations, the contribution of each observation to the likelihood function reflects the true probability of the realized outcome as a function of the covariates and the unknown parameters. The ML estimates are the values of the parameters that maximize the collective likelihood of the realized data.
If the DGP, and hence the likelihood function, are correctly specified, then ML works great; the estimates are consistent, asymptotically normal, and asymptotically efficient (i.e., they achieve the Cramér-Rao lower bound). Note, these are all asymptotic properties. The performance of ML can be poor in finite samples (e.g., it may be biased such as for σ2 in the Classical Linear Regression Model).
The greatness of ML comes at the price of having to correctly specify the DGP, the
entire DGP. Distributions and all. Unless ...
... ML "works" even when the DGP is not correctly specified. In this case, the model is still useful even though it is wrong. In this case, we refer to it as QMLE instead of MLE. In this case, Box is still smiling.
With QMLE we know that some aspect of the DGP is incorrect, yet the parameter estimates obtained from maximizing the incorrect likelihood still produce consistent estimates of the parameters. It's almost like magic!
As an applied researcher, this is critical information. One of the first rules of applied work, in my view, is that you must know what assumptions are required for an estimator to have certain desirable properties. Rational people can then disagree on whether those assumptions are reasonable or not in a given application. But nothing will get you in more trouble with your research than not knowing what assumptions are required in your analysis.
Back to the topic at hand. With QMLE some of the assumptions that are used to derive the likelihood function are not in fact required for consistency. However, the usual ML standard errors will be incorrect; typically robust standard errors are used instead.
As I said, two examples came up this week in my class. The first example is the fractional response model of Papke & Wooldridge (1996). Here, the outcome, y, is a fraction bounded between zero and one. For example, y might represent the participation rate in a program across counties or the pass rate on an exam across schools. Papke & Wooldridge propose the following DGP:
E[y | x] = F(xb),
where F is the standard normal or logistic CDF (ensuring that the expectation lies in the unit interval), and
y ~ Bernoulli (p),
where p = F(xb). The log-likelihood function is then trivial to derive and estimate. However, the assumption of the Bernoulli distribution cannot be correct because the Bernoulli distribution is the probability distribution for a discrete random variable taking on only two values, whereas y in our fractional response model can take on any value in the unit interval. Nonetheless, maximizing this likelihood function yields a QML estimator as it yields consistent estimates of b as long as the conditional mean, E[y | x], is correctly specified. The Bernoulli distribution, while convenient for computation purposes, is not necessary for consistency. Magic!
The second example is even more interesting. It is the Poisson count data model. Here, the outcome, y, is the non-negative integer frequency of some event. For example, y might represent the number of patents obtained by a firm in a given year or the number of visits by an individual to the doctor in a given year. In the Poisson model, the DGP is the following:
E[y | x] = exp{xb} = λ
y ~ Poisson(λ).
The Poisson distribution is the probability distribution for a non-negative, discrete random variable. As such, the DGP may be correct. If so, then the corresponding estimator is an ML estimator. But, even if the Poisson distribution is not correct, the estimator is still consistent for b, and hence a QML estimator, as long the conditional mean, E[y | x], is correctly specified.
The reason why this second example is particularly interesting is that it can be used even if y is a non-negative,
continuous variable as long as the conditional mean is exp{xb}. This is pointed out quite brilliantly in Santos Silva & Tenreyro (2006). Santos Silva and Tenreyro discuss estimation of the so-called
gravity model of trade. According to many trade theories, trade flows between two countries, y, is equal to exp{xb}. Prior to their paper, researchers estimated such models by converting the model to a linear regression specification by taking the logs of both sides. For reasons I won't go into here, Santos Silva & Tenreyro argue that this is not a good idea and propose not taking logs and instead estimating a Poisson count data model.
The authors referred to their estimator as Poisson Pseudo Maximum Likelihood (PPML). But, pseudo ML is just another name for QML. And, the reason PPML works is that the Poisson model yields consistent estimates regardless of the validity of the Poisson distribution, which clearly cannot hold in trade data that is continuous over the non-negative portion of the real number line. Inference can be addressed by using robust standard errors.
This example of QMLE is fascinating because by knowing which assumptions are necessary and which are not, we are able to apply the estimator to settings where we know the optional assumptions are violated. Doing so allows us to avoid several problems that might arise if we instead are old school and simply take the logs of exponential models. This point is made is more detail here.
Stata! Helping researchers since 1985.
Enjoy yourself and love one another!
References
Papke L.E. and J.M. Wooldridge (1996), "
Econometric Methods for Fractional Response Variables with an Application to 401 (k) Plan Participation Rates,"
Journal of Applied Econometrics, 11(6), 619-632
Santos Silva, J.M.C. and S. Tenreyro (2006), "
The Log of Gravity,"
The Review of Economics and Statistics, 88(4), 641-658