What Do You Median?
Every parent with young children gets the tv shows their kids watch etched into their permanent memory. No way to get it out of there. For me, one of our kid's favorites was Blue's Clues. A quote from one episode, where Blue is playing hide-and-go-seek, is
"I could really use your help. Will you help me? You will? Oh good. Let me know if you see Blue."
I am thinking about Blue today for two reasons. First, I just finished updating my lecture notes for Econometrics II, our second-semester econometrics course for first-year economics Ph.D. students (available here). Second, I came across this tweet retweeted by Stephen Wild reminding us all how hard econometrics and statistics are, yet how deficient training can be in graduate school. Truth be told, it's not that graduate training is deficient. Rather, there is just SO MUCH to know.
My lecture notes now exceed 630 slides. For one semester. Yikes!
Alas, back to Blue. Most empirical studies even to this day, use Ordinary Least Squares (OLS) to estimate regression models. Many of us even have a modicum of understanding as to why: OLS is "great." Some may even know in what sense OLS is "great" ... it is BLUE, where BLUE stands for the Best Linear Unbiased Estimator. Unfortunately, that may be where most understanding stops.
But, this begs two questions that empirical researchers and consumers of empirical research ought to understand.
1. What does it mean to be BLUE and should we care?
2. What assumptions are required for OLS to be BLUE and what happens if they fail?
So, let us continue and maybe we can "see Blue" together. Start with the first question and let's break it down. BLUE means
Best = "lowest variance"
Linear = formula is a "linear function" of the dependent variable
Unbiased = "expectation equals the truth"
Estimator = a procedure to guess the values of unknown parameters
This sounds great! Any linear procedure yielding a guess of the true values of unknown parameters that is correct on average will necessarily have at least as high a variance as OLS. And, remember, we care about the variance of an estimator since (in the frequentist view) any estimate is a random variable and thus has a distribution centered on the truth (if it is unbiased) with some dispersion (variance) around the truth. Thus, while an estimator may be unbiased, it will generally not produce estimates that are equal to the truth in a given random sample. However, a lower variance implies that the estimates in a given sample are more likely to be closer to the truth.
Finding that OLS is BLUE is remarkable. It means we know that OLS is no worse than any other linear, unbiased estimator you or someone may invent ... without the need to check every alternative to OLS.
Given this, it seems like we ought to care a great deal that OLS is BLUE.
Not so fast. There is this little issue of "linear" in the name. Presumably this part is often overlooked because most empirical studies specify a "linear" regression model. Since we are estimating a linear regression model, then restricting one's self to the class of "linear" estimates is assumed to not be a big deal.
Unfortunately, the linear in "linear model" is completely different from the linear in "linear estimator." The former refers to regression specifications that are linear in parameters. The usual example being
y = Xβ + ε
where β are the unknown parameters. Importantly, the covariates, X, can be nonlinear and the model is still linear in parameters. The latter, however, refers to the estimate of β, call it b, being of the form
b = a1 * y1 + … + aN * yNwhere yi refers to the value of the dependent variable for observation i and ai is the weight placed on yi and defines the estimator. This formula is linear in the dependent variable. In a simple regression with only a single covariate, OLS sets
ai = (xi - μ)/ Σ(xi – μ)2
where x is the independent variable and μ is the sample mean of x.Even if we are willing to restrict ourselves to linear estimators, many assumptions must hold for OLS to be BLUE. These are the usual assumptions of the Classical Linear Regression Model (CLRM), notably a linear (in parameters) data-generating process with independent, mean zero, and homoskedastic errors, ε. These are also referred to as the Gauss-Markov assumptions as it is the Gauss-Markov Theorem that proves that OLS is BLUE under these assumptions. Absent from this list is a distributional assumption on the errors.
Interestingly, there is substitutability between the restriction to linear estimators and the restriction to normally distributed errors. If ε is normally distributed, then OLS is the Best Unbiased Estimator (BUE). OLS is the lowest variance among all unbiased estimators, even those that are non-linear. In the absence of normality, OLS is only BLUE and we are implicitly limiting ourselves to linear estimators.
Unfortunately, most students are taught that the assumption of normal errors is only needed for hypothesis testing (and, even then, only in small samples). Yet, here we see that normality has a large effect on the quality on the OLS estimates themselves!
BLUE and BUE are finite sample properties of estimators as bias is a finite sample property and the variance referenced by the term "best" is the finite sample variance of the estimator. However, the assumption of normal errors also matters for the asymptotic properties of OLS. Under normality, OLS is identical to the maximum likelihood (ML) estimates of β and ML estimators are asymptotically efficient (achieve the lowest asymptotic variance known as the Cramér-Rao lower bound). Absent normality, the ML estimates of β will diverge from the OLS estimates, with the former continuing to be asymptotically efficient and the latter not so much.
Finally, if any of the assumptions of the CLRM (aside from normality) do not hold, then OLS is no longer BLUE or BUE.
This discussion then begs a final question. If one does not wish to assume normally distributed errors and one does not wish to restrict one's self to linear estimators, is there another estimator that is "best"?
I am unaware of any other Gauss-Markov-type theorems making the case that a particular alternative to OLS is BUE under an alternative distributional assumption. However, another estimator worth examining is the Least Absolute Deviations (LAD) estimator. Students are typically exposed to the LAD estimator ... for roughly one minute. Then we move on. As Lee Corso likes to say,
Whereas OLS minimizes the sum of squared errors, the LAD estimator minimizes the sum of the absolute value of the errors. Whereas the OLS regression line goes through the mean of the data, the LAD regression line goes through the median of the data. In fact, as you may know, the LAD estimator is the quantile regression estimator of Koenker & Bassett (1978) at the median quantile. Thus, LAD is sometimes referred to as median regression.
To the extent that students do recall a bit about LAD, it is probably that it is less susceptible to outliers than OLS (since the residuals are summed in absolute value rather than squared). But, LAD has more to offer than just that. There is no closed form solution for the LAD estimator of β in a linear regression model, but it is not a linear estimator. Thus, under non-normal errors, even absent the presence of outliers, LAD may have a smaller finite and/or asymptotic variance than OLS.
I did a small simulation. I simulated 10,000 data sets with N = 100 where the true model is a simple linear regression with β = 1. In the first case, the errors are standard normal. I compare the distributions of the OLS and LAD estimates of β. In this case, OLS is BUE and so, not surprisingly, has a lower dispersion around the truth than LAD.
In the second case, the errors are non-normal with mean zero, variance one, but heavily skewed. Now, OLS is BLUE but not BUE. In this case, the non-linear estimator, LAD, has a much smaller dispersion around the truth than OLS.
These simulations are in small samples (N = 100). What about asymptotically? Well, as I said, ML achieves the Cramér-Rao lower bound if the likelihood function is correctly specified. It turns out that the LAD estimator is equivalent to ML under the assumptions that the errors follow a Laplace distribution.
What is a Laplace distribution? Clearly that is nothing that empirical researchers are used to dealing with. Well, here's a plot of the normal pdf and a Laplace pdf.
Different, yes. But, given that economic theory does not support any distributional assumption, there is no reason to favor the subtle differences of one over the other.