Heckman, Schmeckman!
Ah, grad school. It's brutal, both in terms of the work required and the mental toll. Thankfully, the latter is more out in the open these days. I came across one tweet this week, presumably from a current PhD student, asking how often people thought about dropping out of grad school. Today, I came across another tweet asking how often PhD students were brought to tears. I must admit, I have had a few PhD students cry in my office over the years.
Thinking about the mental toll of grad school for myself, I was reminded of an incident during my time that happens to be (tangentially) related to the topic of this post. It happened when I was attending an empirical seminar given by some outside speaker. No clue who the speaker was. Well, one of the professors (I do recall who but won't name names) could not see how something in the model was identified. It didn't seem like the speaker could get the explanation across; or, the professor wasn't doing a good job listening. It got a bit contentious.
I abhor conflict. Makes me physically uncomfortable. So, even though I was a lowly grad student, I was compelled to interject. I told the (my!) professor that the speaker was just doing something that was first done in a paper by Heckman in the 1970s. [Disclaimer: I was not in grad school in the 1970s ... but close.]
The professor turned to me and snapped,
"Heckman, Schmeckman!"
That's the last thing I remember. I think I may have blacked out.
Alas, I digress. In this post, I want to talk about sample selection. And, while that was not the issue in the seminar I re-lived above, most of us are probably aware that Heckman was awarded the Nobel Prize in Economic Sciences (it is too a real Nobel!) in large part due to his work on the statistical analysis of selected samples. Specifically, the summary for the prize states:
"James Heckman received the Prize in Economic Sciences for his development of theory and methods used in the analysis of individual or household behavior. His work in selective samples led him to develop methods (such as the Heckman correction) for overcoming statistical sample-selection problems. His research has given policymakers new insights into areas such as education, job training and the importance of accounting for general equilibrium in the analysis of labor markets."
Let me briefly explain or remind you of the issue with selected samples. In most empirical research in economics (and other disciplines), the goal of statistical analysis is to use sample data to learn something about the population from which the sample is drawn. For information from a given sample to inform us about the population, the sample should be randomly drawn from the population.
Knowing this, most large survey samples are randomly assigned (perhaps using something more complex, but that's not relevant here). So far, so good. Now it is time for the researcher to dive into the data and ... "clean" it.
In so doing, the researcher often stumbles upon the problem of missing data. If one excludes observations with missing data, then that which was once a random sample may no longer be.
To be precise, the final sample -- excluding the missing data -- will only provide information about the population if the final sample remains a random sample from the population.
One common situation where this is not likely to be the case is when the dependent variable is missing for some observations and this missingness reflects a choice by agents. For instance, the canonical example in Heckman (1979) is female wages. Here, women's wages can only be observed for the sub-sample of women who are employed. Wages are missing for unemployed women and those out of the labor force. And, it is unlikely to be the case that the sub-sample of employed women is a random sample from the population of all women.
Other common examples include models of house prices (which only exist for houses recently sold on the market), test scores on college entrance exams (which are not taken by students not intending to pursue college), and profits for firms (which are not known for firms that exit the market).
As a result, if a researcher estimates a statistical model to learn the determinants of the outcome using only observations with non-missing data, the model results will not provide information about the determinants of the outcome for the full population. Note, it will provide information about the determinants of the outcome in the population for whom the sample is representative of. However, this is often not of interest (or as interesting). This is an example of the British idiom of "moving the goalposts."
To learn about the full population, Heckman (1979) proposed a solution that has become known as the Heckman Selection Correction. It is still widely used today.
I will not recount the technique here; I have done so in a prior post. Instead, let me just note that the correction proceeds by estimating a linear regression model for the outcome of interest using the selected sample, but augmenting the model with a control function that "controls" for the selection issue. As such, the estimates are now consistent for the parameters of the full population. The control function depends on a first-stage binary choice model for whether a given observation in the sample has missing or non-missing data on the dependent variable.
Interestingly, the Heckman Selection Correction does not overcome the issue by imputing the missing values of the dependent variable. While imputation has a long history in statistics, researchers rarely address non-random missing values of the dependent variable via imputation.
At first glance, this makes sense. If the issue with estimating your model on the selected sample is that the estimates are not applicable to the population for whom the outcome is missing, then how can one build an imputation model to predict outcomes for those with missing values in which one has any confidence?
And that -- finally! -- brings us to the point of this blog. In the canonical example of female wages, the linear regression model is estimated using Ordinary Least Squares (OLS). However, as discussed in my previous post, there are other ways to estimate linear regression models than OLS. Some of these alternatives may be "better" than OLS absent missing data on the dependent variable. However, they may really make your life easier when there is missing data on the dependent variable. These alternatives are better than OLS with imputed missing data because they make it much easier to have confidence in an imputation model.
Huh?
Seems odd that the estimation technique chosen to estimate the model one is really after would affect the confidence one has in the imputations that come from a different model. But, it's true. To be precise, the estimation technique used for the model of interest does not affect the quality of the imputation. Instead, it affects the statistical issues created by erroneous imputations.
Lemme explain.
In the prior blog post, I discussed the benefits of the Least Absolute Deviations (LAD) estimator of a linear regression model. The LAD estimator is the Quantile Regression (QR) estimator at the median. The LAD estimator fits the conditional median function, med(Y|X), whereas OLS fits the conditional expectation function, E[Y|X], or the conditional mean.
As is well known, the mean and median have quite different properties. Consider the following trivial example. Let there be a set of three numbers: {-1, 0, 1}. Clearly, the mean and median are both zero. Next, consider the set {-100, 0, 1}. The mean is -33 and median is still zero. Finally, consider the set { . , 0, 1}, where the value of the first number is really negative one but has mysteriously disappeared in the data. What is the mean and median of the set of three numbers? No clue. We have missing data.
Don't give up. Let's impute the missing value. It is trivial to see that if one replaces the missing value with any value less than zero, the median will be correct (and equal to zero). However, the mean will only be correct if one replaces the missing value with the correct value (-1). Obtaining the correct median only requires the imputed value to be on the correct side of the true median, whereas obtaining the correct mean requires the imputed value to be correct.
This extends to estimates of a linear regression model. If one imputes values of the dependent variable in a linear regression model estimated by LAD, the estimates will be consistent for the full population as long as the imputed values are on the correct side of the (conditional) median. For example, if one believes that all students that opt not to take the college entrance exam would have scored poorly had they taken the exam, one can replace the missing scores with zeros and estimate the model using the full sample by LAD. The estimates will be consistent as long as the non-test takers would have scored below the (conditional) median, even if the scores would not have been identically zero. However, OLS requires more. While it does not require the imputations to be correct, it does require the imputation errors to be classical measurement error (i.e., mean zero and uncorrelated with the truth).
LAD for the win.
One does not even need LAD to accomplish this. In Angrist et al. (2006), the authors are interested in the determinants of student performance on a college entrance exam in Colombia. Instead of LAD, they overcome the sample selection issue using a Tobit model.
A Tobit model assumes a linear regression model for a latent outcome, Y*. In a one-sided Tobit model, the observed outcome, Y, equals Y* if and only if Y* exceeds some threshold, say μ. Otherwise, Y = μ. The model is estimated by Maximum Likelihood (ML) usually under the assumption that the error term in the linear regression model for Y* is normal.
Well, just like in the LAD model, the precise value of Y* has no impact on the estimates for observations with Y* < μ. Angrist et al. (2006) exploit this fact by assuming that all non-test takers would have scored below μ. They then impute missing scores to μ and estimate the Tobit model. Moreover, they perform a specification test by estimating the model for increasing values of μ. If all the imputed test scores really are below, say, 50 (i.e., setting μ = 50), then they must also be below 60 (i.e., setting μ = 60). As such, the estimates should not differ qualitatively.
Isn't it great to know there are alternatives to the usual Heckman Selection Correction? Of course, a natural question is when would one prefer one technique over another. Without doing simulations and certainly without doing asymptotic proofs, the differences across the approaches come down to efficiency (based on sample size) and the validity of the required assumptions.
The Heckman procedure (typically) relies on the assumption of bivariate normality (although this can be relaxed). In contrast, the LAD estimator only requires the conditional median of the error term to be equal to zero. The Tobit requires normality.
The Heckman procedure estimates the model for the outcome of interest using information only from the sub-sample with non-missing data. In contrast, imputation-based approaches such as those relying on LAD and Tobit, estimate the model for the outcome of interest using data from the full sample. Thus, the imputation approaches use larger N, but no longer use OLS.
Finally, the Heckman procedure (in practice) requires an exclusion restriction: an observed variable that affects whether an observation has missing or non-missing data, but does not affect the outcome of interest itself (again, refer to my prior post). In contrast, the imputation approaches require the imputations to lie on the correct side of the (conditional) median or the threshold, μ.
The "best" approach likely is case-specific. Nonetheless, it is important researchers understand that there are estimators beyond OLS, and there are sample selection methods beyond the Heckman Selection Correction. So, everyone say it with me. Don't be shy!
PS. No one tell Heckman I said this.