Where's Waldo
My daughter started college this week; moved into her dorm last week. Fingers crossed, knock on wood, throw some salt, avoid black cats, don't walk under a ladder ... everything's been great so far and she is having an awesome time. She will definitely be the smartest one in our family in six months. In addition to her brain becoming larger, she also has already met a ton of cool new people, including a guy on her dorm floor named Waldo.
While my daughter now knows where Waldo is, it reminded me of a discussion on #EconTwitter from this past summer initiated by Arpit Gupta. Arpit posed a question not about Where's Waldo?, but rather about Where's the Data?. Specifically, he asked about the way that many (most?) applied researchers seem to gloss over issues of missing data when conducting empirical analyses. Either ignoring the issue completely, or, at best, relying on ad hoc and unjustified work-arounds.
With missing data, values for variables being used in the analysis are missing for select observations. The descriptor 'select' here is crucial. If data are missing for all observations, then the problem is one of potential omitted variables (if the missing variable is a control) as the variable is (completely) omitted from the data. This requires a whole other set of considerations and possible remedies.
With missing data, the variable(s) are in our data, but there are no known value(s) for a subset of the sample. When more than one variable has missing data, different variables may be missing for different observations or they may be missing for the same observations.
That missing data are an issue should be obvious to anyone that has performed any statistical analysis. But, don't just take my word for it.
Horton and Kleiman (2007): "Missing data are a frequent complication of any real-world study. The causes of missingness are often numerous..."
Ibrahim et al. (2005): "In clinical trials and observational studies, complete covariate data are often not available for every subject."
Burton and Altman (2004): "We are concerned that very few authors have considered the impact of missing covariate data; it seems that missing data is generally either not recognized as an issue or considered a nuisance that is best hidden."
In a regression context, there are two types of missing variables.
- Missing data on an outcome variable, y
- Missing data on an independent variable, x
Missing data on the outcome variable may arise for numerous reasons. Possibilities include survey nonresponse, confidentiality, and human error. In addition, the outcome may be missing for behavioral reasons. For example, wages are missing for the non-employed, profits are missing for firms that exit the market, and home prices are missing for those not sold. However, regardless of the reason, all roads lead to the same place.
Nonetheless, this approach yields consistent estimates as long as the data are missing conditionally at random. In other words, unobserved determinants of whether the data are missing must be independent of the outcome conditional on the covariates in the model.
Second, if the data are not missing conditionally at random, then one can recover consistent estimates using the Heckman (1979) sample selection correction. This approach amounts to modeling the absence of data and using this to devise a control function (e.g., the inverse Mills' ratio) to be included in the regression model for the outcome using only the sub-sample with complete data.
Third, the missing data may be imputed. Imputation is another one of these fancy words that really means "make up a number."
That said, researchers ought to be aware that sometimes surveys impute the data for us! For example, this is a common occurrence in US data on wages obtained from the Census Bureau. Hirsch & Schumacher (2004) and Bollinger & Hirsch (2006) discuss the issue in detail and show that imputation of wages, not surprisingly, likely leads to biased regression estimates. Thus, researchers need to be sure they fully understand the process by which survey data are generated. Sometimes even 'non-missing' is really 'missing'.
Missing data on control variables may arise for similar reasons as with the outcome variable. Again, the reason does not affect the choices that are available.
The two most common choices made by researchers are the CC approach (i.e., discard observations with any missing covariate data), defined above, and an ad hoc missing indicator (MI) approach. With the MI approach, the missing values of a continuous control, Xk, are replaced with an arbitrary value (typically the sample mean) and a dummy variable, Dk, is created which is equal to one for those observations where Xk was missing and zero otherwise. The missing values of a discrete control, Xk, are addressed by creating a new category reflecting missing values. The discrete variable is then turned into a series of binary indicators, one reflecting missing data. The set of controls in the model is then expanded to include the original set as well as the missing value indicators.
Alternative versions of the MI approach are considered in Dardanoni et al. (2011). To my knowledge, this paper has gone (nearly) completely overlooked by applied researchers. To start, the authors define the fill-in approach. Here, one simply replaces the missing values as described above, but without augmenting the controls with any missing indicators. Second, the fill-in approach is paired with a missing indicator approach. However, unlike the MI approach above, now the indicators represent patterns of missingness, not missing data for each control separately. For clarity, consider the following multiple regression model
Y = a + b1X1 + b2X2 + e.
With two controls, there are three possible patterns of missing data (at the observation level): missing X1 but not X2, missing X2 but not X1, and missing X1 and X2. Thus, instead of creating two missing indicators, D1 and D2, three would be created, say R1, R2, and R3, one for each missing data pattern. The regression model is then augmented to include these three indicators.
Finally, Dardanoni et al. (2011) propose what they call the grand model (GM) approach.
Under the GM approach, the regression model is not only augmented with the indicators for missing data patterns just described, but also the complete set of interactions between the controls (X1 and X2 in the above example) and the missing value pattern indicators (R1, R2, and R3 in the above example).
The estimated effects of the controls in the GM approach are identical to those from the CC approach. Dardanoni et al. (2011) propose either choosing among these three approaches (fill-in, fill-in with indicators, and fill-in with indicators and interactions) using model selection techniques, or using model averaging methods to combine the estimates across multiple models.
Despite the authors making code available ... in Stata no less! (see -gmi-) ... the approach has not caught on.
But wait, there's more! If the grand model approach sounds too extravagant, researchers have at least two other choices at their disposal when data on controls are missing.
First, there is imputation or multiple imputation (MI). This was actually the specific topic of Arpit's tweet. There is long and lengthy literature on MI in statistics. As Arpit asks on #EconTwitter, why this has never really become a part of the economist's tool kit is a bit unclear. I have no magical insight. I do, however, have some cynical insight. As with measurement error, missing data issues are typically ignored because trying to address the issue raises more red flags with referees than ignoring the issue. It reflects a deficiency in the publication process and contributes to the current replication crisis.
That's it. That's imputation.
With MI, you ... wait for it ... impute the missing values in multiple ways and average the estimates.
Easy. You know what is even easier? Yup. Stata has an entire suite of MI commands (see -mi-).
A final option available is my favorite, in theory: partial identification. In contrast to all the approaches discussed here that require structure and/or assumptions of missing conditionally at random, partial identification techniques can be used to see what can be learned in the presence of missing data under various assumptions. I won't go into more detail here, but see, e.g., Horowitz & Manksi (2000).
I know. It is sad. But, I will regale you with one last bit. A plug for my own paper!
No, wait! It is joint work with my former student Ian McDonough, who received tenure last summer at UNLV.
In Millimet & McDonough (2017) we consider an instrumental variable (IV) setup where there is missing data on the endogenous covariate, X. What makes this setup a bit different is that IV (in the absence of missing data) is biased and the bias depends on the strength of the instrument(s). Because the imputation model for the endogenous covariate affects instrument strength in the first-stage, it may be beneficial to choose an imputation model with an eye on this. We assess the finite sample performance of various estimators via simulation, as well as have an application looking at the effects of birthweight on subsequent outcomes. Fun stuff!
References
In Millimet & McDonough (2017) we consider an instrumental variable (IV) setup where there is missing data on the endogenous covariate, X. What makes this setup a bit different is that IV (in the absence of missing data) is biased and the bias depends on the strength of the instrument(s). Because the imputation model for the endogenous covariate affects instrument strength in the first-stage, it may be beneficial to choose an imputation model with an eye on this. We assess the finite sample performance of various estimators via simulation, as well as have an application looking at the effects of birthweight on subsequent outcomes. Fun stuff!
In the end, data are messy. Empirical researchers need to embrace the challenges this messiness creates. Ignoring problems of missing data (and measurement error and countless others that may or may not appear in future blog posts) may lead to publications, but it won't advance science and it won't avoid replication crises. Fortunately, there are tools available; more are always on the way. We just need to know where to look and be willing to do so.
As always, stay safe, stay healthy, vote, and continue to be decent to one another.
References
Bollinger, C.R. and B.T. Hirsch (2006), "Match Bias from Earnings Imputation in the Current Population Survey: The Case of Imperfect Matching," Journal of Labor Economics, 24(3), 483-519
Horowitz, J.L. and C.F. Manski (2000), "Nonparametric Analysis of Randomized Experiments with Missing Covariate and Outcome Data," Journal of the American Statistical Association, 95(449), 77-84
Dardanoni, V., S. Modica, and F. Peracchi (2011), "Regression with imputed Covariates: A Generalized Missing-Indicator Approach," Journal of Econometrics, 162, 362-368
Heckman, J. (1979), "Sample Selection as a Specification Error," Econometrica, 47, 153-161
Hirsch, B.T. and E.J. Schumacher (2004), "Match Bias in Wage Gap Estimates Due to Earnings Imputation," Journal of Labor Economics, 22(3), 689-722
Heckman, J. (1979), "Sample Selection as a Specification Error," Econometrica, 47, 153-161
Hirsch, B.T. and E.J. Schumacher (2004), "Match Bias in Wage Gap Estimates Due to Earnings Imputation," Journal of Labor Economics, 22(3), 689-722
McDonough, I.K. and D.L. Millimet (2017), "Missing Data, Imputation Accuracy, and Endogeneity," Journal
of Econometrics, 199, 141-155.