IV with a Mismeasured Binary Covariate
The issue here is to consider a very simple model,
Y = a + b*D + e,
where D is a binary variable and b is the coeff of interest. Suppose D is endogenous, but we have a valid instrument, z. IV is consistent and should do well in large samples if z is strong.
But what if D is also measured with error? I.e., what if the true model is
Y = a + bD* + e,
where D* is the true D, but we only observe D (D not equal D* for some i)? That valid instrument, z, we had? Now, it's not quite as useful. Why?
Because ME in a binary var canNOT be classical. When D*=0, the ME can only be 0 or 1. When D*=1, the ME can only be 0 or -1. Thus, the ME is necessary neg. corr. with D*. Since the IV, z, is correlated with D*, it is almost assuredly also correlated with the ME and thus invalid.
A very good reference is Black et al. (2000). Note, this argument applies to any bounded variable; D* need not be binary. E.g., if D* represents a percentage, then it is bounded on the unit interval. If the bounds are relevant, same issue arises.
So, is this a big deal? Most applied people would answer a question about this: "ME is not a big deal in this context. If it is exists, it's not severe enough to matter." Perhaps. Perhaps not. I performed some simulations.
The first simulations are based on 250,000 reps with N=1000 and a strong IV. I consider random ME in 1%, 3%, 5%, and 10% of each sample. The IV estimates are biased toward OLS, and one could argue at least moderately so with 5% misclassification.
In an old paper of mine, I cite several instances where 10+% misclassifcation has been documented in commonly analyzed treatments (e.g., training, education, union status around 5%.
Not surprisingly, if the IV is weak, then results are even worse.
So, what do we do? There are a few solutions out there, many that rely on computing bounds. But, definitely an important area for future work! One very recent contribution to this literature is by my great friend, Rusty Tchernis and co-authors.
Note: Code is available here: http://faculty.smu.edu/millimet/blog.html
References
Black, D.A., M.C. Berger, and F.A. Scott (2000) "Bounding Parameter Estimates with Nonclassical Measurement Error," Journal of the American Statistical Association, 95, 451, 739-748
Nguimkeu, P., A. Denteh, and R. Tchernis (2019), "On the Estimation of Treatment Effects with Endogenous Misreporting," Journal of Econometrics, 208, 2, 487-506
Y = a + b*D + e,
where D is a binary variable and b is the coeff of interest. Suppose D is endogenous, but we have a valid instrument, z. IV is consistent and should do well in large samples if z is strong.
But what if D is also measured with error? I.e., what if the true model is
Y = a + bD* + e,
where D* is the true D, but we only observe D (D not equal D* for some i)? That valid instrument, z, we had? Now, it's not quite as useful. Why?
Because ME in a binary var canNOT be classical. When D*=0, the ME can only be 0 or 1. When D*=1, the ME can only be 0 or -1. Thus, the ME is necessary neg. corr. with D*. Since the IV, z, is correlated with D*, it is almost assuredly also correlated with the ME and thus invalid.
A very good reference is Black et al. (2000). Note, this argument applies to any bounded variable; D* need not be binary. E.g., if D* represents a percentage, then it is bounded on the unit interval. If the bounds are relevant, same issue arises.
So, is this a big deal? Most applied people would answer a question about this: "ME is not a big deal in this context. If it is exists, it's not severe enough to matter." Perhaps. Perhaps not. I performed some simulations.
The first simulations are based on 250,000 reps with N=1000 and a strong IV. I consider random ME in 1%, 3%, 5%, and 10% of each sample. The IV estimates are biased toward OLS, and one could argue at least moderately so with 5% misclassification.
In an old paper of mine, I cite several instances where 10+% misclassifcation has been documented in commonly analyzed treatments (e.g., training, education, union status around 5%.
Not surprisingly, if the IV is weak, then results are even worse.
So, what do we do? There are a few solutions out there, many that rely on computing bounds. But, definitely an important area for future work! One very recent contribution to this literature is by my great friend, Rusty Tchernis and co-authors.
Note: Code is available here: http://faculty.smu.edu/millimet/blog.html
References
Black, D.A., M.C. Berger, and F.A. Scott (2000) "Bounding Parameter Estimates with Nonclassical Measurement Error," Journal of the American Statistical Association, 95, 451, 739-748
Nguimkeu, P., A. Denteh, and R. Tchernis (2019), "On the Estimation of Treatment Effects with Endogenous Misreporting," Journal of Econometrics, 208, 2, 487-506