Mostly Unidentified?

In one of the greatest movies ever made, Miracle Max says: "Turns out your friend here is only MOSTLY dead. See, mostly dead is still slightly alive."

Image result for gif amen

Well, there is a lesson to be learned. In econometrics, many parameters that one might consider to be unidentified are only MOSTLY unidentified. And, following Miracle Max's unassailable logic, mostly unidentified is still slightly identified. 

Miracle Max continues to share his wisdom with us mere mortals when he says that with "ALL dead, there is only one thing you can do ... go through his pockets and look for loose change." With ALL unidentified, I suppose all we can do is thank our academy for a lovely run and turn off the lights on our way out.  

But, what do we do when confronted with mostly unidentified parameters? Well, that's where partial identification -- played here by Miracle Manski -- enters the scene.

Image result for enter stage left gif

Partial identification econometric techniques existed before Manski, but no one has done more to espouse the virtues of partial identification over the past three decades. In fact, even when parameters are (point) identified under certain assumptions, Manski advocates relaxing those assumptions (unless one is justifiably confident) even if it means the parameter is no longer (point) identified.

Image result for manski public policy in an uncertain worldImage result for manski public policy in an uncertain world

Manski lambastes the incredible certitude in which researchers shout their point estimates from the mountain tops based on strong and often indefensible assumptions.

Image result for shout from the mountain tops gif

For the uninitiated, nearly all of our training in classical econometrics focuses on an assumed data-generating process containing one or more unknown population parameters. The objective is to use these assumptions to derive valid mathematical expressions for the unknown parameters. If the mathematical expression for a given parameter yields a number -- a single point on the real number line -- then the parameter is identified. More precisely, it is point identified. Combining our assumptions with data enables us to put this result into practice and obtain a point estimate, a single number on the real number line that is our best guess for the unknown parameter and possesses good statistical properties.

At this stage, our understanding is not complete, yet our training typically concludes. We are often left to believe that a lack of point identification is synonymous with not identified. Sometimes a lack of point identification does coincide with not identified; at that point (no pun intended), we look for loose change. However, other times, parameters are only mostly unidentified. In this case, we say that the parameter is partially identified.

Partial identification refers to any situation where the assumptions maintained restrict the value of a parameter to a subset of the real number line, where this subset includes at least two points. Often the subset is an interval of the real number line. In this case, we might also say that the parameter is interval identified. Other times the subset might be a collection of non-contiguous (?) points.

Image result for what now we are inventing new words

In this case, we might say that the parameter is set identified. If the interval or set includes the entire real number line (or, more precisely, the entire parameter space), then -- and only then -- is the parameter not identified.

Thus, identification should be viewed along a continuum, ranging from point identification to non-identification. This leaves a lot of space for partial identification, yet most researchers ignore this space, treating it like a weird distant cousin, not to be discussed in enlightened circles.

Image result for weird cousin gif

Miracle Manski wants us to embrace our family, all of our family. Blood, after all, is thicker than the real number line.

As stated above, Manski did not "invent" partial identification. The first instance of which I am aware is Frisch (1934). Frisch -- like all good researchers (!) -- was concerned with measurement error. Assume the data-generating process conforms to the usual Classical Linear Regression Model with a single covariate, x*:

y = a + bx* + e.

In our data, we do not observe x*, but instead observe x, where x suffers from classical measurement error

x = x* + u,

where u is mean zero and uncorrelated with x* and e. In this case, the Ordinary Least Squares (OLS) coefficient from the regression of y on x, given by

b_ols = Cov(x,y)/Var(x) = b*[Var(x*)/Var(x)]

does not point identify the true slope as it is attenuated toward zero since

Var(x*)/Var(x) < 1.

Knowing that the OLS coefficient is attenuated, however, does partially identify the slope parameter as OLS provides a lower bound (in absolute value). Assuming b>0 (without loss of generality), one can at least state that the true slope b must lie in the bounds given by

[b*{Var(x*)/Var(x)}, +inf).

However, Frisch realized that these bounds are not sharp. Sharp bounds are the tightest bounds possible given a set of assumptions. If bounds do not fully utilize all available information, they are not sharp. Moreover, if bounds are sharp, then any bounds that are narrower must, by definition, utilize additional information.

In this case, sharp bounds are derived by using the OLS coefficient from the reverse regression of x on y. The regression equation is given by

x = c + dy + w,

where w is the error term. Now,

d_ols = Cov(x,y)/Var(y)

which is less than d since Cov(y,w) < 0. But, the reciprocal is

1/d_ols = b + Var(e)/[b*Var(x*)].

Continuing to assume b>0, sharp bounds are given by

[b*{Var(x*)/Var(x)}, b + Var(e)/[b*Var(x*)])

since the reciprocal of d_ols is larger than b. As is well known, Instrumental Variable (IV) estimation can point identify b, but this requires certitude that the instrument is valid.

Since this early example provided by Frisch, partial identification applications have appeared sporadically in a variety of contexts; too many to list here. However, I will briefly discuss two examples from my own work that I think will interest many readers.

Image result for who does this guy think he is gif

The first example concerns the estimation of causal effects of a binary treatment.  Consider the population Average Treatment Effect (ATE) given by

E[Y1 - Y0],

where Y1, Y0 are potential outcomes associated with treatment (denoted by D=1) and non-treatment (D=0). Focus on one aspect of this parameter, E[Y1]. Treatment of E[Y0] is entirely symmetric. E[Y1] is equivalent to

E[Y1|D=1]*Pr(D=1) + E[Y1|D=0]*Pr(D=0).

In principle, E[Y1|D=1], Pr(D=1), and Pr(D=0) can be observed. However, E[Y1|D=0] can never be observed. This is the problem of the missing counterfactual. One can impose strong assumptions to point identify this quantity (e.g., independence or conditional independence assumptions). However, this again requires certitude. Instead, E[Y1] may be bounded by replacing E[Y1|D=0] with upper and lower bounds based on more justifiable assumptions.

Manski (1990) first derives worst case bounds based simply on the minimum and maximum of the support of Y1. In other words, the lower bound is derived by assuming that all untreated would obtain the minimum value of Y1 were they to be treated. Conversely, the upper bound is derived by assuming that all treated would obtain the maximum value of Y1 were they to be treated. If the outcome, Y, is binary, this is trivial as the minimum (maximum) value of Y is 0 (1).

Image result for worst case scenario gif

Denote the worst case bounds on E[Y1] as [LB1, UB1]. Similarly, the worst case bounds on E[Y0] are denoted by  [LB0, UB0]. Then, bounds on the ATE are given by [LB1 - UB0, UB1 - LB0]. The lower bound corresponds to the scenario where all untreated (treated) attain the minimum (maximum) outcome with (without) the treatment. The upper bound corresponds to the reverse.

These bounds are informative -- despite being called worst case -- as they exclude some parts of the parameter space. However, the bounds necessarily include zero and, therefore, cannot rule out an interesting possibility. Nonetheless, the fact that they rule out some extreme values may be useful in assessing the cost-benefit ratio of a proposed intervention. For example, if the intervention fails a cost-benefit analysis even assuming the ATE is equal to the upper bound, then we know all we need to know.

Image result for my knowledge is complete gif

Manski (1990), Lechner (1999), papers by Kreider, Pepper, and co-authors, and papers by myself utilize these worst case bounds, as well as consider additional, transparent assumptions that might yield tighter bounds and might even exclude zero. Some of these papers also allow for measurement error in D as well. Thus, the bounds can be extended to see what can be learned not only in the face of the missing counterfactual problem, but also in the face of misreported treatment assignment. Other examples in the literature also allow for non-random sample selection and adjust the bounds accordingly.

The best part? I have Stata code available (McCarthy et al. 2015).

Image result for and the world rejoiced gif

The second example concerns the estimation of transition matrices. In Millimet  et al. (2019), my former Ph.D. students (http://econhaoli.weebly.com/https://punarjitroyc.weebly.com/) and I consider the problem of assessing income mobility, acknowledging that income is measured with error. Formally, we are interested in estimating the elements of a transition matrix, given by

Pr(y*_1 in k' | y*_0 in k),

where y*_1, y*_0 is an individual's true income in period 1 and 0, respectively, and k, k' refer to cells in the transition matrix (e.g., quintiles of the income distribution in each period).

Given observed data, y_1 and y_0, we derive worst case bounds on these probabilities using logic similar to the preceding example concerning the ATE. We then tighten the bounds by imposing additional assumptions that are transparent and easy to relax if one does not find them credible in a particular context.

While I am biased, our method stands in stark contrast to some existing papers concerned with estimating mobility while admitting the possibility of measurement error. Many such papers posit highly structural models of the income and measurement error processes, estimate such models, simulate 'error-free' income, and then estimate transition matrices with this data.

Image result for oxymoron gif

The number of assumptions and the lack of transparency in the assumptions is jarring.

Did you think we forgot the Stata code?

Image result for don't be ridiculous gif

We got you covered at http://faculty.smu.edu/millimet/code.html.

I apologize for the length of this post, but thank you if you made it this far. I am very happy I could maintain your interest. I will conclude with a few miscellaneous comments.

First, a bit of warning for applied researchers who may be tempted to abandon point estimates that lack certitude and embrace partial identification. Just like the character in another cinematic masterpiece, Singles, who loves her car, people (and by that I mean Referee 2) love their point estimates, credibility be damned.

Image result for singles gif supertrain

It can be hard to sell people on learning less, despite the confidence that comes with it. Manski has been preaching for years, but the progress is slow. I sincerely hope someday he is rewarded with the Nobel Prize for his efforts. I also hope referees (and policymakers) come to be more appreciative of this type of empirical work.

Second, when using bounds in practice, the theoretical bounds -- such as those discussed above -- are then estimated with data. Thus, we obtain "point estimates" for the lower and upper bounds. Since these are estimates, we need to obtain standard errors. There is much work by theoretical econometricians in this regard. And, while god love 'em, this literature is enough to make applied researchers 

Image result for fetal position gif

Don't let it distract you. Hopefully we can avoid letting perfection (on inference) be the enemy of good.

Finally, even ignoring confidence intervals on our bounds, the bounds are an interval and "look" like a confidence interval. This can make researchers looking at bounds for the first time want to treat them like a confidence interval where values in the center of the interval have higher probability than values near the endpoints. With partial identification this is not the case. All values contained in the bounds are equally probable (absent additional information).  Thus, avoid playing favorites.

In closing, I hope more applied researchers will embrace partial identification methods as, if nothing else, a worthwhile complement to our tool box.

Image result for gif amen

References 

Frisch, R. (1934), Statistical Confluence Analysis by Means of Complete Regression Systems, Publ. No. 5. Oslo: University Institute of Economics.

Lechner, M. (1999), "Nonparametric Bounds on Employment and Income Effects of Continuous Vocational Training in East Germany," Econometrics Journal, 2, 1-28. 

Manski, C.F. (1990), "Nonparametric Bounds on Treatment Effects," American Economic Review, 80, 319-323

Manski, C.F. (2008), "Partial Identification in Econometrics," in New Palgrave Dictionary of Economics, 2nd Edition, ed. by S. N. Durlauf, and L. E. Blume, Palgrave MacMillan.

McCarthy, I., D.L. Millimet, and M. Roy (2015), "Bounding Treatment Effects: Stata Command for the Partial Identification of the Average Treatment Effect with Endogenous and Misreported Treatment Assignment," Stata Journal, 15, 411-436.

Millimet, D.L., H. Li, and P. Roychowdhury (2019), "Partial Identification of Economic Mobility: With an Application to the United States," Journal of Business & Economic Statistics (forthcoming).

Tamer, E. (2010), “Partial Identification in Econometrics,” Annual Review of Economics, 2, 167–195.

Popular posts from this blog

The Great Divide

There is Exogeneity, and Then There is Strict Exogeneity

Black Magic