Tail Wagging the Dog

So, it snowed in North Texas over night. There is at least 0.25 inches. Of course, that means all schools are closed for at least today and tomorrow. Likely the day after as well. Seems like a good time to turn on Netflix and write another post. I have wanted to write this one for a while, but it also came up during lecture on Monday. And, that may be the last lecture I give this week.

The topic of the lecture was very straightforward: omitted variable bias. All empirical researchers understand omitted variable bias; omission of relevant covariates relegates them to the (composite) error term and biases the estimated coefficients when using Ordinary Least Squares (OLS) if the omitted covariates are correlated with included covariates. 

However, we know more than just the fact that OLS is biased in this situation. We also know the formula for the bias. How do we know this? Well, I will tell you what is not the answer to this question. We do not answer this question by starting with the estimated model and then pontificating about omitted variables because the data are inadequate and do not include all relevant covariates. To do so, is what I would describe as letting the tail wag the dog. To do so, is to proceed in the wrong order.

In the case of omitted variable bias, we start with the (assumed) true data generation process (DGP). Only then do we proceed, algebraically, to the model that can actually be taken to the data. This allows one to fully understand the econometric implications of estimating an mis-specified model without speculation. By starting with the DGP, and following the math, we find clarity by allowing the dog to wag the tail. 

At the risk of overkill, consider estimation of the model

Y = a + bX + e.

Upon estimation of the model, you are asked about not controlling for some variable, Z. Often the answer is something along the lines of "Well, I wish I had data on Z, but I don't, so this is the best I can do." Moreover, by staring at the model, it is not possible to assess the econometric implications of omitting Z. 

This problem arises when the researcher starts the process by specifying the model that can be estimated and not the model that should be estimated. Starts with reality, instead of the ideal.


Instead, the researcher ought to start by writing down the ideal model; here, given by

Y = a +bX + cZ + u.

Now, when the researcher turns to the data and fails to find data on Z, they can follow from the math from the ideal model to the one that can be estimated in practice. In this case, this is trivial. Omitting Z entails simply moving cZ to the error term, making e a composite error term

Y = a +bX + cZ + u

   = a + bX + [cZ + u]

   = a + bX + e.

By following the math, the researcher knows that the error term has a particular structure and this structure, combined with additional assumptions concerning the true DGP, allows one to understand the properties of OLS applied to the mis-specified model. Specifically, assuming the usual classical linear regression model assumptions hold in the ideal model, then we know that

E[b-hat] = b + c[Cov(X,Z)/Var(X)] 

and this is the omitted variable bias formula.

Empirical research needs to start with the ideal, which entails specifying the (assumed) true DGP. Then, and only then, should researchers turn to the data to determine the model that can actually be taken to the data. When the ideal model and the actual model differ, researchers should follow the math to understand the ramifications of this divergence.


In this simple, but powerful example, the order of events is clear. However, when it comes to complex research in the real world, the lesson learned here goes right out the window. But, in so many other scenarios, researchers could elevate the discussion of their research by heeding this example. Examples abound.

First, in a seminar a while back, a researcher was using an Instrumental Variable (IV) strategy. It's been so long, I do not remember the context. However, the researcher had some justification for an "ideal" instrument. The problem was that there was no data on this instrument. Instead, the researcher used some second-best, mismeasured instrument. However, the researcher had no understanding of the econometric implications of this alternative instrument and instead said "Well, it's the best I can do." 


No, it's not the best that can be done. Not by a long shot. One could start by writing down the (assumed) full DGP where Z is a valid instrument for some endogenous X, augment it by expressing the relationship between the second-best instrument, W, and the ideal instrument Z, and then derive the econometric implications of using W instead of Z. 

Second, recall an old time series issue. I know ... time series. Blech! But, an important issue in time series (and panel data models with long T) is nonstationarity. An important question is what are the OLS properties from estimating the model

Y = a + bX + e

where Y and X are both nonstationary? Well, we cannot answer that question by starting from this model. Instead, Granger & Newbold (1974) start from the DGP, where they assume that Y and X are independent random walks given by

Y = L.Y + ey

X = L.X + ex

where L.() is the lag operator and ey,ex are iid standard normal. Given this (assumed) DGP, the authors can derive the structure of a, b, and e in the model that is being estimated. 

Third, take this question posed on Twitter. The answer becomes obvious once the researcher starts with the (assumed) true DGP and then follows the math to obtain the model that can be taken to the data. In this case, it sounds like the true DGP is 

Y = a + bC + e

X = C + D

but the estimated model is

Y = e + fX + u.

So, what are the properties of the OLS estimate of f? Following the math reveals

Y = a + b[X - D] + e

   = a + bX + [-bD + e]

   = a + bX + u

which leads to 

E[f-hat] = b - b[Cov(X,D)/Var(D)].


As a final example, let's think about proxy variables. This is really the example that chaps my hide. My pet peeve. In nearly every empirical study, researchers attempt to include relevant controls. Often the data do not align with the desired controls. Researchers circumvent this issue by tossing around the word "proxy" like it's the holy grail. Interestingly, the original Steve Hamilton scolded me the other day for ruining the word "proxy" for him. You're welcome!

Researchers mostly justify the use of proxies by noting that it is "the best they can do." Again, not so fast. Starting from the model that can actually be estimated rather than from the (assumed) DGP, along with the assumed relationship between the proxies and the ideal controls, precludes researchers from properly understanding the properties of the estimator being employed. 

The bottom line is that researchers do not need to be structural econometricians to start at the beginning when specifying an empirical model. A proper understanding of the econometric implications of estimation demands that we do so, even if we are "reduced form" types, and follow the math ...









Popular posts from this blog

The Great Divide

There is Exogeneity, and Then There is Strict Exogeneity

Black Magic