Tail Wagging the Dog
The topic of the lecture was very straightforward: omitted variable bias. All empirical researchers understand omitted variable bias; omission of relevant covariates relegates them to the (composite) error term and biases the estimated coefficients when using Ordinary Least Squares (OLS) if the omitted covariates are correlated with included covariates.
However, we know more than just the fact that OLS is biased in this situation. We also know the formula for the bias. How do we know this? Well, I will tell you what is not the answer to this question. We do not answer this question by starting with the estimated model and then pontificating about omitted variables because the data are inadequate and do not include all relevant covariates. To do so, is what I would describe as letting the tail wag the dog. To do so, is to proceed in the wrong order.
At the risk of overkill, consider estimation of the model
Y = a + bX + e.
Upon estimation of the model, you are asked about not controlling for some variable, Z. Often the answer is something along the lines of "Well, I wish I had data on Z, but I don't, so this is the best I can do." Moreover, by staring at the model, it is not possible to assess the econometric implications of omitting Z.
This problem arises when the researcher starts the process by specifying the model that can be estimated and not the model that should be estimated. Starts with reality, instead of the ideal.
Second, recall an old time series issue. I know ... time series. Blech! But, an important issue in time series (and panel data models with long T) is nonstationarity. An important question is what are the OLS properties from estimating the model
Y = a + bX + e
where Y and X are both nonstationary? Well, we cannot answer that question by starting from this model. Instead, Granger & Newbold (1974) start from the DGP, where they assume that Y and X are independent random walks given by
Y = L.Y + ey
X = L.X + ex
where L.() is the lag operator and ey,ex are iid standard normal. Given this (assumed) DGP, the authors can derive the structure of a, b, and e in the model that is being estimated.
Third, take this question posed on Twitter. The answer becomes obvious once the researcher starts with the (assumed) true DGP and then follows the math to obtain the model that can be taken to the data. In this case, it sounds like the true DGP is
Y = a + bC + e
X = C + D
but the estimated model is
Y = e + fX + u.
So, what are the properties of the OLS estimate of f? Following the math reveals
Y = a + b[X - D] + e
= a + bX + [-bD + e]
= a + bX + u
which leads to
E[f-hat] = b - b[Cov(X,D)/Var(D)].
As a final example, let's think about proxy variables. This is really the example that chaps my hide. My pet peeve. In nearly every empirical study, researchers attempt to include relevant controls. Often the data do not align with the desired controls. Researchers circumvent this issue by tossing around the word "proxy" like it's the holy grail. Interestingly, the original Steve Hamilton scolded me the other day for ruining the word "proxy" for him. You're welcome!
Researchers mostly justify the use of proxies by noting that it is "the best they can do." Again, not so fast. Starting from the model that can actually be estimated rather than from the (assumed) DGP, along with the assumed relationship between the proxies and the ideal controls, precludes researchers from properly understanding the properties of the estimator being employed.
The bottom line is that researchers do not need to be structural econometricians to start at the beginning when specifying an empirical model. A proper understanding of the econometric implications of estimation demands that we do so, even if we are "reduced form" types, and follow the math ...