Tomato, Tomahto

Wow, did that semester last a long time! For today, I am back to one of the activities I love, but for which I have had no time or mental energy: talking with all of you.

I hope everyone has survived. I won't ask if anyone thrived. That is not the proper metric. Survival is more than sufficient. 

It reminds me of a conversation I have repeatedly with a friend of mine. He has a son the same as age as mine (both sophomores at our local high school). He and I co-coached our boys' soccer and baseball teams when they were little. His son is now trying to be the kicker/punter on our high school football team. My son is trying to be a pitcher. Our high school is ridiculously competitive when it comes to sports (and academics). 

As kids drop off the sports landscape, either through burn-out or being cut, he and I just hope that our kids stick around to have a chance. It's our mantra: just stick around. So, be kind to yourself and just stick around

That said, let's move on to the topic that motivated me to write a new post. It is this:

We might disagree over trivial things like how to pronounce the name of that bright red vegetable, but there is something we should not disagree over as empirical researchers yet we seem to anyway. I am referring to the distinction between an estimator and an estimand

If you have not spent much, if any, time pondering the meaning of these terms, fret not. You are not alone. Admittedly, I have not thought about it much in my career. However, slowly, over the recent years, #EconTwitter, and in particular the wonderful and brilliant Pedro Sant'Anna, has opened my eyes. 

The importance of distinguishing between the two terms has been made prominent by the popularity of difference-in-differences combined with the recent literature on staggered treatment timing. I have written on difference-in-differences before (here and here). But, before we dive back into it, let us consider a different example first.

Consider a simple linear regression model.

Y = a + bX + e.

Given the assumptions of the Classical Linear Regression Model (CLRM), the conditional expectation function is

E[Y | X] = a + bX.

This implies that 

b = dE[Y | X]/dX 

and, hence, b is equal to the the derivative of the CEF, referred to as the marginal effect of X. For completeness, 

a = E[Y | X=0].

What if X is binary? Well, we still have  

 E[Y | X] = a + bX,

except now this implies that

b = E[Y | X=1] - E[Y | X=0],

which is still the marginal effect of X.

The point is this. When we estimate the CLRM by Ordinary Least Squares (OLS) or Instrumental Variables (IV) or Least Absolute Deviations (LAD) or any other technique, the things we are trying to estimate are a and b. These are our estimands. OLS (or IV or LAD or whatever) is the estimator

However, a and b are just population parameters. They are numbers. They are meaningless until we give them an interpretation. The assumptions of the CLRM accomplish this: a is the intercept and b is the marginal effect of X. Because these are quantities that have meaning to us as researchers, these are the true objects of our interest. 

Let me explain


OK. In sum, when we estimate a regression model, what we have in mind is that the intercept and the marginal effect of the covariates are the objects of interest -- the estimands -- and our technique for converting data into unbiased and/or consistent estimates of these estimands is our estimator.


Good. Well, I hope you continue to accept it as we return to difference-in-differences (DD). 

With DD, we need to recognize that empirical research really proceeds in reverse with how I described it above. In truth, we need to start with the object of interest (the estimand) and then figure out how to say something about this from the data (the estimator). 


With DD, the estimand is

E[∆Y | X=1] - E[∆Y | X=0], 

where X is a binary treatment indicator and ∆Y is the change in Y over time. We often think this is an interesting object because, under certain assumptions, this quantity is interpretable as the average treatment on the treated (ATT). 

Now that we know what the goal is, how do we get there? In other words, what is the estimator that allows us to convert the data into an unbiased and/or consistent estimator of this object? It might be an estimator (e.g., pooled OLS after mean-differencing or first-differencing) of the two-way fixed effects (TWFE) model

Y_it = a_i + g_t + bX_it + e_it,

where a_i is the unit fixed effect, g_t is the time fixed effect, and X_it is a binary indicator for the treatment status of unit i at time t.  

That's right. I said might.


In the classic 2x2 DD design, where there are two periods of data, no one is treated in period 1, and only some are treated prior to period 2, then the TWFE estimator of b yields an unbiased estimate of the estimand, the ATT, under certain assumptions (namely, strict exogeneity of X; see here).

However, if the data contain more than 2 time periods, some of the units are treated prior to, say, period 2, while others are treated prior to, say, period 3, etc., and the effect of the treatment is heterogeneous across units and/or time, then Yoda will not be happy.

In this case -- now known as the case of staggered treatment timing -- the TWFE estimator of the model 

Y_it = a_i + g_t + bX_it + e_it

can still be used to recover an unbiased estimate of b under strict exogeneity. Thus, the estimator remains the same. However, the estimand has changed. In other words, the parameter, b, is no longer equal to the ATT, given by

E[∆Y | X=1] - E[∆Y | X=0]. 

Worse than that, b is now equal to something that lacks any meaningful interpretation. Specifically, b is a weighted average of the heterogeneous treatment effects, but the weights need not be strictly positive. 


There are now several papers forthcoming or on the verge of forthcoming that discuss the issue in detail and outline possible solutions. I will not dive into this here. The incomparable Andrew Baker has a nice primer about the issue here. There are two papers in the forthcoming special issue of the Journal of the Association of Environmental and Resource Economists that Jennifer Alix-Garcia and I edited that are also quite insightful. 

My point in this post is not to get into these solutions. Perhaps another time. My point is simpler: when researchers are crafting a DD paper, it is imperative that the correct language is used. First, do not refer to this

Y_it = a_i + g_t + bX_it + e_it

as a "difference-in-difference regression model" or some such thing. This is a TWFE model; it is the TWFE estimator. Call it as such. It is not a DD estimator. DD is not an estimator. DD is a design or strategy used to motivate a particular estimand as interesting.

Second, be clear on your objective. In DD papers, the estimand is often the ATT. However, with heterogeneous treatment effects across units and/or time, one must explicit about the what the ATT is equal to and what estimator might recover an unbiased estimate of it. With staggered treatment timing, additional assumptions (beyond parallel trends) are needed in order for the TWFE estimate of b to an unbiased estimate of the ATT. Without these additional assumptions, TWFE still yields an unbiased estimate of b, but b is not equal to the ATT.


Take a deep breath. Remember, you don't need to be perfect. Just keep sticking around. What goes for high school athletics also goes for research.

Be well!

Popular posts from this blog

The Great Divide

There is Exogeneity, and Then There is Strict Exogeneity

Black Magic