Shopping at the Gap
As shoppers at the Gap, we take our time and enjoy ourselves. So, too, I hope you enjoy this post at a leisurely pace. This is a gentle way of warning you that this is a long post.
But, the Gap plays another role in this post as well. As (social) scientists, we are constantly worrying about gaps, not the Gap. Gaps in our knowledge, mostly. Many new studies identify a knowledge gap in a specific literature and purport to "fill this gap."
As empirical researchers, we seek to fill these gaps by bringing rigorous statistical analysis to data. Nearly a century ago, starting with Yule (1927), this rigorous statistical analysis has been applied to time series data.
For more than a half century, starting with Mundlak (1961) and Balestra & Nerlove (1966), this rigorous analysis has been applied to panel (or longitudinal) data.
When data involve a time dimension, there is a completely different kind of gap that merits attention, but most often is ignored: the gap between observed time periods in the data. I was reminded of this issue just recently thanks to a new working paper, Franses (2019).
The issue at hand is referred to as one of irregularly spaced data. To be formal and fix ideas, we need to define what it means to be irregularly spaced. A simple and straightforward definition is the following:
Fuleky (2012) refers to the interval between periods in the true DGP as the unit period, while the interval between periods in the data is the observation period.
A few comments are warranted at this point. First, as the true DGP is never known, one can never know with certainty if the data are irregularly spaced; the unit period is always unknown.
Second, equally spaced data may be irregularly spaced. In other words, just because the data are collected at regular intervals, does not mean the data are not irregular. The data are still irregular if the observation period is less frequent than the unit period. [Note: by construction, the observation period cannot be smaller than the unit period.]
Now that we have the definition of irregularly spaced data, why should we care?
Let's start with a simple AR(1) time series regression model, given by
y_t = a*y_t-1 + e_t, t=1,...,T
But what is t? To better understand the model and estimation, we need to be careful to start with writing down the true model. In other words, what is the DGP? So, let's assume the model above represents the true DGP (where y_0 is the initial value determined somehow). If the model above represents the true DGP, then t indexes the unit periods.
Wishing to take the above model to the data, assume we gather data on y_m, m=1,...,M, where m indexes periods in the observed data. A naive researcher might estimate the following regression model
y_m = b*y_m-1 + e_m, m=1,...,M
If the unit and observation periods do not align, then b does not equal a and any estimate of b will be biased for a. Again, this happens even if the observation periods are equally spaced.
For example, if t indexes weekly periods, but m indexes annual measurements, then it follows from repeated substitution that
y_m = (a^52)*y_m-1 + e_m, m=1,...,M
where e_m is a function of e_t for the 52 errors spanning the weekly periods between m-1 and m. This is the model being estimated, and a naive estimate of b will be at best a consistent estimate of a^52.
What if the data are not equally spaced? Then the the model being estimated is given by
y_m = (a^(g_m))*y_m-1 + e_m, m=1,...,M
where g_m is the number of true periods spanned by the observed periods m-1 and m. Now, a naive estimate of b will be at best a consistent estimate of a weighted average of a^(g_m). For instance, if the data are a mix of annual and biennial measurements, then b will represent some average of a^52 and a^104.
Awkward, but this is actually an easy problem to solve. Given an assumption concerning the definition of the unit period, t, one can estimate the model
y_m = (a^(g_m))*y_m-1 + e_m, m=1,...,M
using Nonlinear Least Squares (NLLS) or Generalized Method of Moments (GMM) since g_m is known. This is the point made in the recent working paper, Franses (2019). It was made previously in a panel context in Rosner & Munoz (1988).
At this point, we have spent far too much time on a time series example (pun intended). So, what about real econometrics, er, I mean, panel data models?
Longitudinal data obviously entails a time component and thus issues of irregular spacing may arise. Before delving into the details, let us again motivate why you should care about this issue. Look here at all the longitudinal data sources from developed countries that have completely screwy timing patterns:
and from developing countries:
Estimating panel data models using these or similarly screwy data sources will cause problems ... if there is some dynamic element to the panel data model. If the true DGP is a purely static panel data model, given by
y_it = c_i + b*x_it + e_it,
where e_it is white noise, then irregular spacing creates no difficulties to my knowledge. However, once dynamics are involved, spacing of the data matters.
The problem of irregularly spaced panel data was solved for two special cases decades ago. First, Rosner & Munoz (1988), mentioned above, tackle a dynamic panel data model with no unobserved heterogeneity, given by
y_it = a*y_it- + b*x_it + e_it.
With repeated substitution, the model becomes
y_im = (a^(g_m))*y_im-1 + b*x_im + e_im
and can be estimated by pooled NLLS.
Second, Baltagi & Wu (1999) consider a static panel model with AR(1) errors, given by
y_it = c_i + b*x_it + e_it,
where c_i is a random effect and e_it = r*e_it-1 + u_it. The authors derive a Feasible Generalized Least Squares (FGLS) estimator for the model
y_im = c_i + b*x_im + e_im.
While interesting, these prior studies do not confront the issue of irregular spacing in perhaps the most common dynamic specification of a panel data model, one that includes both fixed effects and a lagged dependent variable.
The typical dynamic panel data model is given by
y_it = c_i + a*y_it-1 + b*x_it + e_it.
This model exploded in popularity following Arellano & Bond (1991), which now has nearly 30,000 citations on Google Scholar. Obtaining consistent estimates of this model with irregularly spaced panel data is much, much harder.
It turns out, that irregular spacing creates three potential problems in this model. With repeated substitution, the model becomes
y_im = [(1-a^(g_m))/(1-a)]*c_i + (a^(g_m))*y_im-1 + b*x_im + e_im,
where e_im is a function of e_it and x_it for the periods between observed periods m and m-1. If that doesn't look pretty, take comfort that you are not alone.
The three problems with estimating this equation are
1. The fixed effect has a time-varying factor loading if the observation periods change in duration over the course of the sample.
2. The coefficient on the lagged dependent variable depends on g_m.
3. The error term includes the covariates from all the periods not included in the data.
These issues mean that the usual Arellano & Bond (1991) approach (or system GMM estimator or original Anderson & Hsiao (1981) approaches) will not work. Naive application of these estimators produces nonsense.
First, all of these estimators proceed by first-differencing to eliminate c_i. Here, with variable frequency between data periods, first-differencing will not eliminate c_i; with irregularly, but equally spaced data, c_i will continue to drop out.
Second, even if first-differencing succeeds in eliminating c_i, the nonlinear coefficients on y_im-1 and y_im-2 must be handled.
Third, even if the covariates, x, are strictly exogenous (with respect to the original error term, e_it), they will not be strictly exogenous (with respect to e_im) if the x's are serially correlated since lags of x end up in the error term of the estimating equation.
In my paper with my former student, Millimet & McDonough (2017), we tackle this problem. We solve the first problem by considering quasi-differencing to eliminate c_i. We solve the second problem by resorting to NLLS or GMM, addressing the endogeneity of y_im-1 using further lags as instruments. The third problem, however, is more elusive. We consider various approaches based on imputing the covariates from the missing time periods. Not ideal, but it was the best we could come up with. We also explore some other proposed estimators in the paper. Check it out!
Interestingly, another paper came out at the exact same time as ours, Susaki & Xin (2017), by econometricians smarter than Ian and I. Well, at least me. It proposes a different estimator that works for many irregular spacing patterns, but relies on a particular weak stationarity assumption for identification.
If you have made it this far, you must truly be interested, or deranged, or a little of both. You should also check out early work on irregular spacing in pseudo panels by McKenzie (2001).
There are many subtle issues that get introduced once one starts to think about the unit period vs. the observation period. In the interest of not writing a novel, I have not mentioned them. But, the main takeaway from this post, for applied researchers, is anytime you are writing or reading a study that has a dynamic component to the model, pay attention to how the data are organized and how that organization compares to the underlying data-generating process. It is as important as anything they sell at the Gap.
References
Arellano M. and S. Bond (1991), "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations," Review of Economic Studies, 58, 277-297.
Balestra, P., and M. Nerlove (1966), “Pooling Cross-Section and Time-Series Data in the Estimation of a Dynamic Economic Model: The Demand for Natural Gas,” Econometrica, 34, 585-612.
Franses, Ph.H.B.F. (2019), "Estimating Persistence for Irregularly Spaced Historical Data," Econometric Institute Research Papers EI2020-03, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
Fuleky P. (2012), "On the Choice of the Unit Period in Time Series Models," Applied Economics Letters, 19, 1179-1182.
McKenzie, D.J. (2001), "Estimation of AR(1) Models with Unequally Spaced Pseudo-Panels," Econometrics Journal, 4, 89-108.
Millimet, D.L. and I.K. McDonough (2017), "Dynamic Panel Data Models with Irregular Spacing," Journal of Applied Econometrics, 32, 725-743.
Mundlak, Y. (1961), “Empirical Production Functions Free of Management Bias,” Journal of Farm Economics, 43, 44-56.
Rosner B. and A. Munoz (1988), "Autoregressive Modelling for the Analysis of Longitudinal Data with Unequally Spaced Examinations," Statistics in Medicine, 7, 59-71.
Sasaki, Y. and Y. Xin (2017), “Unequal Spacing in Dynamic Panel Data: Identification and Estimation” Journal of Econometrics, 196, 320-330.
Yule, G.U. (1927), "On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers," Philosophical Transactions of the Royal Society of London, Series A, 226, 267-298.
But, the Gap plays another role in this post as well. As (social) scientists, we are constantly worrying about gaps, not the Gap. Gaps in our knowledge, mostly. Many new studies identify a knowledge gap in a specific literature and purport to "fill this gap."
As empirical researchers, we seek to fill these gaps by bringing rigorous statistical analysis to data. Nearly a century ago, starting with Yule (1927), this rigorous statistical analysis has been applied to time series data.
For more than a half century, starting with Mundlak (1961) and Balestra & Nerlove (1966), this rigorous analysis has been applied to panel (or longitudinal) data.
When data involve a time dimension, there is a completely different kind of gap that merits attention, but most often is ignored: the gap between observed time periods in the data. I was reminded of this issue just recently thanks to a new working paper, Franses (2019).
"Data are said to be irregularly spaced if the length of time between two consecutive time periods in the data is not equal to the length of time between two consecutive time periods in the true data-generating process (DGP)."
Fuleky (2012) refers to the interval between periods in the true DGP as the unit period, while the interval between periods in the data is the observation period.
A few comments are warranted at this point. First, as the true DGP is never known, one can never know with certainty if the data are irregularly spaced; the unit period is always unknown.
Second, equally spaced data may be irregularly spaced. In other words, just because the data are collected at regular intervals, does not mean the data are not irregular. The data are still irregular if the observation period is less frequent than the unit period. [Note: by construction, the observation period cannot be smaller than the unit period.]
Now that we have the definition of irregularly spaced data, why should we care?
Let's start with a simple AR(1) time series regression model, given by
y_t = a*y_t-1 + e_t, t=1,...,T
But what is t? To better understand the model and estimation, we need to be careful to start with writing down the true model. In other words, what is the DGP? So, let's assume the model above represents the true DGP (where y_0 is the initial value determined somehow). If the model above represents the true DGP, then t indexes the unit periods.
Wishing to take the above model to the data, assume we gather data on y_m, m=1,...,M, where m indexes periods in the observed data. A naive researcher might estimate the following regression model
y_m = b*y_m-1 + e_m, m=1,...,M
If the unit and observation periods do not align, then b does not equal a and any estimate of b will be biased for a. Again, this happens even if the observation periods are equally spaced.
For example, if t indexes weekly periods, but m indexes annual measurements, then it follows from repeated substitution that
y_m = (a^52)*y_m-1 + e_m, m=1,...,M
where e_m is a function of e_t for the 52 errors spanning the weekly periods between m-1 and m. This is the model being estimated, and a naive estimate of b will be at best a consistent estimate of a^52.
What if the data are not equally spaced? Then the the model being estimated is given by
y_m = (a^(g_m))*y_m-1 + e_m, m=1,...,M
where g_m is the number of true periods spanned by the observed periods m-1 and m. Now, a naive estimate of b will be at best a consistent estimate of a weighted average of a^(g_m). For instance, if the data are a mix of annual and biennial measurements, then b will represent some average of a^52 and a^104.
Awkward, but this is actually an easy problem to solve. Given an assumption concerning the definition of the unit period, t, one can estimate the model
y_m = (a^(g_m))*y_m-1 + e_m, m=1,...,M
using Nonlinear Least Squares (NLLS) or Generalized Method of Moments (GMM) since g_m is known. This is the point made in the recent working paper, Franses (2019). It was made previously in a panel context in Rosner & Munoz (1988).
At this point, we have spent far too much time on a time series example (pun intended). So, what about real econometrics, er, I mean, panel data models?
Longitudinal data obviously entails a time component and thus issues of irregular spacing may arise. Before delving into the details, let us again motivate why you should care about this issue. Look here at all the longitudinal data sources from developed countries that have completely screwy timing patterns:
Estimating panel data models using these or similarly screwy data sources will cause problems ... if there is some dynamic element to the panel data model. If the true DGP is a purely static panel data model, given by
y_it = c_i + b*x_it + e_it,
where e_it is white noise, then irregular spacing creates no difficulties to my knowledge. However, once dynamics are involved, spacing of the data matters.
The problem of irregularly spaced panel data was solved for two special cases decades ago. First, Rosner & Munoz (1988), mentioned above, tackle a dynamic panel data model with no unobserved heterogeneity, given by
y_it = a*y_it- + b*x_it + e_it.
With repeated substitution, the model becomes
y_im = (a^(g_m))*y_im-1 + b*x_im + e_im
and can be estimated by pooled NLLS.
Second, Baltagi & Wu (1999) consider a static panel model with AR(1) errors, given by
y_it = c_i + b*x_it + e_it,
where c_i is a random effect and e_it = r*e_it-1 + u_it. The authors derive a Feasible Generalized Least Squares (FGLS) estimator for the model
y_im = c_i + b*x_im + e_im.
While interesting, these prior studies do not confront the issue of irregular spacing in perhaps the most common dynamic specification of a panel data model, one that includes both fixed effects and a lagged dependent variable.
The typical dynamic panel data model is given by
y_it = c_i + a*y_it-1 + b*x_it + e_it.
This model exploded in popularity following Arellano & Bond (1991), which now has nearly 30,000 citations on Google Scholar. Obtaining consistent estimates of this model with irregularly spaced panel data is much, much harder.
It turns out, that irregular spacing creates three potential problems in this model. With repeated substitution, the model becomes
y_im = [(1-a^(g_m))/(1-a)]*c_i + (a^(g_m))*y_im-1 + b*x_im + e_im,
where e_im is a function of e_it and x_it for the periods between observed periods m and m-1. If that doesn't look pretty, take comfort that you are not alone.
1. The fixed effect has a time-varying factor loading if the observation periods change in duration over the course of the sample.
2. The coefficient on the lagged dependent variable depends on g_m.
3. The error term includes the covariates from all the periods not included in the data.
These issues mean that the usual Arellano & Bond (1991) approach (or system GMM estimator or original Anderson & Hsiao (1981) approaches) will not work. Naive application of these estimators produces nonsense.
First, all of these estimators proceed by first-differencing to eliminate c_i. Here, with variable frequency between data periods, first-differencing will not eliminate c_i; with irregularly, but equally spaced data, c_i will continue to drop out.
Second, even if first-differencing succeeds in eliminating c_i, the nonlinear coefficients on y_im-1 and y_im-2 must be handled.
Third, even if the covariates, x, are strictly exogenous (with respect to the original error term, e_it), they will not be strictly exogenous (with respect to e_im) if the x's are serially correlated since lags of x end up in the error term of the estimating equation.
In my paper with my former student, Millimet & McDonough (2017), we tackle this problem. We solve the first problem by considering quasi-differencing to eliminate c_i. We solve the second problem by resorting to NLLS or GMM, addressing the endogeneity of y_im-1 using further lags as instruments. The third problem, however, is more elusive. We consider various approaches based on imputing the covariates from the missing time periods. Not ideal, but it was the best we could come up with. We also explore some other proposed estimators in the paper. Check it out!
Interestingly, another paper came out at the exact same time as ours, Susaki & Xin (2017), by econometricians smarter than Ian and I. Well, at least me. It proposes a different estimator that works for many irregular spacing patterns, but relies on a particular weak stationarity assumption for identification.
If you have made it this far, you must truly be interested, or deranged, or a little of both. You should also check out early work on irregular spacing in pseudo panels by McKenzie (2001).
There are many subtle issues that get introduced once one starts to think about the unit period vs. the observation period. In the interest of not writing a novel, I have not mentioned them. But, the main takeaway from this post, for applied researchers, is anytime you are writing or reading a study that has a dynamic component to the model, pay attention to how the data are organized and how that organization compares to the underlying data-generating process. It is as important as anything they sell at the Gap.
References
Arellano M. and S. Bond (1991), "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations," Review of Economic Studies, 58, 277-297.
Balestra, P., and M. Nerlove (1966), “Pooling Cross-Section and Time-Series Data in the Estimation of a Dynamic Economic Model: The Demand for Natural Gas,” Econometrica, 34, 585-612.
Franses, Ph.H.B.F. (2019), "Estimating Persistence for Irregularly Spaced Historical Data," Econometric Institute Research Papers EI2020-03, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
Fuleky P. (2012), "On the Choice of the Unit Period in Time Series Models," Applied Economics Letters, 19, 1179-1182.
McKenzie, D.J. (2001), "Estimation of AR(1) Models with Unequally Spaced Pseudo-Panels," Econometrics Journal, 4, 89-108.
Millimet, D.L. and I.K. McDonough (2017), "Dynamic Panel Data Models with Irregular Spacing," Journal of Applied Econometrics, 32, 725-743.
Mundlak, Y. (1961), “Empirical Production Functions Free of Management Bias,” Journal of Farm Economics, 43, 44-56.
Rosner B. and A. Munoz (1988), "Autoregressive Modelling for the Analysis of Longitudinal Data with Unequally Spaced Examinations," Statistics in Medicine, 7, 59-71.
Sasaki, Y. and Y. Xin (2017), “Unequal Spacing in Dynamic Panel Data: Identification and Estimation” Journal of Econometrics, 196, 320-330.
Yule, G.U. (1927), "On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers," Philosophical Transactions of the Royal Society of London, Series A, 226, 267-298.