Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments
Abstract
Generalized Estimating Equation (GEE) is a marginal model popularly applied for longitudinal/clustered data analysis in clinical trials or biomedical studies. We provide a systematic review on GEE including basic concepts as well as several recent developments due to practical challenges in real applications. The topics including the selection of “working” correlation structure, sample size and power calculation, and the issue of informative cluster size are covered because these aspects play important roles in GEE utilization and its statistical inference. A brief summary and discussion of potential research interests regarding GEE are provided in the end.
1. Introduction
Generalized Estimating Equation (GEE) is a general statistical approach to fit a marginal model for longitudinal/clustered data analysis, and it has been popularly applied into clinical trials and biomedical studies [1–3]. One longitudinal data example can be taken from a study of orthodontic measurements on children including 11 girls and 16 boys. The response is the measurement of the distance (in millimeters) from the center of the pituitary to the pterygomaxillary fissure, which is repeatedly measured at ages 8, 10, 12, and 14 years. The primary goal is to investigate whether there exists significant gender difference in dental growth measures and the temporal trend as age increases [4]. For such data analysis, it is obvious that the responses from the same individual tend to be “more alike”; thus incorporating within-subject and between-subject variations into model fitting is necessary to improve efficiency of the estimation and the power [5].
There are several simple methods existing for repeated data analysis, that is, ANOVA/MANOVA for repeated measures, but the limitation is the incapability of incorporating covariates. There are two types of approaches, mixed-effect models and GEE [6, 7], which are traditional and are widely used in practice now. Of note is that these two methods have different tendencies in model fitting depending on the study objectives. In particular, the mixed-effect model is an individual-level approach by adopting random effects to capture the correlation between the observations of the same subject [7]. On the other hand, GEE is a population-level approach based on a quasilikelihood function and provides the population-averaged estimates of the parameters [8]. In this paper, we focus on the latter to provide a review and recent developments of GEE. As is well known, GEE has several defining features [9–11]. (1) The variance-covariance matrix of responses is treated as nuisance parameters in GEE and thus this model fitting turns out to be easier than mixed-effect models [12]. In particular, if the overall treatment effect is of primary interest, GEE is preferred. (2) Under mild regularity conditions, the parameter estimates are consistent and asymptotically normally distributed even when the “working” correlation structure of responses is misspecified, and the variance-covariance matrix can be estimated by robust “sandwich” variance estimator. (3) GEE relaxes the distribution assumption and only requires the correct specification of marginal mean and variance as well as the link function which connects the covariates of interest and marginal means.
However, several aspects of GEE are still in controversy since Liang and Zeger [6]. Crowder addressed some issues on inconsistent estimation of within-subject correlation coefficient under a misspecified “working” correlation structure based on asymptotic theory [7]. In addition, the estimation of the correlation coefficients using the moment-based approach is not efficient; thus the correlation matrix may not be a positive definite matrix in certain cases. Also, Liang and Zeger did not incorporate the constraints on the range of correlation which was restricted by the marginal means because the estimation of the correlation coefficients was simply based on Pearson residuals [6]. Chaganty and Joe discussed this issue for dependent Bernoulli random variables [13], and later Sabo and Chaganty made future explanation [14]. For example, Sutradhar and Das pointed out under misspecification the correlation coefficient estimates did not converge to the true values [15]. Furthermore, for discrete random vectors, the correlation matrix was usually complicated, and it was not easy to attain multivariate distributions with specified correlation structures. These limitations lead researchers to actively work on this area to develop novel methodologies. Several alternative approaches for estimating the correlation coefficients have been proposed; for example, one method was based on “Gaussian” estimation [16, 17], and the basic idea was to estimate the correlation coefficients based on multivariate normal estimating equations, and the feature was that this estimation can ensure the estimated correlation matrix was positive-definite. Wang and Carey proposed to estimate the correlation coefficients by differentiating the Cholesky decomposition of the working correlation matrix [18]. Also, Qu and Lindsay (2003) proposed similar Gaussian or quadratic estimating equations [19]. In particular, for binary longitudinal data, the estimation of the correlation coefficients was proposed based on conditional residuals [20–22]. Nevertheless, in this paper, the above issues are not discussed in great depth, and the assumption that, under the regular mild conditions, the consistency of parameter estimates as well as within-subject correlation coefficient estimate holds is satisfied. Thus, three specific topics including model selection, power analysis, and the issue of informative cluster size are mainly focused on and the recent developments are reviewed in the following sections.
2. Method
2.1. Notation and GEE
Correlation structure | Corr(Yij, Yik) | Sample matrix | Estimator |
---|---|---|---|
Independent | NA | ||
Exchangeable |
|
||
k-dependent |
|
||
Autoregressive AR(1) | Corr(Yij, Yi,j+m) = αm, m = 0,1, 2, …, ni − j |
|
|
Toeplitz |
|
||
Unstructured |
2.2. Model Selection of GEE
In this section, we will discuss the model selection criteria available of GEE. There are several reasons why model selection of GEE models is important and necessary: (1) GEE has gained increasing attention in biomedical studies which may include a large group of predictors [25–28]. Therefore, variable selection is necessary for determining which are included in the final regression model by identifying significant predictors; (2) it is already known that one feature of GEE is that the consistency of parameter estimates can still hold even when the “working” correlation structure is misspecified. But, correctly specifying “working” correlation structure can definitely enhance the efficiency of the parameter estimates in particular when the sample size is not large enough [16, 24, 25, 29]. Therefore, how to select intrasubject correlation matrix plays a vital role in GEE with improved finite-sample performance; (3) the variance function ν(μ) is another potential factor affecting the goodness-of-fit of GEE [25, 30]. Correctly specified variance function can assist in the selection of covariates and an appropriate correlation structure [31, 32]. Different criteria might be needed due to the goal of model selection [24, 29, 33], and next I will particularly introduce the existing approaches on the selection of “working” correlation structure with its own merits and limitations [34].
Besides those criteria mentioned above, Cantoni et al. also discussed the covariate selection for longitudinal data analysis [46]; also, a variance function selection was mentioned by Pan and Mackenzie [30] as well as Wang and Lin [47]; in addition, more work on “working” correlation structure selection was addressed by Chaganty and Joe [48], Wang and Lin [47], Gosho et al. [49, 50], Jang [51], Chen [52], and Westgate [53–55], among others. Overall, the model selection of GEE is nontrivial, where the best selection criterion is still being pursued [56], and the recent work by Wang et al. can be followed up as the rule of thumb [45].
2.3. Sample Size and Power of GEE
On the other hand, there are several concerns [68]. First, we here focus on the calculation of the sample size K assuming ni is known; however, based on the power formula (16), νR depends on ni and thus increasing ni can also assist in power improvement but turns out to be less effective than K [69]. Second, the sample size/power calculation may be restricted to the limitation of clusters, for example, clustered randomized trials (CRTs), where the number of clusters could be relatively small. For example, by the literature review of published CRTs, the median number of clusters is shown as 21 [70]. In such situations, the power formula adjusted for the small samples in GEE is necessary, which has drawn attention from researchers recently [71–75].
2.4. Clustered Data with Informative Cluster Size
This method was also explored or extended for the correlated data with nonignorable cluster size by Benhin et al. and Cong et al. [82, 83]. Furthermore, a more efficient method called modified WCR (MWCR) was proposed by Chiang and Lee, where minimum cluster size ni > 1 subjects were randomly sampled from each cluster, and then GEE models for balanced data were applied for estimation by incorporating the intracluster correlation; thus MWCR might be a more efficient way for analysis [84]. But MWCR is not always satisfactory and Pavlou et al. recognized the sufficient conditions of the data structure and the choice of “working” correlation structure, which allowed the consistency of the estimates from MWCR [85]. In addition, Wang et al. extended the above work to the clustered longitudinal data, which are collected as repeated measures on subjects arising in clusters, with potential informative cluster size [45]. Examples include health studies of subjects from multiple hospitals or families. With the adoption and comparison of GEE, WCR, and CWGEE, the author claimed that CWGEE was recommended because of the comparable performance with WCR and the lack of intensive Monte Carlo computation in terms of well preserved coverage rates and desirable power properties, while GEE models led to invalid inference due to the biased parameter estimates via extensive simulation studies and real data application of a periodontal disease study [45]. In addition, for observed-cluster inference, Seaman et al. discussed the methods, including weighted and doubly weighted GEE and the shared random-effects models for comparison, and showed the conditions under which the shared random-effects model described members with observed outcomes Y [86]. More work can be found in [87–90], among others.
3. Simulation
n | K | Criterion | Selection frequencies of “working” correlation structure | |||||
---|---|---|---|---|---|---|---|---|
IND | EXCH | AR-1 | IND | EXCH | AR-1 | |||
Normal | Binary | |||||||
4 | 50 | QIC | 198 | 393 | 409 | 202 | 374 | 424 |
RJ | 327 | 423 | 250 | 312 | 421 | 267 | ||
RJ1 | 388 | 322 | 290 | 399 | 316 | 285 | ||
RJ2 | 384 | 327 | 289 | 388 | 320 | 292 | ||
SC | 488 | 1 | 512 | 351 | 310 | 339 | ||
GP | 547 | 0 | 453 | 368 | 306 | 326 | ||
100 | QIC | 209 | 377 | 414 | 185 | 407 | 408 | |
RJ | 338 | 415 | 247 | 340 | 410 | 250 | ||
RJ1 | 389 | 349 | 262 | 381 | 358 | 261 | ||
RJ2 | 389 | 353 | 258 | 372 | 357 | 271 | ||
SC | 482 | 1 | 517 | 352 | 346 | 302 | ||
GP | 520 | 0 | 480 | 360 | 348 | 292 | ||
8 | 50 | QIC | 200 | 411 | 389 | 203 | 363 | 434 |
RJ | 282 | 497 | 221 | 292 | 476 | 232 | ||
RJ1 | 402 | 354 | 244 | 386 | 340 | 274 | ||
RJ2 | 402 | 357 | 241 | 373 | 347 | 280 | ||
SC | 465 | 1 | 535 | 351 | 325 | 324 | ||
GP | 558 | 0 | 442 | 382 | 311 | 307 | ||
100 | QIC | 188 | 393 | 419 | 201 | 398 | 401 | |
RJ | 321 | 442 | 237 | 287 | 466 | 247 | ||
RJ1 | 347 | 385 | 268 | 385 | 367 | 248 | ||
RJ2 | 347 | 382 | 271 | 377 | 369 | 254 | ||
SC | 492 | 0 | 508 | 355 | 343 | 302 | ||
GP | 541 | 0 | 459 | 370 | 341 | 289 |
n | K | Criterion | Selection frequencies of “working” correlation structure | |||||
---|---|---|---|---|---|---|---|---|
IND | EXCH | AR-1 | IND | EXCH | AR-1 | |||
Normal | Binary | |||||||
4 | 50 | QIC | 106 | 699 | 195 | 53 | 758 | 189 |
RJ | 419 | 139 | 442 | 869 | 5 | 126 | ||
RJ1 | 0 | 963 | 37 | 12 | 898 | 90 | ||
RJ2 | 0 | 959 | 41 | 22 | 876 | 102 | ||
SC | 0 | 593 | 407 | 282 | 650 | 68 | ||
GP | 1 | 593 | 406 | 412 | 524 | 64 | ||
100 | QIC | 31 | 879 | 90 | 7 | 867 | 126 | |
RJ | 350 | 88 | 562 | 911 | 2 | 87 | ||
RJ1 | 0 | 995 | 5 | 2 | 946 | 52 | ||
RJ2 | 0 | 996 | 4 | 10 | 933 | 57 | ||
SC | 0 | 598 | 402 | 339 | 635 | 26 | ||
GP | 0 | 501 | 499 | 445 | 531 | 24 | ||
8 | 50 | QIC | 80 | 828 | 92 | 50 | 876 | 74 |
RJ | 10 | 395 | 595 | 813 | 6 | 181 | ||
RJ1 | 0 | 1000 | 0 | 0 | 987 | 13 | ||
RJ2 | 0 | 1000 | 0 | 0 | 966 | 25 | ||
SC | 0 | 488 | 513 | 302 | 696 | 2 | ||
GP | 0 | 511 | 489 | 497 | 500 | 3 | ||
100 | QIC | 17 | 953 | 30 | 8 | 973 | 19 | |
RJ | 0 | 408 | 592 | 861 | 0 | 139 | ||
RJ1 | 0 | 1000 | 0 | 0 | 997 | 3 | ||
RJ2 | 0 | 1000 | 0 | 0 | 993 | 7 | ||
SC | 0 | 470 | 530 | 328 | 672 | 0 | ||
GP | 0 | 526 | 474 | 486 | 514 | 0 |
n | K | Criterion | Selection frequencies of “working” correlation structure | |||||
---|---|---|---|---|---|---|---|---|
IND | EXCH | AR-1 | IND | EXCH | AR-1 | |||
Normal | Binary | |||||||
4 | 50 | QIC | 91 | 166 | 743 | 66 | 170 | 764 |
RJ | 712 | 142 | 146 | 925 | 12 | 63 | ||
RJ1 | 0 | 478 | 522 | 7 | 505 | 488 | ||
RJ2 | 0 | 466 | 534 | 20 | 499 | 481 | ||
SC | 0 | 480 | 520 | 220 | 350 | 430 | ||
GP | 0 | 543 | 457 | 303 | 332 | 365 | ||
100 | QIC | 25 | 116 | 859 | 7 | 122 | 871 | |
RJ | 770 | 95 | 135 | 972 | 4 | 24 | ||
RJ1 | 0 | 475 | 525 | 1 | 569 | 430 | ||
RJ2 | 0 | 481 | 519 | 5 | 571 | 424 | ||
SC | 0 | 491 | 509 | 237 | 371 | 392 | ||
GP | 0 | 540 | 460 | 290 | 353 | 357 | ||
8 | 50 | QIC | 50 | 88 | 862 | 44 | 77 | 879 |
RJ | 646 | 148 | 206 | 934 | 5 | 61 | ||
RJ1 | 0 | 445 | 555 | 0 | 535 | 465 | ||
RJ2 | 0 | 443 | 557 | 10 | 535 | 455 | ||
SC | 0 | 467 | 533 | 168 | 397 | 435 | ||
GP | 0 | 549 | 451 | 269 | 406 | 325 | ||
100 | QIC | 16 | 39 | 945 | 7 | 33 | 960 | |
RJ | 648 | 154 | 198 | 972 | 0 | 28 | ||
RJ1 | 0 | 455 | 545 | 1 | 603 | 396 | ||
RJ2 | 0 | 455 | 545 | 1 | 609 | 390 | ||
SC | 0 | 480 | 520 | 177 | 458 | 365 | ||
GP | 0 | 532 | 468 | 247 | 457 | 296 |
Based on the results, RJ does not perform well for the scenarios with either continuous or binary outcomes, while RJ1 and RJ2 have comparable performances and can select the true underlying correlation structure in most scenarios with better performance under large sample size. QIC is not satisfactory when the true correlation structure is independent but has advantageous performance for the scenarios with the true correlation structure as exchangeable or AR-1. On the other hand, SC and GP do not perform well for longitudinal data with normal responses, but the performance is slightly improved for longitudinal data with binary outcomes. The results may vary due to variety of factors including the types of “working” correlation structure considered for model fitting, the sample size, and/or the magnitude of correlation coefficient. For the future work, there is a necessity to find out a robust criterion for “working” correlation structure selection of GEE, and more advanced approaches are emerging currently.
4. Future Direction and Discussion
In this paper, we provide a review of several specific topics such as model selection with emphasis on the selection of “working” correlation structure, sample size and power calculation, and clustered data analysis with informative cluster size related to GEE for longitudinal/correlated data. The simulation studies are conducted for providing numerical comparisons among five types of model selection criteria [91, 92]. Until now, novel methodologies are still needed and being developed due to the increasing usage and potential theoretical constraints of GEE as well as new challenges emerging from practical applications in clinical trials or biomedical studies.
In addition, current research of interest related to GEE also includes a robust and optimal model selection criterion of GEE under missing at random (MAR) or missing not at random (MNAR) [93, 94], sample size/power calculation for correlated sparse or overdispersion count data or longitudinal data with small sample [57–60], GEE with improved performance under the situations with informative cluster size and/or MAR and/or small sample size [95–98], and GEE for high-dimensional longitudinal data [99]. Although GEE has attractive features, flexible application, and easy implementation in software, the application in practice should be cautious depending on the context of study design or data structure and the goals of research interest.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The author was supported by a grant from the Penn State CTSI. The project was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant 5 UL1 RR0330184-04. The content is solely the responsibility of the author and does not represent the views of the NIH.