A Review of Hot Deck Imputation for Survey Non-response
Rebecca R. Andridge
Division of Biostatistics, The Ohio State University, Columbus, OH 43210, USA
Search for more papers by this authorRoderick J. A. Little
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USAE-mail: [email protected]
Search for more papers by this authorRebecca R. Andridge
Division of Biostatistics, The Ohio State University, Columbus, OH 43210, USA
Search for more papers by this authorRoderick J. A. Little
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USAE-mail: [email protected]
Search for more papers by this authorSummary
enHot deck imputation is a method for handling missing data in which each missing value is replaced with an observed response from a “similar” unit. Despite being used extensively in practice, the theory is not as well developed as that of other imputation methods. We have found that no consensus exists as to the best way to apply the hot deck and obtain inferences from the completed data set. Here we review different forms of the hot deck and existing research on its statistical properties. We describe applications of the hot deck currently in use, including the U.S. Census Bureau's hot deck for the Current Population Survey (CPS). We also provide an extended example of variations of the hot deck applied to the third National Health and Nutrition Examination Survey (NHANES III). Some potential areas for future research are highlighted.
Résumé
esL'imputation hot deck est une méthode de gestion des données manquantes dans laquelle chaque valeur manquante est remplacée par une réponse observée à partir d'une unité“similaire.” Bien qu'elle soit largement utilisée en pratique, sa théorie n'est pas aussi développée que celle des autres méthodes d'imputation. Nous avons constaté qu'il n'existe aucun consensus quant à la meilleure faon d'appliquer les hot deck et obtenir des inférences à partir de la série de données complète. Ici, nous passons en revue les différentes formes de hot deck et les recherches existantes sur ses propriétés statistiques. Nous décrivons les applications du hot deck actuellement utilisées, y compris le hot deck du Bureau US du recensement pour la Current Population Survey (CPS). Nous proposons aussi des exemples nombreux de variations du hot deck à la troisième National Health and Nutrition Examination Survey (NHANES III). Certains domaines possibles de recherches futures sont mises en évidence.
References
- Andridge, R.R. & Little, R.J.A. (2009). The use of sample weights in hot deck imputation. J. Official Stat., 25, 21–36.
- Bailar, J.C. & Bailar, B.A. (1978). Comparison of two procedures for imputing missing survey values. In ASA Proc. Section on Survey Res. Methods, pp. 462–467.
- Bang, H. & Robins, J.M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61, 962–972.
- Bankier, M., Fillion, J.M., Luc, M. & Nadeau, C. (1994). Imputing numeric and qualitative variables simultaneously. In ASA Proc. Section on Survey Res. Methods, pp. 242–247.
- Bankier, M., Luc, M., Nadeau, C. & Newcombe, P. (1995). Additional details on imputing numeric and qualitative variables simultaneously. In ASA Proc. Section on Survey Res. Methods, pp. 287–292.
- Bankier, M., Poirier, P., Lachance, M. & Mason, P. (2000). A generic implementation of the nearest-neighbour imputation methodology (nim). In Proceedings of the Second International Conference on Establishment Surveys, pp. 571–578.
- Barzi, F. & Woodward, M. (2004). Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. Amer. J. Epidemiol., 160, 34–45.
- Berger, Y.G. & Rao, J.N.K. (2006). Adjusted jackknife for imputation under unequal probability sampling without replacement. J. Roy. Statist. Soc. Ser. B, 68, 531–547.
- Bollinger, C.R. & Hirsch, B.T. (2006). Match bias from earnings imputation in the current population survey: The case of imperfect matching. J. Labor Econ., 24, 483–519.
- K. Bowman, J. Chromy, S. Hunter, P. Martin & D. Odom (Eds.) (2005). 2003 NSDUH Methodological Resource Book, Rockville , MD : Substance Abuse and Mental Health Services Administration, Office of Applied Studies.
- Breiman, L. & Friedman, J.H. (1993). Classification and Regression Trees. New York : Chapman & Hall.
- Brick, J.M. & Kalton, G. (1996). Handling missing data in survey research. Stat. Meth. Med. Res., 5, 215–238.
- Brick, J.M., Kalton, G. & Kim, J.K. (2004). Variance estimation with hot deck imputation using a model. Surv. Methodol., 30, 57–66.
- Burns, R.M. (1990). Multiple and replicate item imputation in a complex sample survey. In U.S. Bureau of the Census Proceedings of the Sixth Annual Research Conference, pp. 655–665.
- Chen, J. & Shao, J. (1999). Inference with survey data imputed by hot deck when imputed values are nonidentifiable. Statist. Sinica., 9, 361–384.
- Chen, J. & Shao, J. (2000). Nearest neighbor imputation for survey data. J. Official. Stat., 16, 113–141.
- Chen, J. & Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. J. Amer. Statist. Assoc., 96, 260–269.
- Cochran, W.G. (1977). Sampling Techniques, 3rd ed. New York : Wiley.
- Cotton, C. (1991). Functional description of the generalized edit and imputation system. Tech. rep., Statistics Canada .
- Cox, B.G. (1980). The weighted sequential hot deck imputation procedure. In ASA Proc. Section on Survey Res. Methods, pp. 721–726.
- Cox, B.G. & Folsom, R.E. (1981). An evaluation of weighted hot deck imputation for unreported health care visits. In ASA Proc. Section on Survey Res. Methods, pp. 412–417.
- David, M., Little, R.J.A., Samuhel, M.E. & Triest, R.K. (1986). Alternative methods for CPS income imputation. J. Amer. Statist. Assoc., 81, 29–41.
- Efron, B. (1994). Missing data, imputation, and the bootstrap. J. Amer. Statist. Assoc., 89, 463–475.
- England, A.M., Hubbell, K.A., Judkins, D.R. & Ryaboy, S. (1994). Imputation of medical cost and payment data. In ASA Proc. Section on Survey Res. Methods, pp. 406–411.
- Ezzati-Rice, T.M., Fahimi, M., Judkins, D. & Khare, M. (1993a). Serial imputation of nhanes III with mixed regression and hot-deck imputation. In ASA Proc. Section on Survey Res. Methods, pp. 292–296.
- Ezzati-Rice, T.M., Khare, M., Rubin, D.B., Little, R.J.A. & Schafer, J.L. (1993b). A comparison of imputation techniques in the third national health and nutrition examination survey. In ASA Proc. Section on Survey Res. Methods, pp. 303–308.
- Fay, R.E. (1993). Valid inferences from imputed survey data. In ASA Proc. Section on Survey Res. Methods, pp. 41–48.
- Fay, R.E. (1996). Alternative paradigms for the analysis of imputed survey data. J. Amer. Statist. Assoc., 91, 490–498.
- Fay, R.E. (1999). Theory and application of nearest neighbor imputation in Census 2000. In ASA Proc. Section on Survey Res. Methods, pp. 112–121.
- Fellegi, I.P. & Holt, D. (1976). A systematic approach to automatic edit and imputation. J. Amer. Statist. Assoc., 71, 17–35.
- Ford, B.L. (1983). An overview of hot-deck procedures. In Incomplete Data in Sample Surveys, Eds. W.G. Madow, I. Olkin & D.B. Rubin, Vol. 2, pp. 185–207. New York : Academic Press.
- Grau, E.A., Frechtel, P.A. & Odom, D.M. (2004). A simple evaluation of the imputation procedures used in HSDUH. In ASA Proc. Section on Survey Res. Methods, pp. 3588–3595.
- Haziza, D. & Beaumont, J.F. (2007). On the construction of imputation classes in surveys. Int. Statist. Rev., 75, 25–43.
- Haziza, D. & Rao, J.N.K. (2006). A nonresponse model approach to inference under imputation for missing survey data. Surv. Method., 32, 53–64.
- Heitjan, D.F. & Little, R.J.A. (1991). Multiple imputation for the fatal accident reporting system. Appl. Stat., 40, 13–29.
- Herzog, T.N., Scheuren, F.J. & Winkler, W.E. (2009). Data Quality and Record Linkage Techniques. New York : Springer.
- Judkins, D.R. (1997). Imputing for swiss cheese patterns of missing data. In Proceedings of Statistics Canada Symposium 97.
- Judkins, D.R., Hubbell, K.A. & England, A.M. (1993). The imputation of compositional data. In ASA Proc. Section on Survey Res. Methods, pp. 458–462.
- Kalton, G. & Kasprzyk, D. (1986). The treatment of missing survey data. Surv. Method., 12, 1–16.
-
Kass, G.V. (1980). An exploratory technique for investigating large quantities of categorical data.
Appl. Stat., 29, 119–127.
10.2307/2986296 Google Scholar
- Khare, M., Little, R.J.A., Rubin, D.B. & Schafer, J.L. (1993). Multiple imputation of nhanes III. In ASA Proc. Section on Survey Res. Methods, pp. 297–302.
- Kim, J.K. (2002). A note on approximate bayesian bootstrap. Biometrika, 89, 470–477.
- Kim, J.K., Brick, J.M., Fuller, W.A. & Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. J. Roy. Statist. Soc. Ser. B, 68, 509–521.
- Kim, J.K. & Fuller, W. (2004). Fractional hot deck imputation. Biometrika, 91, 559–578.
- Lazzeroni, L.G., Schenker, N. & Taylor, J.M.G. (1990). Robustness of multiple-imputation techniques to model misspecification. In ASA Proc. Section on Survey Res. Methods, pp. 260–265.
- Lillard, L., Smith, J.P. & Welch, F. (1982). What do we really know about wages: The importance of non-reporting and census imputation. Tech. rep., Rand Corporation , Santa Monica , CA .
- Little, R.J., Yosef, M., Cain, K.C., Nan, B. & Harlow, S.D. (2008). A hot-deck multiple imputation procedure for gaps in longitudindal data on recurrent events. Stat. Med., 27, 103–120.
- Little, R.J.A. (1986). Survey nonresponse adjustments for estimates of means. Int. Statist. Rev., 54, 139–157.
- Little, R.J.A. (1988). Missing-data adjustments in large surveys. J. Buss. Econ. Stat., 6, 287–296.
- Little, R.J.A. & An, H. (2004). Robust likelihood-based analysis of multivariate data with missing values. Statist. Sinica., 14, 949–968.
-
Little, R.J.A. &
Rubin, D.B. (2002). Statistical Analysis with Missing Data, 2nd ed.
New York
: Wiley.
10.1002/9781119013563 Google Scholar
- Little, R.J.A. & Vartivarian, S. (2003). On weighting the rates in non-response weights. Stat. Med., 22, 1589–1599.
- Little, R.J.A. & Vartivarian, S. (2005). Does weighting for nonresponse increase the variance of survey means Surv. Method., 31, 161–168.
- Marker, D.A., Judkins, D.R. & Winglee, M. (2002). Large-scale imputation for complex surveys. In Survey Nonresponse, pp. 329–341. New York : Wiley.
- Meng, X.L. (1994). Multiple imputation inferences with uncongenial sources of input (with discussion). Stat. Sci., 9, 538–573.
- National Center for Education Statistics (2002). NCES statistical standards. Tech. rep., U.S. Department of Education .
- Oh, H.L. & Scheuren, F.J. (1983). Weighting adjustments for unit nonresponse. In Incomplete Data in Sample Surveys, Eds. W.G. Madow, I. Olkin & D.B. Rubin, Vol. 2, pp. 143–184. New York : Academic Press.
- Ono, M. & Miller, H.P. (1969). Income nonresponses in the current population survey. In ASA Proc. Social Statistics Section, pp. 277–288.
- Perez, A., Dennis, R.J., Gil, J.F.A. & Rondon, M.A. (2002). Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in colombia. Stat. Med., 21, 3885–3896.
- Platek, R. & Gray, G.B. (1983). Imputation methodology: Total survey error. In Incomplete Data in Sample Surveys, Eds. W.G. Madow, I. Olkin & D.B. Rubin, Vol. 2, pp. 249–333. New York : Academic Press.
-
R Development Core Team (2007). R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing
,
Vienna
,
Austria
,
http://www.R-project.org
, ISBN 3-900051-07-0.
10.1111/j.1462-2920.2006.01017.x Google Scholar
- Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J. & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Method., 21, 85–95.
- Rancourt, E. (1999). Estimation with nearest neighbor imputation at Statistics Canada. In ASA Proc. Section on Survey Res. Methods, pp. 131–138.
- Rancourt, E., Särndal, C.E. & Lee, H. (1994). Estimation of the variance in the presence of nearest neighbor imputation. In ASA Proc. Section on Survey Res. Methods, pp. 888–893.
- Rao, J.N.K. (1996). On variance estimation with imputed survey data. J. Amer. Stat. Assoc., 91, 499–506.
- Rao, J.N.K. & Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika, 79, 811–822.
- Robins, J.M., Rotnitzky, A. & Zhao, L.P. (1994). Estimation of regression coefficients when some regressors and not always observed. J. Amer. Statist. Assoc., 89, 846–866.
- Robins, J.M., Rotnitzky, A. & Zhao, L.P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc., 90, 106–121.
- Robins, J.M. & Wang, N. (2000). Inference for imputation estimators. Biometrika, 87, 113–124.
- Rubin, D.B. (1976). Inference and missing data (with discussion). Biometrika, 63, 581–592.
- Rubin, D.B. (1978). Multiple imputation in sample surveys - a phenomenological Bayesian approach to nonresponse. In ASA Proc. Section on Survey Res. Methods, pp. 20–34.
- Rubin, D.B. (1981). The bayesian bootstrap. Ann. Stat., 9, 130–134.
- Rubin, D.B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. J. Bus. Econ. Stat., 4, 87–94.
-
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys.
New York
: Wiley.
10.1002/9780470316696 Google Scholar
- Rubin, D.B. (1996). Multiple imputation after 18+ years. J. Amer. Stat. Assoc., 91, 473–489.
- Rubin, D.B. & Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable non-response. J. Amer. Stat. Assoc., 81, 366–374.
- Saigo, H., Shao, J. & Sitter, R.R. (2001). A repeated half-sample bootstrap and balanced repeated replications for randomly imputed data. Surv. Method., 27, 189–196.
- Särndal, C.E. (1992). Methods for estimating the precision of survey estimates when imputation has been used. Surv. Method., 18, 241–252.
- Schenker, N. & Taylor, J.M.G. (1996). Partially parametric techniques for multiple imputation. Comput. Statist. Data Anal., 22, 425–446.
- Shao, J. & Chen, J. (1999). Approximate balanced half sample and repeated replication methods for imputed survey data. Sankhya Ser. B, 61, 187–201.
- Shao, J., Chen, Y. & Chen, Y. (1998). Balanced repeated replication for stratified multistage survey data under imputation. J. Amer. Stat. Assoc., 93, 819–831.
- Shao, J. & Sitter, R.R. (1996). Bootstrap for imputed survey data. J. Amer. Stat. Assoc., 91, 1278–1288.
- Shao, J. & Steel, P. (1999). Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. J. Amer. Stat. Assoc., 94, 254–265.
- Shao, J. & Wang, H. (2002). Sample correlation coefficients based on survey data under regression imputation. J. Amer. Stat. Assoc., 97, 544–552.
- Siddique, J. & Belin, T.R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat. Med., 27, 83–102.
- Srivastava, M.S. & Carter, E.M. (1986). The maximum likelihood method for non-response in sample surveys. Surv. Method., 12, 61–72.
- Tang, L., Song, J., Belin, T.R. & Unutzer, J. (2005). A comparison of imputation methods in a longitudinal randomized clinical trial. Stat. Med., 24, 2111–2128.
- Twisk, J. & De Vente, W. (2002). Attrition in longitudinal studies: How to deal with missing data. J. Clin. Epidemiol., 55, 329–337.
- U.S. Bureau of the Census (2002). Technical paper 63. Tech. rep., U.S. Government Printing Office .
- U.S. Bureau of the Census (2003). A comparison study of acs if-then-else, nim, discrete edit and imputation systems using acs data. In UN/ECE Work Session of Statistical Data Editing, Madrid , Spain .
- U.S. Department of Health and Human Services (1994). Plan and operation of the third national health and nutrition examination survey, 1988-94. Tech. rep., National Center for Health Statistics, Centers for Disease Control and Prevention .
- U.S. Department of Health and Human Services (2001). Third national health and nutrition examination survey (nhanes iii, 1988-1994): Multiply imputed data set. cd-rom, series 11, no. 7a. Tech. rep., National Center for Health Statistics, Centers for Disease Control and Prevention .
- Van Buuren, S. & Oudshoorn, C.G.M. (1999). Flexible multivariate imputation by MICE. Tech. rep., TNO Prevention and Health , Leiden .
- Williams, R.L. & Folsom, R.E. (1981). Weighted hot-deck imputation of medical expenditures based on a record check subsample. In ASA Proc. Section on Survey Res. Methods, pp. 406–411.
- Zhang, G. & Little, R.J.A. (2009). Extensions of the penalized spline of propensity prediction method of imputation. Biometrics, 65, 911–918.