Bump hunting by topological data analysis
Max Sommerfeld
Felix Bernstein Institute for Mathematical Statistics in the Biosciences, University of Göttingen, Göttingen 37077, Germany
Search for more papers by this authorGiseon Heo
School of Dentistry, University of Alberta, Edmonton, Alberta T6G 2R7, Canada
Search for more papers by this authorPeter Kim
Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario N1G 2W1, Canada
Search for more papers by this authorStephen T. Rush
School of Medical Sciences, Örebro Universitet, Örebro SE-701 82, Sweden
Search for more papers by this authorCorresponding Author
J. S. Marron
Department of Statistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
Email: [email protected]
Search for more papers by this authorMax Sommerfeld
Felix Bernstein Institute for Mathematical Statistics in the Biosciences, University of Göttingen, Göttingen 37077, Germany
Search for more papers by this authorGiseon Heo
School of Dentistry, University of Alberta, Edmonton, Alberta T6G 2R7, Canada
Search for more papers by this authorPeter Kim
Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario N1G 2W1, Canada
Search for more papers by this authorStephen T. Rush
School of Medical Sciences, Örebro Universitet, Örebro SE-701 82, Sweden
Search for more papers by this authorCorresponding Author
J. S. Marron
Department of Statistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
Email: [email protected]
Search for more papers by this authorAbstract
A topological data analysis approach is taken to the challenging problem of finding and validating the statistical significance of local modes in a data set. As with the SIgnificance of the ZERo (SiZer) approach to this problem, statistical inference is performed in a multi-scale way, that is, across bandwidths. The key contribution is a two-parameter approach to the persistent homology representation. For each kernel bandwidth, a sub-level set filtration of the resulting kernel density estimate is computed. Inference based on the resulting persistence diagram indicates statistical significance of modes. It is seen through a simulated example, and by analysis of the famous Hidalgo stamps data, that the new method has more statistical power for finding bumps than SiZer. Copyright © 2017 John Wiley & Sons, Ltd.
References
- Basford, KE, McLachlan, GJ & York, MG (1997), ‘Modelling the distribution of stamp paper thickness via finite normal mixtures: the 1872 Hidalgo stamp issue of Mexico revisited’, Journal of Applied Statistics, 24(2), 169–180.
- Bubenik, P & Kim, PT (2007), ‘A statistical approach to persistent homology’, Homology, Homotopy and Applications, 9(2), 337–362.
- Carlsson, G (2009), ‘Topology and data’, Bulletin of the American Mathematical Society, 46(2), 255–308.
- Carlsson, G & Zomorodian, A (2009), ‘The theory of multidimensional persistence’, Discrete & Computational Geometry, 42(1), 71–93.
- Chaudhuri, P & Marron, J (1999), ‘SiZer for exploration of structures in curves’, Journal of the American Statistical Association, 94(447), 807–823.
- Chaudhuri, P & Marron, J (2000), ‘Scale space view of curve estimation’, Annals of Statistics, 28(2), 408–428.
- Chazal, F, Cohen-Steiner, D & Mérigot, Q (2011), ‘Geometric inference for probability measures’, Foundations of Computational Mathematics, 11(6), 733–751.
- Cohen-Steiner, D, Edelsbrunner, H & Harer, J (2007), ‘Stability of persistence diagrams’, Discrete & Computational Geometry, 37(1), 103–120.
- Devroye, L & Gyorfi, L (1985), Nonparametric Density Estimation: The L1 View, Vol. 119, John Wiley & Sons Incorporated.
- Edelsbrunner, H & Harer, J (2008), ‘Persistent homology—a survey’, Contemporary Mathematics, 453, 257–282.
- Efron, B & Tibshirani, RJ (1994), An Introduction to the Bootstrap, CRC press.
10.1201/9780429246593 Google Scholar
- Erästö, P & Holmström, L (2007), ‘Bayesian analysis of features in a scatter plot with dependent observations and errors in predictors’, Journal of Statistical Computation and Simulation, 77(5), 421–431.
- Erästö, P & Holmström, L (2012), ‘Bayesian multiscale smoothing for making inferences about features in scatterplots’, Journal of Computational and Graphical Statistics, 14(3), 569–589.
- Fasy, BT, Lecci, F, Rinaldo, A, Wasserman, L, Balakrishnan, S & Singh, A (2014), ‘Confidence sets for persistence diagrams’, The Annals of Statistics, 42(6), 2301–2339.
- Fisher, N & Marron, JS (2001), ‘Mode testing via the excess mass estimate’, Biometrika, 88(2), 499–517.
- Ghrist, R (2008), ‘Barcodes: the persistent topology of data’, Bulletin of the American Mathematical Society, 45(1), 61–75.
- Godtliebsen, F, Marron, J & Chaudhuri, P (2002), ‘Significance in scale space for bivariate density estimation’, Journal of Computational and Graphical Statistics, 11(1), 1–21.
- Good, I & Gaskins, R (1980), ‘Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data’, Journal of the American Statistical Association, 75(369), 42–56.
- Hannig, J & Marron, J (2006), ‘Advanced distribution theory for SiZer’, Journal of the American Statistical Association, 101(474), 484–499.
- Holmström, L & Erästö, P (2002), ‘Making inferences about past environmental change using smoothing in multiple time scales’, Computational Statistics & Data Analysis, 41(2), 289–309.
- Izenman, AJ & Sommer, CJ (1988), ‘Philatelic mixtures and multimodal densities’, Journal of the American Statistical association, 83(404), 941–953.
- Jones, M, Marron, JS & Sheather, S (1996a), ‘Progress in data-based bandwidth selection for kernel density estimation’, Computational Statistics, 11(3), 337–381.
- Jones, MC, Marron, JS & Sheather, SJ (1996b), ‘A brief survey of bandwidth selection for density estimation’, Journal of the American Statistical Association, 91(433), 401–407.
- Marron, JS & Wand, MP (1992), ‘Exact mean integrated squared error’, The Annals of Statistics, 20, 712–736.
- Minnotte, MC (2010), ‘Mode testing via higher-order density estimation’, Computational Statistics, 25(3), 391–407.
- Minnotte, MC & Scott, DW (1993), ‘The mode tree: a tool for visualization of nonparametric density features’, Journal of Computational and Graphical Statistics, 2(1), 51–68.
10.1080/10618600.1993.10474599 Google Scholar
- Scott, DW (2015), Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons.
10.1002/9781118575574 Google Scholar
- Silverman, BW (1986), Density Estimation for Statistics and Data Analysis, Vol. 26, CRC Press.
- Simonoff, JS (2012), Smoothing Methods in Statistics, Springer Science & Business Media.
- Walther, G (2002), ‘Detecting the presence of mixing with multiscale maximum likelihood’, Journal of the American Statistical Association, 97(458), 508–513.
- Wand, MP & Jones, MC (1994), Kernel Smoothing, CRC Press.
10.1201/b14876 Google Scholar
- Xia, K, Zhao, Z & Wei, GW (2015a), ‘Multiresolution persistent homology for excessively large biomolecular datasets’, The Journal of Chemical Physics, 143(13), 134103.
- Xia, K, Zhao, Z & Wei, GW (2015b), ‘Multiresolution topological simplification’, Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, 22(9), 887–891.