Speaker Recognition and Diarization

This chapter presents a continuously growing field that promises a wealth of applications far beyond the field of speech processing: the automatic identification of persons from their uttered speech. Research is currently focusing mainly on two tasks: The task of speaker detection is to verify the identity of a new speaker against a set of pretrained speaker models. The task of speaker diarization is to find speech segments of the same speaker without any a priori knowledge. The chapter introduces the general ideas in the two fields then it continues to explain the task of speaker diarization by providing an overview of current work before providing a more detailed description of a concrete example of a diarization system. Then, variants and current research topics are discussed. It presents speaker recognition in a similar way. Finally it concludes the chapter pointing to open problems.

Controlled Vocabulary Terms

speaker recognition

REFERENCES

D. Reynolds, T. Quatieri, and R. Dunn, Speaker verification using adapted Gaussian Mixture Models, Digital Signal Process., 10: 19–41, 2000.
10.1006/dspr.1999.0361
Web of Science® Google Scholar
L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon, An improved endpoint detector for isolated word recognition, IEEE Trans. Acoust. Speech Signal Process., ASSP-29 (4): 777–785, 1981.
10.1109/TASSP.1981.1163642
Web of Science® Google Scholar
M. Huijbregts, C. Wooters, and R. Ordelman, Filtering the unknown: Speech activity detection in heterogeneous video collections, in Proceedings of Interpeech, Antwerpen, 2007, pp. 2925–2928.
Google Scholar
D. Reynolds and P. Torres-Carrasquillo, Approaches and applications of audio diarization, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 5: 953–956, 2005.
Google Scholar
H. Hung, D. Jayagopi, C. Yeo, G. Friedland, S. Ba, J.-M. Odobez, K. Ramchandran, N. Mirghafori, and D. Gatica-Perez, Using audio and video features to classify the most dominant person in a group meeting, in MULTIMEDIA'07: Proceedings of the 15th International Conference on Multimedia, ACM, New York, 2007, pp. 835–838.
10.1145/1291233.1291423
Google Scholar
X. Anguera, Robust speaker diarization for meetings, Ph.D. thesis, Technical University of Catalonia, Barcelona, Spain, December 2006.
Google Scholar
S. Chen and P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Speech Recognition Workshop, 1998.
Google Scholar
H. Ning, M. Liu, H. Tang, and T. Huang, A spectral clustering approach to speaker diarization, in Proceedings of Interspeech, ISCA, 2006.
Google Scholar
X. Anguera, C. Wooters, B. Peskin, and M. Aguilo, Robust speaker segmentation for meetings: The ICSI-SRI spring 2005 diarization system, in Proceeding of the NIST MLMI Meeting Recognition Workshop, Edinburgh. Springer, 2005.
Google Scholar
B. H. Juang and L. R. Rabiner, A probabilistic distance measure for hidden Markov models, AT & T Tech. J., 64 (2): 391–408, 1985.
10.1002/j.1538-7305.1985.tb00439.x
Web of Science® Google Scholar
P. Delacourt and C. Wellekens, Distbic: A speaker-based segmentation for audio data indexing, Speech Communication: Special Issue in Accessing Information in Spoken Audio, 32 (1–2): 111–126, 2000.
10.1016/S0167-6393(00)00027-3
Web of Science® Google Scholar
H. Gish and M. Schmidt, Text-independent speaker identification, IEEE Signal Process. Mag., 11: 18–32, 1994.
10.1109/79.317924
Web of Science® Google Scholar
J. Campbell, Speaker recognition: A tutorial, Proc. IEEE, 85 (9): 1437–1462, 1997.
10.1109/5.628714
Web of Science® Google Scholar
H. Kim, D. Ertelt, and T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1: 745–748, 2005.
Google Scholar
J. Ajmera and C. Wooters, A robust speaker clustering algorithm, paper presented at IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU'03, 2003, pp. 411–416.
Google Scholar
C. Wooters and M. Huijbregts, The ICSI RT07s speaker diarization system, in Proceedings of the Rich Transcription 2007 Meeting Recognition Evaluation Workshop, 2007.
Google Scholar
H. J. Nock, G. Iyengar, and C. Neti, Speaker localisation using audio-visual synchrony: An empirical study, Journal of VLSI Signal Processing, 36 (2): 117–124, 2004.
Google Scholar
S. Tamura, K. Iwano, K., and S. Furui, Multi-modal speech recognition using optical - flow analysis for lip images, Journal of VLSI Signal Processing, 36 (2): 117–124, 2004.
10.1023/B:VLSI.0000015091.47302.07
Web of Science® Google Scholar
H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi, Audio segmentation and speaker localization in meeting videos. 18th International Conference on Pattern recognition (ICPR 2006), 2: 1150–1153, 2006.
10.1109/ICPR.2006.283
Google Scholar
C. Zhang, P. Yin, Y. Rui, R. Cutler, P. Viola, X. Sun, N. Pinto, and Z. Zhang, Boosting - Based Multimodal Speaker Detection for Distributed Meeting Videos, IEEE Transactions on Multimedia, 10 (8): 1541–1552, 2008.
10.1109/TMM.2008.2007344
Web of Science® Google Scholar
A. Noulas and B. J. A. Krose, On-line multi-modal speaker diarization, in ICMI'07: Proceedings of the Ninth International Conference on Multimodal Interfaces, ACM, New York, 2007, pp. 350–357.
10.1145/1322192.1322254
Google Scholar
D. A. van Leeuwen and N. Brümmer, An introduction to application independent evaluation of speaker recognition systems, in Speaker Classification, C. Müller (Ed.), Vol. 4343 of Lecture Notes in Computer Science / Artificial Intelligence, Springer, Heidelberg, 2007.
10.1007/978-3-540-69507-3
Google Scholar
A. Martin, G. Doddington, T. Kamm, M. O. Ki, and M. Przybocki, The DET curve in assessment of detection task performance, in Proc. Eurospeech 1997, Rhodes, Greece, 1997, pp. 1895–1898.
Google Scholar
J.-L. Gauvain and C.-H. Lee, Maximum a posteriori esitimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Process., 2: 291–298, 1994.
10.1109/89.279278
Web of Science® Google Scholar
W. Campbell, D. Sturim, and D. Reynolds, Support Vector Machines using GMM supervectors for speaker verification, IEEE Signal Process. Lett., 13 (5): 308–311, 2006.
10.1109/LSP.2006.870086
Web of Science® Google Scholar
R. Vogt, B. Baker, and S. Sridharan, Modelling session variability in text independent speaker verification, in Proceedings of Interspeech, 2005, pp. 3117–3120.
Google Scholar
P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio, Speech, Lang. Process., 15 (4): 1435–1448, 2007.
10.1109/TASL.2006.881693
Web of Science® Google Scholar
W. Campbell, D. Sturim, D. Reynolds, and A. Solomonoff, SVM based speaker verification using a GMM supervector kernel and NAP variabilitycompensation, in Proc. ICASSP, Toulouse, IEEE, 2006, pp. 97–100.
Google Scholar

Citing Literature

Semantic Computing