A Semantic and Detection-Based Approach to Speech and Language Processing
Li Deng
Microsoft Research Corporation, Redmond, Washington, USA
Search for more papers by this authorKuansan Wang
Microsoft Research Corporation, Redmond, Washington, USA
Search for more papers by this authorRodrigo Capobianco Guido
Institute of Physics at Sao Carlos, University of Sao Paulo, Sao Paulo, Brazil
Search for more papers by this authorLi Deng
Microsoft Research Corporation, Redmond, Washington, USA
Search for more papers by this authorKuansan Wang
Microsoft Research Corporation, Redmond, Washington, USA
Search for more papers by this authorRodrigo Capobianco Guido
Institute of Physics at Sao Carlos, University of Sao Paulo, Sao Paulo, Brazil
Search for more papers by this authorPhillip C.-Y. Sheu
University of California, Irvine, California, USA
Search for more papers by this authorHeather Yu
Search for more papers by this authorC. V. Ramamoorthy
Search for more papers by this authorArvind K. Joshi
Search for more papers by this authorLotfi A. Zadeh
Search for more papers by this authorSummary
This chapter presents a new formulation that tightly integrates the detection - based algorithm into the maximum a posteriori (MAP) decision. The key to this formulation is to implement the sequential detection algorithm and to recurrently apply the sequential probability ratio test in a time - synchronous, single - pass decoding framework. The chapter shows that realizing the detection - based recognition in single - pass architecture is feasible. It provides an overview of the mathematical foundation of this approach, serving as an introduction to the general detection - based approach for computer processing of speech and language. This overview starts with the conventional fixed - sample - size detection, which then naturally extends to sequential detection theory. Finally, it presents a comprehensive case study on how the sequential detection technique is successfully applied to a speech understanding task that is related to personal information management.
Controlled Vocabulary Terms
natural language processing; speech processing
REFERENCES
-
R. C. Guido et al., Spoken document summarization based on dynamic time warping and wavelets, Int. J. Semantic Comput., 1: 347–357, 2007.
10.1142/S1793351X07000214 Google Scholar
- F. Jelinek, L. Bahl, and R. Mercer, Design of a linguistic statistical decoder for the recognition of continuous speech, IEEE Trans. Inform. Theory, May 1975, pp. 250–256.
- B.-H. Juang and S. Furui, Automatic recognition and understanding of spoken language — A first step toward natural human-machine communication, Proc. IEEE, August 2000, pp. 1142–1165.
- Y. Wang, L. Deng, and A. Acero, An introduction to the statistical framework of spoken language understanding, IEEE Signal Process. Mag., 22 (5): 16–31, 2005.
- X. D. Huang, A. Acero, and H. Hon, Spoken Language Processing, Prentice Hall, Englewood Cliffs, NJ, 2001.
- L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993.
- L. Deng and Doug O'Shaughnessy, Speech Processing: A Dynamic and Optimization - Oriented Approach, Marcel Dekker, New York, 2003.
- H. Hon and K. Wang, Unified frame and segment based models for automatic speech recognition, in Proc. ICASSP-2000, Istanbul, Turkey, 2000, 2, pp. 1017–1020.
- T. Kawahara, C. H. Lee, and B. H. Juang, Flexible speech understanding based on combined key-phrase detection and verification, IEEE Trans. Speech Audio Process., November 1998, pp. 558–568.
- J. Allen, How do humans processing and recognize speech? IEEE Trans. Speech Audio Process., October 1994, pp. 567–577.
- S. Furui, On the role of spectral transition for speech perception, J. Acoust. Soc. Am., 80: 1016–1025, 1986.
- A. Houtsma, T. Rossing T., and W. Wagenaars, Auditory Demonstrations, Institute for Perception Research (IPO), Eindhoven, Netherlands, and the Acoustical Society of America, New York, 1987.
- G. Miller and P. Nicely, An analysis of perceptual confusions among some English consonants, J. Acoust. Soc. Am., 22: 338–352, 1955.
- K. Wang and S. Shamma, Spectral shape analysis in the central auditory system, IEEE Trans. Speech Audio Process., September 1995, pp. 382–395.
- K. Stevens, Toward a model for lexical access based on acoustic landmarks and distinctive features, J. Acoust. Soc. Am., 111: 1872–1891, 2002.
- K. Stevens, Diverse acoustic cues at consonantal landmarks, Phonetica, 57: 139–151, 2000.
- K. Stevens, On the quantal nature of speech, J. Phonet., 17: 3–45, 1989.
- K. Stevens, Acoustic Phonetics, MIT Press, Cambridge, MA, 1998.
- J. Li and C.-H. Lee, On designing and evaluating speech event detectors, in Proc. Interspeech, Lisbon, Portugal, September 2005, pp. 3365–3368.
- R. Niyogo, P. Mitra, and M. Sondhi, A detection framework for locating phonetic events, in Proc. ICSLP-1998, Sydney Australia, 1998, paper 0665.
- NSF Symposium on Next-Generation Automatic Speech Recognition, Atlanta, GA, October 7–8, 2003, available: http://users.ece.gatech.edu/∼chl/ngasr03/.
- C.-H. Lee, From knowledge-ignorant to knowledge-rich modeling: A new speech research paradigm for next-generation automatic speech recognition, in Proc. ICSLP-2004, Jeju Island, October 2004, pp. 109–111.
- K. Wang and D. Goblirsch, Extracting dynamic features using the stochastic matching pursuit algorithm for speech event detection. in Proc. IEEE ASRU Workshop, Santa Barbara, CA, 1997, pp. 132–139.
- S. M. Kay, Fundamentals of Statistical Signal Processing — Detection Theory, Prentice Hall, Englewood Cliffs, NJ, 1998.
- A. Wald, Sequential Analysis, Wiley, New York, 1947.
- C. Guo and A. Kuh, Temporal difference learning applied to sequential detection, IEEE Trans. Neural Networks, 8: 278–287, 1997.
- C. Lee and J. Thomas, A modified sequential detection procedure, IEEE Trans. Inform. Theory, 30: 16–23, 1984.
- K. Wang, Semantic object synchronous decoding in SALT for highly interactive speech interface, in Proc. Eurospeech-2003, Geneva, Switzerland, 2003.
- K. Wang, A detection based approach to robust speech understanding, in Proc. ICASSP-2004, Montreal, Canada, May 2004, pp. 413–416.
- K. Wang, A study on semantic synchronous understanding on speech interface design, in Proc. UIST-2003, Vancouver, BC, 2003.
- K. Wang, Semantics synchronous understanding for robust spoken language applications, in Proc. Automatic Speech Recognition and Understanding Workshop, U.S. Virgin Islands, December 2003, pp. 640–645.
- L. Deng, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, Y. Wang, D. Jacoby, M. Mahajan, C. Chelba, and X. D. Huang, Distributed speech processing in MiPad's multimodal user interface, IEEE Trans. Speech Audio Process., 10 (8): 605–619, 2002.
- J. Bussgang and D. Middleton, Optimal sequential detection of signals in noise, IRE Trans. Inform. Theory, 1: 5–18, 1955.
- L. Deng and C. D. Geisler, Responses of auditory-nerve fibers to nasal consonant - vowel syllables, J. Acoust. Soc. Am., 82: 1977–1988, 1987.
- L. Deng and C. D. Geisler, A composite auditory model for processing speech sounds, J. Acoust. Soc. Am., 82: 2001–2012, 1987.
- S. Greenberg, W. Ainsworth, A. Popper, and R. Fay (Eds.), Speech Processing in the Auditory System, Springer, New York, 2004.
- C. W. Helstrom, Elements of Signal Detection and Estimation (Chapter 9), Prentice Hall, Englewood Cliffs, NJ, 1995.
- X. D. Huang et al., MiPad: A next generation PDA prototype, in Proc. ICSLP-2000, Beijing China, October 2000, VIII, pp. 33–36.
- Johns Hopkins University CLSP Summer Workshop on Landmark-Based Speech Recognition, Baltimore, MD, June–August 2004, available: http://www.clsp.jhu. edu/ws2004/groups/ws04ldmk.
-
S. Keyser and K. Stevens, Feature geometry and the vocal tract, Phonology, 11: 207–236, 1994.
10.1017/S0952675700001950 Google Scholar
-
H. V. Poor, An Introduction to Signal Detection and Estimation, Springer-Verlag, New York, 1988.
10.1007/978-1-4757-3863-6 Google Scholar
- H. Sheikhzadeh and L. Deng, A layered neural network interfaced with a cochlear model for the study of speech encoding in the auditory system, Computer Speech Lang., 13: 39–64, 1999.
- W. Strange, J. Jenkins, and T. Johnson, Dynamic specification of coarticulated vowels, J. Acoust. Soc. Am., 74: 695–705, 1983.
- S. Zacks, Parametric Statistical Inference — Basic Theory and Modern Approaches (Chapter 4), Pergamon, Oxford, England, 1981.