Last modified 9 years ago Last modified on 04/06/10 09:32:53

Acoustic Modelling Approaches

from A Survey on Automatic Speech Recognition with an Illustrative Example on Continuous Speech Recognition of Mandarin:

  • Maximum Likelihood (ML) Estimation of HMM: Estimation of HMM parameters is usually accomplished in a batch mode using the ML approach based on the EM (estimation-maximization) algorithm (e.g. [ Baum et al. 1970, Liporace 1982, Juang 1985] ). Segmental ML approaches have also been extensively used (e.g. [Rabiner, Wilpon and Juang 1986]). Although ML estimation has good asymptotic properties, it often requires a large size training set to achieve reliable parameter estimation. Smoothing techniques, such as deleted interpolation [ Jelinek and Mercer 1980] and Bayesian smoothing [Gauvain and Lee 1992], have been proposed to circumvent some of the problems associated with sparse training data
  • Maximum Mutual Information (MMI) Estimation of HMM: Instead of maximizing the likelihood of observing both the given acoustic data and the transcription, the MMI estimation procedure maximizes the mutual information between the given acoustic data and the corresponding transcription [Bahl et al. 1986, Normandin and Morgera 1991]. As opposed to ML estimation, which uses only class-specific data to train the classifier for the particular class, MMI estimation takes into account information from data in other classes due to the necessary inclusion of all class priors and conditional probabilities in the definition of mutual information.
  • Maximum A Posteriori (MAP) Estimation of HMM: Perhaps the ultimate way to train subword units is to adapt them to the task, to the speaking environment, and to the speaker. One way to accomplish adaptive training is through Bayesian learning in which an initial set of seed models (e.g. speaker-independent or SI models) are combined with the adaptation data to adjust the model parameters so that the resulting set of subword models matches the acoustic properties of the adaptation data. This can be accomplished by maximum a posteriori estimation of HMM parameters [Lee, Lin and Junag 1991, Gauvain and Lee 1992, Gauvain and Lee 1994] and has been successfully applied to HMM-based speaker and context adaptation of whole-word and subword models. On-line adaptation, which continuously adapts HMM parameters and hyperparameters, has also been developed (e.g. [Huo and Lee 1996] ).
  • Minimum Classification Error (MCE) Estimation of HMM and ANN: One new direction for speech recognition research is to design a recognizer that minimizes the error rate on task-specific training data. The problem here is that the error probability is not easily expressed in a close functional form because the true probability density function of the speech signal is not known. An alternative is to find a set of model parameters that minimizes the recognition error based on a given set of application-specific, training or cross-validation data [Juang and Katagiri 1992]. Each training utterance is first recognized and then used for both positive and negative learning by adjusting the model parameters of all competing classes in a systematic manner. For HMM-based recognizers, a family of generalized probabilistic descent (GPD) algorithms has been successfully applied to estimate model parameters based on the minimum classification error criterion (e.g. [ Katagiri, Lee and Juang 1991, Chou, Juang and Lee 1992, Juang and Katagiri 1992, Su and Lee 1994] ). The MCE/GPD approaches are also capable of maximizing the separation between models of speech units so that both discrimination and robustness of a recognizer can be simultaneously improved.