# Unified Acoustic Modeling using Deep Conditional Random Fields

### Abstract

Acoustic models based on Deep Neural Networks (DNNs) lead to significant improvement in the recognition accuracy. In these methods, Hidden Markov Models (HMMs) state scores are computed using flexible discriminant DNNs. On the other hand, Conditional Random Fields (CRFs) are undirected graphical models that maintain the Markov properties of HMMs formulated using the maximum entropy (MaxEnt) principle. CRFs have limited ability to model spectral phenomena since they have single quadratic activation function per state. It is possible and natural to use DNNs to compute the state scores in CRFs. These acoustic models are known as Deep Conditional Random Fields (DCRFs). In this work, a variant of DCRFs is presented and connections with hybrid DNN/HMM systems are established. Under certain assumptions, both DCRFs and hybrid DNN/HMM systems can lead to exact same results for a phone recognition task. In addition, linear activation functions are used in the DCRFs output layer. Consequently, DCRFs and traditional DNN/HMM systems have the same decoding speed.### References

L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. of IEEE 77 (2) (1989) 257-286.

F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997.

X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall, 2001.

J. Bilmes, What HMMs can do, IEICE Transactions on Information and

Systems E89-D (3) (2006) 869-891.

A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from

incomplete data via the EM algorithm, Journal of the Royal Statistical

Society 39 (1) (1977) 1-38.

S. Renals, N. Morgan, H. Bourlard, M. Cohen, H. Franco, Connectionist probability estimators in HMM speech recognition, IEEE Transactions on Speech and Audio Processing.

N. Morgan, H. Bourlard, Continuous speech recognition: An introduction to the hybrid HMM/connectionist approach, IEEE Signal Processing Magazine 12 (3) (1995) 25-42.

E. Trentin, M. Gori, A survey of hybrid ANN/HMM models for automatic speech recognition, Neurocomputing 37 (1-4) (2001) 91-126.

A. Robinson, An application of recurrent neural nets to phone probability estimation, IEEE Transactions on Neural Networks 5 (2) (1994) 298-305.

A. Mohamed, G. Dahl, G. Hinton, Acoustic modeling using Deep Belief Networks, IEEE Transactions on Audio, Speech and Language Processing 20 (2012) 14-22.

F. Seide, G. Li, D. Y. ., Conversational speech transcription using context-dependent Deep Neural Networks, in: Interspeech, 2011.

G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing.

G. Hinton, L. Deng, D. Yu, G. Dahl, A. rahman Mohamed, N. Jaitly,

A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, , B. Kingsbury, Deep

Neural Networks for acoustic modeling in speech recognition, IEEE Signal

Processing Magazine.

B. Kingsbury, T. N. Sainath, H. Soltau, Scalable minimum bayes risk training of Deep Neural Network acoustic models using distributed hessian-free optimization, in: INTERSPEECH, 2012.

O. Vinyals, D. Povey, Krylov subspace descent for deep learning, in: AIS-TATS, 2012.

Y. Hifny, Deep learning using a Manhattan update rule, Deep Learning for Audio, Speech and Language Processing, ICML.

K. Van Horn, A maximum-entropy solution to the frame dependency problem in speech recognition, Tech. rep., Dept. of Computer Science, North Dakota State University (2001).

W. Macherey, H. Ney, A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition, in: Proc. EUROSPEECH, Geneva, Switzerland, 2003, pp. 493-496.

J. Laerty, A. McCallum, F. Pereira, Conditional random elds: Probabilistic models for segmenting and labeling sequence data, in: Proc. ICML, 2001, pp. 282-289.

J. Hennebert, C. Ris, H. Bourlard, S. Renals, N. Morgan, Estimation of global posteriors and forward-backward training of hybrid HMM/ANN systems, in: Proc. Eurospeech, Rhodes, 1997, pp. 1951-1954.

A. Krogh, S. K. Riis, Hidden neural networks, Neural Computation 11 (2) (1999) 541-563.

M. Gales, S.Watanabe, E. Fosler-Lussier, Structured discriminative models for speech recognition, IEEE Signal Processing Magazine.

A. Gunawardana, M. Mahajan, A. Acero, J. Platt, Hidden conditional

random elds for phone classication, in: Proc. INTERSPEECH, Lisbon,

Portugal, 2005, pp. 1117-1120.

D. Yu, S. Wang, L. Deng, Sequential labeling using deep-structured conditional random fields, IEEE JOURNAL OF SELECTED TOPICS IN SIG-

NAL PROCESSING.

D. Yu, L. Deng, Deep-structured hidden conditional random fields for phonetic recognition, in: Proc. INTERSPEECH, 2010.

A. Mohamed, D. Yu, L. Deng, Investigation of full-sequence training of Deep Belief Networks for speech recognition, in: Interspeech, 2010.

T.-M.-T. Do, T. Artieres, Neural conditional random fields, in: Proc. of the 13th International Conference on Artificial Intelligence and Statistics, (AI-STATS), 2010.

Y. Fujii, K. Yamamoto, S. Nakagawa, Deep-hidden conditional neural fields for continuous phoneme speech recognition, in: Proc. IWSML, 2012.

G. Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen,

S. Thomas, G. Sivaram, S. Bowman, J. Kao, Speech recognition with seg

mental conditional random fields: A summary of the JHU CLSP summer

workshop, in: Proc. IEEE ICASSP, 2011.

Y. Hifny, Conditional random fields for continuous speech recognition, Ph.D. thesis, University Of Sheeld (2006).

Y. Hifny, S. Renals, Speech recognition using augmented conditional random fields, IEEE Transactions on Audio, Speech and Language Processing 17 (2) (2009) 354-365.

A. Likhododev, Y. Gao, Direct models for phoneme recognition, in: Proc. IEEE ICASSP, Vol. 1, Orlando, FL, USA, 2002, pp. 89-92.

J. K. Hong-Kwang, Y. Gao, Maximum entropy direct models for speech recognition, in: Proc IEEE ASRU Workshop, St. Thomas, U.S. Virgin Islands, 2003, pp. 1- 6.

N. Smith, M. Gales, M. Niranjan, Data dependent kernels in SVM classification speech patterns, Tech. Rep. CUED/F-INFENG/TR.387, University of Cambridge (2001).

N. Smith, M. Gales, Speech recognition using SVMs, in: Proc. NIPS,

Vol. 14, 2002.

T. S. Jaakkola, D. Haussler, Exploiting generative models in discriminative classifiers , in: Proc. NIPS, Vol. 11, 1998.

M. Layton, M. Gales, Augmented statistical models for speech recognition, in: Proc. IEEE ICASSP, Vol. 1, France, 2006, pp. 129-132.

R. Prabhavalkar, E. Fosler-Lussier, Backpropagation training for multilayer conditional random field based phone recognition, in: Proc. IEEE ICASSP, Vol. 1, France, 2010, pp. 5534-5537.

Y. Hifny, Acoustic modeling based on deep conditional random fields, Deep Learning for Audio, Speech and Language Processing, ICML.

E. T. Jaynes, On the rationale of maximum-entropy methods, Proc. of IEEE 70 (9) (1982) 939-952.

J. Nocedal, S. J. Wright, Numerical Optimization, Springer, 1999.

J. Martens, Deep learning via hessian-free optimization, in: Proc. ICML, 2010.

S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd Edition, Prentice Hal, 1998.

G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (2002) 1771-1800.

B. Kingsbury, Lattice-based optimization of sequence classication criteria for neural-network acoustic modeling, in: Proc. IEEE ICASSP, 2009, pp. 3761-3764. doi:10.1109/ICASSP.2009.4960445.

Y. Hifny, S. Renals, N. Lawrence, A hybrid MaxEnt/HMM based ASR

system, in: Proc. INTERSPEECH, Lisbon, Portugal, 2005, pp. 3017-3020.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus (1990). URL http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1

A. Halberstadt, J. Glass, Heterogeneous measurements and multiple classifiers for speech recognition, in: Proc. ICSLP, Vol. 3, Sydney, Australia, 1998, pp. 995-998.

K.-F. Lee, H.-W. Hon, Speaker-independent phone recognition using hidden Markov models, IEEE Transactions on Speech and Audio Processing 37 (11) (1989) 1641-1648.

S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, P. Woodland, The HTK Book, Version 3.1, 2001.

Y. Miao, PDNN: Yet Another Python Toolkit for Deep Neural Networks. URL http://www.cs.cmu.edu/ ymiao/pdnntk.html

J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-

jardins, J. Turian, D. Warde-Farley, Y. Bengio, Theano: a CPU and GPU

math expression compiler, in: Proceedings of the Python for Scientic Computing Conference (SciPy), 2010, oral Presentation.