Enhancing Single Speaker Recognition Using Deep Belief Network

Authors

  • Murman Dwi Prasetio Hiroshima University
  • Tomohiro Hayashida Graduate School of Engineering, Hiroshima University, Hiroshima, JAPAN
  • Ichiro Nishizaki Graduate School of Engineering, Hiroshima University, Hiroshima, JAPAN
  • Shinya Sekizaki Graduate School of Engineering, Hiroshima University, Hiroshima, JAPAN

DOI:

https://doi.org/10.14738/tmlai.64.4850

Keywords:

Deep Belief Network, Evolutionary Computation, Speech Recognition, Speech Signal, Random Subspace.

Abstract

Recognition in speech is complex phenomena study, and the reason for this is the complexity of human language. The barrier of the problem in speech recognition study now can be handled from speech signal using machine learning methods. Nowadays, Deep Belief Networks (DBN) automatically is able to find out the representation of speech signal.

This paper tries to approach a structure optimization of DBN which based on the combined technique of evolutionary computation to enhance the single speaker speech. It firstly extracts from the feature of speech signal then applies them to construct lots of random subspaces. The result of the conducted experimental in the evolutionary computation of DBN indicates the structure have an improvement for speech recognition.

Author Biography

Murman Dwi Prasetio, Hiroshima University

Hiroshima University, Majoring on System Cybernetics Electrical and Electronics Engineering

References

(1) X.Huang, A. Acero and H.-W. Hon Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR, 2001.

(2) Utpal B. C, “A Comparative study of LPCC and MFCC features for the recogniton of assamese phonemes”, International Journal of EngineeringResearch and Technology (IJERT), 2013.

(3) S. Shabani and Y. Norouzi, “ Speech recognition using Principal Component Analysis and Neural Network,” 2016 IEEE 8th International Conference on Intellegent Systems (IS), Sofia, pp.90-95, 2016.

(4) S. S. Bhabad and G.K. Kharate, “Overview of technical progress in speech recognition”, International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE), 2013.

(5) Mohamed, G. Dahl and G. Hinton, “Acoustic modeling using deep belief networks” IEEE Transactions on Audio, Speech, and Language Processing, 2011.

(6) Ranganathan, H, Chakraborty, S. Panchanathan, Multimodal emotion recognition using deep learning architectures. in 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016.

(7) W. L., Zheng, JY ZHU, Y Peng, BL. Lu, EEG-based emotion classification using deep belief networks, Multimedia and Expo (ICME), IEEE International Conference on 2014, pp 1-6, 2014.

(8) G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Process. Mag. 29 (6) 82–97, 2012.

(9) Y. Ishikawa, T. Hayashida, I. Nishizaki, S. Sekizaki, Improvement of structure optimization method of deep belief network, 2017 IEEE SMC Hiroshima Chapter Young Researchers Workshop, pp 56-60, 2017. (in Japanesse).

(10) J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley and Y. Bengio. “Theano: A CPU and GPU Math Expression Compiler”. Proceedings of the Python for Scientific Computing Conference (SciPy) 2010. June 30 - July 3, Austin.

(11) K. Swersky, B. Chen, B. Marlin, and N. de Freitas, “A tutorial on stochasticapproximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets,” in Information Theory and Applications Workshop (ITA), 2010.

(12) M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence learning,” in Artificial Intelligence and Statistics, 2005.

(13) T. Tieleman and G. Hinton, “Using fast weights to improve persistent contrastive divergence,” in Proceedings of the 26th Annual International Conference on Machine Learning, New York, NY, USA, 2009.

(14) O. Breuleux, Y. Bengio, and P. Vincent, “Quickly Generating Representative Samples from an RBM-Derived Process,” Neural Computation, 2011.

(15) G. E. Hinton, “Learning multiple layers of representation,” Trends in Cognitive Sciences, 2007.

(16) Y. Bengio, “Learning Deep Architectures for AI,” Found. Trends Mach. Learn, 2009.

(17) T. Tieleman, “Training restricted Boltzmann machines using approximation to the likelihood gradient,” in Proceedings of the 25th international conference on Machine learning, New York, NY, USA, 2008.

(18) Y. Bengio, A. Courville, and P. Vincent, “Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives,”, 2012.

(19) M. A. Keyvanrad and M. M. Homayounpour, “Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy,”, 2014.

(20) Y. Bengio, N. Chapados, O. Delalleau, H. Larochelle, X. Saint-Mleux, C. Hudon, and J. Louradour, “Detonation Classification from Acoustic Signature with the Restricted Boltzmann Machine,”, 2012.

(21) Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing, 1986.

(22) M. A. Keyvanrad and M. M. Homayounpour, “Effective Sparsity Control in Deep Belief Networks using Normal Regularization Term,” submitted to Neural Networks, 2015.

(23) J. Martens, “Deep Learning via Hessian-free Optimization,”, 2010.

(24) N Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, 2012.

(25) Sivaram G. and H. Hermansky, “Sparse multilayer perceptron for phoneme recognition,” IEEE Transactions on Audio, Speech, and Language Processing, 2012.

(26) T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” 2012.

(27) N. Morgan, Qifeng Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, O. Cretin, H. Bourlard, and M. Athineos, “Pushing the envelope - aside [speech recognition],” Signal Processing Magazine, IEEE, 2005.

(28) O. Vinyals and S.V. Ravuri, “Comparing multilayer perceptron to deep belief network tandem features for robust asr,” in Proceedings of ICASSP, 2011.

(29) D. Yu, S. Siniscalchi, L. Deng, and C. Lee, “Boosting attribute and phone estimation accuracies with deep neural networks for detectionbased speech recognition,” 2012.

(30) L. Deng and D. Sun, “A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features,” Journal of the Acoustical Society of America, 1994.

(31) J. Sun and L. Deng, “An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition,” Journal of the Acoustical Society of America, 2002.

(32) P.C. Woodland and D. Povey, “Large scale discriminative training of hidden markov models for speech recognition,” Computer Speech and Language, 2002.

(33) Penghua Li, Shunxing Zhang, Huizong Feng, Yuanyuan Li, “Speaker Identification using Spectrogram and Learning Vector Quantization”, 2015

(34) Kshamamayee Dash, Debananda Padhi, Bhoomika Panda,

Sanghamitra Mohanty, “Speaker Identification using Mel Frequency Cepstral Coefficient and BPNN”, 2012

(35) Dave, Namrata, “Feature Extraction Methods LPC, PLP And MFCC in Speech Recognition”, 2013

(36) Zhizheng Wu, et.al, “Vulnerability evaluation of speake rverification under voice conversion spoofing: the effect of text constraints”, 2013.

(37) J. Baker, L. Deng, J. Glass, S. Khudanpur, Chin hui Lee, N. Morgan, and D. O’Shaughnessy, “Developments and directions in speech recognition and understanding, part 1, 2009.

(38) S. Furui, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, 2000

(39) C. Plahl, T. N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pre-training of deep belief networks using sparse encoding symmetric machines,” 2012.

(40) B. Hutchinson, L. Deng, and D. Yu, “A deep architecture with bilinear modeling of hidden representations: applications to phonetic recognition,”, 2012.

(41) Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,”, 2011.

(42) N Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE Transactions on Audio, 2012.

(43) Sivaram G. and H. Hermansky, “Sparse multilayer perceptron for phoneme recognition,”, 2012.

(44) T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,”, 2012.

(45) H. Bourlard, H. Hermansky, N. Morgan, Towards increasing speech

recognition error rates, 1996

(46) O. Siohan, Y. Gong, J.-P. Haton, “Comparative experiments of several adaptation approaches to noisy speech recognition using stochastic trajectory models”, 1996.

(47) V. Deng, M. Aksmanovik, Speaker independent phonetic classification using HMMs with mixtures of trend functions, 1997.

(48) Hetherington, PocketSUMMIT: small-footprint continuous speech recognition, 2007.

(49) H. Lin, J. Bilmes, D. Vergyri, K. Kirchhoff, OOV detection by joint word/phone lattice alignment, in: IEEE Automatic Speech Recognition and

Understanding Workshop, 2007

(50) K. Truong, D. van Leeuwen, Automatic discrimination between laughter and speech, 2007.

(51) J. Fiscus, J. Ajot, J. Garofolo, The Rich Transcription 2007 Meeting Recognition Evaluation, in: Joint Proceedings of Multimodal Technologies for Perception of Humans, 2007.

(52) S. Tranter, D. Reynolds, An overview of speech diarization systems, 2006

(53) Y.-F. Liao, Z.-H. Chen, Y.-T. Juang, Latent prosody analysis for robust speaker identification, 2007.

(54) S. F. Rashid. Optical Character Recognition – A Combined ANN/HMM Approach , 2014

(55) Mostafa Hydari, Mohammad Reza Karami, Ehsan Nadernejad, Speech Signals Enhancement Using LPC Analysis based on Inverse Fourier Method, 2009.

(56) Hyunsin Park, Tetsuya Takiguchi, and Yasuo Ariki, Research Article Integrated Phoneme SubspaceMethod for Speech Feature Extraction, 2009

(57) SIshizuka K.& Nakatani T.: A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition, 2006.

Tiwari, Vibha, “MFCC and Its Application in Speaker Recognition”. International Journal on Emerging Technologies, 2010.

Downloads

Published

2018-09-07

How to Cite

Prasetio, M. D., Hayashida, T., Nishizaki, I., & Sekizaki, S. (2018). Enhancing Single Speaker Recognition Using Deep Belief Network. Transactions on Engineering and Computing Sciences, 6(4), 01. https://doi.org/10.14738/tmlai.64.4850