Convolutive Bottleneck Network with Dropout for Dysarthric Speech Recognition

Authors

  • Toru Nakashika Kobe University
  • Toshiya Yoshioka Kobe University
  • Tetsuya Takiguchi Kobe University
  • Yasuo Ariki Kobe University
  • Stefan Duffner Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR 5205
  • Christophe Garcia Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR 5205

DOI:

https://doi.org/10.14738/tmlai.22.150

Keywords:

Articulation disorders, Feature extraction, Convolutional neural network, Bottleneck feature, Dropout, Dysarthric speech

Abstract

In this paper, we investigate the recognition of speech produced by a person with an articulation disorder resulting from athetoid cerebral palsy. The articulation of the first spoken words tends to become unstable due to strain on speech muscles, and that causes degradation of speech recognition. Therefore, we propose a robust feature extraction method using a convolutive bottleneck network (CBN) instead of the well-known MFCC. The CBN stacks multiple various types of layers, such as a convolution layer, a subsampling layer, and a bottleneck layer, forming a deep network. Applying the CBN to feature extraction for dysarthric speech, we expect that the CBN will reduce the influence of the unstable speaking style caused by the athetoid symptoms. Furthermore, we also adopt dropout in the output layer since automatically-assigned labels to the dysarthric speech are usually unreliable due to ambiguous phonemes uttered by the person with speech disorders. We confirmed its effectiveness through word-recognition experiments, where the CNN-based feature extraction method outperformed the conventional feature extraction method.

References

. S. Cox and S. Dasmahapatra, “High-Level Approaches to Confidence Estimation in Speech Recognition,” IEEE Trans. on SAP, vol. 10, No. 7, pp. 460-471, 2002.

. M. K. Bashar, T. Matsumoto, Y. Takeuchi, H. Kudo and N. Ohnishi, “Unsupervised Texture Segmentation via Wavelet-based Locally Orderless Images (WLOIs) and SOM,” 6th IASTED International Conference COMPUTER GRAPHICS AND IMAGING, 2003.

. T. Ohsuga and Y. Horiuchi and A. Ichikawa, “Estimating Syntactic Structure from Prosody in Japanese Speech,” IEICE Trans. on Information and Systems, 86(3), pp. 558―564, 2003.

. K. Nakamura and T. Toda and H. Saruwatari and K. Shikano, “Speaking Aid System for Total Laryngectomees Using Voice Conversion of Body Transmitted Artificial Speech,” Interspeech 2006, pp. 1395―1398, 2006.

. D. Giuliani and M. Gerosa, “Investigating recognition of children’s speech,” ICASSP2003, pp. 137―140, 2003.

. S. T. Canale and W. C. Campbell, “Campbell ’s Operative Orthopaedics,” Mosby-Year Book, 2002.

. H. Matsumasa, T. Takiguchi, Y. Ariki, I. LI, T. Nakabayashi, “Integration of Metamodel and Acoustic Model for Speech Recognition,” Interspeech 2008, pp.2234-2237, 2008.

. Y. Lecun et al., “Gradient-based learning applied to document recognition,” Proc. IEEE, pp. 2278-2324, 1998.

. H. Lee et al., “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems 22, pp. 1096-1104, 2004.

. C. Garcia and M. Delakis, “Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection,” Pattern Analysis and Machine Intelligence, 2004.

. M. Delakis and C. Garcia, “Text detection with Convolutional Neural Networks,” Proc. of the Int. Conf. on Computer Vision Theory and Applications, 2008.

. R. Hadsell et el., “Learning long-range vision for autonomous off-road driving,” Journal of Field Robotics, 2009.

. G. Montavon, “Deep learning for spoken language identification,” NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

. T. Nakashika, C. Garcia, and T. Takiguchi, “Local-feature-map Integration Using Convolutional Neural Networks for Music Genre Classication,” Interspeech 2012, 2012.

. K. Vesely et al. “Convolutive bottleneck network features for LVCSR,” in ASRU, pp. 42-47, 2011.

. C. Plahl, R. Schly ̈ter, and H. Ney, “Improving neural networks by preventing co-adaptation of feature detectors,” CoPR, 2012.

. C. Plahl et al., “Hierarchical bottle neck features for LVCSR,” Interspeech 2010, pp. 1197-1200, 2010.

. A. Kurematsu et al., “ATR japanese speech database as a tool of speech recognition and synthesis,” Speech Communication, No.4, pp.357–363, 1990.

. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” International Conference on Artificial Intelligence and Statistics, pp. 249-256, 2010.

Downloads

Published

2014-04-11

How to Cite

Nakashika, T., Yoshioka, T., Takiguchi, T., Ariki, Y., Duffner, S., & Garcia, C. (2014). Convolutive Bottleneck Network with Dropout for Dysarthric Speech Recognition. Transactions on Engineering and Computing Sciences, 2(2), 48–62. https://doi.org/10.14738/tmlai.22.150