EJAS-11602.pdf

Page 1 of 17

European Journal of Applied Sciences – Vol. 10, No. 1

Publication Date: February 25, 2022

DOI:10.14738/aivp.101.11602. Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via

Back-Propagation Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

Services for Science and Education – United Kingdom

Speech Recognition of Isolated Words in the “Baoulé” Language

Via Back-Propagation Neural Networks (BNN)

Francis Adlès KOUASSI

Ecole Supérieure Africaine des TIC, LASTIC, Abidjan, Côte d’Ivoire

Hyacinthe Kouassi KONAN

Ecole Supérieure Africaine des TIC, LASTIC, Abidjan, Côte d’Ivoire

Mamadou COULIBALY

Ecole Supérieure Africaine des TIC, LASTIC, Abidjan, Côte d’Ivoire

Olivier ASSEU

Ecole Supérieure Africaine des TIC, LASTIC, Abidjan, Côte d’Ivoire

ABSTRACT

The main objective of this research is to explore how a back propagation neural

network (BNN) can be applied to the speech recognition of isolated words. The

simulation results show that a BNN offers an efficient approach for small vocabulary

systems. The recognition rate reaches 100% for a 5 word system and 94% for a 10

word system. The general techniques developed in this article can be extended to

other applications, such as the recognition of sonar targets and the classification of

underwater acoustic signals.

Keyword: Speech recognition, back propagation neural network, LPC

INTRODUCTION

The demand for real-time processing in a vast majority of applications (eg, signal processing

and weapon systems) has led researchers to seek new approaches to address the bottleneck of

conventional serial processing.

Artificial neural networks, based on human-like perceptual characteristics, are one such

approach that is currently gaining much attention. Due to significant advances in computing

and new technologies, neural networks, once thought to be a dead zone, are resurrecting after

a four-decade sleepy period.

The main objective of our research in this article is to explore how neural networks can be used

to recognize single word speech as an alternative to traditional methodologies.

The main benefit of this study would be its contribution to the understanding of neural network

based techniques to solve the common but difficult problem of pattern recognition, especially

in automatic speech recognition (ASR). This technique can be extended to various other

important practical applications. The article is organized as follows:

Page 2 of 17

104

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

Section 2 presents neural networks. The reasons for the use of neural networks in speech

recognition are briefly discussed, and the structure and characteristics of a feedback neural

network (BNN) are described.

Section 3 reviews the basic principles of speech recognition and describes techniques for

estimating the parameters used as characteristics of the speech signal in the recognition step.

Section 4 presents the results of a small vocabulary system.

Section 5 summarizes the important findings, examines the limitations of the proposed system,

and offers ideas for future research.

NEURAL NETWORKS

Why neural networks?

Recently, studies have revealed the potential of neural networks as useful tools for various

applications requiring deep categorization. Several neural network applications have been

shown to be successful or partially successful in the fields of character recognition, image

compression, medical diagnostics, and financial and economic forecasting [1].

The power of parallel processing in neural networks and their ability to classify data based on

selected characteristics provide promising tools for pattern recognition in general and speech

recognition in particular.

Traditional sequential processing technology has serious limitations for the implementation of

pattern recognition problems. The classical approach to pattern recognition is often expensive,

inflexible, and requires data intensive processing [2].

in other words, neural networks accomplish the processing task by training rather than

programming in a manner analogous to how the human brain learns. Unlike traditional Von

Neumann sequential machines where the formula and rules must be specified explicitly, a

neural network can derive its functionality by learning from the examples presented.

Architecture

Neural networks are made up of a large number of neurons or processing elements, PE for

short. Each PE has a number of inputs with associated connection weights, as shown in Figure

1. These weighted inputs are summed and then mapped via a nonlinear threshold function. The

threshold function, also called the activation function or the transfer function, is continuous,

differentiable and increasing monotonically [1]. Two widely used threshold functions are the

sigmoid

�(�) = �

� − �!� (�. �)

and the hyperbolic tangent

�(�) = �� − �!�

�� + �!� (�. �)

The processing elements are layered in a predefined manner. Each PE processes the inputs it

receives through its weighted input connections and provides a continuous value to the other

Page 3 of 17

105

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

PEs through its outgoing connection. The basic neuron model shown in Figure 1 can be easily

realized in hardware. An equivalent electrical analog model could be constructed using a

nonlinear feedback operational amplifier to represent the summation operator and the

activation function while the connection weights could be achieved by the input resistors to the

operational amplifiers [3].

Figure 1. An artificial neuron

Back propagation neural networks

Currently, the most popular and commonly used neural network is the Backpropagation Neural

Network (BNN). A typical backpropagation neural network is a multi-layered, direct-feedback

network with an input layer, an output layer, and at least one hidden layer. Each layer is fully

connected to the next layer, however, there is no interconnection or feedback between PEs in

the same layer.

Each PE has a bias entry with an associated non-zero weight. The bias input is analogous to

connecting to ground in an electrical circuit in that it provides a constant input value. Figure 2

shows a typical multi-layered backpropagation network with a single hidden layer.

Figure 2. A back propagation neural network

In a neural network composed of N processing elements, the input / output function is defined

as [ 4]:

� = �(�, �) (�. �)

where � = {��} is the input vector of the network, � = {��} is the output vector of the

network and � is the weight matrix. The latter is defined as

Page 4 of 17

106

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

� = 5��

�, ��

�, ... , ��

� 8

� (�. �)

where the vectors ��

�, ��

�, ... , ��

� are the individual PE weight vectors, which are given by

��

� = (��, ��, ... , ��) ; � = �, �, ... , � (�. �)

These connection weights adapt or change in response to the examples of inputs and outputs

desired according to a learning law. Most of the learning laws are formulated to guide the vector

of weight � to a location which gives the desired network performance. There are many

learning laws depending on the type of application : delta rule or LMS algorithm, competitive

learning and adaptive resonance theory to name a few [4]. The back propagation neural

network implements the delta rule to update the connection weights. The delta rule is a

modification of the least squares (LMS) algorithm in which the overall error function E between

the current output, ��, and the desired output, ��, given by

� = �

�@5�� − ��8

�

(�. �)

is minimized. The resulting weight update equation can be represented as [Ref. 1]

��

� = ��

� + � D ��

��F + �5��

�!� − ��

�!� 8 (�. �)

where μ is the learning rate, ν is the pulse term and s is the layer number. The learning rate μ is

equivalent to the step size parameter in the LMS algorithm. Selecting an appropriate value for

μ is usually done by trial and error. Too high a learning rate often results in a divergence of the

learning algorithm, while too small a value results in a slow learning process. The value of μ

must be between 0 <μ <1 [4]. The pulse term v has been added in equation (2-7) to act as a low

pass filter on the delta weight term; it is used to help "smooth out" changes in weight. The value

of ν is generally between zero and one [1]. It has been found that using a momentum term

significantly speeds up the learning process. It allows a larger value for μ while avoiding

algorithm divergence.

The main mechanism of the backpropagation neural network is to propagate the input through

all the layers to the output layer. At the output layer, errors are determined and the associated

weights, ��, are updated by equation (2.7). These errors are then propagated through the

network from the output layer to the previous layer (hence the name backpropagation). The

error back propagated to PE j in layer s is given by [1]

��

[�] = ��

[�]

I�. � − ��

[�]

K@I��

[�3�]

��

[�3�]

�

(�. �)

This process continues until the input layer is reached. In summary, artificial neural networks,

which use highly parallel architectures with distributed processing among PEs, overcome the

limitation of the bottleneck of conventional sequential processing. The backpropagation

Page 5 of 17

107

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

algorithm described in this chapter provides a simple mechanism to implement a nonlinear

mapping between input and output.

THE CONCEPTS OF SPEECH RECOGNITION

Existing difficulties in isolated-word speech recognition

Although the field of automatic speech recognition has been around for several decades, many

ASR difficulties still exist and need to be addressed [5]. Besides syntactic and semantic issues

in linguistic theories, segmentation of speech is of great concern. The boundaries between

words and phonemes are very difficult to locate, except when the speaker pauses.

There is no separation between words in relation to the written language. Although the limits

can be estimated by a sudden and large change in the spectrum of speech, this method is not

very reliable due to coarticulation, i.e. changes in articulation and acoustics of a phoneme

because of its phonetic content.

The voice signal depends on the context, i.e. the word before and the word after affect how the

signal is formed.

The voice signal also depends on the speaking mode. Smoothness, volume, and stuttering also

affect the voice signal.

The environment, such as a quiet room, background noise, or interference in the same channel,

greatly affects the signal.

Input devices, such as microphone, amplifier and recorder, also modify the voice signal [6].

Due to asymmetries in its production and interpretation, speech is a complex signal for

recognition. For effective results, ASR requires an approach closer to human perception, and

traditional computer techniques are inefficient for such tasks.

Neural networks, which are modeled on the human brain, appear to provide a useful tool for

speech recognition.

Basics of speech recognition

Voice data is often redundant and too long to apply the entire signal to the recognition device.

A pre-treatment step is often necessary to achieve a high yield. This step consists in determining

the discriminating characteristics specific to the signal [2].

Most speech recognition tools only use those aspects that help in sound discrimination and

capture spectral information in a few parameters to identify spoken words or phonemes.

Typical speech signal parameters that are used as characteristics are LPC coefficients, energy,

zero crossing rate, and speech / non-speech classification.

Most ASR systems use the same techniques as those used in the traditional field of pattern

recognition. The procedure for implementing these systems includes several steps:

normalization, parameterization, feature extraction, similarity comparison and decision. Figure

3 shows a block diagram of the general voice recognition procedure [7].

Page 6 of 17

108

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

The voice signal is fed into a pre-processor which performs amplitude normalization,

parameterization and generates a test model based on these characteristics. The test model is

then compared to a set of pre-recorded reference models. The reference model that most

closely matches the test model is determined based on predetermined decision rules.

Figure 3. Functional diagram of pattern recognition for ASR

Standardization

The amplitude normalization step attempts to eliminate variability in the input speech signal

due to the environment - variations in recording level, distance from microphone, original

speech intensity and signal loss in transmission.

Parameterization and feature extraction

The parameterization step consists in converting the voice signal into a set of statistical

parameters which maximize the probability of choosing the right output characteristics

corresponding to the input signal. Since most useful acoustic speech details are below 4KHz, a

bandwidth of 3-4KHz is sufficient to represent the speech signal.

Experience has shown that using additional bandwidth actually degrades ASR performance [5].

In general, eight to fourteen LPC coefficients are considered sufficient to model the spectral

behavior of the spectral envelope of short time speech [8]. Other parameterization techniques,

such as cepstral analysis, can also be used.

The test pattern is formed by an arrangement of features, such as LPC or Ceptral coefficients,

short term energy, short term zero crossing rate, and V / U switch.

Comparison and similarity decision

Even for a single speaker, the same utterances are spoken with different durations and at

different speaking rates when repeated; therefore, a normalization of the reference utterance

along the time axis is necessary before the comparison step of Figure 3 [5]. Linear temporal

distortion, which is achieved by adjusting the frame interval before parameterization or by

decimating / interpolating the feature sequence, can be used to partially overcome this

problem.

The linear temporal distortion is easy to implement; however, it ignores the non-linearity of the

change in speech rate. For example, the energy of long and short versions of the same word is

not distributed evenly for each phoneme but is often concentrated on a dominant vowel. Other

approaches add silence to the ends of shorter utterances, using the length of the longer

utterance as a reference.

Page 7 of 17

109

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

The test model is then compared with the reference models based on Euclidean distance or LPC

distance measurement. The word corresponding to the reference template which gives the

minimum distance is announced as the spoken word. When ASR is performed by neural

networks, the comparison and decision steps are performed through learning. The neural

network is trained by presenting examples of the reference models. A learning algorithm based

on the delta rule described in chapter two is used to minimize an overall error function between

the current output and the desired output.

Once the error or delta weight term (�� − ��) is less than a predefined threshold, a final

set of weight values is obtained and used as network characteristics. The unknown signal (or

test) is then passed through this network with fixed weight values; the word corresponding to

the highest value of the output vector is the recognized word. The recognition rate is then

calculated on the basis of the percentage of correctly identified exits.

SOFTWARE SIMULATIONS STUDIES AND RESULTS

Simulation

In this section, the back propagation neural network is applied to the speaker independent

voice data.

Data Corpus

The voice data set used in this article, both for training and testing, contains a total of 6,300

sentences spoken by 630 speakers from 8 different main geographies.

The sentences are phonetically diverse to provide good coverage of phonetic utterances and to

add diversity to phonetic contexts.

Each sentence is broken down into four different representations: waveform, text, word and

phoneme.

The waveform is the voice data in the original digitized format. Text is the spelling transcription

of words. Word and phoneme representations use the transcription of time-aligned words and

phonemes, respectively. The boundaries were aligned with the phonetic segments using a

dynamic string alignment program.

Speaker independent system

In this research, three speech recognition devices, consisting of a vocabulary of 3 words, 5

words and 10 words, were developed using a back-propagating neural network. These

recognition tools were trained using voice recordings of various speakers, resulting in a

speaker independent system.

Figure 4 shows the block diagram of a typical 10 word recognition device. The voice signal is

passed through an endpoint detection block to isolate the words. The input vector is then

formed by calculating the necessary characteristics and fed into the back propagation neural

network. The output vector is first normalized and then compared to the desired output vector,

e.g. (1,0,0,0,0,0,0,0,0,0,0) for the first word, (0,1, 0, 0,0,0,0,0,0,0) for the second word, and so on.

The recognition rate is calculated based on the percentage of correctly matched outputs.

Page 8 of 17

110

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

Figure 4. A typical 10 word recognition tool

All pattern recognition tasks, including ASR, use two phases:

• Learning and

• Testing or recognition.

While there are general guidelines for the size of training and testing data, particular issues are

often limited by the data available. Larger datasets produce more reliable results with lower

variances in performance estimates (error rates).

The training set used for the 10-word recognition module includes 100 words spoken by 10

male speakers.

There are two sets of tests:

• one consists of 100 words spoken by 10 speakers and

• The second comprises 380 words spoken by 38 different speakers.

For the 5 and 3 word recognition module, the number of speakers remains the same, but the

total number of words for the training data is reduced to 50 and 30, respectively. The test data

consists of 190 words (5x38) for the 5-word recognition module and 114 words (3x38) for the

3-word recognition module.

Preprocessing and feature extraction

The original speech signals from the database are digitized at 16 KHz. Since most of the energy

for voiced sounds is concentrated below 4 KHz [3], the speech signals am low pass filtered to a

bandwidth of 4 KHz and decimated to remove the redundancy. The input vector to the back- propagation neural network consists of linear predictive coding coefficients, ��, error variance,

E,short time energy function, E. short time average zero crossing rate, ��, and voiced/unvoiced

discrimination, �/�

Speech is a non-stationary and time-variant signal. To cope with the temporal variability, short- time processing techniques are used. Short segments of the speech signal are isolated and

processed as if they were segments from a signal that is statistically stationary.

There are two methods of segmenting the speech signal. The first one segments the speech

signal into fixed length frames. The second divides the speech signal into a fixed number of

frames with variable length. The latter was used in this work to deal with the non-uniform word

length in the speech recordings. The digitized speech signal is segmented into 10 frames of 10

to 40 milliseconds each depending on the length of a given word in the vocabulary. Each frame

is weighted by a hamming window. A brief summary of the processing steps is given in Figure

5 [9].

Page 9 of 17

111

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

The digitized speech signal sampled at 16 KHz is first pre-processed. It is passed through a

lowpass filter and decimated to 8 KHz; the signal is then segmented into 10 variable length

frames. The feature vector is obtained by computing the LPC coefficients, the error variance,

the short time energy, the zero crossing rate and the voiced/unvoiced classification for each

segment. The Matlab programs for processing these parameters in the feature vector are listed

in Appendix A.

Figure 5. Block diagram of pretreatment.

LPC coefficients and errorvariance

Linear predictive analysis is a technique for estimating the basic speech parameters, e.g.,

formants, spectrum, and vocal tract area function [3]. A unique set of parameters is determined

by minimizing the sum of the squared error between the actual and the predicted speech signal.

The LPC parameters provide a good representation of the gross features of the short time

spectrum. The poles of the transfer function indicate the major energy concentration in the

spectrum, and the error variance indicates the energy level of the excitation signal [5]. The

algorithm used to compute the LPC coefficients is the covariance method which requires

solving a set of leastsquare based normal equations; Matlab Code for this is included in

Appendix A.

Short-timeenergy function

The short-time energy function is used to reflect the amplitude variation between voiced and

unvoiced speech segments. This quantity is defined as

�4 = @ �5(�)h(� − �)

78!6

(4.1)

where �(�) = ��(�) , and �(�) is the Hamming window (a length of 50 is used in this

experiment).

Short-time averagezero-crossingrate

The short-time average zero-crossing rate is a simple measure of frequency content of the

speech signal. This quantity is defined as

�� = @ |��[�(�)] − ��[�(� − �)]|�(� − �)

�8!6

(�. �)

where

��[�(�)] = a

� �(�) ≥ �

−� �(�) ≥ � (�. �)

and

Page 10 of 17

112

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

�(�) = c

�

�� ≤ � ≤ � − �

� ��

(�. �)

Voiced / unvoiced discrimination

There is a strong correlation between the energy distribution and the voiced/unvoiced

classification of speech. The speech signal corresponding to each word is divided into 10

smaller segments for classification as voiced or unvoiced speech. The differentiation between

voiced and unvoiced segments is accomplished by setting a threshold based on the short time

energy function. For energy values higher than 0.8 of the maximum level, the segment is treated

as voiced and assigned a value of 1. For values of energy lower than 0.2 of the maximum level,

the segment is classified as unvoiced and assigned a value of 0. Any mid-level is considered

intermediate and assigned a value of 0.5.

Results and analysis

For consistency and ease of comparison for different cases, BNN parameters are kept constant

in all experiments reported here. The learning algorithm used is the delta rule. The transfer

function is hyperbolic tangent. The values of momentum and learning rate are 0.2 and 0.5,

respectively. The training cycles used are 5000 for 3- and 5-word networks, and 10000 for the

10-word network.

The feature vector

The performance of a network varies drastically depending on the features used in the training

and testing of the network. Too many features may decrease the efficiency of the network since

it takes too long to train the network, while too few features may degrade the network's

performance. Two experiments are conducted to emphasize the importance of the choice of the

features used in the preprocessing step. The first experiment only uses the LPC coefficients and

the error variance. The second one uses all features: LPC coefficients, error variance, short time

energy, short time average zero crossing rate, and voiced/unvoiced discrimination.

In the first experiment, a 4th order LPC analysis is performed; for each frame, a set of four LPC

coefficients is generated along with the error variance. Thus the input vector to the recognizer

consists of a total of fifty parameters per word (five parameters per frame and ten frames per

words).

For the second experiment, the short-time energy, the short-time average zerocrossing rate,

and the voiced/unvoiced switch are also computed for each frame, making a total of eighty

parameters per word. The results are summarized in Table 1. For a small vocabulary set, the

recognition rates do not change much between the two experiments. With the 38-speaker

testing set, the recognition rate remains 100% for the 3-word recognizer in both experiments.

For the 5-word recognizer the rate decreases from 94.7% to 91.6%. For the 10-word

recognizer, the result not only deteriorates from 91.3% to 63.7%, but also the training time

increases drastically from 10000 to 80000 training cycles.

Page 11 of 17

113

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

Table 1. Effect of choice of features on recongition rate

Training vector Vocabulary size Error Recognition rate

50x30

50x50

50x100

3 words

5 words

10 words

16/190

138/380

100%

91.6%

63.7%

80x30

80x50

80x100

3 words

5 words

10 words

10/190

36/380

100%

94.7%

91.3%

B. LPC order

The order of the LPC coefficients plays an important role in the performance of the network.

Four experiments with different LPC orders were conducted on a 10-word vocabulary back

propagation network with 10 speaker testing set. A full feature vector is used for all of the

following experiments.

Table 2. Effect of lpc order on recognition rate

LPC order Recognition rate

2 84%

4 94%

8 92%

12 86%

The recognition rates shown in table 2 indicate that the fourth order LPC system captures the

essential features of the speech data. The second order system apparently does not have

enough parameters to differentiate between the words. However, as the LPC order increases,

the complexity of the system also increases, and the information becomes redundant, making

the network more cumbersome for training and recognition. This can actually result in decrease

recognition rate, as shown.

Vocabulary size

The size of vocabulary used in the training and testing set also affects the recognition rate of

the network. As the number of words used in the network increases, the performance decreases

rapidly. The results are summarized in Table 3. With a 3-word vocabulary and a 12th order

system, the recognition rate is 96.5% for the 380 word testing set. This rate decreases to 93.2%

for a 5-word system and deteriorates to 86.6% for a 10-word system.

Table 3: effect of vocabulary size and lpc order on performance of BNN

Vocabulary

size

3 words 5 words 10 words

Testing set size 100wrds 380wrds 100wrds 380wrds 100wrds 380wrds

2nd order LPC 100% 100% 96% 94.2% 84% 87%

4th order LPC 100% 100% 100% 94.7% 94% 91.3%

8th order LPC 100% 99.2% 96% 94.7% 92% 89%

12th order LPC 100% 96.5% 98% 93.2% 86% 86.6%

The number of PEs in hidden layers and the number of hidden layers

The structure of the network also affects the performance of the network. A back-propagation

neural network generally has one to two hidden layers. The performance of the network

Page 12 of 17

114

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

increases as the number of PEs increases to an optimal number and then starts to decrease.

Here the optimal number of PEs in the hidden layer is determined experimentally; a hidden

layer with 24 PEs is found to perform well for the speech data. The single hidden-layer network

is found to give better results than the network with two hidden-layers. The results are

summarized in tables 4 and 5.

Table 4. Effect of hidden layer on the recognition rate

Testing set size

100 wrds 380 wrds

one hidden layer 80-24-10 94% 91.3%

Two hidden layers 80-24-

12-10

90% 87.4%

Table 5. Effect on the number of pes in the single hidden-layer network.

PE in

hidden layer

Testing set size Training cycles

100 wrds 380 wrds required

12 91% 90% 10000

24 94% 91.3% 10000

32 89% 88.4% 15000

64 89% 88% 25000

Learning rate and Momentum

For a learning rate of 0.5, the effect of the momentum term was experimentally studied and the

results are shown in Table 6. A momentum of 0.1 was found to give the best recognition rate

and fast training.

Table 6. Effect of momentum on recognion rate

Momentum Testing set of Training cycles

100 wrds 380 wrds required

0.01 94% 90% 5000

0.1 98% 92.1% 5000

0.2 95% 90.5% 10000

0.3 92% 88.4% 15000

0.4 77% 75.8% 30000

Effect of embedded noise

One special characteristic of the neural network is its non-succeptability to noise. Its ability to

classify in a noisy environment, which is demonstrated here, makes it suitable for ASR. Noise

added to the speech signal only affects the performance of the network slightly. Uniform

random noise may be added to the weighted sum input signal prior applying the weight matrix.

The source and the amount of noise added is determined by the mode (learning or testing) and

the appropriate parameter in the "Temperature" row from the "learning schedule" [1]. Table 7

shows the results of a 10-word vocabulary using a 4th order LPC system with 20% added noise.

Page 13 of 17

115

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

Table 7. Effect of embedded noise

Testing set of

100 wrds 380 wrds

No noise 94% 90.5%

With 20% added

noise

91% 89.2%

In summary, the choice of the feature vector in the design of a BNN is very important. The

discriminating characteristics that are particular to the speech signal must be recognized to

achieve good results. Also, the information fed to the network should be adequate and optimal,

i.e., too little information will lead to a low recognition rate, and too much information will

result in extensive training time.

CONCLUSION

A back-propagation neural network, combined with speech signal processing techniques, is

used to develop a speech recognition system. Specifically, a BNN was used to design a 10-word

speech recognizer. Experiments were conducted on the recognizer. The main observations

from these experiments are summarized below:

• BNN is an effective approach for small vocabulary ASR. The recognition rate is 100% in

most cases for the 3 and 5-word vocabulary systems, and 94% for the 10-word system.

• The choice of feature vector plays an important role in the performance of the BNN. The

recognition rare may decrease drastically or the system may not converge at all if the

features are not correctly chosen. The feature vector chosen in the experiments, which

consisted of the LPC coefficients, short time energy, zero-crossing rate and

voiced/unvoiced classification, worked well and provided good results for the systems

studied.

• The techniques developed in this research on isolated-word speech recognition can be

extended to other important practical applications, such as sonar target recognition,

missile seeking and tracking functions in modern weapon systems, and classification of

underwater acoustic signals. However we cannot make predictions about the likely

performance of the methods in these areas until they are actually tested.

In this research on ASR, all experiments used male speakers from one specific geographical

region. A larger, more diverse group of speakers should be used for a more general case. In

addition, the 10-word recognizer system is small for most real applications. Future research

should be directed toward larger vocabulary systems, involving say 50 words or more.

The emphasis of the research was to develop an isolated-word speech recognizer using a BNN.

The techniques and schemes used in training and testing the network to improve the

recognition rate, as well as to keep the system stable and the convergence rate high, worked

well in the experiments reported in the thesis. However, this set of techniques may be only

applied to this particular case. There are still no firm rules available to train the networks. Most

of the rules of thumb are experiment dependent. More widely applicable learning and testing

schemes are needed to further improve the recognition rates as well as increase the vocabulary

size.

Page 14 of 17

116

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

ANNEXES

Appendix a. Simulation programs

Covarc.m Covariance method to solve the leastsquare normal equation (X'X) a=(S,O)'

Usage: [A,S]=covarc(x,p,N); signal x of length N and order p

function [A,S]=covarc(x,p,N);

for i = l:(N-p)

for j = l:(p+l)

xp(i,j) = x(p+l+i-j);

end

Rx = xp'*xp;

[m,n] = size(Rx);

Rxp = [Rx(2:m,2:n)];

A = [inv(Rxp)*(-Rx(2:m,I))];

S = Rx(l:n)*[I;A];

S = S/N;

stef.m Short time energy function

Usage : y=stef(x);

x is input speech vector and y is corresponding

y corresponding short Time Energy function En.

function xos=stef(xi)

[m,n]=size(xi);

if n=•1,

xi=xi'; % make sure input is a row vector

end;

N= input (' enter window size (50 to 300 samples) N? ');

% N=50 is used;

wt--input (' choose window type 0. Rectangular 1. Hamming? ');

% wtfl; hamming window is used

if wt==0

window=ones(I,N);advance=fix(N/2);wtstrng=' Rect. Win';

else

window=harnming (N)';advance=fix(N/4);wtstrng=' Harem. Win';

end

%shortseq = 0;

if length(xi)<N % case if length of input<length of window

xi=[xi,zeros(1,N-length(xi))]; % zero pad

end

input=xi.^2;

imp.response=window.A2;

xo(l)=sum(input(1 :N).*imp response);

for n=advance:advance:length(xi)-N,

xo(l+n/advance)=sum(input(n:n+N-I).*imp-response);

end;

Page 15 of 17

117

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

xos=sum(xo);

subplot(21 1)

clg; plot(xo);

xlabel('time');title(['N=',num2str(N),wtstmg]);

ylabel('En');grid;

function xos=stazcr(xi)

stzcr.m Short time average zero crossing rate

usage : y=stazcr(x)

x : input speech vector

y : Short-Time Avarage Zero-Crossing Rate Zn

function xos=stazcr(xi)

[m,n]=size(xi);

if n=-l,

xi=xi'; % make sure input is a row vector

end;

%N= input (' enter window size (50 to 300 samples) N? ');

N=50; % window length N=50

%wt--input (' choose window type 0. Rectangular 1. Hamming? ');

wt=l; % hamming window is used

if wt==O

window=ones( 1,N);advance=fix(Nt2);wtstrng=' Rect. Win';

else

window=haniming (N)';advance=tix(N/4);wtstrng=' Hamnm. Win';

end

if length(xi)'zN %case if length of input'clength of window

xi=[xi,zeros(1,N-length(xi))]; %zero pad sequence

end

input=abs( sign( [xi(2:length(xi)),OJ )-sign(xi))

xo(lI)=suni(input( I:N).*window);

for n=advance:advance:length(xi)-N,

xo( 1+n/advance)=sum(input(n: n+N- 1).*window);

end;

xos=sum(xo);

subplot(21 1)

cig; plot(xo);

xlabe1( 'time '); title(['N=',num2str(N),wtstrngJ);

ylabel('En');grid;

voiceunv.m Voiced and unvoied discrimination

usage: vuv=voiceunv(data)

function will return 0 if unvoiced

1 if voiced

0.5 if in between

Page 16 of 17

118

European Journal of Applied Sciences (EJAS) Vol. 10, Issue 1, February-2022

Services for Science and Education – United Kingdom

function vuv=voiceunv(data)

NSeg=1O;

NSample=fix(length(data)INSeg);

for indx=1 :NSeg,

xx = data((NSarnple/2)*(indx-l1)+ 1:(NSamplet2)*(indx~sl ));

en(indx) = stef(xx);

end;

vuv=sum(en( 1:NSeg));

arl2lin.m 12th order AR model coefficients (normalized)

Usage: arcoeff=arl2lin(data)

function arcoefff=ar I2lin(data)

p=12; % order of AR coefficients

Nd =2;

x=decimate(data, Nd);

NSeg= 10;

NSaniple=fix(length(x)/NSeg);

for indxl=l:NSeg,

xx = x(NSample*(indx 1-1)+ 1:NSarnple*indx 1);

xx = dtrend(xx);

xstore(:,indxl) = xx(:) ./ max(xx);

[a,sJ=covarc(xx,p,length(xx));

aa(:,indxl) = a(*)

astore(:,indxl) = a(:) J1 max(a);

ss(indxl) s;

en(indxl) =stef(xx);

zc(indx 1)=stazcr(xx);

vuv(indxl1)=voiceunv(xx);

end;

vuv =vuv/max(vuv);

ss=Ss/max(abs(ss));

en=enlmax(abs(en));

zc=zc/rnax(zc);

for indx3=1:NSeg

if vuv(indx3)<.2

vuv(indx3)=0;

elseif vuv(indx3)<.8

vuv(indx3)=.5;

else

vuv(indx3)= 1;

end

arcol=aa(:);

scalef--max(abs(arcol));

arcoef=aa/scalef;

arcoeff=[arcoef;ss;en;zc;vuvj;

arcoefff=arcoeff(:);

Page 17 of 17

119

Kouassi, F. A., Konan, H. K., Coulibaly, M., & Asseu, O. (2022). Speech Recognition of Isolated Words in the “Baoulé” Language Via Back-Propagation

Neural Networks (BNN). European Journal of Applied Sciences, 10(1). 103-119.

URL: http://dx.doi.org/10.14738/aivp.101.11602

Références

NeuralWare, Inc., Neural Computing, documentation for the Neural Professional II Plus Neural Network

simulation Software, 1991.

Gorman, R.P. and Sejnowski, T.J. "Learned Classification of Sonar Targets Using a Massively Parallel Network,"

IEEE Trans. Acous., Speech, and Sig. Processing, Vol. 36, No. 7, July 1998.

Webster, Willard P., "Artificial Neural Networks and their Application to Weapons," Naval Engineering Journal, v.

103, pp. 46-59, May 1991.

Hetcht-Nielsen, R., Neurocomputing, Addison-Wesley, 1990.

Rabiner, L.R. and Schafer, R.W., DigitalProcessingofSpeech Signals, PrenticeHall, 1978.

Mariani, J., "Recent Advances in Speech Processing," IEEE Int. Conf. Acous., Speech, and Signal Processing, Vol. 1,

pp 429-440, 1989.

O'Shaughnessy, D., Speech Communication, Addison-Wesley Publishing Company, 1987.

NTIS, TIMIT, CD ROM on Line Documentationfor the DARPA TIMIT AcousticPhonetic ContinuousSpeech Corpus,

American Helix, October 1990.

Tom, Daniel M. and Tenorio, Fernando M., "Short Utterance Recognition Using a Network with Minimum

Training," Neural Networks, Vol. 4, No. 6, pp 711-722, 1991.