ASER: Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset


  • Salah Ahmed Fayoum University



Emotion Recognition, Deep Learning, Arabic Speech, automated speech recognition (ASR).


Recently, there have been tremendous research outcomes in the fields of speech recognition and natural language processing. This is  due to the well-developed multilayers deep learning paradigms such as wav2vec2.0, Wav2vecU, WavBERT, and HuBERT that provide better representation learning and high information capturing.  Such paradigms run on hundreds of unlabeled data, then fine-tuned on a small dataset for specific tasks.

This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues. The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT. The experiment and performance outcomes of our model overcome the previous known results.



K. Noh, C. Jeong, J. Lim, S. Chung, G. Kim, J. Lim, and H. Jeong, “Multi-path and group-loss-based network for speech emotion recognition in multi-domain datasets,” Sensors, vol. 21, 1579, 2021

A. Almahdawi and W. Teahan, “A new arabic dataset for emotion recognition. in: Arai k., bhatia r., kapoor s. (eds),” Intelligent Computing. CompCom 2019, Advances in Intelligent Systems and Computing, vol998. Springer, Cham., 2019.

A. H. Meftah, Y. A. Alotaibi, and S.-A. Selouani, “Ksuemotions ldc2017s12.” Web Download. Philadelphia: Linguistic Data Consortium,2017, 2017

I. Shahin, A. B. Nassif, N. Nemmour, A. Elnagar, A. Alhudhaif, and K. Polat, “Novel hybrid dnn approaches for speaker verification in emotional and stressful talking environments,” Neural Computing and Applications, June 2021.

A. Aouf, “Basic arabic vocal emotions dataset (baved) – github,”, 21 September, 2019

F. M. P. del Arco1, S. Halat, S. Padó, and R. Klinger, “Multi-task learning with sentiment, emotion, and target detection to recognize hate speech and offensive language,” Forum for Information Retrieval Evaluation, Virtual Event, December 13–17, 2021

A. Satt, S. Rozenberg, and R. Hoory, “Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms,” in Proc. Inter speech 2017, pp. 1089–1093, 2017.

E. Lieskovská, M. Jakubec, R. Jarina, and M. Chmulı́k, “A review on speech emotion recognition using deep learning attenion mechanism,”electronics,vol.10,1163,2021.

P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and J. Vepa, “Speech emotion recognition using spectrogram & phoneme embedding,” In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6

September 2018.

S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans. Multimed, vol. 20, p. 1576–1590, 2018.

R. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, and T. Alhussain, “Speech emotion recognition using deep learning techniques: A review,” 2019, vol. 7, p. 117327–117345, IEEE Access.

W. Z. Zheng and Y. Zong, “Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition,” Virtual Real. Intell. Hardw., vol. 3, 65–75, 2021.

Y. Hifny and A. Ali, “Efficient arabic emotion recognition using deep neural networks,” in IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP):, 2019.

Y. Hifny and A. Ali, “Efficient arabic emotion recognition using deep neural networks,” IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 6710–6714, 2019 doi: 10.1109/ICASSP.2019.8683632.

S. Klaylat, Z. Osman, L. Hamandi, and R. Zantout, “Emotion recognition in arabic speech,” Analog Integr Circ Sig Process, vol. 96, 337–351,2018.

R. Y. Cherif, A. Moussaoui, N. Frahta, and M. Berrimi, “Effective speech emotion recognition using deep learning approaches for algerian dialect,”Intern. Conf. of Women in Data Science at Taif University (WiDSTaif),2021.

L. Abdel-Hamid, “Egyptian arabic speech emotion recognition using prosodic, spectral and wavelet features,” Speech Commun., vol. 122, pp. 19–30, 2020.

S. Klaylat, “Arabic natural audio dataset,” 2019.

F. A. Shaqra, R. Duwairi, and M. Al-Ayyoub, “The audio-visual arabic dataset for natural emotions,” 7th International Conference on Future Internet of Things and Cloud (FiCloud), pp. 324–329, 2019.

M. Meddeb, H. Karray, and A. Alimi, “Speech emotion recognition based on arabic features,” 2015 15th IEEE International Conference In Intelligent Systems Design and Applications (ISDA), pp. 46–51, December 2015

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec2.0: A framework for self-supervised learning of speech representations,” CoRR, vol. abs/2006.11477, 22 Oct 2020. Facebook Wav2Vec2.0: the-structure-of-speech-from-raw-audio/.

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.

Elgeish, “,”2020.

CommonVoice “,”ar-137h-2021-07-21,2021.

N. Halabi, “Arabic speech corpus,”,2021.




How to Cite

Ahmed, S. (2021). ASER: Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset. Transactions on Machine Learning and Artificial Intelligence, 9(6), 1–8.