Study on speech emotion recognition based on deep learning
Guan, Haozhong
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/117682
Description
Title
Study on speech emotion recognition based on deep learning
Author(s)
Guan, Haozhong
Issue Date
2022-12-05
Director of Research (if dissertation) or Advisor (if thesis)
Hasegawa-Johnson, Mark
Department of Study
Electrical & Computer Eng
Discipline
Electrical & Computer Engr
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Speech emotion recognition
Convolution neural network
ResNet50
Abstract
Speech emotion recognition (SER) is closely related to human life, and has the potential to bring great changes and improvements to people's lives. The continuous development of artificial intelligence and SER will bring new breakthroughs to the field of human-machine interaction. Therefore, studying SER has extremely important theoretical value and research significance.
In this thesis, the development status of speech emotion recognition is reviewed, and the existing problems and development challenges are pointed out. On the basis of summarizing the key technologies of speech emotion recognition, the speech emotion recognition model of ResNet50 CNN is constructed, and the recognition experiment and analysis are carried out. The main work is as follows:
The speech emotion description model, the process of speech emotion recognition, the preprocessing of speech signals and the extraction method of emotion feature parameters are
summarized. The time domain waveform and the spectrogram characteristics of different emotional speeches are analyzed, and the speech emotion recognition scheme combining the extraction of spectrogram and CNN is determined.
In this thesis, a CNN model is constructed based on a residual network, which uses
ResNet50 network and bottleneck block, and consists of 49 convolutional layers and one fully connected layer. The output is expressed as a linear superposition of nonlinear transformation by “shortcut connections” of residual network, which improves the problem of gradient disappearance or explosion in the process of back propagation, and makes the deep network get better training.
Based on IEMOCAP and Emo-DB datasets, the efficient speech emotion recognition is realized. The results show that the recognition accuracies of the constructed ResNet50 CNN model for IEMOCAP and Emo-DB datasets are 69.12% and 85.92%, respectively. Compared with other deep learning models, the proposed ResNet50 CNN model is simple and efficient.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.