Speech interruption detection for live-streaming audio
Xin, Jin
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/110308
Description
Title
Speech interruption detection for live-streaming audio
Author(s)
Xin, Jin
Contributor(s)
Patel, Sanjay
Issue Date
2021-05
Keyword(s)
speech interruption detection
live streaming
support vector machine
k-nearest neighbor
multilayer perceptron
mean opinion score
Abstract
Conversation is an important human activity. It happens between multiple persons when they start
and end talking naturally. However, an interruption may occur when one speaker speaks over
another speaker either intentionally or unintentionally. Frequent interruptions during conversation
can significantly influence the experience and vastly decrease the efficiency of the conversation.
Interruptions can happen more frequently in live-streamed audio calls with significant internet
delays. Detection of interruption during conversation can be helpful for live-streaming companies
who care about their quality of service. It can also be used for speech-to-text models for audio
preprocessing and labeling and estimating conflict level in a debate. This project aims to assess
the quality of the interrupted speech in live-streaming audios. The task of interruption detection was
divided into two steps: generation of the simulated interrupted speech audio dataset and building
machine learning models for interruption detection. The audio dataset was synthetically created by
concatenating and overlapping speech audios with different interruption times and latency times.
The performance of interruption detection was examined on the k-nearest neighbor classifier,
the support vector machine classifier, and the multilayer perceptron model. Each model takes an
array of the 0.5s audio segment as input and then predicts the existence of interrupted speech in each 0.5s segment. The result has shown that the SVM model appears to be very effective at
detecting interrupted speeches in the audio of a conversation. It has an accuracy of 92.61% on
cross-validation of training data and 72.62% on unseen data.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.