Unsupervised speaker-invariant feature representations for QbE-STD
Abstract
Query-by-Example Spoken Term Detection (QbE-STD) is the task of retrieving audio
documents relevant to the user query in spoken form, from a huge collection
of audio data. The idea in QbE-STD is to match the audio documents with the
user query, directly at acoustic-level. Hence, the macro-level speech information,
such as language, context, vocabulary, etc., cannot create much impact. This gives
QbE-STD advantage over Automatic Speech Recognition (ASR) systems. ASR
system faces major challenges in audio databases that contain multilingual audio
documents, Out-of-Vocabulary words, less transcribed or labeled audio data, etc.
QbE-STD systems have three main subsystems. They are feature extraction,
feature representation, and matching subsystems. As a part of this thesis work,
we are focused on improving the feature representation subsystems of QbE-STD.
Speech signal needs to be reformed to a speaker-invariant representation, in order
to be used in speech recognition tasks, such as QbE-STD. Speech-related information
in an audio signal is primarily hidden in the sequence of phones that are
present in the audio. Hence, to make the features more related to speech, we
have to analyze the phonetic information in the speech. In this context, we propose
two representations in this thesis, namely, Sorted Gaussian Mixture Model
(SGMM) posteriorgrams and Synthetically Minority Oversampling TEchniquebased
(SMOTEd) GMM posteriorgrams. Sorted GMM tries to represent phonetic
information using a set of components in GMM, while SMOTEd GMM tries to
improve the balance of various phone classes by providing the uniform number
of features for all the phones.
Another approach to improve speaker-invariability of audio signal is to reduce
the variations caused by speaker-related factors in speech. We have focused
on the spectral variations that exist between the speakers due to the difference in
the length of the vocal tract, as one such factor. To reduce the impact of this variation
in feature representation, we propose to use two models, that represent each
gender, characterized by different spectral scaling, based on Vocal Tract Length
Normalization (VTLN) approach.
Recent technologies in QbE-STD use neural networks and faster computavii
tional algorithms. Neural networks are majorly used in the feature representation
subsystems of QbE-STD. Hence, we also tried to build a simple Deep Neural Network
(DNN) framework for the task of QbE-STD. DNN, thus designed is referred
to unsupervised DNN (uDNN).
This thesis is a study of different approaches that could improve the performance
of QbE-STD. We have built the state-of-the-art model and analyzed the
performance of the QbE-STD system. Based on the analysis, we proposed algorithms
that can impact on the performance of the system. We also studied further
the limitations and drawbacks of the proposed algorithms. Finally, this thesis
concludes by presenting some potential research directions.
Collections
- M Tech Dissertations [923]