Phase Based Methods for Various Speech Applications

Pusuluri, Aditya

dc.contributor.advisor	Patil, Hemant A.
dc.contributor.author	Pusuluri, Aditya
dc.date.accessioned	2024-08-22T05:21:28Z
dc.date.available	2024-08-22T05:21:28Z
dc.date.issued	2023
dc.identifier.citation	Pusuluri, Aditya (2023). Phase Based Methods for Various Speech Applications. Dhirubhai Ambani Institute of Information and Communication Technology. xiv, 94 p. (Acc. # T01151).
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/1208
dc.description.abstract	Vocal communication plays a fundamental role in human interaction and expression.Right from the first cry to adult speech, the signal conveys information aboutthe well-being of the individual. Lack of coordination between the speech musclesand the brain leads to voice pathologies. Some pathologies related to infants areAsphyxia, Sudden Death Syndrome (SIDS), etc. The other voice pathologies thataffect the speech production systems are dysarthria, cerebral palsy, and parkinson�sdisease.Dysarthria, a neurological motor speech disorder, is characterized by impairedspeech intelligibility that can vary across severity-levels. This works focuses onexploring the importance of Modified Group Delay Cepstral Coefficients (MDGCC)-based features in capturing the distinctive acoustic characteristics associated withdysarthric severity-level classification, particularly for irregularities in speech.Convolutional Neural Network (CNN) and traditional Gaussian Mixture Model(GMM) are used as the classification models in this study. MGDCC is comparedwith state-of-the-art magnitude-based features, namely, Mel Frequency CepstralCoefficients (MFCC) and Linear Frequency Cepstral Coefficients (LFCC). In addition,this work also analyzed the noise robustness of MGDCC. To that effect,experiments were performed on various noise types and SNR levels, where thephenomenal performance of MGDCC over other feature sets was reported. Further,this study also analyses the cross-database scenarios for dysarthric severitylevelclassification. Analysis of Voice onset Time (VOT) and experiments wereperformed using MGDCC to detect dysarthric speech against normal speech. Further,the performance of MGDCC was then compared with baseline features usingprecision, recall, and F-1 score and finally, the latency period was analysed forpractical deployment of the system.This work also explores the application of phase-based features on the emotionrecognition task and pop noise detection. As technological advancementsprogress, dependence on machines is inevitable. Therefore, to facilitate effectiveinteraction between humans and machines, it has become crucial to develop proficienttechniques for Speech Emotion Recognition (SER). The MGDCC featureset is compared against MFCC and LFCC features using a CNN classifier and theLeave One Speaker Out technique. Furthermore, due to the ability of MGDCCto capture the information in low-frequency regions and due to the fact that popnoise occurs at lower frequencies, the application of phase-based features on voiceliveness detection is performed. The results are obtained from a CNN classifierusing the 5-Fold cross-validation metric and are compared against MFCC andLFCC feature sets.This work proposed the time averaging-based features in order to understandthe amount of information being captured across the temporal axis as there wouldnot be many temporal variations in a cry signal. The research conducted in thisstudy utilizes a 10-fold stratified cross-validation approach with machine learningclassifiers, specifically Support Vector Machine (SVM), K-Nearest Neighbor(KNN), and Random Forest (RF). This work also showcased CQT-based Constant-Q Harmonic coefficient (CQHC) and Constant-Q Pitch coefficients (CQPC) for theclassification of infant cry into normal and pathology as an effective representationof the spectral and pitch components of a spectrum together is not achievedleaving scope for improvement. The results are compared by considering theMFCC, LFCC, and CQCC feature sets as the baseline features using machinelearning and deep learning classifiers, such as Convolutional Neural Networks(CNN), Gaussian Mixture Models (GMM), and Support Vector Machines (SVM)with 5-Fold cross-validation accuracy as the metric.
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Infant Cry Analysis
dc.subject	Dysarthria Severity-Level Classification
dc.subject	Emotion Recognition
dc.subject	Voice Liveness Detection
dc.subject	Constant-Q Harmonic Coefficients
dc.subject	Modified Group Delay Function
dc.subject	Noise Robustness.
dc.classification.ddc	006.454 PUS
dc.title	Phase Based Methods for Various Speech Applications
dc.type	Dissertation
dc.degree	M. Tech (EC)
dc.student.id	202115008
dc.accession.number	T01151

Files in this item

Name:: 202115008.pdf
Size:: 4.795Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

M Tech (EC) Dissertations [17]

Show simple item record