Phase Based Methods for Various Speech Applications
Abstract
Vocal communication plays a fundamental role in human interaction and expression.Right from the first cry to adult speech, the signal conveys information aboutthe well-being of the individual. Lack of coordination between the speech musclesand the brain leads to voice pathologies. Some pathologies related to infants areAsphyxia, Sudden Death Syndrome (SIDS), etc. The other voice pathologies thataffect the speech production systems are dysarthria, cerebral palsy, and parkinson�sdisease.Dysarthria, a neurological motor speech disorder, is characterized by impairedspeech intelligibility that can vary across severity-levels. This works focuses onexploring the importance of Modified Group Delay Cepstral Coefficients (MDGCC)-based features in capturing the distinctive acoustic characteristics associated withdysarthric severity-level classification, particularly for irregularities in speech.Convolutional Neural Network (CNN) and traditional Gaussian Mixture Model(GMM) are used as the classification models in this study. MGDCC is comparedwith state-of-the-art magnitude-based features, namely, Mel Frequency CepstralCoefficients (MFCC) and Linear Frequency Cepstral Coefficients (LFCC). In addition,this work also analyzed the noise robustness of MGDCC. To that effect,experiments were performed on various noise types and SNR levels, where thephenomenal performance of MGDCC over other feature sets was reported. Further,this study also analyses the cross-database scenarios for dysarthric severitylevelclassification. Analysis of Voice onset Time (VOT) and experiments wereperformed using MGDCC to detect dysarthric speech against normal speech. Further,the performance of MGDCC was then compared with baseline features usingprecision, recall, and F-1 score and finally, the latency period was analysed forpractical deployment of the system.This work also explores the application of phase-based features on the emotionrecognition task and pop noise detection. As technological advancementsprogress, dependence on machines is inevitable. Therefore, to facilitate effectiveinteraction between humans and machines, it has become crucial to develop proficienttechniques for Speech Emotion Recognition (SER). The MGDCC featureset is compared against MFCC and LFCC features using a CNN classifier and theLeave One Speaker Out technique. Furthermore, due to the ability of MGDCCto capture the information in low-frequency regions and due to the fact that popnoise occurs at lower frequencies, the application of phase-based features on voiceliveness detection is performed. The results are obtained from a CNN classifierusing the 5-Fold cross-validation metric and are compared against MFCC andLFCC feature sets.This work proposed the time averaging-based features in order to understandthe amount of information being captured across the temporal axis as there wouldnot be many temporal variations in a cry signal. The research conducted in thisstudy utilizes a 10-fold stratified cross-validation approach with machine learningclassifiers, specifically Support Vector Machine (SVM), K-Nearest Neighbor(KNN), and Random Forest (RF). This work also showcased CQT-based Constant-Q Harmonic coefficient (CQHC) and Constant-Q Pitch coefficients (CQPC) for theclassification of infant cry into normal and pathology as an effective representationof the spectral and pitch components of a spectrum together is not achievedleaving scope for improvement. The results are compared by considering theMFCC, LFCC, and CQCC feature sets as the baseline features using machinelearning and deep learning classifiers, such as Convolutional Neural Networks(CNN), Gaussian Mixture Models (GMM), and Support Vector Machines (SVM)with 5-Fold cross-validation accuracy as the metric.