Feature for Live and Spoofed Speech Detection
Abstract
The authorization to access specific information is given by a biometric system.Biometric systems are used for security purposes in a way that they prevent unauthorized access to important information or data (information privacy). The accessgranted by the biometric is done by capturing traits of humans, which make allhuman beings unique w.r.t. that particular trait. This thesis focuses on voicebased biometric systems, also known as Automatic Speaker Verification (ASV)systems, given that speech is the most natural and powerful form of communication used by humans to communicate with the outside world. It is the most intuitive, simple, and easy-to-produce characteristic. Since ASV systems have beenused for applications, such as in banking transactions and access to buildings associated with classified information, only authorized legitimate or genuine usersare granted access.ASV systems suffer from vulnerabilities to attacks and can be compromisedat various stages. The attacks may be categorized as direct and indirect attacks,depending on the extent of the attacker�s accessibility to the ASV framework. Besides, due to the recent commercial success of several Intelligent Personal Assistants (IPAs), also known as voice assistants, such as Speech Interpretation andRecognition Interface (SIRI), Amazon Alexa, Google Home, and so on, manyvoice-enabled devices in Internet of Things (IoT) have been commonly prone tospoofing attacks. To that effect, there is active research in the direction of designing countermeasure systems for ASV systems, particularly for spoofing attacks,namely, Speech Synthesis (SS), Voice Conversion (VC), and replay.This thesis is a humble attempt to alleviate some of the research gaps in designing features for countermeasure systems. In particular, this thesis proposesQuadrature Energy Separation Algorithm (QESA) in the light of incorporating thequadrature-phase component with the in-phase component of the signal. To thateffect, an existing feature set for replay Spoofed Speech Detection (SSD), namely,CFCCIF-ESA is extended to the CFCCIF-QESA feature set for enhanced performance of the countermeasure system. The performance of the proposed CFCCIFQESA feature set is evaluated on various datasets for various spoofing attacksgiven in the literature. Furthermore, the existing Linear Frequency Residual Cepstral Coefficients (LFRCC) feature set is optimized w.r.t. to its Linear Prediction(LP) order for the replay SSD task. In particular, it is found that the LP orderneeded for a good prediction of speech is not the same as that needed for thereplay SSD task. The resulting optimized LFRCC feature set is evaluated on theASVSpoof 2019 PA dataset. In addition to this, another feature, known as the uncertainty vector (u-vector), is developed from the Heisenberg�s uncertainty principle in the signal processing framework. The proposed u-vector is evaluated usingthe ASVSpoof 2017 dataset for replay attacks.Furthermore, in the direction to make countermeasure systems independent ofthe type of spoofing attack, features have been proposed for the Voice LivenessDetection (VLD) task. VLD is performed by the detection of pop noise which is thediscriminating acoustic cue present in live speech, produced due to the breathingeffect captured by the microphone when the speaker�s mouth is close to the microphone. The work on VLD in this thesis is based on two key hypotheses, namely,Parseval�s energy equivalence for STFT, CWT, and analytic CWT, whereas the second hypothesis is that the energy of pop noise decreases with the distance of a microphone from the speaker that is used to capture genuine speech. The proposedfeatures for VLD in this thesis are wavelet-based, wherein three wavelets are used,namely, Bump, Morlet, and Morse wavelet, where Morse wavelet is presented as asuperfamily of analytic wavelets, called as Generalized Morse Wavelets (GMWs).Detailed experimental analysis such as speaker-microphone proximity, the effectof phoneme type, and the effect of frequency range is studied.Apart from this, the security of speech data is also taken into account and thisthesis proposes an improved Voice Privacy (VP) system, which is based on Linear Prediction (LP) of speech. Furthermore, the VP system is studied along withthe attacker�s perspective using the target selection approach, and particularly,target selection w.r.t. twins is studied, wherein the most vulnerable twin-pair(i.e., target) is selected. Lastly, some of the proposed feature sets in this thesis arealso evaluated for tasks related to other Assistive Speech Technologies (AST) applications, such as the classification of healthy vs. pathological infant cries, anddysarthric severity-level classification.
Collections
- PhD Theses [87]