Handcrafted Features for Anti-Spoofing
Abstract
Amongst various biometrics, voice is the most natural and convenient way of the communication for human-machine interaction. To that effect, the use of AutomaticSpeaker Verification (ASV) for authentication is increasing in various sensitiveapplications, which create a chance for fraudulent attack as attackers canbreach the authentication by using various spoofing attacks. To alleviate this issue,we can either develop an ASV system, which is inherently protected fromthe spoofing attacks or develop a separate countermeasure (CM) system that canassist the ASV system in tandem against the spoofing attacks. The earlier approacheshave trade-off between performance of the ASV system and robustnessagainst spoofing attacks. Hence, it would be advantageous to implementthe separate Spoof Speech Detection (SSD) system, and hence majority researchattempts are focusing upon the later approach. To that effect, various internationalchallenge campaigns were organized during INTERSPEECH conferences,such as ASVSpoof 2015, ASVSpoof 2017, and ASVSpoof 2019, which providesstandard datasets, protocol, and evaluation metrics. This thesis focuses on developingthe handcrafted feature sets for CM systems against the spoofing attacks,namely, Speech Synthesis (SS), Voice Conversion (VC), and replay. These featuresets are either developed by applying the subband filtering on the speech signalsor derived from the spectrogram representations.In this thesis work, various subband filtering-based feature sets are developed,namely, Enhanced Teager Energy-Based Cepstral Coefficients (ETECC), Cross-Teager Energy Cepstral Coefficients (CTECC), and Energy Separation AlgorithmbasedInstantaneous Frequency estimation for Cochlear Cepstral Features (CFCCIFESA).These feature sets are either modification in Teager Energy Operator (TEO)-based representations or utilization of Energy Separation Algorithm (ESA) for InstantaneousFrequency (IF) estimation. The ETECC feature set is developed byaccurately estimating the energies in high frequency regions using compensationof the signal mass. In Teager Energy-Based Cepstral Coefficients (TECC), TEO isutilized to estimate the energy, which considers the approximation sin(?) ? ?,which is applicable for low frequencies. However, the discriminative information or the replay detection is prominently present in the mid and high frequency regions.Hence, ETECC feature set is proposed to obtain the efficient representationfor SSD task by accurately estimating the energies at high frequency regions. Furthermore,signal processing-based approach is presented for replay SSD in VoiceAssistants (VAs). It utilizes the Cross-Teager Energy Operator (CTEO) for extractingthe acoustic cues from replay speech. CTEO gives the interactions amongthe multi-channel signal by estimating the cross-Teager energies between signals.To that effect, it is necessary to efficiently represent the acoustic cues for replayspoofs and hence, maximum cross-Teager energies among the subband filteredmulti-channel signal is utilized for feature representation. Thus, the rationale behindoptimal channel selection is to find the most noisy (distorted) transmissionchannel. The cepstral features extracted using CTEO are referred as Cross-TeagerEnergy Cepstral Coefficients (CTECCmax). The experiments are performed usingRealistic Replay Attack Microphone Array Speech Corpus (ReMASC), which is speciallydesigned for the replay SSD in VAs. The proposed CTECCmax feature setperforms better than other state-of-the-art feature sets. The proposed CFCCIFESAfeature set combines the magnitude and phase (in the form of IFs) informationto develop the efficient feature representation for SS, VC, and replay spoofingattacks. The proposed CFCCIF-ESA utilizes ESA to accurately estimate themodulation patterns due to their relatively low computational complexity, hightime resolution, and instantaneously adapting nature. In previously proposedCochlear Filter Cepstral Coefficient Instantaneous Frequency (CFCCIF) featureset, IFs were estimated using Hilbert transform-based approach, whose time resolutionis relatively low (as it requires a segment of speech) as compared to theESA-based approach.Furthermore, Constant-Q Transform (CQT)-based feature representation andSpectral Root Cepstral Coefficients (SRCC) are developed using spectrogram representationsand effectively utilized for anti-spoofing. According to Heisenberg�suncertainty principle in signal processing framework, the CQT has variable spectrotemporalresolution, in particular, better frequency resolution for low frequencyregion and better temporal resolution for high frequency region. This property ofthe CQT representation is effectively utilized to identify the low frequency characteristicsof pop noise. Here, pop noise is attributed to the live speaker and hence, itis exploited for Voice Liveness Detection (VLD) task. SRCC feature set is derivedfrom the theory of homomorphic filtering, which obeys the generalized superpositiontheory. In spectral root homomorphic deconvolution system, convolutionallycombined vectors are mapped to another convolutionally combined vector space, where signal components are more easily separable by liftering operation.Logarithm operation in Mel Frequency Cepstral Coefficients (MFCC) extractionis replaced by power-law nonlinearity (i.e., (�)?) to derive SRCC feature set. Theproper choice of the ? depends upon the pole-zero arrangements in the transferfunction obtained from the speech signal and it helps to capture the system informationof the speech signal, with a minimum number of cepstral coefficients. Inthis thesis, optimum ?-value is chosen by estimating the energy concentration incepstral coefficients and by visualizing the spectrogram w.r.t. ?-value.To validate performance of our proposed feature sets, the experiments are performedusing various datasets, state-of-the-art feature sets, classifiers, and evaluationmetrics. The development and performance analysis of each proposedfeature set is provided in the corresponding chapters. Furthermore, other contributionsin the thesis, namely, feature normalization for anti-spoofing, analysis onDelay and Sum (DAS) vs. Minimum Variance Distortionless Response (MVDR)beamforming techniques for anti-spoofing in VAs, severity-level classification ofdysarthric speech, and classification for normal vs. pathological cries, are alsodiscussed. Thesis concludes with potential future research directions and openresearch problems.
Collections
- PhD Theses [87]