Auditory representation learning

Sailor, Hardik B.

dc.contributor.advisor	Patil, Hemant A.
dc.contributor.author	Sailor, Hardik B.
dc.date.accessioned	2019-03-19T10:52:15Z
dc.date.available	2019-03-19T10:52:15Z
dc.date.issued	2018
dc.identifier.citation	Sailor, Hardik B. (2018). Auditory Representation Learning. Dhirubhai Ambani Institute of Information and Communication Technology, xxv, 218 p. (Acc. No: T00688)
dc.identifier.uri	http://drsr.daiict.ac.in//handle/123456789/785
dc.description.abstract	Representation learning (RL) or feature learning has a huge impact in the field of signal processing applications. The goal of the RL approaches is to learn the meaningful representation directly from the data that can be helpful to the pattern classifier. Specifically, the unsupervised RL has gained a significant interest in the feature learning in various signal processing areas including the speech and audio processing. Recently, various RL methods are used to learn the auditorylike representations from the speech signals or its spectral representations. In this thesis, we propose a novel auditory representation learning model based on the Convolutional Restricted Boltzmann Machine (ConvRBM). The auditorylike subband filters are learned when the model is trained directly on the raw speech and audio signals with arbitrary lengths. The learned auditory frequency scale is also nonlinear similar to the standard auditory frequency scales. However, the ConvRBM frequency scale is adapted to the sound statistics. The primary motivation for the development of our model is to apply in the Automatic Speech Recognition (ASR) task. Experiments on the standard ASR databases show that the ConvRBM filterbank performs better than the Mel filterbank. The stability analysis of the model is presented using Lipschitz continuity condition. The proposed model is improved by using annealing dropout and Adam optimization. Noise-robust representation is achieved by combining ConvRBM filterbank with an energy estimation using the Teager Energy Operator (TEO). As a part of the research work for the MeitY, Govt. of India sponsored consortium project, the ConvRBM is used as a front-end for the ASR system in the speech-based access for the agricultural commodities in the Gujarati language. Inspired by the success in the ASR task, we applied our model in three audio classification tasks, namely, Environmental Sound Classification (ESC), synthetic and replay Spoof Speech Detection (SSD) in the context of the Automatic Speaker Verification (ASV), and Infant Cry Classification (ICC).We further propose the two layer auditory model by stacking two ConvRBMs. We refer it as an Unsupervised Deep Auditory Model (UDAM) and it performed well compared to the single layer ConvRBM in the ASR task.
dc.publisher	Dhirubhai Ambani Institute of Information and Communication Technology
dc.subject	Representation learning
dc.subject	Deep learning
dc.subject	Filterbank learning
dc.subject	Speech Databases
dc.subject	Sound classification
dc.subject	Auditory model
dc.subject	Speech signal
dc.subject	Signal processing
dc.subject	Audio processing
dc.subject	Speech recognition
dc.subject	Speech detection
dc.classification.ddc	006.454 SAI
dc.title	Auditory representation learning
dc.type	Thesis
dc.degree	Ph.D
dc.student.id	201321002
dc.accession.number	T00688

Files in this item

Name:: 201321002_Hardik B. Sailor.pdf
Size:: 14.29Mb
Format:: PDF
Description:: 201321002

View/Open

This item appears in the following Collection(s)

PhD Theses [87]

Show simple item record