Abstract:
Audio feature extraction plays a vital role in analyzing the complex nature of a signal. It entails studying the signal and determining how signals are related to one another. As a result, the performance of audio spoofing detection in Automatic Speaker Verification (ASV) systems is strongly reliant on front-end feature extraction. In this paper, three types of successively integrated features have been proposed. First, Acoustic Ternary Pattern (ATP) image features are sequentially fused with different audio features such as Mel Frequency Cepstral Coefficients (MFCC), Constant Q Cepstral Coefficients (CQCC), Gammatone Cepstral Coefficients (GTCC), Basilar-membrane Frequency-band Cepstral Coefficients (BFCC) and Perceptual Linear Prediction (PLP), individually. Second, Local binary pattern (LBP) image features are combined with all these audio features similarly. Then, the sequential integration of ATP-LBP features is combined individually with MFCC, CQCC, GTCC, BFCC and PLP features. Finally, these front-end hybrid feature sets are classified using different machine learning and deep learning algorithms based acoustic models at the back-end. The state-of-the-art ASVspoof 2019 dataset has been used to implement various front-end, and back-end combinations. The experimental results reveal that the proposed approach achieved the best results with ATP-LBP-GTCC at the front end with Long Short-Term Memory (LSTM) based acoustic model at back-end.