Back to articles
Article
Volume: 35 | Article ID: MWSF-372
Image
Synthetic speech attribution using self supervised audio spectrogram transformer
  DOI :  10.2352/EI.2023.35.4.MWSF-372  Published OnlineJanuary 2023
Abstract
Abstract

The ability to synthesize convincing human speech has become easier due to the availability of speech generation tools. This necessitates the development of forensics methods that can authenticate and attribute speech signals. In this paper, we examine a speech attribution task, which identifies the origin of a speech signal. Our proposed method known as Synthetic Speech Attribution Transformer (SSAT) converts speech signals into mel spectrograms and uses a self-supervised pretrained transformer for attribution. This transformer is pretrained on two large publicly available audio datasets: Audio Set and LibriSpeech. We finetune the pretrained transformer on three speech attribution datasets: the DARPA SemaFor Audio Attribution dataset, the ASVspoof2019 dataset, and the 2022 IEEE SP Cup dataset. SSAT achieves high closed-set accuracy on all datasets (99.8% on ASVspoof2019 dataset, 96.3% on SP Cup dataset, and 93.4% on DARPA SemaFor Audio Attribution dataset). We also investigate the method’s ability to generalize to unknown speech generation methods (open-set scenario). SSAT has high performance, achieving an open-set accuracy of 90.2% on the ASVspoof2019 dataset and 88.45% on DARPA SemaFor Audio Attribution dataset. Finally, we show that our approach is robust to typical compression rates used by YouTube for speech signals.

Subject Areas :
Views 168
Downloads 56
 articleview.views 168
 articleview.downloads 56
  Cite this article 

Amit Kumar Singh Yadav, Emily R. Bartusiak, Kratika Bhagtani, Edward J. Delp, "Synthetic speech attribution using self supervised audio spectrogram transformerin Electronic Imaging,  2023,  pp 372-1 - 372-11,  https://doi.org/10.2352/EI.2023.35.4.MWSF-372

 Copy citation
  Copyright statement 
Copyright © 2023, Society for Imaging Science and Technology 2023
ei
Electronic Imaging
2470-1173
2470-1173
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA