Frequency Domain-Based Detection of Generated Audio

Emily R. Bartusiak; Edward J. Delp

doi:10.2352/ISSN.2470-1173.2021.4.MWSF-273

Abstract

Attackers may manipulate audio with the intent of presenting falsified reports, changing an opinion of a public figure, and winning influence and power. The prevalence of inauthentic multimedia continues to rise, so it is imperative to develop a set of tools that determines the legitimacy of media. We present a method that analyzes audio signals to determine whether they contain real human voices or fake human voices (i.e., voices generated by neural acoustic and waveform models). Instead of analyzing the audio signals directly, the proposed approach converts the audio signals into spectrogram images displaying frequency, intensity, and temporal content and evaluates them with a Convolutional Neural Network (CNN). Trained on both genuine human voice signals and synthesized voice signals, we show our approach achieves high accuracy on this classification task.

72010604

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

7003 Kilworth Lane, Springfield, VA 22151 USA

10.2352/ISSN.2470-1173.2021.4.MWSF-273

2470-1173(20210118)2021:4L.2731;1-

ei_24701173_v2021n4_input/s4.xml

/ist/ei/2021/00002021/00000004/art00004

Articles

Frequency Domain-Based Detection of Generated Audio

BartusiakEmily R.

DelpEdward J.

18012021

2021

273-1

273-7

2021

Audio signal classificationMachine LearningConvolutional Neural Network (CNN)SpectrogramsFrequency-domain analysisFalsified mediaSpoofing detectionVoice authentication

articleview.keywords