Speech processing is used to translate human speech to text and to identify speakers for applications in biometric systems. Speaker verification requires robust algorithms to prohibit an adversary from impersonating another speaker. Previous research has demonstrated that specially crafted additive noise can cause a misclassification of a speaker as a specific target. In this paper, we study whether targeted additive noise can thwart speaker verification without affecting speech-to-text decoding. Mel-frequency cepstral coefficients (MFCCs) and Gaussian mixture models (GMMs) are commonly used in both applications for encoding schemes. We attempt to induce a desired change in the probability of one speaker model used for speaker classification, while preserving likelihood under another speech model used for speech decoding.
Alireza Farrokh Baroughi, Scott Craver, Daniel Douglas, "Attacks on Speaker Identification Systems Constrained to Speech-to-Text Decoding" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Media Watermarking, Security, and Forensics, 2016, https://doi.org/10.2352/ISSN.2470-1173.2016.8.MWSF-073