It is now well known that practical steganalysis using machine learning techniques can be strongly biased by the problem of Cover Source Mismatch. Such a phenomenon usually occurs in machine learning when the training and the testing sets are drawn from different sources, i.e. when
they do not share the same statistical properties. In the field of steganalysis however, due to the small power of the signal targeted by steganalysis methods, it can drastically lower their performance. This paper aims to define through practical experiments what is a source in steganalysis.
By assuming that two cover datasets coming from a common source should provide comparable performances in steganalysis, it is shown that the definition of a source is more related with the processing pipeline of the RAW images than with the sensor or the acquisition setup of the pictures.
In order to measure the discrepancy between sources, this paper introduces the concept of
Quentin Gibouloto, Rémi Cogranneo, Patrick Bas, "Steganalysis into the Wild: How to Define a Source?" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Media Watermarking, Security, and Forensics, 2018, pp 318-1 - 318-12, https://doi.org/10.2352/ISSN.2470-1173.2018.07.MWSF-318