The direct correlations between modality driven parameters of voice-source and texture were investigated. A perceptual experiment was conducted using vowel sounds with three representative phonation differences (modal, creaky and breathy) and texture images annotated with semantic terms. For quantitative analyses, acoustic features measuring vocal fold vibration, periodicity, spectral noise level, fundamental frequency and energy were calculated. Computational textural features containing coarseness, contrast, directionality, busyness, complexity, strength and brightness were extracted. The results showed that the most important feature is the amplitude difference between the first two harmonics (H1-H2). H1-H2 significantly correlates to coarseness, contrast, busyness, complexity, strength and brightness. Harmonic-to-Noise Ratios (HNRs) highly correlate to coarseness, busyness, complexity and strength. Significant correlations were also observed between Cepstral Peak Prominence (CPP) & coarseness, fundamental frequency (F0) & complexity, brightness and energy & strength. These parametric correlations can serve as basic scientific knowledge for cross-modal mapping.
Win Thuzar Kyaw, Yoshinori Sagisaka, "Studies on Cross-modal Feature-based Mapping from Voice-source to Texture through Image Association by Listening Speech" in Journal of Imaging Science and Technology, 2022, pp 030511-1 - 030511-13, https://doi.org/10.2352/J.ImagingSci.Technol.2022.66.3.030511