SynPoseVAE: A Multimodal Fusion Framework for Audio-Driven Full-Body Animation Synthesis

Junyi Gao; Xiuting Tao; Yigang Wang

doi:10.2352/J.ImagingSci.Technol.2026.70.1.010404

Back to articles

Special issue: ACDM 2025

Volume: 70 | Article ID: 010404

SynPoseVAE: A Multimodal Fusion Framework for Audio-Driven Full-Body Animation Synthesis

audio-driven animation pose fusion module facial–body synergy

DOI : 10.2352/J.ImagingSci.Technol.2026.70.1.010404 Published Online : January 2026

Abstract

In the realm of audio-driven facial animation, most existing research predominantly focuses on head animations, and there is a scarcity of methods capable of generating full-body videos. The few approaches that can produce full-body videos usually concentrate solely on facial animations, resulting in the prevalent issue of head–body separation. This disjointedness seriously undermines the overall visual coherence and the naturalness of human–computer interaction. To overcome these limitations, the authors introduce SynPoseVAE, an enhanced version of the PoseVAE model. This method innovatively incorporates body-related information. SynPoseVAE effectively acquires detailed human pose data by adopting a bottom-up human pose estimation method to detect human key points and incorporates it into pose prediction, thereby solving the problem of head–body separation. Additionally, we design a new loss function that takes into account both head and body postures. It serves as a crucial regulator, enhancing the coordination between head and body movements. By optimizing based on this loss function, the model can significantly reduce the head–body separation problem, ensuring that the generated animations are more natural and coherent. Experimental results show that SynPoseVAE outperforms traditional methods. It can generate highly coordinated full-body animations, greatly improving the quality of human–computer interaction in the context of voice-driven facial animation synthesis.

Journal Title : Journal of Imaging Science and Technology

Publisher Name : Society for Imaging Science and Technology

Downloads 1

Cite this article

Junyi Gao, Xiuting Tao, Yigang Wang, "SynPoseVAE: A Multimodal Fusion Framework for Audio-Driven Full-Body Animation Synthesis" in Journal of Imaging Science and Technology, 2026, pp 1 - 8, https://doi.org/10.2352/J.ImagingSci.Technol.2026.70.1.010404

Copy citation

Article timeline

received May 2025
accepted October 2025
PublishedJanuary 2026

articleview.keywords

Login or subscribe to view the content