A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environments

Lei Zhu; Chenhao Duan; Xiaoyan Xue; Yuan Zhang; Chongtao Sun; Xiao Xing; Yanping Du

doi:10.2352/J.ImagingSci.Technol.2026.70.3.030416

Back to articles

Special issue CAPT 2025 FastTrack

Volume: 0 | Article ID: 030416

A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environments

visual language models data acquisition behavioral cloning imitation learning

DOI : 10.2352/J.ImagingSci.Technol.2026.70.3.030416

Abstract

Large-scale visual language models (VLMs) show great potential in desktop automation, but their performance is highly dependent on extensive, high-quality imitation learning datasets. Current data acquisition methods generally face core challenges such as low synchronization accuracy, high storage costs, and the resulting exacerbation of covariate drift. To address these issues, this paper proposes and implements a high-fidelity, storage-efficient visual-behavioral data acquisition and training framework called sict. The framework achieves more than 99% storage space saving while guaranteeing nanosecond data synchronization accuracy through a multiprocess asynchronous architecture that leverages high-precision monotonic clocks and variable frame rate video coding techniques. The study constructs a hierarchical desktop operation benchmark dataset based on this framework and fine-tunes the Qwen2.5-VL-7B model. Experimental results show that the 7B model trained by the sict framework outperforms a zero-sample model ten times larger by a wide margin, demonstrating that the fidelity of data collection is a key factor determining the model’s maximum capability. This work provides an efficient and feasible solution for training highly powerful desktop intelligences.

Journal Title : Journal of Imaging Science and Technology

Publisher Name : Society for Imaging Science and Technology

Downloads 0

Cite this article

Lei Zhu, Chenhao Duan, Xiaoyan Xue, Yuan Zhang, Chongtao Sun, Xiao Xing, Yanping Du, "A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environments" in Journal of Imaging Science and Technology, 2026, pp 1 - 7, https://doi.org/10.2352/J.ImagingSci.Technol.2026.70.3.030416

Copy citation

Article timeline

received September 2025
accepted January 2026

articleview.keywords

Login or subscribe to view the content