
Large-scale visual language models (VLMs) show great potential in desktop automation, but their performance is highly dependent on extensive, high-quality imitation learning datasets. Current data acquisition methods generally face core challenges such as low synchronization accuracy, high storage costs, and the resulting exacerbation of covariate drift. To address these issues, this paper proposes and implements a high-fidelity, storage-efficient visual-behavioral data acquisition and training framework called sict. The framework achieves more than 99% storage space saving while guaranteeing nanosecond data synchronization accuracy through a multiprocess asynchronous architecture that leverages high-precision monotonic clocks and variable frame rate video coding techniques. The study constructs a hierarchical desktop operation benchmark dataset based on this framework and fine-tunes the Qwen2.5-VL-7B model. Experimental results show that the 7B model trained by the sict framework outperforms a zero-sample model ten times larger by a wide margin, demonstrating that the fidelity of data collection is a key factor determining the model’s maximum capability. This work provides an efficient and feasible solution for training highly powerful desktop intelligences.
Lei Zhu, Chenhao Duan, Xiaoyan Xue, Yuan Zhang, Chongtao Sun, Xiao Xing, Yanping Du, "A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environments" in Journal of Imaging Science and Technology, 2026, pp 1 - 7, https://doi.org/10.2352/J.ImagingSci.Technol.2026.70.3.030416