Back to articles
Special issue CAPT 2025 FastTrack
Volume: 0 | Article ID: 030416
Image
A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environments
Abstract
Abstract

Large-scale visual language models (VLMs) show great potential in desktop automation, but their performance is highly dependent on extensive, high-quality imitation learning datasets. Current data acquisition methods generally face core challenges such as low synchronization accuracy, high storage costs, and the resulting exacerbation of covariate drift. To address these issues, this paper proposes and implements a high-fidelity, storage-efficient visual-behavioral data acquisition and training framework called sict. The framework achieves more than 99% storage space saving while guaranteeing nanosecond data synchronization accuracy through a multiprocess asynchronous architecture that leverages high-precision monotonic clocks and variable frame rate video coding techniques. The study constructs a hierarchical desktop operation benchmark dataset based on this framework and fine-tunes the Qwen2.5-VL-7B model. Experimental results show that the 7B model trained by the sict framework outperforms a zero-sample model ten times larger by a wide margin, demonstrating that the fidelity of data collection is a key factor determining the model’s maximum capability. This work provides an efficient and feasible solution for training highly powerful desktop intelligences.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

Lei Zhu, Chenhao Duan, Xiaoyan Xue, Yuan Zhang, Chongtao Sun, Xiao Xing, Yanping Du, "A High-Fidelity Data Acquisition Framework for Training Visual Language Models for Desktop Environmentsin Journal of Imaging Science and Technology,  2026,  pp 1 - 7,  https://doi.org/10.2352/J.ImagingSci.Technol.2026.70.3.030416

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2026
  Article timeline 
  • received September 2025
  • accepted January 2026

Preprint submitted to:
  Login or subscribe to view the content