Analysis
Data Scaling and Test Time Scaling
Our method enables performance to scale effectively with increased training data. The high Pass@N performance demonstrates OpenCUA-7B has great potencial of test time scaling.


Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 32.5% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
AgentNetTool: The AgentNet Tool is a cross-platform annotation application that captures user interactions across Windows, macOS, and Ubuntu. It records screen videos, mouse/keyboard events, and metadata, enabling scalable collection of real-world computer-use demonstrations.
AgentNet Method: Raw user demonstrations are processed into clean, learning-ready state–action trajectories. The resulting trajectories include inner monologue-style thoughts and action history, making them suitable for vision-language model training.
AgentNet Dataset and AgentNetBench: The processed data is curated into the AgentNet Dataset and AgentNetBench. The dataset covers diverse open-domain tasks across 100+ applications and 200+ websites. The benchmark provides task instructions, step histories, and multiple gold-standard actions per step for efficient offline evaluation.
OpenCUA Models: OpenCUA agent models are trained on the dataset using reflective Chain-of-Thought reasoning, multi-image histories, and mixed-domain data. They can execute in realistic desktop environments across operating systems to perform computer-use tasks.
Efficient and accurate annotation is essential for collecting high-quality computer-use agent data, yet no existing tools support natural, cross-platform task recording by non-technical users. To address this, we developed a user-friendly annotation tool that streamlines the collection and verification of computer-use demonstrations, runs on annotators' personal computers and records demonstrations in the background, capturing: (1) screen videos, (2) mouse and keyboard signals, and (3) accessibility trees (Axtree).
To facilitate the development of computer-use agents, we collected a large-scale computer-use agent task dataset, AgentNet. Our released dataset consists of 21K human-annotated computer-use tasks, including 17K from Windows/macOS and 3K from Ubuntu system. The tasks spans over 140 applications and 190 websites, often involving multi-app workflows, professional tools, and uncommon features. Compared to prior GUI datasets, AgentNet is the first desktop trajectory-level dataset that is realistic, complex, diverse, and multimodal.
Notably, we designed a novel pipeline to augment reflective long CoT for each step in the tasks: generator and reflector iteratively generate and verify the reasoning components between the observation and ground-truth actions 👆
We conduct supervised fine-tuning (SFT) on Kimi-VL-A3B, Qwen2-VL-7B, Qwen2.5-VL-7B and Qwen2.5-VL-32B, and obtained our OpenCUA model variants: OpenCUA-A3B, OpenCUA-Qwen2-7B, OpenCUA-7B, and OpenCUA-2.5-32B. Our best models - OpenCUA-7B and OpenCUA-32B - achieves superior performance in agent grounding and planning benchmarks.
OSWorld originally curated 369 human-crafted tasks spanning a diverse set of applications, each with its own environment setup and evaluation script. The OSWorld team has since re-verified every task, addressing issues such as outdated dependencies, evaluation bugs, and ambiguous instructions. The resulting improved benchmark is released as OSWorld‑Verified. The details of OSWorld-Verified can be found in the Introducing to OSWorld-Verified
This large-scale verification required substantial engineering effort—please make sure to credit the outstanding work of the OSWorld Team.
Model | 15 Steps | 50 Steps | 100 Steps |
---|---|---|---|
Proprietary | |||
OpenAI CUA | 26.0 | 31.3 | 31.4 |
Seed1.5-VL | 27.9 | – | 34.1 |
Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
Open-Source | |||
Qwen2.5-VL-32B-Instruct | 3.0 | – | 3.9 |
Qwen2.5-VL-72B-Instruct | 4.4 | – | 5.0 |
Kimi-VL-A3B | 9.7 | – | 10.3 |
UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
OpenCUA-7B | 24.3 | 27.9 | 26.6 |
OpenCUA-32B | 29.7 | 34.1 | 34.8 |
To make evaluation stable, fast and environment-free, we built AgentNetBench, an offline computer-use agent evaluation benchmark. It is comprised of 100 representative tasks selected from the AgentNet dataset, covering Windows and macOS platforms and diverse domains. Each task was manually reviewed to refine goals and remove redundant actions. Notably, we manually provide multiple valid action options at each step because of the inherent multiplicity of valid actions in computer-use tasks.
Model | Coord. SR | Content SR | Func. SR | Avg. SR |
---|---|---|---|---|
Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
Aguvis-7B | 56.7 | 43.3 | 0.0 | 52.4 |
Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
OpenAI CUA | 71.757.3 | 80.0 | 73.1 | |
OpenCUA-7B | 75.4 | 46.4 | 53.6 | 71.0 |
OpenCUA-32B | 78.7 | 46.0 | 55.2 | 73.2 |
We evaluate our models on 3 GUI grounding benchmarks including ScreenSpot-v2, ScreenSpot-Pro and OSWorld-G.
Model | ScreenSpot-v2 | ScreenSpot-Pro | OSWorld-G |
---|---|---|---|
UI-TARS-7B | 91.6 | 35.7 | 47.5 |
Operator | 70.5 | 36.6 | 40.6 |
Qwen2.5-VL-3B | 80.9 | 25.9 | 27.3 |
Qwen2.5-VL-7B | 88.8 | 27.6 | 31.4 |
Qwen2.5-VL-32B | 91.3 | 39.4 | 46.5 |
OpenCUA-7B | 88.5 | 23.7 | 45.7 |
OpenCUA-A3B | 91.4 | 28.5 | 48.6 |
OpenCUA-2.5-7B | 92.3 | 50.0 | 55.3 |
OpenCUA-2.5-72B | 93.6 | 54.3 | 58.9 |
Here are some task trajectories selected from the evaluation results of OpenCUA-7B in OSWorld and WindowsAgentArena👇
Our method enables performance to scale effectively with increased training data. The high Pass@N performance demonstrates OpenCUA-7B has great potencial of test time scaling.
We appreciate Su Yu, Caiming Xiong, Binyuan Hui for their helpful feedback and discussion on this work. We appreciate Zaida Zhou, Zhengtao Wang, Flood Sung, Hao Hu, Huarong Chen, Calvin, Qizheng Gu and Dikang Du from Kimi Team for the great infrastructure provided and helpful discussion. We sincerely thank our all the annotators for their great effort on this project.
If you find this work useful, please consider citing our paper:
@article{OpenCUA2025,
title={OpenCUA: Open Foundations for Computer-Use Agents},
author={Wang, Xinyuan and Wang, Bowen and Lu, Dunjie and Yang, Junlin and Xie, Tianbao and Wang, Junli and Deng, Jiaqi and Guo, Xiaole and Xu, Yiheng and Wu, Chen Henry and Shen, Zhennan and Li, Zhuokai and Li, Ryan and Li, Xiaochuan and Chen, Junda and Zheng, Boyuan and Li, Peihang and Lei, Fangyu and Cao, Ruisheng and Fu, Yeqiao and Shin, Dongchan and Shin, Martin and Hu, Jiarui and Wang, Yuyan and Chen, Jixuan and Ye, Yuxiao and Zhang, Danyang and Wang, Yipu and Wang, Heng and Yang, Diyi and Zhong, Victor and Charles, Y. and Yang, Zhilin and Yu, Tao},
year={2025},
url={https://opencua.xlang.ai/}
}