OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang*¹, Bowen Wang*¹, Dunjie Lu*¹, Junlin Yang*¹, Tianbao Xie*¹, Junli Wang*¹, Jiaqi Deng¹, Xiaole Guo¹, Yiheng Xu¹, Chen Henry Wu⁵, Zhennan Shen¹, Zhuokai Li¹, Ryan Li³, Xiaochuan Li¹, Junda Chen¹, Boyuan Zheng¹, Peihang Li¹, Fangyu Lei¹, Ruisheng Cao¹, Yeqiao Fu¹, Dongchan Shin¹, Martin Shin¹, Jiarui Hu¹, Yuyan Wang¹, Jixuan Chen¹, Yuxiao Ye¹, Danyang Zhang¹,
Dikang Du², Hao Hu², Huarong Chen², Zaida Zhou², Haotian Yao², Ziwei Chen², Qizheng Gu², Yipu Wang², Heng Wang², Diyi Yang³, Victor Zhong⁴, Flood Sung², Y.Charles², Zhilin Yang², Tao Yu¹

¹XLANG Lab, The University of Hong Kong, ²Moonshot AI,
³Stanford University, ⁴University of Waterloo ⁵Carnegie Mellon University

^*Equal contribution

Paper

Model Dataset Code Tool Tool Code Slides Demo

Abstract

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld‑Verified, establishing a new state-of-the-art (SOTA) among open-source models. It also attains 60.8% on ScreenSpot-Pro and 37.4% (SOTA) on UI-Vision. Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

Updates

[2025-10-03] New model!🔥 OpenCUA-72B-preview now ranks #1 on the OSWorld-Verified leaderboard with strong grounding, 37.3% (SOTA) on UI-Vision and 60.8% on ScreenSpot-Pro.
[2025-09-04] AgentNetTool code🛠️ is now available!
[2025-08-13] Project code repository is now public.

Computer-Use Showcase

Here are some success task trajectories selected from the evaluation results of OpenCUA-7B and OpenCUA-32B 👇

Download OpenCUA Paper
Information Retrieval
Slides
MultiApp: Chrome&Slides
Document
Chrome Bookmark
VSCode Extension
Weather
VSCode

OpenCUA Framework Overview

AgentNetTool: The AgentNet Tool is a cross-platform annotation application that captures user interactions across Windows, macOS, and Ubuntu. It records screen videos, mouse/keyboard events, and metadata, enabling scalable collection of real-world computer-use demonstrations.
AgentNet Method: Raw user demonstrations are processed into clean, learning-ready state–action trajectories. The resulting trajectories include inner monologue-style thoughts and action history, making them suitable for vision-language model training.
AgentNet Dataset and AgentNetBench: The processed data is curated into the AgentNet Dataset and AgentNetBench. The dataset covers diverse open-domain tasks across 100+ applications and 200+ websites. The benchmark provides task instructions, step histories, and multiple gold-standard actions per step for efficient offline evaluation.
OpenCUA Models: OpenCUA agent models are trained on the dataset using reflective Chain-of-Thought reasoning, multi-image histories, and mixed-domain data. They can execute in realistic desktop environments across operating systems to perform computer-use tasks.

AgentNet Tool

Efficient and accurate annotation is essential for collecting high-quality computer-use agent data, yet no existing tools support natural, cross-platform task recording by non-technical users. To address this, we developed a user-friendly annotation tool that streamlines the collection and verification of computer-use demonstrations, runs on annotators' personal computers and records demonstrations in the background, capturing: (1) screen videos, (2) mouse and keyboard signals, and (3) accessibility trees (Axtree).

AgentNet Dataset

(a) Google Sheets

Calculate average and standard deviation of scores from 2023 sem 1 to 2022 sem 2 in the Google sheet Scores

(b) Amazon

Buy the kindle version of Foundations of Computer Vision by Antonio

Customize fonts, colors, and effects for each slide layout within Slide Master in PowerPoint

(d) Spotify

Open Spotify, search for Cui Jian, and play the song 花房姑娘

Domain Distribution

To facilitate the development of computer-use agents, we collected a large-scale computer-use agent task dataset, AgentNet. Our released dataset consists of 22.5K human-annotated computer-use tasks, including 12K from Windows, 5K from macOS and 5K from Ubuntu system. The tasks spans over 140 applications and 190 websites, often involving multi-app workflows, professional tools, and uncommon features. Compared to prior GUI datasets, AgentNet is the first desktop trajectory-level dataset that is realistic, complex, diverse, and multimodal.

Notably, we designed a novel pipeline to augment reflective long CoT for each step in the tasks: generator and reflector iteratively generate and verify the reasoning components between the observation and ground-truth actions 👆

OpenCUA Models

We conduct supervised fine-tuning (SFT) on Kimi-VL-A3B, Qwen2-VL-7B, Qwen2.5-VL-7B and Qwen2.5-VL-32B, and obtained our OpenCUA model variants: OpenCUA-A3B, OpenCUA-Qwen2-7B, OpenCUA-7B, and OpenCUA-2.5-32B. Our best models - OpenCUA-7B, OpenCUA-32B and OpenCUA-72B - achieves superior performance in agent grounding and planning benchmarks.

OSWorld-Verified

OSWorld originally curated 369 human-crafted tasks spanning a diverse set of applications, each with its own environment setup and evaluation script. The OSWorld team has since re-verified every task, addressing issues such as outdated dependencies, evaluation bugs, and ambiguous instructions. The resulting improved benchmark is released as OSWorld‑Verified. The details of OSWorld-Verified can be found in the Introducing to OSWorld-Verified

This large-scale verification required substantial engineering effort—please make sure to credit the outstanding work of the OSWorld Team.

Model	15 Steps	50 Steps	100 Steps
Proprietary
OpenAI CUA	26.0	31.3	31.4
Seed1.5-VL	27.9	–	34.1
Claude 3.7 Sonnet	27.1	35.8	35.9
Claude 4 Sonnet	31.2	43.9	41.5
Open-Source
Qwen2.5-VL-32B-Inst.	3.0	–	3.9
Qwen2.5-VL-72B-Inst.	4.4	–	5.0
Kimi-VL-A3B	9.7	–	10.3
UI-TARS-72B-DPO	24.0	25.8	27.1
UI-TARS-1.5-7B	24.5	27.3	27.4
OpenCUA-7B	24.3 ^+1.9 _−1.3	28.1 ^+0.7 _−0.4	26.6 ^+0.6 _−0.5
OpenCUA-32B	29.7 ^+0.8 _−1.5	34.1 ^+1.0 _−0.6	34.8 ^+0.9 _−1.0
OpenCUA-72B-preview	39.0	44.9	45.0 ^+1.0 _−1.2

Offline Agentic Ability

To make evaluation stable, fast and environment-free, we built AgentNetBench, an offline computer-use agent evaluation benchmark. It is comprised of 100 representative tasks selected from the AgentNet dataset, covering Windows and macOS platforms and diverse domains. Each task was manually reviewed to refine goals and remove redundant actions. Notably, we manually provide multiple valid action options at each step because of the inherent multiplicity of valid actions in computer-use tasks.

Model	Coord. SR	Content SR	Func. SR	Avg. SR
Qwen2.5-VL-7B	50.7	40.8	3.1	48.0
Aguvis-7B	56.7	43.3	0.0	52.4
Qwen2.5-VL-32B	66.6	47.2	41.5	64.8
Qwen2.5-VL-72B	67.2	52.6	50.5	67.0
OpenAI CUA	71.7	57.3	80.0	73.1
OpenCUA-7B	79.0	62.0	44.3	75.2
OpenCUA-32B	81.9	66.1	55.7	79.1

Grounding Ability

We evaluate our models on 3 GUI grounding benchmarks including ScreenSpot-v2, ScreenSpot-Pro and OSWorld-G.

Model	ScreenSpot-v2	ScreenSpot-Pro	OSWorld-G	UI-Vision
UI-TARS-7B	91.6	35.7	47.5	17.6
Operator	70.5	36.6	40.6	-
Qwen2.5-VL-7B	88.8	27.6	31.4	0.85
Qwen2.5-VL-32B	91.3	39.4	46.5	-
UI-TARS-72B	90.3	38.1	57.1	25.5
OpenCUA-7B	92.3	50.0	55.3	29.7
OpenCUA-32B	93.4	55.3	59.6	33.3
OpenCUA-72B-preview	92.9	60.8	59.2	37.3

Analysis

Data Scaling and Test Time Scaling

Our method enables performance to scale effectively with increased training data. The high Pass@N performance demonstrates OpenCUA-7B has great potencial of test time scaling.

Test-time scale performance on OSWorld

Pass@N curves (temperature = 0.1)

Acknowledgements

We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback. We are grateful to Moonshot AI for providing training infrastructure and annotated data. We also sincerely appreciate Jin Zhang, Hao Yang, Zhengtao Wang, Yanxu Chen, and Qizheng Gu from the Kimi Team for their strong infrastructure support and helpful guidance. The development of our tool is based on the open-source projects-DuckTrack and OpenAdapt. We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.

BibTeX

If you find this work useful, please consider citing our paper:

@misc{wang2025opencuaopenfoundationscomputeruse,
      title={OpenCUA: Open Foundations for Computer-Use Agents}, 
      author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
      year={2025},
      eprint={2508.09123},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.09123}, 
}