OpenCUA: Open Foundations for Computer-Use Agents

1XLANG Lab, The University of Hong Kong, 2Moonshot AI,
3Stanford University, 4University of Waterloo 5Carnegie Mellon University
*Equal contribution
Main figure

Abstract

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 32.5% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.

OpenCUA Framework Overview

OpenCUA Framework

AgentNetTool: The AgentNet Tool is a cross-platform annotation application that captures user interactions across Windows, macOS, and Ubuntu. It records screen videos, mouse/keyboard events, and metadata, enabling scalable collection of real-world computer-use demonstrations.
AgentNet Method: Raw user demonstrations are processed into clean, learning-ready state–action trajectories. The resulting trajectories include inner monologue-style thoughts and action history, making them suitable for vision-language model training.
AgentNet Dataset and AgentNetBench: The processed data is curated into the AgentNet Dataset and AgentNetBench. The dataset covers diverse open-domain tasks across 100+ applications and 200+ websites. The benchmark provides task instructions, step histories, and multiple gold-standard actions per step for efficient offline evaluation.
OpenCUA Models: OpenCUA agent models are trained on the dataset using reflective Chain-of-Thought reasoning, multi-image histories, and mixed-domain data. They can execute in realistic desktop environments across operating systems to perform computer-use tasks.

AgentNet Tool

AgentNet Tool

Efficient and accurate annotation is essential for collecting high-quality computer-use agent data, yet no existing tools support natural, cross-platform task recording by non-technical users. To address this, we developed a user-friendly annotation tool that streamlines the collection and verification of computer-use demonstrations, runs on annotators' personal computers and records demonstrations in the background, capturing: (1) screen videos, (2) mouse and keyboard signals, and (3) accessibility trees (Axtree).

AgentNet Dataset

(a) Google Sheets
Google Sheets
Calculate average and standard deviation of scores from 2023 sem 1 to 2022 sem 2 in the Google sheet Scores
(b) Amazon
Amazon
Buy the kindle version of Foundations of Computer Vision by Antonio
(c) Slides
Slides
Customize fonts, colors, and effects for each slide layout within Slide Master in PowerPoint
(d) Spotify
Spotify
Open Spotify, search for Cui Jian, and play the song 花房姑娘
Domain Distribution

Domain Distribution

To facilitate the development of computer-use agents, we collected a large-scale computer-use agent task dataset, AgentNet. Our released dataset consists of 21K human-annotated computer-use tasks, including 17K from Windows/macOS and 3K from Ubuntu system. The tasks spans over 140 applications and 190 websites, often involving multi-app workflows, professional tools, and uncommon features. Compared to prior GUI datasets, AgentNet is the first desktop trajectory-level dataset that is realistic, complex, diverse, and multimodal.

Main figure

Notably, we designed a novel pipeline to augment reflective long CoT for each step in the tasks: generator and reflector iteratively generate and verify the reasoning components between the observation and ground-truth actions 👆

OpenCUA Models

We conduct supervised fine-tuning (SFT) on Kimi-VL-A3B, Qwen2-VL-7B, Qwen2.5-VL-7B and Qwen2.5-VL-32B, and obtained our OpenCUA model variants: OpenCUA-A3B, OpenCUA-Qwen2-7B, OpenCUA-7B, and OpenCUA-2.5-32B. Our best models - OpenCUA-7B and OpenCUA-32B - achieves superior performance in agent grounding and planning benchmarks.

OSWorld-Verified

OSWorld originally curated 369 human-crafted tasks spanning a diverse set of applications, each with its own environment setup and evaluation script. The OSWorld team has since re-verified every task, addressing issues such as outdated dependencies, evaluation bugs, and ambiguous instructions. The resulting improved benchmark is released as OSWorld‑Verified. The details of OSWorld-Verified can be found in the Introducing to OSWorld-Verified

This large-scale verification required substantial engineering effort—please make sure to credit the outstanding work of the OSWorld Team.

Model 15 Steps 50 Steps 100 Steps
Proprietary
OpenAI CUA 26.031.331.4
Seed1.5-VL 27.934.1
Claude 3.7 Sonnet 27.135.835.9
Claude 4 Sonnet 31.243.941.5
Open-Source
Qwen2.5-VL-32B-Instruct 3.03.9
Qwen2.5-VL-72B-Instruct 4.45.0
Kimi-VL-A3B 9.710.3
UI-TARS-72B-DPO 24.025.827.1
UI-TARS-1.5-7B 24.527.327.4
OpenCUA-7B 24.327.926.6
OpenCUA-32B 29.734.134.8

Offline Agentic Ability

AgentNetBench

To make evaluation stable, fast and environment-free, we built AgentNetBench, an offline computer-use agent evaluation benchmark. It is comprised of 100 representative tasks selected from the AgentNet dataset, covering Windows and macOS platforms and diverse domains. Each task was manually reviewed to refine goals and remove redundant actions. Notably, we manually provide multiple valid action options at each step because of the inherent multiplicity of valid actions in computer-use tasks.

Model Coord. SR Content SR Func. SR Avg. SR
Qwen2.5-VL-7B50.740.83.148.0
Aguvis-7B56.743.30.052.4
Qwen2.5-VL-32B66.647.241.564.8
Qwen2.5-VL-72B67.252.650.567.0
OpenAI CUA71.757.380.073.1
OpenCUA-7B75.446.453.671.0
OpenCUA-32B78.746.055.273.2

Grounding Ability

We evaluate our models on 3 GUI grounding benchmarks including ScreenSpot-v2, ScreenSpot-Pro and OSWorld-G.

Model ScreenSpot-v2 ScreenSpot-Pro OSWorld-G
UI-TARS-7B 91.6 35.7 47.5
Operator 70.5 36.6 40.6
Qwen2.5-VL-3B 80.9 25.9 27.3
Qwen2.5-VL-7B 88.8 27.6 31.4
Qwen2.5-VL-32B 91.3 39.4 46.5
OpenCUA-7B 88.5 23.7 45.7
OpenCUA-A3B 91.4 28.5 48.6
OpenCUA-2.5-7B 92.3 50.0 55.3
OpenCUA-2.5-72B 93.6 54.3 58.9


Computer-Use Showcase

Here are some task trajectories selected from the evaluation results of OpenCUA-7B in OSWorld and WindowsAgentArena👇

Task Instruction

Please help me install the autoDocstring extension in VS Code.

Trajectory step 1

Analysis

Data Scaling and Test Time Scaling

Our method enables performance to scale effectively with increased training data. The high Pass@N performance demonstrates OpenCUA-7B has great potencial of test time scaling.

Performance on OSWorld as training data scales
Training-data scaling on OSWorld
OSWorld Pass@N curves of OpenCUA-7B
Pass@N curves (temperature = 0.1)

Acknowledgements

We appreciate Su Yu, Caiming Xiong, Binyuan Hui for their helpful feedback and discussion on this work. We appreciate Zaida Zhou, Zhengtao Wang, Flood Sung, Hao Hu, Huarong Chen, Calvin, Qizheng Gu and Dikang Du from Kimi Team for the great infrastructure provided and helpful discussion. We sincerely thank our all the annotators for their great effort on this project.

BibTeX

If you find this work useful, please consider citing our paper:

@article{OpenCUA2025, 
  title={OpenCUA: Open Foundations for Computer-Use Agents}, 
  author={Wang, Xinyuan and Wang, Bowen and Lu, Dunjie and Yang, Junlin and Xie, Tianbao and Wang, Junli and Deng, Jiaqi and Guo, Xiaole and Xu, Yiheng and Wu, Chen Henry and Shen, Zhennan and Li, Zhuokai and Li, Ryan and Li, Xiaochuan and Chen, Junda and Zheng, Boyuan and Li, Peihang and Lei, Fangyu and Cao, Ruisheng and Fu, Yeqiao and Shin, Dongchan and Shin, Martin and Hu, Jiarui and Wang, Yuyan and Chen, Jixuan and Ye, Yuxiao and Zhang, Danyang and Wang, Yipu and Wang, Heng and Yang, Diyi and Zhong, Victor and Charles, Y. and Yang, Zhilin and Yu, Tao}, 
  year={2025}, 
  url={https://opencua.xlang.ai/} 
}