Introducing WorkArena: A benchmark where agents meet enterprise software
Authors: Alexandre Drouin, Maxime Gasse, Massimo Caccia, and Issam H. Laradji
Enterprise knowledge worker tasks often include repetitive activities, such as manual data entry, validation, report generation, compliance checks, and incident categorization using enterprise software. All of these can be monotonous and error-prone, even for the most diligent workers. Completing these tasks requires that workers follow standard operating procedures and apply contextual judgement.
Even with thousands of automation opportunities across the enterprise, many of these combined steps can be considered too low volume and too low automation value. This makes traditional solutions such as robotic process automation and low-code/no-code not worth the investment given the complexity of the tools, time, and expertise needed.
Individually, these small tasks may be considered insignificant. But together, they represent a very big time spend for each of us and have an enormous aggregate economic value to the business.
Enter the autonomous AI agent
Attention has thus shifted to research and development of what is now more commonly referred to as autonomous AI agents, or simply agents, to automate the completion of these tasks using iterative agentic workflows.
These agents promise low friction and high accuracy, making them far more economically viable by using natural language with an agent that has the required background knowledge and task context to get the job done right, most of the time.
Historically, agents have almost always been synonymous with reinforcement learning (RL), but RL-based agents typically require long training runs in sandboxed environments and have low generalizability to radically new tasks.
Recently, many agent architectures have started to employ large language models (LLMs), as LLM-based agents have the potential to be powerful zero-shot task solvers. Since LLMs have lots of background knowledge about the world, they inherently have a measure of zero-shot commonsense. And there’s a good chance an LLM has been trained on the product documentation of the software you’re trying to use and automate.
“The Emperor’s New Clothes”
When developing agentic solutions, it’s important to have robust ways to demonstrate and evaluate the quality and performance of agents across a broad variety of enterprise-oriented, task-completion scenarios. That’s why we created WorkArena, a benchmark of 29 agent-oriented knowledge worker tasks. It provides developers with an effective measure of LLM-based agents to complete the tasks and achieve full automation.
We built the WorkArena benchmark on a remote-hosted instance of the widely adopted ServiceNow platform—a cloud-based workflow automation platform for end-to-end digital transformation. The platform connects people who need work done (requestors) with people doing the work (fulfillers). With millions of users, ServiceNow use cases are a good proxy for a large proportion of what constitutes everyday knowledge work.
WorkArena represents a significantly difficult and realistic set of scenarios. It tests the ability of LLM-powered autonomous web agents to use the feature-rich and dynamic ServiceNow platform knowledge bases, service catalogs, and user interface menus to complete the benchmark tasks as digital fulfillers using enterprise software that’s vastly more complicated than consumer web apps.
Despite the hype, and while showing great promise at the time of writing, these autonomous agents are not quite ready to substitute humans for enterprise-class use cases. This brings to mind the story of “The Emperor’s New Clothes,” where ambitious marketing can resemble folklore.
WorkArena functionality
To improve human productivity and performance, we had to determine how digital fulfillers can solve tasks using enterprise software, such as completing time sheets, navigating forms, and searching knowledge bases in enterprise scenarios. Instead of manually clicking links, an AI agent would navigate the web for a requestor and complete the tasks autonomously.
We assessed the abilities of LLMs to complete assigned knowledge worker tasks using WorkArena. Our research paper WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? shows that the best LLM-powered agent today is slow and expensive, with only a 55% success rate, even on tasks that are not overly complex.
This is still a remarkable accomplishment. But given the context, it shows there is still significant improvement needed before the ultimate vision can be realized.
In developing the WorkArena benchmark, we took inspiration from the process-based tasks that ServiceNow employees complete when onboarding as new hires to the company. These include database tasks, filtering tasks, and other forms workers might be expected to complete. Each task is described in natural language alongside immediate feedback mechanisms.
Navigating using accessibility trees
The HTML on most websites is enormous: more than 100,000 tokens per file. To manage this, we piggybacked on accessibility trees—representations of objects for all the HTML elements, attributes, and text nodes that can be understood by assistive technologies, such as screen readers, and used in websites built for people who are visually impaired. This helps agents know where to click without being overwhelmed by the amount of information on a page.
Build your own AI agent benchmark tasks
In addition to WorkArena, we created BrowserGym, an environment to design new benchmarks. BrowserGym provides a platform to evaluate multimodal web agents and experiment with new ideas to compare a wide variety of AI agents on the same set of tasks—e.g., text-only agents, vision-augmented agents, memory-augmented agents.
BrowserGym encompasses features such as user-to-agent interactive chat, multipage navigation, flexible agent design, and minimal task design. Compatible with previous benchmarks, such as WebArena and MiniWoB, BrowserGym lets you navigate the web with an agent that accomplishes tasks for you.
Here’s an example of a user using BrowserGym to help with a navigation task:
A benchmark speaks a thousand words
We believe browser-based task automation web agents are the ideal way to test the emergent capabilities of multimodal LLMs, and we hope to integrate additional benchmarks into BrowserGym.
In WorkArena, we found that GPT-4 significantly outperformed GPT-3.5 and that the open-source CodeLlama was considerably less capable than the other agents. This highlights the disparity between closed-source and open-source models and underscores the effort required to develop robust open-source models.
A call for collaboration with the OSS community
Both BrowserGym and WorkArena are open-source contributions to the community. We hope they’ll stimulate the development of other agents and ways to evaluate them with future benchmark contributions as with WebShop, WebArena, and WebVoyager.
Our ultimate goal is to expand the work to include coverage of other, more visually interactive parts of the ServiceNow platform, such as dashboards (login required), workspaces (login required), and low-code app development, as well as compositional tasks based on common trajectories inspired by the Now Platform persona-oriented curricula, such as the Business Process Analyst Career Journey.
We want to make it easy to test agents and for people to use autonomous AI agents to make their work more rewarding.
Let’s make work fun
We’re working toward a future where users can go from doing tasks to explaining what they want to do at a high level and empowering web agents to do the work for them. By freeing knowledge workers from performing redundant tasks—even those requiring deep knowledge—we can help improve the employee experience, letting them focus on the creative and fun parts of working.
To learn more, we encourage you to:
- Watch our keynote titled The Unsolved Challenges of LLMs in Open-Ended Web Tasks, presented at NVIDIA GTC 2024.
- Read our research paper: https://arxiv.org/abs/2403.07718.
- Get started with the WorkArena open-source software project on GitHub.
Find out more about ServiceNow Research.