What is Arena used for?

Arena is used to evaluate and compare AI models through anonymous side by side testing, community voting, and public leaderboards across multiple task types.

How does Arena rank models?

Arena uses pairwise human preference votes and applies the Bradley Terry rating framework to turn those comparisons into leaderboard rankings.

Does Arena only evaluate chat models?

No. Arena publishes leaderboards for several categories including text, code, text to image, image edit, text to video, image to video, video edit, vision, document, and search.

Can enterprises use Arena for formal evaluations?

Yes. Arena offers AI evaluation services for enterprises, model labs, and developers that are grounded in real world human feedback.

Does Arena support open research?

Yes. Arena publishes open datasets, research papers, and the Arena Rank package that powers its leaderboard methodology.

What is Arena Expert?

Arena Expert is an evaluation layer that focuses on expert level prompts to create sharper distinctions between top models on difficult real world tasks.

Arena

Arena is an AI evaluation platform that lets users compare anonymous models, vote on outputs, and explore public leaderboards across text, code, image, video, vision, document, and search tasks.

Visit Arena

About Arena

Arena is a community powered AI evaluation platform created by researchers from UC Berkeley. The platform is designed to help people understand how frontier AI models perform in real world use by collecting human feedback and shaping public rankings from live comparisons.

Its core workflow centers on anonymous side by side model comparisons. A user submits a prompt, reviews responses from two anonymous models in battle mode, and votes for the output that best matches the task. After the vote, model identities are revealed, and that feedback contributes to Arena’s public leaderboards.

Arena supports evaluation across a broad set of categories rather than only general chat. Its public rankings span overall, text, code, text to image, image edit, text to video, image to video, video edit, vision, document, and search use cases. Arena also offers AI evaluation services for enterprises, model labs, and developers grounded in real world human feedback.

Arena also publishes open research assets tied to its evaluation work. The company has released open datasets, maintains the Arena Rank package that powers its leaderboards, and has introduced evaluation layers such as Arena Expert and Occupational Categories to study harder prompts and domain specific performance.

Key Features

Battle Mode: Compare two anonymous models side by side before voting on the preferred response.
Public Leaderboards: Explore rankings across multiple evaluation tracks including text, code, image, video, vision, document, and search.
Human Preference Ranking: Model standings are shaped by community voting using the Bradley Terry rating approach.
Enterprise Evaluations: Access evaluation services for enterprises, model labs, and developers based on real world human feedback.
Arena Rank Methodology: Use the open source Arena Rank package that powers the site’s leaderboards and ranking workflow.
Expert and Domain Views: Analyze expert level prompts and occupational categories to study model performance across real disciplines.

Pricing

Arena does not publish standard self serve pricing packages on the provided reference pages. The official site highlights an AI Evaluations service and directs interested organizations to contact the team for evaluation related engagement.

Use Cases

Benchmark frontier AI models using anonymous side by side comparisons.
Evaluate model performance across text, code, image, video, vision, document, and search tasks.
Gather human preference feedback to shape leaderboard rankings and model release decisions.
Study expert level and domain specific prompt performance through Arena Expert and Occupational Categories.

Pros & Cons

Pros:

Covers multiple AI evaluation tracks beyond text only.
Uses anonymous voting to reduce bias during head to head comparisons.
Supports transparent research through open datasets and open source ranking methodology.

Cons:

Public reference pages do not list standard self serve pricing plans.
Conversation data may be shared in de identified public research datasets or private evaluations, so users need to avoid sensitive inputs.