What is App-Bench and who created it?

App-Bench is a benchmark for evaluating AI coding agents on full-stack web app generation, created by AfterQuery, a Y Combinator-backed research lab specializing in AI evaluation datasets.

How many apps are in the App-Bench benchmark?

App-Bench includes 6 full-stack web applications: Financial Dashboard, Hospital Dashboard, Legal Assistant, Pharmacy System, Drawing Game, and Rental Booking.

How are App-Bench scores calculated?

Two experienced developers manually grade each app against 151 rubric criteria. Each tool gets 3 attempts per task, and the best run is used for final scoring. Total evaluations: 4,530.

What is the current state-of-the-art score on App-Bench?

As of recent testing, Orchids achieved 76.8% accuracy, Claude Code 67.5%, and v0 64.9%. Even the best performers left significant gaps in production features.

Is App-Bench free to use?

Yes, App-Bench is a free, open research benchmark. Results and leaderboards are publicly available at appbench.ai.

What types of coding agents does App-Bench evaluate?

App-Bench evaluates both web-based prompt-to-app builders (like v0, Bolt.new, Lovable) and CLI code agents (like Claude Code, Cursor, Gemini CLI).

App-Bench

App-Bench evaluates how well AI coding agents generate real full-stack web apps from single prompts. Tests 6 production apps across healthcare, finance, legal, and education domains with 4,530+ evaluations.

Visit App-Bench

About App-Bench

App-Bench is a benchmark designed to evaluate how well AI-driven coding agents can automatically build modern, full-stack web applications from a single natural language prompt. Created by AfterQuery (a Y Combinator-backed research lab), it tests agents on economically important domains including healthcare, real estate, finance, legal services, and education.

The benchmark consists of 6 full-stack web applications: Financial Dashboard (Bloomberg-style terminal), Hospital Dashboard (multi-role patient tracking), Legal Assistant (RAG-based with voice), Pharmacy System (multi-user platform), Drawing Game (multiplayer real-time), and Rental Booking (Airbnb-style marketplace). Each task exercises core production features: integrated AI assistants, real-time synchronization, multi-role logic, automated triggers, and robust authentication flows.

Two experienced full-stack developers manually grade each trajectory against a detailed rubric. Each tool receives three attempts per task, with the best-performing run used for final scoring. Across 151 rubric items × 3 attempts × 10 tools = 4,530 evaluations. The best-performing builder (Orchids) achieved 76.8% accuracy, while even top performers left significant gaps in production-ready features.

Key Features

Full-Stack Evaluation: Tests real web apps with multi-user flows, AI integrations, and complex state management.
Domain Variety: Covers healthcare, finance, legal, education, and real estate scenarios.
Rubric-Based Scoring: 151 criteria manually graded by experienced developers for accuracy.
Multiple Attempts: Three runs per task with best-run scoring to reduce variance.
Production Focus: Evaluates features separating prototypes from shippable products.
Public Leaderboard: Comparative results across 10+ coding agents and builders.