Arena is an AI evaluation platform that lets users compare anonymous models, vote on outputs, and explore public leaderboards across text, code, image, video, vision, document, and search tasks.
Phoenix is an open source AI observability and evaluation platform built on OpenTelemetry. Features LLM tracing, prompt playground, evaluation workflows, dataset experiments, and clustering analysis for improving AI quality.
DeepEval is an open-source Python framework for LLM evaluation with pytest-style unit testing, 30+ LLM-as-judge metrics, multi-modal support, and integrations for RAG, agents, and fine-tuning workflows.
Confident AI is an LLM evaluation and observability platform by the creators of DeepEval. Features end-to-end evals, regression testing, tracing, dataset management, and prompt versioning for AI quality assurance.
App-Bench evaluates how well AI coding agents generate real full-stack web apps from single prompts. Tests 6 production apps across healthcare, finance, legal, and education domains with 4,530+ evaluations.