Tools tagged with "LLM Evaluation"

Arena

Arena is an AI evaluation platform that lets users compare anonymous models, vote on outputs, and explore public leaderboards across text, code, image, video, vision, document, and search tasks.

AI Evaluation Platforms

Phoenix is an open source AI observability and evaluation platform built on OpenTelemetry. Features LLM tracing, prompt playground, evaluation workflows, dataset experiments, and clustering analysis for improving AI quality.

AI Observability Platforms

DeepEval

DeepEval is an open-source Python framework for LLM evaluation with pytest-style unit testing, 30+ LLM-as-judge metrics, multi-modal support, and integrations for RAG, agents, and fine-tuning workflows.

AI Evaluation Platforms

Confident AI

Confident AI is an LLM evaluation and observability platform by the creators of DeepEval. Features end-to-end evals, regression testing, tracing, dataset management, and prompt versioning for AI quality assurance.

AI Evaluation Platforms

App-Bench

App-Bench evaluates how well AI coding agents generate real full-stack web apps from single prompts. Tests 6 production apps across healthcare, finance, legal, and education domains with 4,530+ evaluations.

AI Evaluation Platforms