Structured refusal handling with HTTP 200 stop_reason responses and built-in server-side and client-side fallback mechanisms for automatic retry on alternative Claude modelsMore Information
Parent Company: Anthropic
Initial Launch: June 9, 2026
Primary Audience: Enterprise engineering teams, AI agent builders, and developers requiring frontier-level reasoning for complex coding, research, and long-horizon agentic applications
Expert Analysis
View moreView less
### Adaptive Thinking Architecture Claude Fable 5 removes manual thinking budget configuration entirely. Adaptive thinking is the only supported mode: the model allocates reasoning tokens based on task complexity, and developers adjust depth through the effort parameter. Raw chain-of-thought output is never returned; only summarized or omitted thinking blocks are accessible. This architecture reduces per-request overhead on simple tasks while maintaining depth on complex multi-step reasoning without developer tuning. ### Context Window and Agentic Workflow Support The 1 million token context window operates at standard per-token pricing with no long-context surcharge. Combined with 128k output tokens per request, the model supports end-to-end workflows that previously required chunking or retrieval-augmentation. Features like compaction and context editing beta reduce token accumulation costs across long agentic sessions, making it practical for sustained multi-turn autonomous workflows. ### Refusal and Fallback Infrastructure Claude Fable 5 introduces structured refusal handling at the API level. Declined requests return HTTP 200 with a machine-readable stop_reason rather than an error, enabling deterministic programmatic routing. Server-side fallback via the fallbacks parameter and SDK middleware across five languages allow teams to retry declined requests against lower-tier models automatically, with fallback credit offsetting prompt cache costs on the retry.
Adaptive Thinking Architecture
Claude Fable 5 removes manual thinking budget configuration entirely. Adaptive thinking is the only supported mode: the model allocates reasoning tokens based on task complexity, and developers adjust depth through the effort parameter. Raw chain-of-thought output is never returned; only summarized or omitted thinking blocks are accessible. This architecture reduces per-request overhead on simple tasks while maintaining depth on complex multi-step reasoning without developer tuning.
Context Window and Agentic Workflow Support
The 1 million token context window operates at standard per-token pricing with no long-context surcharge. Combined with 128k output tokens per request, the model supports end-to-end workflows that previously required chunking or retrieval-augmentation. Features like compaction and context editing beta reduce token accumulation costs across long agentic sessions, making it practical for sustained multi-turn autonomous workflows.
Refusal and Fallback Infrastructure
Claude Fable 5 introduces structured refusal handling at the API level. Declined requests return HTTP 200 with a machine-readable stop_reason rather than an error, enabling deterministic programmatic routing. Server-side fallback via the fallbacks parameter and SDK middleware across five languages allow teams to retry declined requests against lower-tier models automatically, with fallback credit offsetting prompt cache costs on the retry.
The introduction of tool search architecture addresses the cost implications of large tool ecosystems. Rather than consuming thousands of tokens per request to define all available tools, GPT-5.4 retrieves specific tool definitions on demand. This reduces token usage by 47 percent when using MCP servers, directly lowering API costs while enabling integration with extensive tool libraries that were previously prohibitively expensive to maintain in context.
On GDPval benchmarks spanning 44 occupations, GPT-5.4 achieves 83 percent professional equivalence, with specific strength in spreadsheet modeling at 87.3 percent accuracy. The model demonstrates particular utility for investment banking analysis, presentation generation, and document editing tasks that require maintaining context across extended workflows.
Production Latency Optimization
The model specifically targets compounding latency in agentic workflows where inference calls chain across dozens of steps. Traditional reasoning models increase latency proportional to test time compute, making multi step agents impractical for real time applications. Mercury 2 maintains reasoning grade quality within strict latency budgets, enabling complex agent loops that previously required sacrificing either intelligence or responsiveness. The 128K context window further supports stateful agent operations without frequent context window resets.
Claude Sonnet 4.6 demonstrates sophisticated multi step strategic thinking evident in the Vending Bench Arena evaluation. The model developed an autonomous strategy of heavy capacity investment during initial simulation phases followed by a sharp profitability pivot. This temporal reasoning capability translates to real world business applications where models must balance immediate execution against long term objectives without human micromanagement.
Computer Use Implementation
The 72.5% OSWorld Verified score represents substantial progress in GUI automation since October 2024's initial 14.9% baseline. Sonnet 4.6 processes screen states as visual inputs rather than requiring structured API access, enabling integration with legacy systems lacking modern interfaces. Security considerations include enhanced prompt injection resistance compared to version 4.5, though enterprises should implement additional safeguards when deploying autonomous browser agents on untrusted domains.
Context Window Utilization
The 1M token capacity supports workflows previously requiring chunked processing or retrieval augmentation. In software engineering contexts, this enables holistic codebase comprehension where the model maintains architectural consistency across thousands of lines. API users should note this capability remains beta restricted, requiring specific implementation patterns for production deployment.
Gemini 3.1 Pro establishes a new baseline for reasoning centric AI models, distinguished by its 77.1% verification on ARC-AGI-2. This benchmark specifically tests adaptability to novel logic patterns rather than memorized knowledge, indicating genuine advancement in core cognitive architecture. The 2x improvement over Gemini 3 Pro suggests significant refinements in the model's chain of thought capabilities and abstract pattern recognition.
Agentic Workflow Integration
The model's rollout across Google's development stack—including Antigravity and Android Studio—positions it as infrastructure for autonomous agentic systems. Its ability to generate functional code assets like animated SVGs and interactive 3D simulations demonstrates practical utility beyond text generation. These capabilities enable developers to prototype sensory rich interfaces and complex data visualizations without manual coding of graphics pipelines.
Multi-Modal Output Capabilities
Unlike text-only LLMs, 3.1 Pro generates executable code outputs that render as visual and interactive experiences. The synthesis of hand tracking integration with generative audio in the starling murmuration example reveals sophisticated cross modal reasoning. This technical architecture supports iterative creative workflows where models don't just suggest designs but produce deployable interactive assets.