Mercury 2
Mercury 2 is a diffusion-based reasoning LLM delivering 1,000+ tokens/sec throughput on NVIDIA Blackwell GPUs. Features 128K context, native tool use, tunable reasoning, and OpenAI-compatible API for production AI applications.

About Mercury 2
Mercury 2 is the world's fastest reasoning language model developed by Inception Labs, built on a novel diffusion architecture that diverges from traditional autoregressive token generation. Unlike conventional LLMs that generate text sequentially one token at a time, Mercury 2 utilizes parallel refinement to produce multiple tokens simultaneously, converging over a small number of steps. This architectural shift enables over 5x faster generation while maintaining reasoning grade quality suitable for production AI systems.
Designed specifically for latency sensitive applications, Mercury 2 addresses the compounding latency problem in modern AI workflows where agents, retrieval pipelines, and extraction jobs run in background loops. The model achieves 1,009 tokens per second on NVIDIA Blackwell GPUs while offering competitive performance with leading speed optimized models. It supports a 128,000 token context window and features tunable reasoning capabilities, allowing developers to balance computational depth against response time requirements.
The model targets production environments where user experience depends on instantaneous feedback, including coding assistants, autonomous agents, real time voice interfaces, and search pipelines. Mercury 2 is fully compatible with the OpenAI API specification, enabling drop-in integration with existing infrastructure without requiring architectural rewrites.
Key Features
- Diffusion Architecture: Parallel token refinement generates multiple tokens simultaneously rather than left to right sequential decoding.
- High Speed Inference: Delivers 1,009 tokens per second on NVIDIA Blackwell GPUs for real time production responsiveness.
- Tunable Reasoning: Adjustable reasoning depth allows optimization of the quality and speed trade off for specific application requirements.
- Extended Context Window: 128K token capacity supports long form document analysis and complex multi turn agentic workflows.
- OpenAI API Compatibility: Drop-in replacement for existing OpenAI integrations with no infrastructure rewrites required.
- Schema-Aligned JSON: Native structured output formatting ensures reliable data extraction and API response consistency.
Pricing
- Pay as you go: $0.25 per 1M input tokens / $0.75 per 1M output tokens Consumption based pricing designed for high volume production workloads with no upfront commitments or minimum spend requirements.
Pricing last updated: February 25, 2026 at 6:45 AM
Use Cases
- Real time coding assistance and intelligent autocomplete
- Agentic workflow automation and multi step task chains
- Voice interface and conversational AI systems
- Search and RAG pipeline optimization
Pros & Cons
Pros:
- Exceptional inference speed exceeding 1,000 tokens per second
- Cost effective token pricing compared to equivalent reasoning models
- OpenAI API compatibility enables immediate integration with existing stacks
Cons:
- Peak performance requires specific NVIDIA Blackwell GPU infrastructure
Integrations
OpenAI API, NVIDIA Blackwell, Function calling, JSON schema validation, Tool use frameworks
FAQ
Last edited
February 25, 2026 at 6:45 AM by Venkatraman
