What makes Mercury 2 different from traditional LLMs?

Mercury 2 uses a diffusion based architecture rather than autoregressive generation. Instead of predicting tokens sequentially from left to right, it refines entire sequences in parallel through multiple steps, resulting in over 5x faster generation while maintaining reasoning capabilities.

What hardware is required to achieve the advertised 1,009 tokens/sec speed?

The benchmarked speed of 1,009 tokens per second is achieved on NVIDIA Blackwell GPUs. While the model runs on other hardware configurations, peak performance requires this specific GPU architecture.

Is Mercury 2 compatible with existing OpenAI integrations?

Yes, Mercury 2 is fully OpenAI API compatible, allowing it to function as a drop-in replacement in existing applications without requiring code changes or infrastructure rewrites.

What does 'tunable reasoning' mean in Mercury 2?

Tunable reasoning allows developers to adjust the depth of inference the model applies to specific tasks. This enables balancing between faster responses and more thorough reasoning based on application requirements and latency budgets.

What types of applications is Mercury 2 best suited for?

Mercury 2 excels in latency sensitive production environments including real time coding assistants, autonomous agent loops, voice interaction systems, and search/RAG pipelines where immediate responsiveness is critical to user experience.

Mercury 2

Mercury 2 is a diffusion-based reasoning LLM delivering 1,000+ tokens/sec throughput on NVIDIA Blackwell GPUs. Features 128K context, native tool use, tunable reasoning, and OpenAI-compatible API for production AI applications.

Visit Mercury 2

About Mercury 2

Mercury 2 is the world's fastest reasoning language model developed by Inception Labs, built on a novel diffusion architecture that diverges from traditional autoregressive token generation. Unlike conventional LLMs that generate text sequentially one token at a time, Mercury 2 utilizes parallel refinement to produce multiple tokens simultaneously, converging over a small number of steps. This architectural shift enables over 5x faster generation while maintaining reasoning grade quality suitable for production AI systems.

Designed specifically for latency sensitive applications, Mercury 2 addresses the compounding latency problem in modern AI workflows where agents, retrieval pipelines, and extraction jobs run in background loops. The model achieves 1,009 tokens per second on NVIDIA Blackwell GPUs while offering competitive performance with leading speed optimized models. It supports a 128,000 token context window and features tunable reasoning capabilities, allowing developers to balance computational depth against response time requirements.

The model targets production environments where user experience depends on instantaneous feedback, including coding assistants, autonomous agents, real time voice interfaces, and search pipelines. Mercury 2 is fully compatible with the OpenAI API specification, enabling drop-in integration with existing infrastructure without requiring architectural rewrites.

Key Features

Diffusion Architecture: Parallel token refinement generates multiple tokens simultaneously rather than left to right sequential decoding.
High Speed Inference: Delivers 1,009 tokens per second on NVIDIA Blackwell GPUs for real time production responsiveness.
Tunable Reasoning: Adjustable reasoning depth allows optimization of the quality and speed trade off for specific application requirements.
Extended Context Window: 128K token capacity supports long form document analysis and complex multi turn agentic workflows.
OpenAI API Compatibility: Drop-in replacement for existing OpenAI integrations with no infrastructure rewrites required.
Schema-Aligned JSON: Native structured output formatting ensures reliable data extraction and API response consistency.

Pricing

Pay as you go: $0.25 per 1M input tokens / $0.75 per 1M output tokens Consumption based pricing designed for high volume production workloads with no upfront commitments or minimum spend requirements.

Pricing last updated: February 25, 2026 at 6:45 AM

Use Cases

Real time coding assistance and intelligent autocomplete
Agentic workflow automation and multi step task chains
Voice interface and conversational AI systems
Search and RAG pipeline optimization

Pros & Cons

Pros:

Exceptional inference speed exceeding 1,000 tokens per second
Cost effective token pricing compared to equivalent reasoning models
OpenAI API compatibility enables immediate integration with existing stacks

Cons:

Peak performance requires specific NVIDIA Blackwell GPU infrastructure

Integrations

OpenAI API, NVIDIA Blackwell, Function calling, JSON schema validation, Tool use frameworks

FAQ

Categories:

AI Models Foundation

Tags:

ai-api diffusion-llm reasoning-model

Last edited

June 4, 2026 at 4:24 AM by Venkatraman

A Human Edited Software Directory

Advertise on CTODiscovery.

Advertise on CTODiscovery

Similar to Mercury 2

View all tools

Mercury 2

Mercury 2 is a diffusion-based reasoning LLM delivering 1,000+ tokens/sec throughput on NVIDIA Blackwell GPUs. Features 128K context, native tool use, tunable reasoning, and OpenAI-compatible API for production AI applications.

About Mercury 2

Key Features

Pricing

Use Cases

Pros & Cons

Integrations

FAQ

Tags:

Last edited

Similar to Mercury 2

GPT-5.4

Gemini 3.1 Flash-Lite

LFM2.5-1.2B-Thinking

Similar to Mercury 2

Similar to Mercury 2

GPT-5.4

Gemini 3.1 Flash-Lite

LFM2.5-1.2B-Thinking