Skip to content

Reasoning & Streaming

Thinking/reasoning models, streaming metadata, and structured output from the model layer.

See Models Overview for the 3-tier API and configuration.

Streaming

All models support streaming with complete metadata:

from cogent.models import ChatModel

model = ChatModel(model="gpt-5.4")

async for chunk in model.astream([
    {"role": "user", "content": "Write a story"}
]):
    print(chunk.content, end="", flush=True)

    # Access metadata in all chunks
    if chunk.metadata:
        print(f"\nModel: {chunk.metadata.model}")
        print(f"Response ID: {chunk.metadata.response_id}")

        # Token usage available in final chunk
        if chunk.metadata.tokens:
            print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
            print(f"Finish: {chunk.metadata.finish_reason}")

Streaming Metadata

All 10 chat providers return complete metadata during streaming:

Provider Model Finish Reason Token Usage Notes
OpenAI Uses stream_options={"include_usage": True}
Gemini Extracts from usage_metadata
Groq Compatible with OpenAI pattern
Mistral Metadata accumulation
Cohere Event-based streaming (message-end)
Anthropic Snapshot-based metadata
Cloudflare Stream options support
Ollama Local model metadata
Azure OpenAI Stream options support
Azure AI Foundry / GitHub Stream options via model_extras

Metadata Structure:

@dataclass
class MessageMetadata:
    id: str | None              # Response ID
    timestamp: str | None       # ISO 8601 timestamp
    model: str | None           # Model name/version
    tokens: TokenUsage | None   # Token counts
    finish_reason: str | None   # stop, length, error
    response_id: str | None     # Provider response ID
    duration: float | None      # Request duration (ms)
    correlation_id: str | None  # For tracing

@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    reasoning_tokens: int | None  # Reasoning tokens (if available)

Note: reasoning_tokens is populated by models that support reasoning/thinking (o1/o3, deepseek-reasoner, Claude extended thinking, Gemini thinking, Grok).

Streaming Pattern:

  1. Content chunks — Include partial metadata (model, response_id, timestamp)
  2. Final chunk — Empty content with complete metadata (finish_reason, tokens)
# Example streaming flow
async for chunk in model.astream(messages):
    # Chunks 1-N: Content with partial metadata
    if chunk.content:
        print(chunk.content, end="")

    # Final chunk: Complete metadata
    if chunk.metadata and chunk.metadata.finish_reason:
        print(f"\n\nCompleted with {chunk.metadata.tokens.total_tokens} tokens")

Embeddings


Streaming

All models support streaming with complete metadata:

from cogent.models import ChatModel

model = ChatModel(model="gpt-5.4")

async for chunk in model.astream([
    {"role": "user", "content": "Write a story"}
]):
    print(chunk.content, end="", flush=True)

    # Access metadata in all chunks
    if chunk.metadata:
        print(f"\nModel: {chunk.metadata.model}")
        print(f"Response ID: {chunk.metadata.response_id}")

        # Token usage available in final chunk
        if chunk.metadata.tokens:
            print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
            print(f"Finish: {chunk.metadata.finish_reason}")

Streaming Metadata

All 10 chat providers return complete metadata during streaming:

Provider Model Finish Reason Token Usage Notes
OpenAI Uses stream_options={"include_usage": True}
Gemini Extracts from usage_metadata
Groq Compatible with OpenAI pattern

Thinking & Reasoning

Several providers offer "reasoning" or "thinking" models that expose their chain-of-thought process. Cogent provides unified access to these capabilities.

@effort Shorthand

The fastest way to control reasoning intensity — append @level or @budget to any model string:

from cogent import Agent

# Named effort levels
agent = Agent(name="Analyst", model="o3-mini@high")
agent = Agent(name="Analyst", model="claude@high")
agent = Agent(name="Analyst", model="gpt-5.4-nano@low")

# Token thinking budget
agent = Agent(name="Analyst", model="claude-sonnet@16k")
agent = Agent(name="Analyst", model="gemini-2.5-pro@8192")

# Works with provider prefix and aliases
agent = Agent(name="Analyst", model="openai:o3-mini@medium")
agent = Agent(name="Analyst", model="anthropic:claude@max")

# Also works with create_chat()
from cogent.models import create_chat
llm = create_chat("o3-mini@high")

Provider mapping:

Suffix OpenAI Anthropic Gemini xAI
@low reasoning_effort="low" effort="low" + adaptive thinking_budget=2048 reasoning_effort="low"
@medium reasoning_effort="medium" effort="medium" + adaptive thinking_budget=8192
@high reasoning_effort="high" effort="high" + adaptive thinking_budget=24576 reasoning_effort="high"
@max reasoning_effort="high" effort="max" + adaptive thinking_budget=24576 reasoning_effort="high"
@16k thinking_budget=16384 thinking_budget=16384

Explicit model_kwargs override the @ suffix. For multi-parameter control (budget + display options), use model_kwargs or construct the model directly.

Feature Comparison

Provider Models Control Parameter Access Reasoning Structured Output
Anthropic claude-sonnet-4, claude-opus-4 thinking_budget msg.thinking ✅ via thinking
OpenAI o1, o3, o4-mini reasoning_effort Hidden
Gemini gemini-2.5-* thinking_budget msg.thinking
xAI grok-3-mini reasoning_effort Hidden
DeepSeek deepseek-reasoner Always on msg.reasoning

Anthropic Extended Thinking

Claude models support extended thinking with configurable token budgets:

from cogent.models.anthropic import AnthropicChat

# Enable extended thinking with budget
model = AnthropicChat(
    model="claude-sonnet-4-20250514",
    thinking={"type": "enabled", "budget_tokens": 10000},
)

response = await model.ainvoke([
    {"role": "user", "content": "Solve this step by step: 15! / (12! * 3!)"}
])

# Access thinking content
if response.thinking:
    print("Thinking:", response.thinking)
print("Answer:", response.content)

Using ReasoningConfig:

from cogent.models.anthropic import AnthropicChat
from cogent.reasoning import ReasoningConfig

# Create config
config = ReasoningConfig(budget_tokens=10000)

# Apply to model
model = AnthropicChat(model="claude-sonnet-4-20250514")
thinking_model = model.with_reasoning(config)

response = await thinking_model.ainvoke(messages)

Features: - Thinking exposed in msg.thinking attribute - Works with streaming (thinking streamed first) - Compatible with with_structured_output() via thinking

OpenAI Reasoning Models

OpenAI's o-series models (o1, o3, o4-mini) have built-in reasoning:

from cogent.models.openai import OpenAIChat

# Reasoning effort: "low", "medium", "high"
model = OpenAIChat(
    model="o4-mini",
    reasoning_effort="high",  # More thorough reasoning
)

response = await model.ainvoke([
    {"role": "user", "content": "Prove that sqrt(2) is irrational"}
])

Using ReasoningConfig:

from cogent.models.openai import OpenAIChat
from cogent.reasoning import ReasoningConfig

model = OpenAIChat(model="o4-mini")
reasoning_model = model.with_reasoning(ReasoningConfig(effort="high"))

Notes: - Reasoning is internal (not exposed in response) - No thinking budget - use reasoning_effort instead - Supports structured output with json_schema response format

Gemini Thinking

Gemini 2.5 and 3.0 models support thinking with budget control:

from cogent.models.gemini import GeminiChat

model = GeminiChat(
    model="gemini-2.5-flash-preview-05-20",  # or gemini-3-flash-preview
    thinking_budget=8000,  # Token budget for thinking
)

response = await model.ainvoke([
    {"role": "user", "content": "What's the optimal strategy in this game?"}
])

# Access thinking
if response.thinking:
    print("Thought process:", response.thinking)

Using ReasoningConfig:

from cogent.models.gemini import GeminiChat
from cogent.reasoning import ReasoningConfig

model = GeminiChat(model="gemini-2.5-flash-preview-05-20")
thinking_model = model.with_reasoning(ReasoningConfig(budget_tokens=8000))

xAI Reasoning

Grok 4.20 and grok-4 are always-on reasoning models. grok-3-mini supports configurable reasoning effort:

from cogent.models.xai import XAIChat

# grok-4.20 is a reasoning model — no parameters needed
model = XAIChat(model="grok-4.20")
response = await model.ainvoke([
    {"role": "user", "content": "Explain the halting problem"}
])

# Use non-reasoning variant to skip reasoning (faster/cheaper)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# grok-3-mini: configurable reasoning effort
model = XAIChat(
    model="grok-3-mini",
    reasoning_effort="high",  # "low" or "high"
)

Using with_reasoning():

from cogent.models.xai import XAIChat

model = XAIChat(model="grok-3-mini")
reasoning_model = model.with_reasoning(effort="high")

Notes: - grok-4.20, grok-4, grok-4-1-fast-reasoning: reasoning always on, no reasoning_effort parameter - grok-4.20-0309-non-reasoning, grok-4-1-fast-non-reasoning: reasoning disabled - grok-3-mini supports reasoning_effort ("low" or "high") - presencePenalty, frequencyPenalty, and stop are not supported by grok-4 reasoning models - Reasoning is internal (not exposed in response)

DeepSeek Reasoner

DeepSeek's reasoner model exposes its chain-of-thought:

from cogent.models.deepseek import DeepSeekChat

model = DeepSeekChat(model="deepseek-reasoner")

response = await model.ainvoke([
    {"role": "user", "content": "Prove the Pythagorean theorem"}
])

# Access reasoning content
if response.reasoning:
    print("Chain of thought:", response.reasoning)
print("Final answer:", response.content)

Streaming reasoning:

async for chunk in model.astream(messages):
    if chunk.reasoning:
        print(f"[Reasoning] {chunk.reasoning}", end="", flush=True)
    if chunk.content:
        print(chunk.content, end="", flush=True)

Notes: - Reasoning always enabled for deepseek-reasoner - Does NOT support tools or structured output - Use deepseek-chat for non-reasoning use cases

ReasoningConfig

Unified configuration for reasoning across providers:

from cogent.reasoning import ReasoningConfig

# Token budget (Anthropic, Gemini)
config = ReasoningConfig(budget_tokens=10000)

# Effort level (OpenAI, xAI)
config = ReasoningConfig(effort="high")

# Both (uses appropriate one per provider)
config = ReasoningConfig(budget_tokens=10000, effort="high")

Provider mapping:

Provider budget_tokens effort
Anthropic thinking.budget_tokens
OpenAI reasoning_effort
Gemini thinking_budget
xAI reasoning_effort
DeepSeek ❌ (always on)

Structured Output

Chat models support structured output via with_structured_output() for type-safe JSON responses:

from pydantic import BaseModel, Field
from cogent.models.openai import OpenAIChat

class Person(BaseModel):
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years")

llm = OpenAIChat(model="gpt-5.4").with_structured_output(Person)

response = await llm.ainvoke([
    {"role": "user", "content": "Extract: John Doe is 30 years old"}
])

For most use cases, use agent.run(task, returns=Schema) instead of calling with_structured_output() directly — the agent handles validation, retry, and method selection automatically.

See Structured Output for the full reference — provider support table, output methods, schema types, field guidance, and few-shot examples.