Reasoning & Streaming¶

Thinking/reasoning models, streaming metadata, and structured output from the model layer.

See Models Overview for the 3-tier API and configuration.

Streaming¶

All models support streaming with complete metadata:

from cogent.models import ChatModel

model = ChatModel(model="gpt-5.4")

async for chunk in model.astream([
    {"role": "user", "content": "Write a story"}
]):
    print(chunk.content, end="", flush=True)

    # Access metadata in all chunks
    if chunk.metadata:
        print(f"\nModel: {chunk.metadata.model}")
        print(f"Response ID: {chunk.metadata.response_id}")

        # Token usage available in final chunk
        if chunk.metadata.tokens:
            print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
            print(f"Finish: {chunk.metadata.finish_reason}")

Streaming Metadata¶

All 10 chat providers return complete metadata during streaming:

Provider	Model	Finish Reason	Token Usage	Notes
OpenAI	✅	✅	✅	Uses `stream_options={"include_usage": True}`
Gemini	✅	✅	✅	Extracts from `usage_metadata`
Groq	✅	✅	✅	Compatible with OpenAI pattern
Mistral	✅	✅	✅	Metadata accumulation
Cohere	✅	✅	✅	Event-based streaming (`message-end`)
Anthropic	✅	✅	✅	Snapshot-based metadata
Cloudflare	✅	✅	✅	Stream options support
Ollama	✅	✅	✅	Local model metadata
Azure OpenAI	✅	✅	✅	Stream options support
Azure AI Foundry / GitHub	✅	✅	✅	Stream options via model_extras

Metadata Structure:

@dataclass
class MessageMetadata:
    id: str | None              # Response ID
    timestamp: str | None       # ISO 8601 timestamp
    model: str | None           # Model name/version
    tokens: TokenUsage | None   # Token counts
    finish_reason: str | None   # stop, length, error
    response_id: str | None     # Provider response ID
    duration: float | None      # Request duration (ms)
    correlation_id: str | None  # For tracing

@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    reasoning_tokens: int | None  # Reasoning tokens (if available)

Note: reasoning_tokens is populated by models that support reasoning/thinking (o1/o3, deepseek-reasoner, Claude extended thinking, Gemini thinking, Grok).

Streaming Pattern:

Content chunks — Include partial metadata (model, response_id, timestamp)
Final chunk — Empty content with complete metadata (finish_reason, tokens)

# Example streaming flow
async for chunk in model.astream(messages):
    # Chunks 1-N: Content with partial metadata
    if chunk.content:
        print(chunk.content, end="")

    # Final chunk: Complete metadata
    if chunk.metadata and chunk.metadata.finish_reason:
        print(f"\n\nCompleted with {chunk.metadata.tokens.total_tokens} tokens")

Embeddings¶

Streaming¶

All models support streaming with complete metadata:

from cogent.models import ChatModel

model = ChatModel(model="gpt-5.4")

async for chunk in model.astream([
    {"role": "user", "content": "Write a story"}
]):
    print(chunk.content, end="", flush=True)

    # Access metadata in all chunks
    if chunk.metadata:
        print(f"\nModel: {chunk.metadata.model}")
        print(f"Response ID: {chunk.metadata.response_id}")

        # Token usage available in final chunk
        if chunk.metadata.tokens:
            print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
            print(f"Finish: {chunk.metadata.finish_reason}")

Streaming Metadata¶

All 10 chat providers return complete metadata during streaming:

Provider	Model	Finish Reason	Token Usage	Notes
OpenAI	✅	✅	✅	Uses `stream_options={"include_usage": True}`
Gemini	✅	✅	✅	Extracts from `usage_metadata`
Groq	✅	✅	✅	Compatible with OpenAI pattern

Thinking & Reasoning¶

Several providers offer "reasoning" or "thinking" models that expose their chain-of-thought process. Cogent provides unified access to these capabilities.

`@effort` Shorthand¶

The fastest way to control reasoning intensity — append @level or @budget to any model string:

from cogent import Agent

# Named effort levels
agent = Agent(name="Analyst", model="o3-mini@high")
agent = Agent(name="Analyst", model="claude@high")
agent = Agent(name="Analyst", model="gpt-5.4-nano@low")

# Token thinking budget
agent = Agent(name="Analyst", model="claude-sonnet@16k")
agent = Agent(name="Analyst", model="gemini-2.5-pro@8192")

# Works with provider prefix and aliases
agent = Agent(name="Analyst", model="openai:o3-mini@medium")
agent = Agent(name="Analyst", model="anthropic:claude@max")

# Also works with create_chat()
from cogent.models import create_chat
llm = create_chat("o3-mini@high")

Provider mapping:

Suffix	OpenAI	Anthropic	Gemini	xAI
`@low`	`reasoning_effort="low"`	`effort="low"` + adaptive	`thinking_budget=2048`	`reasoning_effort="low"`
`@medium`	`reasoning_effort="medium"`	`effort="medium"` + adaptive	`thinking_budget=8192`	—
`@high`	`reasoning_effort="high"`	`effort="high"` + adaptive	`thinking_budget=24576`	`reasoning_effort="high"`
`@max`	`reasoning_effort="high"`	`effort="max"` + adaptive	`thinking_budget=24576`	`reasoning_effort="high"`
`@16k`	—	`thinking_budget=16384`	`thinking_budget=16384`	—

Explicit model_kwargs override the @ suffix. For multi-parameter control (budget + display options), use model_kwargs or construct the model directly.

Feature Comparison¶

Provider	Models	Control Parameter	Access Reasoning	Structured Output
Anthropic	`claude-sonnet-4`, `claude-opus-4`	`thinking_budget`	`msg.thinking`	✅ via thinking
OpenAI	`o1`, `o3`, `o4-mini`	`reasoning_effort`	Hidden	✅
Gemini	`gemini-2.5-*`	`thinking_budget`	`msg.thinking`	✅
xAI	`grok-3-mini`	`reasoning_effort`	Hidden	✅
DeepSeek	`deepseek-reasoner`	Always on	`msg.reasoning`	❌

Anthropic Extended Thinking¶

Claude models support extended thinking with configurable token budgets:

from cogent.models.anthropic import AnthropicChat

# Enable extended thinking with budget
model = AnthropicChat(
    model="claude-sonnet-4-20250514",
    thinking={"type": "enabled", "budget_tokens": 10000},
)

response = await model.ainvoke([
    {"role": "user", "content": "Solve this step by step: 15! / (12! * 3!)"}
])

# Access thinking content
if response.thinking:
    print("Thinking:", response.thinking)
print("Answer:", response.content)

Using ReasoningConfig:

from cogent.models.anthropic import AnthropicChat
from cogent.reasoning import ReasoningConfig

# Create config
config = ReasoningConfig(budget_tokens=10000)

# Apply to model
model = AnthropicChat(model="claude-sonnet-4-20250514")
thinking_model = model.with_reasoning(config)

response = await thinking_model.ainvoke(messages)

Features: - Thinking exposed in msg.thinking attribute - Works with streaming (thinking streamed first) - Compatible with with_structured_output() via thinking

OpenAI Reasoning Models¶

OpenAI's o-series models (o1, o3, o4-mini) have built-in reasoning:

from cogent.models.openai import OpenAIChat

# Reasoning effort: "low", "medium", "high"
model = OpenAIChat(
    model="o4-mini",
    reasoning_effort="high",  # More thorough reasoning
)

response = await model.ainvoke([
    {"role": "user", "content": "Prove that sqrt(2) is irrational"}
])

Using ReasoningConfig:

from cogent.models.openai import OpenAIChat
from cogent.reasoning import ReasoningConfig

model = OpenAIChat(model="o4-mini")
reasoning_model = model.with_reasoning(ReasoningConfig(effort="high"))

Notes: - Reasoning is internal (not exposed in response) - No thinking budget - use reasoning_effort instead - Supports structured output with json_schema response format

Gemini Thinking¶

Gemini 2.5 and 3.0 models support thinking with budget control:

from cogent.models.gemini import GeminiChat

model = GeminiChat(
    model="gemini-2.5-flash-preview-05-20",  # or gemini-3-flash-preview
    thinking_budget=8000,  # Token budget for thinking
)

response = await model.ainvoke([
    {"role": "user", "content": "What's the optimal strategy in this game?"}
])

# Access thinking
if response.thinking:
    print("Thought process:", response.thinking)

Using ReasoningConfig:

from cogent.models.gemini import GeminiChat
from cogent.reasoning import ReasoningConfig

model = GeminiChat(model="gemini-2.5-flash-preview-05-20")
thinking_model = model.with_reasoning(ReasoningConfig(budget_tokens=8000))

xAI Reasoning¶

Grok 4.20 and grok-4 are always-on reasoning models. grok-3-mini supports configurable reasoning effort:

from cogent.models.xai import XAIChat

# grok-4.20 is a reasoning model — no parameters needed
model = XAIChat(model="grok-4.20")
response = await model.ainvoke([
    {"role": "user", "content": "Explain the halting problem"}
])

# Use non-reasoning variant to skip reasoning (faster/cheaper)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# grok-3-mini: configurable reasoning effort
model = XAIChat(
    model="grok-3-mini",
    reasoning_effort="high",  # "low" or "high"
)

Using with_reasoning():

from cogent.models.xai import XAIChat

model = XAIChat(model="grok-3-mini")
reasoning_model = model.with_reasoning(effort="high")

Notes: - grok-4.20, grok-4, grok-4-1-fast-reasoning: reasoning always on, no reasoning_effort parameter - grok-4.20-0309-non-reasoning, grok-4-1-fast-non-reasoning: reasoning disabled - grok-3-mini supports reasoning_effort ("low" or "high") - presencePenalty, frequencyPenalty, and stop are not supported by grok-4 reasoning models - Reasoning is internal (not exposed in response)

DeepSeek Reasoner¶

DeepSeek's reasoner model exposes its chain-of-thought:

from cogent.models.deepseek import DeepSeekChat

model = DeepSeekChat(model="deepseek-reasoner")

response = await model.ainvoke([
    {"role": "user", "content": "Prove the Pythagorean theorem"}
])

# Access reasoning content
if response.reasoning:
    print("Chain of thought:", response.reasoning)
print("Final answer:", response.content)

Streaming reasoning:

async for chunk in model.astream(messages):
    if chunk.reasoning:
        print(f"[Reasoning] {chunk.reasoning}", end="", flush=True)
    if chunk.content:
        print(chunk.content, end="", flush=True)

Notes: - Reasoning always enabled for deepseek-reasoner - Does NOT support tools or structured output - Use deepseek-chat for non-reasoning use cases

ReasoningConfig¶

Unified configuration for reasoning across providers:

from cogent.reasoning import ReasoningConfig

# Token budget (Anthropic, Gemini)
config = ReasoningConfig(budget_tokens=10000)

# Effort level (OpenAI, xAI)
config = ReasoningConfig(effort="high")

# Both (uses appropriate one per provider)
config = ReasoningConfig(budget_tokens=10000, effort="high")

Provider mapping:

Provider	`budget_tokens`	`effort`
Anthropic	✅ `thinking.budget_tokens`	❌
OpenAI	❌	✅ `reasoning_effort`
Gemini	✅ `thinking_budget`	❌
xAI	❌	✅ `reasoning_effort`
DeepSeek	❌ (always on)	❌

Structured Output¶

Chat models support structured output via with_structured_output() for type-safe JSON responses:

from pydantic import BaseModel, Field
from cogent.models.openai import OpenAIChat

class Person(BaseModel):
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years")

llm = OpenAIChat(model="gpt-5.4").with_structured_output(Person)

response = await llm.ainvoke([
    {"role": "user", "content": "Extract: John Doe is 30 years old"}
])

For most use cases, use agent.run(task, returns=Schema) instead of calling with_structured_output() directly — the agent handles validation, retry, and method selection automatically.

See Structured Output for the full reference — provider support table, output methods, schema types, field guidance, and few-shot examples.

Reasoning & Streaming¶

Streaming¶

Streaming Metadata¶

Embeddings¶

Streaming¶

Streaming Metadata¶

Thinking & Reasoning¶

@effort Shorthand¶

Feature Comparison¶

Anthropic Extended Thinking¶

OpenAI Reasoning Models¶

Gemini Thinking¶

xAI Reasoning¶

DeepSeek Reasoner¶

ReasoningConfig¶

Structured Output¶

`@effort` Shorthand¶