Reasoning & Streaming¶
Thinking/reasoning models, streaming metadata, and structured output from the model layer.
See Models Overview for the 3-tier API and configuration.
Streaming¶
All models support streaming with complete metadata:
from cogent.models import ChatModel
model = ChatModel(model="gpt-5.4")
async for chunk in model.astream([
{"role": "user", "content": "Write a story"}
]):
print(chunk.content, end="", flush=True)
# Access metadata in all chunks
if chunk.metadata:
print(f"\nModel: {chunk.metadata.model}")
print(f"Response ID: {chunk.metadata.response_id}")
# Token usage available in final chunk
if chunk.metadata.tokens:
print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
print(f"Finish: {chunk.metadata.finish_reason}")
Streaming Metadata¶
All 10 chat providers return complete metadata during streaming:
| Provider | Model | Finish Reason | Token Usage | Notes |
|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | Uses stream_options={"include_usage": True} |
| Gemini | ✅ | ✅ | ✅ | Extracts from usage_metadata |
| Groq | ✅ | ✅ | ✅ | Compatible with OpenAI pattern |
| Mistral | ✅ | ✅ | ✅ | Metadata accumulation |
| Cohere | ✅ | ✅ | ✅ | Event-based streaming (message-end) |
| Anthropic | ✅ | ✅ | ✅ | Snapshot-based metadata |
| Cloudflare | ✅ | ✅ | ✅ | Stream options support |
| Ollama | ✅ | ✅ | ✅ | Local model metadata |
| Azure OpenAI | ✅ | ✅ | ✅ | Stream options support |
| Azure AI Foundry / GitHub | ✅ | ✅ | ✅ | Stream options via model_extras |
Metadata Structure:
@dataclass
class MessageMetadata:
id: str | None # Response ID
timestamp: str | None # ISO 8601 timestamp
model: str | None # Model name/version
tokens: TokenUsage | None # Token counts
finish_reason: str | None # stop, length, error
response_id: str | None # Provider response ID
duration: float | None # Request duration (ms)
correlation_id: str | None # For tracing
@dataclass
class TokenUsage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
reasoning_tokens: int | None # Reasoning tokens (if available)
Note: reasoning_tokens is populated by models that support reasoning/thinking (o1/o3, deepseek-reasoner, Claude extended thinking, Gemini thinking, Grok).
Streaming Pattern:
- Content chunks — Include partial metadata (model, response_id, timestamp)
- Final chunk — Empty content with complete metadata (finish_reason, tokens)
# Example streaming flow
async for chunk in model.astream(messages):
# Chunks 1-N: Content with partial metadata
if chunk.content:
print(chunk.content, end="")
# Final chunk: Complete metadata
if chunk.metadata and chunk.metadata.finish_reason:
print(f"\n\nCompleted with {chunk.metadata.tokens.total_tokens} tokens")
Embeddings¶
Streaming¶
All models support streaming with complete metadata:
from cogent.models import ChatModel
model = ChatModel(model="gpt-5.4")
async for chunk in model.astream([
{"role": "user", "content": "Write a story"}
]):
print(chunk.content, end="", flush=True)
# Access metadata in all chunks
if chunk.metadata:
print(f"\nModel: {chunk.metadata.model}")
print(f"Response ID: {chunk.metadata.response_id}")
# Token usage available in final chunk
if chunk.metadata.tokens:
print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
print(f"Finish: {chunk.metadata.finish_reason}")
Streaming Metadata¶
All 10 chat providers return complete metadata during streaming:
| Provider | Model | Finish Reason | Token Usage | Notes |
|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | Uses stream_options={"include_usage": True} |
| Gemini | ✅ | ✅ | ✅ | Extracts from usage_metadata |
| Groq | ✅ | ✅ | ✅ | Compatible with OpenAI pattern |
Thinking & Reasoning¶
Several providers offer "reasoning" or "thinking" models that expose their chain-of-thought process. Cogent provides unified access to these capabilities.
@effort Shorthand¶
The fastest way to control reasoning intensity — append @level or @budget to any model string:
from cogent import Agent
# Named effort levels
agent = Agent(name="Analyst", model="o3-mini@high")
agent = Agent(name="Analyst", model="claude@high")
agent = Agent(name="Analyst", model="gpt-5.4-nano@low")
# Token thinking budget
agent = Agent(name="Analyst", model="claude-sonnet@16k")
agent = Agent(name="Analyst", model="gemini-2.5-pro@8192")
# Works with provider prefix and aliases
agent = Agent(name="Analyst", model="openai:o3-mini@medium")
agent = Agent(name="Analyst", model="anthropic:claude@max")
# Also works with create_chat()
from cogent.models import create_chat
llm = create_chat("o3-mini@high")
Provider mapping:
| Suffix | OpenAI | Anthropic | Gemini | xAI |
|---|---|---|---|---|
@low |
reasoning_effort="low" |
effort="low" + adaptive |
thinking_budget=2048 |
reasoning_effort="low" |
@medium |
reasoning_effort="medium" |
effort="medium" + adaptive |
thinking_budget=8192 |
— |
@high |
reasoning_effort="high" |
effort="high" + adaptive |
thinking_budget=24576 |
reasoning_effort="high" |
@max |
reasoning_effort="high" |
effort="max" + adaptive |
thinking_budget=24576 |
reasoning_effort="high" |
@16k |
— | thinking_budget=16384 |
thinking_budget=16384 |
— |
Explicit model_kwargs override the @ suffix. For multi-parameter control (budget + display options), use model_kwargs or construct the model directly.
Feature Comparison¶
| Provider | Models | Control Parameter | Access Reasoning | Structured Output |
|---|---|---|---|---|
| Anthropic | claude-sonnet-4, claude-opus-4 |
thinking_budget |
msg.thinking |
✅ via thinking |
| OpenAI | o1, o3, o4-mini |
reasoning_effort |
Hidden | ✅ |
| Gemini | gemini-2.5-* |
thinking_budget |
msg.thinking |
✅ |
| xAI | grok-3-mini |
reasoning_effort |
Hidden | ✅ |
| DeepSeek | deepseek-reasoner |
Always on | msg.reasoning |
❌ |
Anthropic Extended Thinking¶
Claude models support extended thinking with configurable token budgets:
from cogent.models.anthropic import AnthropicChat
# Enable extended thinking with budget
model = AnthropicChat(
model="claude-sonnet-4-20250514",
thinking={"type": "enabled", "budget_tokens": 10000},
)
response = await model.ainvoke([
{"role": "user", "content": "Solve this step by step: 15! / (12! * 3!)"}
])
# Access thinking content
if response.thinking:
print("Thinking:", response.thinking)
print("Answer:", response.content)
Using ReasoningConfig:
from cogent.models.anthropic import AnthropicChat
from cogent.reasoning import ReasoningConfig
# Create config
config = ReasoningConfig(budget_tokens=10000)
# Apply to model
model = AnthropicChat(model="claude-sonnet-4-20250514")
thinking_model = model.with_reasoning(config)
response = await thinking_model.ainvoke(messages)
Features:
- Thinking exposed in msg.thinking attribute
- Works with streaming (thinking streamed first)
- Compatible with with_structured_output() via thinking
OpenAI Reasoning Models¶
OpenAI's o-series models (o1, o3, o4-mini) have built-in reasoning:
from cogent.models.openai import OpenAIChat
# Reasoning effort: "low", "medium", "high"
model = OpenAIChat(
model="o4-mini",
reasoning_effort="high", # More thorough reasoning
)
response = await model.ainvoke([
{"role": "user", "content": "Prove that sqrt(2) is irrational"}
])
Using ReasoningConfig:
from cogent.models.openai import OpenAIChat
from cogent.reasoning import ReasoningConfig
model = OpenAIChat(model="o4-mini")
reasoning_model = model.with_reasoning(ReasoningConfig(effort="high"))
Notes:
- Reasoning is internal (not exposed in response)
- No thinking budget - use reasoning_effort instead
- Supports structured output with json_schema response format
Gemini Thinking¶
Gemini 2.5 and 3.0 models support thinking with budget control:
from cogent.models.gemini import GeminiChat
model = GeminiChat(
model="gemini-2.5-flash-preview-05-20", # or gemini-3-flash-preview
thinking_budget=8000, # Token budget for thinking
)
response = await model.ainvoke([
{"role": "user", "content": "What's the optimal strategy in this game?"}
])
# Access thinking
if response.thinking:
print("Thought process:", response.thinking)
Using ReasoningConfig:
from cogent.models.gemini import GeminiChat
from cogent.reasoning import ReasoningConfig
model = GeminiChat(model="gemini-2.5-flash-preview-05-20")
thinking_model = model.with_reasoning(ReasoningConfig(budget_tokens=8000))
xAI Reasoning¶
Grok 4.20 and grok-4 are always-on reasoning models. grok-3-mini supports configurable reasoning effort:
from cogent.models.xai import XAIChat
# grok-4.20 is a reasoning model — no parameters needed
model = XAIChat(model="grok-4.20")
response = await model.ainvoke([
{"role": "user", "content": "Explain the halting problem"}
])
# Use non-reasoning variant to skip reasoning (faster/cheaper)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# grok-3-mini: configurable reasoning effort
model = XAIChat(
model="grok-3-mini",
reasoning_effort="high", # "low" or "high"
)
Using with_reasoning():
from cogent.models.xai import XAIChat
model = XAIChat(model="grok-3-mini")
reasoning_model = model.with_reasoning(effort="high")
Notes:
- grok-4.20, grok-4, grok-4-1-fast-reasoning: reasoning always on, no reasoning_effort parameter
- grok-4.20-0309-non-reasoning, grok-4-1-fast-non-reasoning: reasoning disabled
- grok-3-mini supports reasoning_effort ("low" or "high")
- presencePenalty, frequencyPenalty, and stop are not supported by grok-4 reasoning models
- Reasoning is internal (not exposed in response)
DeepSeek Reasoner¶
DeepSeek's reasoner model exposes its chain-of-thought:
from cogent.models.deepseek import DeepSeekChat
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke([
{"role": "user", "content": "Prove the Pythagorean theorem"}
])
# Access reasoning content
if response.reasoning:
print("Chain of thought:", response.reasoning)
print("Final answer:", response.content)
Streaming reasoning:
async for chunk in model.astream(messages):
if chunk.reasoning:
print(f"[Reasoning] {chunk.reasoning}", end="", flush=True)
if chunk.content:
print(chunk.content, end="", flush=True)
Notes:
- Reasoning always enabled for deepseek-reasoner
- Does NOT support tools or structured output
- Use deepseek-chat for non-reasoning use cases
ReasoningConfig¶
Unified configuration for reasoning across providers:
from cogent.reasoning import ReasoningConfig
# Token budget (Anthropic, Gemini)
config = ReasoningConfig(budget_tokens=10000)
# Effort level (OpenAI, xAI)
config = ReasoningConfig(effort="high")
# Both (uses appropriate one per provider)
config = ReasoningConfig(budget_tokens=10000, effort="high")
Provider mapping:
| Provider | budget_tokens |
effort |
|---|---|---|
| Anthropic | ✅ thinking.budget_tokens |
❌ |
| OpenAI | ❌ | ✅ reasoning_effort |
| Gemini | ✅ thinking_budget |
❌ |
| xAI | ❌ | ✅ reasoning_effort |
| DeepSeek | ❌ (always on) | ❌ |
Structured Output¶
Chat models support structured output via with_structured_output() for type-safe JSON responses:
from pydantic import BaseModel, Field
from cogent.models.openai import OpenAIChat
class Person(BaseModel):
name: str = Field(description="Full name")
age: int = Field(description="Age in years")
llm = OpenAIChat(model="gpt-5.4").with_structured_output(Person)
response = await llm.ainvoke([
{"role": "user", "content": "Extract: John Doe is 30 years old"}
])
For most use cases, use agent.run(task, returns=Schema) instead of calling with_structured_output() directly — the agent handles validation, retry, and method selection automatically.
See Structured Output for the full reference — provider support table, output methods, schema types, field guidance, and few-shot examples.