Model Providers¶

Detailed setup and configuration for each supported LLM provider.

See Models Overview for the 3-tier API, model aliases, and configuration.

OpenAI¶

from cogent.models import OpenAIChat, OpenAIEmbedding

# Tier 1: Simple string
agent = Agent("Helper", model="gpt4")

# Tier 2: Factory
model = create_chat("gpt4")
model = create_chat("openai", "gpt-5.4")

# Tier 3: Direct
model = OpenAIChat(
    model="gpt-5.4",
    temperature=0.7,
    max_tokens=2000,
    api_key="sk-...",  # Or OPENAI_API_KEY env var
)

# Embeddings
embeddings = OpenAIEmbedding(model="text-embedding-3-small")

# Primary API with metadata
result = await embeddings.embed(["Hello world"])
print(result.embeddings)  # Vectors
print(result.metadata)    # Full metadata

# Convenience for single text
result = await embeddings.embed("Query")
vector = result.embeddings[0]

With tools:

from cogent.tools import tool

@tool
def search(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

model = ChatModel(model="gpt-5.4")
bound = model.bind_tools([search])

response = await bound.ainvoke([
    {"role": "user", "content": "Search for AI news"}
])

if response.tool_calls:
    print(response.tool_calls)

xAI (Grok)¶

from cogent.models import XAIChat

# Latest flagship model (2M context, reasoning always on)
model = XAIChat(model="grok-4.20", api_key="xai-...")

# Non-reasoning variant (faster for latency-sensitive tasks)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")

# Vision model
model = XAIChat(model="grok-2-vision-1212")

# With reasoning effort (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
response = await model.ainvoke("What is 101 * 3?")
print(response.metadata.tokens.reasoning_tokens)

Available Models: - grok-4.20-0309-reasoning: Latest flagship — 2M context, fast + reasoning - grok-4.20-0309-non-reasoning: Non-reasoning variant — 2M context - grok-4.20-multi-agent-0309: Multi-agent optimised variant — 2M context, reasoning - grok-4-0709: Grok 4 stable snapshot — 256K context, reasoning - grok-4-1-fast-reasoning: Fast model with explicit reasoning — 2M context - grok-4-1-fast-non-reasoning: Fast model without reasoning — 2M context - grok-3, grok-3-mini: Previous generation - grok-2-vision-1212: Vision model

Note: Use explicit xai:model syntax, e.g. xai:grok-4.20.

Environment Variable: XAI_API_KEY

DeepSeek¶

from cogent.models import DeepSeekChat

# Standard chat model
model = DeepSeekChat(model="deepseek-chat", api_key="sk-...")

# Reasoning model with Chain of Thought
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")

# Access reasoning content
if hasattr(response, 'reasoning'):
    print("Reasoning:", response.reasoning)
print("Answer:", response.content)

Available Models: - deepseek-chat: General chat model with function calling - deepseek-reasoner: Reasoning model with Chain of Thought (no function calling)

Environment Variable: DEEPSEEK_API_KEY

Note: DeepSeek Reasoner does NOT support function calling, temperature, or sampling parameters.

Cerebras (Ultra-Fast Inference)¶

from cogent.models import CerebrasChat

# Llama 3.1 8B (default)
model = CerebrasChat(model="llama3.1-8b", api_key="csk-...")

# Llama 3.3 70B
model = CerebrasChat(model="llama-3.3-70b")

# Streaming
async for chunk in model.astream(messages):
    print(chunk.content, end="")

Available Models: - llama3.1-8b: Llama 3.1 8B (default) - llama-3.3-70b: Llama 3.3 70B - qwen-3-32b: Qwen 3 32B - qwen-3-235b-a22b-instruct-2507: Qwen 3 235B MoE - zai-glm-4.7: ZAI GLM 4.7 - gpt-oss-120b: GPT OSS 120B (reasoning model)

Note: Use explicit cerebras:model syntax, e.g. cerebras:llama3.1-8b. Bare gpt-oss-* strings are NOT routed to Cerebras.

Environment Variable: CEREBRAS_API_KEY

Note: Cerebras provides industry-leading inference speed using Wafer-Scale Engine (WSE-3).

Cloudflare Workers AI¶

from cogent.models import CloudflareChat, CloudflareEmbedding

# Chat models
model = CloudflareChat(
    model="@cf/meta/llama-3.3-70b-instruct",
    account_id="...",
    api_key="...",
)

# Embeddings
embeddings = CloudflareEmbedding(
    model="@cf/baai/bge-base-en-v1.5",
    account_id="...",
    api_key="...",
)

Available Models: All Cloudflare Workers AI models with @cf/ prefix

Environment Variables: CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_TOKEN

Azure AI Foundry (GitHub Models)¶

from cogent.models.azure import AzureAIFoundryChat

# GitHub Models
model = AzureAIFoundryChat.from_github(
    model="meta/Meta-Llama-3.1-8B-Instruct",
    token=os.getenv("GITHUB_TOKEN"),
)

# Azure AI Foundry endpoint
model = AzureAIFoundryChat(
    model="gpt-5.4-mini",
    endpoint="https://...",
    api_key="...",
)

Available via GitHub Models: Llama, Phi, Mistral, Cohere, and more

Environment Variable: GITHUB_TOKEN

OpenRouter¶

OpenRouter is a unified API gateway that routes to 200+ models from OpenAI, Anthropic, Google, Meta, Mistral, and many others. Use it to access any model through a single API key, enable automatic fallbacks, or add web search to any request.

from cogent.models.openrouter import OpenRouterChat

# Tier 1: Explicit provider:model syntax
agent = Agent("Helper", model="openrouter:anthropic/claude-sonnet-4")
agent = Agent("Helper", model="openrouter:openai/gpt-4o")
agent = Agent("Helper", model="openrouter:openrouter/auto")

# Tier 2: Factory
from cogent.models import create_chat
llm = create_chat("openrouter", "mistralai/mistral-7b-instruct:free")

# Tier 3: Direct class (full control)
llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    temperature=0.7,
    max_tokens=4096,
)

Environment Variable: OPENROUTER_API_KEY

Usage: Use openrouter:vendor/model syntax, e.g. openrouter:anthropic/claude-sonnet-4.

Provider Routing¶

Control which underlying providers OpenRouter routes to, and whether fallbacks are allowed:

llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    provider={
        "order": ["Anthropic", "AWS Bedrock"],  # try in order
        "allow_fallbacks": False,               # hard-fail if both unavailable
        "require_parameters": True,             # only providers supporting all params
    },
)

Model Fallbacks¶

Supply a ranked list of fallback models. OpenRouter tries each in order if a model is unavailable or rate-limited:

llm = OpenRouterChat(
    model="anthropic/claude-opus-4",
    fallback_models=[
        "anthropic/claude-sonnet-4",
        "openai/gpt-4o",
    ],
)

Plugins¶

Plugins extend what any model can do without modifying your prompt.

Web search — attaches live search results to the request:

llm = OpenRouterChat(
    model="openai/gpt-4o",
    plugins=[{"id": "web", "max_results": 5}],
)

Response healing — automatically retries malformed structured-output responses:

llm = OpenRouterChat(
    model="openai/gpt-4o",
    plugins=[{"id": "response-healing"}],
)

File parser and context compression are also available; pass any plugin object supported by the OpenRouter plugins API.

Sampling Parameters¶

All standard and OpenRouter-specific sampling params are supported:

llm = OpenRouterChat(
    model="meta-llama/llama-3.3-70b-instruct",
    temperature=0.8,
    top_p=0.9,
    top_k=40,
    frequency_penalty=0.2,
    presence_penalty=0.1,
    repetition_penalty=1.05,
    min_p=0.05,
    top_a=0.1,
    seed=42,
    stop=["END"],
)

Tool Choice¶

Pass tool_choice through bind_tools to force, prevent, or select specific tool use:

from cogent.tools import tool

@tool
def search(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

# Force the model to call a tool
bound = llm.bind_tools([search], tool_choice="required")

# Force a specific function
bound = llm.bind_tools([search], tool_choice={"type": "function", "function": {"name": "search"}})

# Disable tool use entirely
bound = llm.bind_tools([search], tool_choice="none")

Anthropic Beta Features¶

For Anthropic models routed through OpenRouter, pass beta feature flags via the anthropic_beta field:

llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    anthropic_beta="interleaved-thinking-2025-05-14",
)

# Multiple betas
llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    anthropic_beta=["interleaved-thinking-2025-05-14", "prompt-caching-2024-07-31"],
)

Reasoning Control¶

For reasoning/thinking models, control effort, token budget, and whether reasoning tokens appear in the response:

# Exclude reasoning tokens from the response (model still thinks internally)
llm = OpenRouterChat(model="deepseek/deepseek-v3.2", reasoning_exclude=True)

# Set a specific token budget (Anthropic models, some Qwen and Gemini 2.5)
llm = OpenRouterChat(model="anthropic/claude-3.7-sonnet", reasoning_max_tokens=2000)

# Control effort level (OpenAI o-series, Grok, Gemini 3)
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low")

# Disable thinking entirely on a model that reasons by default
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="none")

# Combine: low effort, exclude from response
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low", reasoning_exclude=True)

Parameter	Type	Description
`reasoning_effort`	`"xhigh"` \| `"high"` \| `"medium"` \| `"low"` \| `"minimal"` \| `"none"`	Effort level. `"none"` disables thinking entirely. Supported by OpenAI o-series, Grok, Gemini 3.
`reasoning_max_tokens`	`int`	Token budget for reasoning. Supported by Anthropic, some Qwen and Gemini 2.5. Cannot combine with `reasoning_effort`.
`reasoning_exclude`	`bool`	When `True`, model thinks internally but reasoning tokens are not returned. Works with both other params.

Cost and Cache Metadata¶

Every response carries cost and cache metadata in AIMessage.metadata:

response = await llm.ainvoke("Hello")
meta = response.metadata

print(meta.cost)                          # USD cost, e.g. 0.000123
print(meta.native_finish_reason)          # raw finish reason from the provider
print(meta.usage.cached_tokens)           # prompt tokens served from cache
print(meta.usage.cache_write_tokens)      # tokens written to prompt cache
print(meta.usage.reasoning_tokens)        # thinking/reasoning tokens used

model_kwargs Shorthand¶

When using the string or factory path, pass OpenRouter-specific options via model_kwargs:

agent = Agent(
    name="Researcher",
    model="openrouter:openai/gpt-4o",
    model_kwargs={
        "plugins": [{"id": "web", "max_results": 3}],
        "provider": {"order": ["OpenAI"]},
        "fallback_models": ["anthropic/claude-sonnet-4"],
        "seed": 42,
    },
)

Previous Provider Sections Continue Below¶

if response.tool_calls: for call in response.tool_calls: print(f"Tool: {call['name']}, Args: {call['args']}")

**Responses API (Beta):**

OpenAI's Responses API is optimized for tool use and structured outputs. Use the `use_responses_api=True` parameter:

```python
from cogent.models.openai import OpenAIChat

# Standard Chat Completions API (default)
model = OpenAIChat(model="gpt-5.4")

# Responses API (optimized for tool use)
model = OpenAIChat(model="gpt-5.4", use_responses_api=True)

# Works seamlessly with tools
bound = model.bind_tools([search_tool, calc_tool])
response = await bound.ainvoke(messages)

The Responses API provides better performance for multi-turn tool conversations while maintaining the same interface.

Azure OpenAI¶

Enterprise Azure deployments with Azure AD support:

from cogent.models.azure import AzureEntraAuth, AzureOpenAIChat, AzureOpenAIEmbedding

# With API key
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    api_key="your-api-key",
    api_version="2024-02-01",
)

# With Entra ID (DefaultAzureCredential)
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    entra=AzureEntraAuth(method="default"),  # Uses DefaultAzureCredential
)

# With Entra ID (Managed Identity)
# - System-assigned MI: omit client_id
# - User-assigned MI: set client_id (recommended when multiple identities exist)
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    entra=AzureEntraAuth(
        method="managed_identity",
        client_id="<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>",
    ),
)

# Embeddings
embeddings = AzureOpenAIEmbedding(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="text-embedding-ada-002",
    entra=AzureEntraAuth(method="default"),
)

result = await embeddings.embed(["Document text"])

Environment variables:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-02-01
AZURE_OPENAI_DEPLOYMENT=gpt-5.4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002

# Auth selection
AZURE_OPENAI_AUTH_TYPE=managed_identity  # api_key | default | managed_identity | client_secret

# API key auth
# AZURE_OPENAI_API_KEY=your-api-key

# Managed identity auth (user-assigned MI)
# AZURE_OPENAI_CLIENT_ID=...

# Service principal auth (client secret)
# AZURE_OPENAI_TENANT_ID=...
# AZURE_OPENAI_CLIENT_ID=...
# AZURE_OPENAI_CLIENT_SECRET=...

Anthropic¶

Claude models with native SDK:

from cogent.models.anthropic import AnthropicChat

model = AnthropicChat(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    api_key="sk-ant-...",  # Or ANTHROPIC_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "Explain quantum computing"}
])

Claude-specific features:

# System message
response = await model.ainvoke(
    messages=[{"role": "user", "content": "Hello"}],
    system="You are a helpful coding assistant.",
)

# With tools
model = AnthropicChat(model="claude-sonnet-4-6")
bound = model.bind_tools([search_tool])

Groq¶

Ultra-fast inference for supported models:

from cogent.models.groq import GroqChat

model = GroqChat(
    model="llama-3.3-70b-versatile",
    api_key="gsk_...",  # Or GROQ_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "Write a haiku about coding"}
])

Available models:

Model	Description
`llama-3.3-70b-versatile`	Llama 3.3 70B
`llama-3.1-8b-instant`	Fast Llama 3.1 8B
`mixtral-8x7b-32768`	Mixtral 8x7B
`gemma2-9b-it`	Gemma 2 9B

Responses API (Beta):

Groq also supports OpenAI's Responses API for optimized tool use:

from cogent.models.groq import GroqChat

# Standard Chat Completions API (default)
model = GroqChat(model="llama-3.3-70b-versatile")

# Responses API (optimized for tool use)
model = GroqChat(model="llama-3.3-70b-versatile", use_responses_api=True)

# Works seamlessly with tools
bound = model.bind_tools([search_tool])
response = await bound.ainvoke(messages)

Google Gemini¶

Google's Gemini models:

from cogent.models.gemini import GeminiChat, GeminiEmbedding

model = GeminiChat(
    model="gemini-2.5-flash",  # Default (upgraded from 2.0)
    api_key="...",  # Or GOOGLE_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "What is the capital of France?"}
])

# Gemini 3 Preview (Not Production Ready)
model = GeminiChat(model="gemini-3-flash-preview")
# ⚠️ WARNING: Preview models may have breaking changes or be removed

# Native Thinking (opt-in for cost efficiency)
model = GeminiChat(
    model="gemini-2.5-flash",
    thinking_budget=16384,  # Enable thinking (default: 0 = disabled)
)

# Embeddings
embeddings = GeminiEmbedding(model="text-embedding-004")

Available Models: - gemini-2.5-pro, gemini-2.5-flash (Stable, 1M context, thinking support) - gemini-2.0-flash (Stable) - gemini-3-pro-preview, gemini-3-flash-preview ⚠️ (Preview only, thinking support)

Native Thinking: - Default: thinking_budget=0 (disabled) - cost-efficient for most tasks - Enable: Set thinking_budget > 0 (recommended: 8192-16384 tokens) - Cost: Thinking tokens are billed separately - only enable when needed - Use Cases: Complex reasoning, multi-step problems, strategic planning

Pass via Agent:

from cogent import Agent

# Enable thinking for this agent
agent = Agent(
    name="Thinker",
    model="gemini-2.5-flash",
    model_kwargs={"thinking_budget": 16384},
)

Ollama¶

Local models via Ollama:

from cogent.models.ollama import OllamaChat, OllamaEmbedding

# Chat (requires `ollama run llama3.2`)
model = OllamaChat(
    model="llama3.2",
    base_url="http://localhost:11434",
)

response = await model.ainvoke([
    {"role": "user", "content": "Hello!"}
])

# Embeddings
embeddings = OllamaEmbedding(model="nomic-embed-text")

xAI (Grok)¶

Grok models with reasoning capabilities:

from cogent.models.xai import XAIChat

# Latest flagship (2M context, reasoning)
model = XAIChat(
    model="grok-4.20",
    api_key="...",  # Or XAI_API_KEY env var
)

# Non-reasoning variant (same price, no internal reasoning)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")

# Fast without reasoning (cheaper for high-volume)
model = XAIChat(model="grok-4-1-fast-non-reasoning")

# With reasoning effort control (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
# or use with_reasoning()
model = XAIChat(model="grok-3-mini").with_reasoning("high")

response = await model.ainvoke([
    {"role": "user", "content": "What is 101 * 3?"}
])

# Reasoning tokens tracked in metadata
if response.metadata.tokens:
    print(f"Reasoning tokens: {response.metadata.tokens.reasoning_tokens}")

Available models:

Model	Alias	Context	Reasoning	Description
`grok-4.20-0309-reasoning`	`grok`, `grok-4.20`, `grok-4.20-reasoning`	2M	✅	Latest flagship — fast + reasoning
`grok-4.20-0309-non-reasoning`	`grok-4.20-non-reasoning`	2M	❌	Latest flagship — non-reasoning variant
`grok-4.20-multi-agent-0309`	—	2M	✅	Multi-agent optimised variant
`grok-4-0709`	`grok-4`	256K	✅	Grok 4 stable snapshot
`grok-4-1-fast-reasoning`	`grok-fast-reasoning`	2M	✅	Fast agentic with explicit reasoning
`grok-4-1-fast-non-reasoning`	`grok-fast`, `grok-fast-non-reasoning`	2M	❌	Fast agentic without reasoning
`grok-3-mini`	—	—	configurable	Supports `reasoning_effort` (low/high)
`grok-2-vision-1212`	`grok-vision`	—	❌	Image understanding
`grok-code-fast-1`	`grok-code`	—	❌	Code-optimized

Features: - Function/tool calling (all models) - Structured outputs (JSON mode) - Reasoning (all grok-4.20 and grok-4 models; grok-3-mini via reasoning_effort) - Vision (grok-2-vision-1212) - 2M context window (grok-4.20 and grok-4-1-fast models)

DeepSeek¶

DeepSeek models with Chain of Thought reasoning:

from cogent.models.deepseek import DeepSeekChat

# Standard chat model
model = DeepSeekChat(
    model="deepseek-chat",
    api_key="...",  # Or DEEPSEEK_API_KEY env var
)

# Reasoning model (exposes Chain of Thought)
model = DeepSeekChat(model="deepseek-reasoner")

response = await model.ainvoke("9.11 and 9.8, which is greater?")

# Access reasoning content (Chain of Thought)
if hasattr(response, 'reasoning'):
    print("Reasoning:", response.reasoning)
print("Answer:", response.content)

Available models:

Model	Tools	Description
`deepseek-chat`	✅	General chat model with tool support
`deepseek-reasoner`	❌	Reasoning model with CoT (no tools)

Note: deepseek-reasoner does NOT support: - Function calling/tools - temperature, top_p, presence_penalty, frequency_penalty

Custom Endpoints¶

Any OpenAI-compatible endpoint (vLLM, Together AI, etc.):

from cogent.models.custom import CustomChat, CustomEmbedding

# vLLM
model = CustomChat(
    base_url="http://localhost:8000/v1",
    model="meta-llama/Llama-3.2-3B-Instruct",
)

# Together AI
model = CustomChat(
    base_url="https://api.together.xyz/v1",
    model="meta-llama/Llama-3-70b-chat-hf",
    api_key="...",
)

# Custom embeddings
embeddings = CustomEmbedding(
    base_url="http://localhost:8000/v1",
    model="BAAI/bge-small-en-v1.5",
)