Model Providers¶
Detailed setup and configuration for each supported LLM provider.
See Models Overview for the 3-tier API, model aliases, and configuration.
OpenAI¶
from cogent.models import OpenAIChat, OpenAIEmbedding
# Tier 1: Simple string
agent = Agent("Helper", model="gpt4")
# Tier 2: Factory
model = create_chat("gpt4")
model = create_chat("openai", "gpt-5.4")
# Tier 3: Direct
model = OpenAIChat(
model="gpt-5.4",
temperature=0.7,
max_tokens=2000,
api_key="sk-...", # Or OPENAI_API_KEY env var
)
# Embeddings
embeddings = OpenAIEmbedding(model="text-embedding-3-small")
# Primary API with metadata
result = await embeddings.embed(["Hello world"])
print(result.embeddings) # Vectors
print(result.metadata) # Full metadata
# Convenience for single text
result = await embeddings.embed("Query")
vector = result.embeddings[0]
With tools:
from cogent.tools import tool
@tool
def search(query: str) -> str:
"""Search the web."""
return f"Results for: {query}"
model = ChatModel(model="gpt-5.4")
bound = model.bind_tools([search])
response = await bound.ainvoke([
{"role": "user", "content": "Search for AI news"}
])
if response.tool_calls:
print(response.tool_calls)
xAI (Grok)¶
from cogent.models import XAIChat
# Latest flagship model (2M context, reasoning always on)
model = XAIChat(model="grok-4.20", api_key="xai-...")
# Non-reasoning variant (faster for latency-sensitive tasks)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")
# Vision model
model = XAIChat(model="grok-2-vision-1212")
# With reasoning effort (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
response = await model.ainvoke("What is 101 * 3?")
print(response.metadata.tokens.reasoning_tokens)
Available Models:
- grok-4.20-0309-reasoning: Latest flagship — 2M context, fast + reasoning
- grok-4.20-0309-non-reasoning: Non-reasoning variant — 2M context
- grok-4.20-multi-agent-0309: Multi-agent optimised variant — 2M context, reasoning
- grok-4-0709: Grok 4 stable snapshot — 256K context, reasoning
- grok-4-1-fast-reasoning: Fast model with explicit reasoning — 2M context
- grok-4-1-fast-non-reasoning: Fast model without reasoning — 2M context
- grok-3, grok-3-mini: Previous generation
- grok-2-vision-1212: Vision model
Note: Use explicit xai:model syntax, e.g. xai:grok-4.20.
Environment Variable: XAI_API_KEY
DeepSeek¶
from cogent.models import DeepSeekChat
# Standard chat model
model = DeepSeekChat(model="deepseek-chat", api_key="sk-...")
# Reasoning model with Chain of Thought
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")
# Access reasoning content
if hasattr(response, 'reasoning'):
print("Reasoning:", response.reasoning)
print("Answer:", response.content)
Available Models:
- deepseek-chat: General chat model with function calling
- deepseek-reasoner: Reasoning model with Chain of Thought (no function calling)
Environment Variable: DEEPSEEK_API_KEY
Note: DeepSeek Reasoner does NOT support function calling, temperature, or sampling parameters.
Cerebras (Ultra-Fast Inference)¶
from cogent.models import CerebrasChat
# Llama 3.1 8B (default)
model = CerebrasChat(model="llama3.1-8b", api_key="csk-...")
# Llama 3.3 70B
model = CerebrasChat(model="llama-3.3-70b")
# Streaming
async for chunk in model.astream(messages):
print(chunk.content, end="")
Available Models:
- llama3.1-8b: Llama 3.1 8B (default)
- llama-3.3-70b: Llama 3.3 70B
- qwen-3-32b: Qwen 3 32B
- qwen-3-235b-a22b-instruct-2507: Qwen 3 235B MoE
- zai-glm-4.7: ZAI GLM 4.7
- gpt-oss-120b: GPT OSS 120B (reasoning model)
Note: Use explicit cerebras:model syntax, e.g. cerebras:llama3.1-8b. Bare gpt-oss-* strings are NOT routed to Cerebras.
Environment Variable: CEREBRAS_API_KEY
Note: Cerebras provides industry-leading inference speed using Wafer-Scale Engine (WSE-3).
Cloudflare Workers AI¶
from cogent.models import CloudflareChat, CloudflareEmbedding
# Chat models
model = CloudflareChat(
model="@cf/meta/llama-3.3-70b-instruct",
account_id="...",
api_key="...",
)
# Embeddings
embeddings = CloudflareEmbedding(
model="@cf/baai/bge-base-en-v1.5",
account_id="...",
api_key="...",
)
Available Models: All Cloudflare Workers AI models with @cf/ prefix
Environment Variables: CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_TOKEN
Azure AI Foundry (GitHub Models)¶
from cogent.models.azure import AzureAIFoundryChat
# GitHub Models
model = AzureAIFoundryChat.from_github(
model="meta/Meta-Llama-3.1-8B-Instruct",
token=os.getenv("GITHUB_TOKEN"),
)
# Azure AI Foundry endpoint
model = AzureAIFoundryChat(
model="gpt-5.4-mini",
endpoint="https://...",
api_key="...",
)
Available via GitHub Models: Llama, Phi, Mistral, Cohere, and more
Environment Variable: GITHUB_TOKEN
OpenRouter¶
OpenRouter is a unified API gateway that routes to 200+ models from OpenAI, Anthropic, Google, Meta, Mistral, and many others. Use it to access any model through a single API key, enable automatic fallbacks, or add web search to any request.
from cogent.models.openrouter import OpenRouterChat
# Tier 1: Explicit provider:model syntax
agent = Agent("Helper", model="openrouter:anthropic/claude-sonnet-4")
agent = Agent("Helper", model="openrouter:openai/gpt-4o")
agent = Agent("Helper", model="openrouter:openrouter/auto")
# Tier 2: Factory
from cogent.models import create_chat
llm = create_chat("openrouter", "mistralai/mistral-7b-instruct:free")
# Tier 3: Direct class (full control)
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
temperature=0.7,
max_tokens=4096,
)
Environment Variable: OPENROUTER_API_KEY
Usage: Use openrouter:vendor/model syntax, e.g. openrouter:anthropic/claude-sonnet-4.
Provider Routing¶
Control which underlying providers OpenRouter routes to, and whether fallbacks are allowed:
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
provider={
"order": ["Anthropic", "AWS Bedrock"], # try in order
"allow_fallbacks": False, # hard-fail if both unavailable
"require_parameters": True, # only providers supporting all params
},
)
Model Fallbacks¶
Supply a ranked list of fallback models. OpenRouter tries each in order if a model is unavailable or rate-limited:
llm = OpenRouterChat(
model="anthropic/claude-opus-4",
fallback_models=[
"anthropic/claude-sonnet-4",
"openai/gpt-4o",
],
)
Plugins¶
Plugins extend what any model can do without modifying your prompt.
Web search — attaches live search results to the request:
Response healing — automatically retries malformed structured-output responses:
File parser and context compression are also available; pass any plugin object supported by the OpenRouter plugins API.
Sampling Parameters¶
All standard and OpenRouter-specific sampling params are supported:
llm = OpenRouterChat(
model="meta-llama/llama-3.3-70b-instruct",
temperature=0.8,
top_p=0.9,
top_k=40,
frequency_penalty=0.2,
presence_penalty=0.1,
repetition_penalty=1.05,
min_p=0.05,
top_a=0.1,
seed=42,
stop=["END"],
)
Tool Choice¶
Pass tool_choice through bind_tools to force, prevent, or select specific tool use:
from cogent.tools import tool
@tool
def search(query: str) -> str:
"""Search the web."""
return f"Results for: {query}"
# Force the model to call a tool
bound = llm.bind_tools([search], tool_choice="required")
# Force a specific function
bound = llm.bind_tools([search], tool_choice={"type": "function", "function": {"name": "search"}})
# Disable tool use entirely
bound = llm.bind_tools([search], tool_choice="none")
Anthropic Beta Features¶
For Anthropic models routed through OpenRouter, pass beta feature flags via the anthropic_beta field:
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
anthropic_beta="interleaved-thinking-2025-05-14",
)
# Multiple betas
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
anthropic_beta=["interleaved-thinking-2025-05-14", "prompt-caching-2024-07-31"],
)
Reasoning Control¶
For reasoning/thinking models, control effort, token budget, and whether reasoning tokens appear in the response:
# Exclude reasoning tokens from the response (model still thinks internally)
llm = OpenRouterChat(model="deepseek/deepseek-v3.2", reasoning_exclude=True)
# Set a specific token budget (Anthropic models, some Qwen and Gemini 2.5)
llm = OpenRouterChat(model="anthropic/claude-3.7-sonnet", reasoning_max_tokens=2000)
# Control effort level (OpenAI o-series, Grok, Gemini 3)
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low")
# Disable thinking entirely on a model that reasons by default
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="none")
# Combine: low effort, exclude from response
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low", reasoning_exclude=True)
| Parameter | Type | Description |
|---|---|---|
reasoning_effort |
"xhigh" | "high" | "medium" | "low" | "minimal" | "none" |
Effort level. "none" disables thinking entirely. Supported by OpenAI o-series, Grok, Gemini 3. |
reasoning_max_tokens |
int |
Token budget for reasoning. Supported by Anthropic, some Qwen and Gemini 2.5. Cannot combine with reasoning_effort. |
reasoning_exclude |
bool |
When True, model thinks internally but reasoning tokens are not returned. Works with both other params. |
Cost and Cache Metadata¶
Every response carries cost and cache metadata in AIMessage.metadata:
response = await llm.ainvoke("Hello")
meta = response.metadata
print(meta.cost) # USD cost, e.g. 0.000123
print(meta.native_finish_reason) # raw finish reason from the provider
print(meta.usage.cached_tokens) # prompt tokens served from cache
print(meta.usage.cache_write_tokens) # tokens written to prompt cache
print(meta.usage.reasoning_tokens) # thinking/reasoning tokens used
model_kwargs Shorthand¶
When using the string or factory path, pass OpenRouter-specific options via model_kwargs:
agent = Agent(
name="Researcher",
model="openrouter:openai/gpt-4o",
model_kwargs={
"plugins": [{"id": "web", "max_results": 3}],
"provider": {"order": ["OpenAI"]},
"fallback_models": ["anthropic/claude-sonnet-4"],
"seed": 42,
},
)
Previous Provider Sections Continue Below¶
if response.tool_calls: for call in response.tool_calls: print(f"Tool: {call['name']}, Args: {call['args']}")
**Responses API (Beta):**
OpenAI's Responses API is optimized for tool use and structured outputs. Use the `use_responses_api=True` parameter:
```python
from cogent.models.openai import OpenAIChat
# Standard Chat Completions API (default)
model = OpenAIChat(model="gpt-5.4")
# Responses API (optimized for tool use)
model = OpenAIChat(model="gpt-5.4", use_responses_api=True)
# Works seamlessly with tools
bound = model.bind_tools([search_tool, calc_tool])
response = await bound.ainvoke(messages)
The Responses API provides better performance for multi-turn tool conversations while maintaining the same interface.
Azure OpenAI¶
Enterprise Azure deployments with Azure AD support:
from cogent.models.azure import AzureEntraAuth, AzureOpenAIChat, AzureOpenAIEmbedding
# With API key
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
api_key="your-api-key",
api_version="2024-02-01",
)
# With Entra ID (DefaultAzureCredential)
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
entra=AzureEntraAuth(method="default"), # Uses DefaultAzureCredential
)
# With Entra ID (Managed Identity)
# - System-assigned MI: omit client_id
# - User-assigned MI: set client_id (recommended when multiple identities exist)
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
entra=AzureEntraAuth(
method="managed_identity",
client_id="<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>",
),
)
# Embeddings
embeddings = AzureOpenAIEmbedding(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="text-embedding-ada-002",
entra=AzureEntraAuth(method="default"),
)
result = await embeddings.embed(["Document text"])
Environment variables:
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-02-01
AZURE_OPENAI_DEPLOYMENT=gpt-5.4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002
# Auth selection
AZURE_OPENAI_AUTH_TYPE=managed_identity # api_key | default | managed_identity | client_secret
# API key auth
# AZURE_OPENAI_API_KEY=your-api-key
# Managed identity auth (user-assigned MI)
# AZURE_OPENAI_CLIENT_ID=...
# Service principal auth (client secret)
# AZURE_OPENAI_TENANT_ID=...
# AZURE_OPENAI_CLIENT_ID=...
# AZURE_OPENAI_CLIENT_SECRET=...
Anthropic¶
Claude models with native SDK:
from cogent.models.anthropic import AnthropicChat
model = AnthropicChat(
model="claude-sonnet-4-6",
max_tokens=4096,
api_key="sk-ant-...", # Or ANTHROPIC_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "Explain quantum computing"}
])
Claude-specific features:
# System message
response = await model.ainvoke(
messages=[{"role": "user", "content": "Hello"}],
system="You are a helpful coding assistant.",
)
# With tools
model = AnthropicChat(model="claude-sonnet-4-6")
bound = model.bind_tools([search_tool])
Groq¶
Ultra-fast inference for supported models:
from cogent.models.groq import GroqChat
model = GroqChat(
model="llama-3.3-70b-versatile",
api_key="gsk_...", # Or GROQ_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "Write a haiku about coding"}
])
Available models:
| Model | Description |
|---|---|
llama-3.3-70b-versatile |
Llama 3.3 70B |
llama-3.1-8b-instant |
Fast Llama 3.1 8B |
mixtral-8x7b-32768 |
Mixtral 8x7B |
gemma2-9b-it |
Gemma 2 9B |
Responses API (Beta):
Groq also supports OpenAI's Responses API for optimized tool use:
from cogent.models.groq import GroqChat
# Standard Chat Completions API (default)
model = GroqChat(model="llama-3.3-70b-versatile")
# Responses API (optimized for tool use)
model = GroqChat(model="llama-3.3-70b-versatile", use_responses_api=True)
# Works seamlessly with tools
bound = model.bind_tools([search_tool])
response = await bound.ainvoke(messages)
Google Gemini¶
Google's Gemini models:
from cogent.models.gemini import GeminiChat, GeminiEmbedding
model = GeminiChat(
model="gemini-2.5-flash", # Default (upgraded from 2.0)
api_key="...", # Or GOOGLE_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "What is the capital of France?"}
])
# Gemini 3 Preview (Not Production Ready)
model = GeminiChat(model="gemini-3-flash-preview")
# ⚠️ WARNING: Preview models may have breaking changes or be removed
# Native Thinking (opt-in for cost efficiency)
model = GeminiChat(
model="gemini-2.5-flash",
thinking_budget=16384, # Enable thinking (default: 0 = disabled)
)
# Embeddings
embeddings = GeminiEmbedding(model="text-embedding-004")
Available Models:
- gemini-2.5-pro, gemini-2.5-flash (Stable, 1M context, thinking support)
- gemini-2.0-flash (Stable)
- gemini-3-pro-preview, gemini-3-flash-preview ⚠️ (Preview only, thinking support)
Native Thinking:
- Default: thinking_budget=0 (disabled) - cost-efficient for most tasks
- Enable: Set thinking_budget > 0 (recommended: 8192-16384 tokens)
- Cost: Thinking tokens are billed separately - only enable when needed
- Use Cases: Complex reasoning, multi-step problems, strategic planning
Pass via Agent:
from cogent import Agent
# Enable thinking for this agent
agent = Agent(
name="Thinker",
model="gemini-2.5-flash",
model_kwargs={"thinking_budget": 16384},
)
Ollama¶
Local models via Ollama:
from cogent.models.ollama import OllamaChat, OllamaEmbedding
# Chat (requires `ollama run llama3.2`)
model = OllamaChat(
model="llama3.2",
base_url="http://localhost:11434",
)
response = await model.ainvoke([
{"role": "user", "content": "Hello!"}
])
# Embeddings
embeddings = OllamaEmbedding(model="nomic-embed-text")
xAI (Grok)¶
Grok models with reasoning capabilities:
from cogent.models.xai import XAIChat
# Latest flagship (2M context, reasoning)
model = XAIChat(
model="grok-4.20",
api_key="...", # Or XAI_API_KEY env var
)
# Non-reasoning variant (same price, no internal reasoning)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")
# Fast without reasoning (cheaper for high-volume)
model = XAIChat(model="grok-4-1-fast-non-reasoning")
# With reasoning effort control (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
# or use with_reasoning()
model = XAIChat(model="grok-3-mini").with_reasoning("high")
response = await model.ainvoke([
{"role": "user", "content": "What is 101 * 3?"}
])
# Reasoning tokens tracked in metadata
if response.metadata.tokens:
print(f"Reasoning tokens: {response.metadata.tokens.reasoning_tokens}")
Available models:
| Model | Alias | Context | Reasoning | Description |
|---|---|---|---|---|
grok-4.20-0309-reasoning |
grok, grok-4.20, grok-4.20-reasoning |
2M | ✅ | Latest flagship — fast + reasoning |
grok-4.20-0309-non-reasoning |
grok-4.20-non-reasoning |
2M | ❌ | Latest flagship — non-reasoning variant |
grok-4.20-multi-agent-0309 |
— | 2M | ✅ | Multi-agent optimised variant |
grok-4-0709 |
grok-4 |
256K | ✅ | Grok 4 stable snapshot |
grok-4-1-fast-reasoning |
grok-fast-reasoning |
2M | ✅ | Fast agentic with explicit reasoning |
grok-4-1-fast-non-reasoning |
grok-fast, grok-fast-non-reasoning |
2M | ❌ | Fast agentic without reasoning |
grok-3-mini |
— | — | configurable | Supports reasoning_effort (low/high) |
grok-2-vision-1212 |
grok-vision |
— | ❌ | Image understanding |
grok-code-fast-1 |
grok-code |
— | ❌ | Code-optimized |
Features:
- Function/tool calling (all models)
- Structured outputs (JSON mode)
- Reasoning (all grok-4.20 and grok-4 models; grok-3-mini via reasoning_effort)
- Vision (grok-2-vision-1212)
- 2M context window (grok-4.20 and grok-4-1-fast models)
DeepSeek¶
DeepSeek models with Chain of Thought reasoning:
from cogent.models.deepseek import DeepSeekChat
# Standard chat model
model = DeepSeekChat(
model="deepseek-chat",
api_key="...", # Or DEEPSEEK_API_KEY env var
)
# Reasoning model (exposes Chain of Thought)
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")
# Access reasoning content (Chain of Thought)
if hasattr(response, 'reasoning'):
print("Reasoning:", response.reasoning)
print("Answer:", response.content)
Available models:
| Model | Tools | Description |
|---|---|---|
deepseek-chat |
✅ | General chat model with tool support |
deepseek-reasoner |
❌ | Reasoning model with CoT (no tools) |
Note: deepseek-reasoner does NOT support:
- Function calling/tools
- temperature, top_p, presence_penalty, frequency_penalty
Custom Endpoints¶
Any OpenAI-compatible endpoint (vLLM, Together AI, etc.):
from cogent.models.custom import CustomChat, CustomEmbedding
# vLLM
model = CustomChat(
base_url="http://localhost:8000/v1",
model="meta-llama/Llama-3.2-3B-Instruct",
)
# Together AI
model = CustomChat(
base_url="https://api.together.xyz/v1",
model="meta-llama/Llama-3-70b-chat-hf",
api_key="...",
)
# Custom embeddings
embeddings = CustomEmbedding(
base_url="http://localhost:8000/v1",
model="BAAI/bge-small-en-v1.5",
)