Skip to content

Model Providers

Detailed setup and configuration for each supported LLM provider.

See Models Overview for the 3-tier API, model aliases, and configuration.


OpenAI

from cogent.models import OpenAIChat, OpenAIEmbedding

# Tier 1: Simple string
agent = Agent("Helper", model="gpt4")

# Tier 2: Factory
model = create_chat("gpt4")
model = create_chat("openai", "gpt-5.4")

# Tier 3: Direct
model = OpenAIChat(
    model="gpt-5.4",
    temperature=0.7,
    max_tokens=2000,
    api_key="sk-...",  # Or OPENAI_API_KEY env var
)

# Embeddings
embeddings = OpenAIEmbedding(model="text-embedding-3-small")

# Primary API with metadata
result = await embeddings.embed(["Hello world"])
print(result.embeddings)  # Vectors
print(result.metadata)    # Full metadata

# Convenience for single text
result = await embeddings.embed("Query")
vector = result.embeddings[0]

With tools:

from cogent.tools import tool

@tool
def search(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

model = ChatModel(model="gpt-5.4")
bound = model.bind_tools([search])

response = await bound.ainvoke([
    {"role": "user", "content": "Search for AI news"}
])

if response.tool_calls:
    print(response.tool_calls)

xAI (Grok)

from cogent.models import XAIChat

# Latest flagship model (2M context, reasoning always on)
model = XAIChat(model="grok-4.20", api_key="xai-...")

# Non-reasoning variant (faster for latency-sensitive tasks)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")

# Vision model
model = XAIChat(model="grok-2-vision-1212")

# With reasoning effort (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
response = await model.ainvoke("What is 101 * 3?")
print(response.metadata.tokens.reasoning_tokens)

Available Models: - grok-4.20-0309-reasoning: Latest flagship — 2M context, fast + reasoning - grok-4.20-0309-non-reasoning: Non-reasoning variant — 2M context - grok-4.20-multi-agent-0309: Multi-agent optimised variant — 2M context, reasoning - grok-4-0709: Grok 4 stable snapshot — 256K context, reasoning - grok-4-1-fast-reasoning: Fast model with explicit reasoning — 2M context - grok-4-1-fast-non-reasoning: Fast model without reasoning — 2M context - grok-3, grok-3-mini: Previous generation - grok-2-vision-1212: Vision model

Note: Use explicit xai:model syntax, e.g. xai:grok-4.20.

Environment Variable: XAI_API_KEY


DeepSeek

from cogent.models import DeepSeekChat

# Standard chat model
model = DeepSeekChat(model="deepseek-chat", api_key="sk-...")

# Reasoning model with Chain of Thought
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")

# Access reasoning content
if hasattr(response, 'reasoning'):
    print("Reasoning:", response.reasoning)
print("Answer:", response.content)

Available Models: - deepseek-chat: General chat model with function calling - deepseek-reasoner: Reasoning model with Chain of Thought (no function calling)

Environment Variable: DEEPSEEK_API_KEY

Note: DeepSeek Reasoner does NOT support function calling, temperature, or sampling parameters.


Cerebras (Ultra-Fast Inference)

from cogent.models import CerebrasChat

# Llama 3.1 8B (default)
model = CerebrasChat(model="llama3.1-8b", api_key="csk-...")

# Llama 3.3 70B
model = CerebrasChat(model="llama-3.3-70b")

# Streaming
async for chunk in model.astream(messages):
    print(chunk.content, end="")

Available Models: - llama3.1-8b: Llama 3.1 8B (default) - llama-3.3-70b: Llama 3.3 70B - qwen-3-32b: Qwen 3 32B - qwen-3-235b-a22b-instruct-2507: Qwen 3 235B MoE - zai-glm-4.7: ZAI GLM 4.7 - gpt-oss-120b: GPT OSS 120B (reasoning model)

Note: Use explicit cerebras:model syntax, e.g. cerebras:llama3.1-8b. Bare gpt-oss-* strings are NOT routed to Cerebras.

Environment Variable: CEREBRAS_API_KEY

Note: Cerebras provides industry-leading inference speed using Wafer-Scale Engine (WSE-3).


Cloudflare Workers AI

from cogent.models import CloudflareChat, CloudflareEmbedding

# Chat models
model = CloudflareChat(
    model="@cf/meta/llama-3.3-70b-instruct",
    account_id="...",
    api_key="...",
)

# Embeddings
embeddings = CloudflareEmbedding(
    model="@cf/baai/bge-base-en-v1.5",
    account_id="...",
    api_key="...",
)

Available Models: All Cloudflare Workers AI models with @cf/ prefix

Environment Variables: CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_TOKEN


Azure AI Foundry (GitHub Models)

from cogent.models.azure import AzureAIFoundryChat

# GitHub Models
model = AzureAIFoundryChat.from_github(
    model="meta/Meta-Llama-3.1-8B-Instruct",
    token=os.getenv("GITHUB_TOKEN"),
)

# Azure AI Foundry endpoint
model = AzureAIFoundryChat(
    model="gpt-5.4-mini",
    endpoint="https://...",
    api_key="...",
)

Available via GitHub Models: Llama, Phi, Mistral, Cohere, and more

Environment Variable: GITHUB_TOKEN


OpenRouter

OpenRouter is a unified API gateway that routes to 200+ models from OpenAI, Anthropic, Google, Meta, Mistral, and many others. Use it to access any model through a single API key, enable automatic fallbacks, or add web search to any request.

from cogent.models.openrouter import OpenRouterChat

# Tier 1: Explicit provider:model syntax
agent = Agent("Helper", model="openrouter:anthropic/claude-sonnet-4")
agent = Agent("Helper", model="openrouter:openai/gpt-4o")
agent = Agent("Helper", model="openrouter:openrouter/auto")

# Tier 2: Factory
from cogent.models import create_chat
llm = create_chat("openrouter", "mistralai/mistral-7b-instruct:free")

# Tier 3: Direct class (full control)
llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    temperature=0.7,
    max_tokens=4096,
)

Environment Variable: OPENROUTER_API_KEY

Usage: Use openrouter:vendor/model syntax, e.g. openrouter:anthropic/claude-sonnet-4.


Provider Routing

Control which underlying providers OpenRouter routes to, and whether fallbacks are allowed:

llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    provider={
        "order": ["Anthropic", "AWS Bedrock"],  # try in order
        "allow_fallbacks": False,               # hard-fail if both unavailable
        "require_parameters": True,             # only providers supporting all params
    },
)

Model Fallbacks

Supply a ranked list of fallback models. OpenRouter tries each in order if a model is unavailable or rate-limited:

llm = OpenRouterChat(
    model="anthropic/claude-opus-4",
    fallback_models=[
        "anthropic/claude-sonnet-4",
        "openai/gpt-4o",
    ],
)

Plugins

Plugins extend what any model can do without modifying your prompt.

Web search — attaches live search results to the request:

llm = OpenRouterChat(
    model="openai/gpt-4o",
    plugins=[{"id": "web", "max_results": 5}],
)

Response healing — automatically retries malformed structured-output responses:

llm = OpenRouterChat(
    model="openai/gpt-4o",
    plugins=[{"id": "response-healing"}],
)

File parser and context compression are also available; pass any plugin object supported by the OpenRouter plugins API.


Sampling Parameters

All standard and OpenRouter-specific sampling params are supported:

llm = OpenRouterChat(
    model="meta-llama/llama-3.3-70b-instruct",
    temperature=0.8,
    top_p=0.9,
    top_k=40,
    frequency_penalty=0.2,
    presence_penalty=0.1,
    repetition_penalty=1.05,
    min_p=0.05,
    top_a=0.1,
    seed=42,
    stop=["END"],
)

Tool Choice

Pass tool_choice through bind_tools to force, prevent, or select specific tool use:

from cogent.tools import tool

@tool
def search(query: str) -> str:
    """Search the web."""
    return f"Results for: {query}"

# Force the model to call a tool
bound = llm.bind_tools([search], tool_choice="required")

# Force a specific function
bound = llm.bind_tools([search], tool_choice={"type": "function", "function": {"name": "search"}})

# Disable tool use entirely
bound = llm.bind_tools([search], tool_choice="none")

Anthropic Beta Features

For Anthropic models routed through OpenRouter, pass beta feature flags via the anthropic_beta field:

llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    anthropic_beta="interleaved-thinking-2025-05-14",
)

# Multiple betas
llm = OpenRouterChat(
    model="anthropic/claude-sonnet-4",
    anthropic_beta=["interleaved-thinking-2025-05-14", "prompt-caching-2024-07-31"],
)

Reasoning Control

For reasoning/thinking models, control effort, token budget, and whether reasoning tokens appear in the response:

# Exclude reasoning tokens from the response (model still thinks internally)
llm = OpenRouterChat(model="deepseek/deepseek-v3.2", reasoning_exclude=True)

# Set a specific token budget (Anthropic models, some Qwen and Gemini 2.5)
llm = OpenRouterChat(model="anthropic/claude-3.7-sonnet", reasoning_max_tokens=2000)

# Control effort level (OpenAI o-series, Grok, Gemini 3)
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low")

# Disable thinking entirely on a model that reasons by default
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="none")

# Combine: low effort, exclude from response
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low", reasoning_exclude=True)
Parameter Type Description
reasoning_effort "xhigh" | "high" | "medium" | "low" | "minimal" | "none" Effort level. "none" disables thinking entirely. Supported by OpenAI o-series, Grok, Gemini 3.
reasoning_max_tokens int Token budget for reasoning. Supported by Anthropic, some Qwen and Gemini 2.5. Cannot combine with reasoning_effort.
reasoning_exclude bool When True, model thinks internally but reasoning tokens are not returned. Works with both other params.

Cost and Cache Metadata

Every response carries cost and cache metadata in AIMessage.metadata:

response = await llm.ainvoke("Hello")
meta = response.metadata

print(meta.cost)                          # USD cost, e.g. 0.000123
print(meta.native_finish_reason)          # raw finish reason from the provider
print(meta.usage.cached_tokens)           # prompt tokens served from cache
print(meta.usage.cache_write_tokens)      # tokens written to prompt cache
print(meta.usage.reasoning_tokens)        # thinking/reasoning tokens used

model_kwargs Shorthand

When using the string or factory path, pass OpenRouter-specific options via model_kwargs:

agent = Agent(
    name="Researcher",
    model="openrouter:openai/gpt-4o",
    model_kwargs={
        "plugins": [{"id": "web", "max_results": 3}],
        "provider": {"order": ["OpenAI"]},
        "fallback_models": ["anthropic/claude-sonnet-4"],
        "seed": 42,
    },
)

Previous Provider Sections Continue Below

if response.tool_calls: for call in response.tool_calls: print(f"Tool: {call['name']}, Args: {call['args']}")

**Responses API (Beta):**

OpenAI's Responses API is optimized for tool use and structured outputs. Use the `use_responses_api=True` parameter:

```python
from cogent.models.openai import OpenAIChat

# Standard Chat Completions API (default)
model = OpenAIChat(model="gpt-5.4")

# Responses API (optimized for tool use)
model = OpenAIChat(model="gpt-5.4", use_responses_api=True)

# Works seamlessly with tools
bound = model.bind_tools([search_tool, calc_tool])
response = await bound.ainvoke(messages)

The Responses API provides better performance for multi-turn tool conversations while maintaining the same interface.


Azure OpenAI

Enterprise Azure deployments with Azure AD support:

from cogent.models.azure import AzureEntraAuth, AzureOpenAIChat, AzureOpenAIEmbedding

# With API key
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    api_key="your-api-key",
    api_version="2024-02-01",
)

# With Entra ID (DefaultAzureCredential)
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    entra=AzureEntraAuth(method="default"),  # Uses DefaultAzureCredential
)

# With Entra ID (Managed Identity)
# - System-assigned MI: omit client_id
# - User-assigned MI: set client_id (recommended when multiple identities exist)
model = AzureOpenAIChat(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="gpt-5.4",
    entra=AzureEntraAuth(
        method="managed_identity",
        client_id="<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>",
    ),
)

# Embeddings
embeddings = AzureOpenAIEmbedding(
    azure_endpoint="https://your-resource.openai.azure.com",
    deployment="text-embedding-ada-002",
    entra=AzureEntraAuth(method="default"),
)

result = await embeddings.embed(["Document text"])

Environment variables:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-02-01
AZURE_OPENAI_DEPLOYMENT=gpt-5.4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002

# Auth selection
AZURE_OPENAI_AUTH_TYPE=managed_identity  # api_key | default | managed_identity | client_secret

# API key auth
# AZURE_OPENAI_API_KEY=your-api-key

# Managed identity auth (user-assigned MI)
# AZURE_OPENAI_CLIENT_ID=...

# Service principal auth (client secret)
# AZURE_OPENAI_TENANT_ID=...
# AZURE_OPENAI_CLIENT_ID=...
# AZURE_OPENAI_CLIENT_SECRET=...

Anthropic

Claude models with native SDK:

from cogent.models.anthropic import AnthropicChat

model = AnthropicChat(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    api_key="sk-ant-...",  # Or ANTHROPIC_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "Explain quantum computing"}
])

Claude-specific features:

# System message
response = await model.ainvoke(
    messages=[{"role": "user", "content": "Hello"}],
    system="You are a helpful coding assistant.",
)

# With tools
model = AnthropicChat(model="claude-sonnet-4-6")
bound = model.bind_tools([search_tool])

Groq

Ultra-fast inference for supported models:

from cogent.models.groq import GroqChat

model = GroqChat(
    model="llama-3.3-70b-versatile",
    api_key="gsk_...",  # Or GROQ_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "Write a haiku about coding"}
])

Available models:

Model Description
llama-3.3-70b-versatile Llama 3.3 70B
llama-3.1-8b-instant Fast Llama 3.1 8B
mixtral-8x7b-32768 Mixtral 8x7B
gemma2-9b-it Gemma 2 9B

Responses API (Beta):

Groq also supports OpenAI's Responses API for optimized tool use:

from cogent.models.groq import GroqChat

# Standard Chat Completions API (default)
model = GroqChat(model="llama-3.3-70b-versatile")

# Responses API (optimized for tool use)
model = GroqChat(model="llama-3.3-70b-versatile", use_responses_api=True)

# Works seamlessly with tools
bound = model.bind_tools([search_tool])
response = await bound.ainvoke(messages)

Google Gemini

Google's Gemini models:

from cogent.models.gemini import GeminiChat, GeminiEmbedding

model = GeminiChat(
    model="gemini-2.5-flash",  # Default (upgraded from 2.0)
    api_key="...",  # Or GOOGLE_API_KEY env var
)

response = await model.ainvoke([
    {"role": "user", "content": "What is the capital of France?"}
])

# Gemini 3 Preview (Not Production Ready)
model = GeminiChat(model="gemini-3-flash-preview")
# ⚠️ WARNING: Preview models may have breaking changes or be removed

# Native Thinking (opt-in for cost efficiency)
model = GeminiChat(
    model="gemini-2.5-flash",
    thinking_budget=16384,  # Enable thinking (default: 0 = disabled)
)

# Embeddings
embeddings = GeminiEmbedding(model="text-embedding-004")

Available Models: - gemini-2.5-pro, gemini-2.5-flash (Stable, 1M context, thinking support) - gemini-2.0-flash (Stable) - gemini-3-pro-preview, gemini-3-flash-preview ⚠️ (Preview only, thinking support)

Native Thinking: - Default: thinking_budget=0 (disabled) - cost-efficient for most tasks - Enable: Set thinking_budget > 0 (recommended: 8192-16384 tokens) - Cost: Thinking tokens are billed separately - only enable when needed - Use Cases: Complex reasoning, multi-step problems, strategic planning

Pass via Agent:

from cogent import Agent

# Enable thinking for this agent
agent = Agent(
    name="Thinker",
    model="gemini-2.5-flash",
    model_kwargs={"thinking_budget": 16384},
)


Ollama

Local models via Ollama:

from cogent.models.ollama import OllamaChat, OllamaEmbedding

# Chat (requires `ollama run llama3.2`)
model = OllamaChat(
    model="llama3.2",
    base_url="http://localhost:11434",
)

response = await model.ainvoke([
    {"role": "user", "content": "Hello!"}
])

# Embeddings
embeddings = OllamaEmbedding(model="nomic-embed-text")

xAI (Grok)

Grok models with reasoning capabilities:

from cogent.models.xai import XAIChat

# Latest flagship (2M context, reasoning)
model = XAIChat(
    model="grok-4.20",
    api_key="...",  # Or XAI_API_KEY env var
)

# Non-reasoning variant (same price, no internal reasoning)
model = XAIChat(model="grok-4.20-0309-non-reasoning")

# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")

# Fast without reasoning (cheaper for high-volume)
model = XAIChat(model="grok-4-1-fast-non-reasoning")

# With reasoning effort control (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
# or use with_reasoning()
model = XAIChat(model="grok-3-mini").with_reasoning("high")

response = await model.ainvoke([
    {"role": "user", "content": "What is 101 * 3?"}
])

# Reasoning tokens tracked in metadata
if response.metadata.tokens:
    print(f"Reasoning tokens: {response.metadata.tokens.reasoning_tokens}")

Available models:

Model Alias Context Reasoning Description
grok-4.20-0309-reasoning grok, grok-4.20, grok-4.20-reasoning 2M Latest flagship — fast + reasoning
grok-4.20-0309-non-reasoning grok-4.20-non-reasoning 2M Latest flagship — non-reasoning variant
grok-4.20-multi-agent-0309 2M Multi-agent optimised variant
grok-4-0709 grok-4 256K Grok 4 stable snapshot
grok-4-1-fast-reasoning grok-fast-reasoning 2M Fast agentic with explicit reasoning
grok-4-1-fast-non-reasoning grok-fast, grok-fast-non-reasoning 2M Fast agentic without reasoning
grok-3-mini configurable Supports reasoning_effort (low/high)
grok-2-vision-1212 grok-vision Image understanding
grok-code-fast-1 grok-code Code-optimized

Features: - Function/tool calling (all models) - Structured outputs (JSON mode) - Reasoning (all grok-4.20 and grok-4 models; grok-3-mini via reasoning_effort) - Vision (grok-2-vision-1212) - 2M context window (grok-4.20 and grok-4-1-fast models)


DeepSeek

DeepSeek models with Chain of Thought reasoning:

from cogent.models.deepseek import DeepSeekChat

# Standard chat model
model = DeepSeekChat(
    model="deepseek-chat",
    api_key="...",  # Or DEEPSEEK_API_KEY env var
)

# Reasoning model (exposes Chain of Thought)
model = DeepSeekChat(model="deepseek-reasoner")

response = await model.ainvoke("9.11 and 9.8, which is greater?")

# Access reasoning content (Chain of Thought)
if hasattr(response, 'reasoning'):
    print("Reasoning:", response.reasoning)
print("Answer:", response.content)

Available models:

Model Tools Description
deepseek-chat General chat model with tool support
deepseek-reasoner Reasoning model with CoT (no tools)

Note: deepseek-reasoner does NOT support: - Function calling/tools - temperature, top_p, presence_penalty, frequency_penalty


Custom Endpoints

Any OpenAI-compatible endpoint (vLLM, Together AI, etc.):

from cogent.models.custom import CustomChat, CustomEmbedding

# vLLM
model = CustomChat(
    base_url="http://localhost:8000/v1",
    model="meta-llama/Llama-3.2-3B-Instruct",
)

# Together AI
model = CustomChat(
    base_url="https://api.together.xyz/v1",
    model="meta-llama/Llama-3-70b-chat-hf",
    api_key="...",
)

# Custom embeddings
embeddings = CustomEmbedding(
    base_url="http://localhost:8000/v1",
    model="BAAI/bge-small-en-v1.5",
)