liteLLM Blog

New Video Characters, Edit and Extension API support

Mon, 16 Mar 2026 10:00:00 GMT

LiteLLM now supoports videos character, edit and extension apis.

What's New

Four new endpoints for video character operations:

Create character - Upload a video to create a reusable asset
Get character - Retrieve character metadata
Edit video - Modify generated videos
Extend video - Continue clips with character consistency

Available from: LiteLLM v1.83.0+

Quick Example

import litellm

# Create character from video
character = litellm.avideo_create_character(
    name="Luna",
    video=open("luna.mp4", "rb"),
    custom_llm_provider="openai",
    model="sora-2"
)
print(f"Character: {character.id}")

# Use in generation
video = litellm.avideo(
    model="sora-2",
    prompt="Luna dances through a magical forest.",
    characters=[{"id": character.id}],
    seconds="8"
)

# Get character info
fetched = litellm.avideo_get_character(
    character_id=character.id,
    custom_llm_provider="openai"
)

# Edit with character preserved
edited = litellm.avideo_edit(
    video_id=video.id,
    prompt="Add warm golden lighting"
)

# Extend sequence
extended = litellm.avideo_extension(
    video_id=video.id,
    prompt="Luna waves goodbye",
    seconds="5"
)

Via Proxy

# Create character
curl -X POST "http://localhost:4000/v1/videos/characters" \
  -H "Authorization: Bearer sk-litellm-key" \
  -F "video=@luna.mp4" \
  -F "name=Luna"

# Get character
curl -X GET "http://localhost:4000/v1/videos/characters/char_abc123def456" \
  -H "Authorization: Bearer sk-litellm-key"

# Edit video
curl -X POST "http://localhost:4000/v1/videos/edits" \
  -H "Authorization: Bearer sk-litellm-key" \
  -H "Content-Type: application/json" \
  -d '{
    "video": {"id": "video_xyz789"},
    "prompt": "Add warm golden lighting and enhance colors"
  }'

# Extend video
curl -X POST "http://localhost:4000/v1/videos/extensions" \
  -H "Authorization: Bearer sk-litellm-key" \
  -H "Content-Type: application/json" \
  -d '{
    "video": {"id": "video_xyz789"},
    "prompt": "Luna waves goodbye and walks into the sunset",
    "seconds": "5"
  }'

Managed Character IDs

LiteLLM automatically encodes provider and model metadata into character IDs:

What happens:

Upload character "Luna" with model "sora-2" on OpenAI
  ↓
LiteLLM creates: char_abc123def456 (contains provider + model_id)
  ↓
When you reference it later, LiteLLM decodes automatically
  ↓
Router knows exactly which deployment to use

Behind the scenes:

Character ID format: character_
Metadata includes: provider, model_id, original_character_id
Transparent to you - just use the ID, LiteLLM handles routing

Realtime WebRTC HTTP Endpoints

Thu, 12 Mar 2026 10:00:00 GMT

Connect to the Realtime API via WebRTC from browser/mobile clients. LiteLLM handles auth and key management.

How it works

Flow of generating ephemeral token

Proxy Setup

model_list:
  - model_name: gpt-4o-realtime
    litellm_params:
      model: openai/gpt-4o-realtime-preview-2024-12-17
      api_key: os.environ/OPENAI_API_KEY
    model_info:
      mode: realtime

Azure: use model: azure/gpt-4o-realtime-preview, api_key, api_base.

litellm --config /path/to/config.yaml

Try it live

INTERACTIVE TESTER

Browser → LiteLLM → OpenAI · WebRTC

▼

Client Usage

1. Get token - POST /v1/realtime/client_secrets with LiteLLM API key and { model }.

2. WebRTC handshake - Create RTCPeerConnection, add mic track, create data channel oai-events, send SDP offer to POST /v1/realtime/calls with Authorization: Bearer and Content-Type: application/sdp.

3. Events - Use the data channel for session.update and other events.

Full code example

Day 0 Support: GPT-5.4

Thu, 05 Mar 2026 10:00:00 GMT

LiteLLM now supports fully GPT-5.4!

Docker Image

docker pull ghcr.io/berriai/litellm:v1.81.14-stable.gpt-5.4_patch

Usage

LiteLLM Proxy
LiteLLM SDK

1. Setup config.yaml

model_list:
  - model_name: gpt-5.4
    litellm_params:
      model: openai/gpt-5.4
      api_key: os.environ/OPENAI_API_KEY

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.14-stable.gpt-5.4_patch \
  --config /app/config.yaml

3. Test it

curl -X POST "http://0.0.0.0:4000/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "gpt-5.4",
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime."}
    ]
  }'

from litellm import completion

response = completion(
    model="openai/gpt-5.4",
    messages=[
        {"role": "user", "content": "Write a Python function to check if a number is prime."}
    ],
)

print(response.choices[0].message.content)

Notes

Restart your container to get the cost tracking for this model.
Use /responses for better model performance.
GPT-5.4 supports reasoning, function calling, vision, and tool-use — see the OpenAI provider docs for advanced usage.

DAY 0 Support: Gemini 3.1 Flash Lite Preview on LiteLLM

Tue, 03 Mar 2026 08:00:00 GMT

LiteLLM now supports gemini-3.1-flash-lite-preview with full day 0 support!

note

If you only want cost tracking, you need no change in your current Litellm version. But if you want the support for new features introduced along with it like thinking levels, you will need to use v1.80.8-stable.1 or above.

Deploy this version

Docker
Pip

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.80.8-stable.1

pip install litellm
pip install litellm==v1.80.8-stable.1

What's New

Supports all four thinking levels:

MINIMAL: Ultra-fast responses with minimal reasoning
LOW: Simple instruction following
MEDIUM: Balanced reasoning for complex tasks
HIGH: Maximum reasoning depth (dynamic)

Quick Start

SDK
PROXY

Basic Usage

from litellm import completion

response = completion(
    model="gemini/gemini-3.1-flash-lite-preview",
    messages=[{"role": "user", "content": "Extract key entities from this text: ..."}],
)

print(response.choices[0].message.content)

With Thinking Levels

from litellm import completion

# Use MEDIUM thinking for complex reasoning tasks
response = completion(
    model="gemini/gemini-3.1-flash-lite-preview",
    messages=[{"role": "user", "content": "Analyze this dataset and identify patterns"}],
    reasoning_effort="medium",  # low, medium , high
)

print(response.choices[0].message.content)

1. Setup config.yaml

model_list:
  - model_name: gemini-3.1-flash-lite
    litellm_params:
      model: gemini/gemini-3.1-flash-lite-preview
      api_key: os.environ/GEMINI_API_KEY
  
  # Or use Vertex AI
  - model_name: vertex-gemini-3.1-flash-lite
    litellm_params:
      model: vertex_ai/gemini-3.1-flash-lite-preview
      vertex_project: your-project-id
      vertex_location: us-central1

2. Start proxy

litellm --config /path/to/config.yaml

3. Make requests

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "gemini-3.1-flash-lite",
    "messages": [{"role": "user", "content": "Extract structured data from this text"}],
    "reasoning_effort": "low"
  }'

Supported Endpoints

LiteLLM provides full end-to-end support for Gemini 3.1 Flash Lite Preview on:

✅ /v1/chat/completions - OpenAI-compatible chat completions endpoint
✅ /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
✅ /v1/messages - Anthropic-compatible messages endpoint
✅ /v1/generateContent – Google Gemini API compatible endpoint

All endpoints support:

Streaming and non-streaming responses
Function calling with thought signatures
Multi-turn conversations
All Gemini 3-specific features (thinking levels, thought signatures)
Full multimodal support (text, image, audio, video)

`reasoning_effort` Mapping for Gemini 3.1

LiteLLM automatically maps OpenAI's reasoning_effort parameter to Gemini's thinkingLevel:

reasoning_effort	thinking_level	Use Case
`minimal`	`minimal`	Ultra-fast responses, simple queries
`low`	`low`	Basic instruction following
`medium`	`medium`	Balanced reasoning for moderate complexity
`high`	`high`	Maximum reasoning depth, complex problems
`disable`	`minimal`	Disable extended reasoning
`none`	`minimal`	No extended reasoning

Incident Report: Cache Eviction Closes In-Use httpx Clients

Fri, 27 Feb 2026 10:00:00 GMT

Date: February 27, 2026 Duration: ~6 days (Feb 21 merge -> Feb 27 fix) Severity: High Status: Resolved

Note: This fix is available starting from LiteLLM v1.81.14.rc.2 or higher.

Summary

A change to improve Redis connection pool cleanup introduced a regression that closed httpx clients that were still actively being used by the proxy. The LLMClientCache (an in-memory TTL cache) stores both Redis clients and httpx clients under the same eviction policy. When a cache entry expired or was evicted, the new cleanup code called aclose()/close() on the evicted value which worked correctly for Redis clients, but destroyed httpx clients that other parts of the system still held references to and were actively using for LLM API calls.

Impact: Any proxy instance that hit the cache TTL (default 10 minutes) or capacity limit (200 entries) would have its httpx clients closed out from under it, causing requests to LLM providers to fail with connection errors.

Background

LLMClientCache extends InMemoryCache and is used to cache SDK clients (OpenAI, Anthropic, etc.) to avoid re-creating them on every request. These clients are keyed by configuration + event loop ID. The cache has:

Max size: 200 entries
Default TTL: 10 minutes

When the cache is full or entries expire, InMemoryCache.evict_cache() calls _remove_key() to drop entries.

The cached values are a mix of:

Redis/async Redis clients — owned exclusively by the cache, safe to close on eviction
httpx-backed SDK clients (OpenAI, Anthropic, etc.) — shared references, still in use by router/model instances

Root Cause

PR #21717 overrode _remove_key() in LLMClientCache to close async clients on eviction:

Problematic code added in PR #21717

Day 0 Support: GPT-5.3-Codex

Tue, 24 Feb 2026 10:00:00 GMT

LiteLLM now supports GPT-5.3-Codex on Day 0, including support for the new assistant phase metadata on Responses API output items.

Why `phase` matters for GPT-5.3-Codex

phase appears on assistant output items and helps distinguish preamble/commentary turns from final closeout responses.

Reference: Phase parameter docs

Supported values:

null
"commentary"
"final_answer"

Important:

Persist assistant output items with phase exactly as returned.
Send those assistant items back on the next turn.
Do not add phase to user messages.

Docker Image

docker pull ghcr.io/berriai/litellm:v1.81.12-stable.gpt-5.3

Usage

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: gpt-5.3-codex
    litellm_params:
      model: openai/gpt-5.3-codex

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e ANTHROPIC_API_KEY=$OPENAI_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.12-stable.gpt-5.3 \
  --config /app/config.yaml

3. Test it

curl -X POST "http://0.0.0.0:4000/v1/responses" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "gpt-5.3-codex",
    "input": "Write a Python script that checks if a number is prime."
  }'

Python Example: Persist `phase` with OpenAI Client + LiteLLM Base URL

from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:4000/v1",  # LiteLLM Proxy
    api_key="your-litellm-api-key",
)

items = []  # Persist this per conversation/thread


def _item_get(item, key, default=None):
    if isinstance(item, dict):
        return item.get(key, default)
    return getattr(item, key, default)


def run_turn(user_text: str):
    global items

    # User message: no phase field
    items.append(
        {
            "type": "message",
            "role": "user",
            "content": [{"type": "input_text", "text": user_text}],
        }
    )

    resp = client.responses.create(
        model="gpt-5.3-codex",
        input=items,
    )

    # Persist assistant output items verbatim, including phase
    for out_item in (resp.output or []):
        items.append(out_item)

    # Optional: inspect latest phase for UI/telemetry routing
    latest_phase = None
    for out_item in reversed(resp.output or []):
        if _item_get(out_item, "type") == "output_item.done" and _item_get(out_item, "phase") is not None:
            latest_phase = _item_get(out_item, "phase")
            break

    return resp, latest_phase

Notes

Use /v1/responses for GPT Codex models.
Preserve full assistant output history for best multi-turn behavior.
If phase metadata is dropped during history reconstruction, output quality can degrade on long-running tasks.

Incident Report: Encrypted Content Failures in Multi-Region Responses API Load Balancing

Tue, 24 Feb 2026 10:00:00 GMT

Date: Feb 24, 2026
Duration: Ongoing (until fix deployed)
Severity: High (for users load balancing Responses API across different API keys)
Status: Resolved

Summary

When load balancing OpenAI's Responses API across deployments with different API keys (e.g., different Azure regions or OpenAI organizations), follow-up requests containing encrypted content items (like rs_... reasoning items) would fail with:

{
  "error": {
    "message": "The encrypted content for item rs_0d09d6e56879e76500699d6feee41c8197bd268aae76141f87 could not be verified. Reason: Encrypted content organization_id did not match the target organization.",
    "type": "invalid_request_error",
    "code": "invalid_encrypted_content"
  }
}

Encrypted content items are cryptographically tied to the API key's organization that created them. When the router load balanced a follow-up request to a deployment with a different API key, decryption failed.

Responses API calls with encrypted content: Complete failure when routed to wrong deployment
Initial requests: Unaffected — only follow-up requests containing encrypted items failed
Other API endpoints: No impact — chat completions, embeddings, etc. functioned normally

Background

OpenAI's Responses API can return encrypted "reasoning items" (with IDs like rs_...) that contain intermediate reasoning steps. These items are encrypted with the organization's key and can only be decrypted by the same organization's API key.

When load balancing across deployments with different API keys, the existing affinity mechanisms were insufficient:

responses_api_deployment_check: Requires previous_response_id which some clients (like Codex) don't provide
deployment_affinity: Too broad — pins all requests from a user to one deployment, reducing effective quota by the number of users
session_affinity: Requires explicit session IDs and still reduces quota

Root Cause

LiteLLM's router had no mechanism to track which deployment created specific encrypted content items and route follow-up requests accordingly. The router treated all deployments as interchangeable, leading to decryption failures when encrypted content crossed organizational boundaries.

The Problem Flow:

User calls router.aresponses() with model gpt-5.1-codex
Router load balances to Deployment A (Azure East US, API Key 1)
Response contains encrypted reasoning item rs_abc123 (encrypted with Org 1's key)
User makes follow-up request with rs_abc123 in the input
Router load balances to Deployment B (Azure West Europe, API Key 2)
Deployment B tries to decrypt rs_abc123 with Org 2's key → fails

Why Existing Solutions Didn't Work:

previous_response_id: Not provided by all clients (e.g., Codex)
deployment_affinity: Pins all user requests to one deployment → reduces quota to 1/N where N = number of deployments
session_affinity: Requires explicit session management and still reduces quota

Timeline:

Users configured multi-region Responses API load balancing with different API keys
Initial requests succeeded, but follow-up requests with encrypted content failed intermittently
Error rate correlated with number of deployments (more deployments = higher chance of routing to wrong one)
Investigation revealed encrypted content was organization-bound
Existing affinity mechanisms deemed unsuitable (quota reduction, missing previous_response_id)
New solution designed and implemented: encrypted_content_affinity

The Fix

Implemented a new encrypted_content_affinity pre-call check that intelligently tracks encrypted content and routes follow-up requests only when necessary.

Implementation

1. Encoding model_id into output items (responses/utils.py)

The same approach used for previous_response_id affinity — no cache needed. When a response contains output items with encrypted_content, LiteLLM encodes the originating deployment's model_id in two places for redundancy:

Into the item ID (if present): rs_abc123 → encitem_{base64("litellm:model_id:{model_id};item_id:rs_abc123")}
Into the encrypted_content itself: Wraps the content with litellm_enc:{base64("model_id:{model_id}")};{original_encrypted_content}

# Encoding item IDs (when present)
def _build_encrypted_item_id(model_id: str, item_id: str) -> str:
    assembled = f"litellm:model_id:{model_id};item_id:{item_id}"
    encoded = base64.b64encode(assembled.encode("utf-8")).decode("utf-8")
    return f"encitem_{encoded}"

# Wrapping encrypted_content (always, for redundancy)
def _wrap_encrypted_content_with_model_id(encrypted_content: str, model_id: str) -> str:
    metadata = f"model_id:{model_id}"
    encoded_metadata = base64.b64encode(metadata.encode("utf-8")).decode("utf-8")
    return f"litellm_enc:{encoded_metadata};{encrypted_content}"

Why wrap encrypted_content directly? Some clients (like Codex) don't consistently send item IDs in follow-up requests, but they always send the encrypted_content itself. By embedding model_id into the content, affinity works even when IDs are missing.

Streaming responses: The wrapping logic is applied to both:

Final response objects (non-streaming)
Individual streaming events (response.output_item.added, response.output_item.done)

This ensures clients receiving streaming responses get wrapped content they can send back.

Before forwarding to the upstream provider, LiteLLM restores the original item IDs and unwraps encrypted_content so the provider never sees the encoded form:

# In responses/main.py — before calling the handler
input = ResponsesAPIRequestUtils._restore_encrypted_content_item_ids_in_input(input)

2. EncryptedContentAffinityCheck — routing only (encrypted_content_affinity_check.py)

No async_log_success_event or cache lookups — the model_id is decoded directly from the item ID or encrypted_content:

class EncryptedContentAffinityCheck(CustomLogger):
    async def async_filter_deployments(self, model, healthy_deployments, ...):
        """Extract model_id from input items (ID or encrypted_content) and pin to that deployment."""
        for item in request_kwargs.get("input", []):
            # Try to extract model_id from two sources:
            model_id = self._extract_model_id_from_input(item)
            
            if model_id:
                deployment = self._find_deployment_by_model_id(
                    healthy_deployments, model_id
                )
                if deployment:
                    request_kwargs["_encrypted_content_affinity_pinned"] = True
                    return [deployment]
        return healthy_deployments
    
    def _extract_model_id_from_input(self, item: dict) -> Optional[str]:
        """Extract model_id from either encoded ID or wrapped encrypted_content."""
        # 1. Try decoding from item ID (if present)
        item_id = item.get("id", "")
        if item_id:
            decoded = ResponsesAPIRequestUtils._decode_encrypted_item_id(item_id)
            if decoded:
                return decoded["model_id"]
        
        # 2. Try unwrapping from encrypted_content (fallback for clients that omit IDs)
        encrypted_content = item.get("encrypted_content", "")
        if encrypted_content and encrypted_content.startswith("litellm_enc:"):
            model_id, _ = ResponsesAPIRequestUtils._unwrap_encrypted_content_with_model_id(
                encrypted_content
            )
            return model_id
        
        return None

3. Rate Limit Bypass (router.py)

When encrypted content requires a specific deployment, RPM/TPM limits are bypassed (the request would fail on any other deployment anyway):

# In async_get_available_deployment, after filtering healthy deployments:
if (
    request_kwargs.get("_encrypted_content_affinity_pinned")
    and len(healthy_deployments) == 1
):
    return healthy_deployments[0]  # Bypass routing strategy (RPM/TPM checks)

3. Configuration

router_settings:
  routing_strategy: usage-based-routing-v2
  enable_pre_call_checks: true
  optional_pre_call_checks:
    - encrypted_content_affinity
  deployment_affinity_ttl_seconds: 86400  # 24 hours

Key Benefits

✅ No quota reduction: Only pins requests containing encrypted items
✅ Bypasses rate limits: When encrypted content requires a specific deployment, RPM/TPM limits don't block it
✅ No previous_response_id required: Works by encoding model_id directly into the item ID
✅ No cache required: model_id is decoded on-the-fly from the item ID — no Redis, no TTL
✅ Globally safe: Can be enabled for all models; non-Responses-API calls are unaffected
✅ Surgical precision: Normal requests continue to load balance freely

Remediation

#	Action	Status	Code
1	Encode `model_id` into encrypted-content item IDs on response	✅ Done	`responses/utils.py`
2	Restore original item IDs before forwarding to upstream provider	✅ Done	`responses/main.py`
3	`EncryptedContentAffinityCheck`: decode item IDs to route (no cache)	✅ Done	`encrypted_content_affinity_check.py`
4	Add `encrypted_content_affinity` to `OptionalPreCallChecks` type	✅ Done	`types/router.py`
5	Implement rate limit bypass for affinity-pinned requests	✅ Done	`router.py`
6	Unit tests: encoding/decoding utilities, routing, RPM bypass	✅ Done	`test_encrypted_content_affinity_check.py`
7	Documentation: Responses API guide, load balancing guide, config reference	✅ Done	Docs
8	[Mar 3] Fix streaming events to wrap encrypted_content	✅ Done	`responses/streaming_iterator.py`

Follow-up Fix: Streaming Responses (Mar 3, 2026)

The Issue

After the initial fix was deployed, users reported that the invalid_encrypted_content error still occurred when using streaming responses with clients like Codex. Investigation revealed:

✅ Non-streaming responses: encrypted_content was correctly wrapped with litellm_enc: prefix
❌ Streaming responses: Individual response.output_item.added and response.output_item.done events contained raw, unwrapped encrypted_content

Since Codex and other clients consume responses as streams, they received unwrapped content in these events and sent it back in follow-up requests, causing the affinity check to fail.

The Root Cause

The _update_encrypted_content_item_ids_in_response function only modified the final response object, which is used for non-streaming responses. For streaming responses, individual chunks are processed by ResponsesAPIStreamingIterator._process_chunk, which was not applying the wrapping logic to streaming events.

The Fix

Modified litellm/litellm/responses/streaming_iterator.py to wrap encrypted_content in streaming events:

# In ResponsesAPIStreamingIterator._process_chunk
if (
    self.litellm_metadata
    and self.litellm_metadata.get("encrypted_content_affinity_enabled")
):
    event_type = getattr(openai_responses_api_chunk, "type", None)
    if event_type in (
        ResponsesAPIStreamEvents.OUTPUT_ITEM_ADDED,
        ResponsesAPIStreamEvents.OUTPUT_ITEM_DONE,
    ):
        item = getattr(openai_responses_api_chunk, "item", None)
        if item:
            encrypted_content = getattr(item, "encrypted_content", None)
            if encrypted_content and isinstance(encrypted_content, str):
                model_id = (
                    self.litellm_metadata.get("model_info", {}).get("id")
                    if self.litellm_metadata
                    else None
                )
                if model_id:
                    wrapped_content = ResponsesAPIRequestUtils._wrap_encrypted_content_with_model_id(
                        encrypted_content, model_id
                    )
                    setattr(item, "encrypted_content", wrapped_content)

This ensures that all encrypted_content sent to clients (streaming or non-streaming) is wrapped with model_id metadata, enabling consistent affinity routing.

Migration Guide

Before (Using `deployment_affinity`)

router_settings:
  optional_pre_call_checks:
    - deployment_affinity  # ❌ Reduces quota by number of users

Problem: All requests from a user pin to one deployment, reducing effective quota to 1/N.

After (Using `encrypted_content_affinity`)

router_settings:
  optional_pre_call_checks:
    - encrypted_content_affinity  # ✅ Only pins requests with encrypted content

Benefit: Normal requests load balance freely, only encrypted content requests pin when necessary.

Incident Report: Wildcard Blocking New Models After Cost Map Reload

Mon, 23 Feb 2026 10:00:00 GMT

Date: Feb 23, 2026
Duration: ~3 hours
Severity: High (for users with provider wildcard access rules)
Status: Resolved

Summary

When a new Anthropic model (e.g. claude-sonnet-4-6) was added to the LiteLLM model cost map and a cost map reload was triggered, requests to the new model were rejected with:

key not allowed to access model. This key can only access models=['anthropic/*']. Tried to access claude-sonnet-4-6.

The reload updated litellm.model_cost correctly but never re-ran add_known_models(), so litellm.anthropic_models (the in-memory set used by the wildcard resolver) remained stale. The new model was invisible to the anthropic/* wildcard even though the cost map knew about it.

LLM calls: All requests to newly-added Anthropic models were blocked with a 401.
Existing models: Unaffected — only models missing from the stale provider set were impacted.
Other providers: Same bug class existed for any provider wildcard (e.g. openai/*, gemini/*).

Background

LiteLLM supports provider-level wildcard access rules. When an admin configures a key or team with models=['anthropic/*'], any model whose provider resolves to anthropic should be allowed. The resolution happens in _model_custom_llm_provider_matches_wildcard_pattern:

litellm.anthropic_models is a Python set populated at import time by add_known_models(). It is the source get_llm_provider() consults to map a bare model name like claude-sonnet-4-6 to the provider string "anthropic".

Root Cause

add_known_models() is called once at module import time. Both reload paths in proxy_server.py updated litellm.model_cost with the fresh map but never called add_known_models() again:

# Before the fix — both reload paths looked like this:
new_model_cost_map = get_model_cost_map(url=model_cost_map_url)
litellm.model_cost = new_model_cost_map          # ✅ cost map updated
_invalidate_model_cost_lowercase_map()           # ✅ cache cleared
# ❌ add_known_models() never called
#    → litellm.anthropic_models still has the old set
#    → new model not in the set
#    → get_llm_provider() raises for the new model
#    → wildcard match returns False
#    → 401 for every request to the new model

The gap existed in two places:

_check_and_reload_model_cost_map — the periodic automatic reload (every 10 s)
The /reload/model_cost_map admin endpoint — the manual reload

Timeline:

New model (claude-sonnet-4-6) added to model_prices_and_context_window.json
Admin triggers cost map reload via UI → litellm.model_cost updated
Users with anthropic/* wildcard keys attempt requests to claude-sonnet-4-6
get_llm_provider('claude-sonnet-4-6') raises → wildcard returns False → 401
Admin reloads cost map again — same result (root cause not addressed)
~3 hours of investigation → root cause identified → fix deployed

The Fix

After each reload, add_known_models() is called with the freshly fetched map passed explicitly. Passing the map directly (rather than relying on the module-level reference) removes any ambiguity about which dict is iterated:

# After the fix — both reload paths now do:
new_model_cost_map = get_model_cost_map(url=model_cost_map_url)
litellm.model_cost = new_model_cost_map
_invalidate_model_cost_lowercase_map()
litellm.add_known_models(model_cost_map=new_model_cost_map)  # ✅ sets repopulated

add_known_models() was also updated to accept an optional explicit map so callers cannot accidentally iterate a stale module-level reference:

# Before
def add_known_models():
    for key, value in model_cost.items():   # reads module global — ambiguous after reload
        ...

# After
def add_known_models(model_cost_map: Optional[Dict] = None):
    _map = model_cost_map if model_cost_map is not None else model_cost
    for key, value in _map.items():         # always iterates the map you just fetched
        ...

After the fix, the provider sets (anthropic_models, open_ai_chat_completion_models, etc.) are always consistent with litellm.model_cost immediately after every reload. New models become accessible via wildcard rules without any proxy restart.

Remediation

#	Action	Status	Code
1	Call `add_known_models(model_cost_map=...)` in the periodic reload path	✅ Done	`proxy_server.py#L4393`
2	Call `add_known_models(model_cost_map=...)` in the `/reload/model_cost_map` endpoint	✅ Done	`proxy_server.py#L11904`
3	Update `add_known_models()` to accept an explicit map parameter	✅ Done	`__init__.py#L617`
4	Regression test: `add_known_models(model_cost_map=...)` populates provider sets	✅ Done	`test_auth_checks.py`
5	Regression test: `anthropic/*` wildcard grants/denies access correctly after reload	✅ Done	`test_auth_checks.py`

Incident Report: SERVER_ROOT_PATH regression broke UI routing

Sat, 21 Feb 2026 10:00:00 GMT

Date: January 22, 2026 Duration: ~4 days (until fix merged January 26, 2026) Severity: High Status: Resolved

Note: This fix is available starting from LiteLLM v1.81.3.rc.6 or higher.

Summary

A PR (#19467) accidentally removed the root_path=server_root_path parameter from the FastAPI app initialization in proxy_server.py. This caused the proxy to ignore the SERVER_ROOT_PATH environment variable when serving the UI. Users who deploy LiteLLM behind a reverse proxy with a path prefix (e.g., /api/v1 or /llmproxy) found that all UI pages returned 404 Not Found.

LLM API calls: No impact. API routing was unaffected.
UI pages: All UI pages returned 404 for deployments using SERVER_ROOT_PATH.
Swagger/OpenAPI docs: Broken when accessed through the configured root path.

Background

Many LiteLLM deployments run behind a reverse proxy (e.g., Nginx, Traefik, AWS ALB) that routes traffic to LiteLLM under a path prefix. FastAPI's root_path parameter tells the application about this prefix so it can correctly serve static files, generate URLs, and handle routing.

The root_path parameter was present in proxy_server.py since early versions of LiteLLM. It was removed as a side effect of PR #19467, which was intended to fix a different UI 404 issue.

Root cause

PR #19467 (73d49f8) removed the root_path=server_root_path line from the FastAPI() constructor in proxy_server.py:

 app = FastAPI(
     docs_url=_get_docs_url(),
     redoc_url=_get_redoc_url(),
     title=_title,
     description=_description,
     version=version,
-    root_path=server_root_path,
     lifespan=proxy_startup_event,
 )

Without root_path, FastAPI treated all requests as if the application was mounted at /, causing path mismatches for any deployment using SERVER_ROOT_PATH.

The regression went undetected because:

No automated test verified that root_path was set on the FastAPI app.
No manual test procedure existed for SERVER_ROOT_PATH functionality.
Default deployments (without SERVER_ROOT_PATH) were unaffected, so most CI tests passed.

Remediation

#	Action	Status	Code
1	Restore `root_path=server_root_path` in FastAPI app initialization	✅ Done	`#19790` (`5426b3c`)
2	Add unit tests for `get_server_root_path()` and FastAPI app initialization	✅ Done	`test_server_root_path.py`
3	Add CI workflow that builds Docker image and tests UI routing with `SERVER_ROOT_PATH` on every PR	✅ Done	`test_server_root_path.yml`
4	Document manual test procedure for `SERVER_ROOT_PATH`	✅ Done	Discussion #8495

CI workflow details

The new test_server_root_path.yml workflow runs on every PR against main. It:

Builds the LiteLLM Docker image
Starts a container with SERVER_ROOT_PATH set (tests both /api/v1 and /llmproxy)
Verifies the UI returns valid HTML at {ROOT_PATH}/ui/
Fails the workflow if the UI is unreachable

This prevents future regressions where changes to proxy_server.py accidentally break SERVER_ROOT_PATH support.

Timeline

Time (UTC)	Event
Jan 22, 2026 04:20	PR #19467 merged, removing `root_path=server_root_path`
Jan 22–26	Users on nightly builds report UI 404 errors when using `SERVER_ROOT_PATH`
Jan 26, 2026 17:48	Fix PR #19790 merged, restoring `root_path=server_root_path`
Feb 18, 2026	CI workflow `test_server_root_path.yml` added to run on every PR

Resolution steps for users

For users still experiencing issues, update to the latest LiteLLM version:

pip install --upgrade litellm

Verify your SERVER_ROOT_PATH is correctly set:

# In your environment or docker-compose.yml
SERVER_ROOT_PATH="/your-prefix"

Then confirm the UI is accessible at http://your-host:4000/your-prefix/ui/.

DAY 0 Support: Gemini 3.1 Pro on LiteLLM

Thu, 19 Feb 2026 10:00:00 GMT

LiteLLM now supports gemini-3.1-pro-preview and all the new API changes along with it.

Deploy this version

Docker
Pip

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.81.9-stable.gemini.3.1-pro

pip install litellm
pip install litellm==v1.81.9-stable.gemini.3.1-pro

What's New

1. New Thinking Levels: `thinkingLevel` with MINIMAL & MEDIUM

Gemini 3.1 Pro introduces support for medium thinking level

LiteLLM automatically maps the OpenAI reasoning_effort parameter to Gemini's thinkingLevel, so you can use familiar reasoning_effort values (minimal, low, medium, high) without changing your code!

Supported Endpoints

LiteLLM provides full end-to-end support for Gemini 3.1 Pro on:

✅ /v1/chat/completions - OpenAI-compatible chat completions endpoint
✅ /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
✅ /v1/messages - Anthropic-compatible messages endpoint
✅ /v1/generateContent – Google Gemini API compatible endpoint

All endpoints support:

Streaming and non-streaming responses
Function calling with thought signatures
Multi-turn conversations
All Gemini 3-specific features
Conversion of provider specific thinking related param to thinkingLevel

Quick Start

SDK
PROXY

Basic Usage with MEDIUM thinking (NEW)

from litellm import completion

# No need to make any changes to your code as we map openai reasoning param to thinkingLevel
response = completion(
    model="gemini/gemini-3.1-pro-preview",
    messages=[{"role": "user", "content": "Solve this complex math problem: 25 * 4 + 10"}],
    reasoning_effort="medium",  # NEW: MEDIUM thinking level
)

print(response.choices[0].message.content)

1. Setup config.yaml

model_list:
  - model_name: gemini-3.1-pro-preview
    litellm_params:
      model: gemini/gemini-3.1-pro-preview
      api_key: os.environ/GEMINI_API_KEY
  - model_name: vertex-gemini-3.1-pro-preview
    litellm_params:
      model: vertex_ai/gemini-3.1-pro-preview

2. Start proxy

litellm --config /path/to/config.yaml

3. Call with MEDIUM thinking

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "gemini-3.1-pro-preview",
    "messages": [{"role": "user", "content": "Complex reasoning task"}],
    "reasoning_effort": "medium"
  }'

`reasoning_effort` Mapping for Gemini 3+

reasoning_effort	thinking_level
`minimal`	`minimal`
`low`	`low`
`medium`	`medium`
`high`	`high`
`disable`	`minimal`
`none`	`minimal`

Incident Report: vLLM Embeddings Broken by encoding_format Parameter

Wed, 18 Feb 2026 10:00:00 GMT

Date: Feb 16, 2026 Duration: ~3 hours Severity: High (for vLLM embedding users) Status: Resolved

Summary

A commit (dbcae4a) intended to fix OpenAI SDK behavior broke vLLM embeddings by explicitly passing encoding_format=None in API requests. vLLM rejects this with error: "unknown variant \`, expected float or base64"`.

vLLM embedding calls: Complete failure - all requests rejected
Other providers: No impact - OpenAI and other providers functioned normally
Other vLLM functionality: No impact - only embeddings were affected

Background

The encoding_format parameter for embeddings specifies whether vectors should be returned as float arrays or base64 encoded strings. Different providers have different expectations:

OpenAI SDK: If encoding_format is omitted, the SDK adds a default value of "float"
vLLM: Strictly validates encoding_format - only accepts "float", "base64", or complete omission. Rejects None or empty string values.

Root cause

A well-intentioned fix for OpenAI SDK behavior inadvertently broke vLLM embeddings:

The Breaking Change (dbcae4a):

In litellm/main.py, the code was changed to explicitly set encoding_format=None instead of omitting it:

# Added in dbcae4a
if encoding_format is not None:
    optional_params["encoding_format"] = encoding_format
else:
    # Omitting causes openai sdk to add default value of "float"
    optional_params["encoding_format"] = None

This fix worked correctly for OpenAI - explicitly passing None prevented the SDK from adding its default value. However, vLLM's strict parameter validation rejected None values, causing all embedding requests to fail.

The Fix

Fix deployed (55348dd). The solution filters out None and empty string values from optional_params before sending requests to OpenAI-like providers (including vLLM).

In litellm/llms/openai_like/embedding/handler.py:

# Before (broken)
data = {"model": model, "input": input, **optional_params}

# After (fixed)
filtered_optional_params = {k: v for k, v in optional_params.items() if v not in (None, '')}
data = {"model": model, "input": input, **filtered_optional_params}

This ensures:

Valid values ("float", "base64") are preserved and sent
None and empty string values are filtered out (parameter omitted entirely)
OpenAI SDK no longer adds defaults because liteLLM handles the parameter upstream

Remediation

#	Action	Status	Code
1	Filter `None` and empty string values in OpenAI-like embedding handler	✅ Done	`handler.py#L108`
2	Unit tests for parameter filtering (None, empty string, valid values)	✅ Done	`test_openai_like_embedding.py`
3	Transformation tests for hosted_vllm embedding config	✅ Done	`test_hosted_vllm_embedding_transformation.py`
4	E2E tests with actual vLLM endpoint	✅ Done	`test_hosted_vllm_embedding_e2e.py`
5	Validate JSON payload structure matches vLLM expectations	✅ Done	Tests verify exact JSON sent to endpoint

Day 0 Support: Claude Sonnet 4.6

Tue, 17 Feb 2026 10:00:00 GMT

LiteLLM now supports Claude Sonnet 4.6 on Day 0. Use it across Anthropic, Azure, Vertex AI, and Bedrock through the LiteLLM AI Gateway.

Docker Image

docker pull ghcr.io/berriai/litellm:v1.81.3-stable.sonnet-4-6

Usage - Anthropic

LiteLLM Proxy
LiteLLM SDK

1. Setup config.yaml

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.3-stable.sonnet-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

from litellm import completion

response = completion(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "what llm are you"}]
)
print(response.choices[0].message.content)

Usage - Azure

LiteLLM Proxy
LiteLLM SDK

1. Setup config.yaml

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: azure_ai/claude-sonnet-4-6
      api_key: os.environ/AZURE_AI_API_KEY
      api_base: os.environ/AZURE_AI_API_BASE  # https://.services.ai.azure.com

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e AZURE_AI_API_KEY=$AZURE_AI_API_KEY \
  -e AZURE_AI_API_BASE=$AZURE_AI_API_BASE \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.3-stable.sonnet-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

from litellm import completion

response = completion(
    model="azure_ai/claude-sonnet-4-6",
    api_key="your-azure-api-key",
    api_base="https://.services.ai.azure.com",
    messages=[{"role": "user", "content": "what llm are you"}]
)
print(response.choices[0].message.content)

Usage - Vertex AI

LiteLLM Proxy
LiteLLM SDK

1. Setup config.yaml

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: vertex_ai/claude-sonnet-4-6
      vertex_project: os.environ/VERTEX_PROJECT
      vertex_location: us-east5

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e VERTEX_PROJECT=$VERTEX_PROJECT \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -v $(pwd)/credentials.json:/app/credentials.json \
  ghcr.io/berriai/litellm:v1.81.3-stable.sonnet-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

from litellm import completion

response = completion(
    model="vertex_ai/claude-sonnet-4-6",
    vertex_project="your-project-id",
    vertex_location="us-east5",
    messages=[{"role": "user", "content": "what llm are you"}]
)
print(response.choices[0].message.content)

Usage - Bedrock

LiteLLM Proxy
LiteLLM SDK

1. Setup config.yaml

model_list:
  - model_name: claude-sonnet-4-6
    litellm_params:
      model: bedrock/anthropic.claude-sonnet-4-6-v1
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.3-stable.sonnet-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-sonnet-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

from litellm import completion

response = completion(
    model="bedrock/anthropic.claude-sonnet-4-6-v1",
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    aws_region_name="us-east-1",
    messages=[{"role": "user", "content": "what llm are you"}]
)
print(response.choices[0].message.content)

Incident Report: Invalid beta headers with Claude Code

Mon, 16 Feb 2026 10:00:00 GMT

Date: February 13, 2026 Duration: ~3 hours Severity: High Status: Resolved

Note: This fix will be available starting from v1.81.13-nightly or higher of LiteLLM.

Summary

Claude Code began sending unsupported Anthropic beta headers to non-Anthropic providers (Bedrock, Azure AI, Vertex AI), causing invalid beta flag errors. LiteLLM was forwarding all beta headers without provider-specific validation. Users experienced request failures when routing Claude Code requests through LiteLLM to these providers.

LLM calls to Anthropic: No impact.
LLM calls to Bedrock/Azure/Vertex: Failed with invalid beta flag errors when unsupported headers were present.
Cost tracking and routing: No impact.

Background

Anthropic uses beta headers to enable experimental features in Claude. When Claude Code makes API requests, it includes headers like anthropic-beta: prompt-caching-scope-2026-01-05,advanced-tool-use-2025-11-20. However, not all providers support all Anthropic beta features.

Before this incident, LiteLLM forwarded all beta headers to all providers without validation:

Requests succeeded for Anthropic (native support) but failed for other providers when Claude Code sent headers those providers didn't support.

Root cause

LiteLLM lacked provider-specific beta header validation. When Claude Code introduced new beta features or sent headers that specific providers didn't support, those headers were blindly forwarded, causing provider API errors.

Remediation

#	Action	Status	Code
1	Create `anthropic_beta_headers_config.json` with provider-specific mappings	✅ Done	`anthropic_beta_headers_config.json`
2	Implement strict validation: headers must be explicitly mapped to be forwarded	✅ Done	`litellm_logging.py`
3	Add `/reload/anthropic_beta_headers` endpoint for dynamic config updates	✅ Done	Proxy management endpoints
4	Add `/schedule/anthropic_beta_headers_reload` for automatic periodic updates	✅ Done	Proxy management endpoints
5	Support `LITELLM_ANTHROPIC_BETA_HEADERS_URL` for custom config sources	✅ Done	Environment configuration
6	Support `LITELLM_LOCAL_ANTHROPIC_BETA_HEADERS` for air-gapped deployments	✅ Done	Environment configuration

Now LiteLLM validates and transforms headers per-provider:

Dynamic configuration updates

A key improvement is zero-downtime configuration updates. When Anthropic releases new beta features, users can update their configuration without restarting:

# Manually trigger reload (no restart needed)
curl -X POST "https://your-proxy-url/reload/anthropic_beta_headers" \
  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"

# Or schedule automatic reloads every 24 hours
curl -X POST "https://your-proxy-url/schedule/anthropic_beta_headers_reload?hours=24" \
  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"

This prevents future incidents where Claude Code introduces new headers before LiteLLM configuration is updated.

Configuration format

The anthropic_beta_headers_config.json file maps input headers to provider-specific output headers:

{
  "description": "Mapping of Anthropic beta headers for each provider.",
  "anthropic": {
    "advanced-tool-use-2025-11-20": "advanced-tool-use-2025-11-20",
    "computer-use-2025-01-24": "computer-use-2025-01-24"
  },
  "bedrock_converse": {
    "advanced-tool-use-2025-11-20": null,
    "computer-use-2025-01-24": "computer-use-2025-01-24"
  },
  "azure_ai": {
    "advanced-tool-use-2025-11-20": "advanced-tool-use-2025-11-20",
    "computer-use-2025-01-24": "computer-use-2025-01-24"
  }
}

Validation rules:

Headers must exist in the mapping for the target provider
Headers with null values are filtered out (unsupported)
Header names can be transformed per-provider (e.g., Bedrock uses different names for some features)

Resolution steps for users

For users still experiencing issues, update to the latest LiteLLM version if < v1.81.11-nightly:

pip install --upgrade litellm

Or manually reload the configuration without restarting:

curl -X POST "https://your-proxy-url/reload/anthropic_beta_headers" \
  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"

Managing Anthropic Beta Headers - Complete configuration guide
anthropic_beta_headers_config.json - Current configuration file

Day 0 Support: MiniMax-M2.5

Thu, 12 Feb 2026 10:00:00 GMT

LiteLLM now supports MiniMax-M2.5 on Day 0. Use it across OpenAI-compatible and Anthropic-compatible APIs through the LiteLLM AI Gateway.

Supported Models

LiteLLM supports the following MiniMax models:

Model	Description	Input Cost	Output Cost	Context Window
MiniMax-M2.5	Advanced reasoning, Agentic capabilities	$0.3/M tokens	$1.2/M tokens	1M tokens
MiniMax-M2.5-lightning	Faster and More Agile (~100 tps)	$0.3/M tokens	$2.4/M tokens	1M tokens

Features Supported

Prompt Caching: Reduce costs with cached prompts ($0.03/M tokens for cache read, $0.375/M tokens for cache write)
Function Calling: Built-in tool calling support
Reasoning: Advanced reasoning capabilities with thinking support
System Messages: Full system message support
Cost Tracking: Automatic cost calculation for all requests

Docker Image

docker pull litellm/litellm:v1.81.3-stable

Usage - OpenAI Compatible API (/v1/chat/completions)

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: minimax-m2-5
    litellm_params:
      model: minimax/MiniMax-M2.5
      api_key: os.environ/MINIMAX_API_KEY
      api_base: https://api.minimax.io/v1

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e MINIMAX_API_KEY=$MINIMAX_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.3-stable \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "minimax-m2-5",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

With Reasoning Split

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "minimax-m2-5",
  "messages": [
    {
      "role": "user",
      "content": "Solve: 2+2=?"
    }
  ],
  "extra_body": {
    "reasoning_split": true
  }
}'

Usage - Anthropic Compatible API (/v1/messages)

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: minimax-m2-5
    litellm_params:
      model: minimax/MiniMax-M2.5
      api_key: os.environ/MINIMAX_API_KEY
      api_base: https://api.minimax.io/anthropic/v1/messages

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e MINIMAX_API_KEY=$MINIMAX_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:v1.81.3-stable \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "minimax-m2-5",
  "max_tokens": 1000,
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

With Thinking

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "minimax-m2-5",
  "max_tokens": 1000,
  "thinking": {
    "type": "enabled",
    "budget_tokens": 1000
  },
  "messages": [
    {
      "role": "user",
      "content": "Solve: 2+2=?"
    }
  ]
}'

Usage - LiteLLM SDK

OpenAI-compatible API

import litellm

response = litellm.completion(
    model="minimax/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    api_key="your-minimax-api-key",
    api_base="https://api.minimax.io/v1"
)

print(response.choices[0].message.content)

Anthropic-compatible API

import litellm

response = litellm.anthropic.messages.acreate(
    model="minimax/MiniMax-M2.5",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    api_key="your-minimax-api-key",
    api_base="https://api.minimax.io/anthropic/v1/messages",
    max_tokens=1000
)

print(response.choices[0].message.content)

With Thinking

response = litellm.anthropic.messages.acreate(
    model="minimax/MiniMax-M2.5",
    messages=[{"role": "user", "content": "Solve: 2+2=?"}],
    thinking={"type": "enabled", "budget_tokens": 1000},
    api_key="your-minimax-api-key"
)

# Access thinking content
for block in response.choices[0].message.content:
    if hasattr(block, 'type') and block.type == 'thinking':
        print(f"Thinking: {block.thinking}")

With Reasoning Split (OpenAI API)

response = litellm.completion(
    model="minimax/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "Solve: 2+2=?"}
    ],
    extra_body={"reasoning_split": True},
    api_key="your-minimax-api-key",
    api_base="https://api.minimax.io/v1"
)

# Access thinking and response
if hasattr(response.choices[0].message, 'reasoning_details'):
    print(f"Thinking: {response.choices[0].message.reasoning_details}")
print(f"Response: {response.choices[0].message.content}")

Cost Tracking

LiteLLM automatically tracks costs for MiniMax-M2.5 requests. The pricing is:

Input: $0.3 per 1M tokens
Output: $1.2 per 1M tokens
Cache Read: $0.03 per 1M tokens
Cache Write: $0.375 per 1M tokens

Accessing Cost Information

response = litellm.completion(
    model="minimax/MiniMax-M2.5",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key="your-minimax-api-key"
)

# Access cost information
print(f"Cost: ${response._hidden_params.get('response_cost', 0)}")

Streaming Support

OpenAI API

response = litellm.completion(
    model="minimax/MiniMax-M2.5",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    api_key="your-minimax-api-key",
    api_base="https://api.minimax.io/v1"
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Streaming with Reasoning Split

stream = litellm.completion(
    model="minimax/MiniMax-M2.5",
    messages=[
        {"role": "user", "content": "Tell me a story"},
    ],
    extra_body={"reasoning_split": True},
    stream=True,
    api_key="your-minimax-api-key",
    api_base="https://api.minimax.io/v1"
)

reasoning_buffer = ""
text_buffer = ""

for chunk in stream:
    if hasattr(chunk.choices[0].delta, "reasoning_details") and chunk.choices[0].delta.reasoning_details:
        for detail in chunk.choices[0].delta.reasoning_details:
            if "text" in detail:
                reasoning_text = detail["text"]
                new_reasoning = reasoning_text[len(reasoning_buffer):]
                if new_reasoning:
                    print(new_reasoning, end="", flush=True)
                    reasoning_buffer = reasoning_text

    if chunk.choices[0].delta.content:
        content_text = chunk.choices[0].delta.content
        new_text = content_text[len(text_buffer):] if text_buffer else content_text
        if new_text:
            print(new_text, end="", flush=True)
            text_buffer = content_text

Using with Native SDKs

Anthropic SDK via LiteLLM Proxy

import os
os.environ["ANTHROPIC_BASE_URL"] = "http://localhost:4000"
os.environ["ANTHROPIC_API_KEY"] = "sk-1234"  # Your LiteLLM proxy key

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="minimax-m2-5",
    max_tokens=1000,
    system="You are a helpful assistant.",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Hi, how are you?"
                }
            ]
        }
    ]
)

for block in message.content:
    if block.type == "thinking":
        print(f"Thinking:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"Text:\n{block.text}\n")

OpenAI SDK via LiteLLM Proxy

import os
os.environ["OPENAI_BASE_URL"] = "http://localhost:4000"
os.environ["OPENAI_API_KEY"] = "sk-1234"  # Your LiteLLM proxy key

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="minimax-m2-5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hi, how are you?"},
    ],
    extra_body={"reasoning_split": True},
)

# Access thinking and response
if hasattr(response.choices[0].message, 'reasoning_details'):
    print(f"Thinking:\n{response.choices[0].message.reasoning_details[0]['text']}\n")
print(f"Text:\n{response.choices[0].message.content}\n")

Incident Report: Invalid model cost map on main

Tue, 10 Feb 2026 10:00:00 GMT

Date: January 27, 2026 Duration: ~20 minutes Severity: Low Status: Resolved

Summary

A malformed JSON entry in model_prices_and_context_window.json was merged to main (562f0a0). This caused LiteLLM to silently fall back to a stale local copy of the model cost map. Users on older package versions lost cost tracking for newer models only (e.g. azure/gpt-5.2). No LLM calls were blocked.

LLM calls and proxy routing: No impact.
Cost tracking: Impacted for newer models not present in the local backup. Older models were unaffected. The incident lasted ~20 minutes until the commit was reverted.

Background

The model cost map is not in the request path. It is used after the LLM response comes back, inside a try/catch, to calculate spend. A missing entry never blocks a call.

Both paths return a response to the caller. When the cost map lookup fails, the only difference is cost=0 on that request.

Root cause

LiteLLM fetches the model cost map from GitHub main at import time. If the fetch fails, it falls back to a local backup bundled with the package. Before this incident, the fallback was completely silent -- no warning was logged.

A contributor PR introduced an extra { bracket, producing invalid JSON. The remote fetch failed with JSONDecodeError, triggering the silent fallback. Users on older package versions had backup files missing newer models.

Timeline:

Malformed JSON merged to main
LiteLLM installations fall back to local backup on next import
Users report "This model isn't mapped yet" for newer models
Bad commit identified and reverted (~20 minutes)

Remediation

#	Action	Status	Code
1	CI validation on `model_prices_and_context_window.json`	✅ Done	`test-model-map.yaml`
2	Warning log on fallback to local backup	✅ Done	`get_model_cost_map.py#L57-L68`
3	`GetModelCostMap` class with integrity validation helpers	✅ Done	`get_model_cost_map.py#L24-L149`
4	Resilience test suite (bad hosted map, fallback, completion)	✅ Done	`test_model_cost_map_resilience.py#L150-L291`
5	Test that backup model cost map always exists and contains common models	✅ Done	`test_model_cost_map_resilience.py#L213-L228`

Enterprises that require zero external dependencies at import time can set LITELLM_LOCAL_MODEL_COST_MAP=True to skip the GitHub fetch entirely.

Other dependencies on external resources

Dependency	Impact if unavailable	Fallback
Model cost map (GitHub)	Cost tracking for newer models	Local backup (now with warning)
JWT public keys (IDP/SSO)	Auth fails	None
OIDC UserInfo (IDP/SSO)	Auth fails	None
HuggingFace model API	HF provider calls fail	None
Ollama tags (localhost)	Ollama model list stale	Static list

Your Middleware Could Be a Bottleneck

Sat, 07 Feb 2026 10:00:00 GMT

How we improved LiteLLM proxy latency and throughput by replacing a single, simple middleware base class

Our Setup

The LiteLLM proxy server has two middleware layers. The first is Starlette's CORSMiddleware (re-exported by FastAPI), which is a pure ASGI middleware. Then we have a simple BaseHTTPMiddleware called PrometheusAuthMiddleware.

The job of PrometheusAuthMiddleware is to authenticate requests to the /metrics endpoint. It's not on by default, you enable it with a flag in your proxy config:

Proxy config flag

Improve release stability with 24 hour load tests

Fri, 06 Feb 2026 10:00:00 GMT

As LiteLLM adoption has grown, so have expectations around reliability, performance, and operational safety. Meeting those expectations requires more than correctness-focused tests, it requires validating how the system behaves over time, under real-world conditions.

This post introduces LiteLLM Observatory, a long-running release-validation system we built to catch regressions before they reach users.

Why We Built the Observatory

LiteLLM operates at the intersection of external providers, long-lived network connections, and high-throughput workloads. While our unit and integration tests do an excellent job validating correctness, they are not designed to surface issues that only appear after extended operation.

A subtle lifecycle edge case discovered in v1.81.3 reinforced the need for stronger release validation in this area.

A Real-World Lifecycle Edge Case

In v1.81.3, we shipped a fix for an HTTP client memory leak. The change passed unit and integration tests and behaved correctly in short-lived runs.

The issue that surfaced was not caused by a single incorrect line of logic, but by how multiple components interacted over time:

A cached httpx client was configured with a 1-hour TTL
When the cache expired, the underlying HTTP connection was closed as expected
A higher-level client continued to hold a reference to that connection
Subsequent requests failed with:

Cannot send a request, as the client has been closed

Before (with bug):

Provider	Requests	Success	Failures	Fail %
OpenAI	720,000	432,000	288,000	40%
Azure	692,000	415,200	276,800	40%

After (fixed):

Provider	Requests	Success	Failures	Fail %
OpenAI	1,200,000	1,199,988	12	0.001%
Azure	1,150,000	1,149,982	18	0.002%

Our focus moving forward is on being the first to detect issues, even when they aren’t covered by unit tests. LiteLLM Observatory is designed to surface latency regressions, OOMs, and failure modes that only appear under real traffic patterns in our own production deployments during release validation.

How the Observatory Works

LiteLLM Observatory is a testing service that runs long-running tests against our LiteLLM deployments. We trigger tests by sending API requests, and results are automatically sent to Slack when tests complete.

How Tests Run

Start a Test: We send a request to the Observatory API with:
- Which LiteLLM deployment to test (URL and API key)
- Which test to run (e.g., TestOAIAzureRelease)
- Test settings (which models to test, how long to run, failure thresholds)
Smart Queueing:
- The system checks whether we are attempting to run the exact same test more than once
- If a duplicate test is already running or queued, we receive an error to avoid wasting resources
- Otherwise, the test is added to a queue and runs when capacity is available (up to 5 tests can run concurrently by default)
Instant Response: The API responds immediately—we do not wait for the test to finish. Tests may run for hours, but the request itself completes in milliseconds.
Background Execution:
- The test runs in the background, issuing requests against our LiteLLM deployment
- It tracks request success and failure rates over time
- When the test completes, results are automatically posted to our Slack channel

Example: The OpenAI / Azure Reliability Test

The TestOAIAzureRelease test is designed to catch a class of bugs that only surface after sustained runtime:

Duration: Runs continuously for 3 hours
Behavior: Cycles through specified models (such as gpt-4 and gpt-3.5-turbo), issuing requests continuously
Why 3 Hours: This helps catch issues where HTTP clients degrade or fail after extended use (for example, a bug observed in LiteLLM v1.81.3)
Pass / Fail Criteria: The test passes if fewer than 1% of requests fail. If the failure rate exceeds 1%, the test fails and we are notified in Slack
Key Detail: The same HTTP client is reused for the entire run, allowing us to detect lifecycle-related bugs that only appear under prolonged reuse

When We Use It

Before Deployments: Run tests before promoting a new LiteLLM version to production
Routine Validation: Schedule regular runs (daily or weekly) to catch regressions early
Issue Investigation: Run tests on demand when we suspect a deployment issue
Long-Running Failure Detection: Identify bugs that only appear under sustained load, beyond what short smoke tests can reveal

Complementing Unit Tests

Unit tests remain a foundational part of our development process. They are fast and precise, but they don’t cover:

Real provider behavior
Long-lived network interactions
Resource lifecycle edge cases
Time-dependent regressions

LiteLLM Observatory complements unit tests by validating the system as it actually runs in production-like environments.

Looking Ahead

Reliability is an ongoing investment.

LiteLLM Observatory is one of several systems we’re building to continuously raise the bar on release quality and operational safety. As LiteLLM evolves, so will our validation tooling, informed by real-world usage and lessons learned.

We’ll continue to share those improvements openly as we go.

Day 0 Support: Claude Opus 4.6

Thu, 05 Feb 2026 10:00:00 GMT

LiteLLM now supports Claude Opus 4.6 on Day 0. Use it across Anthropic, Azure, Vertex AI, and Bedrock through the LiteLLM AI Gateway.

Docker Image

docker pull ghcr.io/berriai/litellm:litellm_stable_release_branch-v1.80.0-stable.opus-4-6

Usage - Anthropic

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: claude-opus-4-6
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:litellm_stable_release_branch-v1.80.0-stable.opus-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

Usage - Azure

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: claude-opus-4-6
    litellm_params:
      model: azure_ai/claude-opus-4-6
      api_key: os.environ/AZURE_AI_API_KEY
      api_base: os.environ/AZURE_AI_API_BASE  # https://.services.ai.azure.com

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e AZURE_AI_API_KEY=$AZURE_AI_API_KEY \
  -e AZURE_AI_API_BASE=$AZURE_AI_API_BASE \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:litellm_stable_release_branch-v1.80.0-stable.opus-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

Usage - Vertex AI

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: claude-opus-4-6
    litellm_params:
      model: vertex_ai/claude-opus-4-6
      vertex_project: os.environ/VERTEX_PROJECT
      vertex_location: us-east5

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e VERTEX_PROJECT=$VERTEX_PROJECT \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -v $(pwd)/credentials.json:/app/credentials.json \
  ghcr.io/berriai/litellm:litellm_stable_release_branch-v1.80.0-stable.opus-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

Usage - Bedrock

LiteLLM Proxy

1. Setup config.yaml

model_list:
  - model_name: claude-opus-4-6
    litellm_params:
      model: bedrock/anthropic.claude-opus-4-6-v1
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

2. Start the proxy

docker run -d \
  -p 4000:4000 \
  -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  -v $(pwd)/config.yaml:/app/config.yaml \
  ghcr.io/berriai/litellm:litellm_stable_release_branch-v1.80.0-stable.opus-4-6 \
  --config /app/config.yaml

3. Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "what llm are you"
    }
  ]
}'

Advanced Features

Compaction

/chat/completions
/v1/messages

Litellm supports enabling compaction for the new claude-opus-4-6.

Enabling Compaction

To enable compaction, add the context_management parameter with the compact_20260112 edit type:

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather in San Francisco?"
    }
  ],
  "context_management": {
    "edits": [
      {
        "type": "compact_20260112"
      }
    ]
  },
  "max_tokens": 100
}'

All the parameters supported for context_management by anthropic are supported and can be directly added. Litellm automatically adds the compact-2026-01-12 beta header in the request.

Enable compaction to reduce context size while preserving key information. LiteLLM automatically adds the compact-2026-01-12 beta header when compaction is enabled.

info

Provider Support: Compaction is supported on Anthropic, Azure AI, and Vertex AI. It is not supported on Bedrock (Invoke or Converse APIs).

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "Hi"
        }
    ],
    "context_management": {
        "edits": [
            {
                "type": "compact_20260112"
            }
        ]
    }
}'

Response with Compaction Block

The response will include the compaction summary in provider_specific_fields.compaction_blocks:

{
  "id": "chatcmpl-a6c105a3-4b25-419e-9551-c800633b6cb2",
  "created": 1770357619,
  "model": "claude-opus-4-6",
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "I don't have access to real-time data, so I can't provide the current weather in San Francisco. To get up-to-date weather information, I'd recommend checking:\n\n- **Weather websites** like weather.com, accuweather.com, or wunderground.com\n- **Search engines** – just Google \"San Francisco weather\"\n- **Weather apps** on your phone (e.g., Apple Weather, Google Weather)\n- **National",
        "role": "assistant",
        "provider_specific_fields": {
          "compaction_blocks": [
            {
              "type": "compaction",
              "content": "Summary of the conversation: The user requested help building a web scraper..."
            }
          ]
        }
      }
    }
  ],
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 86,
    "total_tokens": 186
  }
}

Using Compaction Blocks in Follow-up Requests

To continue the conversation with compaction, include the compaction block in the assistant message's provider_specific_fields:

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "How can I build a web scraper?"
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Certainly! To build a basic web scraper, you'll typically use a programming language like Python along with libraries such as `requests` (for fetching web pages) and `BeautifulSoup` (for parsing HTML). Here's a basic example:\n\n```python\nimport requests\nfrom bs4 import BeautifulSoup\n\nurl = 'https://example.com'\nresponse = requests.get(url)\nsoup = BeautifulSoup(response.text, 'html.parser')\n\n# Extract and print all text\ntext = soup.get_text()\nprint(text)\n```\n\nLet me know what you're interested in scraping or if you need help with a specific website!"
        }
      ],
      "provider_specific_fields": {
        "compaction_blocks": [
          {
            "type": "compaction",
            "content": "Summary of the conversation: The user asked how to build a web scraper, and the assistant gave an overview using Python with requests and BeautifulSoup."
          }
        ]
      }
    },
    {
      "role": "user",
      "content": "How do I use it to scrape product prices?"
    }
  ],
  "context_management": {
    "edits": [
      {
        "type": "compact_20260112"
      }
    ]
  },
  "max_tokens": 100
}'

Streaming Support

Compaction blocks are also supported in streaming mode. You'll receive:

compaction_start event when a compaction block begins
compaction_delta events with the compaction content
The accumulated compaction_blocks in provider_specific_fields

Adaptive Thinking

note

When using reasoning_effort with Claude Opus 4.6, all values (low, medium, high) are mapped to thinking: {type: "adaptive"}. To use explicit thinking budgets with type: "enabled", pass the native thinking parameter directly (see "Native thinking param" tab below).

/chat/completions
/v1/messages
Native thinking param

LiteLLM supports adaptive thinking through the reasoning_effort parameter:

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "Solve this complex problem: What is the optimal strategy for..."
    }
  ],
  "reasoning_effort": "high"
}'

Use the thinking parameter with type: "adaptive" to enable adaptive thinking mode:

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 16000,
    "thinking": {
        "type": "adaptive"
    },
    "messages": [
        {
            "role": "user",
            "content": "Explain why the sum of two even numbers is always even."
        }
    ]
}'

Use the thinking parameter directly for adaptive thinking via the SDK:

import litellm

response = litellm.completion(
  model="anthropic/claude-opus-4-6",
  messages=[{"role": "user", "content": "Solve this complex problem: What is the optimal strategy for..."}],
  thinking={"type": "adaptive"},
)

Effort Levels

/chat/completions
/v1/messages

Four effort levels available: low, medium, high (default), and max. Pass directly via the output_config parameter:

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing"
    }
  ],
  "output_config": {
        "effort": "medium"
    }
}'

You can use reasoning effort plus output_config to have more control on the model.

Four effort levels available: low, medium, high (default), and max. Pass directly via the output_config parameter:

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "Explain quantum computing"
        }
    ],
    "output_config": {
        "effort": "medium"
    }
}'

1M Token Context (Beta)

Opus 4.6 supports 1M token context. Premium pricing applies for prompts exceeding 200k tokens ($10/$37.50 per million input/output tokens). LiteLLM supports cost calculations for 1M token contexts.

/chat/completions
/v1/messages

To use the 1M token context window, you need to forward the anthropic-beta header from your client to the LLM provider.

Step 1: Enable header forwarding in your config

general_settings:
  forward_client_headers_to_llm_api: true

Step 2: Send requests with the beta header

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--header 'anthropic-beta: context-1m-2025-08-07' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "Analyze this large document..."
    }
  ]
}'

To use the 1M token context window, you need to forward the anthropic-beta header from your client to the LLM provider.

Step 1: Enable header forwarding in your config

general_settings:
  forward_client_headers_to_llm_api: true

Step 2: Send requests with the beta header

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'anthropic-beta: context-1m-2025-08-07' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 16000,
    "messages": [
        {
            "role": "user",
            "content": "Analyze this large document..."
        }
    ]
}'

tip

You can combine multiple beta headers by separating them with commas:

--header 'anthropic-beta: context-1m-2025-08-07,compact-2026-01-12'

US-Only Inference

Available at 1.1× token pricing. LiteLLM automatically tracks costs for US-only inference.

/chat/completions
/v1/messages

Use the inference_geo parameter to specify US-only inference:

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "inference_geo": "us"
}'

LiteLLM will automatically apply the 1.1× pricing multiplier for US-only inference in cost tracking.

Use the inference_geo parameter to specify US-only inference:

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    "inference_geo": "us"
}'

LiteLLM will automatically apply the 1.1× pricing multiplier for US-only inference in cost tracking.

Fast Mode

info

Fast mode is only supported on the Anthropic provider (anthropic/claude-opus-4-6). It is not available on Azure AI, Vertex AI, or Bedrock.

Pricing:

Standard: $5 input / $25 output per MTok
Fast: $30 input / $150 output per MTok (6× premium)

/chat/completions
/v1/messages

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data '{
  "model": "claude-opus-4-6",
  "messages": [
    {
      "role": "user",
      "content": "Refactor this module..."
    }
  ],
  "max_tokens": 4096,
  "speed": "fast"
}'

Using OpenAI SDK:

import openai

client = openai.OpenAI(
    api_key="your-litellm-key",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Refactor this module..."}],
    max_tokens=4096,
    extra_body={"speed": "fast"}
)

Using LiteLLM SDK:

from litellm import completion

response = completion(
    model="anthropic/claude-opus-4-6",
    messages=[{"role": "user", "content": "Refactor this module..."}],
    max_tokens=4096,
    speed="fast"
)

LiteLLM automatically tracks the higher costs for fast mode in usage and cost calculations.

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'x-api-key: sk-12345' \
--header 'content-type: application/json' \
--data '{
    "model": "claude-opus-4-6",
    "max_tokens": 4096,
    "speed": "fast",
    "messages": [
        {
            "role": "user",
            "content": "Refactor this module..."
        }
    ]
}'

LiteLLM automatically:

Adds the fast-mode-2026-02-01 beta header
Tracks the 6× premium pricing in cost calculations

Achieving Sub-Millisecond Proxy Overhead

Mon, 02 Feb 2026 10:00:00 GMT

Introduction

Our Q1 performance target is to aggressively move toward sub-millisecond proxy overhead on a single instance with 4 CPUs and 8 GB of RAM, and to continue pushing that boundary over time. Our broader goal is to make LiteLLM inexpensive to deploy, lightweight, and fast. This post outlines the architectural direction behind that effort.

Proxy overhead refers to the latency introduced by LiteLLM itself, independent of the upstream provider.

To measure it, we run the same workload directly against the provider and through LiteLLM at identical QPS (for example, 1,000 QPS) and compare the latency delta. To reduce noise, the load generator, LiteLLM, and a mock LLM endpoint all run on the same machine, ensuring the difference reflects proxy overhead rather than network latency.

Where We're Coming From

Under the same benchmark originally conducted by TensorZero, LiteLLM previously failed at around 1,000 QPS.

That is no longer the case. Today, LiteLLM can be stress-tested at 1,000 QPS with no failures and can scale up to 5,000 QPS without failures on a 4-CPU, 8-GB RAM single instance setup.

This establishes a more up to date baseline and provides useful context as we continue working on proxy overhead and overall performance.

Design Choice

Achieving sub-millisecond proxy overhead with a Python-based system requires being deliberate about where work happens.

Python is a strong fit for flexibility and extensibility: provider abstraction, configuration-driven routing, and a rich callback ecosystem. These are areas where development velocity and correctness matter more than raw throughput.

At higher request rates, however, certain classes of work become expensive when executed inside the Python process on every request. Rather than rewriting LiteLLM or introducing complex deployment requirements, we adopt an optional sidecar architecture.

This architectural change is how we intend to make LiteLLM permanently fast. While it supports our near-term performance targets, it is a long-term investment.

Python continues to own:

Request validation and normalization
Model and provider selection
Callbacks and integrations

The sidecar owns performance-critical execution, such as:

Efficient request forwarding
Connection reuse and pooling
Enforcing timeouts and limits
Aggregating high-frequency metrics

This separation allows each component to focus on what it does best: Python acts as the control plane, while the sidecar handles the hot path.

Why the Sidecar Is Optional

The sidecar is intentionally optional.

This allows us to ship it incrementally, validate it under real-world workloads, and avoid making it a hard dependency before it is fully battle-tested across all LiteLLM features.

Just as importantly, this ensures that self-hosting LiteLLM remains simple. The sidecar is bundled and started automatically, requires no additional infrastructure, and can be disabled entirely. From a user's perspective, LiteLLM continues to behave like a single service.

As of today, the sidecar is an optimization, not a requirement.

Conclusion

Sub-millisecond proxy overhead is not achieved through a single optimization, but through architectural changes.

By keeping Python focused on orchestration and extensibility, and offloading performance-critical execution to a sidecar, we establish a foundation for making LiteLLM permanently fast over time—even on modest hardware such as a 1-CPU, 2-GB RAM instance, while keeping deployment and self-hosting simple.

This work extends beyond Q1, and we will continue sharing benchmarks and updates as the architecture evolves.

DAY 0 Support: Gemini 3 Flash on LiteLLM

Wed, 17 Dec 2025 10:00:00 GMT

LiteLLM now supports gemini-3-flash-preview and all the new API changes along with it.

note

Deploy this version

Docker
Pip

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.80.8-stable.1

pip install litellm
pip install litellm==1.80.8.post1

What's New

1. New Thinking Levels: `thinkingLevel` with MINIMAL & MEDIUM

Gemini 3 Flash introduces granular thinking control with thinkingLevel instead of thinkingBudget.

MINIMAL: Ultra-lightweight thinking for fast responses
MEDIUM: Balanced thinking for complex reasoning
HIGH: Maximum reasoning depth

2. Thought Signatures

Like gemini-3-pro, this model also includes thought signatures for tool calls. LiteLLM handles signature extraction and embedding internally. Learn more about thought signatures.

Edge Case Handling: If thought signatures are missing in the request, LiteLLM adds a dummy signature ensuring the API call doesn't break

Supported Endpoints

LiteLLM provides full end-to-end support for Gemini 3 Flash on:

✅ /v1/chat/completions - OpenAI-compatible chat completions endpoint
✅ /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
✅ /v1/messages - Anthropic-compatible messages endpoint
✅ /v1/generateContent – Google Gemini API compatible endpoint All endpoints support:
Streaming and non-streaming responses
Function calling with thought signatures
Multi-turn conversations
All Gemini 3-specific features
Converstion of provider specific thinking related param to thinkingLevel

Quick Start

SDK
PROXY
LOW
MEDIUM (NEW)
HIGH

Basic Usage with MEDIUM thinking (NEW)

from litellm import completion

# No need to make any changes to your code as we map openai reasoning param to thinkingLevel
response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Solve this complex math problem: 25 * 4 + 10"}],
    reasoning_effort="medium",  # NEW: MEDIUM thinking level
)

print(response.choices[0].message.content)

1. Setup config.yaml

model_list:
  - model_name: gemini-3-flash
    litellm_params:
      model: gemini/gemini-3-flash-preview
      api_key: os.environ/GEMINI_API_KEY

2. Start proxy

litellm --config /path/to/config.yaml

3. Call with MEDIUM thinking

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "gemini-3-flash",
    "messages": [{"role": "user", "content": "Complex reasoning task"}],
    "reasoning_effort": "medium"
  }'
``'




---

## All `reasoning_effort` Levels




**Ultra-fast, minimal reasoning**

```python
from litellm import completion

response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "What's 2+2?"}],
    reasoning_effort="minimal",
)

Simple instruction following

response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    reasoning_effort="low",
)

Balanced reasoning for complex tasks ✨

response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Analyze this dataset and find patterns"}],
    reasoning_effort="medium",  # NEW!
)

Maximum reasoning depth

response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Prove this mathematical theorem"}],
    reasoning_effort="high",
)

Key Features

✅ Thinking Levels: MINIMAL, LOW, MEDIUM, HIGH
✅ Thought Signatures: Track reasoning with unique identifiers
✅ Seamless Integration: Works with existing OpenAI-compatible client
✅ Backward Compatible: Gemini 2.5 models continue using thinkingBudget

Installation

pip install litellm --upgrade

import litellm
from litellm import completion

response = completion(
    model="gemini/gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Your question here"}],
    reasoning_effort="medium",  # Use MEDIUM thinking
)
print(response)

note

If using this model via vertex_ai, keep the location as global as this is the only supported location as of now.

`reasoning_effort` Mapping for Gemini 3+

reasoning_effort	thinking_level
`minimal`	`minimal`
`low`	`low`
`medium`	`medium`
`high`	`high`
`disable`	`minimal`
`none`	`minimal`

liteLLM Blog

New Video Characters, Edit and Extension API support

What's New​

Quick Example​

Via Proxy​

Managed Character IDs​

Realtime WebRTC HTTP Endpoints

How it works​

Proxy Setup​

Try it live​

Client Usage​

Day 0 Support: GPT-5.4

Docker Image​

Usage​

Notes​

DAY 0 Support: Gemini 3.1 Flash Lite Preview on LiteLLM

Deploy this version​

What's New​

Quick Start​

Supported Endpoints​

reasoning_effort Mapping for Gemini 3.1​

Incident Report: Cache Eviction Closes In-Use httpx Clients

Summary​

Background​

Root Cause​

Day 0 Support: GPT-5.3-Codex

Why phase matters for GPT-5.3-Codex​

Docker Image​

Usage​

Python Example: Persist phase with OpenAI Client + LiteLLM Base URL​

Notes​

Incident Report: Encrypted Content Failures in Multi-Region Responses API Load Balancing

Summary​

Background​

Root Cause​

The Fix​

Implementation​

Key Benefits​

Remediation​

Follow-up Fix: Streaming Responses (Mar 3, 2026)​

The Issue​

The Root Cause​

The Fix​

Migration Guide​

Before (Using deployment_affinity)​

After (Using encrypted_content_affinity)​

Incident Report: Wildcard Blocking New Models After Cost Map Reload

Summary​

Background​

Root Cause​

The Fix​

Remediation​

Incident Report: SERVER_ROOT_PATH regression broke UI routing

Summary​

Background​

Root cause​

Remediation​

CI workflow details​

Timeline​

Resolution steps for users​

DAY 0 Support: Gemini 3.1 Pro on LiteLLM

Deploy this version​

What's New​

1. New Thinking Levels: thinkingLevel with MINIMAL & MEDIUM​

Supported Endpoints​

Quick Start​

reasoning_effort Mapping for Gemini 3+​

Incident Report: vLLM Embeddings Broken by encoding_format Parameter

Summary​

Background​

Root cause​

The Fix​

Remediation​

Day 0 Support: Claude Sonnet 4.6

Docker Image​

Usage - Anthropic​

Usage - Azure​

Usage - Vertex AI​

Usage - Bedrock​

Incident Report: Invalid beta headers with Claude Code

What's New

Quick Example

Via Proxy

Managed Character IDs

How it works

Proxy Setup

Try it live

Client Usage

Docker Image

Usage

Notes

Deploy this version

What's New

Quick Start

Supported Endpoints

`reasoning_effort` Mapping for Gemini 3.1

Summary

Background

Root Cause

Why `phase` matters for GPT-5.3-Codex

Docker Image

Usage

Python Example: Persist `phase` with OpenAI Client + LiteLLM Base URL

Notes

Summary

Background

Root Cause

The Fix

Implementation

Key Benefits

Remediation

Follow-up Fix: Streaming Responses (Mar 3, 2026)

The Issue

The Root Cause

The Fix

Migration Guide

Before (Using `deployment_affinity`)

After (Using `encrypted_content_affinity`)

Summary

Background

Root Cause

The Fix

Remediation

Summary

Background

Root cause

Remediation

CI workflow details

Timeline

Resolution steps for users

Deploy this version

What's New

1. New Thinking Levels: `thinkingLevel` with MINIMAL & MEDIUM

Supported Endpoints

Quick Start

`reasoning_effort` Mapping for Gemini 3+

Summary

Background

Root cause

The Fix

Remediation

Docker Image

Usage - Anthropic

Usage - Azure

Usage - Vertex AI

Usage - Bedrock

Summary

Background

Root cause

Remediation

Dynamic configuration updates

Configuration format

Resolution steps for users

Related documentation

Supported Models

Features Supported

Docker Image

Usage - OpenAI Compatible API (/v1/chat/completions)

With Reasoning Split

Usage - Anthropic Compatible API (/v1/messages)