Skip to main content

RouteLLM API Reference

RouteLLM provides an OpenAI-compatible API endpoint that intelligently routes your requests to the most appropriate underlying model based on cost, speed, and performance requirements.

Overview​

RouteLLM is a smart routing layer that automatically selects the best model for your request, balancing performance, cost, and speed. Instead of manually choosing between different models, you can use the route-llm model identifier and let the system make the optimal choice for you.

Key Features​

  • Intelligent Routing: Automatically selects the best model based on request complexity
  • Cost Optimization: Routes to cost-effective models when appropriate
  • Performance Tuning: Uses high-performance models for complex tasks
  • Streaming Support: Real-time response streaming available
  • Tool Calling: Invoke functions from the model response and submit results back for multi-step workflows
  • Multimodal Support: Supports text, audio, and image inputs for compatible models
  • PDF Support: Process and analyze PDF documents as input for compatible models
  • Image Generation: Generate high-quality images from text prompts using state-of-the-art models
  • Audio Understanding: Analyze and transcribe audio inputs using OpenAI GPT-4o Audio and Google Gemini models
  • Audio Generation (TTS): Generate spoken audio responses from text using OpenAI GPT-4o Audio and Google Gemini TTS models

Getting Started​

How It Works​

  1. Sign Up: Sign up as a ChatLLM subscriber to access RouteLLM API
  2. Access the API: Click on the RouteLLM API icon in the lower left corner of the ChatLLM interface to access API documentation and details
  3. Get Your API Key: Obtain your API key from the RouteLLM API page
  4. Start Using: Invoke the API for any LLM and use it in your applications

Why Choose RouteLLM API?​

RouteLLM API comes with your ChatLLM subscription, providing several key benefits:

  • Unified Platform: Use all LLMs (both open-weight and Proprietary) in the ChatLLM Teams UX and via API, all in one place
  • Easy Management: Centralized way to manage all your favorite AI model consumption
  • Flexible Access: Access models through both the user interface and programmatic API
  • Cost-Effective: Competitive pricing with best available rates for open-source models
  • Transparent Pricing: No markup on proprietary LLMs - you pay provider prices

Pricing​

Credit System​

The ChatLLM subscription includes 20,000 credits to get you started. Each API call consumes credits proportional to the cost of the LLM call. RouteLLM is available for unlimited use for ChatLLM subscribers - while it still tracks credits for accounting purposes, you can continue to use RouteLLM even after hitting your monthly credit limit.

Pricing Details​

Proprietary LLMs​

Proprietary LLMs (e.g., OpenAI, Anthropic, Google Gemini, etc.) are priced based on the prices advertised by the provider. We DO NOT charge you more than what the provider does. Prices are updated automatically whenever the provider updates their pricing.

Open-Weight LLMs​

Open-Weight LLMs are typically priced at the best available price on the planet. Our prices typically match the best available price anywhere in the world.

Note: All open weight LLMs are hosted on servers based in the United States.

View Current Pricing​

Pricing for each LLM is published in our RouteLLM API pricing documentation. You can also programmatically retrieve the most up-to-date list of available models and their current pricing via the /v1/models endpoint — both GET /v1/models and listRouteLLMModels resolve to the same underlying API.

Base URLs​

The base URL depends on your organization type:

  • Self-Serve Organizations: https://routellm.abacus.ai/v1
  • Enterprise Platform: https://<workspace>.abacus.ai/v1

Replace <workspace> with your specific workspace identifier for enterprise deployments. To know your correct base url, refer: RouteLLM API.

Authentication​

All API requests require authentication using an API key. Include your API key in the request header:

Authorization: Bearer <your_api_key>

You can obtain your API key from the Abacus.AI platform.

Supported Models​

The RouteLLM API supports a wide range of models for both text generation and image generation. You can specify a model explicitly or use route-llm to let the system decide.

Routing Model​

  • route-llm: Intelligently routes to the best available model based on the complexity of the request. This is the recommended option for most use cases.

Text Generation Models​

You can also directly target specific text generation models. Select a provider below:

Chat ModelsReasoning Models
gpt-5.4, gpt-5.4-mini, gpt-5.4-nanoo4-mini, o4-mini-high
gpt-5.3-codexo3, o3-high, o3-mini, o3-pro
gpt-5.2, gpt-5.1o1, o1-mini
gpt-5, gpt-5-mini, gpt-5-nano
gpt-4.1, gpt-4.1-mini, gpt-4.1-nano
gpt-4o, gpt-4o-mini

Note: This list is subject to change as new models are added. Use the /v1/models endpoint to get the most up-to-date list of available models and their pricing.

Image Generation Models​

RouteLLM supports a wide range of image generation models, from dedicated generators to multimodal LLMs with native image output. Models are grouped into two categories:

  • Dedicated image generation models — purpose-built for image synthesis. Examples: flux-2-pro, dall-e, ideogram, recraft, imagen, seedream, nano-banana-pro, midjourney, and more.
  • Multimodal LLMs — conversational models (OpenAI GPT and Google Gemini) that can generate images alongside text when modalities: ["image"] is specified.

Image generation requests use the same /v1/chat/completions endpoint as text, with the modalities and image_config parameters controlling output type, number of images, aspect ratio, quality, and resolution.

For the complete model catalogue, supported parameters per model, and usage examples, see the Image Analysis & Generation reference.

Audio Models​

Model IDProviderCapabilities
gpt-4o-audio-previewOpenAIAudio input + Audio output
gpt-4o-mini-audio-previewOpenAIAudio input + Audio output
gemini-2.5-flash-preview-ttsGoogleAudio output (TTS)
gemini-2.5-pro-preview-ttsGoogleAudio output (TTS)

For full details on models, pricing, and usage → Audio Capabilities

Request Parameters​

1. Required Parameters​

messages (array, required)​

A list of messages comprising the conversation so far. Each message must be an object with the following structure:

  • role (string, required): The role of the message sender. Must be one of:

    • user: Messages from the user/end-user
    • assistant: Previous responses from the AI assistant
    • system: System-level instructions that guide the assistant's behavior
  • content (string or array, required): The content of the message. Can be:

    • A string for text-only messages
    • An array for multimodal content (text and images)

2. Optional Parameters​

model (string, optional)​

The ID of the model to use. Can be either a text generation model or an image generation model, depending on the modalities parameter. If omitted, defaults to route-llm.

Note: The model names shown in the Supported Models section use a human-readable format (e.g. flux-2-pro), but the actual model ID accepted by the API may differ (e.g. flux2_pro). Call GET /v1/models to retrieve the exact id string for each model and use that value in your requests.

Text Generation Models: route-llm, gpt-5.4, claude-sonnet-4-6, gemini-3.1-pro, etc.

Image Generation Models: flux-2-pro, flux-kontext, dall-e, ideogram, recraft, imagen, nano-banana-pro, seedream

Examples: route-llm, gpt-5.4, flux-2-pro, seedream

max_tokens (integer, optional)​

The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context window.

Default: Model-dependent

temperature (number, optional)​

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

Default: 1.0

Recommended values:

  • 0.0-0.3: For factual, deterministic responses
  • 0.7-1.0: For creative, varied responses
  • 1.0-2.0: For highly creative, diverse outputs

top_p (number, optional)​

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Default: 1.0

Range: 0.0 to 1.0

stream (boolean, optional)​

If set to true, partial message deltas will be sent as data-only server-sent events as they become available. The stream will terminate by a data: [DONE] message.

Default: false

stop (string or array, optional)​

Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

Example: stop": ["Human:", "AI:"]

presence_penalty (number, optional)​

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default: 0.0

frequency_penalty (number, optional)​

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Default: 0.0

response_format (object, optional)​

An object specifying the format that the model must output. Two types are supported:

1. JSON Object Mode

"response_format": {
"type": "json_object"
}

Constrains the model to output valid JSON. You must also instruct the model to produce JSON via a system or user message.

2. JSON Schema Mode

"response_format": {
"type": "json_schema",
"json_schema": {
"name": "your_schema_name",
"schema": {
"type": "object",
"properties": {
"field_name": { "type": "string" },
"count": { "type": "integer" }
},
"required": ["field_name", "count"],
"additionalProperties": false
}
}
}

JSON Schema mode constrains the model to output JSON that strictly conforms to the provided schema. No system or user message instructing the model to produce JSON is required — the schema itself enforces the format. The json_schema object requires:

FieldTypeRequiredDescription
namestringYesA name identifier for the schema
schemaobjectYesThe JSON Schema definition
strictbooleanNoWhether to enforce strict schema adherence (see below)

The inner schema object requires:

FieldTypeRequiredDescription
typestringYesJSON Schema type (e.g., "object")
propertiesobjectYesProperty definitions for the object
requiredarrayYesList of required property names
additionalPropertiesbooleanYesWhether to allow extra properties beyond those defined

strict mode​

The strict field controls how rigidly the model follows the schema:

  • strict: true — The schema is treated as a law. The model is guaranteed to produce output that exactly matches the schema. Every field in required will be present, no extra fields are added, and types are enforced precisely.
  • strict: false (default) — The schema is treated as a suggestion. The model will try to follow it, but may deviate in edge cases (e.g., omitting optional fields or adding extra context).

Use strict: true whenever your downstream code parses the response programmatically.

Important: When using response_format: { type: "json_object" }, you must instruct the model to produce JSON via a system or user message. This is not required for json_schema mode — the schema enforces the format automatically.

tools (array, optional)​

A list of tools the model may call. Each tool is an object with:

  • type: Must be "function".
  • function: Object with:
    • name (string, required): Name of the function the model can call.
    • description (string, optional): Description of the function for the model.
    • parameters (object, optional): JSON Schema for the function parameters (OpenAI-style).

Example:

"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string", "description": "City and state, e.g. San Francisco, CA" },
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location"]
}
}
}
]

tool_choice (string or object, optional)​

Controls whether the model can call tools. Values:

  • "none": Do not call any tool (default when tools is omitted).
  • "auto": Model may choose to call one or more tools (default when tools is provided).
  • {"type": "function", "function": {"name": "get_current_weather"}}: Force the model to call the named function.

Default: "auto" when tools is provided.

modalities (array, optional)​

Specifies the output type for the request.

audio (object, optional)​

Required when modalities includes "audio". Specifies the voice and output format for audio generation. See Audio Capabilities for full parameter reference, available voices, and examples.

Response Format​

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "route-llm",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The meaning of life is..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
}
}

Response Fields​

  • id: A unique identifier for the chat completion
  • object: The object type, always chat.completion (or chat.completion.chunk for streaming)
  • created: The Unix timestamp of when the completion was created
  • model: The model used for the completion (may differ from the requested model if using route-llm)
  • choices: A list of completion choices
    • index: The index of the choice
    • message: The message object (non-streaming) or delta (streaming)
    • finish_reason: The reason the completion finished (stop, length, content_filter, tool_calls, or null for streaming)
  • usage: Token usage statistics (not present in streaming responses until the final chunk)

Tool Calling​

The API supports tool (function) calling: the model can request that your application run a function and return the result in a follow-up request. This enables multi-step workflows (e.g. get weather, query a database, run code).

note

Currently, tool calling is stateless. The server does not execute tools or persist tool-call state. Your application must run the requested functions, send the results back in a follow-up request (with the same tools and full message history), and handle any multi-step flow on the client side.

Request: Defining tools​

Pass a tools array with one or more functions. Optionally set tool_choice to "auto" (default), "none", or a specific function to force.

Example request with tools:

{
"model": "route-llm",
"messages": [
{"role": "user", "content": "What's the weather in Boston?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string", "description": "City and state" },
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}

Response: model requests a tool call​

When the model decides to call a tool, the completion message includes a tool_calls array and finish_reason is "tool_calls". The message content may be empty or contain reasoning.

Example response with tool_calls:

{
"id": "chatcmpl-xyz",
"object": "chat.completion",
"created": 1677858242,
"model": "route-llm",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"Boston, MA\", \"unit\": \"fahrenheit\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": { "prompt_tokens": 20, "completion_tokens": 25, "total_tokens": 45 }
}

Follow-up: sending tool results​

To continue the conversation, send the assistant message (including tool_calls) and add a message with role: "tool" for each tool call, providing the tool_call_id and the result as content.

Example follow-up request:

{
"model": "route-llm",
"messages": [
{"role": "user", "content": "What's the weather in Boston?"},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"Boston, MA\", \"unit\": \"fahrenheit\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"temperature\": 72, \"unit\": \"fahrenheit\", \"conditions\": \"Sunny\"}"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" },
"unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
},
"required": ["location"]
}
}
}
]
}

The model will then generate a final reply (e.g. summarizing the weather). Repeat the flow if it returns more tool_calls.

Notes:

  • Include the same tools (and optionally tool_choice) in follow-up requests when continuing a tool-calling conversation.
  • When streaming, tool call arguments arrive in multiple chunks. Use the index field to match chunks to specific tool calls, and concatenate the arguments strings before parsing as JSON.
  • Multiple tool calls can be returned in a single response. In streaming mode, track each tool call by its index and aggregate separately.

PDF Support​

PDF documents are supported as input for compatible models. Use the file content type with a file object (filename, file_data) for parsing.

Request schema:

{
"model": "gpt-5.1",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the main points in this document?"
},
{
"type": "file",
"file": {
"filename": "document.pdf",
"file_data": "https://bitcoin.org/bitcoin.pdf"
}
}
]
}
]
}

Notes:

  • Use type: "file" with a file object containing filename and file_data
  • file_data can be an HTTPS URL to the PDF or base64-encoded content

Image & Audio Capabilities​

RouteLLM supports rich multimodal capabilities beyond text. Use the links below to explore each capability in detail.

Images​

Analyze images as input (vision) or generate images from text prompts using dedicated generators and multimodal LLMs.

For supported models, request parameters, and code examples → Image Analysis & Generation

Audio​

The RouteLLM API supports audio understanding (speech input) and audio generation (text-to-speech) using OpenAI GPT-4o Audio models and Google Gemini TTS models.

For supported models, pricing, audio parameter reference, available voices, and code examples → Audio Capabilities

Error Handling​

The API uses standard HTTP status codes to indicate success or failure:

  • 200 OK: Request succeeded
  • 400 Bad Request: Invalid request (missing parameters, invalid format, etc.)
  • 401 Unauthorized: Missing or invalid API key
  • 429 Too Many Requests: Rate limit exceeded
  • 500 Internal Server Error: Server error

Error Response Format​

{
"error": {
"message": "The 'messages' parameter is missing, empty, or not a list.",
"type": "ValidationError",
"code": "invalid_request_error"
}
}

Common error scenarios:

  • Missing required messages parameter
  • Empty messages array
  • Missing role or content in message objects
  • Invalid role value (must be "user", "assistant", or "system")
  • Invalid model name
  • Rate limit exceeded

Code Examples​

Basic Request​

from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="route-llm",
messages=[
{"role": "user", "content": "What is the meaning of life?"}
]
)

print(response.choices[0].message.content)

Streaming Request​

from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

stream = client.chat.completions.create(
model="route-llm",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)

Conversation with History​

from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Alice."},
{"role": "assistant", "content": "Nice to meet you, Alice! How can I help you today?"},
{"role": "user", "content": "What's my name?"}
]

response = client.chat.completions.create(
model="route-llm",
messages=messages,
temperature=0.7,
max_tokens=150
)

print(response.choices[0].message.content)

JSON Mode​

from openai import OpenAI
import json

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="route-llm",
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs JSON."
},
{
"role": "user",
"content": "Return a JSON object with keys 'name', 'age', and 'city'."
}
],
response_format={"type": "json_object"},
temperature=0.7
)

content = response.choices[0].message.content
data = json.loads(content)
print(data)

Structured Output (JSON Schema)​

from openai import OpenAI
import json

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="route-llm",
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs JSON."
},
{
"role": "user",
"content": "Extract the name, age, and city from: 'Alice is 30 years old and lives in Paris.'"
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person_info",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"city": {"type": "string"}
},
"required": ["name", "age", "city"],
"additionalProperties": False
}
}
}
)

data = json.loads(response.choices[0].message.content)
print(data) # {"name": "Alice", "age": 30, "city": "Paris"}

With Optional Parameters​

from openai import OpenAI

client = OpenAI(
base_url="<your base url>",
api_key="<your_api_key>",
)

response = client.chat.completions.create(
model="route-llm",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about programming."}
],
max_tokens=100,
temperature=0.8,
top_p=0.9
)

print(response.choices[0].message.content)

Best Practices​

  1. Use route-llm for most cases: Let the system choose the optimal model automatically
  2. Include conversation history: Provide full message history for better context
  3. Set appropriate max_tokens: Prevent unnecessarily long responses
  4. Use streaming for long responses: Improve user experience with real-time output
  5. Handle errors gracefully: Implement retry logic for transient errors