How is this different from LiteLLM?

Same feature set, different runtime. CrabLLM is a single static binary with sub-millisecond latency. LiteLLM is Python. Both do provider translation, routing, auth, and extensions.

Does it support streaming?

Yes. SSE streams are proxied without buffering across all providers — OpenAI, Anthropic, Gemini, Azure, Bedrock, and Ollama.

CrabLLM powers CrabTalk's provider system. CrabTalk is an agent daemon. CrabLLM is the LLM gateway underneath it. They're separate products — use either independently.

How much does it cost?

Free and open source. You pay your LLM provider directly. No markup, no metering, no vendor lock-in.

Chapter 00 · What it is

One API.
Every model.

Name: CrabLLM
Author: CrabTalk

Route requests to OpenAI, Anthropic, Gemini, Azure, Bedrock, or Ollama. Sub-millisecond overhead. Single binary. No runtime.

Install · one line

cargo install crabllm crabctl

00 · Install

Get started Enterprise

Fig. 00

Fig. 00 · The Router. Requests enter as one format, pass through the gateway membrane, and fan out to provider zones. Hover a zone to simulate failover. Click for a burst of traffic.

“Same OpenAI/Anthropic format, any provider.”

Chapter 01

What you get

Six capabilities the gateway gives you out of the box. Each one a plate; each plate a sentence; the sheet reads in any order you like.

Fig. 01 · Feature

Provider translation

Docs · providers/overview

Send OpenAI format. CrabLLM translates to Anthropic, Gemini, Bedrock, and Azure automatically.

Fig. 02 · Feature

Routing & fallback

Docs · features/routing

Weighted random selection across providers. Exponential backoff retry. Automatic failover.

Fig. 03 · Feature

Streaming first-class

Docs · features/streaming

SSE proxied without buffering. Per-chunk extension hooks. Keep-alive pings.

Fig. 04 · Feature

Virtual keys & auth

Docs · features/auth

Per-key model access control. Rate limiting, usage tracking, and budget enforcement.

Fig. 05 · Feature

Caching & rate limits

Docs · features/extensions

SHA-256 response cache. Per-key RPM and TPM limits. Sliding window enforcement.

Fig. 06 · Feature

Budget enforcement

Docs · features/extensions

Per-key spend limits in USD. Automatic cost tracking from token usage and pricing config.

Chapter 02

Performance

Gateway overhead at 5,000 concurrent requests per second. All gateways run with identical resource limits (2 CPUs, 512 MB) against a mock backend with instant responses. Full results →

Gateway	P50	P99
CrabLLM	0.26ms	0.54ms
Bifrost	0.61ms	1.26ms
LiteLLM	159ms	227ms

02-A · Latency

Gateway	Peak RSS
CrabLLM	34.9 MB
Bifrost	171.7 MB
LiteLLM	541.8 MB

02-B · Memory

Chapter 03

How it works

1Configure

listen = "0.0.0.0:8080"

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]

03-A · Config

2Run

crabllm --config crabllm.toml

03-B · Launch

3Send requests

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "claude-sonnet-4-20250514",
       "messages": [{"role": "user", "content": "Hello!"}]}'

03-C · Request

Same OpenAI format, any provider. CrabLLM translates automatically.

Frequently asked questions

0.26ms P50 at 5,000 RPS. Rust with Tokio — no GC pauses, no interpreter. The gateway is not the bottleneck.