Architecture
Design principles, workspace layout, and request flow through the CrabLLM gateway.
Principles
- Simplicity over abstraction. No trait where a function suffices.
- Single responsibility. Each crate has one focused job.
- OpenAI as canonical format. Providers translate to/from it.
- Streaming first-class. Never buffer a full response when streaming.
- Configuration-driven. Provider setup and routing from config, not code.
- Minimal gateway latency. Avoid hot-path allocations.
Workspace layout
crabllm/
crates/
crabllm/ — binary (serve, init, openapi subcommands)
crabctl/ — admin CLI for managing a running gateway
core/ — shared types, config, errors
provider/ — provider enum + translation modules
proxy/ — HTTP server, routing, extensions, admin API
mlx/ — Apple Silicon local inference via MLX
llamacpp/ — cross-platform local inference via llama.cpp
bench/ — benchmark mock backendCrates
crabllm
Binary entry point. Three subcommands:
serve(default) — loads TOML config, builds the provider registry, initializes storage + extensions, starts the Axum HTTP server. Flags:--config,--bind,-v/-vv/-vvvfor verbosity.init— generates a startercrabllm.tomlin the current directory.openapi— dumps the OpenAPI spec as JSON or a self-contained Scalar HTML page.
crabctl
Admin CLI for managing a running gateway over HTTP. Supports key management (keys list|create|get|update|delete), provider management (providers list|create|get|update|delete), usage/budget/logs queries, and cache clearing. See Management.
core
Shared types with no business logic. Contains:
- Config —
GatewayConfigwith env var interpolation. - Types — OpenAI-compatible wire format structs (request, response, chunk).
- Provider trait — async trait with methods for chat, streaming, embeddings, images, audio. Uses RPITIT for zero-cost dispatch.
- Error — error enum with transient detection for retry logic.
- Storage — async KV trait with memory, SQLite, and Redis backends.
- Extension — hook trait for the request pipeline.
provider
Provider dispatch. ProviderRegistry maps model names to weighted deployment lists. Supports alias resolution, weighted random selection, and per-model provider lookup. Generic over P: Provider so it unifies remote APIs, MLX, and llama.cpp.
proxy
Axum HTTP server. Route handlers implement retry + fallback across deployments. Auth middleware validates virtual keys. Five built-in extensions run as in-handler hooks. Admin API routes at /v1/admin/* for dynamic key and provider management. OpenAPI/Scalar docs at /docs and /openapi.json when enabled.
mlx
Local inference on Apple Silicon. Thin Rust wrapper around a Swift static library using the MLX framework. Multi-model cache with idle eviction. Supports chat completions (streaming + non-streaming) with tool calling. macOS and iOS only — stubs out on other platforms.
llamacpp
Cross-platform local inference. Manages the lifecycle of spawned llama-server processes — auto-downloads the binary, pulls models from the Ollama registry, spawns per-model servers on demand, and evicts idle servers.