Architecture

Design principles, workspace layout, and request flow through the CrabLLM gateway.

Principles

Simplicity over abstraction. No trait where a function suffices.
Single responsibility. Each crate has one focused job.
OpenAI as canonical format. Providers translate to/from it.
Streaming first-class. Never buffer a full response when streaming.
Configuration-driven. Provider setup and routing from config, not code.
Minimal gateway latency. Avoid hot-path allocations.

Workspace layout

crabllm/
  crates/
    crabllm/    — binary (serve, init, openapi subcommands)
    crabctl/    — admin CLI for managing a running gateway
    core/       — shared types, config, errors
    provider/   — provider enum + translation modules
    proxy/      — HTTP server, routing, extensions, admin API
    mlx/        — Apple Silicon local inference via MLX
    llamacpp/   — cross-platform local inference via llama.cpp
    bench/      — benchmark mock backend

Crates

crabllm

Binary entry point. Three subcommands:

serve (default) — loads TOML config, builds the provider registry, initializes storage + extensions, starts the Axum HTTP server. Flags: --config, --bind, -v/-vv/-vvv for verbosity.
init — generates a starter crabllm.toml in the current directory.
openapi — dumps the OpenAPI spec as JSON or a self-contained Scalar HTML page.

core

Shared types with no business logic. Contains:

Config — GatewayConfig with env var interpolation.
Types — OpenAI-compatible wire format structs (request, response, chunk).
Provider trait — async trait with methods for chat, streaming, embeddings, images, audio. Uses RPITIT for zero-cost dispatch.
Error — error enum with transient detection for retry logic.
Storage — async KV trait with memory, SQLite, and Redis backends.
Extension — hook trait for the request pipeline.

provider

Provider dispatch. ProviderRegistry maps model names to weighted deployment lists. Supports alias resolution, weighted random selection, and per-model provider lookup. Generic over P: Provider so it unifies remote APIs, MLX, and llama.cpp.

proxy

Axum HTTP server. Route handlers implement retry + fallback across deployments. Auth middleware validates virtual keys. Five built-in extensions run as in-handler hooks. Admin API routes at /v1/admin/* for dynamic key and provider management. OpenAPI/Scalar docs at /docs and /openapi.json when enabled.

mlx

Local inference on Apple Silicon. Thin Rust wrapper around a Swift static library using the MLX framework. Multi-model cache with idle eviction. Supports chat completions (streaming + non-streaming) with tool calling. macOS and iOS only — stubs out on other platforms.

llamacpp

Cross-platform local inference. Manages the lifecycle of spawned llama-server processes — auto-downloads the binary, pulls models from the Ollama registry, spawns per-model servers on demand, and evicts idle servers.