Libraries
Standalone Rust crates for local inference — MLX (Apple Silicon) and llama.cpp (cross-platform).
CrabLLM includes two standalone inference libraries that can be used independently or as part of the gateway.
crabllm-mlx
Local inference on Apple Silicon via the MLX framework. Multi-model cache with idle eviction, streaming chat completions, and tool calling support. macOS and iOS only.
- Source: crates/mlx
- Platform: Apple Silicon (M1+)
- Model format: HuggingFace MLX weights (
.safetensors) - Model aliases:
family-param_size-quant(e.g.qwen3.5-2b-4bit) - Capabilities: chat completions, streaming, tool calling
crabllm-llamacpp
Cross-platform local inference via managed llama.cpp server processes. Auto-downloads the llama-server binary, pulls GGUF models from the Ollama registry, spawns per-model servers on demand, and evicts idle servers.
- Source: crates/llamacpp
- Platform: macOS, Linux, Windows
- GPU: Metal (macOS), CUDA (Linux/Windows), CPU fallback
- Model format: GGUF via Ollama registry (
name:tag, e.g.llama3.2:3b) - Capabilities: chat completions, streaming, tool calling