Local Inference
Built-in model registry with auto-quantization — just pick a model, OpenWalrus handles the rest.
OpenWalrus ships with a curated model registry. Pick a model by name, and the runtime auto-selects the best quantization for your hardware. No API keys, no manual configuration.
Quick start
Set your default model in walrus.toml:
[model]
default = "qwen3-4b"That's it. OpenWalrus looks up qwen3-4b in the registry, checks your system RAM, selects the optimal quantization, and loads the model asynchronously. The runtime starts in under 10 ms — the model downloads and loads in the background.
Model registry
The registry is compiled into the binary at build time, with platform-specific entries for CPU (GGUF), Metal, and CUDA.
Text models
| Registry key | Model | Min RAM |
|---|---|---|
qwen3-06b | Qwen3 0.6B | 4 GB |
qwen3-17b | Qwen3 1.7B | 4 GB |
smollm2-17b | SmolLM2 1.7B | 4 GB |
gemma-3-1b | Gemma 3 1B | 4 GB |
phi-4-mini-flash | Phi-4 Mini Flash | 4 GB |
qwen3-4b | Qwen3 4B | 8 GB |
phi-4-mini | Phi-4 Mini | 8 GB |
gemma-3-4b | Gemma 3 4B | 8 GB |
gemma-3n-e4b | Gemma 3n E4B | 8 GB |
qwen3-8b | Qwen3 8B | 16 GB |
llama31-8b | Llama 3.1 8B | 16 GB |
mistral-7b | Mistral 7B | 16 GB |
qwen3-14b | Qwen3 14B | 32 GB |
qwen25-coder-14b | Qwen2.5 Coder 14B | 32 GB |
devstral-small | Devstral Small 24B | 32 GB |
qwen3-32b | Qwen3 32B | 64 GB |
qwen25-coder-32b | Qwen2.5 Coder 32B | 64 GB |
Vision models
| Registry key | Model | Min RAM |
|---|---|---|
qwen3-vl-4b | Qwen3 VL 4B | 8 GB |
gemma-3-4b-vision | Gemma 3 4B Vision | 8 GB |
llama32-vision-11b | Llama 3.2 Vision 11B | 16 GB |
qwen25-vl-7b | Qwen2.5 VL 7B | 16 GB |
qwen3-vl-32b | Qwen3 VL 32B | 32 GB |
The default model is qwen3-vl-4b (vision-capable, runs on 8 GB machines).
Auto-quantization
OpenWalrus automatically selects the best quantization level based on how much RAM is available after loading the model (headroom):
GGUF models (CPU):
| Headroom | Quantization |
|---|---|
| ≥ 16 GB | Q8_0 (highest quality) |
| ≥ 8 GB | Q6_K |
| ≥ 4 GB | Q5_K_M |
| < 4 GB | Q4_K_M (smallest) |
Metal (Apple Silicon): AFQ8 → AFQ6 → AFQ4 by the same headroom tiers.
CUDA (NVIDIA): Q8K → Q6K → Q4K by the same headroom tiers.
You never need to think about quantization — OpenWalrus picks the best option your hardware can handle.
GPU acceleration
Install with the appropriate feature flag:
# Apple Silicon (Metal)
cargo install openwalrus --features local,metal
# NVIDIA (CUDA)
cargo install openwalrus --features local,cuda
# CPU only (GGUF quantized)
cargo install openwalrus --features localThe install wizard (curl -sSL https://openwalrus.xyz/install | sh) auto-detects your platform and enables the right features.
Async model loading
Models load asynchronously on a dedicated thread. The runtime starts immediately and is ready to accept connections while the model downloads (if needed) and initializes in the background. Once loaded, it transitions to a ready state — no blocking, no warm-up delays.
What's next
- Remote providers — use cloud APIs when you need more power
- Providers overview — API standards and provider management
- Configuration — full config reference