Local Inference

Built-in model registry with auto-quantization — just pick a model, OpenWalrus handles the rest.

OpenWalrus ships with a curated model registry. Pick a model by name, and the runtime auto-selects the best quantization for your hardware. No API keys, no manual configuration.

Quick start

Set your default model in walrus.toml:

[model]
default = "qwen3-4b"

That's it. OpenWalrus looks up qwen3-4b in the registry, checks your system RAM, selects the optimal quantization, and loads the model asynchronously. The runtime starts in under 10 ms — the model downloads and loads in the background.

Model registry

The registry is compiled into the binary at build time, with platform-specific entries for CPU (GGUF), Metal, and CUDA.

Text models

Registry key	Model	Min RAM
`qwen3-06b`	Qwen3 0.6B	4 GB
`qwen3-17b`	Qwen3 1.7B	4 GB
`smollm2-17b`	SmolLM2 1.7B	4 GB
`gemma-3-1b`	Gemma 3 1B	4 GB
`phi-4-mini-flash`	Phi-4 Mini Flash	4 GB
`qwen3-4b`	Qwen3 4B	8 GB
`phi-4-mini`	Phi-4 Mini	8 GB
`gemma-3-4b`	Gemma 3 4B	8 GB
`gemma-3n-e4b`	Gemma 3n E4B	8 GB
`qwen3-8b`	Qwen3 8B	16 GB
`llama31-8b`	Llama 3.1 8B	16 GB
`mistral-7b`	Mistral 7B	16 GB
`qwen3-14b`	Qwen3 14B	32 GB
`qwen25-coder-14b`	Qwen2.5 Coder 14B	32 GB
`devstral-small`	Devstral Small 24B	32 GB
`qwen3-32b`	Qwen3 32B	64 GB
`qwen25-coder-32b`	Qwen2.5 Coder 32B	64 GB

Vision models

Registry key	Model	Min RAM
`qwen3-vl-4b`	Qwen3 VL 4B	8 GB
`gemma-3-4b-vision`	Gemma 3 4B Vision	8 GB
`llama32-vision-11b`	Llama 3.2 Vision 11B	16 GB
`qwen25-vl-7b`	Qwen2.5 VL 7B	16 GB
`qwen3-vl-32b`	Qwen3 VL 32B	32 GB

The default model is qwen3-vl-4b (vision-capable, runs on 8 GB machines).

Auto-quantization

OpenWalrus automatically selects the best quantization level based on how much RAM is available after loading the model (headroom):

GGUF models (CPU):

Headroom	Quantization
≥ 16 GB	Q8_0 (highest quality)
≥ 8 GB	Q6_K
≥ 4 GB	Q5_K_M
< 4 GB	Q4_K_M (smallest)

Metal (Apple Silicon): AFQ8 → AFQ6 → AFQ4 by the same headroom tiers.

CUDA (NVIDIA): Q8K → Q6K → Q4K by the same headroom tiers.

You never need to think about quantization — OpenWalrus picks the best option your hardware can handle.

GPU acceleration

Install with the appropriate feature flag:

# Apple Silicon (Metal)
cargo install openwalrus --features local,metal

# NVIDIA (CUDA)
cargo install openwalrus --features local,cuda

# CPU only (GGUF quantized)
cargo install openwalrus --features local

The install wizard (curl -sSL https://openwalrus.xyz/install | sh) auto-detects your platform and enables the right features.

Async model loading

Models load asynchronously on a dedicated thread. The runtime starts immediately and is ready to accept connections while the model downloads (if needed) and initializes in the background. Once loaded, it transitions to a ready state — no blocking, no warm-up delays.

What's next

Remote providers — use cloud APIs when you need more power
Providers overview — API standards and provider management
Configuration — full config reference

Local Inference

On this page