OpenWalrusOpenWalrus

Local Inference

Built-in model registry with auto-quantization — just pick a model, OpenWalrus handles the rest.

OpenWalrus ships with a curated model registry. Pick a model by name, and the runtime auto-selects the best quantization for your hardware. No API keys, no manual configuration.

Quick start

Set your default model in walrus.toml:

[model]
default = "qwen3-4b"

That's it. OpenWalrus looks up qwen3-4b in the registry, checks your system RAM, selects the optimal quantization, and loads the model asynchronously. The runtime starts in under 10 ms — the model downloads and loads in the background.

Model registry

The registry is compiled into the binary at build time, with platform-specific entries for CPU (GGUF), Metal, and CUDA.

Text models

Registry keyModelMin RAM
qwen3-06bQwen3 0.6B4 GB
qwen3-17bQwen3 1.7B4 GB
smollm2-17bSmolLM2 1.7B4 GB
gemma-3-1bGemma 3 1B4 GB
phi-4-mini-flashPhi-4 Mini Flash4 GB
qwen3-4bQwen3 4B8 GB
phi-4-miniPhi-4 Mini8 GB
gemma-3-4bGemma 3 4B8 GB
gemma-3n-e4bGemma 3n E4B8 GB
qwen3-8bQwen3 8B16 GB
llama31-8bLlama 3.1 8B16 GB
mistral-7bMistral 7B16 GB
qwen3-14bQwen3 14B32 GB
qwen25-coder-14bQwen2.5 Coder 14B32 GB
devstral-smallDevstral Small 24B32 GB
qwen3-32bQwen3 32B64 GB
qwen25-coder-32bQwen2.5 Coder 32B64 GB

Vision models

Registry keyModelMin RAM
qwen3-vl-4bQwen3 VL 4B8 GB
gemma-3-4b-visionGemma 3 4B Vision8 GB
llama32-vision-11bLlama 3.2 Vision 11B16 GB
qwen25-vl-7bQwen2.5 VL 7B16 GB
qwen3-vl-32bQwen3 VL 32B32 GB

The default model is qwen3-vl-4b (vision-capable, runs on 8 GB machines).

Auto-quantization

OpenWalrus automatically selects the best quantization level based on how much RAM is available after loading the model (headroom):

GGUF models (CPU):

HeadroomQuantization
≥ 16 GBQ8_0 (highest quality)
≥ 8 GBQ6_K
≥ 4 GBQ5_K_M
< 4 GBQ4_K_M (smallest)

Metal (Apple Silicon): AFQ8 → AFQ6 → AFQ4 by the same headroom tiers.

CUDA (NVIDIA): Q8K → Q6K → Q4K by the same headroom tiers.

You never need to think about quantization — OpenWalrus picks the best option your hardware can handle.

GPU acceleration

Install with the appropriate feature flag:

# Apple Silicon (Metal)
cargo install openwalrus --features local,metal

# NVIDIA (CUDA)
cargo install openwalrus --features local,cuda

# CPU only (GGUF quantized)
cargo install openwalrus --features local

The install wizard (curl -sSL https://openwalrus.xyz/install | sh) auto-detects your platform and enables the right features.

Async model loading

Models load asynchronously on a dedicated thread. The runtime starts immediately and is ready to accept connections while the model downloads (if needed) and initializes in the background. Once loaded, it transitions to a ready state — no blocking, no warm-up delays.

What's next

On this page