Built-in web search: no API keys, no setup

How we gave every CrabTalk agent web search and page fetching — with multi-engine consensus ranking, zero API keys, and zero configuration.

Agents that can't search the web are half-blind. Most frameworks solve this with API keys — SerpAPI, Tavily, Google Custom Search. You sign up, paste a key, configure a tool, hope the rate limits hold. We built it into the binary instead.

Starting today, every CrabTalk agent has two new built-in tools: web_search and web_fetch. No API keys. No configuration. No third-party accounts.

Why not just use an API?

Search APIs work, but they come with baggage:

Credentials to manage. One more API key per deployment, one more secret to rotate, one more service to monitor for billing surprises.
Rate limits. Free tiers cap at 100–1,000 queries/day. Autonomous agents burn through that fast.
Cost. SerpAPI runs $50–250/mo for serious usage. Tavily charges per query. These add up alongside LLM inference costs.
Privacy. Every query goes to a third-party logging service. For a local-first runtime, routing agent searches through a cloud proxy defeats the point.

We wanted something simpler: search that works the moment you install crabtalk, with zero setup and zero ongoing cost.

The meta search approach

Instead of depending on a single search provider, we built a meta search engine — crabtalk-search — that queries multiple free backends in parallel, merges the results, and ranks by consensus.

All five backends are queried via their public web endpoints — HTML scraping, not authenticated APIs. Bing and Brave Search are scraped from their public search pages, DuckDuckGo via the Lite HTML endpoint, Mojeek via its public results page, and Wikipedia via the OpenSearch API. No API keys, no accounts, no rate-limited tiers. All five are queried in parallel using tokio::task::JoinSet, so latency is bounded by the slowest engine, not the sum. The engines span three independent indexes (Bing/Microsoft, Brave, Mojeek) plus DuckDuckGo's Bing-derived ranking and Wikipedia's encyclopedia corpus — giving consensus ranking real signal.

Results are deduplicated by normalized URL — stripping trailing slashes, www. prefixes, and tracking parameters (utm_*, fbclid, gclid). When the same URL appears from multiple engines, the descriptions are merged (longer one wins) and the result gets a consensus score boost.

Consensus ranking

Ranking is simple and deterministic:

Position score: 1.0 / (position + 1) — earlier results from each engine score higher.
Consensus bonus: 0.5 * (engine_count - 1) — each additional engine that returns the same URL adds 0.5 to the score.

A result that three engines agree on will outscore a result that only one engine knows about. No ML model, no relevance tuning, no training data. Just arithmetic that rewards agreement across independent sources.

Page fetching

web_fetch downloads a URL and extracts clean text content. It strips <script>, <style>, <nav>, <footer>, <header>, <aside>, <noscript>, <svg>, <iframe>, and <form> subtrees entirely, then walks the remaining DOM and collects text nodes with proper spacing.

The fetcher rotates through 8 realistic browser user-agent strings (Chrome, Firefox, Safari, Edge across Windows/Mac/Linux) to avoid bot detection. No headless browser, no Playwright dependency — just HTTP requests and HTML parsing.

Zero configuration for agents

Both tools are registered in BASE_TOOLS, which means every agent gets them automatically — even agents with scoped tool whitelists. The aggregator and fetch client initialize at daemon startup with sensible defaults:

Setting	Default
Engines	Bing, Brave, DuckDuckGo, Mojeek, Wikipedia
Timeout	10 seconds per engine
Max results	20
Cache TTL	5 minutes

An in-memory cache prevents redundant queries. Same search within the TTL window returns instantly.

This follows the same design principle behind the rest of CrabTalk: batteries included, nothing to configure for the common case.

The standalone CLI

The search engine also ships as a standalone binary — wsearch — for human use and debugging:

# Search
wsearch search "rust async runtime"
wsearch search "crabtalk" --engines wikipedia
wsearch search "hello world" -n 5 --format text

# Fetch a page
wsearch fetch "https://example.com"

# List available engines
wsearch engines

Configuration lives in ~/.config/wsearch/config.toml:

engines = ["bing", "brave", "duckduckgo", "mojeek", "wikipedia"]
timeout_secs = 10
max_results = 20
cache_ttl_secs = 300
output_format = "json"

How agents use it

The tools work like any other built-in. An agent searching for information simply calls web_search, reviews the results, then optionally calls web_fetch on the most relevant URLs to read full page content:

{
  "name": "web_search",
  "parameters": {
    "query": "rust error handling best practices",
    "max_results": 5
  }
}

{
  "name": "web_fetch",
  "parameters": {
    "url": "https://doc.rust-lang.org/book/ch09-00-error-handling.html"
  }
}

No tool registration, no MCP server, no plugin. It just works.

Agents have network access for search — the permission system controls which tools require approval, not outbound HTTP.

What's next

The meta search architecture is designed to grow. We're looking at:

Domain-specific backends — code search (GitHub, docs sites), academic search (Semantic Scholar, arXiv).
Per-engine weighting — let users boost or suppress specific engines based on query type.
Search-to-memory — automatically caching search results so repeated research doesn't re-fetch the same pages.

The full tool reference is in the built-in tools docs.