Claude Code's Source Leak Reveals How WebSearch Sees Your Website

Published 2026-04-01 · 21 min read

A GitHub repository claiming to be Claude Code's source appeared on April 1. The day after I published a black-box study on Claude's web search. Coincidence or comedy – the code is real enough to read.

The source reveals two tools: WebSearch (server-side, returns encrypted snippets) and WebFetch (client-side, converts your HTML through five filters). They work very differently. What matters more to you?

WebSearch is a server-side black box. Claude Code sends a plain text query and gets back titles, URLs, page age, and roughly 500 words of encrypted snippet text per result. The code discards the snippets and page age – it keeps only title and URL. There is zero guidance on how queries should be formulated. The only instruction: include the current year and month. The date includes only month and year (not the day), so it changes at most once a month – minimizing server-side prompt cache invalidation.

Each search result contains an encrypted_content field – a binary blob of 4,000-6,300 characters. After accounting for Base64 and encryption overhead, that is roughly 500-650 words of plaintext. My black-box study measured a consistent 500-word snippet budget. The sizes match. Claude Code throws these snippets away. claude.ai uses them directly – that is the fundamental difference between the two products.

Your page passes through five filters: HTTP fetch (10 MB limit), Turndown HTML-to-Markdown conversion (zero custom rules – drops JSON-LD, alt-text, meta descriptions, keeps navigation text), truncation at 100,000 characters, Haiku paraphrasing with a 125-character quote limit, and a tool result budget of 50,000 characters. If exceeded, Claude sees a 2 KB preview. One path skips Haiku entirely: preapproved domain, server responds with text/markdown, content under 100K.

107 documentation domains skip the Haiku filter: react.dev, docs.python.org, nodejs.org, kubernetes.io, developer.mozilla.org, and 102 others. They also skip the domain blocklist check. If your site serves text/markdown via content negotiation, and it is on this list, your content reaches Claude unmodified. Everyone else gets compressed.

Same search engine, different content pipeline. claude.ai receives encrypted snippets directly from the search index – fragment selection, keyword proximity, roughly 500 words per result. Claude Code discards those snippets and re-fetches your page through Turndown and Haiku. Your website looks different depending on which Claude product reads it.

GEO is SEO with extra filters. The search engine appears to use keyword matching. The model is instructed to add the current year to queries. Your content is paraphrased, not quoted. JSON-LD and structured data are invisible. Body text is all that matters. Write for a small, fast model that scans 100K characters in one pass – front-load keywords, minimize navigation noise, keep sentences self-contained.

On this page

WebSearch is server-side. Obviously.
Reverse-engineering the encrypted fields
How search queries are constructed: not at all
The date instruction: monthly, not daily
Black-box observation: keyword-based, not semantic
WebFetch: what the pipeline does to your content
What gets indexed, what does not
The Accept header prefers Markdown
Five layers between your content and Claude
The cache changes everything about freshness
The domain blocklist is a kill switch
Three User-Agents, three identities
Claude Code can pay for content
Same web, two different views
claude.ai (chat)
Claude Code (CLI)
Hypotheses about Anthropic's search engine
Or it is all an April Fools' joke
Appendix: the full preapproved domain whitelist

Christopher Helm Creator of Wire

I build systems that make work unnecessary. Not the people. The overhead.

There is a growing industry around "Generative Engine Optimization" – GEO – selling the idea that optimizing for LLM search requires fundamentally new thinking. I read the source code. It is keyword matching, an HTML-to-Markdown converter with zero configuration, and a small model that paraphrases your content with a 125-character quote limit. GEO is SEO with extra filters. Prove me wrong.

Wire

Key Finding

Same search engine, different content pipeline

Claude Code and claude.ai share the same search index but see different versions of your website. Each result returns a title, a URL, a page age, and roughly 500 words of encrypted snippet text. Claude Code discards everything except title and URL. When it needs content, it fetches your page separately, converts the HTML to Markdown via Turndown – losing JSON-LD, Schema.org, alt-text, and meta descriptions – then sends the Markdown to a smaller model (Haiku) that paraphrases it with a 125-character quote limit. 107 documentation domains skip this filter. Everyone else gets compressed. The model is instructed to include the current year and month in search queries, but receives no guidance on how to formulate them. The search API is still in beta.

On March 31 – yesterday – I published a black-box study (in German, small-scale exploratory research) reverse-engineering how Claude's web search selects and ranks content. 170 queries against a handful of controlled pages, run in the claude.ai web UI. Not a large-scale study – a directional probe to see what differs from traditional search. It took considerable effort.

On April 1, a GitHub repository appeared containing what claims to be the full source code of Claude Code – the CLI tool, not the chat UI. They leaked everything. These are different products. The chat runs entirely server-side. Claude Code runs locally and talks to the same API. The search engine behind both is likely the same (both use web_search_20250305), but the content fetching pipeline is different – Claude Code fetches locally, the chat fetches server-side.

One day. From black box to white box – of a related but not identical product.

The timing is either a coincidence, an April Fools' joke, or the universe telling me that patience is overrated. I read the source anyway. Here is what it reveals, where it aligns with my observations, and what remains unknown.

Search

Server-side engine, keyword matching, encrypted snippets, no query guidance.

Fetch

Turndown, Haiku paraphrase, 125-char quote limit, five content filters.

Compare

claude.ai sees encrypted snippets directly. Claude Code throws them away.

Conclude

GEO is SEO with extra filters. The search API is still in beta.

WebSearch is server-side. Obviously.

Claude Code does not contain a search engine. It sends your query to Anthropic's API, which runs the search server-side. This is not surprising – shipping a search index with a CLI tool would be absurd. But it matters, because it means the source code reveals the interface to the search engine without revealing the engine itself.

// github.com/codeaashu/claude-code/blob/main/
//   src/tools/WebSearchTool/WebSearchTool.ts#L76
function makeToolSchema(input) {
  return {
    type: 'web_search_20250305',
    name: 'web_search',
    max_uses: 8,
  }
}

The tool sends your query to Anthropic's API as a server_tool_use request. The actual search engine, the index, the ranking – all server-side. Claude Code is a wrapper. The search happens behind a wall you cannot inspect even with the source code in front of you.

The API returns more than Claude Code keeps. According to Anthropic's documentation, each search result contains title, url, page_age, encrypted_content, and encrypted_index. But Claude Code maps only two fields:

// WebSearchTool.ts#L124
const hits = block.content.map(
  r => ({ title: r.title, url: r.url })
)

page_age – dropped. encrypted_content (the actual snippet text, encrypted so the model can cite it without exposing raw content in client code) – dropped. encrypted_index (the citation reference) – dropped.

This is the key difference between claude.ai and Claude Code. In my black-box study on claude.ai, I observed structured snippets: 1–6 fragments per result, ~500 words total, keyword-proximity-driven selection. Those snippets come from encrypted_content – the server-side search returns them, and the claude.ai model reads them directly.

Claude Code throws all of that away. It keeps only the titles and URLs. If it wants page content, it must call WebFetch separately – which triggers the Haiku pipeline with the 125-character quote limit, the Turndown conversion, the 100K truncation. The snippet structure I measured in my black-box study does not apply to Claude Code at all. Claude Code never sees those snippets.

The text blocks that the secondary search model writes (its commentary on the results) are what Claude Code actually works with. These are not the structured snippets from the index – they are a model's free-form summary, written after seeing the encrypted results server-side.

Reverse-engineering the encrypted fields

The code shows that encrypted_content and encrypted_index exist but are discarded. My black-box study on claude.ai – where the model does use them – lets me infer what they contain.

encrypted_content is likely the snippet fragments. My study measured their structure:

1–6 fragments per result, typically 5
~94 words per fragment (range 30–120), ~500–600 words total
Fragment start: 70% at sentence boundaries, 10% mid-sentence, 4% at navigation/breadcrumbs
Fragment end: 50% at a period, 30% hard cutoff at ~120 words
Selection driven by keyword proximity – three keywords in one sentence was selected in 6 out of 6 queries
An "anchor fragment" (the intro paragraph with highest keyword density) appeared in every query tested
Fragments are query-dependent: same URL, different query, different fragments selected

This means the search index stores the full page text, not pre-computed snippets. Fragment selection happens server-side after ranking – my study found no correlation between fragment coverage breadth and ranking position.

encrypted_index is likely the citation pointer. My study tested citation notation (X-1, X-3, X-5 – document X, sentence Y). Changing the sentence index did not change what the frontend displayed. The index is an internal reference the model uses to point at specific positions within encrypted_content when formulating citations.

Is it actually encrypted? I called the API directly and inspected the raw response. encrypted_content is a binary blob, Base64-encoded, 4,000–6,300 characters per result. Not readable. Not JSON. Not plaintext. The encoding starts with a consistent byte prefix suggesting a structured binary format. page_age is a plain relative string like "6 days ago" or null – likely from the search index's crawl timestamp, not from the page's HTML meta tags.

The size confirms the black-box observations. 6,300 characters of Base64 decode to roughly 4,700 bytes. After encryption overhead, that leaves approximately 3,500–4,000 bytes of plaintext – roughly 500–650 words at typical word length. My black-box study measured a consistent ~500–600 word snippet budget. The encrypted blob is exactly the right size to contain the fragments I observed.

Claude Code sidesteps these fields entirely and fetches content through its own WebFetch pipeline instead.

Eight searches per call. Hardcoded. The web-search-2025-03-05 beta header in the codebase confirms this is still an experimental feature, not a stable API.

The instructions Claude receives for using WebSearch are also in the source:

CRITICAL REQUIREMENT - You MUST follow this:
  After answering the user's question,
  you MUST include a "Sources:" section
  at the end of your response.

IMPORTANT - Use the correct year
  in search queries:
  The current month is [auto-injected].
  You MUST use this year when searching
  for recent information.

Source: src/tools/WebSearchTool/prompt.ts. The model is forced to cite sources and instructed to include the current date in queries. This explains the temporal awareness I observed in my black-box study – it is a prompt instruction, not an index feature.

How search queries are constructed: not at all

The most consequential finding for GEO may be what is absent. I searched the entire codebase for guidance on how Claude should formulate search queries. There is none.

The tool description the model receives is:

- Allows Claude to search the web and use
  the results to inform responses
- Use this tool for accessing information
  beyond Claude's knowledge cutoff

The input schema describes the query field as:

query: z.string().min(2)
  .describe('The search query to use')

"The search query to use." Minimum 2 characters. No instructions on keyword extraction, language choice, query decomposition, phrase quoting, or synonym expansion. The model decides entirely on its own – based on training, not on prompt engineering – what query string to write.

That query string then travels verbatim through a secondary model call:

// WebSearchTool.ts#L257-258
const userMessage = createUserMessage({
  content:
    'Perform a web search for the query: '
    + query,
})

No reformulation. No expansion. The literal string the model composed goes to the server-side search engine, which – as my black-box study established – does keyword-based matching.

For companies optimizing for GEO: you cannot influence the query construction. What you can do is ensure your content contains the literal terms the model is likely to use. And there is one term the model is explicitly told to include.

The date instruction: monthly, not daily

The "[auto-injected]" date in the search prompt is not a placeholder. It is computed dynamically:

// constants/common.ts#L26-33
// Returns "Month YYYY" (e.g. "February 2026")
// in the user's local timezone.
// Changes monthly, not daily — used in tool
// prompts to minimize cache busting.
export function getLocalMonthYear() {
  const date =
    process.env.CLAUDE_CODE_OVERRIDE_DATE
      ? new Date(
          process.env.CLAUDE_CODE_OVERRIDE_DATE)
      : new Date()
  return date.toLocaleString('en-US',
    { month: 'long', year: 'numeric' })
}

Two code comments tell the story:

// Changes monthly, not daily — used in
// tool prompts to minimize cache busting.

The value changes only when the calendar month rolls over – "April 2026" becomes "May 2026" on May 1. It is not frozen at session start; it is computed fresh on each call via new Date(). But because it only includes month and year (not the day), the string changes at most once a month. The code comment explains why: changing the tool prompt invalidates the server-side prompt cache for all users.

A separate function (getSessionStartDate) IS memoized and frozen at session start – but that one feeds the system prompt, not the search tool.

The prompt tells the model:

The current month is April 2026.
You MUST use this year when searching for
recent information, documentation, or
current events.

Example: If the user asks for "latest React
docs", search for "React documentation"
with the current year, NOT last year

The model is instructed to inject year and month into search queries. If the server-side search is keyword-based (my black-box observation, not proven), then content containing "2026" in its body text would match queries that include "2026."

But there is a catch. On May 1, the prompt says "May 2026" – and the model may search for "best CRM tools May 2026." Nothing relevant was published in May yet. The month-level granularity helps in the second half of a month but hurts in the first half. For GEO, the year matters more than the month.

CLAUDE_CODE_OVERRIDE_DATE is an internal Anthropic developer variable (the comment says "ant-only date override") – not a user-facing feature.

Black-box observation: keyword-based, not semantic

My study found that Claude's search appears to use exact lexical matching. Synonym queries returned zero results – "Neukäufer gewinnen" found nothing when the actual text said "Neukunden akquirieren." This is an empirical observation from a small sample, not a proven fact. The search engine is server-side and its algorithm is invisible.

What the source code does reveal: the web_search input schema has exactly three parameters.

// WebSearchTool.ts#L25-37
z.strictObject({
  query: z.string().min(2),
  allowed_domains: z.array(z.string()).optional(),
  blocked_domains: z.array(z.string()).optional(),
})

strictObject with additionalProperties: false. No semantic: true flag. No embedding mode. No site: operator. No date range. No language filter. A plain text query and optional domain filters. My black-box observation that additionalProperties: false exists on the schema – confirmed in the code, character for character.

WebFetch: what the pipeline does to your content

WebFetch is the tool that actually reads your website. And it is more complex than anyone assumed.

The flow: validate URL, upgrade HTTP to HTTPS, check a domain blocklist via api.anthropic.com, fetch with axios, convert HTML to Markdown via Turndown, truncate to 100,000 characters, then – and this is the part that matters for GEO – send the Markdown to Haiku.

// github.com/codeaashu/claude-code/blob/main/
//   src/tools/WebFetchTool/prompt.ts#L23
export function makeSecondaryModelPrompt(
  markdownContent, prompt, isPreapprovedDomain
) {
  const guidelines = isPreapprovedDomain
    ? `Provide a concise response based on
       the content above. Include relevant
       details, code examples, and
       documentation excerpts as needed.`
    : `Enforce a strict 125-character maximum
       for quotes from any source document.
       Use quotation marks for exact language
       from articles; any language outside of
       the quotation should never be
       word-for-word the same.`

Your website's content is not returned to Claude directly. A smaller, faster model – Haiku – reads the Markdown and writes a summary. That summary is what Claude sees. Your page goes through two AI models before it influences an answer.

The exact instructions Haiku receives depend on the domain tier:

Preapproved domains (107 sites)Everyone else

Provide a concise response based on the
content above. Include relevant details,
code examples, and documentation excerpts
as needed.

Full reproduction allowed. Code examples pass through. Documentation excerpts preserved.

Provide a concise response based only on
the content above. In your response:
- Enforce a strict 125-character maximum
  for quotes from any source document.
- Use quotation marks for exact language
  from articles; any language outside of
  the quotation should never be
  word-for-word the same.
- You are not a lawyer and never comment
  on the legality of your own prompts
  and responses.
- Never produce or reproduce exact
  song lyrics.

125 characters per quote. Mandatory paraphrasing. Your content arrives as Haiku's interpretation, not your words.

My black-box study measured web_fetch returning 2,000–8,000+ words in page-order Markdown with preserved headings. That is what goes into Haiku. What comes out is the compressed summary Claude actually reasons with.

What gets indexed, what does not

My study tested specific phrases hidden in JSON-LD, image alt-text, and meta descriptions. None were retrievable via web_search. The source code offers a plausible explanation – though this applies to WebFetch (the page reader), not to the server-side search index, which remains opaque.

Turndown converts HTML to Markdown. That is the entire content pipeline for WebFetch. Turndown is an HTML-to-Markdown converter. It processes <body> content: paragraphs, headings, lists, tables, links, code blocks. It does not process:

<script> tags (where JSON-LD lives) – stripped
<meta> tags (where descriptions live) – in <head>, ignored
<img alt="..."> attributes – Turndown drops images by default
<nav>, <header>, <footer> – converted to text if present in body

My study found that navigation and breadcrumb text sometimes consumed fragment slots. The code confirms this: Turndown does not distinguish navigation from content. If your breadcrumbs contain keywords, they compete with your article text for Haiku's attention.

// utils.ts#L493-494
if (contentType.includes('text/html')) {
  markdownContent =
    (await getTurndownService()).turndown(htmlContent)
}

No custom Turndown rules. No <nav> stripping. No <script> removal beyond Turndown's defaults. The instance is created with new Turndown() – zero configuration. Whatever Turndown's default rules preserve, Claude sees. Whatever they strip, Claude never knows existed.

Schema.org in JSON-LD? Invisible. <script type="application/ld+json"> is a script tag. Turndown strips it.

Image alt-text? Invisible. Turndown's default removes images entirely.

Meta descriptions? Invisible to WebFetch. My study found they sometimes appear as ~30-word fragments in web_search snippets – but that is server-side behavior, not something the client code controls.

The Accept header prefers Markdown

// utils.ts#L279
headers: {
  Accept: 'text/markdown, text/html, */*',
  'User-Agent': getWebFetchUserAgent(),
}

text/markdown comes first. If your server serves Markdown (some documentation sites do), Claude gets it raw – no Turndown conversion, no structural loss. This is a genuine advantage for sites that support content negotiation.

My study observed that web_fetch returns content in Markdown format with preserved heading hierarchy. The code explains why: if your server responds with text/markdown, the content skips Turndown entirely – no conversion loss. For HTML responses, Turndown does the work. Sites that support content negotiation and serve Markdown get an edge.

Five layers between your content and Claude

Your blog post passes through five transformations before Claude reasons with it. Some are obvious. One is not.

Layer 1: Turndown (HTML to Markdown). Your HTML becomes Markdown via new Turndown() with zero configuration. Everything in <head> disappears: meta tags, JSON-LD, Open Graph. Images are dropped. Navigation and breadcrumbs survive as plain text and compete with your article content.

Layer 2: Truncation at 100,000 characters. If your Markdown exceeds 100K characters, it is sliced and a note is appended:

// utils.ts#L529-532
markdownContent.slice(0, MAX_MARKDOWN_LENGTH)
  + '\n\n[Content truncated due to length...]'

Everything after 100K is gone. Claude never knows it existed.

Layer 3: Haiku screening. For non-preapproved domains, a smaller model processes your content with copyright-aware instructions. The full set of rules Haiku receives:

- Enforce a strict 125-character maximum
  for quotes from any source document.
  Open Source Software is ok as long as
  we respect the license.
- Use quotation marks for exact language
  from articles; any language outside of
  the quotation should never be
  word-for-word the same.
- You are not a lawyer and never comment
  on the legality of your own prompts
  and responses.
- Never produce or reproduce exact
  song lyrics.

This is not just summarization. Haiku is instructed to avoid reproducing your content verbatim. Your words are paraphrased by design. The only exception: preapproved documentation sites, where code examples and excerpts pass through.

Layer 4: Tool result budget (the hidden one). After Haiku returns its summary, that summary enters the tool result pipeline. Here, a second size limit applies:

// toolLimits.ts
DEFAULT_MAX_RESULT_SIZE_CHARS = 50_000
PREVIEW_SIZE_BYTES = 2000
MAX_TOOL_RESULTS_PER_MESSAGE_CHARS = 200_000

If the Haiku summary exceeds 50K characters, it is saved to disk and replaced with a 2,000-byte preview. Claude sees only the first ~2KB. If multiple tools run in parallel (search + fetch + fetch), their combined results are capped at 200K per message – the largest ones are persisted to disk first.

These thresholds are also remotely adjustable via GrowthBook – an open-source feature flag system that lets Anthropic change configuration values server-side without shipping a new release. The flag tengu_satin_quoll controls per-tool persistence thresholds. Anthropic can tighten or loosen how much of your content reaches Claude at any time.

Layer 5: The preapproved bypass. Exactly one path skips Haiku entirely:

// WebFetchTool.ts#L264-269
if (
  isPreapproved &&
  contentType.includes('text/markdown') &&
  content.length < MAX_MARKDOWN_LENGTH
) {
  result = content  // raw, no Haiku
}

Preapproved domain + server responds with text/markdown + content under 100K = your content arrives at Claude unmodified. For every other site, it passes through all five layers.

The full pipeline with limits:

Step	Limit	Approx. equivalent
HTTP fetch	10 MB	Raw HTML
Turndown (HTML to Markdown)	no own limit	Nav text survives
Markdown truncation	100,000 characters	~20,000 words
Haiku processing	100K chars in, 8K tokens out	~6,000 words output
Tool result budget	50,000 characters	2KB preview if exceeded

A typical 2,000-word blog post produces 5,000–8,000 characters of Markdown after Turndown – well within all limits. The bottleneck for normal content is not truncation but Haiku's output: regardless of input size, the summary is capped at ~6,000 words. For pages with heavy navigation HTML (mega-menus, footer link farms, sidebar widgets), Turndown preserves the text of those elements – they compete with your article content for Haiku's attention and the 100K character budget.

The cache changes everything about freshness

// utils.ts#L63-78
const CACHE_TTL_MS = 15 * 60 * 1000   // 15 minutes
const MAX_CACHE_SIZE_BYTES =
  50 * 1024 * 1024                      // 50MB

const URL_CACHE = new LRUCache({
  maxSize: MAX_CACHE_SIZE_BYTES,
  ttl: CACHE_TTL_MS,
})

const DOMAIN_CHECK_CACHE = new LRUCache({
  max: 128,
  ttl: 5 * 60 * 1000, // 5 min
})

When Claude fetches your page, the result is cached for 15 minutes. Every user asking about the same URL in that window gets the same Haiku summary. Not the same page – the same processed summary.

Your page's first impression is its only impression for 15 minutes. If Haiku misreads your content on the first pass, that misreading is what everyone gets. There is no retry.

The domain blocklist has a separate 5-minute cache. If Anthropic blocks your domain, it takes at most 5 minutes to take effect.

The domain blocklist is a kill switch

Before fetching any non-preapproved URL, Claude Code calls home:

// utils.ts#L183
const response = await axios.get(
  `https://api.anthropic.com/api/web/
    domain_info?domain=${domain}`,
  { timeout: 10_000 }
)
if (response.data.can_fetch === true) {
  // proceed
}

Anthropic maintains a real-time domain blocklist. If can_fetch returns false, Claude cannot read your site. This is not a robots.txt check – it is Anthropic's own gate, checked on every first request to a new domain.

Enterprise customers can skip this check via skipWebFetchPreflight – but regular users cannot.

Three User-Agents, three identities

The code reveals Claude operates with three distinct User-Agent strings:

// http.ts#L52-57
// "Claude-User" is Anthropic's publicly
// documented agent for user-initiated
// fetches; the claude-code suffix lets them
// distinguish local CLI traffic from
// claude.ai server-side fetches.
export function getWebFetchUserAgent() {
  return `Claude-User (${version};
    +https://support.anthropic.com/)`
}

The comment is the finding: "distinguish local CLI traffic from claude.ai server-side fetches." This confirms that claude.ai (the chat product) uses a different fetcher – not this client-side pipeline. When your server sees Claude-User with a claude-code suffix, that is the CLI. The chat product has its own server-side mechanism.

Claude Code can pay for content

Buried in 40 lines of error handling is a protocol nobody has written about:

// utils.ts#L329-356
if (error.response?.status === 402) {
  const { isX402Enabled,
    handlePaymentRequired,
    getX402SessionSpentUSD
  } = require('../../services/x402/index.js')

  const paymentHeader =
    error.response.headers['x-payment-required']
  if (isX402Enabled() && paymentHeader) {
    const result = handlePaymentRequired(
      paymentHeader,
      getX402SessionSpentUSD(),
    )
    if (result) {
      // Retry request with payment header
      return await axios.get(url, {
        headers: {
          'x-payment': result.paymentHeader,
        },
      })
    }
  }
}

HTTP 402 Payment Required. The x402 protocol. Claude Code can detect paywalled content and pay for it using a session-scoped budget tracked via getX402SessionSpentUSD(). This is infrastructure for a web where AI agents pay per article. It exists in the code today.

Same web, two different views

The same search engine powers both products. But what reaches the model differs fundamentally.

claude.ai (chat)

Search returns 10 results with encrypted snippets (~500 words each)
Model reads snippets directly, cites from them
page_age available as freshness signal ("6 days ago")
No separate fetch needed for most answers
Snippet selection: keyword-proximity-based, 1–6 fragments, query-dependent
Your content arrives as the search engine selected it

Claude Code (CLI)

Same 10 results, but encrypted_content and page_age are discarded
Only { title, url } kept
Content requires separate WebFetch call
WebFetch pipeline: Turndown (HTML to Markdown) then Haiku with 125-character quote limit then 100K truncation then 50K tool result budget (2KB preview if exceeded)
107 preapproved documentation sites skip Haiku
Your content arrives as Haiku's paraphrase of a Markdown conversion of your HTML

Both products share the same search index, the same ranking, the same keyword matching. The difference is what happens after the results arrive. claude.ai works with the server-provided snippets. Claude Code throws them away and rebuilds its own view of your content through a five-layer pipeline.

My black-box observations about snippet structure and fragment behavior describe the shared search engine. The code I am reading here shows a different, more lossy path to the same content.

Hypotheses about Anthropic's search engine

The source code reveals the edges of the black box without opening it. Combined with my empirical measurements, here is what I can infer:

The index stores full page text, not pre-computed snippets. My study proved this: the same URL returns different fragments for different queries. The code confirms that web_search returns only { title, url } – no snippet field. The snippets I observed in my study come from the model's commentary, not from pre-indexed fragments.

Ranking may use term-frequency scoring (BM25 or similar). My study found a correlation between BM25 proxy scores and ranking position across domains – but this is a small-sample observation, not proof. The code shows no semantic search parameters. The query is a plain string. No embeddings, no vector similarity. This is consistent with lexical retrieval, but the server-side engine could use any algorithm.

The ~500-word snippet budget is model-generated, not API-enforced. My study measured a consistent ~500-600 word budget per snippet with a hard maximum of 6 fragments. The code shows no word limit on search results – just { title, url }. The fragment structure, the ~94-word average per fragment, the keyword proximity effects – all of this is the model's own behavior when summarizing search results. The budget is a behavioral pattern, not a technical constraint.

The web-search-2025-03-05 version string suggests iteration. It is a beta header with a date stamp. Anthropic is actively developing this. The search behavior I measured in March 2026 may not match what exists in June 2026.

Or it is all an April Fools' joke

The repository appeared on April 1. The code is plausible, detailed, and internally consistent. It matches my black-box observations in ways that would be difficult to fake:

additionalProperties: false on the search schema – matches my empirical observation
Turndown with zero custom rules – would explain why JSON-LD and alt-text are invisible
Server-side search returning only { title, url } – consistent with why I never saw snippet text in the raw API response
The 125-character quote limit – consistent with the compression patterns I measured in fetch results

If it is a fake, it is a fake that independently arrived at the same architecture I reverse-engineered from the outside. I will treat it as real until proven otherwise.

Yesterday: 170 queries and careful measurement. Today: grep -r "makeSecondaryModelPrompt" and reading TypeScript.

The vibes say April Fools'. The code says architecture. I will update this article when Anthropic comments.

Appendix: the full preapproved domain whitelist

These 107 domains skip the domain blocklist check and receive relaxed content extraction rules (no 125-character quote limit, code examples preserved). Source: src/tools/WebFetchTool/preapproved.ts.

Anthropic: platform.claude.com, code.claude.com, modelcontextprotocol.io, github.com/anthropics, agentskills.io

Programming languages: docs.python.org, en.cppreference.com, docs.oracle.com, learn.microsoft.com, developer.mozilla.org, go.dev, pkg.go.dev, www.php.net, docs.swift.org, kotlinlang.org, ruby-doc.org, doc.rust-lang.org, www.typescriptlang.org

JavaScript/Web: react.dev, angular.io, vuejs.org, nextjs.org, expressjs.com, nodejs.org, bun.sh, jquery.com, getbootstrap.com, tailwindcss.com, d3js.org, threejs.org, redux.js.org, webpack.js.org, jestjs.io, reactrouter.com

Python libraries: docs.djangoproject.com, flask.palletsprojects.com, fastapi.tiangolo.com, pandas.pydata.org, numpy.org, www.tensorflow.org, pytorch.org, scikit-learn.org, matplotlib.org, requests.readthedocs.io, jupyter.org

PHP: laravel.com, symfony.com, wordpress.org

Java: docs.spring.io, hibernate.org, tomcat.apache.org, gradle.org, maven.apache.org

.NET: asp.net, dotnet.microsoft.com, nuget.org, blazor.net

Mobile: reactnative.dev, docs.flutter.dev, developer.apple.com, developer.android.com

ML/Data Science: keras.io, spark.apache.org, huggingface.co, www.kaggle.com

Databases: www.mongodb.com, redis.io, www.postgresql.org, dev.mysql.com, www.sqlite.org, graphql.org, prisma.io

Cloud/DevOps: docs.aws.amazon.com, cloud.google.com, kubernetes.io, www.docker.com, www.terraform.io, www.ansible.com, vercel.com/docs, docs.netlify.com, devcenter.heroku.com

Testing: cypress.io, selenium.dev

Game dev: docs.unity.com, docs.unrealengine.com

Other: git-scm.com, nginx.org, httpd.apache.org

If your documentation site is not on this list, Claude Code treats it as untrusted – blocklist check, 125-character quote limit, mandatory paraphrasing.

Using Wire commercially?

Free for personal sites. Commercial sites need a license.

See pricing