Self-Assessment - What Wire Does Well and Where It Grows

You're evaluating a tool and you want the real story, not the sales pitch. This page was written by the people who built Wire, and they included the parts that keep them up at night.

Wire's self-assessment covers five areas: what surprised the team during development, what works well with measured evidence, what worries them, where the architecture hits limits, and what's coming next. Most readers land here from one of two directions: they're deciding whether Wire fits their situation, or they're already using it and something isn't working the way they expected. Which is closer to where you are?

The claims here come with numbers. Six cents per page in API costs versus $100-200 for manual SEO work. Forty-four lint rules in the audit system. Zero API calls during the analysis phase. The zero-cost analysis point surprises most people: keyword routing, BM25 scoring, and amendment briefs all run locally in pure Python. You can audit a thousand-page site without spending anything. But cost and coverage aren't the only things that matter when you're committing to a tool.

Wire runs on a single model: Claude Sonnet 4.6. The entry point is abstracted so switching providers is a one-line change, but no alternatives have been tested in production. Google Search Console caps at 25,000 rows per query, which means large sites may miss long-tail keyword data. Wire is also a batch tool with no continuous monitoring. If a page breaks overnight, Wire won't catch it until you run `audit` the next morning. Some of these are architectural decisions. Others are just gaps. One of them might be the one that matters for your situation.

Wire's architecture was designed for one operator running CLI commands against one site. Module-level globals tie each Python process to a single `mkdocs.yml`. The audit reads every page from disk twice. Git-based date lookups run one subprocess call per page, so a thousand-page site means a thousand `git log` calls at build time. None of these are breaking problems at current scale: a full audit of 1,100 pages takes under two seconds on an SSD. But "under two seconds on an SSD" is a different statement than "works fine on a network-mounted filesystem."

The roadmap includes a multi-site dashboard for agencies, Lighthouse integration for performance budgets, webhook notifications for long batch runs, and expanded Schema.org structured data. These are active priorities, not a wish list, but there are no timeline commitments. The team ships when the work is solid. If a feature you're counting on is on this list, that's useful information before you commit to Wire. If it's not on this list at all, that's also useful information.

Most product pages tell you what's great. This one tells you what's great, what's messy, and what keeps us up at night. Wire is a real tool used on real sites, and we think honesty about its rough edges is more useful than polish.

What surprised us building Wire

Wire was not designed top-down. It grew from solving actual problems on production sites, and several things turned out differently than we expected.

The 3-layer quality system was an accident. PREVENT (teach Claude the rules before writing), FIX (auto-correct on save), and DETECT (warn about issues after the fact) emerged from three separate rounds of fixing real failures. We kept finding that prompts alone could not prevent every mistake, and post-hoc warnings were useless if the bad content was already saved. The layered approach was never planned; it just turned out that you need all three layers or quality slips through the cracks.

BM25 works surprisingly well without a neural model. When we needed to route keywords to the right pages (should this keyword expand an existing page or justify a new one?), we expected to need embeddings or a language model. Instead, classic BM25 term-frequency scoring, combined with impression ratios and page breadth signals, produces reliable routing decisions. The analysis phase runs entirely offline with zero API calls and zero cost. We were genuinely surprised that a 1990s algorithm held up this well against 2026 content problems.

Source diversity detection caught real problems we did not know we had. On idp-software.com, the automated audit flagged 32 pages with concentrated external sources: articles that cited the same domain 4, 5, 6 times. After the auto-fix pipeline (deduplicate external links, diversify on refine), that number dropped to zero. We built the detection because it seemed like good practice. We did not expect it to find that many problems on a site we thought was well-maintained.

The junior-senior news pattern outperforms batch evaluation. Our first approach sent all articles to Claude in one batch. The results were mediocre. Important details got lost in volume. Splitting into individual "junior analyst" evaluations (one article at a time) followed by a "senior editor" synthesis produces noticeably better output. Each article gets proper attention, and the synthesis step catches contradictions between sources that batch evaluation misses.

Schema validation refusing builds was initially terrifying. When we first shipped strict frontmatter validation, rejecting pages with missing titles or malformed metadata, we expected complaints. Instead, users told us they preferred it. "Tell me what's wrong, don't guess" turns out to be a better experience than silently patching bad data and producing confusing output downstream.

What's great (with evidence)

These are not aspirational claims. They are measured results from production use.

Cost: $0.06 per page vs $150+ manual. A single enrich call costs roughly six cents in Claude API usage. It performs local analysis (free), targeted web research, and a combined improve pass. Manual SEO content work (research, rewrite, optimize, review) runs $100-200 per page at agency rates. That is a 2,500x cost reduction, and we have the API bills to prove it. See the full pricing comparison.

44 lint rules in the audit system. The audit command checks for duplicate titles, duplicate descriptions, orphan pages, broken internal links, source concentration, underlinked pages, title length violations, missing citations, thin content, heading hierarchy issues, H1 mismatches, and more. The content quality system documents every rule with the evidence behind it. Most enterprise SEO audit tools check fewer items, and they charge monthly for the privilege.

Zero-API-cost analysis phase. The audit, analyze, and the analysis stage of enrich make zero Claude API calls. Keyword presence scoring, BM25 ranking, keyword routing, amendment briefs: all computed locally with pure Python. You can run diagnostics on a thousand-page site without spending a cent.

Resume-on-interrupt. Every batch command (news, refine, reword, enrich) writes progress to .wire/progress-*.json. If your laptop dies, your VPN drops, or you hit Ctrl+C, run the same command with --resume and it picks up where it left off. Failed items are retried, not skipped.

Git-native workflow. Wire writes markdown files into your existing docs directory. Every change is a normal file change. Every page has a git history. There is no proprietary database, no lock-in, no export step. If you stop using Wire tomorrow, your content is still there, in standard markdown with YAML frontmatter.

Dry-run mode that actually works. The --dry-run flag writes .preview files and shows diffs without touching your real content. It skips the stamp step so your metadata stays clean. You can inspect exactly what Wire would do before letting it do it.

What worries us

We are shipping Wire with these known concerns. We think transparency about them is more valuable than pretending they do not exist.

Single-model dependency. Wire uses Claude (currently claude-sonnet-4-6) for all generation. The single claude_text() entry point could be swapped to another provider, but we have not tested alternatives. In practice, each Claude generation costs well under $0.10 per page. Model pricing has decreased with each generation, and the entry point abstraction means switching models is a one-line change.

GSC API limits. Google Search Console returns a maximum of 25,000 rows per query. For a site with thousands of pages and tens of thousands of ranking keywords, this ceiling means some long-tail data gets truncated. Wire works within this limit, but very large sites (10,000+ pages) may miss low-volume keyword data that could inform better routing decisions.

No real-time monitoring. Wire is a batch tool. You run commands, they process, they finish. There is no daemon watching for new content, no webhook listener, no continuous audit loop. If a page breaks at 2am, Wire will not notice until you run audit the next morning. For teams that need continuous monitoring, Wire is a complement to those tools, not a replacement.

Windows as primary development environment. Wire is developed and tested primarily on Windows with Git Bash. The test suite (1,227 tests, 90% coverage) runs on Windows. Path handling uses pathlib which should be cross-platform, but "should be" and "is" are different statements. Linux and macOS users may encounter edge cases we have not hit. We fix these when reported, but we cannot claim equal confidence across platforms.

Prompt brittleness. Wire's output quality depends heavily on prompt engineering. The 18 prompt templates have been refined through hundreds of iterations, but they are still natural language instructions to a language model. Edge cases in content structure, unusual frontmatter, or unexpected topic layouts can produce suboptimal results. The 3-layer quality system catches most failures, but "most" is not "all."

Rate limiting is conservative. Wire defaults to 1-second delays between API calls. Google's GSC API allows 1,200 requests per minute. We chose reliability over speed, but this means large batch operations (news gathering across 300+ pages) take longer than they theoretically need to. The delay is configurable via extra.wire.rate_limit_delay, but we have not tested aggressive settings in production.

Scalability boundaries

Wire works well for its current use case: a single operator running CLI commands against one site at a time. These are the architectural boundaries that would need to change for different usage patterns.

Single-process, single-site architecture. Module-level globals (DOCS_DIR, SITE, WIRE_CONFIG) are initialized on import from mkdocs.yml in the current working directory. One Python process handles one site. This is the right design for a CLI tool, but it rules out a SaaS API server or background job queue that processes multiple sites without forking. Moving to explicit config-passing would require threading a config object through every function, which is a large refactor with no current payoff.

LRU cache staleness during batch operations. _get_valid_internal_paths() is cached with @lru_cache(maxsize=1) and never invalidated. If a batch operation creates new pages (e.g., compare or create), broken link detection in _sanitize_content() will miss the new pages for the rest of the process lifetime. The same applies to generate_site_directory(). For current batch sizes (100-300 pages), this rarely matters because page creation and link validation happen in separate commands. At larger scale or in a single long-running process, this cache would need explicit invalidation.

Audit reads every page from disk twice. _audit_content_quality() scans all index.md files to build a pages dict, then reads them again for quality checks. For 1,000+ pages, this doubles the I/O. Not a bottleneck on SSDs (the entire scan takes under 2 seconds for 1,100 pages), but it could matter on network-mounted filesystems or very large sites.

No machine-readable audit output. The audit() method outputs via logger.info(). Any dashboard or monitoring tool would need to parse log strings. A structured JSON output mode would be straightforward to add but has not been needed yet.

Git-based date lookups scale linearly. build.py calls git log once per page to get creation and modification dates. For 1,000 pages, that is 1,000 subprocess calls during build. Each has a 10-second timeout. A pre-computed date cache (built once per build from a single git log call) would reduce this to constant time.

chief.py is the largest file at ~2,900 lines. The audit() method alone spans ~550 lines. This is a maintenance concern, not a runtime concern. Extracting audit, newsweek, and init into separate modules would improve readability without changing behavior.

Dual validate_frontmatter naming. wire/tools.py has a validate_frontmatter() that validates string content, and wire/schema.py has a validate_frontmatter() that validates metadata dicts. They do different things with the same name. This has not caused bugs because build.py imports from schema.py while content.py uses the one in tools.py, but it could confuse contributors.

What we are building next

These are active priorities, not a wish list. No timeline promises. We ship when the work is solid.

Lighthouse integration. Performance budgets derived from Lighthouse scores, tied to content operations. If a page's performance degrades after an update, Wire should flag it before deploy.

Multi-site dashboard. Wire currently operates on one site at a time (wherever mkdocs.yml lives). Agencies managing 10+ sites need a single view across all of them: batch status, audit summaries, cost tracking.

Schema.org structured data. Wire generates JSON-LD structured data (Article, WebSite, CollectionPage) at build time based on page type and frontmatter dates. Expanding to FAQPage and HowTo schemas from content analysis is the next step.

Webhook notifications. When a 4-hour newsweek run finishes, you should not have to watch the terminal. Push notifications for completed batch operations are a straightforward improvement we plan to add.

Multi-model support. The claude_text() interface is model-agnostic by design. Testing alternative providers is straightforward since the function signature stays the same. Current priority is low because Claude Sonnet 4.6 delivers strong results at decreasing cost, but the abstraction exists for when it matters.