Visual QA - Catch Layout Breaks Before Users Do

Your CSS looked fine in the browser. Users are reporting a broken mobile nav. HTML validation passed.

Visual QA compares screenshots pixel-by-pixel against a known-good baseline. If more than 2% of pixels changed, the page is flagged. That sounds simple, but it raises an immediate question: your site has hundreds of pages, and you have no baseline yet. Before you can catch regressions, you need a reference state to regress from. The order of operations matters, and so does what you do when something flags unexpectedly.

Installing the `qa` dependency group pulls in Playwright and Pillow. Playwright then downloads a Chromium binary on first run, about 150MB. That's a one-time cost, but it surprises people who expect a lightweight install. There's no external service, no API key, no per-run cost. Once Chromium is cached, runs are local and fast. The question is where visual QA fits in your existing build process, because it runs against the built HTML in `site/`, not your source files.

The first run with `--baseline` captures reference screenshots for every page at desktop and mobile sizes. Those images sit in `.wire/qa/reference/`. Every subsequent run compares against them. The 2% threshold is tighter than it sounds: anti-aliasing causes variations under 0.5%, so 2% reliably catches real changes. A single misplaced heading affects roughly 3-5% of pixels. A broken mobile nav overlay can hit 15-30%. But here's the complication: some visual changes are intentional. A theme update will flag every page.

The report shows you the diff percentage per page and saves both screenshots to `.wire/qa/diff/` for manual inspection. A 4.7% desktop flag could be a heading that shifted. A 12.3% mobile flag is almost certainly a layout break. Wire doesn't decide which is acceptable. That's the part that surprises people: visual regressions don't block the deploy the way lint errors do. You get the information, then you decide.

Wire runs visual QA after `wire.build` completes, against the static output in `site/`. Lint errors refuse the build entirely. Visual regressions do not. The reasoning: a CSS change that affects every page might be a deliberate redesign, not a mistake. Blocking on visual changes would make intentional updates impossible to ship. So Wire flags and reports, but the deploy decision stays with you. That distinction matters if you're wiring this into a CI pipeline and expecting it to gate releases automatically.

For sites over 500 pages, you can run QA on a random sample or limit it to one section. The sampler prioritizes pages that changed since the last baseline, so recent edits always get checked even when you're not checking everything. The tradeoff: a CSS regression on an unsampled page won't surface until the next full run. If your theme change touches every page, a sample won't tell you the full scope of what broke.

Visual QA catches regressions, things that changed from a known-good state. A page that was always broken passes QA perfectly, because the broken state is the baseline. Wire automates the question "did anything change unexpectedly?" It doesn't answer "does this look good?" That's still a human call. If your reference baseline was captured before a design problem existed, QA will never flag it.

Content audits catch broken links and missing metadata. Lint rules catch HTML structure problems. Neither catches the moment a CSS change makes your mobile nav overlap your hero text, or when a font swap silently breaks every table on the site.

Visual QA fills that gap. Wire takes screenshots of every page at two viewport sizes, compares them against reference baselines, and flags anything that changed more than 2% of the pixels. No neural network, no AI judgment. Pixel math that catches real regressions.

Why Screenshots Beat HTML Validation

HTML can be valid and still look broken. A page can pass all 44 lint rules and still have:

Overlapping text from a z-index change in the theme
Missing images that load fine locally but fail on the built site
Table overflow on mobile where columns push past the viewport
Font rendering shifts after a woff2 update changes character widths
Dark-on-dark text from an accidental CSS variable override

These are visual bugs, not structural ones. The only reliable way to catch them is to look at the page, and Wire automates the looking.

How It Works

Wire's QA system uses Playwright to render every page in a real Chromium browser, capturing full-page screenshots at two viewport sizes:

Viewport	Width x Height	Why
Desktop	1280 x 900	Standard laptop viewport, catches sidebar and nav layout
Mobile	390 x 844	iPhone 14 dimensions, catches responsive breakpoint issues

Reference Baselines

The first run captures reference screenshots, the known-good state of every page. These are stored in .wire/qa/reference/ with filenames like vendors-acme-desktop.png.

python -m wire.qa screenshot --baseline

Comparison Runs

Subsequent runs capture current screenshots and compare them pixel-by-pixel against the references:

python -m wire.qa screenshot

For each page, Wire computes a pixel difference ratio: the percentage of pixels that changed between reference and current. If the ratio exceeds 2%, the page is flagged.

Why 2%

The threshold is deliberately tight. Anti-aliasing and subpixel rendering cause minor variations between identical renders, typically under 0.5%. A 2% threshold catches real changes while ignoring rendering noise.

For context: a single misplaced heading on a 1280x900 page affects roughly 3-5% of pixels. A broken mobile nav overlay can affect 15-30%. The 2% threshold catches both.

Reading the QA Report

Visual QA Report
================

Pages checked: 142
  Desktop: 142 screenshots
  Mobile:  142 screenshots

Regressions (>2% pixel diff):
  vendors/acme/     desktop  4.7%
  guides/workflow/  mobile   12.3%

Clean: 140 pages (98.6%)

Each flagged page includes the diff percentage and both screenshots are saved for manual inspection in .wire/qa/diff/.

Sampling Strategy

For large sites (500+ pages), Wire supports sampling to keep QA runs under 5 minutes:

python -m wire.qa screenshot --sample 50    # Check 50 random pages
python -m wire.qa screenshot --topic vendors # Check one topic only

The sampling prioritizes pages that changed since the last baseline, so fresh edits always get checked even in sampled runs.

Integration with the Build Pipeline

Visual QA runs after wire.build completes, against the built HTML in site/. The sequence:

python -m wire.build generates the static site.
python -m wire.qa screenshot compares against baselines.
Review any regressions before deploy

Wire does not block deploys on visual regressions (unlike lint errors, which refuse the build). Visual changes may be intentional: a theme update, a new section, a redesigned nav. The QA report gives you the information; you decide whether the changes are expected.

What Visual QA Does Not Replace

Visual QA catches regressions, things that changed unexpectedly. It does not evaluate whether a page looks good in the first place. A page can be ugly and pass QA perfectly, because it was ugly in the reference baseline too.

For design quality, you still need human review. Wire handles the tedious part (did anything break?) so your human reviewers can focus on the subjective part (does this look right?).

Requirements

Visual QA requires the qa optional dependency group:

pip install -e ".[qa]"    # Installs Playwright + Pillow
playwright install chromium

Playwright downloads a Chromium binary (~150MB) on first install. Subsequent runs use the cached binary. No external service, no API key, no cost beyond the initial download.