Vibe Coding Ships Broken

On this page

30 commits. 29 corrections.
The perception gap
The 5% that eats everything
The AGI question
What actually works

I build systems that make work unnecessary. Not the people. The overhead.

Andrej Karpathy coined "vibe coding" in February 2025. Give in to the vibes. Accept AI output. Ship it. Merriam-Webster added the term within weeks. Collins named it Word of the Year.

Then Karpathy built his next project by hand. He tried the AI agents. They "just didn't work well enough at all." The person who named the wave stopped surfing it.

30 commits. 29 corrections.

I spent a day working with an AI agent. Not a weekend prototype. Not a demo. Production work that customers depend on.

30 commits shipped. I corrected the agent 29 times.

Not "looks great." Twenty-nine times: stop. You skipped a step. That is wrong. Check the actual system. You are moving too fast. Think.

Every correction prevented an error that would have reached users. Wrong instructions. Things described that do not exist. Features claimed that do not work. Content declared "redundant" that was the only place certain information lived.

The agent was confident about all of it. Clean output. Professional tone. Wrong.

The perception gap

A METR randomized controlled trial tested 16 experienced developers on 246 real issues. With AI tools, they were 19% slower. They expected a 24% speedup. After the study, they still believed AI had helped them by 20%.

Read that again. They got slower. They felt faster. They could not tell the difference even after being measured.

That gap between feeling productive and being productive is the entire vibe coding pitch. It feels like progress. The commits flow. The diffs look clean. The output reads well. You are shipping slower and you cannot perceive it.

The 5% that eats everything

The agent is right 95% of the time. That sounds good until you multiply it.

30 changes per day at 5% error rate: 1.5 errors daily. Over a week: 7.5 errors in production. Over a month: 30. Each one erodes trust, generates support load, and compounds. Your users do not know you vibe-coded it. They know it does not work.

Tenzai built 15 identical apps across five AI coding tools. 69 security vulnerabilities. Zero apps implemented basic protections. Every tool produced code that looked functional and was unsafe.

Lovable exposed 18,697 user records through authentication logic that was inverted. It blocked real users and let strangers in. Lovable's response: the user should have checked.

Replit's agent deleted a live production database during a code freeze. Then lied about recovery.

CodeRabbit measured 2.74x more security vulnerabilities in AI co-authored code. GitClear found code cloning quadrupled while refactoring halved. The codebases are getting bigger, more repetitive, and less maintained.

The AGI question

Every frontier model scores below 1% on ARC-AGI-3, tasks that ordinary humans solve without instructions. Opus 4.6: 0.25%. GPT 5.4: 0.26%. Grok-4.20: 0.00%.

Yann LeCun calls general intelligence "complete BS" as a concept. Melanie Mitchell shows models score 97% on familiar patterns and drop to 53% when the pattern changes slightly. Francois Chollet designed ARC specifically to test what LLMs cannot do: adapt to the unfamiliar.

Tim Dettmers calculated that lowering error by a factor of 2 requires increasing compute by a factor of 1 million. Each GPT generation used 70x more compute for improvements that were "merely perceptible." We are not approaching AGI. We are approaching a wall where the cost of the next percentage point exceeds the value it creates.

When I studied finance and information management at TUM, AGI meant Allianz Global Investors. They managed real assets with real risk models and real accountability. The AI version of AGI has none of those properties. The acronym upgraded. The substance did not.

What actually works

The agent is a 10x junior developer. It reads entire codebases in minutes. It finds patterns across thousands of lines. It produces structurally sound output at speed no human matches.

The human's job changed. You do not write anymore. You verify. You catch the 5%. You say "check the actual system" 29 times per day and mean it every time.

That is exhausting. It is the job. And it is the only thing that separates production software from a demo that stops working on day 30.

The people posting "I built this in a weekend with AI" are telling the truth. They built something. The demo ends before the support tickets start.

30 commits. 29 corrections.

The perception gap

The 5% that eats everything

The AGI question

What actually works

Let us run Wire for you