If AI really eliminates developers, why does it take 12 retries to get working error handling? I’ve been staring at my screen at 2 AM, with Windsurf generating code that imports modules that don’t exist, calls functions I never wrote, and manages to delete working features somehow while “fixing” completely unrelated bugs. AI evangelists and CEO’s keep telling us that anyone can build production-ready apps, that we’re witnessing the democratization of software development. But after spending the better part of a year vibe coding my way through building an AI assistant, I can tell you the reality looks a lot messier and slower than the marketing deck. Half the time you’re not debugging your logic, you’re debugging the AI’s hallucinations.
I’ve documented this journey here on the blog: from where I’ve been with AI development to the security implications of securing the vibe, the technical challenges of tokenization in your AI stack, and seeing how agents will do what they need to do to survive in the call coming from inside the model, and when it came. I got things working eventually, but there was a nagging thought in the back of my head after hours of vibing: was I actually productive, or was I paying a premium and spending valuable time playing whack-a-mole with broken code? Turns out, we have a mathematical way to answer that question, and the math isn’t kind to the vibe coding approach.
The Math Behind the Madness
Before I dive into the formulas, let me set the stage. I’ve been deep in the trenches of building an AI assistant over the past six months or so. What started as curiosity about whether I could vibe code my way to a working CLI REPL AI agent turned into a months-long experiment in frustration, learning, and occasional breakthroughs to generate an enterprise CISO AI assistant. I would get versions working in fits and starts. Still, I kept wondering: was this actually efficient, or was I simply getting constantly gaslit by a computer that occasionally got things right?
Then I stumbled across Glyph Lefkowitz’s brilliant blog post “The Futzing Fraction”, and suddenly I had a framework for what I’d been feeling intuitively. Glyph introduced a mathematical way to measure whether using an LLM is actually worthwhile or if you’re just burning time and money on what he calls “futzing,” (which reminded me of my nanna telling me to stop always futzing around). His formula describes AI-assisted development in economic terms: the time you spend writing prompts, waiting for responses, and checking the output versus doing the work yourself.
But here’s the thing: Glyph’s formula, as insightful as it is, doesn’t capture everything I learned from my years working within large enterprises as well as my recent vibe coding adventures. The formula assumes a somewhat idealized scenario where skill levels are constant, all tasks are equally complex, and all errors have the exact cost. I realized we need to account for the messier realities of application development. So I extended his framework to include the human and environmental factors that make or break AI-assisted coding in practice.
Glyph’s Formula: The Baseline Reality Check
Breaking this down with examples from my vibe coding sessions:
- P (Probability of Success): The percentage chance that the AI gets it right on any given attempt. Spoiler: this number is way lower than vendor benchmarks suggest, especially for anything beyond trivial examples.
- I (Inference Time): Those 30-45 second waits while Windsurf thinks hard about your prompt. It doesn’t sound like much until you’re doing it dozens of times per feature. I started timing these after realizing I was spending more time waiting than I thought.
- W (Writing Time): The time you spend crafting prompts. And re-crafting them and explaining to the AI why its previous attempt was wrong. In my experience building the assistant, this wasn’t just the initial prompt but the entire conversation, trying to get something workable. “No, not that API. The other one. The one that exists.”
- C (Checking Time): This is the big one. The time spent reviewing AI output, testing it, debugging it, inevitably fixing it, and don’t get me started on the time spent rebuilding containers. This includes everything from “this imports a module that doesn’t exist” to “this deletes my working authentication while trying to add a button.” As I learned time and time again, you can’t trust AI output, even when it looks reasonable.
- H (Human Baseline): How long would it take you to code it yourself from scratch? This is your control group, the thing you’re trying to beat.
Let’s run some real numbers. Based on some very unscientific analysis, a typical task takes 4.7 minutes per full AI attempt (including prompt writing, waiting, rebuilding, and checking) versus 40 minutes to code manually. Here’s what the math looks like for different skill levels:
- Citizen developer (P = 0.073): FF = 4.7 / (40 × 0.073) ≈ 1.61
- Competent developer (P = 0.083): FF = 4.7 / (40 × 0.083) ≈ 1.42
- Expert developer (P = 0.10): FF = 4.7 / (40 × 0.10) ≈ 1.18
Even for expert developers, the futzing fraction is above 1, meaning you’re losing time. For citizen developers, you’re spending 60% more time than just learning to code it properly. This aligns perfectly with my assistant project experience: I was definitely in that 1.4-1.6 range for most features, which explains why those late-night coding sessions felt less like productivity and more like expensive therapy.
The Golf Problem: Why We Keep Futzing
Most of us are terrible at accurately tracking our productivity, and AI tools are practically designed to exploit that weakness. Glyph made a brilliant observation in his original post about the “slot machine” nature of AI coding, and I think it goes even deeper than he realized.
Using Glyph’s example: when you lose a dollar, it’s expected. You can lose dozens of times in a row, and each individual loss feels like background noise. But when you hit a jackpot and win a hundred dollars, that moment is electric, memorable, and completely overshadows all those forgettable losses.
AI coding works the same way. When ChatGPT generates code that imports nonexistent modules or hallucinates an API that doesn’t exist, it’s just another failed attempt. Annoying, sure, but you tell yourself “AI does that sometimes” and try again. But when it occasionally generates a complex function that works, when it saves you two hours of writing authentication middleware, it feels like magic.
It’s like golf (an example I’m far too familiar with): people keep playing because they hit that one perfect shot per round. Never mind the 17 other holes where they’re duffing around in the rough, cursing their swing, and losing balls in the water hazard. That one 250-yard drive down the fairway or that impossible chip shot that lands three feet from the pin makes them forget everything else. “Did you see that shot on the 7th hole?” becomes the story they tell, not the triple bogey on the 12th.
With AI coding, that “perfect shot” might be the time it generated a complete React component that worked on the first run, or when it solved a regex problem you’d been struggling with. You remember those wins vividly and forget the dozens of failed attempts, the phantom imports, the deleted working features, the time you spent explaining why its “solution” didn’t build without a meaningful error log.
But here’s the crucial difference: in a casino, you at least start with a fixed budget. You know you’re walking in with $200, and when it’s gone, you’re done. With golf, you pay your green fees up front. With AI coding, there’s no built-in limit. You don’t set a “prompt budget” before starting a feature. You keep feeding it attempts, convinced that the next one will be the magic prompt that gets it right.
This is why the futzing fraction matters so much; it forces you to account for all those forgettable losses, not just the memorable wins. When you track the time spent on failed attempts, debugging hallucinated code, and fixing “working” solutions that break other features, the math tells a very different story than your selective memory does.
What’s Next
The original futzing fraction reveals that AI coding is generally inefficient, but it doesn’t tell the whole story. In part two, I’ll show you how I extended Glyph’s formula to account for developer skill levels, task complexity, error costs, and learning benefits. Spoiler: the extended formula makes AI coding look even less appealing than the original analysis suggests.
Then in part three, I’ll give you a practical framework for when to use AI and when to skip it, based on the mathematical realities rather than vendor promises or gut feelings.

AI is great at writing code that looks correct until you try to handle edge cases. Then it's back to 2 AM debugging sessions trying to figure out why the AI thought "just catch Exception" was good enough.
Some things never change.