Can agentic coding raise the quality bar?

February 15, 2025

1530 words

8 min

The mainstream discourse around agentic coding focuses on throughput: ship faster, write more code, do more with less (headcount). And that's indeed how AI tooling adoption is playing out in many organizations. But there's more than one playbook in town!

In many software systems, quality matters: payment rails, databases, infrastructure control planes. They must be highly available, performant, trustworthy. In these systems, quality can be measured in units that can ultimately be converted to $$$. Those are the kind of systems I worked on throughout my career. And in these systems, agentic coding can be used to raise the quality bar.

The shift

Most of the processes and rituals in the world of software development are designed around a single core belief: code is expensive. Software requires highly specialized workers, it takes an unpredictably long time, and changing your mind halfway through a project is likely to have disastrous impact on the economic viability of the whole endeavour.

Agentic coding challenges that premise. Agent code is cheap.
The challenge is integration: the output of the current generation of models can’t be blindly trusted, especially on production-critical paths. The effort shifts: verification become the dominant cost, which caps how much agentic work you can safely deploy at once.
Nonetheless, there’s an interesting quadrant where agentic workflows have no real downsides:

Time-consuming work with cheap verification
Low-blast-radius problems where partial or approximate solutions are still useful

And there’s a lot to do in that quadrant! Tasks and practices that were "not worth it" when every line had to be hand-written are suddenly economically viable, and they can leveraged to raise the quality bar of the entire system.

I’ll cover a few concrete examples from my own experiments: cases where agentic workflows changed the cost calculus enough to justify doing more quality-related work than I used to.

Example 1: More tooling

Most engineering processes suffer from annoying papercuts that nobody bothers to fix. A quality metric you'd like to track, a code pattern you'd like to enforce (or forbid!). Those tasks rot in the backlog: they're not hard, but they're neither urgent nor do they deliver enough value to be prioritized over feature work.

The cost equation changes when an agent can knock out a good-enough version with minimal input, and it's cheap to verify whether it works or not. Last month, I used an agent to build a CLI that tracks the ratio of safe/unsafe code on a crate-by-crate basis. It had been on my mental to-do list for ages.

The quality of the primary system improves thanks to guardrails and tools that, under normal circumstances, wouldn't get built at all.

Example 2: Prototype to discover constraints

There's a recurring dream in engineering management: the perfect specification. The belief that, with enough upfront analysis, all constraints and challenges can be discovered before writing a single line of code. No surprises during execution. We're seeing a comeback of this idea in the AI age under the banner of spec-driven development: put a lot effort into writing a detailed specification, then let the agent execute it in one shot. Same dream, different age.

I see a rather different opportunity. Rather than going back to waterfall, push iterative prototyping further.

Do just enough upfront design to kick off an agentic experiment. Observe where the agent gets stuck or confused: those are constraints your design failed to account for. Note them down, fold them back into the design draft, restart. Repeat until you have a walking skeleton, then you start iterating and polishing.

AI agents become a mechanism to map the problem space. Each failed attempt surfaces something real¹: an API that doesn't behave as expected, a concurrency issue you didn't foresee, old piles of tech debt getting in the way. You could discover many of these issues through careful upfront analysis, but cheap executable probes tend to reveal them faster, because they force the design to confront the real system earlier in the process.

Example 3: Build to compare

In some cases, you know the design constraints and you have narrowed down the solution space to a few options. How do you pick one?

In a world where code is expensive, it's preferrable to sort things out on a whiteboard. You analyze the different options, try to estimate the pros and cons of each, then go out to build the one that comes out on top (or whose backer is the loudest in the room).
With agents, you can prototype all options. Performance, complexity, memory consumption: you can measure them, thus injecting empirical data into the final decision.

I did this recently when working on a tree-based data structure. The server process forks, and the forked process must determine what needs to be garbage collected: how should a node be identified across the fork boundary? Using its memory address? Based on its position within the tree? Or is it better to introduce a notion of node index?
I had agents build out all three approaches². The arena-based implementation came out on top: simplest implementation (no unsafe), shortest GC pauses, and overall performance profile was satisfactory.

I wouldn't have gone down that route if I had trusted my initial bias (position-based) and jumped straight to implementation.

Example 4: Low value-per-line abstractions

Some abstractions are useful, but tedious to write. I like to say that their value-per-line is low.

Take this example: ~2,000 lines of Rust code to create a safe interface over Redis' RedisModule_Reply* FFI functions. The builder minimizes unsafe in downstream code (i.e. just the constructor) and ensures that array and map lengths are correctly set.
The code is extremely repetitive, no sane dev would enjoy hand-writing it; nor should they, the return on investment would be miserable. But it makes perfect sense to have an agent produce it! You have ruled out a whole class of minor mistakes with a modest token investment.

Example 5: Pay off tech debt eagerly

Agentic coding works best with a closed feedback loop: the agent writes code, an automated check verifies it, and the agent iterates until the check passes. The tighter this loop, the more you can trust the output.

This reframes small tech debt items as investments in your verification infrastructure. Tightening the type constraints in a module, improving test coverage, enabling a new class of static analysis: these were always "nice to have", but rarely worth pulling someone off other work. Agentic coding changes the equation: small clean-up tasks are easy to execute, easy to test and deliver obvious value.

Case in point: migrating small chunks of C code to Rust to bring more code under the purview of Rust's borrow-checker and miri³. Each migration is a small task, easy to pull off for an agent. But the cumulative effect compounds: more code under static analysis provides a more reliable signal on whether changes are correct, thus speeding both agents and humans up.

What's next?

These are just a few examples from my day-to-day work. I have no grand theory about where we're heading as a profession, but I don't think agentic coding spells the death of software engineering, nor of craft. If anything, it raises the bar on engineering discipline: organizations and systems where quality matters will invest more in verification, tooling, and feedback loops to extract real value from agentic workflows⁴.

This is a great time to experiment, and get excited about what's possible.

Footnotes

The Outcome Engineering manifesto calls this "Failures are Artifacts": "Opinions are conjecture; outcomes are data. When an outcome fails, do not simply rollback. Dissect the failure. Understand why the hypothesis was wrong."

But how can you trust the generated prototypes? You must lay out the groundwork in advance! In this case, I had a comprehensive test suite in place, exercising the data structure exclusively via its public API, as well as end-to-end tests for the GC part. Correctness-wise, the verification loop was tight.

miri is a tool to catch undefined behaviour in Rust programs. It can't reason about code that crosses the language barrier via FFI, which greatly limits its usefulness in mixed language codebases.

⁴

Bryan Cantrill, Adam Leventhal, Rain Paharia, and David Crespo discuss this at length in Engineering Rigor in the LLM Age (Oxide and Friends).