Risk, accountability, and failure modes

Part of a series

Copied

This article is Part 2 of 3 in AI usage in software development: where it helps and where ownership still matters.

Back to Part 1: How AI fits into engineering workflows

Continue to Part 3: Why architecture-first delivery controls AI behavior

Faster generation does not move responsibility

Copied

Adding AI to a delivery process changes the speed at which code is produced. The speed is the change. Responsibility stays where it was. Whether the tool is producing documentation, an exploratory prototype, a routine refactor inside known patterns, or anything else, a named technical owner on the team still has to defend the result in review. The temptation to treat code that runs locally as ready to ship gets stronger when the diff arrives quickly, and that temptation is the issue worth watching.

A common failure mode is treating generated output as safe because it compiled, or because it ran on a developer machine. Neither of those facts tells you much. Correctness in production depends on context that is simply not present during isolated generation, including load, legacy constraints, integration behavior, and operational edge cases. The gap usually becomes visible under real traffic and inside long-running systems, where small inconsistencies stack up into something larger than any one change.

When the volume of output exceeds review capacity

Copied

Security and reliability risks grow with the volume of output. High-throughput generation can produce diffs larger than the team can practically review. That increases the chance that subtle defects reach production.

Consider a sprint that lands twenty thousand lines of code with only light review. Reviewers then have to read a diff large enough that serious flaws can hide inside ordinary-looking sections on any single screen. The model can produce code faster than people can read it. That is where load-dependent and security-dependent logic typically goes unnoticed.

Independent 2026 audits underline the bottleneck. Veracode reported forty-five percent of sampled AI-assisted code carrying common categories of web risk aligned to the OWASP Top 10 family. Related work flagged thirty-one percent as plainly exploitable. Commentary on loosely reviewed AI-heavy repos cites up to ninety-two percent carrying at least one critical finding. Security teams have reported spending more time vetting model output than fixing conventional bugs.

Third-party scan headlines, early 2026

Rounded to match percentages often cited from Veracode and partner studies. Use them as a reason to verify your own setup rather than a final ruling.

AI-heavy samples tagged with issues similar to the OWASP Top 10~45%

Audited subsets called "straight-up exploitable"~31%

Loosely reviewed AI-heavy repos allegedly carrying critical findings~92%

AI-authored changes allegedly needing manual production debugging afterward~43%

Human and model pairing still works when the engineer who merges the code accepts responsibility for the production impact, and when each change is small enough to read end to end before it ships.

Treat hosted models like any other vendor

Copied

Anything you send to a hosted model crosses another company's infrastructure. That traffic can include prompt history, repository snippets, internal identifiers, and follow-up messages in the same session. Treat the path the same as any other vendor that handles confidential data. Decide explicitly whether content may leave your network, how long the provider may retain it, in which countries it may be stored, and how you will respond to a breach. Apply the same standard you would apply to a hosted continuous integration service that clones your repositories.

Hosted models and their APIs fail like any other dependency. If inference is required to produce a production fix, then keep a fallback path that does not assume the service will be available.

For a deeper discussion of data access risk under broad AI integration, including what the system can retrieve, which regulatory obligations apply, and what vendor retention means for sensitive content, see AI full data access risk.

Production incidents and literal interpretation

Copied

Once revenue depends on a system, and there are years of decisions inside it, a full rewrite is no longer realistic. Recent public failures look less like random hallucination and more like literal execution of dangerous instructions. Replit's AI reportedly deleted a production database with over a thousand executive records after being told not to touch live data. Google's Gemini CLI deleted user files while "organizing" missing folders. The pattern is faithful execution of dangerous prompts. The model is not acting on its own.

Those failures are governance and change-control problems as much as they are model problems. In March 2026, Amazon publicly attributed outages to AI-assisted changes that lacked safeguards. The result was hours of downtime, six-figure missed orders in one incident, and roughly $6.3 million in losses in another widely cited figure. Separate 2026 analyses still cite about 45 percent of sampled AI-heavy code carrying OWASP-class issues and about 43 percent of AI-guided changes requiring manual debugging after deployment. Larger models do not remove that risk on their own.

The business damage often lasts beyond the outage itself. Critical production datasets and the working backups for them have disappeared together for paying customers in widely reported cases over the past few years. The restore procedure failed when operators tried it under pressure. Customers and procurement teams often accelerate replacement once confidence is gone, even when the vendor promises a fix.

Models recognize patterns. They do not have access to your operational history, your business model, or the implicit rules your team has built up across releases. The output can look correct on screen and still fail under real traffic, an audit, or a coordinated rollout. Larger models give the team more room to design and debug, but turning a plausible branch into code that you will run for years is still human work.

Cost belongs on the same budget sheet

Copied

AI use introduces a new infrastructure expense with an ongoing per-use cost. In higher-throughput environments, that spend can become comparable to the cost of a senior engineer when no limits are applied. A senior engineer running agents with large context windows can push API spend toward the fully loaded cost of hiring another senior. Our own usage measurements on long sessions, without using flagship models, came in at about one dollar per minute.

After months of heavy agent use, per-developer spend often lands near two hundred to six hundred dollars per developer once you leave vendor starter tiers and start paying per token. On modest sites and apps we still recorded more than eight hundred dollars in under two weeks for a single developer without full-time stack immersion. A single modest prompt cost about ten dollars; the output was largely usable, it still needed edits, and the invoice for that piece of work sat in the same band as senior time while we were also paying that senior to supervise. Run that habit for a year, and a small shop can approach twenty thousand dollars per engineer when agents are the default on every workflow.

Illustrative monthly AI spend per engineer (published ranges vs. our spikes)

Widths are directional from vendor commentary plus invoices we actually saw. They are not GAAP audited.

Light experimental use~$75 to $120

Quoted industry band for habitual agent pipelines$200 to $600

Burst weeks matching our exploratory spend> $800 in a fortnight, about two weeks

Engineers who rely heavily on agents during large refactors can push monthly invoices into the thousands when prompts run in long chains without limits. Higher spend and faster commits usually mean more security review, more QA, and rework that a more cautious approach would have avoided.

Accountability cannot be delegated to the tool

Copied

Across correctness, security, and cost, the requirement is the same. A named person inside the team still weighs the trade-offs, defends the change in review, and answers for the system when something goes wrong. The tool does not get to take any of those jobs over.

The next article in this series explains how architecture-first delivery narrows what models are allowed to produce in the first place, so that the failures described above have fewer opportunities to occur.

If any of this looks familiar in your own work, then what would the first thing to address look like, and would you be open to walking through it with us via our contact page?

If you want to scale AI use without losing review discipline or producing unexpected budget items, then talk with Corsair about your next build.

Contact Corsair

Continued reading

Keep exploring related topics that connect strategy, implementation, and long-term maintenance.