Over the last year, testing has started to look more active than it has in a long time. Tickets now have generated test cases attached, prompts show up in PRs, and reviews include screenshots of “AI coverage.” There is visible proof that testing happened and that proof creates comfort. For teams that used to struggle to keep testing in sync with delivery, this feels like progress. On paper, it looks like we finally caught up.

Production does not look any safer.

I keep hearing the same two lines across teams. Either “AI will generate the tests” or “we’ll test ourselves.” Both sound reasonable when you say them. Both usually end the same way - more bugs are escaping to production than anyone expected. Not rare edge cases, not complicated failures. Basic correctness issues like double charges, broken access, sessions that never expire, or orders that exist without inventory. The kind of bugs where the only reaction is “how did we miss this?” You might have fired these employees before the vibe coding era.

If all this extra testing was actually reducing risk, these are the first problems that should disappear. The fact that they don’t is the real signal.

The issue is not the number of tests. It is what those tests are actually protecting.

AI doesn’t decide risk. It just expands what you give it.

AI testing sounds smarter than it really is. It does not understand your system, your users, or what failure would hurt the most. It does not discover risk for you. It simply expands whatever you put into the prompt.

A prompt is just your current thinking written down. The model elaborates on that thinking. It does not question it. It does not fill in the gaps.

If your understanding is narrow, the output will be narrow too, just longer and better formatted.

So when AI-generated tests look thorough but still miss obvious problems, nothing went wrong. The model tested exactly what you described and ignored everything you didn’t. It is being literal.

Where this usually breaks

This gap shows up in very normal work.

Take a ticket that says, “Add retry logic to checkout.” Someone writes, “Generate test cases for checkout retry logic.” The output looks perfectly fine. Retries on network failure, retry limits, timeout handling, and eventual success. From a functional point of view, it seems complete.

But checkout is not really a retry problem. It is a money and inventory problem. What actually hurts you is charging twice, reserving inventory twice, or ending up with payment and order out of sync.

Those risks are not mentioned, so they do not appear in the tests.

The model did not miss them. We never named them.

This is the pattern I keep seeing. We prompt around implementation details and forget the system invariants. Then production reminds us what actually mattered.

The quiet trap AI introduces

AI also makes it very easy to look serious about testing. You can generate dozens of scenarios in seconds and attach them neatly to a ticket. The output looks structured and thorough, so it feels like diligence.

Over time, the presence of that list starts to stand in for safety. “AI-generated tests for this” becomes shorthand for “we’re covered.”

But most of those lists are never read carefully. They are too long and too generic. The list becomes paperwork. Something that proves effort happened, not that risk went down.

So activity increases and confidence increases, but production incidents stay roughly the same.

AI did not create this behavior. It just made it cheaper.

Why flows aren’t enough

Most of the bugs that hurt teams are not flow failures. The happy path usually works. The screens load. The steps complete. What breaks are the states underneath.

Payment says success while access says no. A password changes, but old sessions stay active. An order exists without matching inventory. Different parts of the system disagree about reality.

These are states that should never exist.

But most prompts are written around flows. Test the login flow. Test the subscription flow. Test checkout failures. Flows are easy to imagine, so AI happily generates more flows. Meanwhile the real risk lives in conditions nobody explicitly defined.

AI can expand what you specify all day. It cannot guess what must always be true.

If you don’t decide that, nothing is actually being protected.

Where AI actually helps

AI is still useful here, just not in the way people expect.

It is not great at figuring out what to test from scratch. But it is very good at expanding constraints. If you clearly state a risk, it will generate many ways that risk could happen. It can help you see holes and question assumptions you might miss.

But only after you make the risk explicit.

So the leverage point is not “write better prompts.” It is “be clearer about what matters before you prompt.”

Once the intent is sharp, AI becomes useful very quickly.

How to make AI test what matters

The teams that get value from AI testing usually make one small change. Before writing any prompt, they pause and ask a simple question: what state must never exist in production?

Not scenarios. Not coverage. Not edge cases.

Just one unacceptable condition.

For example, payment completed but order missing. Subscription active but access removed. Password changed but old sessions still valid.

Once that is clear, the prompt becomes straightforward. Instead of asking for generic tests, you ask for tests that ensure those states never occur or persist. The output is usually shorter but much sharper, because every test is defending something specific.

Nothing fancy changed. The thinking just got clearer.

Try this next sprint

During refinement, pick the feature being discussed and force the team to agree on one forbidden state. If you cannot agree quickly, that confusion is probably the real risk. Resolve that first.

Then write your AI prompt around protecting that state. Every generated test should exist to prevent or detect that condition.

It takes a few minutes and usually does more for quality than another page of generic scenarios.

The takeaway

AI does not reduce risk just by generating more tests. It reduces risk when you first decide what must not break and then use AI to defend that decision.

Define the constraint first.

Then let AI expand it.

Otherwise, you are mostly producing better-looking paperwork, not safer software.

Hanisha Arora is a Growth Hacker Lead focused on advocating for products and building better systems. She works where technicalities meet product thinking. Hanisha writes about testing, systems, and the tiny human errors that turn into product lessons.

Learn more about her work on LinkedIn or follow her blog for more.

Reply

Avatar

or to participate

Keep Reading