Should You Let AI Write Your Automated Tests?

In early 2025, a company has asked me to help them leverage AI development tools in their Scrum process.

They wanted efficiency gains, but I was worried.

I looked into a specific development plugin they mentioned, Qodo, which is big on generating automated tests.

I had it take a go at a Java program I’d written (using TDD) to produce the material for one of my course simulations. The plugin’s interface looked very convincing about the tests it was generating.

The bottom line: uh oh.

Some tests didn’t make sense or even compile.
Others indicated expected values that had nothing to do with the input and processing, so they failed.
The remaining tests ran successfully, but most had little value.

Moreover, the duplication across tests was very high, which meant that keeping them up to date (if I ever change the program’s functionality) would be costly.

These are exactly the kinds of tests I’m used to finding at client companies where devs had been told “write tests!”, but given no training on writing behaviour-proving and safety-enhancing tests.

For years, I taught developers how to be test-driven: to evolve software guided by behaviour-focused tests. This strategy produces tight, malleable, intent-revealing code — quite different from tests written after the fact. It takes some getting used to, but once you see the value, it’s hard to go back.

While AI can certainly be used in support of test-driven development, the availability and speed of the AI encourages the opposite, traditional behaviour: write your code, generate tests (now with a tool), fix things until everything seems to work.

And, in case it needed saying: AI works with the code you give it. If your code has bugs, your tests will still pass. They’ll be wrong, but you’ll think they’re right.

I’ve only tested this specific tool (which is based on ChatGPT) in Java and at this point in time. It’s clear that code generation has come a very long way, but it’s not there yet. Theoretically, test generation should be easier and more deterministic, but it’s not there either.

Don’t be dazzled by its speed and output. You might end up paying for them considerably on the back-end.