The Biggest Risk of GenAI Coding

Do you use AI to write code? If you don’t also use test-driven development (TDD), you may be asking for trouble, even if it’s greenfield work or a quick prototype.

I’ve just seen this firsthand with a two-part word puzzle I’m designing. Preparing the first part involved filtering and manipulating dictionary words using data structures and algorithms straight out of first-year computer science.

I asked ChatGPT to write the code (I use Java). I tested it manually and immediately spotted an error. ChatGPT tried to fix it seven(!) times before I gave up.

Switched to Claude, which got the code right on the first try. But when I asked it to write tests, they were not great: wrong assertions, missing checks, and pointless cases.

For the second part of the puzzle, I went back to trusty TDD. I explained the task to Claude and had it write behaviour-focused failing tests. They needed only light cleanup. Its minimal passing code worked, and refactoring was painless.

I repeated the TDD cycle with ChatGPT. It generated over-engineered tests with mocks, but when it wrote the code, those tests caught a bug. I asked it to fix that bug, and what did it do? It completely rewrote the code! The result was uglier, but at least correct.

The takeaway isn’t that Claude beats ChatGPT. It’s that if you assume AI-generated code is correct, that will bite you. And if you have AI generate tests for its code, those “passing” tests might confirm faulty behaviour. Instead, constrain the solution space by writing behaviour-focused tests first. Check them thoroughly, and once the AI makes them pass, clean up the code.