Should you use AI to write tests ?
Recently I came across a team in my organisation who claimed that they had achieved 100% test coverage using AI , where they generated the tests using ai and also made code changes to increase the coverage on those tests. It made me question when an AI generates tests how does it know what is right and wrong ?. Any test at its core is a form of encoding the notion of correctness for a given system. Isn’t that definition of correct something only the humans working in the company know ? . Let’s say I build a new application for internal folks at my company to query historical data , what does a ‘correct’ application look like in this example . Correctness here is spread across dimensions -
- Can it withstand high load and concurrency ? [ Reliability]
- Is it adding value to peoples workflow’s (Most Important) [Product Market Fit]
- Is it designed in a way that going forward I can swap out the current query engine with some new one later ? [Composability]
- Does the system ensure efficient utilisation of resources ? [Cost Control]
Each of these questions is something that should have active human involvement even though AI does have an idea about high load and concurrency it’s still better to tailor the solution to your specific scale to avoid over-engineering
So the answer I believe is to have a document that encodes this notion of correctness for your system and then write harnesses that can verify these notions (integration tests , unit tests) . This prevents the AI from generating a lot of garbage code and keeps it on track . The hard part now shifts from generating code to answering 2 questions about your system -
- What does correct look like for your system ? [technically and in terms of product ]
- Can an LLM verify this definition of correct ?
Both are hard questions , the first one even more so if you are early in your career like me then you haven’t seen many failures or systems going rogue to have a complete picture of correctness . Verification becomes an issue when you cannot simulate or verify certain aspects of correctness , or even if you can its cumbersome and takes a long time
Inspired by