Test-Driven Code Generation

SaschaWildgrube · ‎12-18-2024

LLMs have made considerable progress since ChatGPT was made available to the public for the first time. And the quality of generated code improved greatly. Yet, the created code by far does not yet meet production standards and is not fit-for-purpose in many cases.

The core challenge is that code - to function properly - must be an expression of an algorithmic idea - and not the random summary of something that someone (or something) read before. The latter may be the starting point, but it is not sufficient for a final version.

Whether tests are written as code or executed manually by a developer, in most cases a (human) developer runs in cycles of building a hypothesis, then writing code, testing the resulting program and analyzing the result of the test - and then starts over.

Test-Driven Development is the practice of automating parts of that cycle by writing test code early in the process and shorten the feedback loop between the hypothesis and the analysis, allowing for greater productivity and more robust results.

When the quality of generated code is assessed, we compare apples and oranges. Generated code usually results from a single prompt, containing a description of the requirement, eventually the current version of the code. Do we really expect AI to deliver a result that is comparable to what an experienced human developer would provide after running through the cycle several times?

The relevant question is: are we asking AI the right way?

Test-Driven Code Generation (also referred to as "Test-Driven Generation" or "TDG") is to integrate the generative power of AI in the same procedure that the human developer would go through. Where Test-Driven Development automates the testing part in the cycle, Test-Driven Code Generation aims at automating the part where code is written.

Human writes boilerplate code and describes the outcome.
Human writes a test.
The test is run.
If the test passes, end here or return to step 2 to add more test cases.
If the test fails, ask AI to implement the code based on the description, the interface, the current version of the code, the test code and the latest test results.
Return to step 3.

For disambiguation: The term "Test-Driven Generation" has sometimes been used to describe the process of creating test cases (not the code to fit these tests) using AI. However, more recent publications use the term consistently for code (not test) generation.

Developer Efficiency Gains

What to expect from AI augmented coding? As always, the answer is: it depends.

It depends on the LLM, the specific prompt and the difficulty of the problem to be solved. To make reasonable statements about solution quality, the following problem taxonomy is proposed:

Class 1 Problems

These are low complexity non-domain specific problems, where many fit-for-purpose solutions are available on public sources. Solutions do not depend on additional data sources or the system environment. Resulting code delivers fully deterministic results.

Examples are simple math problems (linear algebra, geometry, Newton physics) like calculating the volume of solid bodies or gravitational vectors, or well-known algorithms for sorting or searching.

Class 2 Problems

These are high complexity non-domain specific problems or low complexity domain specific problems. Solutions are available on public sources, but not all of them are fit-for-purpose.

Examples are complex math problems with partially non-deterministic results, optimization problems, data validation that require implicit knowledge (e.g., emails, URLs, ISINs, IBANs), problems that require data retrieval via database interfaces or REST APIs, low complexity domain-specific problems that involve ServiceNow specific APIs like retrieving records based on filter criteria.

Class 3 Problems

These are high complexity domain specific problems. Fit-for-purpose solutions are not available on public sources or are difficult to find due to varying terminology or context.

An example is to retrieve an API token from a ServiceNow connection alias and then to use it with a REST API that requires multiple requests to perform a task.

As of now - at the time of the release of this article - Test-Driven Code Generation as implemented in the ArtificialDeveloper open-source app, using the GTP4 model works very well for class 1 problems, works most of the time for class 2 problems but fails on class 3 problems - although provides code that can serve as a basis for further manual work.

https://www.wildgrube.com/servicenow-artificialdeveloper

As the Test-Driven Code Generation approach requires developers to split capabilities into functions and produce automated tests, the efficiency gain for class 1 problems may be neglectable but for class 2 problems it can be significant. One could argue though that enforcing the refactoring of code into Script Include functions and creation of automated tests is per se a design pattern that saves costs on the long run. As a result, the adoption of TDG and hence indirectly adopting TDD already represents an efficiency gain - even if a particular problem cannot be fully solved through TDG.

Read the full story:

https://www.wildgrube.com/download/A%20mature%20Development%20and%20Deployment%20Process.pdf