Feb 1, 2024
Three distinct attempts by gencraft.com to generate an image for “the Good, the Bad, and the Unknown”. I’m not sure which is best.
“The reports of my death have been greatly exaggerated”. (Mark Twain)
And neither are the announcements of dismissal of programming that come together with it. We’ve seen such announcements with Cobol many years ago. And we’ve seen them with 4GLs (cover of App Development w/o Programmers, James Martin, 1982), model-driven engineering and with low-code platforms. And now we’re seeing it with AI.
The traditional code generation was generating code in a lower-level formal language from a formal higher-level formal language. Notice the emphasize on formal language.
This is different with the AI-based generation that fills the news in 2024: ChatGPT, Copilot, llama2, etc. This new kind of code generation can take as input a mix of natural language or code (in the case of automatic code completion) and can generate a mix of natural language and code.
The most successful generative AI of the moment is based on large language models (LLMs). These are statistical models of language, which originally designed for natural language. The extensive introduction to ChatGPT by Stephen Wolfram is a fantastic introduction to LLMs. It’s a blog post but it that will take you several hours. But it’s worth all that time.
Indeed, it turns out that when one feeds “the whole internet” into such a model, somewhere in the weights of their billions of parameters, correct code can also be generated without ever having specified a parser or a lexer.
The current generation of such models is based on several conditions:
These things are ever changing, new architectures appear every year, new models and variants appear every more frequently, so I’ll try to keep this presentation and discourse at the level where things are not changing.
LLMs are trained to predict the next token (i.e., decoder-transformers, e.g. GPT) or to predict the missing token (i.e., encoder-transformer, e.g. BERT). However, in both cases, their strength is generating text and, for the programmer, code. This is why, there are two main ways in which they are integrated in the development process in two variants of user interfaces:
Auto-complete-like agents. The developer writes a function definition, and the LLM gets to auto-complete it. If the problem is general enough, and function name is likely clear enough, the most common likely continuation might be exactly what the developer needs. This works especially well and is very useful for boilerplate code, or generic queries.
Chat-like agents. With extra training in the form of reinforcement learning with human feedback (RLHF) a large language model (e.g. GPT) can be conditioned to work in the fashion of a chat-like system (e.g. ChatGPT, llama2). In this context, the model completes the next token in such a way as to obtain the most likely continuation of a conversation not a general text.
Some IDEs such as VSCode integrate the two modes of interaction.
There are ways in which these tools are improving the lives of the developers:
LLM extract patterns from other developers and bring this knowledge to the individual Indeed, developers are aware of the DRY principle (don’t repeat yourself). However, they can not respect the don’t repeat others principle since they don’t know what others are doing. As a community, Software Engineering has already done work on extracting patterns from the ecosystem and mining the version repositories to help the individual developer, but the new models bring this knowledge to a completely new level.
The homepage of GitHub Copilot reports a study that suggests that when using Copilot, developers are “faster with repetitive tasks” (96%), spending “less time searching” (77%), “more in the flow” (73%). These very precise numbers have to be taken with a bit of salt but they do suggest a speedup of code generation.
Can help writing boilerplate code much faster. Although it’s still written. I’m collecting my own examples of both successes and failures. Sometimes it feels like magic; sometimes it feels like the second coming of Clippy from MS Word.
Chat-like systems also help developers avoid interruptions and searching the web - which these days is a pain. Google and StackOverflow are not going away, but for some kinds of queries, will likely be replaced by LLMs-based systems.
This in itself is often halfway to the solution. Google has taught all of us to throw a bag of keywords to it and sift through pages filled with ads and junk to collect the valuable snippet. Not unlike developers used to sift through horrible forums before StackOverflow.
There are also things that are rotten in the state of Denmark…
The most likely continuation of a string is likely not the best in terms of:
Quality. The way biases exist in other kinds of datasets, there will be biases in the programming related data. A recent study attempting to evaluate the quality of the generated code seems to suggest that it’s quality is more akin to a passing contributor to a codebase rather than a senior developer on the team.
Timeliness of patterns and libraries - and thus, if you as a customer or employer get yourself a lazy programmer, who relies too much on generated code, you might end up with the patterns of yesterday. Paolo Tell: “there was a delay of several years required for the community to realize that Singleton, although a design pattern was a bad idea”. Indeed, if generating code in the way it used to be done in the past is easy, why should one learn a new way?
Given that these systems are trained on existing human generated content, they can not exist without it. At the same time, they represent a strong disincentive for further content generation. - When StackOverflow came first developers were enthusiastic. It put order into a messy web. Remember how forums looked before StackOverflow? SO had a clean and simple design, and a clever social incentive system with the help of which, people who wanted to be good citizens could contribute and recognized for their contributions. The key was probably the recognition. - Systems like ChatGPT and Copilot are exactly the opposite. They do not give credit. And unfortunately, even even if it wanted to, they couldn’t because of their architecture. This will discourage people to contribute their knowledge freely.
For long-lived systems, where maintainability is important, using code generated by these systems is very likely worse than the recommended software engineering practices.
Code generation risks encouraging ignorance – a lack of understanding of the bigger context.
It lacks the rich conversations around the questions that are present on StackOverflow for example. These conversations around a question are sometimes even more informative than the actual answer. And the value of the conversations is that they identify and solve also the corner cases. The main answer solves the general problem, the comments often solve the rare exception.
As of today, there remain many open questions:
Indeed, maintenance is the longest phase in the life of a system. These days we call it software evolution exactly to acknowledge its importance and extent. We have no idea whether a patchwork of generated code, cobbled together will result in reliable and maintainable systems.
For what kind of tasks, it is still not clear. For what kind of developers it is also not clear. Maybe more for juniors. Maybe more for data scientists who are not necessarily that experienced in coding?
Developers spend an overwhelming majority of their time reading code rather than writing; why? Because code understanding is still the hardest part of maintenance. Can we use LLMs to make code more understandable? This would be hugely impactful.
How do we study these things? Running after every new version of every new model of every new company is chasing after wind. New versions might be much better or decrease in quality as the company needs to cut on operations. From this POV, running one owns’ model is more sustainable. But they are not as good as the commercial ones. What can we write about it that will still be relevant in 5 years? It can’t be dependent on the current version; but how can you not depend on it?
Thanks to Iulian, Tiago, Adam for feedback on earlier drafts of this.