How Well Did AI Do It? Software Engineering Edition

For nearly a year now, the world has been captivated by the power and potential of generative artificial intelligence (AI) and large language models (LLMs). Whether the task has been generating artistic renderings in the style of Salvador Dali, generating music from the brainwaves of patients during surgery, or simply generating your next email, the latest generation of AI models promises a bright future.

But how well does AI do at more technical tasks, like answering coding questions?

Researchers at Purdue University set out to test this very question by using ChatGPT to answer software engineering questions on Stack Overflow (a popular Q&A community platform for programmers). A pre-print version of the study can be found here.

The Challenge

The Purdue Team set out to determine how well ChatGPT could perform at answering questions about software programming. More specifically, the Purdue Team reviewed ChatGPT responses to determine their correctness, consistency, comprehensiveness, and conciseness. Additionally, the researchers assessed how ChatGPT responses differed linguistically from responses provided by humans. Finally, a user study was conducted with a group of 12 programmers to determine user preferences for ChatGPT or standard Stack Overflow responses.

The Data

The Purdue Team pulled questions from Stack Overflow in Mach 2023. Questions were required to have an accepted answer, indicating that the user who posted the question found that an answer worked for their problem. The accepted answer also provided a human response to serve as the comparison to the ChatGPT responses.

The Purdue Team classified the questions according to three variables: popularity, age, and question type. Popularity was defined using the number or views and splitting out the top 10 percent most and least popular questions from the middle 80 percent of questions. Age was defined based on whether the question was posed before or after the release of ChatGPT on November 30, 2022. Finally, the question type was defined using a support vector machine (SVM) classifier to analyze the title text of the question. Three categories of questions were identified with 78 percent accuracy: conceptual programming, how-to, and debugging.

The Purdue Team used a stratified random sampling method to select questions from each of the categories of the three classification variables in approximately equal proportions. Because the SVM classifier for question type had some error, the researchers manually confirmed that the sampled questions were properly classified and dropped questions that had been misclassified. The final sample of Stack Overflow questions was 517. An additional sample of 2,000 Stack Overflow questions was drawn specifically for the linguistic analysis.

The Analysis

The Purdue Team submitted the original 517 questions to ChatGPT to obtain responses. The ChatGPT responses were reviewed by the researchers and compared to the accepted answers from Stack Overflow, as well as information from other websites and programming language documentation. Each of the answers was assessed for the following:

Correctness: was the ChatGPT answer correct in terms of facts, concepts, code, and terminology?
Consistency: was the ChatGPT response consistent with the accepted Stack Overflow response? (Note that inconsistency does not mean incorrect).
Conciseness: did the ChatGPT response contain redundancy, irrelevant information (not part of the original question), or excess information (not needed to understand the answer)?
Comprehensiveness: did the ChatGPT response 1) answer all parts of the question, and 2) address all parts of the solution in the answer?

The linguistic analysis was performed on the sample of 2,000 additional questions submitted to ChatGPT. The analysis used the Linguistic Inquiry and Word Count (LIWC) tool, a database of psychologically meaningful words about the emotional, cognitive, and structural components of text, and assessed the frequency of each word within each of the ChatGPT and accepted Stack Overflow responses. The linguistic analysis focused on the following categories of language:

Linguistic style: was the response written in an analytical manner, with confident language, in an authentic or spontaneous manner, and what was the emotional tone?
Affective attributes: did the response include language indicating either a positive or negative emotional state?
Cognitive processes: did the response include language suggesting insights, causation, comparisons, certainty, and differentiation of specific issues?
Drives attributes: did the responses include language expressing the need, desire, or effort to achieve something?
Perceptual attributes: did the response include words about human perception (e.g., see, feel, hear, etc.)?
Informal attributes: did the response include casual language, idioms, abbreviations (e.g., btw, lol, etc.), or fillers (e.g., you know, I mean, etc.)

The Purdue Team used the word frequencies in each category to calculate the relative difference in the frequency of linguistic features between the ChatGPT and accepted Stack Overflow responses.

Finally, a user study was performed with 12 programmers. The programmers were presented with the Stack Overflow questions, the accepted Stack Overflow responses, and the ChatGPT responses. The responses were not labeled as to the source of the response (i.e., Stack Overflow vs ChatGPT), and were presented so that sometimes the ChatGPT response appeared first and sometimes not. The programmers were asked to complete a brief questionnaire about each question and set of responses. A 10-minute interview was performed to review the programmer responses and ask probing questions about why they answered the survey questions in particular ways.

The Results

Key results from the study are summarized as follows:

Among the 517 ChatGPT responses reviewed by the research team
- 52 percent were incorrect
- 22 percent were consistent with human responses
- 65 percent were comprehensive
- 23 percent were concise
Among the incorrect ChatGPT responses
- 54 percent included conceptual errors attributed to ChatGPT not understanding the context of the question
- 36 percent included factual errors
- 28 percent included errors in programming code
Most code errors were caused by ChatGPT using incorrect logic or using non-existent or inappropriate functions, libraries (i.e., groups of functions), or application programming interfaces (i.e., functions used to communicate between different programs).
In the linguistic analysis, ChatGPT responses were
- more formal than human responses
- exhibited more language about analytic thinking
- included more goal-oriented language
- included more positive sentiments than human responses
In the user study, the results indicated participants
- Correctly identified ChatGPT responses from human responses 81 percent of the time.
- Failed to identify incorrect ChatGPT responses 39 percent of the time due to the comprehensive and well-articulated language used.
- Preferred human responses 65 percent of the time because they were more concise and useful.

Implications

The publicity around AI tools has catapulted their use in mainstream business applications over the past year, including as tools that programmers and analysts are turning to for support on a regular basis. The results of the Purdue study reviewed here carry important implications for the future use of AI in coding. Whether you are a programmer, or you work alongside programmers, here are a few key takeaways.

More than Half of Responses were Incorrect: ChatGPT was unable to discern nuances of questions and often focused on specific parts of questions when the most pressing issues were present in other parts of the question. The inability of the model to consider potential programming structures and algorithms outside the scope of the question prompt presents a challenge to providing conceptually and factually correct responses. Furthermore, the implementation of false logic or incorrect functions within the code responses indicates that hallucinations remain a considerable challenge to AI applications in programming.

Human Users Cannot Always Identify Incorrect Responses: Programmers reviewing incorrect ChatGPT responses failed to identify the flaws, often because the response was written in a formal, confident, and authoritative manner. This is problematic for two reasons. First, generative AI models like ChatGPT often produce responses written in more formal and authoritative tones, which fail to account for or identify the presence of hallucinations or false logic. Users reviewing the response must rely on their own knowledge, or other sources to confirm the accuracy of AI responses about coding questions.

Second, if a programmer is seeking assistance from AI in resolving a coding problem, then it is not a far cry to assume the programmer may not have the knowledge base to recognize an incorrect response. When these two facts are combined, they create a scenario in which user questions may be answered incorrectly, and users cannot identify that the response is incorrect.

Always Verify with Outside Sources: Given the potential errors created by ChatGPT responses to coding questions, and the potential to overlook errors by humans, the Purdue study results highlight the importance of verifying responses with outside sources. It is imperative that users take the time to vet responses through multiple sources to ensure complete understanding and accuracy before proceeding with a solution provided by AI.

How Well Did AI Do It? Conclusion

In ChatGPT’s defense, it was not created to be a programming code model. Other tools such as DeepMind’s AlphaCode are being developed specifically to provide programming code based on natural language descriptions, and the industry is continuing to press forward with advances in AI technology. So, while the results discussed above don’t appear very positive, I think they remain promising for the future.

Still, I don’t see that human programmers are in danger of being made obsolete by AI anytime soon. Even as progress continues, all AI models are a function of the data they have been trained on. For programming applications, this includes understanding syntax rules, function parameters, as well as common programming structures and innovative deviations from those structures. It is more difficult, however, to train an AI model on how to write a program that works within the context, interests, and goals of the broader organization.

Most coding solutions are not written in a vacuum and require specific knowledge of the business environment the solution will be implemented within. What is the structure of the database you are pulling your data from? How does the output need to be formatted to work downstream in your analytics process? Make sure that you format things in the way that your VP likes them. So far, AI models have yet to accomplish these tasks well. I’m sure that day may come down the road, but for now this is the realm where human programmers will continue to excel for some time.