Applying Generative AI to Software Engineering: Navigating Ethical and Educational Landscapes
The SEI recently hosted a question-and-answer webcast on generative AI that featured experts from across the SEI answering questions posed by the audience and discussing both the technological advancements and the practical considerations necessary for effective and reliable application of generative AI and large language models (LLMs), such as ChatGPT and Claude. This blog post includes our responses, which have been reordered and edited to enhance the clarity of the original webcast. It is the second of a two-part series—the first installment focused on applications in software engineering—and explores the broader impacts of generative AI, addressing concerns about the evolving landscape of software engineering and the need for informed and responsible AI use. In particular, we discuss how to navigate the risks and ethical implications of AI-generated code, as well as the impact of generative AI on education, public perception, and future technological advances.
Navigating the Risks and Ethical Implications of AI-Generated Code
Q: I have observed a concerning trend that worries me. It appears that the traditional software engineering profession is gradually diminishing. I am curious to hear your thoughts on the growing concerns surrounding the increasing potential dangers posed by AI.
John: Many people are concerned about the implications of generative AI on the profession of software engineering. The press and social media are full of articles and postings asking if the age of the programmer is ending due to generative AI. Many of these concerns are overstated, however, and humans are an essential part of the software development process for many reasons, not just because today’s LLMs are imperfect.
For example, software engineers must still understand system requirements, and architectural issues, as well as how to validate, deploy, and sustain software-reliant systems. Although LLMs are getting better at augmenting people in activities previously done through human-centric effort, other risks remain, such as becoming over-reliant on LLMs—especially for mission-critical or safety-critical software—which can incur many risks. We’ve seen other professions, such as lawyers, get into serious trouble by naively relying on erroneous LLM output, which should serve as a cautionary tale for software engineers!
LLMs are just one of many advances in software engineering over the years where the skill sets of talented engineers and subject matter experts remained essential, even though tasks were increasingly automated by powerful and intelligent tools. There have been many times in the past where it appeared that software engineers were becoming less relevant, but they actually turned out to be more relevant because properly functioning software-reliant systems became more essential to meet user needs.
For example, when FORTRAN was released in the late 1950s, assembly language programmers worried that demand for software developers would evaporate since compilers could perform all the nitty-gritty details of low-level programming, such as register allocation, thereby rendering programmers superfluous. It turned out, however, that the need for programmers expanded dramatically over the ensuing decades since consumer, enterprise, and embedded market demands actually grew as higher-level programming languages and software platforms increased software developer productivity and system capabilities.
This phenomenon is commonly known as Jevons Paradox, where the demand for software professionals increases rather than decreases as efficiency in software development increases due to better tools and languages, as well as expanded application requirements, increased complexity, and a constantly evolving landscape of technology needs. Another example of the Jevons Paradox was in the push toward increased use of commercial off-the-shelf (COTS)-based systems. Initially, software developers worried that demand for their skills would shrink because organizations could simply purchase or acquire software that was already built. It turned out, however, that demand for software developer skills remained steady and even increased to enable evaluation and integration of COTS components into systems (see Table 3).
Prompt engineering is currently garnering much interest because it helps LLMs to do our bidding more consistently and accurately. However, it’s essential to prompt LLMs properly since if they are used incorrectly, we’re back to the garbage-in, garbage-out anti-pattern and LLMs will hallucinate and generate nonsense. If software engineers are trained to provide proper context—along with the right LLM plug-ins and prompt patterns—they become highly effective and can guide LLMs through a series of prompts to create specific and effective outputs that improve the productivity and performance of people and platforms.
Judging from job postings we’ve seen across many domains, it’s clear that engineers who can use LLMs reliably and integrate them seamlessly into their software development lifecycle processes are in high demand. The challenge is how to broaden and deepen this work force by training the next generation of computer scientists and software engineers more effectively. Meeting this challenge requires getting more people comfortable with generative AI technologies, while simultaneously understanding their limitations and then overcoming them through better training and advances in generative AI technologies.
Q: A coding question. How hard is it to detect if the code was generated by AI versus a human? If an organization is trying to avoid copyright violations from using code generated by AI, what should be done?
Doug: As you can imagine, computer science professors like me worry a lot about this issue because we’re concerned our students will stop thinking for themselves and start just generating all their programming assignment solutions using ChatGPT or Claude, which may yield the garbage-in, garbage-out anti-pattern that John mentioned earlier. More broadly, many other disciplines that rely on written essays as the means to assess student performance are also worried because it’s become hard to tell the difference between human-generated and AI-generated prose.
At Vanderbilt in the Spring 2023 semester, we tried using a tool that purported to automatically identify AI-generated answers to essay questions. We stopped using it by the Fall 2023 semester, however, because it was simply too inaccurate. Similar problems arise with trying to detect AI-generated code, especially as programmers and LLMs become more sophisticated. For example, the first generation of LLMs tended to generate relatively uniform and simple code snippets, which at the time seemed like a promising pattern to base AI detector tools on. The latest generation of LLMs generate more sophisticated code, however, especially when programmers and prompt engineers apply the appropriate prompt patterns.
LLMs are quite effective at generating meaningful comments and documentation when given the right prompts. Ironically, many programmers are much less consistent and conscientious in their commenting habits. So, perhaps one way to tell if code was generated by AI is if it’s nicely formatted and carefully constructed and commented!
All joking aside, there are several ways to address issues associated with potential copyright violations. One approach is to only work with AI providers that indemnify their (paying) customers from being held liable if their LLMs and related generative AI tools generate copyrighted code. OpenAI, Microsoft, Amazon, and IBM all offer some levels of assurances in their recent generative AI offerings. (Currently, some of these assurances may only apply when paying for a subscription.)
Another approach is to train and/or fine-tune an LLM to perform stylometry based on careful analysis of programmer styles. For example, if code written by programmers in an organization no longer matches what they typically write, this discrepancy could be flagged as something generated by an LLM from copyrighted sources. Of course, the tricky part with this approach is differentiating between LLM-generated code versus something programmers copy legitimately from Stack Overflow, which is common practice in many software development organizations nowadays. It’s also possible to train specialized classifiers that use machine learning to detect copyright violations, though this approach may ultimately be pointless as the training sets for popular generative AI platforms become more thoroughly vetted.
If you are really concerned about copyright violations—and you aren’t willing or able to trust your AI providers—you should probably resort to manual code reviews, where programmers must show the provenance of what they produce and explain where their code came from. That model is similar to Vanderbilt’s syllabus AI policy, which allows students to use LLMs if permitted by their professors, but they must attribute where they got the code from and whether it was generated by ChatGPT, copied from Stack Overflow, etc. Coupled with LLM provider assurances, this type of voluntary conformance may be the best we can do. It is a fool’s errand to expect that we can detect LLM-generated code with any degree of accuracy, especially as these technologies evolve and mature, since they will get better at masking their own use!
Future Prospects: Education, Public Perception, and Technological Advancements
Q: How can the software industry educate users and the general public to better understand the appropriate versus inappropriate use of LLMs?
John: This question raises another really thought-provoking issue. Doug and I recently facilitated a U.S. Leadership in Software Engineering & AI Engineering workshop hosted at the National Science Foundation where speakers from academia, government, and industry presented their views on the future of AI-augmented software engineering. A key question arose at that event as to how to better educate the public about the effective and responsible applications of LLMs. One theme that emerged from workshop participants is the need to increase AI literacy and clearly articulate and codify the present and near-future strengths and weaknesses of LLMs.
For example, as we’ve discussed in this webcast today, LLMs are good at summarizing large sets of information. They can also find inaccuracies across corpora of documents, such as Compare these repositories of DoD acquisition program documents and identify their inconsistencies. LLMs are quite good at this type of discrepancy analysis, particularly when combined with techniques such as retrieval-augmented generation, which has been integrated into the ChatGPT-4 turbo release.
It’s also important to understand where LLMs are not (yet) good at, or where expecting too much from them can lead to disaster in the absence of proper oversight. For example, we talked earlier about risks associated with LLMs generating code for mission- and safety-critical applications, where seemingly minor mistakes can have catastrophic consequences. So, building awareness of where LLMs are good and where they are bad is crucial, though we also need to recognize that LLMs will continue to improve over time.
Another interesting theme that emerged from the NSF-hosted workshop was the need for more transparency in the data used to train and test LLMs. To build more confidence in understanding how these models can be used, we need to understand how they are developed and tested. LLM providers often share how their most recent LLM release performs against popular tests, and there are leader boards to highlight the latest LLM performance. However, LLMs can be created to perform well on specific tests while also making tradeoffs in other areas that may be less visible to users. We clearly need more transparency about the LLM training and testing process, and I’m sure there will soon be more developments in this fast-moving area.
Q: What are your thoughts on the current and future state of prompt engineering? Will certain popular techniques—reflection multi-shot prompt, multi-shot prompting summarization—still be relevant?
The main difference between programming LLMs via natural language versus programming computers with traditional structured languages is there is more room for ambiguity with LLMs. The English language is fundamentally ambiguous, so we’ll always need some form of prompt engineering. This need will continue even as LLMs improve at ferreting out our intentions since different ways of phrasing prompts cause LLMs to respond differently. Moreover, there won’t be “one LLM to rule them all,” even given OpenAI’s current dominance with ChatGPT. For example, you’ll get different responses (and often quite different responses) if you give a prompt to ChatGPT-3.5 versus ChatGPT-4 versus Claude versus Bard. This diversity will expand over time as more LLMs—and more versions of LLMs—are released.
There’s also something else to consider. Some people think that prompt engineering is limited to how users ask questions and make requests to their favorite LLM(s). If we step back, however, and think about the engineering term in prompt engineering, it’s clear that quality attributes, such as configuration management, version control, testing, and release-to-release compatibility, are just as important—if not more important—than for traditional software engineering.
Understanding and addressing these quality attributes will become essential as LLMs, generative AI technologies, and prompt engineering are increasingly used in the processes of building systems that we must sustain for many years or even decades. In these contexts, the role of prompt engineering must expand well beyond simply phrasing prompts to an LLM to cover all the -ilities and non-functional requirements we must support throughout the software development lifecycle (SDLC). We have just begun to scratch the surface of this holistic view of prompt engineering, which is a topic that the SEI is well equipped to explore due to our long history of focusing on quality attributes through the SDLC.
Q: Doug, you’ve touched on this a little bit in your last comments, I know you do a lot of work with your students in this area, but how are you personally using generative AI in your day-to-day teaching at Vanderbilt University?
Doug: My colleagues and I in the computer science and data science programs at Vanderbilt use generative AI extensively in our teaching. Ever since ChatGPT “escaped from the lab” in November of 2022, my philosophy has been that programmers should work hand-in-hand with LLMs. I don’t see LLMs as replacing programmers, but instead augmenting them, like an exoskeleton for your brain! It’s therefore crucial to train my students to use LLMs effectively and responsibly, (i.e., in the right ways rather than the wrong ways).
I’ve begun integrating ChatGPT into my courses wherever possible. For example, it’s very handy for summarizing videos of my lectures that I record and post to my YouTube channel, as well as generating questions for in-class quizzes that are fresh and up to date based on the transcripts of my class lectures uploaded to YouTube. My teaching assistants and I also use ChatGPT to automate our assessments of student programming assignments. In fact, we have built a static analysis tool using ChatGPT that analyzes my student programming submissions to detect frequently made mistakes in their code.
In general, I use LLMs whenever I would traditionally have expended significant time and effort on tedious and mundane—yet essential—tasks, thereby freeing me to focus on more creative aspects of my teaching. While LLMs are not perfect, I find that applying the right prompt patterns and the right tool chains has made me enormously more productive. Generative AI tools today are incredibly helpful, as long as I apply them judiciously. Moreover, they are improving at a breakneck pace!
John: Navigating the ethical and educational challenges of generative AI is an ongoing conversation across many communities and perspectives. The rapid advancements in generative AI are creating new opportunities and risks for software engineers, software educators, software acquisition authorities, and software users. As often happens throughout the history of software engineering, the technology advancements challenge all stakeholders to experiment and learn new skills, but the demand for software engineering expertise, particularly for cyber-physical and mission-critical systems, remains very high.
The resources to help apply LLMs to software engineering and acquisition are also increasing. A recent SEI publication, Assessing Opportunities for LLMs in Software Engineering and Acquisition, provides a framework to explore the risks/benefits of applying LLMs in multiple use cases. The application of LLMs in software acquisition presents important new opportunities that will be described in more detail in upcoming SEI blog postings.
Doug: Earlier in the webcast we talked about the impact of LLMs and generative AI on software engineers. These technologies are also enabling other key software-reliant stakeholders (such as subject matter experts, systems engineers, and acquisition professionals) to participate more effectively throughout the system and software lifecycle. Allowing a wider spectrum of stakeholders to contribute throughout the lifecycle makes it easier for customers and sponsors to get a better sense of what is actually happening without having to become experts in software engineering.
However, this trend doesn’t mean that the need for software developers will diminish. As John pointed out earlier in his discussion of the Jevons Paradox, there’s a vital role for those of us who program using third and fourth generation languages because many systems—especially safety-critical and mission-critical cyber physical systems—require high-confidence and fine-grained control over software behavior. It’s therefore incumbent on the software engineering community to create the processes, methods, and tools needed to ensure a robust discipline of prompt engineering emerges, and that key software engineering quality attributes (such as configuration management, testing, and sustainment) are extended to the domain of prompt engineering for LLMs. Otherwise, people who lack our body of knowledge will create brittle artifacts that can’t stand the test of time and instead will yield mountains of expensive technical debt that can’t be paid down easily or cheaply!
View the complete SEI webcast Ask Us Anything: Generative AI Edition featuring John Robert, Douglas C. Schmidt, Rachel Dzombak, Jasmine Ratchford, Matthew Walsh, and Shing-hon Lau.