Generative AI Q&A: Applications in Software Engineering

The SEI recently hosted a question-and-answer webcast on generative AI. This webinar featured experts from across the SEI answering questions posed by the audience and discussing both the technological advancements and the practical considerations necessary for effective and reliable application of generative AI and large language models (LLMs), such as ChatGPT and Claude. This blog post includes our responses, which have been reordered and edited to enhance the clarity of the original webcast. It is the first of part a two-part series and explores the implications of generative AI in software engineering, particularly in the context of defense and domains with stringent quality-of-service requirements. In this part, we discuss the transformative impacts of generative AI on software engineering as well as its practical implications and adaptability in mission-critical environments.

Transformative Impacts of Generative AI on Software Engineering

Q: What are the advantages generative AI brings in regard to traditional software engineering?

John Robert: There are many exciting applications for generative AI in the context of software engineering. Many of us now have experience using generative AI tools like ChatGPT and other popular LLMs to create code, usually in response to prompts in a browser window. However, generative AI coding assistants, such as GitHub Copilot and Amazon Code Whisperer, are increasingly being merged with popular integrated development environments, such as IntelliJ, Android Studio, Visual Studio, and Eclipse. In both cases, creating code from prompts can increase developer productivity. Moreover, these AI code assistants are also good at other things, such as code refactoring and code transformation, that modify existing code and/or translate it into different programming languages, programming language versions, and/or platforms.

Using generative AI tools to create test cases that evaluate code quality and performance is another emerging area of interest. Although these tools can review code similar to conventional static analysis tools, they also enable extensive interactions with software engineers and analysts. There are many examples of software engineers using LLMs to explore code in newly interactive ways, such as asking for a summary of the code, checking compliance with coding standard(s), or having a dialog to explore how the code relates to specific considerations, such as safety, security, or performance. In these and other use cases, the knowledge of experienced software engineers is critical to avoid overreliance on generative AI tools. What is new is the interactivity that enables software engineers to explore answers to questions and iteratively develop solutions to problems.

Generative AI is not limited to only enhancing code-level activities in the software lifecycle and, in fact, it provides other potential benefits to the practice of software engineering. For example, software engineers perform many other tasks beyond coding, including participating in meetings, examining documents, or interacting with different stakeholders. All these activities today require humans to inspect and summarize reams of documentation. Generative AI is well suited to helping humans perform those activities more efficiently and accurately, as well as helping improve the quality and efficiency of humans involved with Department of Defense (DoD) and government software acquisition activities and policies.

A key point I want to underscore is that humans are an essential part of the generative AI process and should not be replaced wholesale by these tools. Moreover, given the nascent nature of the first-generation of generative AI tools, it’s essential to have skilled software and systems engineers, as well as subject matter experts, who can spot where generated documentation or code is inaccurate and ensure that the key context is not lost. These human skills are important and necessary, even as generative AI tools provide significant new capabilities.

Q: What do you think about hybrid approaches that use generative AI and one or more additional techniques to generate code? Hybrid examples may include using LLMs with MDD or symbolic AI?

John: In answering this question, I assume “MDD” stands for model-driven development, which forms part of the broader field of model-based software engineering (MSBE). There is considerable interest in using models to generate code, as well as helping reduce the cost of maintaining software (especially large-scale software-reliant systems) over the lifecycle. Applying generative AI to MBSE is thus an area of active research interest.

However, combining MBSE with LLMs like ChatGPT has raised various concerns, such as whether the generated code is incorrect or contains vulnerabilities, like buffer overflows. Another active area of interest and research, therefore, is the use of hybrid approaches that leverage not just LLMs but also other techniques, such as MBSE, DevSecOps, or component-based software engineering (CBSE), to address those shortcomings or those risks. What is important is to assess the opportunities and risks for application of LLMs in software engineering and combine LLMs with existing techniques.

At the SEI, we have begun applying generative AI to reverse engineer model-based representations from lower-level corpora of code. Our early experiments indicate this combination can generate fairly accurate results in many cases. Looking ahead, the SEI sees many opportunities in this area since legacy software often lacks accurate model representations or even good documentation in many cases. Moreover, ensuring robust “round-trip engineering” that continuously synchronizes software models and their corresponding code-bases has been a long-standing challenge in MBSE. A promising research area, therefore, is hybrid approaches that integrate MBSE and generative AI techniques to minimize risks of applying generative AI for code generation in isolation.

Q: Is it possible to align open source LLMs to unfamiliar proprietary programming language that the model has never seen before?

John: LLMs have demonstrated remarkable extensibility, particularly when optimized with well-crafted prompt engineering and prompt patterns. While LLMs are most proficient with mainstream languages, like Python, Java, and C++, they also offer surprising utility for lesser-known languages, like JOVIAL, Ada, and COBOL that are crucial to long-lived DoD programs. An effective strategy for adapting LLMs to support these niche languages involves fine-tuning them using specialized datasets, which is an approach similar to Hugging Face's CodeGen initiative. Prompt engineering can further leverage this fine-tuned knowledge, translating it into actionable insights for legacy and greenfield application domains alike.

However, it's essential to temper enthusiasm with caution. LLMs present a wealth of novel opportunities for reshaping various tasks, but their efficacy is context-dependent. It's therefore crucial to understand that while these tools are powerful, they also have limitations. Not all problems are best solved with AI models, so the SEI is developing methods for discerning when traditional methods offer more reliable solutions.

In summary, while there are promising avenues for aligning open source LLMs to unfamiliar proprietary programming languages, the effectiveness of these endeavors is not guaranteed. It is crucial to perform thorough evaluations to determine the applicability and limitations of LLMs in specific use cases and domains. As LLMs continue to evolve, moreover, it's important to keep an open mind and periodically revisit domains where they might not currently be an effective solution but could become useful in the future.

Practical Implications and Adaptability of Generative AI in Critical Environments

Q: How can generative AI be used now in the Department of Defense?

Douglas Schmidt: Generative AI presents a diverse range of applications for the DoD, addressing both legacy and contemporary challenges. One pressing issue lies in sustaining legacy software systems, which as John mentioned earlier are often developed in now-obscure languages like Ada or JOVIAL. The diminishing pool of developers proficient in these languages poses a significant obstacle for the DoD's organic sustainment efforts. However, LLMs can be trained, fine-tuned, and/or prompt engineered to understand these older languages, thereby aiding the comprehension and evolution of existing codebases. Collaborations with cloud providers, such as Azure from Microsoft and others, further enable secure, government-approved access to these specialized code repositories, thereby enhancing software sustainment strategies.

Another promising application of LLMs in the DoD focuses on large-scale acquisition programs that possess extensive repositories of regulatory documents, safety specifications, and security protocols. Given the sheer volume of these data, it is practically infeasible for human analysts to comprehensively understand all these documents. Fortunately, many LLMs excel at textual analysis and can sift through massive repositories quickly to identify inconsistencies, gaps, and specific information—helping to find "needles in a haystack." This capability is invaluable to ensure that DoD acquisition programs adhere to necessary guidelines and requirements in a timely and cost-effective manner.

Operational activities within the DoD can also benefit from today’s capabilities of LLMs. For example, Scale with their Donovan platform or Palantir with their AI platform are pioneering new ways of aiding DoD analysts and operators who process vast amounts of diverse information and turn it into actionable courses of action. These platforms are leveraging fine-tuned LLMs to synthesize data from various signals and sensors, enabling more effective coordination, fusing of information, and cueing of assets for intelligence collection and mission planning. I expect we’ll see more of these types of platforms being deployed in DoD programs in the near future.

In summary, generative AI is not only a future prospect for the DoD, it’s an emerging reality with applications ranging from software sustainment to acquisition program oversight and operational support. As AI technology continues to advance, I anticipate an even broader range of military applications, reinforcing the strategic importance of AI competency in national defense.

Q: How do you evaluate risks when using code generated by generative AI products before deployment, in production, high-risk settings, and DoD use cases; any thoughts on traditional verification and validation methods or formal methods?

John: This question is interesting because people are increasingly planning to leverage generative AI for those types of settings and environments. Applying generative AI to the software engineering lifecycle is part of a larger trend towards AI-augmented software engineering covered by the SEI in a publication from the fall of 2021. This trend towards intelligent automation has emerged over the last decade, with more AI-augmented tools coming to market and being applied to develop software, test software, and deploy software. In that context, however, a range of new challenges have emerged.

For example, today’s LLMs that generate code have been trained on imperfect code from GitHub, Stack Overflow, and so on. Not surprisingly, the code they generate may also be imperfect (e.g., there may be defects, vulnerabilities, etc.). As a result, it’s essential to leverage human insight and oversight throughout the software engineering lifecycle, including the planning, architecture, design, development, testing, and deployment phases.

When used properly, however, generative AI tools can also accelerate many of these phases in new ways (e.g., creating new test cases, statically analyzing the code, etc.). Moreover, the software engineering community needs to consider ways to apply LLMs to accelerate the software lifecycle as a whole, rather than just focusing on generating code. For example, the SEI is exploring ways to leverage LLMs, together with formal methods and architecture analysis, and apply these techniques much earlier in the lifecycle.

Doug: I’d like to amplify a few things that John just mentioned. We’ve been generating code from various higher-level abstractions for decades, going way back to tools like lex and yacc for compiler construction. We’ve also long been generating code from model-driven engineering tools and domain-specific modeling languages through meta-modeling frameworks via tools like AADL and GME.

The main thing that’s changed with the advent of LLMs is that AI now generates more of the code that was traditionally generated by tools written by people. However, the same basic principles and practices apply, (e.g., We still need unit tests, integration tests, and so on). Therefore, all the things we’ve come to know and love about ensuring confidence in the validity and verification of software still apply, but we’re now expecting generative AI tools to perform more of the workload.

The second point, to build on John’s earlier response, is that we shouldn’t expect AI to generate complete and flawless software-reliant systems from scratch. Instead, we should view LLMs through the lens of generative augmented intelligence, (i.e., developers working together with AI tools). I do this type of collaboration all the time in my teaching, research, and programming nowadays. In particular, I work hand-in-hand with ChatGPT and Claude, but I don’t expect them to generate all the code. Instead, I do much of the design, decomposition, and some of the implementation tasks, and then have the LLMs help me with tasks that would otherwise be tedious, error-prone, and/or boring for me to do manually. Thus, I use LLMs to supplement my skills as a programmer, rather than to supplant me.

This distinction between generative augmented intelligence and generative artificial intelligence is important. When I read articles by colleagues who are skeptical about the benefits of using generative artificial intelligence for programming, I find they usually make the same mistakes. First, they just try a handful of examples using early releases of LLMs, such as ChatGPT-3.5. Next, they don’t spend time thinking about how to perform effective prompt engineering or apply sound prompt patterns. Then, when they don’t get the results they expect they throw their hands up and say “See the emperor has no clothes” or “AI does not help programmers.” I call this rhetorical tactic “de-generative AI”, where people over generalize from a few simple cases that didn’t work without any additional thought or effort and then disparage the whole paradigm. However, those of us who spend time learning effective patterns of prompt engineering and actually applying LLMs in our programming and software engineering practice day in and day out have realized they work quite well when used properly.

Closing Thoughts

John: I’ve really enjoyed the questions and our conversation. I agree that hands-on experimentation is essential to understanding what LLMs can and can’t do, as well as what opportunities and risks arise when applying generative AI in practice. From a software engineering perspective, my main take-away message is that LLMs are not just useful for code-related activities but can also be applied fruitfully to upstream activities, including acquisition planning, planning, and governance.

Much valuable information beyond code exists in software projects, whether it be in your favorite open-source GitHub repositories or your own in-house document revision control systems. For example, there can be test cases, documentation, safety policies, etc. Therefore, the opportunities to apply generative AI to assist acquirers and software engineers are quite profound. We’re just beginning to explore these opportunities at the SEI, and are also investigating and mitigating the risks, as well.

Doug: For decades, many of us in education and government have been concerned about the digital divide, which historically referred to people with access to the Internet and computers and people who lacked that access. While we’ve made steady progress in shrinking the digital divide, we are about to encounter the digital chasm, which will occur when some people know how to use generative AI tools effectively and some don’t. Thus, while AI itself may not directly take your job, someone who utilizes AI more effectively than you could potentially take your job. This trend underscores the importance of becoming proficient in AI technologies to maintain a competitive edge in the workforce of tomorrow.

If you are a non-computer scientist—and you want to become facile at web development—you could take a 24-week boot camp and learn to do some coding in JavaScript and related web technologies. After graduating, however, you’ll be compared with developers with decades of experience, and it may be hard to compete with them. In contrast, there are few people with more than about six-to-eight months of experience with prompt engineering and using LLMs effectively. If you want to get in on the ground floor, therefore, it’s great time to start afresh, because all you need is an Internet connection, a computer with a web browser, and a passion for learning.

Moreover, you don’t even need to be a programmer or a software engineer to become highly productive if you are willing to put the time and effort into it. By treating LLMs as exoskeletons for our brains—rather than replacements for critical thinking—we’ll be much more productive and effective as a society and a workforce. Naturally, we have much work ahead of us to make LLMs more trustworthy, more ethical, and more effective, so people can apply them the way they should be used as opposed to using them as a crutch for not having to think. I am extremely optimistic about the future, but we all need to pitch in and help educate everyone so we become much more facile at using this new technology.

Additional Resources

Software Engineering Institute

SEI Blog