icon-carat-right menu search cmu-wordmark

Perspectives on Generative AI in Software Engineering and Acquisition

In the realm of software engineering and software acquisition, generative AI promises to improve developer productivity and rate of production of related artifacts, and in some cases their quality. It is essential, however, that software and acquisition professionals learn how to apply AI-augmented methods and tools in their workflows effectively. SEI researchers addressed this topic in a webcast that focused on the future of software engineering and acquisition using generative AI technologies, such as ChatGPT, DALL·E, and Copilot. This blog post excerpts and lightly edits portions of that webcast to explore the expert perspectives of applying generative AI in software engineering and acquisition. It is the latest in a series of blog posts on these topics.

Moderating the webcast was SEI Fellow Anita Carleton, director of the SEI Software Solutions Division. Participating in the webcast were a group of SEI thought leaders on AI and software, including James Ivers, principal engineer; Ipek Ozkaya, technical director of the Engineering Intelligent Software Systems group; John Robert, deputy director of the Software Solutions Division; Douglas Schmidt, who was the Director of Operational Test and Evaluation at the Department of Defense (DoD) and is now the inaugural dean of the School of Computing, Data Sciences, and Physics at William & Mary; and Shen Zhang, a senior engineer.

Anita: What are the gaps, risks, and challenges that you all see in using generative AI that need to be addressed to make it more effective for software engineering and software acquisition?

Shen: I will focus on two specifically. One that is very important to the DoD is explainability. Explainable AI is critical because it allows practitioners to gain an understanding of the results output from generative AI tools, especially when we use them for mission- and safety-critical applications. There is a lot of research in this field. Progress is slow, however, and not all approaches apply to generative AI, especially regarding identifying and understanding incorrect output. Alternatively, it’s helpful to use prompting techniques like chain of thought reasoning, which decomposes a complex task into a sequence of smaller subtasks. These smaller subtasks can more easily be reviewed incrementally, reducing the likelihood of acting on incorrect outputs.

The second area is security and disclosure, which is especially critical for the DoD and other high-stakes domains such as health care, finance, and aviation. For many of the SEI’s DoD sponsors and partners, we work at impact levels of IL5 and beyond. In this type of environment, users cannot just take that information—be it text, code, or any kind of input—and pass it into a commercial service, such as ChatGPT, Claude, or Gemini, that doesn’t provide adequate controls on how the data are transmitted, used, and stored.

Commercial IL5 options can mitigate concerns about data handling, as they can use of local LLMs air-gapped from the internet. There are, however, trade-offs between use of powerful commercial LLMs that tap into resources around the web and more limited capabilities of local models. Balancing capability, security, and disclosure of sensitive data is crucial.

John: A key challenge in applying generative AI to development of software and its acquisition is ensuring proper human oversight, which is needed regardless of which LLM is applied. It’s not our intent to replace people with LLMs or other forms of generative AI. Instead, our goal is to help people bring these new tools into their software engineering and acquisition processes, interact with them reliably and responsibly, and ensure the accuracy and fairness of their results.

I also want to mention a concern about overhyped expectations. Many claims made today about what generative AI can do are overhyped. At the same time, however, generative AI is providing many opportunities and benefits. For example, we have found that applying LLMs for some work at the SEI and elsewhere substantially improves productivity in many software engineering activities, though we are also painfully aware that generative AI won’t solve every problem every time. For example, using generative AI to synthesize software test cases can accelerate software testing, as mentioned in recent studies, such as Automated Unit Test Improvement using Large Language Models at Meta. We are also exploring using generative AI to help engineers examine testing and analyze data to find strengths and weaknesses in software assurance data, such as issues or defects related to safety or security as outlined in the paper Using LLMs to Adjudicate Static-Analysis Alerts.

I would also like mention two recent SEI articles that further cover the challenges that generative AI needs to address to make it more effective for software engineering and software acquisition:

Anita: Ipek, how about some gaps, challenges, and risks from your perspective?

Ipek: I think it’s important to discuss the scale of acquisition systems as well as their evolvability and sustainability aspects. We are at a stage in the evolution of generative-AI-based software engineering and acquisition tools where we still don’t know what we don’t know. In particular, the software development tasks where generative AI had been applied thus far are fairly narrow in scope, for example, interacting with a relatively small number of methods and classes in popular programming languages and platforms.

In contrast, the types of software-reliant acquisition systems we deal with at the SEI are substantially larger and more complex, containing millions of lines of code and thousands of modules and using a wide range of legacy programming languages and platforms. Moreover, these systems will be developed, operated, and sustained over decades. We therefore don’t know yet how well generative AI will work with the overall structure, behavior, and architecture of these software-reliant systems.

For example, if a team applying LLMs to develop and sustain portions of an acquisition system makes changes in one particular module, how consistently will these changes propagate to other, similar modules? Likewise, how will the rapid evolution of LLM versions affect generated code dependencies and technical debt? These are very complicated problems, and while there are emerging approaches to address some of them, we shouldn’t assume that all of these concerns have been—or will be—addressed soon.

Anita: What are some opportunities for generative AI as we think about software engineering and software acquisition?

James: I tend to think about these opportunities from a few perspectives. One is, what’s a natural problem for generative AI, where it’s a really good fit, but where I as a developer am less facile or don’t want to devote time to? For example, generative AI is often good at automating highly repetitive and common tasks, such as generating scaffolding for a web application that gives me the structure to get started. Then I can come in and really flesh out that scaffolding with my domain-specific information.

When most of us were just starting out in the computing field, we had mentors who gave us good advice along the way. Likewise, there are opportunities now to ask generative AI to provide advice, for example, what elements I should include in a proposal for my manager or how should I approach a testing strategy. A generative AI tool may not always provide deep domain- or program-specific advice. However, for developers who are learning these tools, it’s like having a mentor who gives you pretty good advice most of the time. Of course, you can’t trust everything these tools tell you, but we didn’t always trust everything our mentors told us either!.

Doug: I’d like to riff off of what James was just saying. Generative AI holds significant promise to transform and modernize the static, document-heavy processes common in large-scale software acquisition programs. By automating the curation and summarization of vast numbers of documents, these technologies can mitigate the chaos often encountered in managing extensive archives of PDFs and Word files. This automation reduces the burden on the technical staff, who often spend considerable time trying to regain an understanding of existing documentation. By enabling quicker retrieval and summarization of relevant documents, AI can enhance productivity and reduce redundancy, which is essential when modernizing the acquisition process.

In practical terms, the application of generative AI in software an can streamline workflows by providing dynamic, information-centric systems. For instance, LLMs can sift through vast data repositories to identify and extract pertinent information, thereby simplifying the task of managing large volumes of documentation. This capability is particularly beneficial for keeping up-to-date with the evolving requirements, architecture, and test plans in a project, ensuring all team members have timely access to the most relevant information.

However, while generative AI can improve efficiency dramatically, it is crucial to maintain the human oversight John mentioned earlier to ensure the accuracy and relevancy of the information extracted. Human expertise remains essential in interpreting AI outputs, particularly in nuanced or critical decision-making areas. Ensuring these AI systems are audited regularly—and that their outputs can be (and are) verified—helps safeguard against errors and ensures that integrating AI into software acquisition processes augments human expertise rather than replaces it.

Anita: What are some of the key challenges you foresee in curating data for building a trusted LLM for acquisition in the DoD space? Do any of you have insights from working with DoD programs here?

Shen: In the acquisition space, as part of the contract, multiple customer templates and standard deliverables are imposed on vendors. These contracts often place a substantial burden on government teams to assess deliverables from contractors to ensure they adhere to those standards. As Doug mentioned, here’s where generative AI can help by scaling and efficiently validating that vendor deliverables meet those government standards.

More importantly, generative AI offers an objective review of the data being analyzed, which is key to enhancing impartiality in the acquisition process. When dealing with multiple vendors, for example in reviewing responses to a broad agency announcement (BAA), it’s critical that there is objectivity in assessing submitted proposals. Generative AI can certainly help here, especially when instructed with appropriate prompt engineering and prompt patterns. Of course, generative AI has its own biases, which circles back to John’s admonition to keep informed and cognizant humans in the loop to help mitigate risks with LLM hallucinations.

Anita: John, I know you have worked a great deal with Navy programs and thought you might have some insights here as well.

John: As we develop AI models to enhance and modernize software acquisition activities in the DoD space, certain domains present early opportunities, such as the standardization of government policies for ensuring safety in aircraft or ships. These extensive regulatory documents often span several hundred pages and dictate a range of activities that acquisition program offices require developers to undertake to ensure safety and compliance within these areas. Safety standards in these domains are frequently managed by specialized government teams who engage with multiple programs, have access to relevant datasets, and possess trained personnel.

In these specialized acquisition contexts, there are opportunities to either develop dedicated LLMs or fine-tune existing models to meet specific needs. LLMs can serve as valuable resources to augment the capabilities of these teams, enhancing their efficiency and effectiveness in maintaining safety standards. For example, by synthesizing and interpreting complex regulatory texts, LLMs can help teams by providing insights and automated compliance checks, thereby streamlining the often lengthy and intricate process of meeting governmental safety regulations.

These domain-specific applications represent some near-term opportunities for LLMs because their scope of usage is bounded in terms of the types of needed data. Likewise, government organizations already collect, organize, and analyze data specific to their area of governance. For example, government automobile safety organizations have years of data relevant to software safety to inform regulatory policy and standards. Collecting and analyzing vast amounts of data for many possible uses is a significant challenge in the DoD for various reasons, some of which Doug mentioned earlier. I therefore think we should focus on building trusted LLMs for specific domains first, demonstrate their effectiveness, and then extend their data and uses more broadly after that.

James: With respect to your question about building trusted LLMs, we should remember that we don’t need to put all our trust in the AI itself. We need to think about workflows and processes. In particular, if we put other safeguards—be they humans, static analysis tools, or whatever—in place, then we don’t always need absolute trust in the AI to have confidence in the outcome, as long as they are comprehensive and complementary perspectives. It’s therefore essential to take a step back and think about the workflow as a whole. Do we trust the workflow, the process, and people in the loop? may be a better question than merely Do we trust the AI?

Future Work to Address Generative AI Challenges in Acquisition and Software Engineering

While generative AI holds great promise, several gaps must be closed so that software engineering and acquisition organizations can utilize generative AI more extensively and consistently. Specific examples include:

  • Accuracy and trust: Generative AI can create hallucinations, which may not be obvious for less experienced users and can create significant issues. Some of these errors can be partially mitigated with effective prompt engineering, consistent testing, and human oversight. Organizations should adopt governance standards that continuously monitor generative AI performance and ensure human accountability throughout the process.
  • Data security and privacy: Generative AI operates on large sets of information or data, including data that is private or must be controlled. Generative AI online services are primarily intended for public data, and therefore sharing sensitive or proprietary information with these public services can be problematic. Organizations can address these issues by creating secure generative AI deployment configurations, such as private cloud infrastructure, air-gapped systems, or data privacy vaults.
  • Business processes and cost: Organizations deploying any new service, including generative AI services, must always consider changes to the business processes and financial commitments beyond initial deployment. Generative AI costs can include infrastructure investments, model fine-tuning, security monitoring, upgrading with new and improved models, and training programs for proper use and use cases. These up-front costs are balanced by improvements in development and evaluation productivity and, potentially, quality.
  • Ethical and legal risks: Generative AI systems can introduce ethical and legal challenges, including bias, fairness, and intellectual property rights. Biases in training data may lead to unfair outcomes, making it essential to include human review of fairness as mitigation. Organizations should establish guidelines for ethical use of generative AI, so consider leveraging resources like the NIST AI Risk Management Framework to guide responsible use of generative AI.

Generative AI presents exciting possibilities for software engineering and software acquisition. However, it is a fast-evolving technology with different interaction styles and input-output assumptions compared to those familiar with software and acquisition professionals. In a recent IEEE Software article, Anita Carleton and her coauthors emphasized how software engineering and software and acquisition professionals need training to manage and collaborate with AI systems effectively and ensure operational efficiency.

In addition, John and Doug participated in a recent webinar, Generative Artificial Intelligence in the DoD Acquisition Lifecycle, with other government leaders who further emphasized the importance of ensuring generative AI is fit for use in high-stakes domains such as defense, healthcare, and litigation. Organizations can only benefit from generative AI by understanding how it works, recognizing its risks, and taking steps to mitigate them.

Additional Resources

Generative AI: Redefining the Future of Software Engineering by Anita Carleton, Davide Falessi, Hongyu Zhang, and Xin Xia.

Assessing Opportunities for Large Language Models in Software Engineering and Acquisition by Stephany Bellomo, Shen Zhang, James Ivers, Julie B. Cohen, and Ipek Ozkaya.

10 Benefits and 10 Challenges of Applying Large Language Models to DoD Software Acquisition by John Robert and Douglas C. Schmidt.

Get updates on our latest work.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Subscribe Get our RSS feed