Applying Large Language Models to DoD Software Acquisition: An Initial Experiment
There is considerable interest in using generative AI tools, such as large language models (LLMs), to revolutionize industries and create new opportunities in the commercial and government domains. For many Department of Defense (DoD) software acquisition professionals, the promise of LLMs is appealing, but there’s also a deep-seated concern that LLMs do not address today’s challenges due to privacy concerns, potential for inaccuracy in the output, and lack of confidence or uncertainty about how to use LLMs effectively and responsibly. This blog post is the second in a series dedicated to exploring how generative AI, particularly LLMs such as ChatGPT, Claude, and Gemini, can be applied within the DoD to enhance software acquisition activities.
Our first blog post in this series presented 10 Benefits and 10 Challenges of Applying LLMs to DoD Software Acquisition and suggested specific use cases where generative AI can provide value to software acquisition activities. This second blog post expands on that discussion by showing specific examples of using LLMs for software acquisition in the context of a document summarization experiment, as well as codifying the lessons we learned from this experiment and our related work on applying generative AI to software engineering.
Objectives
Experiments help increase our collective understanding of if/how LLMs can accelerate and enhance DoD software acquisition and increase knowledge about current risks and opportunities when using LLMs. Automatic summarization of documents is a fitting experiment because it is a common activity for software acquisition professionals. Use of online or downloadable LLMs can offer benefits across a wide swath of other high-stakes industries (e.g., financial services, healthcare, and law), and we wanted to test their effectiveness in government acquisition. Summarization is particularly helpful in DoD acquisitions, due to the large volume of regulatory documents and the need for a range of stakeholders to review and comprehend key aspects of those documents, especially as those documents evolve over time.
For this experiment, the context was DoD software acquisition professionals who have learned that a new DoD policy will be used in current and future acquisitions. These professionals have limited time to digest this material but can ask an LLM multiple questions about the document and can read specific pages or images of the document as needed. Comprehending large acquisition documents or multiple documents manually takes extensive and expensive human effort. Consequently, our goal was to determine the extent to which LLMs can—or cannot—provide rapid and accurate summaries and support an interactive environment for summarizing a large document.
The five steps for our experiment are shown in the Figure 1 above and described below:
- Define objectives, that is, determine whether LLMs accelerate summarization for a DoD acquisition document and increase understanding of benefits and risks of using LLMs (this step is already completed, as discussed above).
- Assess the opportunities and risks of using an LLM for a specific DoD acquisition-related document.
- Select a chat-adapted LLM service.
- Apply the LLM to summarize the DoD document.
- Evaluate and reflect on the results to identify lessons learned.
Step 2: Assessing Opportunities and Risks
This step involved assessing the opportunities and risks for applying an LLM to analyze a particular DoD acquisition issuance of interest. There are many options available and resources, such as DoD Directives home page, that can be examined to identify documents acquisition professionals might use. Below are the key types and tiers of DoD acquisition issuances, along with representative examples:
- DoD Directive—a DoD Directive, such as Autonomy in Weapon Systems
- DoD Instruction—a DoD Instruction, such as Operation of the Software Acquisition Pathway
- DoD Memorandum—a DoD Memorandum, such as Software Development and Open Source Software
- DoD Guidance—an informational document, such as DoD Enterprise DevSecOps Fundamentals.
More information about the differences between types of DoD issuances is available here.
For this experiment, we used DoD Enterprise DevSecOps Fundamentals, which is an informational document. Our experiment highlighted how generative AI can be used for this informational use case and similar use cases.
Having selected a document, we next assessed the opportunities and risks for this document. The white paper Assessing Opportunities for LLMs in Software Engineering and Acquisition provides useful guidance that we considered for our experiment. The multiple key points to consider from this white paper are outlined below as questions and answers:
- What are the potential benefits? The document is over 30 pages long and acquisition professionals have a relatively short time to understand its main points. In this context, the use of LLMs could help acquisition professionals obtain a better understanding in a shorter period of time, particularly with unfamiliar terms or topics.
- Do mistakes have large or small consequences? The acquisition professionals are trying to review a new document quickly to determine if it’s applicable to current or future acquisition. If the output of the LLM output is incorrect, the impact will be minimal because this use case mostly involves information gathering and personal learning. In addition, we expect users will interact with the LLM with multiple questions or prompts, which helps mitigate the risk of misunderstanding.
- Will mistakes be easy or hard to detect? A summary from the LLM that is incorrect or missing information may not be immediately easy to detect. However, our experiment is designed with the expectation that experienced acquisition professionals interacting with the LLM will engage in multiple prompts with the LLM and can validate the accuracy of the output as needed.
- Is there a risk of disclosure of protected information? There is no risk of disclosure of protected information from the document because we selected a public (i.e., Distribution A) document from the DoD Directives site. In general, however, never share any protected data with public online generative AI services, including protected data in the form of documents, spreadsheets, images, or software artifacts, such as code or unit tests.
- Can the user trust the output of the LLM document summary? In this example, acquisition professionals use an LLM to summarize and rapidly learn key contents in a document of interest. Given the right prompts, this use case will produce more accurate results than depending on an LLM to generate information wholesale since the output is bounded by the document provided for summarization. The level of trust may also depend on users’ comfort with LLMs or proficiency in creating prompts. Naturally, if users are concerned about any part of the document summary, they should read the relevant sections of the document to determine the LLM’s accuracy.
Step 3: Select a Chat-Adapted LLM Service
There are many choices for LLMs, ranging from standalone to online services. Online examples include ChatGPT, Claude, and Gemini. Any LLM that provides the ability to upload a PDF document should suffice, though we used ChatGPT-4 because it is generally more accurate than ChatGPT-3.5. Some LLMs, including earlier versions of ChatGPT, offer plugins for document analysis, though doing that introduces additional security and data privacy considerations.
Step 4: Apply the LLM to Summarize the DoD Document
Now we could jump in and start exploring! First, we downloaded the DoD Enterprise DevSecOps Fundamentals PDF document. Next we opened the ChatGPT website and signed up or logged in. There is a free service or a paid monthly service with additional features, and for this example we used the paid version of ChatGPT because it uses the latest GPT-4 model, which has a button to upload documents for analysis.
At this point we uploaded the PDF document into ChatGPT using the paper clip icon. Before hitting enter, we typed in a prompt, which asked a question about the document: Tell me the top 5 points from the document. If you try to replicate this experiment at home, the response may be similar to, but not the same as, the following output (the entire ChatGPT-4 conversation for all the examples in this blog posting are available in the Additional Resources section at the end of this blog):
ChatGPT
The top 5 points from the "DoD Enterprise DevSecOps Fundamentals" document are:
- Introduction of DevSecOps concepts, targeted at novice to intermediate staff within DoD, aiming to standardize understanding and application of
- Agile methodologies and their importance in the DevSecOps process, emphasizing the Agile Manifesto’s principles such as prioritizing individuals and interactions, working software, customer collaboration, and responding to change.
- The concept of Software Supply Chains, the role of a Software Factory within it, and the imperative to secure and monitor the supply chain to ensure mission success.
- Detailed explanation of DevSecOps, including its lifecycle, cultural and philosophical aspects, and the importance of integrating security throughout the development process (Shift Left), rather than treating it as an afterthought.
- Discussion on Continuous Integration/Continuous Deployment (CI/CD) practices, highlighting the need for automated pipelines that support frequent, incremental updates to applications, thereby reducing risk and improving quality and security.
This output from ChatGPT was a nice summary, but we wanted to understand aspects of its guidance better. We therefore interacted with the document by prompting ChatGPT to provide additional insight. There are many effective prompt patterns, and we applied one of those patterns (the Persona prompt pattern) to increase our insights. Using this pattern, we asked a question from a perspective of interest: As a reliability engineer, what are the most important reliability considerations in this document? Again, ChatGPT’s output will be similar, but perhaps not identical, to the following output:
ChatGPT
For a reliability engineer, the most important reliability considerations outlined in the "DoD Enterprise DevSecOps Fundamentals" document are:
- Continuous Monitoring: It's mandatory to have infosec and operational monitoring of the entire software factory. Explicit tools and mechanisms for this must be clearly defined to ensure the reliability and security of software applications.
- Alerting and Notification Strategy: CI/CD pipelines must include an alerting and notification capability to proactively notify the team of any problems that need to be addressed. This ensures that any issues can be quickly identified and resolved, maintaining the reliability of the software.
- Performance Metrics: Utilizing DORA's key measurements, such as Deployment Frequency, Lead Time, Mean Time to Resolution (MTTR), and Change Failure Rate (CFR), to gauge a team's proficiency in DevSecOps practices. These metrics are crucial for understanding and improving the reliability and efficiency of software delivery processes.
- Google’s Four Golden Signals (Latency, Traffic, Errors, Saturation): These metrics are valuable for measuring day-to-day operations and can help in identifying issues that may affect the reliability and performance of the system.
- Continuous Improvement: Emphasizing that failure in agile is not negative but rather an opportunity to learn rapidly. This culture encourages regular and fast iterations, learning from failures, and improving the reliability and resilience of the software being developed.
These considerations highlight the importance of monitoring, alerting, performance metrics, and continuous improvement in ensuring the reliability of software systems within the DevSecOps framework.
When interpreting a response from an LLM, it’s often important to learn more about why the LLM generated that particular response. A detailed explanation is beyond the scope of this blog, but an interesting approach is to ask the LLM to explain its answers or apply the Fact Check List prompt pattern. For example, a prompt of Why did you select these topics as the most important for a reliability engineer? can reveal additional insights about what the LLM is—and is not—considering as it generates responses. Below is a summary of the answers ChatGPT provided to this prompt:
ChatGPT
I selected these topics as the most important for a reliability engineer from the "DoD Enterprise DevSecOps Fundamentals" document for several reasons:
- Continuous Monitoring: Reliability engineers are responsible for ensuring that software systems are always operational and performing optimally. Continuous monitoring is fundamental to identifying and resolving issues before they impact users, making it a critical consideration.
- Alerting and Notification Strategy: For reliability engineers, being proactively notified about potential issues allows for swift action to mitigate problems. This capability is essential for maintaining system uptime and ensuring that any disruptions are addressed as quickly as possible.
- Performance Metrics: Utilizing established metrics like those from DORA or Google’s Four Golden Signals provides a quantifiable means of assessing the reliability and efficiency of software delivery and operations. These metrics help reliability engineers focus on areas that need improvement and track the impact of changes made to the system.
- Google’s Four Golden Signals: These signals offer a focused framework for monitoring the health of web services and applications. For reliability engineers, paying attention to latency, traffic, errors, and saturation is crucial for maintaining system performance and reliability.
- Continuous Improvement: A core principle of DevSecOps is learning from failures and continuously improving. For reliability engineers, fostering a culture that embraces this principle is vital for driving enhancements in system reliability and resilience.
Step 5: Evaluate and Reflect Upon the Results
Having received outcomes from ChatGPT-4, we reviewed its output and considered its effectiveness in achieving our objectives. Key observations and lessons learned from this experiment and our related work include the following:
- Engage with LLMs interactively—We engaged in fruitful interactions with ChatGPT to summarize the document by entering multiple prompts asking for specifics on various topics. This interactive experience enabled us to learn about and gain a greater understanding of the acquisition document, compared with providing just one prompt and expecting to get the perfect response. Although we can still read this document in its entirety, the ability to interact with it via ChatGPT provided a richer and more engaging way to learn and understand its salient points quickly.
- Prompt engineering is critical to obtain accurate results—Over the past year we’ve learned to apply prompt engineering and prompt patterns effectively by providing sufficient specificity and context to ChatGPT so it yields more accurate and useful answers. Some open-source projects have added common prompts and prompt libraries into their user interfaces to enable reuse of effective prompts and to obtain more accurate results.
- Learn to handle non-deterministic output—The results generated by LLMs from a user prompt will vary, based on the LLM, the prompt, and the context of the document. Even when queried repeatedly with the same prompt, LLMs may generate different answers by design, although there are ways to control this property. Some LLMs allow the variability or randomness of their output to be controlled, so it is more consistent via a “temperature” setting with a range from low to high variability of output. This temperature setting is not available in the current ChatGPT end-user chat but is available in its API.
- Guide LLMs to interpret images accurately—Current LLMs are highly text-focused and trained on text data sources, so they may or may not capture full understanding from images or graphics embedded in documents. Although generative AI services are increasingly multi-modal and rapidly improving at extracting information from images or graphics, their accuracy can vary widely based on the version. Users should be aware of this limitation and use prompt engineering to guide the LLM to ensure accurate understanding of all images. Users can approach this issue by:
- Assessing the risk by considering the number of graphics in the document. In documents with a higher percentage of content in graphics or content, there is a higher risk that generative AI tools may not respond to information within the images.
- Use prompts to ask about specific images or graphics to determine if the LLM is properly extracting the correct information. When a user notices the information from the LLM is incorrect, guide the LLM with additional prompts to indicate what is incorrect and ask for suggestions in prompts to reduce future errors.
- Account for token limits with LLM processing—Current LLMs can only understand documents up to a token limit, which constrains the size of a document or wordcount that it can process effectively. Sometimes this limit is easy to find with a prompt, such as How many tokens can you support? In many cases the LLM service will provide the number of tokens that it supports, and LLM services are competing (at least in part) based on the ability to support higher numbers of tokens. ChatGPT-4 currently supports a range of token limits from 8,192 to 128,000, but LLMs continue to increase with new releases (e.g., Google’s latest Gemini LLM release supports much larger token limits). For documents longer than the token limit there are ways to reduce the scope and reduce the risks of lost or inaccurate information, including:
- truncating or breaking the document into sequential sections, with each section within the token limit,
- deleting information from the document that is not relevant to your areas of interest, and
- using retrieval-augmented generation to extend the size of the documents that can be supported.
- Assess the accuracy of responses—Responses generated by LLMs may not be perfect and may omit some pertinent information, due in part to the token limits mentioned above. Of course, when acquisition professionals are asked to provide a summary of a document, they also make judgments on what is important or not relevant. LLMs make probabilistic decisions on what to generate and can thus generate bad or misleading information, which is why selecting the right use case and prompt patterns is critical. For example, summarizing a document or learning about its contents is less risky than other use cases, such as generating acquisition documents. LLMs are trained using autoregressive models, so providing data in the form of an existing acquisition document bounds the feedback and reduces errors. If the token limit or capability of the LLM training is exceeded, however, the risk of hallucinations increases and more errors will be observed in the summary.
- Beware of data disclosure risks—To mitigate data disclosure risks with LLMs, it’s essential to sanitize data to remove sensitive information before submission and design prompts that do not elicit protected data. There is still a risk of disclosing protected information with some configurations or prompt engineering tricks, so keep this risk in mind when interacting with the document via an LLM. Restricting access to who can interact with the LLM and establishing data governance policies are also key in controlling data exposure. Likewise, monitoring and filtering LLM outputs can detect and address unintended disclosures promptly. Finally, establishing LLM governance and educating users on secure interaction practices with LLMs is crucial to safeguard information.
Looking Beyond the Benefits of Automatic Summarization
This blog post represents the starting point of our exploration into using LLMs within the DoD’s acquisition processes to gain insights into both their benefits and potential pitfalls. The results from our experiment reveal that (1) the benefits of automatic document summarization are meaningful and (2) the risks can be managed with thoughtful preparation and human review of results generated by LLMs.
Upcoming blog posts in this series will help software acquisition stakeholders—who operate with challenging constraints of regulation, security, and accuracy—consider why, when, and how to start using these new tools in their activities. In particular, our next blog post will delve into more examples of applying LLMs to software acquisition activities, with a focus on identifying inconsistencies and discrepancies, to showcase what can—and can’t—be done with current generative AI tools and LLMs.
Additional Resources
View the full ChatGPT conversation for this example (a ChatGPT account may be needed).
Read the related SEI Blog post 10 Benefits and 10 Challenges of Applying Large Language Models to DoD Software Acquisition.
More By The Authors
More In Artificial Intelligence Engineering
Weaknesses and Vulnerabilities in Modern AI: AI Risk, Cyber Risk, and Planning for Test and Evaluation
• By Bill Scherlis
PUBLISHED IN
Get updates on our latest work.
Sign up to have the latest post sent to your inbox weekly.
Subscribe Get our RSS feedMore In Artificial Intelligence Engineering
Weaknesses and Vulnerabilities in Modern AI: AI Risk, Cyber Risk, and Planning for Test and Evaluation
• By Bill Scherlis
Get updates on our latest work.
Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.
Subscribe Get our RSS feed