Evaluating LLMs for Text Summarization: An Introduction

Large language models (LLMs) have shown tremendous potential across various applications. At the SEI, we study the application of LLMs to a number of DoD relevant use cases. One application we consider is intelligence report summarization, where LLMs could significantly reduce the analyst cognitive load and, potentially, the extent of human error. However, deploying LLMs without human supervision and evaluation could lead to significant errors including, in the worst case, the potential loss of life. In this post, we outline the fundamentals of LLM evaluation for text summarization in high-stakes applications such as intelligence report summarization. We first discuss the challenges of LLM evaluation, give an overview of the current state of the art, and finally detail how we are filling the identified gaps at the SEI.

Why is LLM Evaluation Important?

LLMs are a nascent technology, and, therefore, there are gaps in our understanding of how they might perform in different settings. Most high performing LLMs have been trained on a huge amount of data from a vast array of internet sources, which could be unfiltered and non-vetted. Therefore, it is unclear how often we can expect LLM outputs to be accurate, trustworthy, consistent, or even safe. A well-known issue with LLMs is hallucinations, which means the potential to produce incorrect and non-sensical information. This is a consequence of the fact that LLMs are fundamentally statistical predictors. Thus, to safely adopt LLMs for high-stakes applications and ensure that the outputs of LLMs well represent factual data, evaluation is critical. At the SEI, we have been researching this area and published several reports on the subject so far, including Considerations for Evaluating Large Language Models for Cybersecurity Tasks and Assessing Opportunities for LLMs in Software Engineering and Acquisition.

Challenges in LLM Evaluation Practices

While LLM evaluation is an important problem, there are several challenges, specifically in the context of text summarization. First, there are limited data and benchmarks, with ground truth (reference/human generated) summaries on the scale needed to test LLMs: XSUM and Daily Mail/CNN are two commonly used datasets that include article summaries generated by humans. It is difficult to ascertain if an LLM has not already been trained on the available test data, which creates a potential confound. If the LLM has already been trained on the available test data, the results may not generalize well to unseen data. Second, even if such test data and benchmarks are available, there is no guarantee that the results will be applicable to our specific use case. For example, results on a dataset with summarization of research papers may not translate well to an application in the area of defense or national security where the language and style can be different. Third, LLMs can output different summaries based on different prompts, and testing under different prompting strategies may be important to see which prompts give the best results. Finally, choosing which metrics to use for evaluation is a major question, because the metrics need to be easily computable while still efficiently capturing the desired high level contextual meaning.

LLM Evaluation: Current Techniques

As LLMs have become prominent, much work has gone into different LLM evaluation methodologies, as explained in articles from Hugging Face, Confident AI, IBM, and Microsoft. In this post, we specifically focus on evaluation of LLM-based text summarization.

We can build on this work rather than developing LLM evaluation methodologies from scratch. Additionally, many methods can be borrowed and repurposed from existing evaluation techniques for text summarization methods that are not LLM-based. However, due to unique challenges posed by LLMs—such as their inexactness and propensity for hallucinations—certain aspects of evaluation require heightened scrutiny. Measuring the performance of an LLM for this task is not as simple as determining whether a summary is “good” or “bad.” Instead, we must answer a set of questions targeting different aspects of the summary’s quality, such as:

Is the summary factually correct?
Does the summary cover the principal points?
Does the summary correctly omit incidental or secondary points?
Does every sentence of the summary add value?
Does the summary avoid redundancy and contradictions?
Is the summary well-structured and organized?
Is the summary correctly targeted to its intended audience?

The questions above and others like them demonstrate that evaluating LLMs requires the examination of several related dimensions of the summary’s quality. This complexity is what motivates the SEI and the scientific community to mature existing and pursue new techniques for summary evaluation. In the next section, we discuss key techniques for evaluating LLM-generated summaries with the goal of measuring one or more of their dimensions. In this post we divide those techniques into three categories of evaluation: (1) human assessment, (2) automated benchmarks and metrics, and (3) AI red-teaming.

Human Assessment of LLM-Generated Summaries

One commonly adopted approach is human evaluation, where people manually assess the quality, truthfulness, and relevance of LLM-generated outputs. While this can be effective, it comes with significant challenges:

Scale: Human evaluation is laborious, potentially requiring significant time and effort from multiple evaluators. Additionally, organizing an adequately large group of evaluators with relevant subject matter expertise can be a difficult and expensive endeavor. Identifying how many evaluators are needed and how to recruit them are other tasks that can be difficult to accomplish.
Bias: Human evaluations may be biased and subjective based on their life experiences and preferences. Traditionally, multiple human inputs are combined to overcome such biases. The need to analyze and mitigate bias across multiple evaluators adds another layer of complexity to the process, making it more difficult to aggregate their assessments into a single evaluation metric.

Despite the challenges of human assessment, it is often considered the gold standard. Other benchmarks are often aligned to human performance to determine how automated, less costly methods compare to human judgment.

Automated Evaluation

Some of the challenges outlined above can be addressed using automated evaluations. Two key components common with automated evaluations are benchmarks and metrics. Benchmarks are consistent sets of evaluations that typically contain standardized test datasets. LLM benchmarks leverage curated datasets to produce a set of predefined metrics that measure how well the algorithm performs on these test datasets. Metrics are scores that measure some aspect of performance.

In Table 1 below, we look at some of the popular metrics used for text summarization. Evaluating with a single metric has yet to be proven effective, so current strategies focus on using a collection of metrics. There are many different metrics to choose from, but for the purpose of scoping down the space of possible metrics, we look at the following high-level aspects: accuracy, faithfulness, compression, extractiveness, and efficiency. We were inspired to use these aspects by examining HELM, a popular framework for evaluating LLMs. Below are what these aspects mean in the context of LLM evaluation:

Accuracy generally measures how closely the output resembles the expected answer. This is typically measured as an average over the test instances.
Faithfulness measures the consistency of the output summary with the input article. Faithfulness metrics to some extent capture any hallucinations output by the LLM.
Compression measures how much compression has been achieved via summarization.
Extractiveness measures how much of the summary is directly taken from the article as is. While rewording the article in the summary is sometimes very important to achieve compression, a less extractive summary may yield more inconsistencies compared to the original article. Hence, this is a metric one might track in text summarization applications.
Efficiency measures how many resources are required to train a model or to use it for inference. This could be measured using different metrics such as processing time required, energy consumption, etc.

While general benchmarks are required when evaluating multiple LLMs across a variety of tasks, when evaluating for a specific application, we may have to pick individual metrics and tailor them for each use case.

Aspect	Metric	Type	Explanation
Accuracy	ROUGE	Computable score	Measures text overlap
	BLEU	Computable score	Measures text overlap and computes precision
	METEOR	Computable score	Measures text overlap including synonyms, etc.
	BERTScore	Computable score	Measures cosine similarity between embeddings of summary and article
Faithfulness	SummaC	Computable score	Computes alignment between individual sentences of summary and article
Faithfulness	QAFactEval	Computable score	Verifies consistency of summary and article based on question answering
Compression	Compresion ratio	Computable score	Measures ratio of number of tokens (words) in summary and article
Extractiveness	Coverage	Computable score	Measures the extent to which summary text is from article
Extractiveness	Density	Computable score	Quantifies how well the word sequence of a summary can be described as a series of extractions
Efficiency	Computation time	Physical measure	-
Efficiency	Computation energy	Physical measure	-

Note that AI may be used for metric computation at different capacities. At one extreme, an LLM may assign a single number as a score for consistency of an article compared to its summary. This scenario is considered a black-box technique, as users of the technique are not able to directly see or measure the logic used to perform the evaluation. This kind of approach has led to debates about how one can trust one LLM to judge another LLM. It is possible to employ AI techniques in a more transparent, gray-box approach, where the inner workings behind the evaluation mechanisms are better understood. BERTScore, for example, calculates cosine similarity between word embeddings. In either case, human will still need to trust the AI’s ability to accurately evaluate summaries despite lacking full transparency into the AI’s decision-making process. Using AI technologies to perform large-scale evaluations and comparison between different metrics will ultimately still require, in some part, human judgement and trust.

So far, the metrics we have discussed ensure that the model (in our case an LLM) does what we expect it to, under ideal circumstances. Next, we briefly touch upon AI red-teaming aimed at stress-testing LLMs under adversarial settings for safety, security, and trustworthiness.

AI Red-Teaming

AI red-teaming is a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with AI developers. In this context, it involves testing the AI system—an LLM for summarization—with adversarial prompts and inputs. This is done to uncover any harmful outputs from an AI system that could lead to potential misuse of the system. In the case of text summarization for intelligence reports, we may imagine that the LLM may be deployed locally and used by trusted entities. However, it is possible that unknowingly to the user, a prompt or input could trigger an unsafe response due to intentional or accidental data poisoning, for example. AI red-teaming can be used to uncover such cases.

LLM Evaluation: Identifying Gaps and Our Future Directions

Though work is being done to mature LLM evaluation techniques, there are still major gaps in this space that prevent the proper validation of an LLM’s ability to perform high-stakes tasks such as intelligence report summarization. As part of our work at the SEI we have identified a key set of these gaps and are actively working to leverage existing techniques or create new ones that bridge those gaps for LLM integration.

We set out to evaluate different dimensions of LLM summarization performance. As seen from Table 1, existing metrics capture some of these via the aspects of accuracy, faithfulness, compression, extractiveness and efficiency. However, some open questions remain. For instance, how do we identify missing key points from a summary? Does a summary correctly omit incidental and secondary points? Some methods to achieve these have been proposed, but not fully tested and verified. One way to answer these questions would be to extract key points and compare key points from summaries output by different LLMs. We are exploring the details of such techniques further in our work.

In addition, many of the accuracy metrics require a reference summary, which may not always be available. In our current work, we are exploring how to compute effective metrics in the absence of a reference summary or only having access to small amounts of human generated feedback. Our research will focus on developing novel metrics that can operate using limited number of reference summaries or no reference summaries at all. Finally, we will focus on experimenting with report summarization using different prompting strategies and investigate the set of metrics required to effectively evaluate whether a human analyst would deem the LLM-generated summary as useful, safe, and consistent with the original article.

With this research, our goal is to be able to confidently report when, where, and how LLMs could be used for high-stakes applications like intelligence report summarization, and if there are limitations of current LLMs that might impede their adoption.

Software Engineering Institute

SEI Blog