Leveraging LLMs for Data Coding
• White Paper
Publisher
Software Engineering Institute
Abstract
There are about 50 insider threat court cases per month in the U.S. federal criminal court system. The number of these cases has created a backlog of thousands of insider threat cases awaiting encoding for our MERIT insider threat database. Our goal is to use machine learning (ML) to assist with coding these incidents. The court cases consist of unstructured data, including scanned documents, e-filed documents, and webpages. A fully coded case might consist of over 200 fields. The information for these fields could be spread across many different documents, making manual coding tedious. In this paper, we investigate the usefulness and design methodology of applying large language models (LLMs) to improve and automate the process of coding case data. We introduce tools to guide LLMs to assist in this coding process. We show initial results and lay groundwork for further research in this field in the hopes that the larger scientific community will contribute to solving the problem of coding specific fields from unstructured data.