LLVM Intermediate Representation for Code Weakness Identification

July 8, 2022 • White Paper

By

Shannon Gallagher, William Klieber, and David Svoboda

This paper examines whether intermediate representation used in Large Language Models can be useful to indicate the presence of software vulnerabilities.

Publisher

Defense Technical Information Center

Topic or Tag

Artificial Intelligence Engineering

Abstract

Recent effort for code weakness identification focuses on training statistical machine learning (ML) models on source code text as the feature space in addition to more structural features like abstract syntax trees. LLVM intermediate representation (IR) can aid ML models by standardizing code, reducing vocabulary size, and removing some context-sensitivity regarding syntax and memory. We investigate the benefit of LLVM IR to train statistical and machine learning models including bag-of-words models, BiLSTMs, and a few varieties of transformer models. We compare these LLVM IR based models to models trained on source C-based models on two different sets of data: synthetic data and more natural data. We find that while using LLVM IR features does not result in more accurate models than their C-based counterparts, we are able to identify context-specific LLVM IR and C tokens that help indicate the presence of weaknesses. Additionally, for a given data set, we find that bag-of-words models can be powerful indicators of whether any statistical or ML model is beneficial for code weakness identification before using more complex and time-consuming models.

Software Engineering Institute