LLVM Intermediate Representation for Code Weakness Identification
• White Paper
Defense Technical Information Center
Recent effort for code weakness identification focuses on training statistical machine learning (ML) models on source code text as the feature space in addition to more structural features like abstract syntax trees. LLVM intermediate representation (IR) can aid ML models by standardizing code, reducing vocabulary size, and removing some context-sensitivity regarding syntax and memory. We investigate the benefit of LLVM IR to train statistical and machine learning models including bag-of-words models, BiLSTMs, and a few varieties of transformer models. We compare these LLVM IR based models to models trained on source C-based models on two different sets of data: synthetic data and more natural data. We find that while using LLVM IR features does not result in more accurate models than their C-based counterparts, we are able to identify context-specific LLVM IR and C tokens that help indicate the presence of weaknesses. Additionally, for a given data set, we find that bag-of-words models can be powerful indicators of whether any statistical or ML model is beneficial for code weakness identification before using more complex and time-consuming models.