search menu icon-carat-right cmu-wordmark

IEEE Secure and Trustworthy Machine Learning Conference Awards SEI Researchers’ Trojan Finding Method

IEEE Secure and Trustworthy Machine Learning Conference Awards SEI Researchers’ Trojan Finding Method

May 2, 2024—Two Software Engineering Institute researchers have created a novel method of detecting trojans in convolutional neural network (CNN) image models. Their method, Feature Embeddings Using Diffusion (FEUD), won second place in an IEEE CNN interpretability competition last month. FEUD uses generative artificial intelligence (AI) and other techniques to find disruptive images inserted into computer vision data sets and re-creates them for human analysts. A paper on the arXiv preprint server describes the method created by SEI AI researchers Hayden Moore and David Shriver.

Satellite imaging, facial recognition systems, and medical imaging all use computer vision to process and interpret images. The engine of most computer vision systems is the CNN, a type of machine learning (ML) model whose structure mimics the brain’s neuronal networks. Layers of nodes extract certain features embedded in an image, such as edges, colors, or patterns, and feed them to the proceeding layers. A CNN model trained on images of stop signs, for example, can then see a novel image and make the inference, This is a picture of a stop sign.

But CNN models can also misclassify if they see hidden trigger images, or trojans. “If you don’t have control over how the model is trained, it could have been trained by an adversary to perform poorly when it sees certain features,” said Shriver, a machine learning research scientist. “Defenders want to be able to detect these cases.” For example, a smiley face can be patched onto a training image of a stop sign, causing the model to misclassify it—and all future smiley-face-patched stop signs it will see—as a yield sign. The consequences could be dire in any critical technology that relies on computer vision, such as autonomous vehicles.

It is infeasible to manually test a CNN model for trojans. Some trigger images can be as unremarkable as animal print. Most models do not come with the training data anyway, and if they do, the images could number in the thousands. Human analysts cannot guess which of a model’s potential classifications will be wrong, step backwards through the CNN’s hidden layers and obscure model weights, and find the trojans themselves.

Debugging an ML model requires a method to peel back the trained image features and find which ones are trojans. Crucially, it must also reproduce trojans in a way that is interpretable by human analysts.

“Often you are trying to recover triggers that cause a misclassification the most often and effectively, but sometimes the trojans are hard to interpret,” said Shriver. They can be color artifacts, chaotic or blurred features, or messy lines and edges. “You don’t know whether it’s intentional or you’ve accidentally found something that happens to cause a misbehavior. The goal is to come up with better methods for making those recovered images interpretable to humans so that we can say, I know what this is an image of, so when I see this in my other images, I know why my model is failing.

MIT’s Algorithmic Alignment Group created a competition to inspire new methods of finding trojans in CNNs and making them as interpretable to humans as possible. The organizers took a CNN model trained on the ImageNet database and poisoned it with 12 hidden trigger images. Then they ran the most effective known methods for recovering and reproducing the trigger images. Volunteers matched the recovered triggers to the original triggers about 49 percent of the time. This result became the competition’s benchmark success rate of reproducing the right trigger in a human-interpretable way.

Moore and Shriver saw an opportunity to test a method combining proven reverse-engineering techniques and modern generative AI. Their automated solution, Feature Embeddings Using Diffusion, has three stages. First, it leverages customized adversarial patch tools to dig through the poisoned model, creates patches that cause the 12 target misclassifications, and produces a visual approximation of the triggers. Next, FEUD uses a CLIP Interrogator generative AI model to describe the approximated trigger image in text. Finally, it feeds the image and text into a diffusion generative AI model—the kind used by popular text-to-image generators—to refine the trigger image into a clearer, more interpretable version.

“Diffusion is a powerful tool to generate realistic images and accurate text describing the images,” said Moore, a software developer. “We used this generative AI to describe the feature embeddings from our adversarial patch approach. It really helped to make the patch a more realistic and human-interpretable image.”

clownfish original trigger and recovered trigger-3
One of the competition’s original trigger images and FEUD’s iteratively recovered version with description.

Volunteers correctly matched the FEUD output to the original trigger image 45 percent of the time, earning FEUD second place among the top four methods. One of the contest’s organizers complimented FEUD on its ability to use off-the-shelf tools to create some of the most realistic and interpretable trigger images in the competition.

With more development, CNN interpretability tools like FEUD could become important assets for those who procure third-party image classification models. The implications are especially important for the federal government, which each year licenses thousands of software products, according to a 2024 U.S. Government Accountability Office report.

FEUD could also help those training their own models on existing data sets. “If you don’t have total control of your data pipeline, an adversary could have corrupted some of the training data,” said Shriver. “Our method allows users to see potential trojans and decide whether they want to put their trust in a trained model.”

Shriver and Moore caution that FEUD is an early experiment. But the more the AI research community can illuminate the workings of deep neural networks, the more trustworthy AI systems will become. To further this goal, the SEI recently established the AI Trust Lab to advance trustworthy, human-centered, and responsible AI engineering practices for Department of Defense missions.

The CNN interpretability competition results were presented during the 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) in April. Watch a video describing the competition, and Moore explaining FEUD, on YouTube. The competition report is available on the arXiv preprint server. Get the files for FEUD at