search menu icon-carat-right cmu-wordmark

Advancing Algorithms for File Deduplication Across Containers

This project supports the Department of Defense's use of containers to support its vision of a cloud-to-edge continuum in which capabilities packaged as containers are pushed from the cloud to edge devices to support localized data processing.

Software Engineering Institute


To address limitations, we developed an automated container image minimization technology. This technology combined and improved on two minimization approaches: pruning (removing unnecessary files from single images) and deduplication (combining shared files across images into common layers). We focused on advancing the state-of-the-art in deduplication across container images.
To create this new technology, we developed an algorithm for file deduplication across a collection of container images that can reduce container image storage usage and update bandwidth
by up to 5–15% for multi-container deployments and by up to 10–30% for pruned container deployments. In our tests with real multi-container image systems, our algorithm deduplicates
100% of shared files and processes 10 images with 225,000 files in approximately 81 minutes.

This project focused on technology that supports the Open Container Initiative (OCI) standard because the DoD aims to avoid vendor lock-in and leverage OCI-compliant containers. Additionally, this project has the potential to accelerate the SEI’s impact by open sourcing minimization algorithms to gain wider interest and adoption from industry and the DoD community.