# Software Engineering Institute

## Improving Data Quality Through Anomaly Detection

##### SHARE

Organizations run on data. They use it to manage programs, select products to fund or develop, make decisions, and guide improvement. Data comes in many forms, both structured (tables of numbers and text) and unstructured (emails, images, sound, etc.). Data are generally considered high quality if they are fit for their intended uses in operations, decision making, and planning. This definition implies that data quality is both a subjective perception of individuals involved with the data, as well as the quality associated with the objective measurements based on the data set in question. This post describes the work we're doing with the Office of Acquisition, Technology and Logistics (AT&L)--a division of the Department of Defense (DoD) that oversees acquisition programs and is charged with, among other things, ensuring that the data reported to Congress is reliable.

The problem with poor data quality is that it leads to poor decisions. This problem has been well documented by many researchers, notably by Larry English in his book Information Quality Applied. According to a report released by Gartner in 2009, the average organization loses $8.2 million annually because of poor data quality. The annual cost of poor data to U.S. industry has been estimated to be$600 billion. Research indicates the Pentagon has lost more than \$13 billion dollars due to poor data quality.

Data quality is a multi-dimensional concept and the international standard data quality model identifies 15 data quality characteristics, including accuracy, completeness, consistency, credibility, currentness, accessibility, compliance, confidentiality, efficiency, precision, traceability, understandability, availability, portability, and recoverability. In our data quality research, we have been focusing on the accuracy attribute of data quality. Within the ISO model, accuracy is defined as the degree to which data has attributes that correctly represent the true value for the intended attribute of a concept.

Ensuring data quality is a multi-faceted problem. Errors in data can be introduced in multiple ways. Sometimes it's as simple as mistyping an entry, but more complex organizational factors also lead to data problems. Common data problems include misalignment of policies and procedures with how the data is collected and entered into databases, misinterpretation of data entry instructions, incomplete data entry, faulty processing of data, and errors introduced while migrating data from one database to another.

A number of software applications have been introduced in recent years to address data quality issues. Gartner estimates that the number of software tools available for data quality grew by 26 percent since 2008. The bulk of these applications, however, focus on problems with customer-relationship management (CRM) data, materials data, and financial data (for example reconciling duplicate records and missing and inconsistent data). As part of our research, we are going beyond these basic types of data checks using statistical, quantitative methods to identify data anomalies that are not addressed by current off-the-shelf data quality software tools. While available data quality automated platforms address erroneous data, these applications are intended for customer relationship management, materials processing, and financial accounting. The types of data errors that they are intended to find and correct include missing data, incomplete data, character mismatches, and duplicate records.

Examples of the data anomalies that our research is focused on exposing include cost estimates and performance values that are unusual when compared to the time series values that constitute the remainder of the data series. These unusual data values are considered outliers and tagged as anomalies.

A data anomaly is not necessarily the same as a data defect. A data anomaly might be a data defect, but it might also be accurate data caused by unusual, but actual, behavior of an attribute in a specific context. Root cause analysis is typically required to resolve the cause(s) of data anomalies. We are working with our DoD collaborators on the resolution process to determine if the anomalies detected are actual data defects.

Our research is analyzing performance data submitted by DoD contractors in monthly reports about aspects of high-profile acquisition programs, including cost, schedule, and technical accomplishments on a project or task. Some methods that we are evaluating include

These approaches to anomaly detection are being compared and contrasted to determine what specific methods work best for each EVM variable we are studying.

Our data quality research complements our recent work on the Measurement and Analysis Infrastructure Method (MAID), which is an evaluation tool that helps organizations understand the weaknesses and strengths of their measurement systems. MAID is broader in scope than what is being addressed with our current research, recognizing that data is part of a life cycle that begins with sound definition, specification, collection, storage, analysis, packaging (for information purposes), and reporting (for decision-making). The integrity of data can be compromised at any of these stages unless policies, procedures and other safeguards are in place.

Our research thus far has found a number of methods that have been effective for identifying anomalies in the EVM data. Our work will culminate in a report that we plan to publish by the end of 2011. With support from AT&L, we're hoping these methods will identify problems in the data they receive and report, ultimately leading to better decisions made by government officials and lawmakers.

www.sei.cmu.edu/measurement/

To read the SEI technical report, Issues and Opportunities for Improving the Quality and Use of Data in the Department of Defense, please visit
www.sei.cmu.edu/library/abstracts/reports/11sr004.cfm

To read the SEI technical report, Can You Trust Your Data? Establishing the Need for a Measurement and Analysis Infrastructure Diagnostic, please visit
www.sei.cmu.edu/library/abstracts/reports/08tn028.cfm

To read the SEI technical report, Measurement and Analysis Infrastructure Diagnostic, Version 1.0: Method Definition Document, please visit
www.sei.cmu.edu/library/abstracts/reports/10tr035.cfm

To read the SEI technical report, Measurement and Analysis Infrastructure Diagnostic (MAID) Evaluation Criteria, Version 1.0, please visit
www.sei.cmu.edu/library/abstracts/reports/09tr022.cfm