Posted on by Software and Information Assurancein
By Mark Sherman
Technical Director, Cyber Security Foundations
Since its debut on Jeopardy in 2011, IBM's Watson has generated a lot of interest in potential applications across many industries. I recently led a research team investigating whether the Department of Defense (DoD) could use Watson to improve software assurance and help acquisition professionals assemble and review relevant evidence from documents. As this blog post describes, our work examined whether typical developers could build an IBM Watson application to support an assurance review.
Foundations of Our Work
The team of researchers I work with at CERT--in addition to myself, the team includes Lori Flynn, and Chris Alberts--focuses on acquisitions and secure software development. We thought general assurance would be an appropriate area to test the cognitive processing of IBM Watson . Assurance often involves taking vast amounts of information--such as documentation related to mission, threats, architecture, design, operations, and code analytics--and examining them for evidence suggesting a particular question can be answered (e.g., is there a high level of confidence that this software module is free from vulnerabilities?).
Contracting offices and program managers review acquisition documents and artifacts for information related to risk. As software changes, these reviewers struggle to find assurance information related to changes in risk. Our aim was to see if we can feed Watson a large number of acquisition-related documents and have it use the information contained in those documents in a question-and-answer application.
We were also interested in determining whether this type of application could be developed with skill sets one would find in a typical organization that does software development. Our preliminary results suggested the following:
Our aim with this project was to simulate a development process for building a Watson application using the skill set that would most likely be found in a federal agency or organization. We wanted to test a hypothesis that a development team did not require members who had advanced research experience in cognitive systems.
We focused our assurance questions around information generated during source code evaluation. Specifically, we focused on building a corpus of CERT Secure Coding Standards and MITRE's Common Weakness Enumerations (CWEs). The intent of our work was to answer a specific question about technology maturity. Moreover, our intent was to create a program and train Watson to continually ingest new information and respond to queries based on that information. To support continuous corpus updating and training, we needed to automate the preprocessing of standards and enumerations, and the creation of training questions and answers.
We assembled a team of CMU computer science students that included Christine Baek, Anire Bowman, Skye Toor, and Myles Blodnick. We picked team members who were experienced in Python, which we used to build the application, but we did not seek any special skills in natural language processing or artificial intelligence.
There were two phases to our work:
Preparatory phase. Three subject matter experts spent two weeks creating the strategy for organizing the information to be retrieved from the corpus. They also determined the types and subject focus of questions that a user would ask.
In this phase, we relied on the knowledge and experience of three SEI researchers with experience in secure coding and assurance. During this phase, our experts determined that we should focus our assurance questions around information gathered during source-code evaluations.
Our experts also developed sample training questions and answers, and they documented a fragmentation strategy for breaking rules and enumerations into answer-sized chunks that could be processed by Watson. These specifications drove the automation built by the development team.
Development phase. This phase took 11 weeks. Student developers built the application and the corpus, which included about 400 CERT rules and 700 CWEs.
Developing the corpus iterated through the following steps in each sprint:
Our project relied on the Watson Retrieve and Rank service, hosted on BlueMix.
We evaluated our application against the following factors:
To help evaluate the team's progress, we were lucky to engage Dr. Eric Nyberg, who was part of the original Jeopardy Watson project. Dr. Nyberg is a professor in Carnegie Mellon University's School of Computer Science and director of the Master of Computational Data Science Program. Dr. Nyberg provided helpful feedback on our automated training strategy and the use of human trainers to tune Watson. He suggested testing methods that we used, and others we hope to use in the future. He and his graduate student Hugo Rodrigues provided insight into natural language processing and artificial intelligence, along with good research paper references about the current state of automated question variation and generation research.
Example of a Query Application
As an example, we presented our coding-rule question-and-answer application with the following query:
What is the risk of INT33-C?
As illustrated in the figure below, with respect to recall, all six-of-six relevant answers were returned, so recall was good. On the precision front, the exact subtext describing risk was returned as the best document. All the documents our application returned were related to INT33-C but the cost of fixing the violation and the rule title are not related to the risk, so one precision metric is 6/8.
We then submitted that same query to Google, which returned an entire document with no subcomponents. Recall was spread across irrelevant documents, and the excerpt was imprecise. While not strictly an apples-to-apples comparison, the exercise provided evidence of the utility of using cognitive processing, in general, and of Watson, in particular.
Challenges and Lessons Learned
As stated earlier, we had a hard time achieving precision in developing our question-and-answer application. The following is a summary of our lessons learned on this project:
Wrapping Up and Looking Ahead
Having built our question-and-answer application, we found the corpus to be the most interesting artifact of our work. The government can acquire our corpus and our tools for any application. We have also licensed our database and supporting automation tools to SparkCognition in Austin, Texas, for use in their Watson-based applications.
The project expanded and deepened our skills in building cognitive processing and Watson applications. Watson, as well as the capabilities of other cognitive processing tools, continue to evolve. With the help of appropriate subject matter experts, the technology can be put into practice to solve DoD needs. Organizations interesting in collaborating with us should send an email to email@example.com.
We wish to acknowledge and thank IBM and SparkCognition for discussions on our approach to Watson application development.
We welcome your feedback on our work in the comments section below.
View a video of Mark Sherman discussing his work at the 2016 SEI Research Review.