Automated Detection of Information Leaks in Mobile Devices

Exfiltration of sensitive data on mobile devices is a major concern for the DoD, other organizations, and individuals. Colluding apps in public use have been discovered by security researchers. The Mobile App Collusion attack, which spread across thousands of Android packages, is an example. Colluding apps, or a combination of a malicious app and leaky app, can use intents (messages sent to Android app components) to extract sensitive or private information from an Android phone. This blog post details our work to more precisely detect (i.e., with significantly fewer false positives) malicious exfiltration of sensitive information from an Android phone (even across multiple components), in a practical time and memory bound. In doing this work, we developed a new method for the broader class of problems, not limited to Android, involving information flow analysis for software systems that communicate by message passing: modular analysis with parameterized summaries of flow of sensitive information.

Foundation of Our Work

The SEI's CERT Secure Coding Team does software security research, including development of new static analyses. The team occasionally helps organizations to analyze Android app security. In early 2014, we designed and implemented DidFail, a novel tool that was the first static analysis (tool and algorithm) to analyze potential tainted data flows across multiple Android apps. DidFail's two-phase design allows it to provide remarkably fast responses to user queries, as described below.

DidFail enables an organization, such as an enterprise, app store, or security system provider, to pre-analyze apps so that the analysis for potential dataflow problems within the set of apps on the phone is fast when a user requests to install a new app. Phase 1 of DidFail is performed on each application separately and, once completed, does not need to be run again. Phase 1 analysis of apps can be done in parallel on all apps in an app set, or sequentially (e.g., as part of the process of accepting each app into an app store). Phase 2 of DidFail, which focuses onapp set analysis, typically takes only seconds. Envisioned use within an enterprise is described in our earlier blog post in section "DidFail Practicality at an Enterprise Level."

DidFail looks for a problem common to many dataflow analyses: a leak of sensitive information from a sensitive source to a restricted sink. We define source as an external resource (external to the app, not necessarily external to the phone) from which data is read. We define sink as an external resource to which data is written. Information read from a sensitive source is said to be tainted with respect to that source.

Although the original version of DidFail successfully found potential inter-app taint flows, it was not yet precise enough for practical use. In particular, it produced too many false positives, where a false positive is a warning about a potential flow that a manual examination can determine is not actually possible. For such a static analysis tool to be practical, the tool must be fast, require only affordable and feasible memory and hard drive space, and it must be precise, because organizations cannot afford a lot of expensive engineer time trying to manually determine if each taint flow warning is a true or false positive.

Since DidFail's original publication, more tools and algorithms have been published that do static analyses of taint flows across Android app sets. The other tools also pose practical-use problems, because of imprecision (like DidFail) or because of failures due to taking too much time and memory.

Our work --in addition to myself and William Klieber, the team includes William Snavely, a software engineer also in the CERT Division, and Michael Zheng, an SEI student intern from the University of Pittsburgh--focused on getting closer to the goal of a practical app set taint flow analysis. In particular, our work focuses on significantly improving the precision of the DidFail analysis, while retaining its modular flow analysis and thus its practical qualities of speed, relatively low memory use, and relatively low disk space use.

An Approach for Increased Precision

To address the need for speed, we retained DidFail's modular analysis. We added context and therefore precision to the analysis with parameterized summaries of potential data flows. We added Boolean formulas to DidFail's flow equations, to record conditions of control flow paths relevant to possible taint flows, as shown in Figure 1 below.

Figure 1: Per-component Flow Equations, Now with Boolean Formulas

In the first phase of DidFail, we create a summary of each component in the Android app. This summary consists of the set of tainted flows found possible in the component, counting received intents as sources and counting sent intents as sinks. We developed a new, more precise type of summary than DidFail used before, that indicates that a flow is possible only if the intents received have certain properties.

For example, if a certain flow occurs only when the component receives an intent with a particular action string, then our summary expresses this fact. (In the examples shown below, we only use this method to make flows involving action strings more precise, but the method is easily extensible for use with flows involving additional intent fields.) The summary maps each possible flow to a formula in propositional logic that indicates under what conditions the flow is possible. The atomic propositions in the formula consist of string equality between a property of the incoming intent and a string constant that occurs in the program text.

In previous versions of DidFail, Boolean formulas were not part of the phase-1 flow equations, because previously we did not gather data about in-component control flow dependent on values of the incoming intent's fields. To reduce false-positive taint flows detected, we reduce possible taint flows that can be joined in phase 2 by using this control-flow path information in both phase 1 and phase 2.

In our current work, we add a Boolean formula if our static analysis detects that control flow on a possible taint flow path contains a conditional branch on a value of a received intent's field. As shown in all the figures in this post, we add this Boolean formula only for action strings in intents. However, this method is easily extensible to other intent fields.

In Figure 1, the flow equation outlined in blue describes a possible taint flow in pseudocode on the left side highlighted in blue. The part of the flow equation before the comma was gathered in previous versions of DidFail. It indicates that there is a flow across component "C2" (the string above the arrow denotes the component ID), with the source to the left of the arrow (a received intent named "irÂ1") and the sink to the right of the arrow (a sent intent named "s2"). The conditional expression in this pseudocode checks if the received intent's (ID "1") action string is equal to "foo", and the possible taint flow in the blue-highlighted line below is only possible if that field indeed was "foo". The possible taint flow highlighted in blue involves sending an intent with the possibly-tainted data field from the received intent (ID "1").

The pseudocode in the else block outlined in pink is only reached if this condition is false, meaning the action string was not equal to "foo". In that case, tainted data could potentially flow from the camera to a sent intent (a sent intent is a sink). The flow equation outlined in pink expresses that information, including the Boolean expression indicating that the action string must not equal "foo".

Data for these flow equations are determined and stored during phase 1 of DidFail. Compared to the number of flow equations from previous versions of DidFail, the new version of DidFail (which we call "Precise-DidFail") will usually produce at least as many flow equations for phase 1. In the depiction of phase 1 analysis in Figure 2, Precise-DidFail shows more arrows for phase 1 flows because the figure differentiates flows detected (with different colors) involving the same source and sink across the same component if they have different Boolean expressions.

In phase 2 of DidFail, we analyze "full taint flows"--a taint flow calculated across all apps in an app set, where neither the first source nor the final sink is an intent. In Figure 2, Precise-DidFail detects fewer "full taint flows" than previous versions of DidFail for the same example app set.

Figure 2: Comparison of previous analysis (FY14 Analysis) to new analysis

Reachability of sinks alone provides some benefit for increased precision. We plan to add greater precision to the taint analysis as follows: For each source, we will represent the taintedness of each variable (or, in general, each access path) as a formula in propositional logic. The code in Figure 3 shows a situation where this representation would be useful. The variable x is tainted if and only if ("iff") the incoming intent's action is "foo".

Figure 3: Variable x tainted iff incoming intent's action is "foo"

A Broader Impact

In the next phase of our project we plan to test our approach against popular Android apps, plus known malicious or leaky apps. If tainted flows are discovered, a user might refuse to install an application. Likewise, an app store might remove an app.

While our work applies the use of our method to Android security, we also realized its applicability can be extended on a broader scale. The method that we have refined makes our approach applicable for a class of problems involving taint flow analysis for software systems that communicate by message passing using intents. A modular analysis with parameterized summaries of flow of sensitive information can now be applied to systems that pass messages by connections to sockets (e.g., an https connection to a particular port) or messages passed using multiple threads.

We are currently writing a conference paper about this work. Previous versions of DidFail are downloadable for free from the CERT Secure Coding webpage for DidFail.

Envisioned Precise-DidFail Use

When a user requests to install a new Android app, a security system such as the DoD's Mobile Device Management (MDM) system would use our analysis to determine if the app can be installed without violating policies. Similarly, if policies were updated then on-device app set potential data flows could be checked against blacklists or whitelists the policies describe. Users would be provided with a user-friendly set of options if they would be required to make a choice between apps to keep. The DoD plans to improve its mobile device management, including more thorough software testing and policy definition and enforcement, and Precise-DidFail has the potential to help with that in a practical way.

Our descriptions of the algorithm for Precise-DidFail do not yet include all the enhancements to precision that can be added, using the core techniques we developed. Boolean formulas can provide context related to conditional flows for other intent fields, in addition to action strings. Practical use of this analysis should include implementation of the method for many intent fields, beyond action strings.

Additional Resources

Download DidFail from the CERT Secure Coding website.

Read the SEI technical report Making DidFail Succeed: Enhancing the CERT Static Taint Analyzer for Android App Sets.

Software Engineering Institute

SEI Blog