According to Gartner Incorporated, the world's leading information technology research and advisory company, Big Data is defined as “high volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”I submit that all organizations that collect, process, and analyze Big Data have similar insight and decision making challenges. This is especially true in cyber. In light of the recent Office of Personnel Management (OPM)and Internal Revenue Service (IRS) data breaches, organizations are rapidly looking for innovative Big Data analysis and analytic development methods to extract value and actionable intelligence from their data. This value and actionable intelligence drives the planning, development, and deployment of effective mitigations. However, trying to meet these challenges with thought-homogeneous (little to no variety of opinions, expertise or perspectives) teams will result in: (1) a reduced ability to counter and/or neuter adversaries’ ability to deny, degrade, and destabilize their computer networks; and (2) an inability to plan, develop, and deploy mitigations that matter.
According to a 2013 study by Deloitte, cultivating diversity of thoughts on and in analysis teams can boost innovation and creative problem solving. Francis Anscombe’s seminal 1973 “Graphs in Statistical Analysis” paper can provide important lessons in how analysts approach Big Data analysis. The combination of thought diversity and visualize first-analyze-second (VFAS) can make Big Data analysis a more valuable investment regardless of organization.
Anscombe used four fictitious data sets with nearly identical simple statistical properties, yet appeared very different when graphed. Each data set consisted of eleven (x,y) points, shown below in Table The summary statistics for each data set are close to identical:
mean x value = 9 mean y value = 7.50correlation between x and y = 0.816
variance for x = 11 variance for y = 4.12
trend line equation is y = 0.5x + 3
Based on these summary statistics, analysts would say these data sets, while relatively different numerically, show the same statistical behavior, so, they must be describing the same actual behavior. However, as shown in Figure 1 below, Dataset I shows a linear relationship between x and y while Dataset II shows a strong non-linear relationship between x andy. The latter graph indicates that nonlinear regression may have been the proper tool to use. Data set III shows a linear relationship between x and y, except for a large outlier, while Dataset IV shows x remaining constant, except for an outlier. This “quartet” shows that “things are not always what they seem.”
While performing analysis on Big Data, analysts will often provide summary statistics, e.g., the mean, variance, correlation, and trend lines to see what patterns emerge. Summary statistics are extremely useful because they allow analysts to describe big data with just a few numbers. One could also argue that summary statistics allow decision makers to assess risk. Or do they? Well, no, not really. Anscombe makes the point that analysts should visualize their data before applying any analysis tools. Within the context of Big Data because different visualizations may offer competing or alternative hypotheses and, hopefully, with thought diverse analysis teams, inspire diversity of thought.
Thought Diversity realizes that an individual’s thought processes are derived from their unique experiences and therefore provides unique perspectives on situations. It is important to note that cultural/ethnic diversity can spawn thought diversity. By putting together teams of varying subject matter expertise and analytic approaches, experts can rely upon their intuition and divergent perspectives. Having a Cyber Intelligence Analyst that understands malware behavior working side by side with a Political Scientist that under stands open source intelligence and a Data Scientist that can glean adversary tradecraft based on advanced analytic techniques can produce behavior-enriched insights. Figure 2 below shows a fictitious characterization of the number of correctly classified (via a statistical classifier) malicious actor malware samples collected at the beginning of 2015. A statistician may not be able to explain the gap between mid-January and mid-February and a malware analyst may surmise that the gap reflects a period where the adversary is refining their malware tradecraft before redeploying it. However, a political scientist or Intelligence Analyst may notice that this period of time is consistent with a Lunar New Year celebration so malware attacks would likely decrease.
The point of this illustration is that Anscombe’s Quartet is telling us to step back and look at our Big Data(graph/visualize) in its raw state before applying any advanced analytic tools or capabilities and observe the patterns or behaviors naturally emerge. This will allow thought-diverse analysis teams to provide objective perspectives of what is subjectively measured.