BigPandaLogoBigPanda formally launched a new data science platform to automate IT Incident Management. BigPanda’s platform analyzes the flood of alerts that IT teams face every day and clusters them into high-level incidents; it then automates the manual processes involved with detecting, investigating and collaborating around every IT incident. This enables companies to resolve IT issues faster and minimize their impact on customers and revenue.

Data centers have changed dramatically in the last decade. IT and DevOps teams are struggling with traditional approaches to Incident Management that have not kept pace with those changes.

Two major changes, in particular, have been a major source of pain:

  1. Data Centers have Exploded in Scale and Complexity due to the Cloud and Virtualization. As the moving parts have multiplied, so has the number of IT incidents that require immediate attention. The average BigPanda user has thousands of daily alerts, and that number is growing exponentially. Today, that mountain of alerts must be manually detected, investigated and managed by IT and DevOps teams, which has turned into major drain on time, people and efficiency.
  2. Data Center Monitoring has become Highly Fragmented. Companies are shifting away from monolithic data center monitoring vendors, like HP and IBM, towards using multiple tools such as Splunk, New Relic, Nagios, Zabbix and Pingdom. Companies use on average five different monitoring tools, none of which speak the same language. When IT incidents occur, correlating alerts and connecting the dots between all those fragmented tools is a time-consuming and error-prone task.

Despite these changes, companies continue to use Incident Management solutions that have not evolved to meet these new challenges. Legacy solutions focus on helping teams to organize and track their activities, but leave the heavy lifting of detecting, investigating and collaborating on alerts up to individuals. That has left IT and DevOps teams struggling to keep up with new challenges arising from data center scale and fragmentation.

The new generation of IT infrastructure requires a fundamentally different approach to Incident Management,” said Assaf Resnick, Co-Founder and CEO of BigPanda. “We believe that only through leveraging data science can IT teams tackle the scale of machines, events and dependencies that must be understood and managed. That’s why we founded BigPanda.”

BigPanda’s core innovation lies in its data science approach that enables automation of the time-consuming process involved in responding to IT issues. They do this through a SaaS platform that aggregates and normalizes alerts from leading monitoring systems, such as New Relic, Nagios and Splunk, as well as home-built monitoring solutions, and then leverages powerful data algorithms to automate the heavy lifting out of Incident Management. BigPanda:

  • Consolidates Noisy Alerts: BigPanda automatically clusters the daily flood of alerts into high level incidents, so IT can quickly see critical issues without having to dig.
  • Correlates Alerts and Changes: BigPanda correlates IT incidents with the code deployments and infrastructure changes that may have caused them, so IT and DevOps teams have instant access to the data they need to make smart decisions quickly.
  • Streamlines Collaboration: BigPanda makes it easy to notify the right people and keep everyone updated on incident status, notes, activities, metrics, and more. It syncs seamlessly with ServiceNow, JIRA and Remedy, which frees IT from having to manually manage tickets and keep them up-to-date.

Any modern Ops environment at scale will hit the pain-points BigPanda is solving. There’s a strong need for this product,” said Kevin Park, Head of Tech Ops and IT at Dropbox.

Pricing and Availability
BigPanda is available for a free 30-day trial. A lite version is available for free; pricing starts at $1,500 per month for a company-wide license.




Fortscale-logoFortscale is officially introducing its innovative flagship product that helps enterprise security analysts identify user-related threats, malicious insiders, compromised accounts, suspicious behavior and risky access to data by extracting Big Data repositories with user behavior analytics. Using a state of the art, Big Data analytics and pure machine learning approach to cyber security, Fortscale’s solution leverages SIEM log repositories and adds an enrichment layer that profiles user behavior and provides investigation capabilities to solve these specific challenges.

As we have recently seen, user-related threats, whether they are insiders such as Snowden or compromised users that were the likely vehicles of breaches in Home Depot and Target, continue to grow at an alarming rate,” said Idan Tendler, CEO & Co-Founder, Fortscale. “Our user behavior analysis makes enterprise security teams analytics savvy to help them discover, identify and investigate compromised users, malicious insiders and questionable users that are likely to commit malicious activities. With Fortscale, security analysts now have all the information necessary to remediate user-related threats.”

Key features of the Fortscale solution include:

Sophisticated Machine Learning Algorithms: Machine learning algorithms that discover patterns and high-risk behavior of users without pre-defined rules, heuristics or thresholds.

Analyst-Friendly Toolbox Interface: Facilitating a proactive, efficient investigation process using analytics package sets, canned reports and dashboards that facilitate and expedite the process of creating faster and more effective context-based prediction of risky user behavior.

Advanced Visualization Tools: Discovering potential cyber threats and visualizing prioritized leads for analysts in an informative way, which can then be further refined using the analysts’ input.

Dynamic Analytics Environment: Easy customization and expansion of information sources, integration to various security products and modification of reports and dashboards.

Multi Platform: Scalable integration with an enterprise’s big data repository or SIEM systems, powered by a robust Hadoop environment.

Fortscale supports multiple use cases including discovering targeted attacks that leverage compromised user credentials, identifying rogue users and profiling malicious users’ access to data. Fortscale customers have benefitted from predictive intelligence capabilities, the ability to evaluate and mitigate risks, obtaining fast results and achieving an improved ROI on their existing SIEM investments.

Among Fortscale’s customers is Playtech, an international designer, developer and licensor of software for the online, mobile, TV and land-based gaming industry.

Since deploying Fortscale’s solution, our security team has achieved better visibility and a deeper understanding of user behavior within our network,” said Jochanan Sommerfeld, CIO, Playtech. “Fortscale enriches our existing SIEM system with user behavioral analytics and enhances our security analysts’ capabilities and overall effectiveness.”



Prelert, the anomaly detection company and Alert Logic, a leading provider of Security-as-a-Service solutions for the cloud, has announced an OEM partnership enabling Prelert’s machine learning analytics to be included in Alert Logic’s Security-as-a-Service solutions. This agreement enhances Alert Logic’s ability to detect threats that are designed to bypass traditional signature-based approaches.

Alert Logic’s Security-as-a-Service platform keeps data and infrastructure safe and compliant wherever it resides – including public and private clouds, hybrid environments or on-premises – through a set of fully managed products and services. The company maintains partnerships with the largest cloud and hosting service providers and offers its customers continuous protection down the application stack through a 24×7 Security Operations Center that analyzes, escalates and works with customers to remediate threats with actionable intelligence.

Integrating Prelert’s anomaly detection engine into our big data platform creates a powerful combination of security analytics techniques, allowing us to identify unknown and advanced threats across petabytes of machine data we manage for our customers.” said Alert Logic’s Chief Strategy Officer, Misha Govshteyn. “Our objective has always been to help our customers respond to the most relevant security incidents before they impact their business. Working with Prelert allows us to leverage massive amounts of machine data we process every day to identify precursors to security breaches at the earliest possible moment and maintain our historically high degree of accuracy, even when advanced attackers employ sophisticated tactics to avoid detection.”

Prelert’s Anomaly Detective engine uses advanced analytics based on unsupervised machine learning to process and cross-correlate millions of data points in real-time, automatically learning normal behavior patterns and identifying statistical outliers that may indicate successful breaches and data exfiltrations. In May 2014, Prelert opened its API giving enterprise application developers, technology vendors and cloud service providers such as Alert Logic the ability to utilize its machine learning engine in their products and environments.

Security paradigms solely reliant on identifying already ‘known’ threats are proving inadequate when used against today’s advanced cybercriminals,” said Mark Jaffe, Prelert’s CEO. “As a result, leadership organizations are starting to aggregate data accumulated from security devices, web servers and network equipment, and then processing it with advanced machine learning analytics to identify suspicious activities that would otherwise go unnoticed.”



In recent years, the number of people and businesses falling victim to fraud has increased at an alarming rate with targets include eBay, Target, Amazon, and even the U.S. government. As technology continues to improve, fraudsters are finding new and creative methods to continue their work.

Fraudulent practices targeting businesses are growing at an alarming rate and are costing industries billions of dollars. In the insurance industry alone, $40 billion per year is lost as a result of fraud. The theft of money is not the only culprit; 36 percent of total loss due to fraud is directly related to business disruption and loss of productivity.

  • Health care fraud estimated to cost over $500 billion.
  • Cybercrime in 2013 cost an average $11.56 million per year.

Fraud prevention requires innovative solutions that can both fight fraudulent practices and protect individuals and businesses from disruption and loss of productivity. To learn more about how data analysis can be utilized for fraud detection, check out the infographic below, brought to you by Stetson University’s online Master of Accountancy Program.



prelert_logoPrelert, the anomaly detection company, today announced the release of an Elasticsearch Connector to help developers quickly and easily deploy its machine learning-based Anomaly Detective® engine on their Elasticsearch ELK (Elasticsearch, Logstash, Kibana) stack.

Earlier this year, Prelert released its Engine API enabling developers and power users to leverage its advanced analytics algorithms in their operations monitoring and security architectures. By offering an Elasticsearch Connector, the company further strengthens its commitment to democratizing the use of machine learning technology, providing tools that make it even easier to identify threats and opportunities hidden within massive data sets.

Written in Python, the Prelert Elasticsearch Connector source is available on GitHub. This enables developers to apply Prelert’s advanced, machine learning-based analytics to fit the big data needs within their unique environment.

Prelert is dedicated to making it easier for users to analyze their data and drive real, actionable value from it,” said Mark Jaffe, CEO, Prelert. “The amounts of data that companies and organizations have these days are simply massive – too massive for humans to process and analyze. The release of our Elasticsearch Connector is the latest step toward making the analysis of large data sets possible, repeatable and valuable without a team of data scientists.”

Prelert’s Anomaly Detective processes huge volumes of streaming data, automatically learns normal behavior patterns represented by the data and identifies and cross-correlates any anomalies. It routinely processes millions of data points in real-time and identifies performance, security and operational anomalies so they can be acted on before they impact business.

The Elasticsearch Connector is the first connector to be officially released by Prelert. Additional connectors to several of the most popular technologies used with big data will be released throughout the coming months.

For more information and a free license of the Anomaly Detective engine and API, please visit: For more information on the Elasticsearch Connector, please visit:



It’s a truism in big data that you can never have enough data. With the cost of storage declining to unprecedented levels, the mantra now is to store everything… just in case it becomes useful data tomorrow or years from now.

The problem with this approach, however, is that it assumes that the only cost of storing more data is the associated storage cost. Lost in the calculation is the difficulty of making sense of signal amidst ever increasing data noise. The more data we store, the harder it becomes to separate meaningful signal from meaningless noise.

Just ask the NSA.

So much data… what’s a spy to do?

Bill Binney recently resigned from the US National Security Agency (NSA), where he was a high-ranking official, mathematician, and codebreaker. After becoming disillusioned with the way the NSA was gathering and using intelligence, he quit.

While Binney is a severe critic of the NSA’s spying on US citizens, one of his most potent critiques goes to the heart of big data:

“[T]he problem…[w]ith this bulk acquisition of data on everybody [is that the NSA has] inundated their analysts with data. Unless they do a very focused attack, they’re buried in information, and that’s why they can’t succeed.”

In other words, there’s so much data noise that it’s increasingly difficult to decipher any signal.

Noted statistician Nate Silver addresses this in his book The Signal and the Noise:

“If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn’t. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine — but a relatively constant amount of objective truth.”

As both Binney and Silver highlight, the bigger the haystack, the harder it is to find the needle. We make this task ever more difficult for ourselves by using Hadoop and other modern data technologies to create “unsupervised digital landfills,” as one Fortune 100 IT executive phrased it to me.

Nate Silver on signal and noise

Not only does it become ever harder to glean insight from mountains of data, but we can also seduce ourselves into believing that more data necessarily translates into more truth. In fact, all data is always processed by highly biased beings. Our prejudices aren’t minimized by data.

If anything, they can be amplified by data, as Silver posits:

“[Big data] is sometimes seen as a cure-all, as computers were in the 1970s. Chris Anderson… wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method….

“[T]hese views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning…. [W]e may construe them in self-serving ways that are detached from their objective reality.”

Ultimately, more data doesn’t require less thinking, as some would suggest. We don’t magically find correlations in mountains of data. We have to search for them, so we must ask the right questions of our data.

The best data scientist is the one you already have

This is why Gartner analyst Svetlana Sicular is dead-on when she suggests that enterprises will find it easier to train employees on big data technologies like Hadoop and NoSQL rather than bring in a “mythical data scientist” who already knows such technologies but likely won’t know your business.

The hard part is figuring out the right questions to ask of your data, not how to use the technologies.

Which brings us back to the NSA. While the NSA may know which questions to ask of its data to figure out what citizens are doing with our time, could it be that mass surveillance may actually help to make us less susceptible to the NSA’s prying into our lives? Share your thoughts in the discussion thread below.



If there’s one “company” that’s doing big data right, it’s the NSA.

The U.S. spy agency receives over five billion data points regarding mobile phone locations all around the world every day. That’s a huge amount of information to store — and use. But unlike most commercial companies, whose data is just sitting there un-analyzed, the NSA is already making use of this information left and right.

Three specific programs are worth calling out from the NSA’s broad use of this mobile location information: Fast Follower, Happy Foot, and Co-Traveler. Those are the code names for three of the NSA’s analysis tools, as noted by the Washington Post, which published a leaked document from former NSA-contactor Edward Snowden.

And yes, these programs might affect you. Although the NSA doesn’t have access to the audio of your phone conversations, it can learn a lot just from the metadata about your calls, the location data of your phone, and more. Through its own tapping skills and relationships with telecommunications companies and Internet companies, the NSA is able to put together a highly detailed picture about your personal relationships and where and when you’re with certain people.

All this data analysis might creep you out — particularly if you’re a terrorist. But more importantly, the NSA’s tools show how data-sorting tools and algorithms can make sense of an enormous pile of data.

Fast Follower

Fast Follower is a program the NSA created in order to watch for any foul play around U.S. case workers in foreign nations. These individuals are at risk for being followed when overseas, and the NSA took advatage of cell signals to make sure they were protected.

Your phone sends signals to cell towers in the area to let it know where it is. It also jumps from tower to tower as you (and your device) move. The NSA collects that information and then looks to see if there are any signals from devices around you that show up in the same place, or connect to the same cell towers as you multiple times. Through this it can determine if one of its case workers is being followed.

Happy Foot

The NSA is able to figure out your direct location through more means than just cell towers. Cell towers can provide information about how far away from that specific tower you are and then use that same data from other towers to determine your exact location. But it’s also possible to triangulate a person’s location through their phone’s Wi-Fi and GPS locations.

A number of applications will actually use your Wi-Fi and GPS if they have location-based social components or local e-commerce components, but they also use it to send advertisers your information. The documents suggest the NSA actually intercepts this data as part of the Happy Foot program to get clearer location information.

It might be able to do this through partnerships with telecommunications companies. It may also be able to do this through vulnerabilities in the GRX system that provides data to mobile phones worldwide.


We first learned about Co-Traveler last week when it was revealed that the NSA was collecting five billion data points a day. Co-Traveler is a data analytics tool that wholesale collects data from the NSA’s monitored cell towers. This information includes the data, time, and location of cell phones that connect to those towers. The NSA can look at a certain cell-tower area and determine whether someone is traveling alongside a known target. This is a way to learn about “associates” of that target — and might be a way to track down more bad guys.

Of course, when you collect that kind of bulk information, data about U.S. citizens is likely to get swept up, too. The NSA continues to say that it does not intend to collect this information, however.



Cloud security startups are booming and SkyHigh Network’s recent $40 million venture capital infusion is evidence of this boom. The Cupertino, California based SkyHigh Networks gives system administrators insight into which cloud apps end users are utilizing on their desktops. This insight can help prevent insider attacks before they happen. SkyHigh also protects against data leaks from malicious insiders who are trying to leak sensitive information such as corporate data by utilizing services such as DropBox or OneDrive.

The Financial Times recently wrote up a piece about SkyHigh’s new venture capital infusion as well as the partners involved with the deal. SkyHigh mentioned that one corporate banking CIO thought there were only 40-50 cloud apps being utilized within his environment. After SkyHigh was implemented, the CIO was alarmed to find that over 1,000 cloud apps were being used on employee personal computers. Having your network wide open for all cloud apps to access is much like locking your front door, your windows and installing a security system only for you to leave the back door wide open for those to walk in and out as they please.

Allowing end users to fully utilize cloud apps on their desktops opens up big corporations to immediate security risks. Essentially, corporations that operate without the protection that SkyHigh Networks offers trust their end users to not walk out the door with sensitive corporate data. SkyHigh Networks offers a proactive solution to corporations to prevent such occurrences from happening.

Rajiv Gupta, founder of SkyHigh mentions, “In general, [in the US] they are more concerned about insider threats, hackers and third-party nation-states than the US government looking into their bank accounts, the NSA looking into their affairs.” Asheem Chandra who works with Greylock, an investor in SkyHigh, says that “There’s a strong investment climate right now for cyber security companies. Security has always been a top priority, but at this point in time, particularly for large companies, the security budget is growing even if the IT budget is flat or declining.”



In a recent research survey, ESG asked security professionals to identify the most important type of data for use in malware detection and analysis (note: I am an employee of ESG). The responses were as follows:

  •     42% of security professionals said, “Firewall logs”
  •     28% of security professionals said, “IDS/IPS alerts”
  •     27% of security professionals said. “PC/laptop forensic data”
  •     23% of security professionals said, “IP packet capture”
  •     22% of security professionals said, “Server logs”

I understand this hierarchy from a historical perspective, but I contend that this list is no longer appropriate for several reasons. First of all, it is skewed toward the network perimeter which no longer makes sense in a mobile device/mobile user world. Second, it appears rooted in SIEM technology which was OK a few years ago, but we no longer want security technologies mandating what types of data we can and cannot collect and analyze.

Finally, this list has “old school” written all over it. We used to be limited by analytics platforms and the cost of storage, but this is no longer the case. Big data, cheap storage, and cloud-based storage services have altered the rules of the games from an analytics and economics perspective. The new mantra for security analytics should be, “collect and analyze everything.”

What makes up “everything?” Meta data, security intelligence, identity information, transactions, emails, physical security systems everything!

Now, I know what you are thinking:

I don’t have the right tools to analyse “everything.” You are probably right, but this situation is changing rapidly. Network forensic tools from Blue Coat (Solera Networks), Click Security and LogRythm can perform stream processing on network packets. Big data security analytics platforms from IBM, Leidos, Narus, RSA Security, and Splunk are designed to capture and analyze structured and unstructured data. Heck, there are even managed services from Arbor Networks and Dell if you don’t want to get your hands dirty.

I don’t have the skills to analyse “everything.” Very good point, and things aren’t likely to improve — there’s a global cybersecurity skills shortage and more data to analyse each day. Security analytics vendors need to do a better job here in terms of algorithms, automation, dashboards, machine learning, and threat intelligence integration. While I expect a lot of innovation in this area, CISOs should take a prudent approach here. For example, Splunk customers talk about collecting the data, learning the relationships between events, and then contextualizing specific data views by creating numerous dashboards. Makes sense to me.

I can’t afford yottabytes of storage for all of this data. With the exception of the NSA and its Bluffdale Utah data centre, few organisations do. To be clear, big data security analytics doesn’t demand retention of data, but it does demand scanning the data in search of suspicious/anomalous behavior. In many cases, CISOs only retain the Meta data, a fraction of the whole enchilada.

While it may seem like hype to our cynical cybersecurity community, big data is radically changing the way we look at the world we live in. For example, we no longer have to rely on data sampling and historical analysis, we can now collect and analyze volumes of data in real time. The sooner we incorporate this new reality into our cybersecurity strategies, the better.


Security Analytics


1. Big Data Analytics for Security

This section explains how Big Data is changing the analytics landscape. In particular, Big Data analytics can be leveraged to improve information security and situational awareness. For example, Big Data analytics can be employed to analyze financial transactions, log files, and network traffic to identify anomalies and suspicious activities, and to correlate multiple sources of information into a coherent view.

Data-driven information security dates back to bank fraud detection and anomaly-based intrusion detection systems. Fraud detection is one of the most visible uses for Big Data analytics. Credit card companies have conducted fraud detection for decades. However, the custom-built infrastructure to mine Big Data for fraud detection was not economical to adapt for other fraud detection uses. Off-the-shelf Big Data tools and techniques are now bringing attention to analytics for fraud detection in healthcare, insurance, and other fields.

In the context of data analytics for intrusion detection, the following evolution is anticipated:

  • 1st generation: Intrusion detection systems – Security architects realized the need for layered security (e.g., reactive security and breach response) because a system with 100% protective security is impossible.
  • 2nd generation: Security information and event management (SIEM) – Managing alerts from different intrusion detection sensors and rules was a big challenge in enterprise settings. SIEM systems aggregate and filter alarms from many sources and present actionable information to security analysts.
  • 3rd generation: Big Data analytics in security (2nd generation SIEM) – Big Data tools have the potential to provide a significant advance in actionable security intelligence by reducing the time for correlating, consolidating, and contextualizing diverse security event information, and also for correlating long-term historical data for forensic purposes.

Analyzing logs, network packets, and system events for forensics and intrusion detection has traditionally been a significant problem; however, traditional technologies fail to provide the tools to support long-term, large-scale analytics for several reasons:

  1.  Storing and retaining a large quantity of data was not economically feasible. As a result, most event logs and other recorded computer activity were deleted after a fixed retention period (e.g., 60 days).
  1. Performing analytics and complex queries on large, structured data sets was inefficient because traditional tools did not leverage Big Data technologies.
  1. Traditional tools were not designed to analyze and manage unstructured data. As a result, traditional tools had rigid, defined schemas. Big Data tools (e.g., Piglatin scripts and regular expressions) can query data in flexible formats.
  1. Big Data systems use cluster computing infrastructures. As a result, the systems are more reliable and available, and provide guarantees that queries on the systems are processed to completion.

New Big Data technologies, such as databases related to the Hadoop ecosystem and stream processing, are enabling the storage and analysis of large heterogeneous data sets at an unprecedented scale and speed. These technologies will transform security analytics by: (a) collecting data at a massive scale from many internal enterprise sources and external sources such as vulnerability databases; (b) performing deeper analytics on the data; (c) providing a consolidated view of security-related information; and (d) achieving real-time analysis of streaming data. It is important to note that Big Data tools still require system architects and analysts to have a deep knowledge of their system in order to properly configure the Big Data analysis tools.

2. Examples

This section describes examples of Big Data analytics used for security purposes.

 2.1 Network Security

In a published case study, Zions Bancorporation announced that it is using Hadoop clusters and business intelligence tools to parse more data more quickly than with traditional SIEM tools. In their experience, the quantity of data and the frequency analysis of events are too much for traditional SIEMs to handle alone. In their traditional systems, searching among a month’s load of data could take between 20 minutes and an hour. In their new Hadoop system running queries with Hive, they get the same results in about one minute.

The security data warehouse driving this implementation not only enables users to mine meaningful security information from sources such as firewalls and security devices, but also from website traffic, business processes and other day-to-day transactions. This incorporation of unstructured data and multiple disparate data sets into a single analytical framework is one of the main promises of Big Data.


2.2 Enterprise Events Analytics

Enterprises routinely collect terabytes of security relevant data (e.g., network events, software application events, and people action events) for several reasons, including the need for regulatory compliance and post-hoc forensic analysis. Unfortunately, this volume of data quickly becomes overwhelming. Enterprises can barely store the data, much less do anything useful with it. For example, it is estimated that an enterprise as large as HP currently (in 2013) generates 1 trillion events per day, or roughly 12 million events per second. These numbers will grow as enterprises enable event logging in more sources, hire more employees, deploy more devices, and run more software. Existing analytical techniques do not work well at this scale and typically produce so many false positives that their efficacy is undermined. The problem becomes worse as enterprises move to cloud architectures and collect much more data. As a result, the more data that is collected, the less actionable information is derived from the data.

The goal of a recent research effort at HP Labs is to move toward a scenario where more data leads to better analytics and more actionable information (Manadhata, Horne, & Rao, forthcoming). To do so, algorithms and systems must be designed and implemented in order to identify actionable security information from large enterprise data sets and drive false positive rates down to manageable levels. In this scenario, the more data that is collected, the more value can be derived from the data. However, many challenges must be overcome to realize the true potential of Big Data analysis. Among these challenges are the legal, privacy, and technical issues regarding scalable data collection, transport, storage, analysis, and visualization.

Despite the challenges, the group at HP Labs has successfully addressed several Big Data analytics for security challenges, some of which are highlighted in this section. First, a large-scale graph inference approach was introduced to identify malware-infected hosts in an enterprise network and the malicious domains accessed by the enterprise’s hosts. Specifically, a host-domain access graph was constructed from large enterprise event data sets by adding edges between every host in the enterprise and the domains visited by the host. The graph was then seeded with minimal ground truth information from a black list and a white list, and belief propagation was used to estimate the likelihood that a host or domain is malicious. Experiments on a 2 billion HTTP request data set collected at a large enterprise, a 1 billion DNS request data set collected at an ISP, and a 35 billion network intrusion detection system alert data set collected from over 900 enterprises worldwide showed that high true positive rates and low false positive rates can be achieved with minimal ground truth information (that is, having limited data labeled as normal events or attack events used to train anomaly detectors).

Second, terabytes of DNS events consisting of billions of DNS requests and responses collected at an ISP were analyzed. The goal was to use the rich source of DNS information to identify botnets, malicious domains, and other malicious activities in a network. Specifically, features that are indicative of maliciousness were identified. For example, malicious fast-flux domains tend to last for a short time, whereas good domains such as last much longer and resolve to many geographically-distributed IPs. A varied set of features were computed, including ones derived from domain names, time stamps, and DNS response time-to-live values. Then, classification techniques (e.g., decision trees and support vector machines) were used to identify infected hosts and malicious domains. The analysis has already identified many malicious activities from the ISP data set.

2.3 Netflow Monitoring to Identify Botnets

This section summarizes the BotCloud research project (Fraçois, J. et al. 2011, November), which leverages the MapReduce paradigm for analyzing enormous quantities of Netflow data to identify infected hosts participating in a botnet (François, 2011, November). The rationale for using MapReduce for this project stemmed from the large amount of Netflow data collected for data analysis. 720 million Netflow records (77GB) were collected in only 23 hours. Processing this data with traditional tools is challenging. However, Big Data solutions like MapReduce greatly enhance analytics by enabling an easy-to-deploy distributed computing paradigm.

BotCloud relies on BotTrack, which examines host relationships using a combination of PageRank and clustering algorithms to track the command-and-control (C&C) channels in the botnet (François et al., 2011, May). Botnet detection is divided into the following steps: dependency graph creation, PageRank algorithm, and DBScan clustering.

The dependency graph was constructed from Netflow records by representing each host (IP address) as a node. There is an edge from node A to B if, and only if, there is at least one Netflow record having A as the source address and B as the destination address. PageRank will discover patterns in this graph (assuming that P2P communications between bots have similar characteristics since they are involved in same type of activities) and the clustering phase will then group together hosts having the same pattern. Since PageRank is the most resource-consuming part, it is the only one implemented in MapReduce.

BotCloud used a small Hadoop cluster of 12 commodity nodes (11 slaves + 1 master): 6 Intel Core 2 Duo 2.13GHz nodes with 4 GB of memory and 6 Intel Pentium 4 3GHz nodes with 2GB of memory. The dataset contained about 16 million hosts and 720 million Netflow records. This leads to a dependency graph of 57 million edges.

The number of edges in the graph is the main parameter affecting the computational complexity. Since scores are propagated through the edges, the number of intermediate MapReduce key-value pairs is dependent on the number of links. Figure 5 shows the time to complete an iteration with different edges and cluster sizes.


Figure 1. Average execution time for a single PageRank iteration.

The results demonstrate that the time for analyzing the complete dataset (57 million edges) was reduced by a factor of seven by this small Hadoop cluster. Full results (including the accuracy of the algorithm for identifying botnets) are described in François et al. (2011, May).

2.4 Advanced Persistent Threats Detection

An Advanced Persistent Threat (APT) is a targeted attack against a high-value asset or a physical system. In contrast to mass-spreading malware, such as worms, viruses, and Trojans, APT attackers operate in “low-and-slow” mode. “Low mode” maintains a low profile in the networks and “slow mode” allows for long execution time. APT attackers often leverage stolen user credentials or zero-day exploits to avoid triggering alerts. As such, this type of attack can take place over an extended period of time while the victim organization remains oblivious to the intrusion. The 2010 Verizon data breach investigation report concludes that in 86% of the cases, evidence about the data breach was recorded in the organization logs, but the detection mechanisms failed to raise security alarms (Verizon, 2010).

APTs are among the most serious information security threats that organizations face today. A common goal of an APT is to steal intellectual property (IP) from the targeted organization, to gain access to sensitive customer data, or to access strategic business information that could be used for financial gain, blackmail, embarrassment, data poisoning, illegal insider trading or disrupting an organization’s business. APTs are operated by highly-skilled, well-funded and motivated attackers targeting sensitive information from specific organizations and operating over periods of months or years. APTs have become very sophisticated and diverse in the methods and technologies used, particularly in the ability to use organizations’ own employees to penetrate the IT systems by using social engineering methods. They often trick users into opening spear-phishing messages that are customized for each victim (e.g., emails, SMS, and PUSH messages) and then downloading and installing specially crafted malware that may contain zero-day exploits (Verizon, 2010; Curry et al., 2011; and Alperovitch, 2011).

Today, detection relies heavily on the expertise of human analysts to create custom signatures and perform manual investigation. This process is labor-intensive, difficult to generalize, and not scalable. Existing anomaly detection proposals commonly focus on obvious outliers (e.g., volume-based), but are ill-suited for stealthy APT attacks and suffer from high false positive rates.

Big Data analysis is a suitable approach for APT detection. A challenge in detecting APTs is the massive amount of data to sift through in search of anomalies. The data comes from an ever-increasing number of diverse information sources that have to be audited. This massive volume of data makes the detection task look like searching for a needle in a haystack (Giura & Wang, 2012). Due to the volume of data, traditional network perimeter defense systems can become ineffective in detecting targeted attacks and they are not scalable to the increasing size of organizational networks. As a result, a new approach is required. Many enterprises collect data about users’ and hosts’ activities within the organization’s network, as logged by firewalls, web proxies, domain controllers, intrusion detection systems, and VPN servers. While this data is typically used for compliance and forensic investigation, it also contains a wealth of information about user behavior that holds promise for detecting stealthy attacks.

2.4.1 Beehive: Behavior Profiling for APT Detection

At RSA Labs, the observation about APTs is that, however subtle the attack might be, the attacker’s behavior (in attempting to steal sensitive information or subvert system operations) should cause the compromised user’s actions to deviate from their usual pattern. Moreover, since APT attacks consist of multiple stages (e.g., exploitation, command-and-control, lateral movement, and objectives), each action by the attacker provides an opportunity to detect behavioral deviations from the norm. Correlating these seemingly independent events can reveal evidence of the intrusion, exposing stealthy attacks that could not be identified with previous methods.

These detectors of behavioral deviations are referred to as “anomaly sensors,” with each sensor examining one aspect of the host’s or user’s activities within an enterprise’s network. For instance, a sensor may keep track of the external sites a host contacts in order to identify unusual connections (potential command-and-control channels), profile the set of machines each user logs into to find anomalous access patterns (potential “pivoting” behavior in the lateral movement stage), study users’ regular working hours to flag suspicious activities in the middle of the night, or track the flow of data between internal hosts to find unusual “sinks” where large amounts of data are gathered (potential staging servers before data exfiltration).

While the triggering of one sensor indicates the presence of a singular unusual activity, the triggering of multiple sensors suggests more suspicious behavior. The human analyst is given the flexibility of combining multiple sensors according to known attack patterns (e.g., command-and-control communications followed by lateral movement) to look for abnormal events that may warrant investigation or to generate behavioral reports of a given user’s activities across time.

The prototype APT detection system at RSA Lab is named Beehive. The name refers to the multiple weak components (the “sensors”) that work together to achieve a goal (APT detection), just as bees with differentiated

roles cooperate to maintain a hive. Preliminary results showed that Beehive is able to process a day’s worth of data (around a billion log messages) in an hour and identified policy violations and malware infections that would otherwise have gone unnoticed (Yen et al., 2013).

In addition to detecting APTs, behavior profiling also supports other applications, including IT management (e.g., identifying critical services and unauthorized IT infrastructure within the organization by examining usage patterns), and behavior-based authentication (e.g., authenticating users based on their interaction with other users and hosts, the applications they typically access, or their regular working hours). Thus, Beehive provides insights into an organization’s environment for security and beyond.

2.4.2 Using Large-Scale Distributed Computing to Unveil APTs

Although an APT itself is not a large-scale exploit, the detection method should use large-scale methods and close-to-target monitoring algorithms in order to be effective and to cover all possible attack paths. In this regard, a successful APT detection methodology should model the APT as an attack pyramid, as introduced by Giura & Wang (2012). An attack pyramid should have the possible attack goal (e.g., sensitive data, high rank employees, and data servers) at the top and lateral planes representing the environments where the events associated with an attack can be recorded (e.g., user plane, network plane, application plane, or physical plane). The detection framework proposed by Giura & Wang groups all of the events recorded in an organization that could potentially be relevant for security using flexible correlation rules that can be redefined as the attack evolves. The framework implements the detection rules (e.g., signature based, anomaly based, or policy based) using various algorithms to detect possible malicious activities within each context and across contexts using a MapReduce paradigm.

There is no doubt that the data used as evidence of attacks is growing in volume, velocity, and variety, and is increasingly difficult to detect. In the case of APTs, there is no known bad item that IDS could pick up or that could be found in traditional information retrieval systems or databases. By using a MapReduce implementation, an APT detection system has the possibility to more efficiently handle highly unstructured data with arbitrary formats that are captured by many types of sensors (e.g., Syslog, IDS, Firewall, NetFlow, and DNS) over long periods of time. Moreover, the massive parallel processing mechanism of MapReduce could use much more sophisticated detection algorithms than the traditional SQL-based data systems that are designed for transactional workloads with highly structured data. Additionally, with MapReduce, users have the power and flexibility to incorporate any detection algorithms into the Map and Reduce functions. The functions can be tailored to work with specific data and make the distributed computing details transparent to the users. Finally, exploring the use of large-scale distributed systems has the potential to help to analyze more data at once, to cover more attack paths and possible targets, and to reveal unknown threats in a context closer to the target, as is the case in APTs.

3. The WINE Platform for Experimenting with Big Data Analytics in Security

The Worldwide Intelligence Network Environment (WINE) provides a platform for conducting data analysis at scale, using field data collected at Symantec (e.g., anti-virus telemetry and file downloads), and promotes rigorous experimental methods (Dumitras & Shoue, 2011). WINE loads, samples, and aggregates data feeds originating from millions of hosts around the world and keeps them up-to-date. This allows researchers to conduct open-ended, reproducible experiments in order to, for example, validate new ideas on real-world data, conduct empirical studies, or compare the performance of different algorithms against reference data sets archived in

WINE is currently used by Symantec’s engineers and by academic researchers.

3.1 Data Sharing and Provenance

Experimental research in cyber security is rarely reproducible because today’s data sets are not widely available to the research community and are often insufficient for answering many open questions. Due to scientific, ethical, and legal barriers to publicly disseminating security data, the data sets used for validating cyber security research are often mentioned in a single publication and then forgotten. The “data wishlist” (Camp, 2009) published by the security research community in 2009 emphasizes the need to obtain data for research purposes on an ongoing basis.

WINE provides one possible model for addressing these challenges. The WINE platform continuously samples and aggregates multiple petabyte-sized data sets, collected around the world by Symantec from customers who agree to share this data. Through the use of parallel processing techniques, the platform also enables open-ended experiments at scale. In order to protect the sensitive information included in the data sets, WINE can only be accessed on-site at Symantec Research Labs. To conduct a WINE experiment, academic researchers are first required to submit a proposal describing the goals of the experiment and the data needed. When using the WINE platform, researchers have access to the raw data relevant to their experiment. All of the experiments carried out on WINE can be attributed to the researchers who conducted them and the raw data cannot be accessed anonymously or copied outside of Symantec’s network.

WINE provides access to a large collection of malware samples and to the contextual information needed to understand how malware spreads and conceals its presence, how malware gains access to different systems, what actions malware performs once it is in control, and how malware is ultimately defeated. The malware samples are collected around the world and are used to update Symantec’s anti-virus signatures. Researchers can analyze these samples in an isolated “red lab,” which does not have inbound/outbound network connectivity in order to prevent viruses and worms from escaping this isolated environment.

A number of additional telemetry data sets, received from hosts running Symantec’s products, are stored in a separate parallel database. Researchers can analyze this data using SQL queries or by writing MapReduce tasks.

These data sets include anti-virus telemetry and intrusion-protection telemetry, which record occurrences of known host-based threats and network-based threats, respectively. The binary reputation data set provides information on unknown binaries that are downloaded by users who participate in Download Insight, Symantec’s reputation-based security program. The history of binary reputation submissions can reveal when a particular threat has first appeared and how long it existed before it was detected. Similarly, the binary stability data set is collected from the users who participate in the Performance Insight program, which reports the health and stability of applications before users download them. This telemetry data set reports application and system crashes, as well as system lifecycle events (e.g., software installations and uninstallations). Telemetry submission is an optional feature of Symantec products and users can opt out at any time.

These data sets are collected at high rates and the combined data volume exceeds 1 petabyte. To keep the data sets up-to-date and to make them easier to analyze, WINE stores a representative sample from each telemetry source. The samples included in WINE contain either all of the events recorded on a host or no data from that host at all, allowing researchers to search for correlations among events from different data sets.

This operational model also allows Symantec to record metadata establishing the provenance of experimental results (Dumitras & Efstathopoulos, 2012), which ensures the reproducibility of past experiments conducted on WINE. The WINE data sets include provenance information, such as when an attack was first observed, where it has spread, and how it was detected. Moreover, experimentation is performed in a controlled environment at Symantec Research Labs and all intermediate and final results are kept within the administrative control of the system.

However, recording all of the mechanical steps in an experiment is not enough. To reproduce a researcher’s conclusions, the hypothesis and the reasoning behind each experimental step must be explicit. To achieve this transparency, experimenters are provided with an electronic lab book (in the form of a wiki) for documenting all of the experimental procedures. Maintaining the lab book requires a conscious effort from the experimenters and produces documentation on the reference data sets created for the purposes of the experiment, the script that executes the experimental procedure, and the output data. Keeping such a lab book is a common practice in other experimental fields, such as applied physics or experimental biology.

3.2 WINE Analysis Example: Determining the Duration of Zero-Day Attacks

A zero-day attack exploits one or more vulnerabilities that have not been disclosed publicly. Knowledge of such vulnerabilities enables cyber criminals to attack any target undetected, from Fortune 500 companies to millions of consumer PCs around the world. The WINE platform was used to measure the duration of 18 zero-day attacks by combining the binary reputation and anti-virus telemetry data sets and by analyzing field data collected on 11 million hosts worldwide (Bilge & Dumitras, 2012). These attacks lasted between 19 days and 30 months, with a median of 8 months and an average of approximately 10 months (Figure 6). Moreover, 60% of the vulnerabilities identified in this study had not been previously identified as exploited in zero-day attacks. This suggests that such attacks are more common than previously thought. These insights have important implications for future security technologies because they focus attention on the attacks and vulnerabilities that matter most in the real world.


Figure 2. Analysis of zero-day attacks that go undetected.

The outcome of this analysis highlights the importance of Big Data techniques for security research. For more than a decade, the security community suspected that zero-day attacks are undetected for long periods of time, but past studies were unable to provide statistically significant evidence of this phenomenon. This is because zero-day attacks are rare events that are unlikely to be observed in honeypots or in lab experiments. For example, most of the zero-day attacks in the study showed up on fewer than 150 hosts out of the 11 million analyzed. Big Data platforms such as WINE provide unique insights about advanced cyber attacks and open up new avenues of research on next-generation security technologies.

4. Conclusions

The goal of Big Data analytics for security is to obtain actionable intelligence in real time. Although Big Data analytics have significant promise, there are a number of challenges that must be overcome to realize its true potential. The following are only some of the questions that need to be addressed:

1. Data provenance: authenticity and integrity of data used for analytics. As Big Data expands the sources of data it can use, the trustworthiness of each data source needs to be verified and the inclusion of ideas such as adversarial machine learning must be explored in order to identify maliciously inserted data.

2. Privacy: we need regulatory incentives and technical mechanisms to minimize the amount of inferences that Big Data users can make. CSA has a group dedicated to privacy in Big Data and has liaisons with NIST’s

Big Data working group on security and privacy. We plan to produce new guidelines and white papers exploring the technical means and the best principles for minimizing privacy invasions arising from Big Data analytics.

3. Securing Big Data stores: this document focused on using Big Data for security, but the other side of the coin is the security of Big Data. CSA has produced documents on security in Cloud Computing and also has working groups focusing on identifying the best practices for securing Big Data.

4. Human-computer interaction: Big Data might facilitate the analysis of diverse sources of data, but a human analyst still has to interpret any result. Compared to the technical mechanisms developed for efficient computation and storage, the human-computer interaction with Big Data has received less attention and this is an area that needs to grow. A good first step in this direction is the use of visualization tools to help analysts understand the data of their systems.

Source: Cloud Security Alliance