With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.

The community recently announced the release of Apache Storm 0.9.3. With this release, the team closed 100 JIRA tickets and delivered many new features, fixes and enhancements, including these three important improvements:

Screen Shot 2014-12-15 at 3.28.34 PM

This blog gives a brief overview of these new features in Apache Storm 0.9.3 and also looks ahead to future plans for the project.

HDFS Integration

Apache Storm’s HDFS integration consists of several bolt and Trident state implementations that allow topology developers to easily write data to HDFS from any Storm topology. Many stream processing use cases involve storing data in HDFS for further batch processing and further analysis of historical trends.

HBase Integration

Apache Storm’s HBase integration includes a number of components that allow Storm topologies to both write to and query HBase in real-time.

Many organizations use Apache HBase as part of their big data strategy for batch, interactive, and real-time workflows. Storm’s HBase integration allows users to leverage HBase data assets for streaming queries, and also use HBase as a destination for streaming computation results.

Output to Apache Kafka

Apache Storm has supported Kafka as a streaming data source since version 0.9.2-incubating. Now Storm 0.9.3 brings a number of improvements to the Kafka integration and also adds the ability to write data to one or more Kafka clusters and topics.

The ability to both read and write to Kafka unlocks additional potential in the already powerful combination of Storm and Kafka. Storm users can now use Kafka as a source of and destination for streaming data. This allows for inter-topology communication, combining spout and bolt-based topologies with Trident-based data flows. It also enables integration with any external system that supports data ingest from Kafka.

Plans for the Future

In upcoming releases of Apache Storm, the community will be focusing on enhanced security, high availability, and deeper integration with YARN.

The Apache Storm PMC would like to thank the community of volunteers who made the many new features and fixes in this release a reality.

Download Apache Storm and Learn More


pareto_mobileBI-arcplanMobileMobile will force a fundamental change in the approach to BI. When it comes to mobile BI, adoption has been shockingly poor because it doesn’t usually work well with mobile devices. You can’t read data in depth on a mobile device, but rather you need to get to the point quickly. With everything shifting to mobile, the approach to BI will change. Rather than elaborate visualizations, you will see hard numbers, simple graphs and conclusions. For instance, with wearable devices, you might look at an employee and quickly see the KPI (key performance indicator). The BI game is about to change — primed to go mobile this year. – Adi Azaria, co-founder, Sisense


Big Data in 2014 was supposed to be about moving past the hype toward real-world business gains, but for most companies, Big Data is still in the experimental phase. With 2015, we will begin to see Big Data delivering on its promise, with Hadoop serving as the engine behind more practical and profitable applications of Big Data thanks to its reduced cost compared to other platforms. – Mike Hoskins, CTO atActian


The data scientist role will not die or be replaced by generalists. In 2014 we saw c-suite recognition of Chief Data Officers as key additions to the executive team for enterprises across industry sectors. Data scientists will continue to provide generalists with new capabilities. R, along with BI tools, enable data science teams to make their work accessible through GUI analytics tools which amplify the knowledge and skills of the generalist. – David Smith, chief community officer, Revolution Analytics


heres-why-the-internet-of-things-will-be-huge-and-drive-tremendous-value-for-people-and-businessesThe Internet of Things will gain momentum, but will still be in the early stages throughout 2015. – While we’ve seen the number of connected devices continue to rise, most people still aren’t putting network keys into their toasters. … Expect toasters to get connected, eventually, too — as people get readings of the nutritional value of the bread they toast, their energy consumption, and their carbon footprints, among other things — but the more mundane items won’t be connected for another year or two yet. Information Builders



Big Data becomes a “Big Target.” As the bad guys realize big data repositories are a gold mine of high value data. – Ashvin Kamaraju, VP Product Development atVormetric


Data will finally be about driving higher margins/bringing value to the enterprise. A lot of the “confusion” around big data has been defining it. 2015 will see companies deriving value. VoltDB


Big Data moving to the cloud – enterprises are increasingly using the cloud for Big Data analytics for a multitude of reasons: elastic infrastructure needs, faster provisioning time and time to value in the cloud, and increasing reliance on externally generated data (e.g., 3rdparty data sources, Internet of Things and device generated data, clickstream data). Spark – the most active project in the Big Data ecosystem – was optimized for cloud environments. The uptake of this trend is evident in the fact that large enterprise vendors – SAP, Oracle, IBM – and promising startups are all pushing cloud-based analytics solution. – Ali Ghodsi, Head of Product Management and Engineering Databricks


IOT drives stronger security for big data environments. With millions of high-tech wearable devices coming on line daily, more personal private data including health data, locations, searching histories, shopping habits will get stored and analyzed by big data analytics. Securing big data using encryption becomes as inevitable requirement. – Ashvin Kamaraju, VP Product Development at Vormetric


hadoop-sqlSQL will be a “must-have” to get the analytic value out of Hadoop data.  We’ll see some vendor shake-out as bolt-on, legacy or immature SQL on Hadoop offerings cave to those that offer the performance, maturity and stability organizations need. – Mike Hoskins, CTO at Actian





In 2015, enterprises will move beyond data visualization to data actualization. The deployment of applications will require deployment of production-quality analytics that become integral to applications. – Bill Jacobs, Vice President of Product Marketing, Revolution Analytics


Big Data will turn to Big Documentation – Most people won’t know what to do with all of the data they have. We already see people collecting far more data than they know what to do with. We already see people “hadumping” all sorts of data into Hadoop clusters without knowing how it’s going to be used. Ultimately, data has to be used in order for it to have value. … Big Documentation is around the corner. Information Builders


The Rise of the Chief-IoT-Officer: In the not too distant past, there was an emerging technology trend called “eBusiness”. Many CEOs wanted to accelerate the adoption of eBusiness across various corporate functions, so they appointed a change leader often known as the “VP of eBusiness,” who partnered with functional leaders to help propagate and integrate eBusiness processes and technologies within legacy operations. IoT represents a similar transformational opportunity. As CEOs start examining the implications of IoT for their business strategy, there will be a push to drive change and move forward faster. A new leader, called the Chief IoT Officer, will emerge as an internal champion to help corporate functions identify the possibilities and accelerate adoption of IoT on a wider scale. ParStream


The conversation around big data analytics is becoming less about technology and more about driving successful business use cases. In 2015, we’re going to see a continuous movement out of IT and into generating ROI-orientated deployment models. Secondly, I think we’re going to see resources move away from heavy in-house big data infrastructure to big data-as-a-service in the cloud. We’re already seeing a lot of investment in this area and I expect this to steadily grow. – Stefan Groschupf, CEO at Datameer


The cloud will increasingly become the deployment model for BI and predictive analytics – particularly with the private cloud powered by the cost advantages and of Hadoop and fast access to analytic value. – Mike Hoskins, CTO at Actian


Hadoop-ElephantsBig Data meets Concurrency – New Big Data applications will emerge that have multiple users reading and writing data concurrently, while data streams in simultaneously from connected systems. Concurrent applications will overtake batch data science as the most interesting Hadoop use case. – Monte Zweben, co-founder and CEO of Splice Machine



data_intelligence_0The term “Business Intelligence” will morph into “Data Intelligence.” BI will finally evolve from being a reporting tool into data intelligence that every entity from governments to cities to individuals will use to prevent traffic, detect fraud, track diseases, manage personal health and even notify you when your favorite fruit has arrived at your local market. We will see the consumerization of BI where it will extend beyond the business world and become intricately woven into our everyday lives directly impacting the decisions we make. – Eldad Farkash, co-founder and CTO, Sisense, Sisense


‘Data wrangling’ will be the biggest area requiring innovation, automation and simplicity. Modeling and wrangling data from disparate systems into shape for insights for decades has been lengthy, tedious and labor-intensive. Most organizations today spend 70-80% time modeling and preparing data rather than interacting with data to generate business-critical insights. Simplifying data prep and data wrangling through automation will take shape in 2015 so businesses can reach a fast-clip on real data-driven insights.– Sharmila Mulligan, CEO and founder ofClearStory Data


In the coming year, analytics will have the power to become the next killer app to legitimize the need for hybrid cloud solutions.  Analytics has the ability to mine vast amounts of data from diverse sources, deliver value and build predictions without huge data landfills. In addition, the ability to apply predictions to the myriad decisions made daily – and do so within applications and systems running on-premises–is unprecedented. – Dave Rich, CEO, Revolution Analytics


Increasing Role of Open Source in Enterprise Software – Data warehousing and BI has long been the domain of proprietary software concentrated across a handful of vendors. However, the last 10 years has seen the emergence and increasing prevalence of Hadoop and subsequently Spark as lower-cost open source alternatives that deliver the scale and sophistication needed to gain insights from Big Data. The Hadoop-related ecosystem is projected to be $25B by 2020, and Spark is now distributed by 10+ vendors, including SAP, Oracle, Microsoft, and Teradata, with support for all major BI tools, including Tableau, Qlik, and Microstrategy. – Ali Ghodsi, Head of Product Management and Engineering Databricks


Personal predictive technology will come to the forefront – and fail. – As analysts see that they can use data discovery and other analytical tools, they’ll want the power of predictions to fall within their grasp, too. Unfortunately, most people don’t understand statistics well enough (see the “Monty Hall problem”) to make predictive models that really work, no matter how simple the tools make it. If anything, simple tools are more likely to get them into trouble. Information Builders

Real-Time Big Data – Companies will act on real-time data streams with data-driven, intelligent applications, instead of acting on yesterday’s data that was batch ingested last night. – Monte Zweben, co-founder and CEO of Splice Machine


article-cloud-analyticsAnalytics in the cloud will become pervasive. As organizations continue to rely on various cloud-based services for their mission-critical operations, analytics in the cloud will become a prevalent deployment option. Not only does the cloud offer self-service, intuitive experiences that allow organizations to implement an analytics solution in minutes, it speeds insight and enables data sharing across internal and external data sources. By reducing the over-reliance on IT, users can focus on asking new questions and find new answers at a unprecedented rate.– Sharmila Mulligan, CEO and founder of ClearStory Data




If you wonder what the government has done for you lately, take a look at DeepDive. DeepDive is a free version of IBM’s Watson developed in the same Defense Advanced Research Projects Agency (DARPA) program as IBM’s Watson, but is being made available free and open-source.

Although never been pitted against IBM’s Watson, DeepDive has gone up against a more fleshy foe: the human being. Result: DeepDive beat or at least equaled humans in the time it took to complete an arduous cataloging task. These were no ordinary humans, but expert human catalogers tackling the same task as DeepDive — to read technical journal articles and catalog them by understanding their content.

“We tested DeepDive against humans performing the same tasks and DeepDive came out ahead or at least equaled the efforts of the humans,” professor Shanan Peters, who supervised the testing, told EE Times.

DeepDive is free and open-source, which was the idea of its primary programmer, Christopher Re.

“We started out as part of a machine reading project funded by DARPA in which Watson also participated,” Re, a professor at Univ. of Wisconsin told EE Times. “Watson is a question-answering engine (although now it seems to be much bigger). [In contrast] DeepDive’s goal is to extract lots of structured data” from unstructured data sources.

DeepDive incorporates probability-based learning algorithms as well as open-source tools such as MADlib, Impala (from Oracle), and low-level techniques, such as Hogwild, some of which have also been included in Microsoft’s Adam. To build DeepDive into your application, you should be familiar with SQL and Python.


DeepDive was developed in the same Defense Advanced Research Projects Agency (DARPA) program as

IBM’s Watson, but is being made available free by its programmers at University of Wisconsin-Madison. 

“Underneath the covers, DeepDive is based on a probability model; this is a very principled, academic approach to build these systems, but the question for use was ‘could it actually scale in practice’? Our biggest innovations in Deep Dive have to do with giving it this ability to scale,” Re told us.


For the future, DeepDive aims to be proven to other domains.

“We hope go have similar results in those domains soon, but it’s too early to be very specific about our plans here,” Re told us. “We use a RISC processor right now, we’re trying to make a compiler and we think machine learning will let us make it much easier to program in the next generation of DeepDive. We also plan to get more data types into DeepDive: images, figures, tables, charts, spreadsheets — a sort of ‘Data Omnivore’ to borrow a line from Oren Etzioni.”

Get all the details in the free download which are going at 10,000 per week.





Databases are the spine of the tech industry: unsung, invisible, but critical–and beyond disastrous when they break or are deformed. This makes database people cautious. For years, only the Big Three–Oracle, IBM’s DB2, and (maybe) SQL Server–were serious options. Then the open-source alternatives–MySQL, PostgreSQL–became viable. …And then, over the last five years, things got interesting.

Some history: around the turn of this millennium, more and more people begin to recognize that formal, structured, normalized relational databases, interrogated by variants of SQL, often hindered rather than helped development. Over the following decade, a plethora of new databases bloomed, especially within Google, which had a particular need for web-scale datastore solutions: hence BigTable, Megastore and Spanner.

Meanwhile, Apache brought us Cassandra, HBase, and CouchDB; Clustrix offered a plug-and-play scalable MySQL replacement; Redis became a fundamental component of many Rails (and other) apps; and, especially, MongoDB became extremely popular among startups, despite vociferous criticism — in particular, of its write lock which prevented concurrent write operations across entire databases. This will apparently soon be much relaxed, after which there will presumably be much rejoicing. (For context: I’m a developer, and have done some work with MongoDB, and I’m not a fan.)

As interesting as these new developments–called “NoSQL databases”–were, though, only bleeding-edge startups and a tiny handful of other dreamers were really taking themseriously. Databases are beyond mission-critical, after all. If your database is deformed, you’re in real trouble. If your database doesn’t guarantee the integrity of its data and your transactions–i.e. if it doesn’t substantially support what are known as “ACID transactions“–then real database engineers don’t take it seriously:

MongoDB is not ACID compliant. Neither is Cassandra. Neither is Riak. Neither is Redis. Etc etc etc. In fact, it was sometimes claimed that NoSQL databases were fundamentally incompatible with ACID compliance. This isn’t true — Google’s Megastore is basically ACID compliant, and their Spanner is even better — but you can’t use Megastore outside of Google unless you’re willing to build your entire application on their idiosyncratic App Engine platform.

Which is why I was so intrigued a couple of years ago when I stumbled across a booth at TechCrunch Disrupt whose slogan was “NoSQL, YesACID.” It was hosted by a company named FoundationDB, who have performed the remarkable achievement of building anACID-compliant1 key-value datastore while also providing a standard SQL access layer on top of that. Earlier this week they announced the release of FoundationDB 3.0, a remarkable twenty-five times faster than their previous version, thanks to what co-founder and COO compares to a “heart and lungs transplant” for their engine. This new engine scales up to a whopping 14.4 million writes per second.

That is a quite a feat of engineering. To quote their blog post, this isn’t just 14 million writes per second, it’s 14 million “in a fully-ordered, fully-transactional database with 100% multi-key cross-node transactions […] in the public cloud […] Said another way, FoundationDB can do 3.6 million database writes per penny.”

Impressive stuff. Impressive enough to capture the attention of enterprise database engineers, maybe. And obviously a great fit with the forthcoming Internet of Things, and the enormous amount of data that billions of connected devices will soon be constantly capturing.

But most importantly, this will push their competitors to do even better — which, in turn, will hopefully nudge the enormous numbers of enterprises still in the database Bronze Ages, running off Oracle and DB2, to consider maybe, just maybe, beginning to slowly, cautiously, carefully move into the bold new present day, in which developers are spoiled with simple key-value semantics, the full power of classic SQL queries, and distributed ACID transactions, all at the same time. In the long run that will make life better. In the interim, hats off to all the unsung database engineers out there pushing the collective envelope. You may not realize it, but they’re doing us all a huge service.




Naive Bayes classification is a simple, yet effective algorithm. It’s commonly used in things like text analytics and works well on both small datasets and massively scaled out, distributed systems.

How does it work?

Naive Bayes is based on, you guessed it, Bayes’ theorem. Think back to your first statistics class. Bayes’ theorem was that seemingly counterintuitive lecture on conditional probability.


Bayes’ Theorem

The neon formula above might look intimidating, but it’s actually not that complicated. To explain it, instead of using “events A and B“, I’m going to use something a little more familiar. Let’s say the two events in question are:

A) I watched The Lego Movie today
B) I sat on the couch today

So for my 2 events, let’s break it down into it’s Bayesian components:

I’ve seen The Lego Movie 10 times in the past 2 months–this is a lot, I know. I’ve been lucky enough that it’s been playing on almost every plane I’ve been on (as it should be! it’s a great movie for both adults and kids). Since I’ve watched The Lego Movie 10 out of the last 60 days, we’ll say that:

P(A) = P(I watched The Lego Movie today) = 10 / 60, or ~0.17

I sit on the couch most days I’m at my apartment. I’ve traveled 14 days in the past 2 months, so to keep it simple, we’ll assume I sat on my couch at least once on every other day (hey, it’s pretty comfy).

P(B) = P(I sat on the couch today) = (60 - 14) / 60, or ~0.76

I’ve seen The Lego Movie 10 times and 4 of those times have been on a plane. I think it’s pretty safe to assume the rest of those times I was seated comfortably in my living room. So given that I’ve had 46 days of couchtime in the past 2 months, we can say that I watched The Lego Movie from my couch 6 / 10 times.

P(B|A) = P(I sat on the couch given that I watched The Lego Movie) = 6 / 10 = 0.60

Ok, ready for the magic! Using Bayes’ theorem, I now have everything I need to calculate the Probability that I watched The Lego Movie today given that I sat on the couch.

P(A|B)=P(B|A)*P(A)/P(B) = (0.60 * 0.17) / 0.76

P(I watched The Lego Movie given that I sat on the couch) = 0.13

And voilà! Given that I sat on the couch today, there is a 13% chance that I also watched The Lego Movie (wow, that’s a lot of Lego time).

Now I wonder what the probability of me watching The Lego Movie from a double decker couch would be?

Why should I use it?

Where you see Naive Bayes classifiers pop up a lot is in document classification. Naive Bayes is a great choice for this because it’s pretty fast, it can handle a large number of features (i.e. words), and it’s actually really effective. Take a look at what happens when you do some basic benchmarking between Naive Bayes and other methods like SVM and RandomForest against the 20 Newsgroups dataset.



Naive Bayes wins! Granted this is a relatively simple approach without much in terms of feature engineering, but in my opinion that’s part of the beauty of Naive Bayes!

Code for benchmarking is available here.

Document Classification

For our example we’re going to be attempting to classify whether a wikipedia page is referring to a dinosaur or acryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).

My favorite Yeti, The Bumble, from the stop-motion holiday classic Rudolph the Red-nosed ReindeerWe’ll be using the text from each wikipedia article as features. What we’d expect is that certain words like “sighting” or “hoax” would be more commonly found in articles about cryptozoology, while words like “fossil” would be more commonly found in articles about dinosaurs.

We’ll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.

You can find the sample documents I used and the corresponding code here.

Tokenizing and counting

First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we’ll be counting how many times each word appears in each document. We’ll also perform a little text normalization by removing punctuation and lowercasing the text (this means “Hello,” and “hello” will now be considered the same word).

Once we’ve cleaned the text, we need a way to delineate words. A simple approach is to just use a good ‘ole regex that splits on whitespace and punctuation:

Calculating our probabilities

So now that we can count words, let’s get cooking. The code below is going to do the following:

  • open each document
  • label it as either “crypto” or “dino” and keep track of how many of each label there are (priors)
  • count the words for the document
  • add those counts to the vocab, or a corpus level word count
  • add those counts to the word_counts, for a category level word count

Classifying a new page

And finally it’s time for the math. We’re going to use the word counts we calculated in the previous step to calculate the following:

Prior Probability for each category, or for the layman, the percentage of documents that belong to each category. We have 9 crypto docs and 8 dino docs, so that gives us the following:

Prior Prob(crypto) = 9 / (8 + 9) = 0.53

Prior Prob(dino) = 8 / (8 + 9) = 0.47

Ok priors, check. The next thing we need are conditional probabilities for the words in the document we’re trying to classify. How do we do that? Well we start by doing a word count on a new document. We’ll use the Yeti page as our new document., we’ve got our counts. Now we’ll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we’re going to perform the operations in logspace. All this means is we’re going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.

Since we’re slightly bending the rules of Bayes’ Theorem, the results are not actual probabilities, but rather are “scores”. All you really need to know is which one is bigger. So our suspicions are confirmed, the “Yeti.txt” file is being classified overwhelmingly in favor of crypto (as we would hope).

Bringing it home, the LEGO Yeti!

Final Thoughts

You can find all the code and documents used in this post on GitHub.

Naive Bayes is great because it’s fairly easy to see what’s going on under the hood. It’s a great way to start any text analysis and it can easily scale out of core to work in a distributed environment. There are some excellent implementations in the Python community you can use as well, so if you don’t want to roll your own, have no fear! The scikit-learn and nltk versions are great places to start.



An intro to Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view.


Probabilistic ProgrammingThe Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

The choice of PyMC as the probabilistic programming language is two-fold. As of this writing, there is currently no central resource for examples and explanations in the PyMC universe. The official documentation assumes prior knowledge of Bayesian inference and probabilistic programming. We hope this book encourages users at every level to look at PyMC. Secondly, with recent core developments and popularity of the scientific stack in Python, PyMC is likely to become a core component soon enough.

PyMC does have dependencies to run, namely NumPy and (optionally) SciPy. To not limit the user, the examples in this book will rely only on PyMC, NumPy, SciPy and Matplotlib only.


(The below chapters are rendered via the nbviewer at, and is read-only and rendered in real-time. Interactive notebooks + examples can be downloaded by cloning!

More questions about PyMC? Please post your modeling, convergence, or any other PyMC question on cross-validated, the statistics stack-exchange.

Examples from the book

Below are just some examples from Bayesian Methods for Hackers.

Inferring behaviour changes using SMS message rates

Chapter 1

By only visually inspecting a noisy stream of daily SMS message rates, it can be difficult to detect a sudden change in the users’s SMS behaviour. In our first probabilistic programming example, we solve the problem by setting up a simple model to detect probable points where the user’s behaviour changed, and examine pre and post behaviour.

Simpler AB Testing

Chapter 2

AB testing, also called randomized experiments in other literature, is a great framework for determining the difference between competing alternatives, with applications to web designs, drug treatments, advertising, plus much more.

With our new interpretation of probability, a more intuitive method of AB testing is demonstrated. And since we are not dealing with confusing ideas like p-values or Z-scores, we can compute more understandable quantities about our uncertainty.

Discovering cheating while maintaing privacy

Chapter 2

A very simple algorithm can be used to infer proportions of cheaters, while also maintaining the privacy of the population. For each participant in the study:

  1. Have the user privately flip a coin. If heads, answer “Did you cheat?”truthfully.
  2. If tails, flip again. If heads, answer “Yes” regardless of the truth; if tails, answer “No”.

This way, the suveyor’s do not know whether a cheating confession is a result of cheating or a heads on the second coin flip. But how do we cut through this scheme and perform inference on the true proportion of cheaters?

Challenger Space Shuttle disaster

Chapter 2

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend.

We examine this data in a Bayesian framework and show strong support that a faulty O-ring, caused by low abmient temperatures, was likely the cause of the disaster.

Understanding Bayesian posteriors and MCMC

Chapter 3

The prior-posterior paradigm is visualized to make understanding the MCMC algorithm more clear. For example, below we show how two different priors can result in two different posteriors.

Clustering data

Chapter 3

Given a dataset, sometimes we wish to ask whether there may be more than one hidden source that created it. A priori, it is not always clear this is the case. We introduce a simple model to try to pry data apart into two clusters.

Sorting Reddit comments from best to worst

Chapter 4

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product. This has created flaws in how we sort items, and more generally, how we compare items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results. Often the seemingly top videos or comments have perfect ratings only from a few enthusiastic fans, and truly more quality videos or comments are hidden in later pages with falsely-substandard ratings of around 4.8. How can we correct this?

Solving the Price is Right’s Showcase

Chapter 5

Bless you if you are ever chosen as a contestant on the Price is Right, for here we will show you how to optimize your final price on the Showcase. We create a Bayesian model of your best guess and your uncertainty in that guess, and push it through the odd Showdown loss function (closest wins, lose if you bid over).

Kaggle’s Dark World winning solution

Chapter 5

We implement Tim Saliman’s winning solution to the Observing Dark World’s contest on the data science website Kaggle.

Bayesian Bandits – a solution to the Multi-Armed Bandit problem

Chapter 6

Suppose you are faced with N slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don’t know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

Stock Market analysis

Chapter 6

For decades, finance students have been taught using naive statistical methods to pick stocks. This has caused terrible inference, mostly caused by two things: temporal parameters and ignoring uncertainty. The first is harder to solve, the second fits right into a Bayesian framework.

Using the book

The book can be read in three different ways, starting from most recommended to least recommended:

  1. The most recommended option is to clone the repository to download the .ipynb files to your local machine. If you have IPython installed, you can view the chapters in your browser plus edit and run the code provided (and try some practice questions). This is the preferred option to read this book, though it comes with some dependencies.
    • IPython v0.13 (or greater) is a requirement to view the ipynb files. It can be downloaded here. IPython notebooks can be run by (your-virtualenv) ~/path/to/the/book/Chapter1_Introduction $ ipython notebook
    • For Linux users, you should not have a problem installing NumPy, SciPy, Matplotlib and PyMC. For Windows users, check out pre-compiled versions if you have difficulty.
    • In the styles/ directory are a number of files (.matplotlirc) that used to make things pretty. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib.
    • while technically not required, it may help to run the IPython notebook with ipython notebook --pylab inline flag if you encounter io errors.
  2. The second, preferred, option is to use the site, which display IPython notebooks in the browser (example). The contents are updated synchronously as commits are made to the book. You can use the Contents section above to link to the chapters.
  3. PDF versions are available! Look in the PDF/ directory. PDFs are the least-prefered method to read the book, as pdf’s are static and non-interactive. If PDFs are desired, they can be created dynamically using Chrome’s builtin print-to-pdf feature or using thenbconvert utility.

Installation and configuration

If you would like to run the IPython notebooks locally, (option 1. above), you’ll need to install the following:

  1. IPython 0.13 is a requirement to view the ipynb files. It can be downloaded here
  2. For Linux users, you should not have a problem installing NumPy, SciPy and PyMC. For Windows users, check out pre-compiled versions if you have difficulty. Also recommended, for data-mining exercises, are PRAW and requests.
  3. In the styles/ directory are a number of files that are customized for the notebook. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib and the IPython notebook. The in notebook style has not been finalized yet.


This book has an unusual development design. The content is open-sourced, meaning anyone can be an author. Authors submit content or revisions using the GitHub interface.

What to contribute?

  1. The current chapter list is not finalized. If you see something that is missing (MCMC, MAP, Bayesian networks, good prior choices, Potential classes etc.), feel free to start there.
  2. Cleaning up Python code and making code more PyMC-esque
  3. Giving better explanations
  4. Spelling/grammar mistakes
  5. Suggestions
  6. Contributing to the IPython notebook styles

We would like to thank the Python community for building an amazing architecture. We would like to thank the statistics community for building an amazing architecture.

Similarly, the book is only possible because of the PyMC library. A big thanks to the core devs of PyMC: Chris Fonnesbeck, Anand Patil, David Huard and John Salvatier.

One final thanks. This book was generated by IPython Notebook, a wonderful tool for developing in Python. We thank the IPython community for developing the Notebook interface. All IPython notebook files are available for download on the GitHub repository.



africa_greenWhile many commentators have focused on data-driven innovation in the United States and Western Europe, developing regions, such as Africa, also offer important opportunities to use data to improve economic conditions and quality of life. Three major areas where data can help are improving health care, protecting the environment, and reducing crime and corruption.

First, data-driven innovation can play a major role in improving health in Africa, including by advancing disease surveillance and medical research. Several recent examples have come to light during the ongoing West African Ebola virus outbreak of 2014, which has catalyzed international efforts to improve the continent’s disease surveillance infrastructure. One effort is an attempt to crowdsource contributions to OpenStreetMap, the self-described “Wikipedia for Maps” that anyone can edit. OpenStreetMap volunteers are using satellite images to manually identify roads, buildings, bodies of water, and other features in rural areas of West Africa, which can help aid workers and local public health officials better plan their interventions and ensure every village has been checked for the disease. The U.S. Centers for Disease Control and Prevention is also piloting a program to track aggregate cell phone location data in areas affected by Ebola to provide a better picture of disease reports in real time. Other efforts are targeting basic medical research to ensure that African people are not underrepresented in genomics research. The United Genomes Project hopes to counteract the trend of basing genomic research on primarily U.S. and European populations, which can result in treatments that are ineffective among other populations, by compiling the genomes of 1,000 Africans into an openly accessible database over the next several years.

Second, various projects are using data for conservation and environmental efforts in Africa. One such initiative, the Great Elephant Census, is attempting to count African elephants to help local authorities better target conservation efforts and fight poaching. The census uses imaging drones and automated image recognition techniques to collect data that otherwise would have been too expensive or too difficult to collect using traditional data collection techniques. The University of California Santa Barbara’s Climate Hazards Group recently released a near-real time rainfall data set to help government agencies around the world detect droughts as rapidly as possible. The data is already being used to identify burgeoning areas of food insecurity, including in drought-plagued areas of East Africa. The Trans-African Hydro-Meteorological Observatory is a Delft University of Technology-led collaboration among 14 universities and several private sector organizations to use cheap sensors to collect localized weather information at 20,000 locations in sub-Saharan Africa. The resulting data will be available freely for scientific research and government use.

Third, data is helping reduce crime and corruption in Africa. The U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide is working on a public early-warning system for mass atrocities around the world that uses data from news reports and other sources to predict what countries carry the highest risk of genocide and other violent events in the near future. It will be rolled out during next year’s elections in Nigeria, which have historically been marred by violence. Israeli startup Windward uses satellite imagery to flag potentially illegal activity, including illegal fishing and piracy, around the world. Windward’s algorithms have already been used around the Horn of Africa to identify pirate activity. Other efforts are focused around government corruption. These include the South Africa-based Parliamentary Monitoring Group, which compiles and publishes data about local politicians, and Ghana-based Odekro, a website that monitors politicians’ behavior and publishes public debate transcripts and other political information.

These examples represent just a subset of the many ways that data-driven innovation is having a positive impact on Africa and address problems that have plagued African countries for decades. Although many of these innovative approaches come from international efforts, some are homegrown as well. For example, Nairobi-based Gro Ventures collects regular data on crop yields and commodity prices from local farmers and uses the data to build risk models that banks can use to make loans to farmers. The number of opportunities will continue to grow as the technology becomes cheaper, data becomes more plentiful, and the skills needed to perform analysis becomes more widely available.



IBM’s Watson Analytics service is now in open beta. The natural language-based system, born of the same programme that developed the company’sJeopardy-playing super computer, offers predictive and visual analytics tools for businesses.

Early this summer, IBM announced it is investing more than $1 billion into commercializing Watson. Watson Analytics is part of that effort. The company promises that it can can automate tasks such as data preparation, predictive analysis and visual storytelling.

IBM will offer Watson Analytics as a cloud-based freemium service, accessible via the Web and mobile devices. Since it announced the programme in the summer, 22,000 people have registered for the beta.

The launch of Watson Analytics follows the announcement two months ago that IBM has teamed up with Twitter to apply the Watson technology to analysing data from the social network.





Are you searching for some best books to get acquainted with the basics of AI? Here is our list!

1. A Course in Machine Learning

Machine learning is the study of computer systems that learn from data and experience. It is applied in an incredibly wide variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential customer of machine learning.

2. Simply Logical: Intelligent Reasoning by Example

An introduction to Prolog programming for artificial intelligence covering both basic and advanced AI material. A unique advantage to this work is the combination of AI, Prolog and Logic. Each technique is accompanied by a program implementing it. Seeks to simplify the basic concepts of logic programming. Contains exercises and authentic examples to help facilitate the understanding of difficult concepts.

3. Logic for Computer Science: Foundations of Automatic Theorem Proving

Covers the mathematical logic necessary to computer science, emphasising algorithmic methods for solving proofs. Treatment is self-contained, with all required mathematics contained in Chapter 2 and the appendix. Provides readable, inductive definitions and offers a unified framework using Getzen systems.

4. Artificial Intelligence: Foundations of Computational Agents

This textbook, aimed at junior to senior undergraduate students and first-year graduate students, presents artificial intelligence (AI) using a coherent framework to study the design of intelligent computational agents. By showing how basic approaches fit into a multidimensional design space, readers can learn the fundamentals without losing sight of the bigger picture.

5. From Bricks to Brains: The Embodied Cognitive Science of LEGO Robots

From Bricks to Brains introduces embodied cognitive science and illustrates its foundational ideas through the construction and observation of LEGO Mindstorms robots. Discussing the characteristics that distinguish embodied cognitive science from classical cognitive science, the book places a renewed emphasis on sensing and acting, the importance of embodiment, the exploration of distributed notions of control, and the development of theories by synthesising simple systems and exploring their behavior.

6. Practical Artificial Intelligence Programming in Java

This book has been written for both professional programmers and home hobbyists who already know how to program in Java and who want to learn practical AI programming techniques. In the style of a “cook book”, the chapters in this book can be studied in any order. Each chapter follows the same pattern: a motivation for learning a technique, some theory for the technique, and a Java example program that you can experiment with.

7. An Introduction to Logic Programming Through Prolog

This is one of the few texts that combines three essential theses in the study of logic programming: the logic that gives logic programs their unique character: the practice of programming effectively using the logic; and the efficient implementation of logic programming on computers.

8. Essentials of Metaheuristics

The book covers a wide range of algorithms, representations, selection and modification operators, and related topics, and includes 70 figures and 133 algorithms great and small.

9. A Quick and Gentle Guide to Constraint Logic Programming

Introductory and down-to-earth presentation of Constraint Logic Programming, an exciting software paradigm, more and more popular for solving combinatorial as well as continuous constraint satisfaction problems and constraint optimisation problems.

10. Clever Algorithms: Nature-Inspired Programming Recipes

This book provides a handbook of algorithmic recipes from the fields of Metaheuristics, Biologically Inspired Computation and Computational Intelligence that have been described in a complete, consistent, and centralised manner. These standardised descriptions were carefully designed to be accessible, usable, and understandable.

11. Clever Algorithms: Nature-Inspired Programming Recipes

Covers the mathematical logic necessary to computer science, emphasizing algorithmic methods for solving proofs. Provides readable, inductive definitions and offers a unified framework using Getzen systems. Offers unique coverage of congruence, and contains an entire chapter devoted to SLD resolution and logic programming (PROLOG).

12. Common LISP: A Gentle Introduction to Symbolic Computation

This highly accessible introduction to Lisp is suitable both for novices approaching their first programming language and experienced programmers interested in exploring a key tool for artificial intelligence research.

13. Bio-Inspired Computational Algorithms and Their Applications

This book integrates contrasting techniques of genetic algorithms, artificial immune systems, particle swarm optimisation, and hybrid models to solve many real-world problems. The works presented in this book give insights into the creation of innovative improvements over algorithm performance, potential applications on various practical tasks, and combination of different techniques.

14. The Quest for Artificial Intelligence

This book traces the history of the subject, from the early dreams of eighteenth-century (and earlier) pioneers to the more successful work of today’s AI engineers.

15. Planning Algorithms

Planning algorithms are impacting technical disciplines and industries around the world, including robotics, computer-aided design, manufacturing, computer graphics, aerospace applications, drug design, and protein folding. Written for computer scientists and engineers with interests in artificial intelligence, robotics, or control theory, this is the only book on this topic that tightly integrates a vast body of literature from several fields into a coherent source for teaching and reference in a wide variety of applications.

16. Virtual Reality – Human Computer Interaction

At present, the virtual reality has impact on information organisation and management and even changes design principle of information systems, which will make it adapt to application requirements. The book aims to provide a broader perspective of virtual reality on development and application.

17. Affective Computing

This book provides an overview of state of the art research in Affective Computing. It presents new ideas, original results and practical experiences in this increasingly important research field.

18. Machine Learning, Neural and Statistical Classification

This book is based on the EC (ESPRIT) project StatLog which compare and evaluated a range of classification techniques, with an assessment of their merits, disadvantages and range of application. This integrated volume provides a concise introduction to each method, and reviews comparative trials in large-scale commercial and industrial problems.

19. Ambient Intelligence

Ambient Intelligence has attracted much attention from multidisciplinary research areas and there are still open issues in most of them. In this book a selection of unsolved problems which are considered key for ambient intelligence to become a reality, is analysed and studied in depth.

20. The World and Mind of Computation and Complexity

With the increase in development of technology, there is research going into the development of human-like artificial intelligence that can be self-aware and act just like humans. This book explores the possibilities of artificial intelligence and how we may be close to developing a true artificially intelligent being.




Retailers have gained a unique advantage over consumers that previously never existed. They are now able to track and analyze customer behavior (online, mobile, and in-store) to better aim marketing campaigns, improve the customer experience, and increase revenue. Without Apache Hadoop, all of this would be almost impossible, as it is the most capable big data storage and processing framework available.

Where there once were customer panels, in-store surveys, focus groups, and guesswork, there is now social media, online search behavior, and easy-access customer input. Retailers can focus on the shopper as an individual, rather than aiming at the masses and hoping to snag a few.

Applications of Hadoop in Retail

Among marketing campaigns and targeted advertising are multiple other use cases for Hadoop in the retail industry. Five of the most common of its applications are detailed below.

Get to Know the Customer

Interacting with the customer has become easier than ever with the help of social media. Now consumers can send feedback directly, and ads can be focused based on statuses and search terms. Hadoop is able to scan and store transaction data and identify phases of the customer lifecycle to help retailers reduce inventory costs, increase sales, and build and retain a loyal customer base.

Analyze Brand Sentiment

Product launches, promotions, competitor moves, news stories, and in-store experiences all affect the customers’ opinions of the brand. It’s important for retailers to understand public perception of their brand so they can adjust promotions, advertisements, locations, and policies accordingly. Hadoop helps store and process this information from social media websites and browser searches to provide real-time perspective.

Localize and Personalize Promotions

In order to effectively localize and personalize retail promotions, it’s helpful to have a mobile app that sends users personalized push notifications of promotions and specific products at a nearby location that align with their customer data. Geo-location technology combined with historical storage and real-time streaming data becomes a unique marketing tactic that only Hadoop can help retailers with; other platforms simply don’t have the capacity.

Optimize Websites

When customers browse retail websites, clickstream data allows researchers to view and analyze the consumers’ click habits, transaction data, and site glitches. This information is helpful for prioritizing site updates, running A/B tests, doing basket analyses, and to understand user paths to create a better overall shopping experience and more effectively reach the customers as individuals. For example, if a pattern suggests that there is zero activity on a portion of the website, you can assume that either that part of the website needs IT attention, or it is unappealing to the customer, and you need to make adjustments. Either way, a problem is resolved and revenue potentially increases.

Redesign Store Layouts

Store layout has a significant impact on product sales, but the customer’s in-store shopping experience is the hardest for retailers to track. The lack of pre-register data leaves a gap in the information about what customers look at, how long they linger, etc. Often, businesses will hire unnecessary extra staff to make up for poor sales, when really the issue is poor product placement and store layout. Sensors have been developed to help fill the gap, such as RFID tags and QR codes. They store information through Hadoop that helps retailers improve the customer experience and reduce costs.


Apache Hadoop is a comprehensive big data storage and processing framework helping retailers optimize the customer experience, increase sales, and reduce costs related to inventory and marketing. With this expansive technology, businesses can focus their marketing efforts, get to know their customers and their needs, and create a customized shopping experience that will result in happy returning customers.