Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or “transitioning,” from one state to any other state—e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

A simple, two-state Markov chain is shown below.

With two states (A and B) in our state space, there are 4 possible transitions (not 2, because a state can transition back into itself). If we’re at ‘A’ we could transition to ‘B’ or stay at ‘A’. If we’re at ‘B’ we could transition to ‘A’ or stay at ‘B’. In this two state diagram, the probability of transitioning from any state to any other state is 0.5.

Of course, real modelers don’t always draw out Markov chain diagrams. Instead they use a “transition matrix” to tally the transition probabilities. Every state in the state space is included once as a row and again as a column, and each cell in the matrix tells you the probability of transitioning from its row’s state to its column’s state. So, in the matrix, the cells do the same job that the arrows do in the diagram.

If the state space adds one state, we add one row and one column, adding one cell to every existing column and row. This means the number of cells grows quadratically as we add states to our Markov chain. Thus, a transition matrix comes in handy pretty quickly, unless you want to draw a jungle gym Markov chain diagram.

One use of Markov chains is to include real-world phenomena in computer simulations. For example, we might want to check how frequently a new dam will overflow, which depends on the number of rainy days in a row. To build this model, we start out with the following pattern of rainy (R) and sunny (S) days:

One way to simulate this weather would be to just say “Half of the days are rainy. Therefore, every day in our simulation will have a fifty percent chance of rain.” This rule would generate the following sequence in simulation:

Did you notice how the above sequence doesn’t look quite like the original? The second sequence seems to jump around, while the first one (the real data) seems to have a “stickyness”. In the real data, if it’s sunny (S) one day, then the next day is also much more likely to be sunny.

We can minic this “stickyness” with a two-state Markov chain. When the Markov chain is in state “R”, it has a 0.9 probability of staying put and a 0.1 chance of leaving for the “S” state. Likewise, “S” state has 0.9 probability of staying put and a 0.1 chance of transitioning to the “R” state.

In the hands of metereologists, ecologists, computer scientists, financial engineers and other people who need to model big phenomena, Markov chains can get to be quite large and powerful. For example, the algorithm Google uses to determine the order of search results, called PageRank, is a type of Markov chain.

Above, we’ve included a Markov chain “playground”, where you can make your own Markov chains by messing around with a transition matrix. Here’s a few to work from as an example: ex1, ex2, ex3 or generate one randomly. The transition matrix text will turn red if the provided matrix isn’t a valid transition matrix. The rows of the transition matrix must total to 1. There also has to be the same number of rows as columns.

Source: Setosa.ioio


While not a new concept, gamification has been catching on in recent years as organizations figure out how best to utilize it. After a period of experimentation, many companies are starting to realize how effective it can actually be. Most of the time this is used as a way to boost the customer experience by fostering more interaction and a spirit of competition and gaming, all to benefit the overall bottom line of the business. More recently, however, gamification is being used in other matters, more specifically being directed internally to the employees themselves. In fact, a report from Gartner predicts that 40 percent of global 1000 organizations will use gamification this year to transform the way their businesses operate. As can be learned from the experiences of some organizations, gamification can also be used to tackle some serious issues, such as helping workers overcome their fear of failure.


To better understand how gamification can eliminate certain fears, it’s best to start with a clear understanding of what the concept is. According to gamification thought leader Gabe Zicherman, it is “the process of engaging people and changing behavior with game design, loyalty, and behavioral economics.” Essentially, it applies the joy of playing a game with topics and activities that might not be considered so enjoyable. Very few people like to fail, let alone confront the reasons why they failed. It’s a touchy subject that often requires a delicate hand to manage. With that in mind, it may seem strange that a game could help employees face their fears of failure, but that’s exactly what’s being done, often to good results.

One organization that has taken this idea and run with it is DirecTV. The satellite television company recently had a few IT project failures and decided it needed to address how to face these shortcomings. IT leaders came up with the idea to create a gamification learning platform where IT workers could discuss failure by creating and viewing videos about the subject. Simply creating the platform and posting videos wasn’t enough. IT leaders needed to come up with a way to make sure employees were using it and posting videos of their own. This all had to be done privately with enhanced network securitydue to the potentially sensitive nature of the videos. The gamification aspect comes in by awarding points and badges to those workers who do make use of it. Prizes were even given to those who scored the most points from the platform. The result was a major increase in usage rate among the IT staff. In fact, not long after the platform was launched, the company was experiencing a 97 percent participation rate, with most workers giving positive feedback on the experience.

Based off of DirecTV’s gamification, employees were able to talk openly about failing, discuss why the failures happened, and improve on their overall performance. IT leaders for DirecTV said later projects were completed much more smoothly as the whole staff became more successful. From this experience, other organizations may want to use gamification to eliminate the fear of failure among their own staffs, so it’s helpful to take note of a number of factors that can help businesses achieve this goal. First, gamification platforms need to be intuitive and all actions need to relate to the overarching goal (in this case, facing the fear of failure). Second, newcomers should have a way to start on the platform and be competitive. Facing other employees that have a lot of experience and accumulated points can be intimidating, so newcomers need a good entry point that gives them a chance to win. Third, the platform should constantly be adjusted based off of user feedback. Creators who make the platform and don’t touch it afterward will soon see participation rates drop off and enthusiasm wane. And last, a gamification platform that doesn’t foster social interaction and competition is a platform not worth having.

Gamification allows employees to confront failure in a setting deemed safe and fun. People don’t enjoy failure and often go to great lengths to avoid even the very prospect of it, but sometimes to succeed, risks have to be taken. By helping workers overcome this fear with gamification, businesses can ensure future success and a more confident environment. It may seem a little outside the mainstream, but turning the fear of failure into a game can actually pay big dividends as employees learn there’s nothing wrong with trying hard and coming up short.



enter image description here

BlueData, a pioneer in Big Data private clouds, announced a technology preview of the Tachyon in-memory distributed storage system as a new option for the BlueData EPIC platform. Together with the company’s existing integration with Apache Spark, BlueData supports the next generation of Big Data analytics with real-time capabilities at scale, which allows organizations to realize value from their Big Data that wasn’t before possible. In addition, this new integration enables Hadoop, Hbase virtual clusters, and other applications provisioned in the BlueData platform, to take advantage of Tachyon’s high performance in-memory data processing.

Enterprises need to be able to run a wide variety of Big Data jobs such as trading, fraud detection, cybersecurity and system monitoring. These high performance applications require the ability to run in real-time and at scale in order to provide true value to the business. Existing Big Data approaches using Hadoop are relatively inflexible and do not fully meet the business needs for high speed stream processing. New technologies like Spark, which offers 100X faster data processing, and Tachyon, which offers 300X higher throughput, overcome these challenges.

Big Data is about the combination of speed and scale for analytics. With the advent of the Internet of Things and streaming data, Big Data is helping enterprises make more decisions in real time. Spark and Tachyon will be the next generation of building blocks for interactive and instantaneous processing and analytics, much like Hadoop MapReduce and disk-based HDFS were for batch processing,” said Nik Rouda, senior analyst of Enterprise Strategy Group. “By incorporating a shared in-memory distributed storage system in a common platform that runs multiple clusters, BlueData streamlines the development of real-time analytics applications and services.”

However, incorporating these technologies with existing Big Data platforms like Hadoop requires point integrations on a cluster-by-cluster basis, which makes it manual and slow. With this preview, BlueData is streamlining infrastructure by creating a unified platform that incorporates Tachyon. This allows users to focus on building real-time processing applications rather than manually cobbling together infrastructure components.

We are thrilled to welcome BlueData into the Tachyon community, and we look forward to working with BlueData to refine features for Big Data applications,” said Haoyuan Li, co-creator and lead of Tachyon.

The BlueData platform also includes high availability, auto tuning of configurations based on cluster size and virtual resources, and compatibility with each of the leading Hadoop distributions. Customers who deploy BlueData can now take advantage of these enterprise-grade benefits along with the memory-speed advantages of Spark and Tachyon for any Big Data application, on any server, with any storage.

First generation enterprise data lakes and data hubs showed us the possibilities with batch processing and analytics. With the advent of Spark, the momentum has clearly shifted to in-memory and streaming with emerging use cases around IoT, real-time analytics and high speed machine learning. Tachyon’s appealing architecture has the potential to be a key foundational building block for the next generation logical data lake and key to the adoption and success of in-memory computing,” said Kumar Sreekanti, CEO and co-founder of BlueData. “BlueData is proud to deliver the industry’s first Big Data private cloud with a shared, distributed in-memory Tachyon file system. We look forward to continuing our partnership with Tachyon to deliver on our mission of democratizing Big Data private clouds.”



If you wish to process huge piles of data very, very quickly, you’re in luck.

From the comfort of your own data center, you can now use Google’s recently announced Dataflow programming model for processing data in batches or as it comes in, on top of the fast Spark open-source engine.

Cloudera, one company selling a distribution of Hadoop open-source software for storing and analyzing large quantities of different kinds of data, has been working with Google to make that possible, and the results of their efforts are now available for free under an open-source license, the two companies announced today.

The technology could benefit the burgeoning Spark ecosystem, as well as Google, which wants programmers to adopt its Dataflow model. If that happens, developers might well feel more comfortable storing and crunching data on Google’s cloud.

Google last year sent shockwaves through the big data world it helped create when Urs Hölzle, Google’s senior vice president of technical infrastructure, announced that Googlers “don’t really use MapReduce anymore.” In lieu of MapReduce, which Google first developed more than 10 years ago and still lies at the heart of Hadoop, Google has largely switched to a new programming model for processing data in streaming or batch format.

Google has brought out a commercial service for running Dataflow on the Google public cloud. And late last year it went further and issued a Java software-development kit for Dataflow.

All the while, outside of Google, engineers have been making progress. Spark in recent years has emerged as a potential MapReduce successor.

Now there’s a solid way to use the latest system from Google on top of Spark. And that could be great news from a technical standpoint.

“[Dataflow’s] streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data,” Josh Wills, Cloudera’s senior director of data science, wrote in a blog post on the news.



pareto_mobileBI-arcplanMobileMobile will force a fundamental change in the approach to BI. When it comes to mobile BI, adoption has been shockingly poor because it doesn’t usually work well with mobile devices. You can’t read data in depth on a mobile device, but rather you need to get to the point quickly. With everything shifting to mobile, the approach to BI will change. Rather than elaborate visualizations, you will see hard numbers, simple graphs and conclusions. For instance, with wearable devices, you might look at an employee and quickly see the KPI (key performance indicator). The BI game is about to change — primed to go mobile this year. – Adi Azaria, co-founder, Sisense


Big Data in 2014 was supposed to be about moving past the hype toward real-world business gains, but for most companies, Big Data is still in the experimental phase. With 2015, we will begin to see Big Data delivering on its promise, with Hadoop serving as the engine behind more practical and profitable applications of Big Data thanks to its reduced cost compared to other platforms. – Mike Hoskins, CTO atActian


The data scientist role will not die or be replaced by generalists. In 2014 we saw c-suite recognition of Chief Data Officers as key additions to the executive team for enterprises across industry sectors. Data scientists will continue to provide generalists with new capabilities. R, along with BI tools, enable data science teams to make their work accessible through GUI analytics tools which amplify the knowledge and skills of the generalist. – David Smith, chief community officer, Revolution Analytics


heres-why-the-internet-of-things-will-be-huge-and-drive-tremendous-value-for-people-and-businessesThe Internet of Things will gain momentum, but will still be in the early stages throughout 2015. – While we’ve seen the number of connected devices continue to rise, most people still aren’t putting network keys into their toasters. … Expect toasters to get connected, eventually, too — as people get readings of the nutritional value of the bread they toast, their energy consumption, and their carbon footprints, among other things — but the more mundane items won’t be connected for another year or two yet. Information Builders



Big Data becomes a “Big Target.” As the bad guys realize big data repositories are a gold mine of high value data. – Ashvin Kamaraju, VP Product Development atVormetric


Data will finally be about driving higher margins/bringing value to the enterprise. A lot of the “confusion” around big data has been defining it. 2015 will see companies deriving value. VoltDB


Big Data moving to the cloud – enterprises are increasingly using the cloud for Big Data analytics for a multitude of reasons: elastic infrastructure needs, faster provisioning time and time to value in the cloud, and increasing reliance on externally generated data (e.g., 3rdparty data sources, Internet of Things and device generated data, clickstream data). Spark – the most active project in the Big Data ecosystem – was optimized for cloud environments. The uptake of this trend is evident in the fact that large enterprise vendors – SAP, Oracle, IBM – and promising startups are all pushing cloud-based analytics solution. – Ali Ghodsi, Head of Product Management and Engineering Databricks


IOT drives stronger security for big data environments. With millions of high-tech wearable devices coming on line daily, more personal private data including health data, locations, searching histories, shopping habits will get stored and analyzed by big data analytics. Securing big data using encryption becomes as inevitable requirement. – Ashvin Kamaraju, VP Product Development at Vormetric


hadoop-sqlSQL will be a “must-have” to get the analytic value out of Hadoop data.  We’ll see some vendor shake-out as bolt-on, legacy or immature SQL on Hadoop offerings cave to those that offer the performance, maturity and stability organizations need. – Mike Hoskins, CTO at Actian





In 2015, enterprises will move beyond data visualization to data actualization. The deployment of applications will require deployment of production-quality analytics that become integral to applications. – Bill Jacobs, Vice President of Product Marketing, Revolution Analytics


Big Data will turn to Big Documentation – Most people won’t know what to do with all of the data they have. We already see people collecting far more data than they know what to do with. We already see people “hadumping” all sorts of data into Hadoop clusters without knowing how it’s going to be used. Ultimately, data has to be used in order for it to have value. … Big Documentation is around the corner. Information Builders


The Rise of the Chief-IoT-Officer: In the not too distant past, there was an emerging technology trend called “eBusiness”. Many CEOs wanted to accelerate the adoption of eBusiness across various corporate functions, so they appointed a change leader often known as the “VP of eBusiness,” who partnered with functional leaders to help propagate and integrate eBusiness processes and technologies within legacy operations. IoT represents a similar transformational opportunity. As CEOs start examining the implications of IoT for their business strategy, there will be a push to drive change and move forward faster. A new leader, called the Chief IoT Officer, will emerge as an internal champion to help corporate functions identify the possibilities and accelerate adoption of IoT on a wider scale. ParStream


The conversation around big data analytics is becoming less about technology and more about driving successful business use cases. In 2015, we’re going to see a continuous movement out of IT and into generating ROI-orientated deployment models. Secondly, I think we’re going to see resources move away from heavy in-house big data infrastructure to big data-as-a-service in the cloud. We’re already seeing a lot of investment in this area and I expect this to steadily grow. – Stefan Groschupf, CEO at Datameer


The cloud will increasingly become the deployment model for BI and predictive analytics – particularly with the private cloud powered by the cost advantages and of Hadoop and fast access to analytic value. – Mike Hoskins, CTO at Actian


Hadoop-ElephantsBig Data meets Concurrency – New Big Data applications will emerge that have multiple users reading and writing data concurrently, while data streams in simultaneously from connected systems. Concurrent applications will overtake batch data science as the most interesting Hadoop use case. – Monte Zweben, co-founder and CEO of Splice Machine



data_intelligence_0The term “Business Intelligence” will morph into “Data Intelligence.” BI will finally evolve from being a reporting tool into data intelligence that every entity from governments to cities to individuals will use to prevent traffic, detect fraud, track diseases, manage personal health and even notify you when your favorite fruit has arrived at your local market. We will see the consumerization of BI where it will extend beyond the business world and become intricately woven into our everyday lives directly impacting the decisions we make. – Eldad Farkash, co-founder and CTO, Sisense, Sisense


‘Data wrangling’ will be the biggest area requiring innovation, automation and simplicity. Modeling and wrangling data from disparate systems into shape for insights for decades has been lengthy, tedious and labor-intensive. Most organizations today spend 70-80% time modeling and preparing data rather than interacting with data to generate business-critical insights. Simplifying data prep and data wrangling through automation will take shape in 2015 so businesses can reach a fast-clip on real data-driven insights.– Sharmila Mulligan, CEO and founder ofClearStory Data


In the coming year, analytics will have the power to become the next killer app to legitimize the need for hybrid cloud solutions.  Analytics has the ability to mine vast amounts of data from diverse sources, deliver value and build predictions without huge data landfills. In addition, the ability to apply predictions to the myriad decisions made daily – and do so within applications and systems running on-premises–is unprecedented. – Dave Rich, CEO, Revolution Analytics


Increasing Role of Open Source in Enterprise Software – Data warehousing and BI has long been the domain of proprietary software concentrated across a handful of vendors. However, the last 10 years has seen the emergence and increasing prevalence of Hadoop and subsequently Spark as lower-cost open source alternatives that deliver the scale and sophistication needed to gain insights from Big Data. The Hadoop-related ecosystem is projected to be $25B by 2020, and Spark is now distributed by 10+ vendors, including SAP, Oracle, Microsoft, and Teradata, with support for all major BI tools, including Tableau, Qlik, and Microstrategy. – Ali Ghodsi, Head of Product Management and Engineering Databricks


Personal predictive technology will come to the forefront – and fail. – As analysts see that they can use data discovery and other analytical tools, they’ll want the power of predictions to fall within their grasp, too. Unfortunately, most people don’t understand statistics well enough (see the “Monty Hall problem”) to make predictive models that really work, no matter how simple the tools make it. If anything, simple tools are more likely to get them into trouble. Information Builders

Real-Time Big Data – Companies will act on real-time data streams with data-driven, intelligent applications, instead of acting on yesterday’s data that was batch ingested last night. – Monte Zweben, co-founder and CEO of Splice Machine


article-cloud-analyticsAnalytics in the cloud will become pervasive. As organizations continue to rely on various cloud-based services for their mission-critical operations, analytics in the cloud will become a prevalent deployment option. Not only does the cloud offer self-service, intuitive experiences that allow organizations to implement an analytics solution in minutes, it speeds insight and enables data sharing across internal and external data sources. By reducing the over-reliance on IT, users can focus on asking new questions and find new answers at a unprecedented rate.– Sharmila Mulligan, CEO and founder of ClearStory Data




If you wonder what the government has done for you lately, take a look at DeepDive. DeepDive is a free version of IBM’s Watson developed in the same Defense Advanced Research Projects Agency (DARPA) program as IBM’s Watson, but is being made available free and open-source.

Although never been pitted against IBM’s Watson, DeepDive has gone up against a more fleshy foe: the human being. Result: DeepDive beat or at least equaled humans in the time it took to complete an arduous cataloging task. These were no ordinary humans, but expert human catalogers tackling the same task as DeepDive — to read technical journal articles and catalog them by understanding their content.

“We tested DeepDive against humans performing the same tasks and DeepDive came out ahead or at least equaled the efforts of the humans,” professor Shanan Peters, who supervised the testing, told EE Times.

DeepDive is free and open-source, which was the idea of its primary programmer, Christopher Re.

“We started out as part of a machine reading project funded by DARPA in which Watson also participated,” Re, a professor at Univ. of Wisconsin told EE Times. “Watson is a question-answering engine (although now it seems to be much bigger). [In contrast] DeepDive’s goal is to extract lots of structured data” from unstructured data sources.

DeepDive incorporates probability-based learning algorithms as well as open-source tools such as MADlib, Impala (from Oracle), and low-level techniques, such as Hogwild, some of which have also been included in Microsoft’s Adam. To build DeepDive into your application, you should be familiar with SQL and Python.


DeepDive was developed in the same Defense Advanced Research Projects Agency (DARPA) program as

IBM’s Watson, but is being made available free by its programmers at University of Wisconsin-Madison. 

“Underneath the covers, DeepDive is based on a probability model; this is a very principled, academic approach to build these systems, but the question for use was ‘could it actually scale in practice’? Our biggest innovations in Deep Dive have to do with giving it this ability to scale,” Re told us.


For the future, DeepDive aims to be proven to other domains.

“We hope go have similar results in those domains soon, but it’s too early to be very specific about our plans here,” Re told us. “We use a RISC processor right now, we’re trying to make a compiler and we think machine learning will let us make it much easier to program in the next generation of DeepDive. We also plan to get more data types into DeepDive: images, figures, tables, charts, spreadsheets — a sort of ‘Data Omnivore’ to borrow a line from Oren Etzioni.”

Get all the details in the free download which are going at 10,000 per week.



An intro to Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view.


Probabilistic ProgrammingThe Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

The choice of PyMC as the probabilistic programming language is two-fold. As of this writing, there is currently no central resource for examples and explanations in the PyMC universe. The official documentation assumes prior knowledge of Bayesian inference and probabilistic programming. We hope this book encourages users at every level to look at PyMC. Secondly, with recent core developments and popularity of the scientific stack in Python, PyMC is likely to become a core component soon enough.

PyMC does have dependencies to run, namely NumPy and (optionally) SciPy. To not limit the user, the examples in this book will rely only on PyMC, NumPy, SciPy and Matplotlib only.


(The below chapters are rendered via the nbviewer at, and is read-only and rendered in real-time. Interactive notebooks + examples can be downloaded by cloning!

More questions about PyMC? Please post your modeling, convergence, or any other PyMC question on cross-validated, the statistics stack-exchange.

Examples from the book

Below are just some examples from Bayesian Methods for Hackers.

Inferring behaviour changes using SMS message rates

Chapter 1

By only visually inspecting a noisy stream of daily SMS message rates, it can be difficult to detect a sudden change in the users’s SMS behaviour. In our first probabilistic programming example, we solve the problem by setting up a simple model to detect probable points where the user’s behaviour changed, and examine pre and post behaviour.

Simpler AB Testing

Chapter 2

AB testing, also called randomized experiments in other literature, is a great framework for determining the difference between competing alternatives, with applications to web designs, drug treatments, advertising, plus much more.

With our new interpretation of probability, a more intuitive method of AB testing is demonstrated. And since we are not dealing with confusing ideas like p-values or Z-scores, we can compute more understandable quantities about our uncertainty.

Discovering cheating while maintaing privacy

Chapter 2

A very simple algorithm can be used to infer proportions of cheaters, while also maintaining the privacy of the population. For each participant in the study:

  1. Have the user privately flip a coin. If heads, answer “Did you cheat?”truthfully.
  2. If tails, flip again. If heads, answer “Yes” regardless of the truth; if tails, answer “No”.

This way, the suveyor’s do not know whether a cheating confession is a result of cheating or a heads on the second coin flip. But how do we cut through this scheme and perform inference on the true proportion of cheaters?

Challenger Space Shuttle disaster

Chapter 2

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend.

We examine this data in a Bayesian framework and show strong support that a faulty O-ring, caused by low abmient temperatures, was likely the cause of the disaster.

Understanding Bayesian posteriors and MCMC

Chapter 3

The prior-posterior paradigm is visualized to make understanding the MCMC algorithm more clear. For example, below we show how two different priors can result in two different posteriors.

Clustering data

Chapter 3

Given a dataset, sometimes we wish to ask whether there may be more than one hidden source that created it. A priori, it is not always clear this is the case. We introduce a simple model to try to pry data apart into two clusters.

Sorting Reddit comments from best to worst

Chapter 4

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product. This has created flaws in how we sort items, and more generally, how we compare items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results. Often the seemingly top videos or comments have perfect ratings only from a few enthusiastic fans, and truly more quality videos or comments are hidden in later pages with falsely-substandard ratings of around 4.8. How can we correct this?

Solving the Price is Right’s Showcase

Chapter 5

Bless you if you are ever chosen as a contestant on the Price is Right, for here we will show you how to optimize your final price on the Showcase. We create a Bayesian model of your best guess and your uncertainty in that guess, and push it through the odd Showdown loss function (closest wins, lose if you bid over).

Kaggle’s Dark World winning solution

Chapter 5

We implement Tim Saliman’s winning solution to the Observing Dark World’s contest on the data science website Kaggle.

Bayesian Bandits – a solution to the Multi-Armed Bandit problem

Chapter 6

Suppose you are faced with N slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don’t know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

Stock Market analysis

Chapter 6

For decades, finance students have been taught using naive statistical methods to pick stocks. This has caused terrible inference, mostly caused by two things: temporal parameters and ignoring uncertainty. The first is harder to solve, the second fits right into a Bayesian framework.

Using the book

The book can be read in three different ways, starting from most recommended to least recommended:

  1. The most recommended option is to clone the repository to download the .ipynb files to your local machine. If you have IPython installed, you can view the chapters in your browser plus edit and run the code provided (and try some practice questions). This is the preferred option to read this book, though it comes with some dependencies.
    • IPython v0.13 (or greater) is a requirement to view the ipynb files. It can be downloaded here. IPython notebooks can be run by (your-virtualenv) ~/path/to/the/book/Chapter1_Introduction $ ipython notebook
    • For Linux users, you should not have a problem installing NumPy, SciPy, Matplotlib and PyMC. For Windows users, check out pre-compiled versions if you have difficulty.
    • In the styles/ directory are a number of files (.matplotlirc) that used to make things pretty. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib.
    • while technically not required, it may help to run the IPython notebook with ipython notebook --pylab inline flag if you encounter io errors.
  2. The second, preferred, option is to use the site, which display IPython notebooks in the browser (example). The contents are updated synchronously as commits are made to the book. You can use the Contents section above to link to the chapters.
  3. PDF versions are available! Look in the PDF/ directory. PDFs are the least-prefered method to read the book, as pdf’s are static and non-interactive. If PDFs are desired, they can be created dynamically using Chrome’s builtin print-to-pdf feature or using thenbconvert utility.

Installation and configuration

If you would like to run the IPython notebooks locally, (option 1. above), you’ll need to install the following:

  1. IPython 0.13 is a requirement to view the ipynb files. It can be downloaded here
  2. For Linux users, you should not have a problem installing NumPy, SciPy and PyMC. For Windows users, check out pre-compiled versions if you have difficulty. Also recommended, for data-mining exercises, are PRAW and requests.
  3. In the styles/ directory are a number of files that are customized for the notebook. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib and the IPython notebook. The in notebook style has not been finalized yet.


This book has an unusual development design. The content is open-sourced, meaning anyone can be an author. Authors submit content or revisions using the GitHub interface.

What to contribute?

  1. The current chapter list is not finalized. If you see something that is missing (MCMC, MAP, Bayesian networks, good prior choices, Potential classes etc.), feel free to start there.
  2. Cleaning up Python code and making code more PyMC-esque
  3. Giving better explanations
  4. Spelling/grammar mistakes
  5. Suggestions
  6. Contributing to the IPython notebook styles

We would like to thank the Python community for building an amazing architecture. We would like to thank the statistics community for building an amazing architecture.

Similarly, the book is only possible because of the PyMC library. A big thanks to the core devs of PyMC: Chris Fonnesbeck, Anand Patil, David Huard and John Salvatier.

One final thanks. This book was generated by IPython Notebook, a wonderful tool for developing in Python. We thank the IPython community for developing the Notebook interface. All IPython notebook files are available for download on the GitHub repository.



africa_greenWhile many commentators have focused on data-driven innovation in the United States and Western Europe, developing regions, such as Africa, also offer important opportunities to use data to improve economic conditions and quality of life. Three major areas where data can help are improving health care, protecting the environment, and reducing crime and corruption.

First, data-driven innovation can play a major role in improving health in Africa, including by advancing disease surveillance and medical research. Several recent examples have come to light during the ongoing West African Ebola virus outbreak of 2014, which has catalyzed international efforts to improve the continent’s disease surveillance infrastructure. One effort is an attempt to crowdsource contributions to OpenStreetMap, the self-described “Wikipedia for Maps” that anyone can edit. OpenStreetMap volunteers are using satellite images to manually identify roads, buildings, bodies of water, and other features in rural areas of West Africa, which can help aid workers and local public health officials better plan their interventions and ensure every village has been checked for the disease. The U.S. Centers for Disease Control and Prevention is also piloting a program to track aggregate cell phone location data in areas affected by Ebola to provide a better picture of disease reports in real time. Other efforts are targeting basic medical research to ensure that African people are not underrepresented in genomics research. The United Genomes Project hopes to counteract the trend of basing genomic research on primarily U.S. and European populations, which can result in treatments that are ineffective among other populations, by compiling the genomes of 1,000 Africans into an openly accessible database over the next several years.

Second, various projects are using data for conservation and environmental efforts in Africa. One such initiative, the Great Elephant Census, is attempting to count African elephants to help local authorities better target conservation efforts and fight poaching. The census uses imaging drones and automated image recognition techniques to collect data that otherwise would have been too expensive or too difficult to collect using traditional data collection techniques. The University of California Santa Barbara’s Climate Hazards Group recently released a near-real time rainfall data set to help government agencies around the world detect droughts as rapidly as possible. The data is already being used to identify burgeoning areas of food insecurity, including in drought-plagued areas of East Africa. The Trans-African Hydro-Meteorological Observatory is a Delft University of Technology-led collaboration among 14 universities and several private sector organizations to use cheap sensors to collect localized weather information at 20,000 locations in sub-Saharan Africa. The resulting data will be available freely for scientific research and government use.

Third, data is helping reduce crime and corruption in Africa. The U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide is working on a public early-warning system for mass atrocities around the world that uses data from news reports and other sources to predict what countries carry the highest risk of genocide and other violent events in the near future. It will be rolled out during next year’s elections in Nigeria, which have historically been marred by violence. Israeli startup Windward uses satellite imagery to flag potentially illegal activity, including illegal fishing and piracy, around the world. Windward’s algorithms have already been used around the Horn of Africa to identify pirate activity. Other efforts are focused around government corruption. These include the South Africa-based Parliamentary Monitoring Group, which compiles and publishes data about local politicians, and Ghana-based Odekro, a website that monitors politicians’ behavior and publishes public debate transcripts and other political information.

These examples represent just a subset of the many ways that data-driven innovation is having a positive impact on Africa and address problems that have plagued African countries for decades. Although many of these innovative approaches come from international efforts, some are homegrown as well. For example, Nairobi-based Gro Ventures collects regular data on crop yields and commodity prices from local farmers and uses the data to build risk models that banks can use to make loans to farmers. The number of opportunities will continue to grow as the technology becomes cheaper, data becomes more plentiful, and the skills needed to perform analysis becomes more widely available.



IBM’s Watson Analytics service is now in open beta. The natural language-based system, born of the same programme that developed the company’sJeopardy-playing super computer, offers predictive and visual analytics tools for businesses.

Early this summer, IBM announced it is investing more than $1 billion into commercializing Watson. Watson Analytics is part of that effort. The company promises that it can can automate tasks such as data preparation, predictive analysis and visual storytelling.

IBM will offer Watson Analytics as a cloud-based freemium service, accessible via the Web and mobile devices. Since it announced the programme in the summer, 22,000 people have registered for the beta.

The launch of Watson Analytics follows the announcement two months ago that IBM has teamed up with Twitter to apply the Watson technology to analysing data from the social network.





Retailers have gained a unique advantage over consumers that previously never existed. They are now able to track and analyze customer behavior (online, mobile, and in-store) to better aim marketing campaigns, improve the customer experience, and increase revenue. Without Apache Hadoop, all of this would be almost impossible, as it is the most capable big data storage and processing framework available.

Where there once were customer panels, in-store surveys, focus groups, and guesswork, there is now social media, online search behavior, and easy-access customer input. Retailers can focus on the shopper as an individual, rather than aiming at the masses and hoping to snag a few.

Applications of Hadoop in Retail

Among marketing campaigns and targeted advertising are multiple other use cases for Hadoop in the retail industry. Five of the most common of its applications are detailed below.

Get to Know the Customer

Interacting with the customer has become easier than ever with the help of social media. Now consumers can send feedback directly, and ads can be focused based on statuses and search terms. Hadoop is able to scan and store transaction data and identify phases of the customer lifecycle to help retailers reduce inventory costs, increase sales, and build and retain a loyal customer base.

Analyze Brand Sentiment

Product launches, promotions, competitor moves, news stories, and in-store experiences all affect the customers’ opinions of the brand. It’s important for retailers to understand public perception of their brand so they can adjust promotions, advertisements, locations, and policies accordingly. Hadoop helps store and process this information from social media websites and browser searches to provide real-time perspective.

Localize and Personalize Promotions

In order to effectively localize and personalize retail promotions, it’s helpful to have a mobile app that sends users personalized push notifications of promotions and specific products at a nearby location that align with their customer data. Geo-location technology combined with historical storage and real-time streaming data becomes a unique marketing tactic that only Hadoop can help retailers with; other platforms simply don’t have the capacity.

Optimize Websites

When customers browse retail websites, clickstream data allows researchers to view and analyze the consumers’ click habits, transaction data, and site glitches. This information is helpful for prioritizing site updates, running A/B tests, doing basket analyses, and to understand user paths to create a better overall shopping experience and more effectively reach the customers as individuals. For example, if a pattern suggests that there is zero activity on a portion of the website, you can assume that either that part of the website needs IT attention, or it is unappealing to the customer, and you need to make adjustments. Either way, a problem is resolved and revenue potentially increases.

Redesign Store Layouts

Store layout has a significant impact on product sales, but the customer’s in-store shopping experience is the hardest for retailers to track. The lack of pre-register data leaves a gap in the information about what customers look at, how long they linger, etc. Often, businesses will hire unnecessary extra staff to make up for poor sales, when really the issue is poor product placement and store layout. Sensors have been developed to help fill the gap, such as RFID tags and QR codes. They store information through Hadoop that helps retailers improve the customer experience and reduce costs.


Apache Hadoop is a comprehensive big data storage and processing framework helping retailers optimize the customer experience, increase sales, and reduce costs related to inventory and marketing. With this expansive technology, businesses can focus their marketing efforts, get to know their customers and their needs, and create a customized shopping experience that will result in happy returning customers.