Aside

mmdsThe second edition of this landmark book adds Jure Leskovec as a coauthor and has 3 new chapters, on mining large graphs, dimensionality reduction, and machine learning. You can still freely download a PDF version.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Chapter 2 also has new material on algorithm design techniques for map-reduce.

Support Materials include Gradiance automated homeworks for the book and slides.

The authors note if want to reuse parts of this book, you need to obtain their permission and acknowledge our authorship. They have seen evidence that other items they published have been appropriated and republished under other names, but that is easy to detect, as you will learn in Chapter 3.

Download chapters of the book:

Preface and Table of Contents
Chapter 1 Data Mining
Chapter 2 Map-Reduce and the New Software Stack
Chapter 3 Finding Similar Items
Chapter 4 Mining Data Streams
Chapter 5 Link Analysis
Chapter 6 Frequent Itemsets
Chapter 7 Clustering
Chapter 8 Advertising on the Web
Chapter 9 Recommendation Systems
Chapter 10 Mining Social-Network Graphs
Chapter 11 Dimensionality Reduction
Chapter 12 Large-Scale Machine Learning
Index

Advertisements
Aside

LDA automatically assigns topics to text documents. How is it done? Which are its limitations? What is the best open-source library to use in your code?

In this post we’re going to describe how topics can be automatically assigned to text documents; this process is named, unsurprisingly, topic-modelling. It works like fuzzy (or soft) clustering since it’s not a strict categorisation of the document to its dominant topic.

Let’s start with an example: the optimal topic-modelling outcome for Shakespeare’s Romeo & Juliet would be a composition of topics of circa 50% tragedy and 50% romance. Surely, topics like social networks, football and indian cuisine don’t appear in the play, so their weights would be all 0%.

One of the most advanced algorithms for doing topic-modelling is Latent Dirichlet Allocation (or LDA). This is a probabilistic model developed by Blei, Ng and Jordan in 2003. LDA is an iterative algorithm which requires only three parameters to run: when they’re chosen properly, its accuracy is pretty high. Unfortunately, one of the required parameters is the number of topics: exactly as happens with K-means, this requires a deep a-priori knowledge of the dataset.

A good measure to evaluate the performance of LDA is perplexity. This measure is taken from information theory and measures how well a probability distribution predicts an observed sample. To evaluate the LDA model, one document is taken and split in two. The first half is fed into LDA to compute the topics composition; from that composition, then, the word distribution is estimated. This distribution is then compared with the word distribution of the second half of the document. a a measure of distance is extracted. Thanks to this measure, in practice, perplexity is often used to select the best number of topics of the LDA model.

Under the hood, LDA models both the topics-per-document and the topic-per-word distribution as Dirichlet distributions (that’s why it appears in its name). By using a Markov Chain Monte Carlo (MCMC) method to sample and approximate the underlying Markov Chain stationary distribution (called Gibbs sampling), the whole process is iterative, pretty fast, convergent and accurate. Math behind LDA is fairly complex, but a simple example on how LDA works is contained in this video presentation of David Minmo, a world class researcher of topic-modelling:

Topic Modeling Workshop: Mimno from MITH in MD.

Start TL;DR

For the bravest, this is the graphical representation of LDA: grey circles represent the observable variables; latent (also called hidden) ones are white. Boxes represent collections (repetition) of variables.

Graphical representation of LDAlda-graphical-representation

[Image taken from Wikipedia, CC-3.0 licensed]

Parameters of the model:

  • Boxed:
    • K is the number of topics
    • N is the number of words in the document
    • M is the number of documents to analyse
  • α is the Dirichlet-prior concentration parameter of the per-document topic distribution
  • β is the same parameter of the per-topic word distribution
  • φ(k) is the word distribution for topic k
  • θ(i) is the topic distribution for document i
  • z(i,j) is the topic assignment for w(i,j)
  • w(i,j) is the j-th word in the i-th document

φ and θ are Dirichlet distributions, z and w are multinomials.

End TL;DR

On the Internet there are a bunch of libraries able to perform topic-modelling through LDA. Note that the acronym LDA is also refer to another technique with the same initials (Linear Discriminant Analysis), but the two algorithms are completely unrelated. Now, in the follow, our point of view of some open sourced Latent Dirichlet Allocation Implementations. For each of them, we’re pointing out strengths and weakness, as well as simplicity to install and use and scalability.

MALLET and TMT

  • Current status: no longer developed or maintained
  • Programming language: Java (Mallet) and Scala (TMT)
  • Features: university backed software, not optimised for production. Great for learning and exploring LDA on small datasets, understanding its parameters and tuning the model.
  • Scalability: multi-threaded, single-machine. Good for small to medium collections of documents.
  • Simplicity to install: simple. Jar distributed and Maven compilable.
  • Simplicity to train the model: simple and very customisable.
  • Infer topics on unseen documents: simple as well as training.

Y!LDA

  • Current status: no longer developed or maintained
  • Programming language: C++
  • Features: Very scalable LDA algorithm, able to scale across multiple hosts and cores. Code is very optimised, and requires experienced C++ developers to modify it.
  • Scalability: multi-core, multi-machine Hadoop backed. Good for medium to huge collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: pretty complicated. A 4 years old linux box with many outdates libraries are required. Ice dependency is very tricky to install.
  • Simplicity to train the model: cumbersome. It tooks a long while to make Yahoo_LDA working properly on a Hadoop cluster. Also, in case of error, C++ compiled code on a Java/Hadoop system makes the investigation of what went wrong very hard.
  • Infer topics on unseen documents: simpler than the training phase.

GraphLab

  • Current status: active. Maintained by GraphLab Inc and community
  • Programming language: C++
  • Features: Very scalable LDA algorithm, able to scale across multiple hosts and cores. Code and algorithms are very optimised, and requires experienced C++ developers to modify it.
  • Scalability: multi-core, multi-machine through MPIs. Good for medium to huge collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: pretty simple (cMake), with few dependencies to install.
  • Simplicity to train the model: pretty simple, even in a multi-machine environment. Following the easy documentation, LDA simply works.
  • Infer topics on unseen documents: complex. There is not an out of the box routine to infer topics on new documents. Creating that inferencer is not so complicated, though.
  • Note: recently, documentation of LDA has disappeared from the website. Fortunately, it’s still available from the internet archive.

Gensim

  • Current status: active. Maintained by Radim Řehůřek and community
  • Programming language: Python (with core pieces in Fortran/C)
  • Features: Very scalable LDA algorithm, distributed, able to process input collections larger than RAM (online learning) and easy to use.
  • Scalability: multi-core, multi-machine through RPCs. Good for medium to large collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: very easy (using pip install or easy_install)
  • Simplicity to train the model: very simple. There are many helping routines that allow to build and tune the model with few lines of code (also in multi-machine environment)
  • Infer topics on unseen documents: very easy. It also update the model with the sample.
  • Quick tour as IPython notebook here.

LDA limitations: what’s next?

Although LDA is a great algorithm for topic-modelling, it still has some limitations, mainly due to the fact that it’s has become popular and available to the mass recently. One major limitation is perhaps given by its underlying unigram text model: LDA doesn’t consider the mutual position of the words in the document. Documents like “Man, I love this can” and “I can love this man” are probably modelled the same way. It’s also true that for longer documents, mismatching topics is harder. To overcome this limitation, at the cost of almost square the complexity, you can use 2-grams (or N-grams)along with 1-gram.

Another weakness of LDA is in the topics composition: they’re overlapping. In fact, you can find the same word in multiple topics (the example above, of the word “can”, is obvious). The generated topics, therefore, are not independent and orthogonal like in a PCA-decomposed basis, for example. This implies that you must pay lots of attention while dealing with them (e.g. don’t use cosine similarity).

For a more structured approach – especially if the topic composition is very misleading – you might consider the hierarchical variation of LDA, named H-LDA, (or simply Hierarchical LDA). In H-LDA, topics are joined together in a hierarchy by using a Nested Chinese Restaurant Process (NCRP). This model is more complex than LDA, and the description is beyond the goal of this blog entry, but if you like to have an idea of the possible output, here it is. Don’t forget that we’re still in the probabilistic world: each node of the H-DLA tree is a topic distribution.

Graphical representation of LDAhlda_example

[Image taken from the original paper on HLDA: Blei, Jordan, Griffiths Tenenbaum, © MIT Press, 2004 ]

If you’re very curious on LDA, here is a quick example.

Source: engineering.intenthq.com

Aside

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or “transitioning,” from one state to any other state—e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

A simple, two-state Markov chain is shown below.

http://setosa.io/markov/no-controls-with-hash.html#%7B%22tm%22%3A%5B%5B0.5%2C0.5%5D%2C%5B0.5%2C0.5%5D%5D%7D

With two states (A and B) in our state space, there are 4 possible transitions (not 2, because a state can transition back into itself). If we’re at ‘A’ we could transition to ‘B’ or stay at ‘A’. If we’re at ‘B’ we could transition to ‘A’ or stay at ‘B’. In this two state diagram, the probability of transitioning from any state to any other state is 0.5.

Of course, real modelers don’t always draw out Markov chain diagrams. Instead they use a “transition matrix” to tally the transition probabilities. Every state in the state space is included once as a row and again as a column, and each cell in the matrix tells you the probability of transitioning from its row’s state to its column’s state. So, in the matrix, the cells do the same job that the arrows do in the diagram.

http://setosa.io/markov/transition-matrix.html#%7B%22tm%22%3A%5B%5B0.5%2C0.5%5D%2C%5B0.5%2C0.5%5D%5D%7D

If the state space adds one state, we add one row and one column, adding one cell to every existing column and row. This means the number of cells grows quadratically as we add states to our Markov chain. Thus, a transition matrix comes in handy pretty quickly, unless you want to draw a jungle gym Markov chain diagram.

One use of Markov chains is to include real-world phenomena in computer simulations. For example, we might want to check how frequently a new dam will overflow, which depends on the number of rainy days in a row. To build this model, we start out with the following pattern of rainy (R) and sunny (S) days:

http://setosa.io/markov/random-sequence.html

One way to simulate this weather would be to just say “Half of the days are rainy. Therefore, every day in our simulation will have a fifty percent chance of rain.” This rule would generate the following sequence in simulation:

http://setosa.io/markov/random-sequence-50-50.html

Did you notice how the above sequence doesn’t look quite like the original? The second sequence seems to jump around, while the first one (the real data) seems to have a “stickyness”. In the real data, if it’s sunny (S) one day, then the next day is also much more likely to be sunny.

We can minic this “stickyness” with a two-state Markov chain. When the Markov chain is in state “R”, it has a 0.9 probability of staying put and a 0.1 chance of leaving for the “S” state. Likewise, “S” state has 0.9 probability of staying put and a 0.1 chance of transitioning to the “R” state.

http://setosa.io/markov/random-sequence-markov.html

In the hands of metereologists, ecologists, computer scientists, financial engineers and other people who need to model big phenomena, Markov chains can get to be quite large and powerful. For example, the algorithm Google uses to determine the order of search results, called PageRank, is a type of Markov chain.

http://setosa.io/markov/playground.html?1425338959085#%7B%20%20%22tm%22%3A%20%5B%20%20%20%20%5B%20%20%20%20%20%200.3%2C%20%20%20%20%20%200.3%2C%20%20%20%20%20%200.4%20%20%20%20%5D%2C%20%20%20%20%5B%20%20%20%20%20%200.3%2C%20%20%20%20%20%200.5%2C%20%20%20%20%20%200.2%20%20%20%20%5D%2C%20%20%20%20%5B%20%20%20%20%20%200.4%2C%20%20%20%20%20%200.4%2C%20%20%20%20%20%200.2%20%20%20%20%5D%20%20%5D%7D

Above, we’ve included a Markov chain “playground”, where you can make your own Markov chains by messing around with a transition matrix. Here’s a few to work from as an example: ex1, ex2, ex3 or generate one randomly. The transition matrix text will turn red if the provided matrix isn’t a valid transition matrix. The rows of the transition matrix must total to 1. There also has to be the same number of rows as columns.

Source: Setosa.ioio

Aside

If you wonder what the government has done for you lately, take a look at DeepDive. DeepDive is a free version of IBM’s Watson developed in the same Defense Advanced Research Projects Agency (DARPA) program as IBM’s Watson, but is being made available free and open-source.

Although never been pitted against IBM’s Watson, DeepDive has gone up against a more fleshy foe: the human being. Result: DeepDive beat or at least equaled humans in the time it took to complete an arduous cataloging task. These were no ordinary humans, but expert human catalogers tackling the same task as DeepDive — to read technical journal articles and catalog them by understanding their content.

“We tested DeepDive against humans performing the same tasks and DeepDive came out ahead or at least equaled the efforts of the humans,” professor Shanan Peters, who supervised the testing, told EE Times.

DeepDive is free and open-source, which was the idea of its primary programmer, Christopher Re.

“We started out as part of a machine reading project funded by DARPA in which Watson also participated,” Re, a professor at Univ. of Wisconsin told EE Times. “Watson is a question-answering engine (although now it seems to be much bigger). [In contrast] DeepDive’s goal is to extract lots of structured data” from unstructured data sources.

DeepDive incorporates probability-based learning algorithms as well as open-source tools such as MADlib, Impala (from Oracle), and low-level techniques, such as Hogwild, some of which have also been included in Microsoft’s Adam. To build DeepDive into your application, you should be familiar with SQL and Python.

rcjDeepDive_AI_artificial-intelligence_sm

DeepDive was developed in the same Defense Advanced Research Projects Agency (DARPA) program as

IBM’s Watson, but is being made available free by its programmers at University of Wisconsin-Madison. 

“Underneath the covers, DeepDive is based on a probability model; this is a very principled, academic approach to build these systems, but the question for use was ‘could it actually scale in practice’? Our biggest innovations in Deep Dive have to do with giving it this ability to scale,” Re told us.

 

For the future, DeepDive aims to be proven to other domains.

“We hope go have similar results in those domains soon, but it’s too early to be very specific about our plans here,” Re told us. “We use a RISC processor right now, we’re trying to make a compiler and we think machine learning will let us make it much easier to program in the next generation of DeepDive. We also plan to get more data types into DeepDive: images, figures, tables, charts, spreadsheets — a sort of ‘Data Omnivore’ to borrow a line from Oren Etzioni.”

Get all the details in the free download which are going at 10,000 per week.

Source: http://www.eetimes.com/

Aside

Naive Bayes classification is a simple, yet effective algorithm. It’s commonly used in things like text analytics and works well on both small datasets and massively scaled out, distributed systems.

How does it work?

Naive Bayes is based on, you guessed it, Bayes’ theorem. Think back to your first statistics class. Bayes’ theorem was that seemingly counterintuitive lecture on conditional probability.

bayes-theorem-neon-sign

Bayes’ Theorem

The neon formula above might look intimidating, but it’s actually not that complicated. To explain it, instead of using “events A and B“, I’m going to use something a little more familiar. Let’s say the two events in question are:

A) I watched The Lego Movie today
B) I sat on the couch today

So for my 2 events, let’s break it down into it’s Bayesian components:

I’ve seen The Lego Movie 10 times in the past 2 months–this is a lot, I know. I’ve been lucky enough that it’s been playing on almost every plane I’ve been on (as it should be! it’s a great movie for both adults and kids). Since I’ve watched The Lego Movie 10 out of the last 60 days, we’ll say that:

P(A) = P(I watched The Lego Movie today) = 10 / 60, or ~0.17

I sit on the couch most days I’m at my apartment. I’ve traveled 14 days in the past 2 months, so to keep it simple, we’ll assume I sat on my couch at least once on every other day (hey, it’s pretty comfy).

P(B) = P(I sat on the couch today) = (60 - 14) / 60, or ~0.76

I’ve seen The Lego Movie 10 times and 4 of those times have been on a plane. I think it’s pretty safe to assume the rest of those times I was seated comfortably in my living room. So given that I’ve had 46 days of couchtime in the past 2 months, we can say that I watched The Lego Movie from my couch 6 / 10 times.

P(B|A) = P(I sat on the couch given that I watched The Lego Movie) = 6 / 10 = 0.60

Ok, ready for the magic! Using Bayes’ theorem, I now have everything I need to calculate the Probability that I watched The Lego Movie today given that I sat on the couch.

P(A|B)=P(B|A)*P(A)/P(B) = (0.60 * 0.17) / 0.76

P(I watched The Lego Movie given that I sat on the couch) = 0.13

And voilà! Given that I sat on the couch today, there is a 13% chance that I also watched The Lego Movie (wow, that’s a lot of Lego time).

Now I wonder what the probability of me watching The Lego Movie from a double decker couch would be?

Why should I use it?

Where you see Naive Bayes classifiers pop up a lot is in document classification. Naive Bayes is a great choice for this because it’s pretty fast, it can handle a large number of features (i.e. words), and it’s actually really effective. Take a look at what happens when you do some basic benchmarking between Naive Bayes and other methods like SVM and RandomForest against the 20 Newsgroups dataset.

20newsgroups-benchmark

 

Naive Bayes wins! Granted this is a relatively simple approach without much in terms of feature engineering, but in my opinion that’s part of the beauty of Naive Bayes!

Code for benchmarking is available here.

Document Classification

For our example we’re going to be attempting to classify whether a wikipedia page is referring to a dinosaur or acryptid (an animal from cryptozoology. Think Lochness Monster or Bigfoot).

My favorite Yeti, The Bumble, from the stop-motion holiday classic Rudolph the Red-nosed ReindeerWe’ll be using the text from each wikipedia article as features. What we’d expect is that certain words like “sighting” or “hoax” would be more commonly found in articles about cryptozoology, while words like “fossil” would be more commonly found in articles about dinosaurs.

We’ll do some basic word-tokenization to count the occurrences of each word and then calculate conditional probabilities for each word as it pertains to our 2 categories.

You can find the sample documents I used and the corresponding code here.

Tokenizing and counting

First things first. We need to turn our files full of text into something a little more mathy. The simplest way to do this is to take the bag of words approach. That just means we’ll be counting how many times each word appears in each document. We’ll also perform a little text normalization by removing punctuation and lowercasing the text (this means “Hello,” and “hello” will now be considered the same word).

Once we’ve cleaned the text, we need a way to delineate words. A simple approach is to just use a good ‘ole regex that splits on whitespace and punctuation:

https://gist.github.com/glamp/d1c9c1e20184dcd42b6c.js

Calculating our probabilities

So now that we can count words, let’s get cooking. The code below is going to do the following:

  • open each document
  • label it as either “crypto” or “dino” and keep track of how many of each label there are (priors)
  • count the words for the document
  • add those counts to the vocab, or a corpus level word count
  • add those counts to the word_counts, for a category level word count

Classifying a new page

And finally it’s time for the math. We’re going to use the word counts we calculated in the previous step to calculate the following:

Prior Probability for each category, or for the layman, the percentage of documents that belong to each category. We have 9 crypto docs and 8 dino docs, so that gives us the following:

Prior Prob(crypto) = 9 / (8 + 9) = 0.53

Prior Prob(dino) = 8 / (8 + 9) = 0.47

Ok priors, check. The next thing we need are conditional probabilities for the words in the document we’re trying to classify. How do we do that? Well we start by doing a word count on a new document. We’ll use the Yeti page as our new document.

https://gist.github.com/glamp/53881f20a2e930dac763.jsAlright, we’ve got our counts. Now we’ll calculate P(word|category) for each word and multiply each of these conditional probabilities together to calculate the P(category|set of words). To prevent computational errors, we’re going to perform the operations in logspace. All this means is we’re going to use the log(probability) so we require fewer decimal places. More on the mystical properties of logs here and here.https://gist.github.com/glamp/9117c11a6b0319e1e410.js

Since we’re slightly bending the rules of Bayes’ Theorem, the results are not actual probabilities, but rather are “scores”. All you really need to know is which one is bigger. So our suspicions are confirmed, the “Yeti.txt” file is being classified overwhelmingly in favor of crypto (as we would hope).

Bringing it home, the LEGO Yeti!

Final Thoughts

You can find all the code and documents used in this post on GitHub.

Naive Bayes is great because it’s fairly easy to see what’s going on under the hood. It’s a great way to start any text analysis and it can easily scale out of core to work in a distributed environment. There are some excellent implementations in the Python community you can use as well, so if you don’t want to roll your own, have no fear! The scikit-learn and nltk versions are great places to start.

Source: Blog.yhathq.com

Aside

An intro to Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view.

Prologue

Probabilistic ProgrammingThe Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

The choice of PyMC as the probabilistic programming language is two-fold. As of this writing, there is currently no central resource for examples and explanations in the PyMC universe. The official documentation assumes prior knowledge of Bayesian inference and probabilistic programming. We hope this book encourages users at every level to look at PyMC. Secondly, with recent core developments and popularity of the scientific stack in Python, PyMC is likely to become a core component soon enough.

PyMC does have dependencies to run, namely NumPy and (optionally) SciPy. To not limit the user, the examples in this book will rely only on PyMC, NumPy, SciPy and Matplotlib only.

Contents

(The below chapters are rendered via the nbviewer at nbviewer.ipython.org/, and is read-only and rendered in real-time. Interactive notebooks + examples can be downloaded by cloning!

More questions about PyMC? Please post your modeling, convergence, or any other PyMC question on cross-validated, the statistics stack-exchange.

Examples from the book

Below are just some examples from Bayesian Methods for Hackers.

Inferring behaviour changes using SMS message rates

Chapter 1

By only visually inspecting a noisy stream of daily SMS message rates, it can be difficult to detect a sudden change in the users’s SMS behaviour. In our first probabilistic programming example, we solve the problem by setting up a simple model to detect probable points where the user’s behaviour changed, and examine pre and post behaviour.

Simpler AB Testing

Chapter 2

AB testing, also called randomized experiments in other literature, is a great framework for determining the difference between competing alternatives, with applications to web designs, drug treatments, advertising, plus much more.

With our new interpretation of probability, a more intuitive method of AB testing is demonstrated. And since we are not dealing with confusing ideas like p-values or Z-scores, we can compute more understandable quantities about our uncertainty.

Discovering cheating while maintaing privacy

Chapter 2

A very simple algorithm can be used to infer proportions of cheaters, while also maintaining the privacy of the population. For each participant in the study:

  1. Have the user privately flip a coin. If heads, answer “Did you cheat?”truthfully.
  2. If tails, flip again. If heads, answer “Yes” regardless of the truth; if tails, answer “No”.

This way, the suveyor’s do not know whether a cheating confession is a result of cheating or a heads on the second coin flip. But how do we cut through this scheme and perform inference on the true proportion of cheaters?

Challenger Space Shuttle disaster

Chapter 2

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend.

We examine this data in a Bayesian framework and show strong support that a faulty O-ring, caused by low abmient temperatures, was likely the cause of the disaster.

Understanding Bayesian posteriors and MCMC

Chapter 3

The prior-posterior paradigm is visualized to make understanding the MCMC algorithm more clear. For example, below we show how two different priors can result in two different posteriors.

Clustering data

Chapter 3

Given a dataset, sometimes we wish to ask whether there may be more than one hidden source that created it. A priori, it is not always clear this is the case. We introduce a simple model to try to pry data apart into two clusters.

Sorting Reddit comments from best to worst

Chapter 4

Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product. This has created flaws in how we sort items, and more generally, how we compare items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results. Often the seemingly top videos or comments have perfect ratings only from a few enthusiastic fans, and truly more quality videos or comments are hidden in later pages with falsely-substandard ratings of around 4.8. How can we correct this?

Solving the Price is Right’s Showcase

Chapter 5

Bless you if you are ever chosen as a contestant on the Price is Right, for here we will show you how to optimize your final price on the Showcase. We create a Bayesian model of your best guess and your uncertainty in that guess, and push it through the odd Showdown loss function (closest wins, lose if you bid over).

Kaggle’s Dark World winning solution

Chapter 5

We implement Tim Saliman’s winning solution to the Observing Dark World’s contest on the data science website Kaggle.

Bayesian Bandits – a solution to the Multi-Armed Bandit problem

Chapter 6

Suppose you are faced with N slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don’t know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.

Stock Market analysis

Chapter 6

For decades, finance students have been taught using naive statistical methods to pick stocks. This has caused terrible inference, mostly caused by two things: temporal parameters and ignoring uncertainty. The first is harder to solve, the second fits right into a Bayesian framework.

Using the book

The book can be read in three different ways, starting from most recommended to least recommended:

  1. The most recommended option is to clone the repository to download the .ipynb files to your local machine. If you have IPython installed, you can view the chapters in your browser plus edit and run the code provided (and try some practice questions). This is the preferred option to read this book, though it comes with some dependencies.
    • IPython v0.13 (or greater) is a requirement to view the ipynb files. It can be downloaded here. IPython notebooks can be run by (your-virtualenv) ~/path/to/the/book/Chapter1_Introduction $ ipython notebook
    • For Linux users, you should not have a problem installing NumPy, SciPy, Matplotlib and PyMC. For Windows users, check out pre-compiled versions if you have difficulty.
    • In the styles/ directory are a number of files (.matplotlirc) that used to make things pretty. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib.
    • while technically not required, it may help to run the IPython notebook with ipython notebook --pylab inline flag if you encounter io errors.
  2. The second, preferred, option is to use the nbviewer.ipython.org site, which display IPython notebooks in the browser (example). The contents are updated synchronously as commits are made to the book. You can use the Contents section above to link to the chapters.
  3. PDF versions are available! Look in the PDF/ directory. PDFs are the least-prefered method to read the book, as pdf’s are static and non-interactive. If PDFs are desired, they can be created dynamically using Chrome’s builtin print-to-pdf feature or using thenbconvert utility.

Installation and configuration

If you would like to run the IPython notebooks locally, (option 1. above), you’ll need to install the following:

  1. IPython 0.13 is a requirement to view the ipynb files. It can be downloaded here
  2. For Linux users, you should not have a problem installing NumPy, SciPy and PyMC. For Windows users, check out pre-compiled versions if you have difficulty. Also recommended, for data-mining exercises, are PRAW and requests.
  3. In the styles/ directory are a number of files that are customized for the notebook. These are not only designed for the book, but they offer many improvements over the default settings of matplotlib and the IPython notebook. The in notebook style has not been finalized yet.

Development

This book has an unusual development design. The content is open-sourced, meaning anyone can be an author. Authors submit content or revisions using the GitHub interface.

What to contribute?

  1. The current chapter list is not finalized. If you see something that is missing (MCMC, MAP, Bayesian networks, good prior choices, Potential classes etc.), feel free to start there.
  2. Cleaning up Python code and making code more PyMC-esque
  3. Giving better explanations
  4. Spelling/grammar mistakes
  5. Suggestions
  6. Contributing to the IPython notebook styles

We would like to thank the Python community for building an amazing architecture. We would like to thank the statistics community for building an amazing architecture.

Similarly, the book is only possible because of the PyMC library. A big thanks to the core devs of PyMC: Chris Fonnesbeck, Anand Patil, David Huard and John Salvatier.

One final thanks. This book was generated by IPython Notebook, a wonderful tool for developing in Python. We thank the IPython community for developing the Notebook interface. All IPython notebook files are available for download on the GitHub repository.

Source: camdavidsonpilon.github.io

Aside

AI-books

Are you searching for some best books to get acquainted with the basics of AI? Here is our list!

1. A Course in Machine Learning

Machine learning is the study of computer systems that learn from data and experience. It is applied in an incredibly wide variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential customer of machine learning.

2. Simply Logical: Intelligent Reasoning by Example

An introduction to Prolog programming for artificial intelligence covering both basic and advanced AI material. A unique advantage to this work is the combination of AI, Prolog and Logic. Each technique is accompanied by a program implementing it. Seeks to simplify the basic concepts of logic programming. Contains exercises and authentic examples to help facilitate the understanding of difficult concepts.

3. Logic for Computer Science: Foundations of Automatic Theorem Proving

Covers the mathematical logic necessary to computer science, emphasising algorithmic methods for solving proofs. Treatment is self-contained, with all required mathematics contained in Chapter 2 and the appendix. Provides readable, inductive definitions and offers a unified framework using Getzen systems.

4. Artificial Intelligence: Foundations of Computational Agents

This textbook, aimed at junior to senior undergraduate students and first-year graduate students, presents artificial intelligence (AI) using a coherent framework to study the design of intelligent computational agents. By showing how basic approaches fit into a multidimensional design space, readers can learn the fundamentals without losing sight of the bigger picture.

5. From Bricks to Brains: The Embodied Cognitive Science of LEGO Robots

From Bricks to Brains introduces embodied cognitive science and illustrates its foundational ideas through the construction and observation of LEGO Mindstorms robots. Discussing the characteristics that distinguish embodied cognitive science from classical cognitive science, the book places a renewed emphasis on sensing and acting, the importance of embodiment, the exploration of distributed notions of control, and the development of theories by synthesising simple systems and exploring their behavior.

6. Practical Artificial Intelligence Programming in Java

This book has been written for both professional programmers and home hobbyists who already know how to program in Java and who want to learn practical AI programming techniques. In the style of a “cook book”, the chapters in this book can be studied in any order. Each chapter follows the same pattern: a motivation for learning a technique, some theory for the technique, and a Java example program that you can experiment with.

7. An Introduction to Logic Programming Through Prolog

This is one of the few texts that combines three essential theses in the study of logic programming: the logic that gives logic programs their unique character: the practice of programming effectively using the logic; and the efficient implementation of logic programming on computers.

8. Essentials of Metaheuristics

The book covers a wide range of algorithms, representations, selection and modification operators, and related topics, and includes 70 figures and 133 algorithms great and small.

9. A Quick and Gentle Guide to Constraint Logic Programming

Introductory and down-to-earth presentation of Constraint Logic Programming, an exciting software paradigm, more and more popular for solving combinatorial as well as continuous constraint satisfaction problems and constraint optimisation problems.

10. Clever Algorithms: Nature-Inspired Programming Recipes

This book provides a handbook of algorithmic recipes from the fields of Metaheuristics, Biologically Inspired Computation and Computational Intelligence that have been described in a complete, consistent, and centralised manner. These standardised descriptions were carefully designed to be accessible, usable, and understandable.

11. Clever Algorithms: Nature-Inspired Programming Recipes

Covers the mathematical logic necessary to computer science, emphasizing algorithmic methods for solving proofs. Provides readable, inductive definitions and offers a unified framework using Getzen systems. Offers unique coverage of congruence, and contains an entire chapter devoted to SLD resolution and logic programming (PROLOG).

12. Common LISP: A Gentle Introduction to Symbolic Computation

This highly accessible introduction to Lisp is suitable both for novices approaching their first programming language and experienced programmers interested in exploring a key tool for artificial intelligence research.

13. Bio-Inspired Computational Algorithms and Their Applications

This book integrates contrasting techniques of genetic algorithms, artificial immune systems, particle swarm optimisation, and hybrid models to solve many real-world problems. The works presented in this book give insights into the creation of innovative improvements over algorithm performance, potential applications on various practical tasks, and combination of different techniques.

14. The Quest for Artificial Intelligence

This book traces the history of the subject, from the early dreams of eighteenth-century (and earlier) pioneers to the more successful work of today’s AI engineers.

15. Planning Algorithms

Planning algorithms are impacting technical disciplines and industries around the world, including robotics, computer-aided design, manufacturing, computer graphics, aerospace applications, drug design, and protein folding. Written for computer scientists and engineers with interests in artificial intelligence, robotics, or control theory, this is the only book on this topic that tightly integrates a vast body of literature from several fields into a coherent source for teaching and reference in a wide variety of applications.

16. Virtual Reality – Human Computer Interaction

At present, the virtual reality has impact on information organisation and management and even changes design principle of information systems, which will make it adapt to application requirements. The book aims to provide a broader perspective of virtual reality on development and application.

17. Affective Computing

This book provides an overview of state of the art research in Affective Computing. It presents new ideas, original results and practical experiences in this increasingly important research field.

18. Machine Learning, Neural and Statistical Classification

This book is based on the EC (ESPRIT) project StatLog which compare and evaluated a range of classification techniques, with an assessment of their merits, disadvantages and range of application. This integrated volume provides a concise introduction to each method, and reviews comparative trials in large-scale commercial and industrial problems.

19. Ambient Intelligence

Ambient Intelligence has attracted much attention from multidisciplinary research areas and there are still open issues in most of them. In this book a selection of unsolved problems which are considered key for ambient intelligence to become a reality, is analysed and studied in depth.

20. The World and Mind of Computation and Complexity

With the increase in development of technology, there is research going into the development of human-like artificial intelligence that can be self-aware and act just like humans. This book explores the possibilities of artificial intelligence and how we may be close to developing a true artificially intelligent being.

 

Aside

alpineAlpine Data Labs announced the introduction of Alpine Chorus 5.0, an advanced analytics enterprise platform to enable organizations to unify all data access, control and innovation into one consistent and collaborative environment.

With this release, the company introduces new functionality to allow business executives to oversee and manage their analytics ecosystem while enabling all employees to collaborate and innovate with data.  Alpine Chorus 5.0 also unveils a new extensible framework which scientists and analysts can use to integrate and manage popular technologies like R and Spark.  Finally, the company further advances its “in-cluster” innovation lead by introducing access, discovery and management parity across all Hadoop distributions (HortonWorks, Cloudera, MapR and Pivotal) and all major database platforms from Oracle, to Teradata to SQL Server.

It’s not just about the algorithm

Alpine has focused on making Machine Learning at web scale accessible to more people so they can solve real business problems,” says Joe Otto, President and CEO at Alpine Data Labs. “As we worked with some of the most sophisticated users of advanced analytics, we saw that, to operationalize meaningful insights, customers require a platform that enables a new engagement model between analytics, people, and business processes.”

Alpine Chorus 5.0 new monitoring and operationalization capabilities provide all employees with a consistent, personalized and secure interface into the enterprise’s analytical work.  It ensures that more people – not just data scientists – can actively participate in the development, deployment and fast iteration of insightful applications.

It’s not just about Hadoop

We built Alpine Chorus from the start knowing that it would require enterprise-level support for both Hadoop and non-Hadoop systems,” says Steven Hillion, Chief Product Officer and co-founder at Alpine Data Labs.  “The majority of enterprises we work with, use Hadoop as an augmentation of their data environment, rarely as a replacement of their existing data warehouses.”

By supporting all Hadoop distributions and non-Hadoop sources, Chorus 5.0 becomes the advanced analytics platform that drives in-cluster analytics processing into the broadest set of data sources in the industry.

It is about an open platform

Alpine Chorus allows us to take advantage of the latest innovation in Advanced Analytics while supporting open source tools like R and standards like PMML,” says Hauke Schmidt, Head of Data Mining Services at Bosch.  “Alpine Chorus’ open architecture provides a great fit for organizations that want to integrate multiple tools and want Advanced Analytics solutions to play nicely with existing infrastructure.”

The Alpine Chorus R integration indeed provides a large number of benefits to R developers, from enabling them to work with R in a collaborative manner to benefiting from centralized management, scheduling and security capabilities.  Now, the use of R can be extended to a much larger number of people and use cases.  Finally, the R integration enables Data Scientists to drop their R code into existing or new Alpine Chorus workflows, combine them with machine learning algorithms and models that don’t use R.  This added value provides a great roadmap for extending the reach and use of languages like R.

Our integration with R is a prime example of the value of our new open framework,” says Steven Hillion. “The average enterprise uses a collection of tools for data mining and statistical analysis.  We are not trying to replace them.  Rather, we provide a platform that unites them and manages them into a scalable enterprise solution.”

To find out more, request a free trial of Alpine Chorus 5.0http://start.alpinenow.com

Aside

Between 2007 and 2010, graduate students Erez Aiden, Jean-Baptiste Michel and Yuan Shen developed a tool for analyzing Google’s digitized books. Known as the N-gram viewer, their tool essentially just counted the number of times a word or phrase appeared in all the digitized publications for a given year. By plotting the year-by-year count, the software produced an N-gram: a chart showing the frequency of word use over time. For example, the chart below shows the frequency of the words “data,” “fact,” and “information” plotted from 1800 to 2000.

 

https://books.google.com/ngrams/interactive_chart?content=data%2Cfact%2C+information&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%3B%2Cc0%3B.t1%3B%2Cfact%3B%2Cc0%3B.t1%3B%2Cinformation%3B%2Cc0

This is a powerful tool and there is lots that we can discern from even this basic query (that took all of 5 seconds): “facts” became more and more important through the 19th century, reaching a plateau by about 1920, and then declining sharply from about 1970 onward; “data” and “information” have grown rapidly in popularity in the 20th century, tracking each other closely until about the mid-1980s.

Big Data as a Lens on Human cultureAiden and Michel have written about their N-gram viewer and how they came to develop it in a book: Uncharted: Big Data as a Lens on Human Culture (2013). The book gives lots of interesting example of how N-grams might be used to track fame, understand nationalism, explore the birth and death of words and concepts, and follow the influence of inventions. Aiden and Michel see their tool as part of a big data revolution that will transform the way we understand human culture in general and history in particular. “Big data,” they argue, “is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower” (p. 8). They see the N-gram as an instrument, like the microscope, that will provide us with a new quantitative insight into historical change. They call this approach “culturomics” (in analogy with genomics).

But there are some good reasons to be skeptical, or perhaps even a little worried, about such claims. There are two issues in particular that are suggestive of more general problems for big data approaches.

First, there is the issue of completeness. One of the vaunted features of big data approaches are their ability to utilize very wide data sets. Victor Mayer-Schonberger and Kenneth Cukier have argued that one of the unique features of big data approaches is the use of “all available data.” Of course, Google Books has a very wide (and long) data set – 30 million or so books. And for this reason it is easy to see something like N-gram as working with “all available data” or some close approximation of it. But even this huge amount of text is only 20-25% of all material ever published. This is a large fraction, and it is no doubt valuable for all sorts of purposes. But it raises questions about what is left out. Do we have a representative sample? Are some sorts of books being systematically excluded? How does copyright influence which books are available?

Even leaving aside the question of what included and excluded, we have another problem. Google Books only (at the moment at least) takes account of published materials. Humans have, over the last centuries, created and left behind huge troves of unpublished material – letters, diaries, manuscripts, memos, files, etc. that were never published. Paying attention only to published sources can lead to unbalanced accounts. Publication is expensive and often those who publish written work tend to be those in positions of power and influence (men not women, lords not peasants, owners not workers, colonizers not colonized, etc.). Historians have grappled with this problem for many years – writing balanced history often means finding a way to account for inherently unbalanced sources. In dealing with societies where only a small fraction of the population is literate, one must be even more careful not to take any written material (published or unpublished) as representative of the overall population.

Second, there is the issue of quantitation. Historians attempt to make up for the inadequacy of their sources by using their intuition, judgment, empathy, and imagination. This sort of work may become increasingly marginalized in a world where the humanities and social sciences are subjected to quantitation. Aiden and Michel imagine and celebrate a “scrambling of the boundaries between science and the humanities” (p. 207). Surely, both have a lot to learn from one another. But it is important that quantitation does not dominate history, or other humanities too much. Counting words is important, but it cannot tell us everything. N-gram viewer tells us nothing about the context in which words appear for example. Context is exactly the kind of thing that historians are concerned with. Ultimately, N-gram viewer is a one-dimensional measure of a highly-multi-dimensional space (culture, society, history).

To be fair, Aiden and Michel are aware of many of the shortcomings of their tool and their book contains some intelligent discussions of its possible limitations. However, big data tools of this type – powerful and yet easy to use – are proving immensely popular. They are seductive. However, here and with big data in general, we should be careful not to confuse having very large amounts of data with having “all data” and we should be careful to temper the drive to quantification with careful judgment and intuition.

Source: SmartDataCollective.com

Aside

berganx299Future smartphones will be able to understand what you’re taking photos of and recognize faces, says mobile chip maker Qualcomm. Researchers at the company are working to make a powerful new approach to artificial intelligence known as deep learning a standard feature of mobile devices.

Smartphone camera apps often have “scene” modes to get the best shots of landscapes, sports, or sunsets. Qualcomm has created a camera app able to identify different types of scenes on its own, based on their visual characteristics. That could lead to phones that can choose their own settings without having to send or receive data over the Internet.

Charles Bergan, who leads software research at Qualcomm, demonstrated that software in a sponsored talk at MIT Technology Review’s EmTech conferencelast week in Cambridge, Massachusetts. He said that it should be possible to use the same approach to create software that could decide the best moment to take a photo. “Maybe it will detect that it’s a soccer game and look for that moment when the ball is just lifting off,” he said.

Bergan also demonstrated a facial-recognition app. It recognized his face despite being trained to recognize his features using only a short, shaky, and poorly lit video of his face.

Those demonstrations were based on deep learning, a technique that trains software by processing data through networks of simulated neurons (see “10 Breakthrough Technologies 2013: Deep Learning”). In the case of the scene-classifying app, for example, the simulated neurons were exposed to thousands of photos of different types of scenes.

Bergan said that one reason Qualcomm is working on enabling phones to run deep learning software is that major mobile device manufacturers requested ways to make their devices smarter about images. When exactly the features might make it into phones is unclear.

Qualcomm has previously experimented with chips that are considered “neuromorphic,” because their circuits are arranged in neuron-like arrangements (see “Qualcomm to Build Neuro-Inspired Chips”). However, designs like that are still very much research projects (see “10 Breakthrough Technologies 2014: Neuromorphic Chips”). Bergan says that adding small “accelerators” for deep learning to existing chip designs would be a more practical approach.

Source: MIT Technology Review