Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.


Data Science in General

Interviews with Data Scientists

Forming Data Science Teams

Data Analysis

Distributed Computing Tools

Learning Languages


Data Mining and Machine Learning

Statistics and Statistical Learning

Data Visualization

Big Data

Computer Science Topics

If you have any suggestions of free books to include or want to review a book mentioned, please comment below and let me know!


mmdsThe second edition of this landmark book adds Jure Leskovec as a coauthor and has 3 new chapters, on mining large graphs, dimensionality reduction, and machine learning. You can still freely download a PDF version.

There is a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice, rather than how it was described in the original paper. Chapter 2 also has new material on algorithm design techniques for map-reduce.

Support Materials include Gradiance automated homeworks for the book and slides.

The authors note if want to reuse parts of this book, you need to obtain their permission and acknowledge our authorship. They have seen evidence that other items they published have been appropriated and republished under other names, but that is easy to detect, as you will learn in Chapter 3.

Download chapters of the book:

Preface and Table of Contents
Chapter 1 Data Mining
Chapter 2 Map-Reduce and the New Software Stack
Chapter 3 Finding Similar Items
Chapter 4 Mining Data Streams
Chapter 5 Link Analysis
Chapter 6 Frequent Itemsets
Chapter 7 Clustering
Chapter 8 Advertising on the Web
Chapter 9 Recommendation Systems
Chapter 10 Mining Social-Network Graphs
Chapter 11 Dimensionality Reduction
Chapter 12 Large-Scale Machine Learning


LDA automatically assigns topics to text documents. How is it done? Which are its limitations? What is the best open-source library to use in your code?

In this post we’re going to describe how topics can be automatically assigned to text documents; this process is named, unsurprisingly, topic-modelling. It works like fuzzy (or soft) clustering since it’s not a strict categorisation of the document to its dominant topic.

Let’s start with an example: the optimal topic-modelling outcome for Shakespeare’s Romeo & Juliet would be a composition of topics of circa 50% tragedy and 50% romance. Surely, topics like social networks, football and indian cuisine don’t appear in the play, so their weights would be all 0%.

One of the most advanced algorithms for doing topic-modelling is Latent Dirichlet Allocation (or LDA). This is a probabilistic model developed by Blei, Ng and Jordan in 2003. LDA is an iterative algorithm which requires only three parameters to run: when they’re chosen properly, its accuracy is pretty high. Unfortunately, one of the required parameters is the number of topics: exactly as happens with K-means, this requires a deep a-priori knowledge of the dataset.

A good measure to evaluate the performance of LDA is perplexity. This measure is taken from information theory and measures how well a probability distribution predicts an observed sample. To evaluate the LDA model, one document is taken and split in two. The first half is fed into LDA to compute the topics composition; from that composition, then, the word distribution is estimated. This distribution is then compared with the word distribution of the second half of the document. a a measure of distance is extracted. Thanks to this measure, in practice, perplexity is often used to select the best number of topics of the LDA model.

Under the hood, LDA models both the topics-per-document and the topic-per-word distribution as Dirichlet distributions (that’s why it appears in its name). By using a Markov Chain Monte Carlo (MCMC) method to sample and approximate the underlying Markov Chain stationary distribution (called Gibbs sampling), the whole process is iterative, pretty fast, convergent and accurate. Math behind LDA is fairly complex, but a simple example on how LDA works is contained in this video presentation of David Minmo, a world class researcher of topic-modelling:

Topic Modeling Workshop: Mimno from MITH in MD.

Start TL;DR

For the bravest, this is the graphical representation of LDA: grey circles represent the observable variables; latent (also called hidden) ones are white. Boxes represent collections (repetition) of variables.

Graphical representation of LDAlda-graphical-representation

[Image taken from Wikipedia, CC-3.0 licensed]

Parameters of the model:

  • Boxed:
    • K is the number of topics
    • N is the number of words in the document
    • M is the number of documents to analyse
  • α is the Dirichlet-prior concentration parameter of the per-document topic distribution
  • β is the same parameter of the per-topic word distribution
  • φ(k) is the word distribution for topic k
  • θ(i) is the topic distribution for document i
  • z(i,j) is the topic assignment for w(i,j)
  • w(i,j) is the j-th word in the i-th document

φ and θ are Dirichlet distributions, z and w are multinomials.


On the Internet there are a bunch of libraries able to perform topic-modelling through LDA. Note that the acronym LDA is also refer to another technique with the same initials (Linear Discriminant Analysis), but the two algorithms are completely unrelated. Now, in the follow, our point of view of some open sourced Latent Dirichlet Allocation Implementations. For each of them, we’re pointing out strengths and weakness, as well as simplicity to install and use and scalability.


  • Current status: no longer developed or maintained
  • Programming language: Java (Mallet) and Scala (TMT)
  • Features: university backed software, not optimised for production. Great for learning and exploring LDA on small datasets, understanding its parameters and tuning the model.
  • Scalability: multi-threaded, single-machine. Good for small to medium collections of documents.
  • Simplicity to install: simple. Jar distributed and Maven compilable.
  • Simplicity to train the model: simple and very customisable.
  • Infer topics on unseen documents: simple as well as training.


  • Current status: no longer developed or maintained
  • Programming language: C++
  • Features: Very scalable LDA algorithm, able to scale across multiple hosts and cores. Code is very optimised, and requires experienced C++ developers to modify it.
  • Scalability: multi-core, multi-machine Hadoop backed. Good for medium to huge collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: pretty complicated. A 4 years old linux box with many outdates libraries are required. Ice dependency is very tricky to install.
  • Simplicity to train the model: cumbersome. It tooks a long while to make Yahoo_LDA working properly on a Hadoop cluster. Also, in case of error, C++ compiled code on a Java/Hadoop system makes the investigation of what went wrong very hard.
  • Infer topics on unseen documents: simpler than the training phase.


  • Current status: active. Maintained by GraphLab Inc and community
  • Programming language: C++
  • Features: Very scalable LDA algorithm, able to scale across multiple hosts and cores. Code and algorithms are very optimised, and requires experienced C++ developers to modify it.
  • Scalability: multi-core, multi-machine through MPIs. Good for medium to huge collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: pretty simple (cMake), with few dependencies to install.
  • Simplicity to train the model: pretty simple, even in a multi-machine environment. Following the easy documentation, LDA simply works.
  • Infer topics on unseen documents: complex. There is not an out of the box routine to infer topics on new documents. Creating that inferencer is not so complicated, though.
  • Note: recently, documentation of LDA has disappeared from the website. Fortunately, it’s still available from the internet archive.


  • Current status: active. Maintained by Radim Řehůřek and community
  • Programming language: Python (with core pieces in Fortran/C)
  • Features: Very scalable LDA algorithm, distributed, able to process input collections larger than RAM (online learning) and easy to use.
  • Scalability: multi-core, multi-machine through RPCs. Good for medium to large collections of documents (it’s able to handle 1M+ documents).
  • Simplicity to install: very easy (using pip install or easy_install)
  • Simplicity to train the model: very simple. There are many helping routines that allow to build and tune the model with few lines of code (also in multi-machine environment)
  • Infer topics on unseen documents: very easy. It also update the model with the sample.
  • Quick tour as IPython notebook here.

LDA limitations: what’s next?

Although LDA is a great algorithm for topic-modelling, it still has some limitations, mainly due to the fact that it’s has become popular and available to the mass recently. One major limitation is perhaps given by its underlying unigram text model: LDA doesn’t consider the mutual position of the words in the document. Documents like “Man, I love this can” and “I can love this man” are probably modelled the same way. It’s also true that for longer documents, mismatching topics is harder. To overcome this limitation, at the cost of almost square the complexity, you can use 2-grams (or N-grams)along with 1-gram.

Another weakness of LDA is in the topics composition: they’re overlapping. In fact, you can find the same word in multiple topics (the example above, of the word “can”, is obvious). The generated topics, therefore, are not independent and orthogonal like in a PCA-decomposed basis, for example. This implies that you must pay lots of attention while dealing with them (e.g. don’t use cosine similarity).

For a more structured approach – especially if the topic composition is very misleading – you might consider the hierarchical variation of LDA, named H-LDA, (or simply Hierarchical LDA). In H-LDA, topics are joined together in a hierarchy by using a Nested Chinese Restaurant Process (NCRP). This model is more complex than LDA, and the description is beyond the goal of this blog entry, but if you like to have an idea of the possible output, here it is. Don’t forget that we’re still in the probabilistic world: each node of the H-DLA tree is a topic distribution.

Graphical representation of LDAhlda_example

[Image taken from the original paper on HLDA: Blei, Jordan, Griffiths Tenenbaum, © MIT Press, 2004 ]

If you’re very curious on LDA, here is a quick example.


Clustering With K-Means in Python

The Data Science Lab

A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.

Mathematical background

The k-means algorithm takes a dataset X of N points as input, together with a parameter K specifying how many clusters to create. The output is a set of K cluster centroids and a labeling…

View original post 705 more words


Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or “transitioning,” from one state to any other state—e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

A simple, two-state Markov chain is shown below.

With two states (A and B) in our state space, there are 4 possible transitions (not 2, because a state can transition back into itself). If we’re at ‘A’ we could transition to ‘B’ or stay at ‘A’. If we’re at ‘B’ we could transition to ‘A’ or stay at ‘B’. In this two state diagram, the probability of transitioning from any state to any other state is 0.5.

Of course, real modelers don’t always draw out Markov chain diagrams. Instead they use a “transition matrix” to tally the transition probabilities. Every state in the state space is included once as a row and again as a column, and each cell in the matrix tells you the probability of transitioning from its row’s state to its column’s state. So, in the matrix, the cells do the same job that the arrows do in the diagram.

If the state space adds one state, we add one row and one column, adding one cell to every existing column and row. This means the number of cells grows quadratically as we add states to our Markov chain. Thus, a transition matrix comes in handy pretty quickly, unless you want to draw a jungle gym Markov chain diagram.

One use of Markov chains is to include real-world phenomena in computer simulations. For example, we might want to check how frequently a new dam will overflow, which depends on the number of rainy days in a row. To build this model, we start out with the following pattern of rainy (R) and sunny (S) days:

One way to simulate this weather would be to just say “Half of the days are rainy. Therefore, every day in our simulation will have a fifty percent chance of rain.” This rule would generate the following sequence in simulation:

Did you notice how the above sequence doesn’t look quite like the original? The second sequence seems to jump around, while the first one (the real data) seems to have a “stickyness”. In the real data, if it’s sunny (S) one day, then the next day is also much more likely to be sunny.

We can minic this “stickyness” with a two-state Markov chain. When the Markov chain is in state “R”, it has a 0.9 probability of staying put and a 0.1 chance of leaving for the “S” state. Likewise, “S” state has 0.9 probability of staying put and a 0.1 chance of transitioning to the “R” state.

In the hands of metereologists, ecologists, computer scientists, financial engineers and other people who need to model big phenomena, Markov chains can get to be quite large and powerful. For example, the algorithm Google uses to determine the order of search results, called PageRank, is a type of Markov chain.

Above, we’ve included a Markov chain “playground”, where you can make your own Markov chains by messing around with a transition matrix. Here’s a few to work from as an example: ex1, ex2, ex3 or generate one randomly. The transition matrix text will turn red if the provided matrix isn’t a valid transition matrix. The rows of the transition matrix must total to 1. There also has to be the same number of rows as columns.

Source: Setosa.ioio


While not a new concept, gamification has been catching on in recent years as organizations figure out how best to utilize it. After a period of experimentation, many companies are starting to realize how effective it can actually be. Most of the time this is used as a way to boost the customer experience by fostering more interaction and a spirit of competition and gaming, all to benefit the overall bottom line of the business. More recently, however, gamification is being used in other matters, more specifically being directed internally to the employees themselves. In fact, a report from Gartner predicts that 40 percent of global 1000 organizations will use gamification this year to transform the way their businesses operate. As can be learned from the experiences of some organizations, gamification can also be used to tackle some serious issues, such as helping workers overcome their fear of failure.


To better understand how gamification can eliminate certain fears, it’s best to start with a clear understanding of what the concept is. According to gamification thought leader Gabe Zicherman, it is “the process of engaging people and changing behavior with game design, loyalty, and behavioral economics.” Essentially, it applies the joy of playing a game with topics and activities that might not be considered so enjoyable. Very few people like to fail, let alone confront the reasons why they failed. It’s a touchy subject that often requires a delicate hand to manage. With that in mind, it may seem strange that a game could help employees face their fears of failure, but that’s exactly what’s being done, often to good results.

One organization that has taken this idea and run with it is DirecTV. The satellite television company recently had a few IT project failures and decided it needed to address how to face these shortcomings. IT leaders came up with the idea to create a gamification learning platform where IT workers could discuss failure by creating and viewing videos about the subject. Simply creating the platform and posting videos wasn’t enough. IT leaders needed to come up with a way to make sure employees were using it and posting videos of their own. This all had to be done privately with enhanced network securitydue to the potentially sensitive nature of the videos. The gamification aspect comes in by awarding points and badges to those workers who do make use of it. Prizes were even given to those who scored the most points from the platform. The result was a major increase in usage rate among the IT staff. In fact, not long after the platform was launched, the company was experiencing a 97 percent participation rate, with most workers giving positive feedback on the experience.

Based off of DirecTV’s gamification, employees were able to talk openly about failing, discuss why the failures happened, and improve on their overall performance. IT leaders for DirecTV said later projects were completed much more smoothly as the whole staff became more successful. From this experience, other organizations may want to use gamification to eliminate the fear of failure among their own staffs, so it’s helpful to take note of a number of factors that can help businesses achieve this goal. First, gamification platforms need to be intuitive and all actions need to relate to the overarching goal (in this case, facing the fear of failure). Second, newcomers should have a way to start on the platform and be competitive. Facing other employees that have a lot of experience and accumulated points can be intimidating, so newcomers need a good entry point that gives them a chance to win. Third, the platform should constantly be adjusted based off of user feedback. Creators who make the platform and don’t touch it afterward will soon see participation rates drop off and enthusiasm wane. And last, a gamification platform that doesn’t foster social interaction and competition is a platform not worth having.

Gamification allows employees to confront failure in a setting deemed safe and fun. People don’t enjoy failure and often go to great lengths to avoid even the very prospect of it, but sometimes to succeed, risks have to be taken. By helping workers overcome this fear with gamification, businesses can ensure future success and a more confident environment. It may seem a little outside the mainstream, but turning the fear of failure into a game can actually pay big dividends as employees learn there’s nothing wrong with trying hard and coming up short.



enter image description here

BlueData, a pioneer in Big Data private clouds, announced a technology preview of the Tachyon in-memory distributed storage system as a new option for the BlueData EPIC platform. Together with the company’s existing integration with Apache Spark, BlueData supports the next generation of Big Data analytics with real-time capabilities at scale, which allows organizations to realize value from their Big Data that wasn’t before possible. In addition, this new integration enables Hadoop, Hbase virtual clusters, and other applications provisioned in the BlueData platform, to take advantage of Tachyon’s high performance in-memory data processing.

Enterprises need to be able to run a wide variety of Big Data jobs such as trading, fraud detection, cybersecurity and system monitoring. These high performance applications require the ability to run in real-time and at scale in order to provide true value to the business. Existing Big Data approaches using Hadoop are relatively inflexible and do not fully meet the business needs for high speed stream processing. New technologies like Spark, which offers 100X faster data processing, and Tachyon, which offers 300X higher throughput, overcome these challenges.

Big Data is about the combination of speed and scale for analytics. With the advent of the Internet of Things and streaming data, Big Data is helping enterprises make more decisions in real time. Spark and Tachyon will be the next generation of building blocks for interactive and instantaneous processing and analytics, much like Hadoop MapReduce and disk-based HDFS were for batch processing,” said Nik Rouda, senior analyst of Enterprise Strategy Group. “By incorporating a shared in-memory distributed storage system in a common platform that runs multiple clusters, BlueData streamlines the development of real-time analytics applications and services.”

However, incorporating these technologies with existing Big Data platforms like Hadoop requires point integrations on a cluster-by-cluster basis, which makes it manual and slow. With this preview, BlueData is streamlining infrastructure by creating a unified platform that incorporates Tachyon. This allows users to focus on building real-time processing applications rather than manually cobbling together infrastructure components.

We are thrilled to welcome BlueData into the Tachyon community, and we look forward to working with BlueData to refine features for Big Data applications,” said Haoyuan Li, co-creator and lead of Tachyon.

The BlueData platform also includes high availability, auto tuning of configurations based on cluster size and virtual resources, and compatibility with each of the leading Hadoop distributions. Customers who deploy BlueData can now take advantage of these enterprise-grade benefits along with the memory-speed advantages of Spark and Tachyon for any Big Data application, on any server, with any storage.

First generation enterprise data lakes and data hubs showed us the possibilities with batch processing and analytics. With the advent of Spark, the momentum has clearly shifted to in-memory and streaming with emerging use cases around IoT, real-time analytics and high speed machine learning. Tachyon’s appealing architecture has the potential to be a key foundational building block for the next generation logical data lake and key to the adoption and success of in-memory computing,” said Kumar Sreekanti, CEO and co-founder of BlueData. “BlueData is proud to deliver the industry’s first Big Data private cloud with a shared, distributed in-memory Tachyon file system. We look forward to continuing our partnership with Tachyon to deliver on our mission of democratizing Big Data private clouds.”



If you wish to process huge piles of data very, very quickly, you’re in luck.

From the comfort of your own data center, you can now use Google’s recently announced Dataflow programming model for processing data in batches or as it comes in, on top of the fast Spark open-source engine.

Cloudera, one company selling a distribution of Hadoop open-source software for storing and analyzing large quantities of different kinds of data, has been working with Google to make that possible, and the results of their efforts are now available for free under an open-source license, the two companies announced today.

The technology could benefit the burgeoning Spark ecosystem, as well as Google, which wants programmers to adopt its Dataflow model. If that happens, developers might well feel more comfortable storing and crunching data on Google’s cloud.

Google last year sent shockwaves through the big data world it helped create when Urs Hölzle, Google’s senior vice president of technical infrastructure, announced that Googlers “don’t really use MapReduce anymore.” In lieu of MapReduce, which Google first developed more than 10 years ago and still lies at the heart of Hadoop, Google has largely switched to a new programming model for processing data in streaming or batch format.

Google has brought out a commercial service for running Dataflow on the Google public cloud. And late last year it went further and issued a Java software-development kit for Dataflow.

All the while, outside of Google, engineers have been making progress. Spark in recent years has emerged as a potential MapReduce successor.

Now there’s a solid way to use the latest system from Google on top of Spark. And that could be great news from a technical standpoint.

“[Dataflow’s] streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data,” Josh Wills, Cloudera’s senior director of data science, wrote in a blog post on the news.




Software developers at Lockheed Martin [NYSE: LMT] have designed a platform to make big data analysis easier for developers and non-developers and are open sourcing the project on GitHub, a popular web-based hosting service.

The StreamFlow™ software project is designed to make working with Apache Storm, a free and open source distributed real-time computation system, easier and more productive. A Storm application ingests significant amounts of data through the use of topologies, or set of rules that govern how a network is organized. These topologies categorize the data streams into understandable pipelines. The ultimate goal of StreamFlow is to make working with Storm easier and faster, allowing non-developers and domain experts of all kinds to contribute to real-time data-driven solutions,” said Jason O’Connor, vice president of Analysis & Mission Solutions with Lockheed Martin Information Systems & Global Solutions. “The next step in data analytics relies on the inclusion of diverse expertise and we envision this product contributing to fields ranging from systems telematics to cyber security to medical care.” Companies currently using Apache Storm to repartition streams of data include Twitter, Spotify and The Weather Channel.

The StreamFlow software introduces a dashboard to monitor metrics, high-level protocols to make coding more interoperable, and a graphical topology builder to make assembling and monitoring topologies in Storm much easier for beginner programmers and users without software development experience. StreamFlow was open sourced on Lockheed Martin’s GitHub account for users to install on their own desktops. Installing StreamFlow is much like installing typical web applications, though some configuration may be necessary. The software is released under the Apache 2 license and can be freely downloaded and built from GitHub.

The software contains a front end user interface supported by a series of web services. Future plans include open sourcing of additional frameworks and support for further real-time processing systems like Apache Spark.

Headquartered in Bethesda, Maryland, Lockheed Martin is a global security and aerospace company that employs approximately 113,000 people worldwide and is principally engaged in the research, design, development, manufacture, integration and sustainment of advanced technology systems, products and services. The Corporation’s net sales for 2013 were $45.4 billion.



In 2010, the CEO of Google at the time, Eric Schmidt, made a remarkable statement at a media event in Abu Dhabi: “One day we had a conversation where we figured we could just [use Google’s data about its users] to predict the stock market. And then we decided it was illegal. So we stopped doing that” (Fortt 2010).

The journalist John Battelle (2010) has described Google as “the database of [human] intentions.” Battelle noticed that the search queries entered into Google express human needs and desires. By storing all those queries—more than a trillion a year—Google can build up a database of human intent. That knowledge of intention then makes it possible for Google to predict the movement of the stock market (and much else). Of course, neither Google nor anyone else has a complete database of human intentions. But part of the power of Battelle’s phrase is that it suggests that aspiration. Google cofounder Sergey Brin has said that the ultimate future of search is to connect directly to users’ brains (Arrington 2009). What could you do if you had a database that truly contained all human intentions?

The database of human intentions is a small part of a much bigger vision: a database containing all the world’s knowledge. This idea goes back to the early days of modern computing, and people such as Arthur C. Clarke and H. G. Wells exploring visions of a “world brain” (Wikipedia 2013). What’s changed recently is that a small number of technology companies are engaged in serious (albeit early stage) efforts to build databases which really will contain much of human knowledge. Think, for example, of the way Facebook has mapped out the social connections between more than 1 billion people. Or the way Wolfram Research has integrated massive amounts of knowledge about mathematics and the natural and social sciences into Wolfram Alpha. Or Google’s efforts to build Google Maps, the most detailed map of the world ever constructed, and Google Books, which aspires to digitize all the books (in all languages) in the world (Taycher 2010). Building a database containing all the world’s knowledge has become profitable.

This data gives these companies great power to understand the world. Consider the following examples: Facebook CEO Mark Zuckerberg has used user data to predict which Facebook users will start relationships (O’Neill 2010); researchers have used data from Twitter to forecast box office revenue for movies (Asur and Huberman 2010); and Google has used search data to track influenza outbreaks around the world (Ginsberg et al. 2009). These few examples are merely the tip of a much larger iceberg; with the right infrastructure, data can be converted into knowledge, often in surprising ways.

What’s especially striking about examples like these is the ease with which such projects can be carried out. It’s possible for a small team of engineers to build a service such as Google Flu Trends, Google’s influenza tracking service, in a matter of weeks. However, that ability relies on access to both specialized data and the tools necessary to make sense of that data. This combination of data and tools is a kind of data infrastructure, and a powerful data infrastructure is available only at a very few organizations, such as Google and Facebook. Without access to such data infrastructure, even the most talented programmer would find it extremely challenging to create projects such as Google Flu Trends.

Today, we take it for granted that a powerful data infrastructure is available only at a few big for-profit companies , and to secretive intelligence agencies such as the NSA and GCHQ. But in this essay I explore the possibility of creating a similarly powerful public data infrastructure, an infrastructure which could be used by anyone in the world. It would be Big Data for the masses.


Imagine, for example, a 19-year-old intern at a health agency somewhere who has an idea like Google Flu Trends . They could use the public data infrastructure to quickly test their idea. Or imagine a 21-year-old undergraduate with a new idea for how to rank search engine results. Again, they could use the public data infrastructure to quickly test their idea. Or perhaps a historian of ideas wants to understand how phrases get added to the language over time; or how ideas spread within particular groups, and die out within others; or how particular types of stories get traction within the news, while others don’t. Again, this kind of thing could easily be done with a powerful public data infrastructure.

These kinds of experiments won’t be free—it costs real money to run computations across clusters containing thousands of computers, and those costs will need to be passed on to the people doing the experiments. But it should be possible for even novice programmers to do amazing experiments for a few tens of dollars, experiments which today would be nearly impossible for even the most talented programmers.

Note, by the way, that when I say public data infrastructure, I don’t necessarily mean data infrastructure that’s run by the government. What’s important is that the infrastructure be usable by the public, as a platform for discovery and innovation, not that it actually be publicly owned. In principle, it could be run by a not-for-profit organization, or a for-profit company, or perhaps even by a loose network of individuals. Below, I’ll argue that there are good reasons such infrastructure should be run by a not-for-profit.

There are many nascent projects to build powerful public data infrastructure. Probably the best known such project is Wikipedia. Consider the vision statement of the Wikimedia Foundation (which runs Wikipedia): “Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.” Wikipedia is impressive in size, with more than 4 million articles in the English language edition. The Wikipedia database contains more than 40 gigabytes of data. But while that sounds enormous, consider that Google routinely works with data at the petabyte scale—a million gigabytes! By comparison, Wikipedia is miniscule. And it’s easy to see why there’s this difference. What the Wikimedia Foundation considers “the sum of all knowledge” is extremely narrow compared to the range of data about the world that Google finds useful—everything from scans of books to the data being generated by Google’s driverless cars (each car generates nearly a gigabyte per second about its environment! [Gross 2013]) And so Google is creating a far more comprehensive database of knowledge.

Another marvelous public project is OpenStreetMap,  a not-for-profit that is working to create a free and openly editable map of the entire world. OpenStreetMap is good enough that their data is used by services such as Wikipedia, Craigslist, and Apple Maps. However, while the data is good, OpenStreetMap does not yet match the comprehensive cover provided by Google Maps, which has 1,000 full-time employees and 6,100 contractors working on the project (Carlson 2012). The OpenStreetMap database contains 400 gigabytes of data. Again, while that is impressive, it’s miniscule by comparison to the scale at which companies such as Google and Facebook operate.

More generally, many existing public projects such as Wikipedia and OpenStreetMap are generating data that can be analyzed on a single computer using off-the-shelf software. The for-profit companies have data infrastructure far beyond this scale. Their computer clusters contain hundreds of thousands or millions of computers. They use clever algorithms to run computations distributed across those clusters. This requires not only access to hardware, but also to specialized algorithms and tools, and to large teams of remarkable people with the rare (and expensive!) knowledge required to make all this work. The payoff is that this much larger data infrastructure gives them far more power to understand and to shape the world. If the human race is currently constructing a database of all the world’s knowledge, then by far the majority of that work is being done on privately owned databases.

I haven’t yet said what I mean by a “database of all the world’s knowledge.” Of course, it’s meant to be an evocative phrase, not (yet!) a literal description of what’s being built. Even Google, the organization which has made most progress toward this goal, has for the most part not worked directly toward this goal . Instead, they’ve focused on practical user needs—search, maps, books, and so on—in each case gathering data to build a useful product. They then leverage and integrate the data sets they already have to create other products. For example, they’ve combined Android and Google Maps to build up real-time maps of the traffic in cities, which can then be displayed on Android phones. The data behind Google Search has been used to launch products such as Google News, Google Flu Trends, and (the now defunct, but famous) Google Reader. And so while most of Google’s effort isn’t literally aimed at building a database of all the world’s knowledge, it’s a useful way of thinking about the eventual end game.

For this reason, from now I’ll mostly use the more generic term public data infrastructure. In concrete, everyday terms this can be thought of in terms of specific projects. Imagine, for example, a project to build an open infrastructure search engine. As I described above, this would be a platform that enabled anyone in the world to experiment with new ways of ranking search results, and new ways of presenting information. Or imagine a project to build an open infrastructure social network, where anyone in the world could experiment with new ways to connect people. Those projects would, in turn, serve as platforms for other new services. Who knows what people could come up with?

The phrase a public data infrastructure perhaps suggests a singular creation by some special organization. But that’s not quite what I mean. To build a powerful public data infrastructure will require a vibrant ecology of organizations, each making their own contribution to an overall public data infrastructure. Many of those organizations will be small, looking to innovate in new ways, or to act as niche platforms. And some winners will emerge, larger organizations that integrate and aggregate huge amounts of data in superior ways. And so when I write of creating a public data infrastructure, I’m not talking about creating a single organization. Instead, I’m talking about the creation of an entire vibrant ecology of organizations, an ecology of which projects like Wikipedia and OpenStreetMap are just early members.

I’ll describe shortly how a powerful public data infrastructure could be created, and what the implications might be. But before doing that, let me make it clear that what I’m proposing is very different from the muchdiscussed idea of open data.

Many people, including the creator of the web, Tim Berners-Lee, have advocated open, online publication of data. The open data visionaries believe we can transform domains such as government, science, and the law by publishing the crucial data underlying those domains.

If this vision comes to pass then thousands or millions of people and organizations will publish their data online.

While open data will be transformative, it’s also different (though complementary) to what I am proposing. The open data vision is about decentralized publication of data. That means it’s about small data, for the most part. What I’m talking about is Big Data—aggregating data from many sources inside a powerful centralized data infrastructure, and then making that infrastructure usable by anyone. That’s qualitatively different. To put it another way, open publication of data is a good first step. But to get the full benefit, we need to aggregate data from many sources inside a powerful public data infrastructure.

Why a Public Data Infrastructure Should Be Developed by Not-for-Profits

Is it better for public data infrastructure to be built by for-profit companies, or by not-for-profits? Or is some other option even better—say, governments creating it, or perhaps loosely organized networks of contributors, without a traditional institutional structure? In this section I argue that the best option is not-for-profits.

Let’s focus first on the case of for-profits versus not-for-profits. In general, I am all for for-profit companies bringing technologies to market. However, in the case of a public data infrastructure, there are special circumstances which make not-for-profits preferable.

To understand those special circumstances, think back to the late 1980s and early 1990s. That was a time of stagnation in computer software, a time of incremental progress, but few major leaps. The reason was Microsoft’s stranglehold over computer operating systems. Whenever a company discovered a new market for software, Microsoft would replicate the product and then use their control of the operating system to crush the original innovator. This happened to the spreadsheet Lotus 1-2-3 (crushed by Excel), the word processor Word Perfect (crushed by Word), and many other lesser-known programs. In effect, those other companies were acting as the research and development arms of Microsoft. As this pattern gradually became clear, the result was a reduced incentive to invest in new ideas for software, and a decade or so of stagnation.

That all changed when a new platform for computing emerged—the web browser. Microsoft couldn’t use their operating system dominance to destroy companies such as Google, Facebook, and Amazon. The reason is that those companies’ products didn’t run (directly) on Microsoft’s operating system, they ran over the web. Microsoft initially largely ignored the web, a situation that only changed in May 1995, when Bill Gates sent out a company-wide memo entitled “The Internet Tidal Wave” (Letters of Note 2011). But by the time Gates realized the importance of the web, it was too late to stop the tidal wave. Microsoft made many subsequent attempts to get control of web standards, but those efforts were defeated by organizations such as the World Wide Web Consortium, Netscape, Mozilla, and Google. Effectively, the computer industry moved from a proprietary platform (Windows) to an open platform (the web) not owned by anyone in particular. The result was a resurgence of software innovation.

The lesson is that when dominant technology platforms are privately owned, the platform owner can co-opt markets discovered by companies using the platform. I gave the example of Microsoft, but there are many other examples—companies such as Apple, Facebook, and Twitter have all used their ownership of important technology platforms to co-opt new markets in this way. We’d all be better off if dominant technology platforms were operated in the public interest, not as a way of co-opting innovation. Fortunately, that is what’s happened with both the Internet and the web, and that’s why those platforms have been such a powerful spur to innovation.

Platforms such as the web and the Internet are a little bit special in that they’re primarily standards. That is, they’re broadly shared agreements on how technologies should operate. Those standards are often stewarded by not-for-profit organizations such as the World Wide Web Consortium and the Internet Engineering Task Force. But it doesn’t really make sense to say the standards are owned by those not-for-profits, since what matters is really the broad community commitment to the standards. Standards are about owning hearts and minds, not atoms.

By contrast, a public data infrastructure would be a different kind of technology platform. Any piece of such an infrastructure would involve considerable capital costs, associated with owning (or leasing) and operating a large cluster of computers. And because of this capital investment there really is a necessity for an owner. We’ve already seen that if a public data infrastructure were owned by for-profit companies, those companies would always be tempted to use their ownership to co-opt innovation. The natural alternative solution is for a public data infrastructure to be owned and operated by not-for-profits that are committed to not co-opting innovation, but rather to encouraging it and helping it to flourish.

What about government providing public data infrastructure? In fact, for data related directly to government this is beginning to happen, through initiatives such as, the U.S. Government’s portal for government data in the U.S. But it’s difficult to believe that having the government provide a public data infrastructure more broadly would be a good idea. Technological innovation requires many groups of people to try our many different ideas, with most failing, and with the best ideas winning. This isn’t a model for development that governments have a long history of using effectively. With that said, initiatives such as will make a very important contribution to a public data infrastructure. But they will not be the core of a powerful, broad-ranging public data infrastructure.

The final possibility is that a public data infrastructure not be developed by an organization at all, but rather by a loosely organized network of contributors, without a traditional institutional structure. Examples such as OpenStreetMap are in this vein. OpenStreetMap does have a traditional not-for-profit at its core, but it’s tiny, with a 2012 budget of less than 100,000 British pounds (OMS 2013). Most of the work is done by a loose network of volunteers. That’s a great model for OpenStreetMap, but part of the reason it works is because of the relatively modest scale of the  data involved. Big Data involves larger organizations (and larger budgets), due to the scale of the computing power involved, as well as the longterm commitments necessary to providing reliable service, effective documentation, and support. All these things mean building a lasting organization. So while a loosely distributed model may be a great way to start such projects, over time they will need to transition to a more traditional not-for-profit model.

Challenges for Not-for-Profits Developing a Public Data Infrastructure

How could not-for-profits help develop such a public data infrastructure?

At first sight, an encouraging sign is the flourishing ecosystem of opensource software. Ohloh ,  a site indexing open-source projects, currently lists more than 600,000 projects. Open-source projects such as Linux, Hadoop, and others are often leaders in their areas.

Given this ecosystem of open-source software, it’s somewhat puzzling that there is comparatively little public data infrastructure. Why has so much important code been made usable by anyone in the world, and so little data infrastructure?

To answer this question, it helps to think about the origin of opensource software. Open-source projects usually start in one of two ways: (1) as hobby projects (albeit often created by professional programmers in their spare time), such as Linux; or (2) as by-products of the work of for-profit companies. By looking at each of these cases separately, we can understand why open-source software has flourished so much more than public data infrastructure.

Let’s first consider the motivations for open-source software created by for-profit companies. An example is the Hadoop project, which was created by Yahoo as a way of making it easier to run programs across large clusters of computers. When for-profit companies open source projects in this way, it’s because they don’t view owning the code as part of their competitive business advantage. While running large clusterbased computations is obviously essential to Yahoo, they’re not trying to use that as their edge over other companies. And so it made sense for Yahoo to open-source Hadoop, so other people and organizations can help them improve the code.

By contrast, for many Internet companies owning their own data really is a core business advantage, and they are unlikely to open up their data infrastructure. A priori nothing says this necessarily has to be the case. A for-profit could attempt to build a business offering a powerful public data infrastructure, and find some competitive advantage other than owning the data (most likely, an advantage in logistics and supply chain management). But I believe that this hasn’t happened because holding data close is an easy and natural way for a company to maintain a competitive advantage. The investor Warren Buffet has described how successful companies need a moat—a competitive advantage that is truly difficult for other organizations to duplicate. For Google and Facebook and many other Internet companies their internal data infrastructure is their moat.

What about hobby projects? If projects such Linux can start as a hobby, then why don’t we see more public data infrastructure started as part of a hobby project? The problem is that creating data infrastructure requires a much greater commitment than creating open-source code. A hobby open-source project requires a time commitment, but little direct expenditure of money. It can be done on weekends, or in the evenings. As I noted already above, building effective data infrastructure requires time, money, and a long-term commitment to providing reliable service, effective documentation, and support. To do these things requires an organization that will be around for a long time. That’s a much bigger barrier to entry than in the case of open source.

What would be needed to create a healthy, vibrant ecology of not-forprofit organizations working on developing a public data infrastructure?

This question is too big to comprehensively answer in a short essay such as this. But I will briefly point out two significant obstacles to this happening through the traditional mechanisms for funding not-for-profits: foundations, grant agencies, and similar philanthropic sources.

To understand the first obstacle, consider the story of the for-profit company Ludicorp. In 2003 Ludicorp released an online game called Game Neverending. After releasing the game, Ludicorp added a feature for players to swap photos with one another. The programmers soon noticed that people were logging onto the game just to swap photos, and ignoring the actual gameplay. After observing this, they made a bold decision. They threw out the game, and relaunched a few weeks later as a photo-sharing service, which they named Flickr. Flickr went on to become the first major online photo-sharing application, and was eventually acquired by Yahoo. Although Flickr has faded since the acquisition, in its day it was one of the most beloved websites in the world.

Stories like this are so common in technology circles that there’s even a name for this phenomenon. Entrepreneurs talk about pivoting when they discover that some key assumption in their business model is wrong, and they need to try something else. Entrepreneur Steve Blank, one of the people who developed the concept of the pivot, has devised an influential definition of a startup as “an organization formed to search for a repeatable and scalable business model” (Blank 2010). When Ludicorp discovered that photo sharing was a scalable business in a way that Game Neverending wasn’t, they did the right thing: they pivoted hard.

This pattern of pivoting makes sense for entrepreneurs who are trying to create new technologies and new markets for those technologies. True innovators don’t start out knowing what will work; they discover what will work. And so their initial plans are almost certain to be wrong, and will need to change, perhaps radically.

The pivot has been understood and accepted by many technology investors. It’s expected and even encouraged that companies will change their mission, often radically, as they search for a scalable business model. But in the not-for-profit world this kind of change is verboten. Can you imagine a notfor- profit telling their funders—say, some big foundation—that they’ve decided to pivot? Perhaps they’ve decided that they’re no longer working with homeless youth, because they’ve discovered that their technology has a great application to the art scene. Such a change won’t look good on the end-of-year report! Yet, as the pivots behind Flickr and similar companies show, that kind of flexibility is an enormous aid (and arguably very nearly essential) in developing new technologies and new markets.

A second obstacle to funding not-for-profits working on a public data infrastructure is the risk-averse nature of much not-for-profit funding. In the for-profit world it’s understood that technology startups are extremely risky. Estimates of the risk vary, but typical estimates place the odds of failure for a startup at perhaps 70 to 80 percent (Gompers et al. 2008). Very few foundations or grant agencies would accept 70 to 80 percent odds of failure. It’s informative to consider entrepreneur Steve Blank’s startup biography. He bluntly states that his startups have made “two deep craters, several ‘base hits,’ [and] one massive ‘dot-com bubble’ home run” (Blank 2013). That is, he’s had two catastrophic failures, and one genuine success. In the for-profit startup world this can be bragged about; in the not-for-profit world this rate of success would be viewed as disastrous. The situation is compounded by the difficulty in defining what success is for a not-for-profit; this makes it tempting (and possible) for mediocre notfor-profits to scrape by, continuing to exist, when it would be healthier if they ceased to operate, and made space for more effective organizations.

One solution I’ve seen tried is for foundations and grant agencies to exhort applicants to take more risks. The problem is that any applicant considering taking those risks knows failure means they will still have trouble getting grants in the future, exhortation or no exhortation. So it still makes more sense to do low-risk work.

One possible resolution to this problem would be for not-for-profit funders to run failure audits. Suppose programs at the big foundations were audited for failures, and had to achieve a failure rate above a certain number. If a foundation were serious about taking risks, then they could run a deliberately high-risk grant program, where the program had to meet a target goal of at least 70 percent of projects failing. Doing this well would require careful design to avoid pitfalls. But if implemented well, the outcome would be a not-for-profit culture willing to take risks. At the moment, so far as I am aware, no large funder uses failure audits or any similar idea to encourage genuine risk taking.

I’ve painted a bleak picture of not-for-profit funding for a public data infrastructure (and for much other technology). But it’s not entirely bleak. Projects such as Wikipedia and OpenStreetMap have found ways to be successful, despite not being started with traditional funding. And I am optimistic that examples such as these will help inspire funders to adopt a more experimental and high-risk approach to funding technological innovation, an approach that will speed up the development of a powerful public data infrastructure.

Two Futures for Big Data

We’re at a transition moment in history. Many core human activities are changing profoundly: the way we seek information; the way we connect to people; the way we decide where we want to go, and who we want to be with. The way we make such choices is becoming more and more dominated by a few technology companies with powerful data infrastructure. It’s fantastic that technology can improve our lives. But I believe that we’d be better off if more people could influence these core decisions about how we live.

In this essay, I’ve described two possible futures for Big Data. In one future, today’s trends continue. The best data infrastructure will be privately owned by a few large companies who see it as a competitive advantage to map out human knowledge. In the other future, the future I hope we will create, the best data infrastructure will be available for use by anyone in the world, a powerful platform for experimentation, discovery, and the creation of new and better ways of living.