It’s been a busy year for Apache Ambari. Keeping up with the rapid innovation in the open community certainly is exciting. We’ve already seen six releases this year to maintain a steady drumbeat of new features and usability guardrails. We have also seen some exciting announcements of new folks jumping into the Ambari community.

With all these releases and community activities, let’s take a break to talk about how the broader Hadoop community is affecting Ambari and how this is influencing what you will see from Ambari in the future.

Take a Look Around

To talk about the future of Ambari, we have to recognize what is happening outside of Ambari in the Hadoop community. We have to talk about Apache Hadoop YARN.

YARN is the operating system for data processing, making it possible to bring multiple workloads and processing engines to the data stored in Apache Hadoop 2. YARN enables a single platform for storage and processing that can handle different access patterns (batch, interactive and real-time). This fundamentally changes the future of data management.


The YARN Effect

Just as YARN is re-shaping the definition—and capabilities—of Hadoop, Ambari complements YARN by enabling Hadoop operators to efficiently and securely harness the power of Hadoop by provisioning, managing, and monitoring their clusters.

Different workloads, different processing engines, and different access patterns mean Ambari needs to be flexible while maintaining the stable, predictable operational capabilities that an enterprise expects.

To that end, we at Hortonworks focus on rallying the community around three Ambari areas: operate, integrate, and extend.


Operate: Provision, Manage, and Monitor

Ambari’s core function is to provision, manage, and monitor a Hadoop Stack.

Ambari has come a long way standardizing the stack operations model, and Ambari Stacks proves this progress. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer. With this wrapper in-place, Ambari can rationalize operations over a broad set of services.

To the Hadoop operator, this means that regardless of differences across services (e.g.install/start/stop/configure/status) each service can be managed and monitored with a consistent approach.

This also provides a natural extension point for operators and the community to create their own custom stack definitions to “plug-in” new services that can co-exist with Hadoop. For example, we have seen the Gluster team at Red Hat extend Ambari and implement a custom stack that deploys Hadoop on GlusterFS.

We expect to see more of this type of activity and rapid adoption in the community, to take advantage of Stacks.

As for provisioning, the Stacks technology also rationalizes the cluster install experience across a set of services. Stacks enable Ambari Blueprints. Blueprints deliver a trifecta of benefits:

  • A repeatable model for cluster provisioning (for consistency);
  • A method to automate cluster provisioning (for ad hoc cluster creation, whether bare metal or cloud);
  • A portable and cohesive definition of a cluster (for sharing best practices on component layout and configuration).

As Shaun Connolly discussed during his Hadoop Summit 2014 keynote, Hadoop adoption spans data lifecycles (i.e. learn, dev/test, discovery, production). Blueprints enable this connected lifecycle and provide consistency, portability and deployment flexibility.

We expect Blueprints to become a shared language for defining a Hadoop cluster and for Ambari to become a key component for provisioning clusters in an automated fashion, whether using bare metal or cloud infrastructure.

Integrate: Enterprise Tools, Skills, and Systems

To create modern data architecture, Hadoop must be integrated with existing data center management systems. Fortunately, the Apache developer community designed Ambari with a robust REST API that exposes cluster management controls and monitoring information.

This API facilitates Ambari’s integration with existing processes and systems to automate operational workflows (such as automatic host decommission/re-commission in alert scenarios). As adoption of Hadoop progresses and the types of workloads supported by Hadoop expand, data center operations teams will still be able to leverage their existing investments in people, processes, and tools.

We have already seen our partners put their weight behind ease of Hadoop integration. For example, Ambari SCOM Management Pack leverages Ambari to bring Hadoop monitoring information into Microsoft System Center Operations Manager. Teradata Viewpoint uses Ambari as a single integration point for Hadoop management. As we look ahead, we anticipate other Hadoop ecosystem products and broader systems management products to follow this pattern.

Extend: Customizing the Interaction Experience for Operators and Users

This is where it really gets interesting. If someone extends Ambari with a custom service (via Stacks) then the rationalized operational controls (as defined in the Stack) should “just work”. The lifecycle definition makes it possible to expose consistently the service control from the Ambari Web UI and the Ambari REST API.

But as new services are brought into Ambari, that will introduce new requirements on how Ambari manages, organizes, and visualizes information about the cluster. With all these services under Ambari, certain capabilities are going to be unique and role-based. How do you provide an extensibility point for the community to easily “plug-in” custom features that go beyond core operations? How do you expose (or limit) those capabilities to the operators and end users with a consistent interaction experience? That’s where Ambari Views come in.

Ambari Views will enable the community and operators to develop new ways to visualize operations, troubleshoot issues and interact with Hadoop. They will provide a framework to offer those experiences to specific sets of users. Via the pluggable UI framework, operators will be able to control which users get certain capabilities and to customize how those users interact with Hadoop.

The community has been working on the Views Framework for some time, and this presentation (given at Hadoop Summit 2014) provides a good overview of the technology.

We look to release the Views Framework in the upcoming Ambari 1.7.0 release. In the meantime, you can get a preview of some Views and the Views Framework itself by digging into the examples or by grabbing one of the current contributions.



Earlier this month, the Apache Ambari community released Apache Ambari 1.6.1, which includes multiple improvements for performance and usability. The momentum in and around the Ambari community is unstoppable. Today we saw the Pivotal team lean in to Ambari, and this is the sixth release of this critical component in 2014, proving again that open source is the fastest path to innovation.

Many thanks to the wealth of contribution from the broad Ambari community that resulted in 585 JIRA issues being resolved in this release.

The new features in Ambari 1.6.1 help operate enterprise-grade Hadoop at scale on very large clusters supporting batch, interactive and real-time workloads running through YARN. Ambari 1.6.1 also delivers the usability guardrails an enterprise expects from an operations platform.

Operating Hadoop at Scale


Performance Improvements

YARN allows more Hadoop users to access data in different ways. Larger groups of users create more demand for data in Hadoop, which accelerates the rate of cluster growth. That means that many enterprise clusters must handle thousands of hosts, and Ambari 1.6.1 makes this increased scale easier to manage.

Ambari 1.6.1 includes an optimized REST API that makes requests as efficient as possible. It only loads the most critical information into the Ambari Web interface. To match the API improvements, additional improvements to the Ambari Server backend cache make operational information readily available. In addition, better tuning of Ganglia and Nagios facilitate metrics collection and alerts.

The Apache Ambari team has tested and verified Ambari Server against a live 2000-node cluster. We presented those test results at Hadoop Summit in June. Watch the presentation or review that slide deck.

Enterprise Usability Guardrails

Custom JDK Checking

In a number of regulated industries, we see users installing clusters with limited or no Internet access. Performing that “local install” means that HDP Stack repositories are available and ready to use on the local network. But it also means that Java should be installed and available on the various hosts prior to install.

Ambari has supported local installs for some time and it also handles environments with a custom JDK installed. Ambari 1.6.1 does more checking for validity of a custom JDK path and reporting of discrepancies across all hosts. This ensures that installed services do not run into problems when they start up, and it also allows Hadoop Operators to identify and correct JDK installation issues.

Simplified Database Connection Setup

Ambari supports configuring Hadoop services that require an external RDBMS, allowing them to work with some of the most popular databases including Oracle, MySQL and PostgreSQL. Ambari 1.6.1 makes those configurations easier.

Ambari 1.6.1 also makes it easier to get your JDBC driver in place within the cluster and also tests the database connections to make sure that your configuration is correct and working properly. These improvements reduce the number of manual setup steps, decrease errors and eliminate downstream triage.

Topology Validation for Blueprints

The previous Ambari release introduced Ambari Blueprints. As more and more services are introduced to the Hadoop Stack, tracking the number of components (and the relationship between those components) can be a challenge. Blueprints are an API-driven method to install and configure clusters in a consistent and repeatable fashion.

Ambari 1.6.1 addresses the tracking challenge by allowing you to perform a topology validation based on dependency and cardinality information provided in the associated Ambari Stack definition. In addition to validation, some components are auto-deployed if not present in the topology. Optionally, validation can be turned-off while making the API call.

Additional Hostname Checks

While installing Hadoop, one must make sure that the network is configured properly for each node within the cluster. Ambari 1.6.1 has added more network checks to help with that.

The “hostname check” enhancements include checking for reverse hostname lookup. During cluster installation or while adding hosts, host check will report any errors related to reverse hostname lookup. Ambari 1.6.1 users can also perform hostname resolution checks on each host to ensure that other hostnames are resolvable and reachable.

Coming Next

The Ambari community is already pushing forward towards the next release. Top priority items on the roadmap include:

  • Installation, management and monitoring of Apache Flume agents
  • Ubuntu Platform support
  • Configuration history and rollback in Ambari Web UI
  • Ambari Views, an extensible framework for customizing the user experience

Downland Ambari and Learn More



Last week, Apache Tez graduated to become a top level project within the Apache Software Foundation (ASF). This represents a major step forward for the project and is representative of its momentum that has been built by a broad community of developers from not only Hortonworks but Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo as well.

What is Apache Tez and why is it useful?

ApacheTezLogo_lowresApache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets. It allows projects in the Hadoop ecosystem, such as Apache Hive and Apache Pig, as well as 3rd-party software vendors to express fit-to-purpose data processing applications in a way that meets their unique demands for fast response times and extreme throughput at petabyte scale. Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads and is used with Apache Hive 0.13 as part of Hortonworks Data Platform.

What does “graduation” mean for Apache Tez?

Ultimately, an open source project is only as good as the community that supports it by improving quality and adding features needed by end users. The ASF has been established to do exactly this. They set governance frameworks and build communities around open source software projects. Further, the ASF believes (end enforces) that the real life of a project lies in the vibrance of its community of developers and users.

A project enters the Apache Software Foundation as an incubating project so that it can develop within these frameworks. One of the main criteria for graduation to a top-level project is an evaluation of the diversity and momentum behind the project community and whether it can self-sustain that momentum without the active guidance of the foundation. This graduating vote from the Apache Software Foundation is approval that Apache Tez has met that bar.

There were many factors that supported the graduation. Strong uptake from important Apache projects like Apache Hive and Apache Pig and other popular projects like Cascading. There are efforts underway to use Tez as the processing engine for Twitter’s Summingbird and Apache Flink (formerly Stratosphere from TU Berlin). Increasing usage among companies like Microsoft, Yahoo, LinkedIn and Netflix reflects considerable investment that will motivate more effort and sustenance from this influential user base. Vocal support from members of these communities about the excellent engagements they had from the Apache Tez community. Increasing number of contributors and committers into the Tez project. The fundamental open architecture of Tez is motivating a lot of interest in solving a number of hard problems in distributed data processing. All in all, the future for Apache Tez looks promising both from a project roadmap perspective as well user adoption and use case scenarios.

What does this mean for you?

Apache Tez can be adopted with a level of confidence in the project’s strength and long term sustainability. You can be sure that there will be an active and diverse community that is going to support you and the project will keep evolving with its users. As a community driven open source project, Tez encourages you to invest in the project by becoming a contributor so that you can meaningfully improve the project for yourself and others. Your use cases and your contributions are welcome!



Databricks_CTOIn this special guest feature, Matei Zaharia, CTO of Databricks and Creator of Apache Spark, explores open-source Apache Spark ‘s status in the Hadoop community. The Spark Summit recently ended. With over 1,000 attendees, up from just under 400 at last year’s conference, the community around Apache Spark continues to rapidly expand. In the last 12 months, Spark has had more than 200 contributors from more than 50 organizations add code to the project, making it the most active open source project in the Hadoop ecosystem.

With the second Spark Summit behind us, we wanted to take a look back at our journey since 2009 when Spark, the fast and general engine for large-scale data processing, was initially developed. It has been exciting and extremely gratifying to watch Spark mature over the years, thanks in large part to the vibrant, open source community that latched onto it and busily began contributing to make Spark what it is today.

The idea for Spark first emerged in the AMPLab (AMP stands for Algorithms, Machines, and People) at the University of California, Berkeley. With its significant industry funding and exposure, the AMPlab had a unique perspective on what is important and what issues exist among early adopters of big data. We had worked with most of the early users of Hadoop and consistently saw the same issues arise. Spark itself started as the solution to one such problem—speeding machine learning applications on clusters, which machine learning researchers in the lab were having trouble doing using Hadoop. However, we soon realized that we could easily cover a much broader set of applications.

The Vision

When we worked with early Hadoop users, we saw that they were all excited about the scalability of MapReduce. However, as soon as these users began using MapReduce, they needed more than the system could offer. First, users wanted faster data analysis—instead of waiting tens of minutes to run a query, as was required with MapReduce’s batch model, they wanted to query data interactively, or even continuously in real-time. Second, users wanted more sophisticated processing, such as iterative machine learning algorithms, which were not supported by the rigid, one-pass model of MapReduce.

At this point, several systems had started to emerge as point solutions to these problems, e.g., systems that ran only interactive queries, or only machine learning applications. However, these systems were difficult to use with Hadoop, as they would require users to learn and stitch together a zoo of different frameworks to build pipelines. Instead, we decided to try to generalize the MapReduce model to support more types of computation in a single framework.

We achieved this using only two simple extensions to the model. First, we added support for storing and operating on data in memory—a key optimization for the more complex, iterative algorithms required in applications like machine learning, and one that proved shrewd with the continued drop in memory prices. Second, we modeled execution as general directed acyclic graphs (DAGs) instead of the rigid model of map-and-reduce, which led to significant speedups even on disk. With these additions we were able to cover a wide range of emerging workloads, matching and sometimes exceeding the performance of specialized systems while keeping a single, simple unified programming model.

This decision allowed, over time, new functionality such as Shark (SQL over Spark), Spark Streaming (stream processing), MLlib (efficient implementations of machine learning algorithms), and GraphX (graph computation over Spark) to be built. These modules in Spark are not separate systems, but libraries that users can combine together into a program in powerful ways. Combined with the more than 80 basic data manipulation operators in Spark, they make it dramatically simpler to build big data applications compared to previous, multi-system pipelines. And we have sought to make them available in a variety of programming languages, including Java, Scala, and Python (available today) and soon R.


As interest in Spark increased, we received a lot of questions about how Spark related to Hadoop and whether Spark was its replacement. The reality is that Hadoop consists of three parts: a file system (HDFS), resource management (YARN/Mesos), and a processing layer (MapReduce). Spark is only a processing engine and thus is an alternative for that last layer. Having it operate over HDFS data was a natural starting point because the volume of data in HDFS was growing rapidly. However, Spark’s architecture has also allowed it to support a host of storage systems beyond Hadoop. Spark is now being used as a processing layer for other data stores (e.g., Cassandra, MongoDB) or even to seamlessly join data from multiple data stores (e.g., HDFS and an operational data store).

Success Comes from the Community

Perhaps the one decision that has made Spark so robust is our continuing commitment to keep it 100 percent open source and work with a large set of contributors from around the world. That commitment continues to pay dividends in Spark’s future and effectiveness.

In a relatively short time, enthusiasm rose among the open source community and is still growing. Indeed, in just the last 12 months, Spark has had more than 200 people from more than 50 organizations contribute code to the project, making it the most active open source project in the Hadoop ecosystem. Even after reaching this point, Spark is still enjoying steady growth, moving us toward the inflection point of the hockey stick adoption curve.

Not only has the community invested countless hours in development, but it has also gotten people excited about Spark and brought users together to share ideas. One of the more exciting events was the first Spark Summit held in December 2013 in San Francisco, which drew nearly 500 attendees. At this year’s Summit, held June 30, we had double the amount of participants. The 2014 Summit included more than 50 community talks on applications, data science, and research using Spark. In addition, Spark community meetups have sprouted all over the United States and internationally, and we anticipate that number to grow.


Spark in the Enterprise

We have seen a real rise in excitement among enterprises as well. After going through the initial proof-of-concept process, Spark has found its place in the enterprise ecosystem and every major Hadoop distributor has made Spark part of their distribution. For many of these distributions, support came from the bottom up: we heard from vendors that customers were downloading and using Spark on their own and then contacting vendors to ask them to support it.

Spark is being used across many verticals including large Internet companies, government agencies, financial service companies, and some major players such as Yahoo, eBay, Alibaba, and NASA. These enterprises are deploying Spark for a variety of use cases including ETL, machine learning, data product creation, and complex event processing with streaming data. Vertical-specific cases include churn analysis, fraud detection, risk analytics, and 360-degree customer views. And many companies are conducting advanced analytics using Spark’s scalable machine learning library (MLlib), which contains high-quality algorithms that leverage iteration to yield better results.

Finally, in addition to being used directly by customers, Spark is increasingly the backend for a growing number of higher-level business applications. Major business intelligence vendors such as Microstrategy, Pentaho, and Qlik have all certified their applications on Spark, while a number of innovative startups such as Adatao, Tresata, and Alpine are basing products on it. These applications bring the capabilities of Spark to a much broader set of users throughout the enterprise.

Spark’s Future

We recently released version 1.0 of Apache Spark – a major milestone for the project. This version includes a number of added capabilities such as:

  • A stable application programming interface to provide compatibility across all 1.x releases.
  • Spark SQL to provide schema-aware data modeling and SQL language support.
  • Support for Java 8 lambda syntax to simplify writing applications in Java.
  • Enhanced MLlib with several new algorithms; MLlib continues to be extremely active on its own with more than 40 contributors since it was introduced in September 2013.
  • Major updates to Spark’s streaming and graph libraries.

These wouldn’t have happened without the support of the community. Many of these features were requested directly by users, while others were contributed by the dozens of developers who worked on this release. One of our top priorities is to continue to make Spark more robust and focus on key enterprise features, such as security, monitoring, and seamless ecosystem integration.

Additionally, the continued success of Spark is dependent on a vibrant ecosystem. For us, it is exciting to see the community innovate and enhance above, below, and around Spark. Maintaining compatibility across the various Spark distributions will be critical, as we’ve seen how destructive forking and fragmentation can be to open source efforts. We would like to define and center compatibility around the Apache version of Spark, where we continue to make all our contributions. We are excited to see the community rally around this vision.

While Spark has come far in the past five years, we realize that there is still a lot to do. We are working hard on new features and improvements in both the core engine and the libraries built on top. We look forward to future Spark releases, an expanded ecosystem, and future Summits and meetups where people are generating ideas that go far beyond what we imagined years ago at UC Berkeley.



Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem, and Hive queries could only run on MapReduce. But today, alternative execution engines to MapReduce are available — such as Apache Spark and Apache Tez (incubating).

Although Spark is relatively new to the Hadoop ecosystem, its adoption has been meteoric. An open-source data analytics cluster computing framework, Spark is built outside of Hadoop’s two-stage MapReduce paradigm but runs on top of HDFS. Because of its successful approach, Spark has quickly gained momentum and become established as an attractive choice for the future of data processing in Hadoop.

In this post, you’ll get an overview of the motivations and technical details behind some very exciting news for Spark and Hive users: the fact that the Hive and Spark communities are joining forces to collaboratively introduce Spark as a new execution engine option for Hive, alongside MapReduce and Tez (see HIVE-7292).

Motivation and Approach

Here are the two main motivations for enabling Hive to run on Spark:

  • To improve the Hive user experience
    Hive queries will run faster, thereby improving user experience. Furthermore, users will have access to a robust, non-MR execution engine that has already proven itself to be a leading option for data processing as well as streaming, and which is among the most active projects across all of Apache from contributor and commit standpoints.
  • To streamline operational management for Spark shops
    Hive-on-Spark will be very valuable for users who are already using Spark for other data-processing and machine-learning needs. Standardizing on one execution back end is also convenient for operational management, making it easier to debug issues and create enhancements.

Superficially, this project’s goals look similar to those of Shark or Spark SQL, which are separate projects that reuse the Hive front end to run queries using Spark. However, this design adds Spark into Hive, parallel to MapReduce and Tez, as another backend execution engine. Thus, existing Hive jobs continue to run as-is transparently on Spark.

The key advantage of this approach is to leverage all the existing integration on top of Hive, including ODBC/JDBC, auditing, authorization, and monitoring. Another advantage is that it will have no impact on Hive’s existing code path and thus no functional or performance effects. Users choosing to run Hive on either MapReduce or Tez will have the same functionality and code paths they have today — thus, the Hive user community will be in the great position of being able to choose among MapReduce, Tez, or Spark as a back end. In addition, maintenance costs will be minimized so the Hive community needn’t make specialized investments for Spark.

Meanwhile, users opting for Spark as the execution engine will automatically have all the rich functional features that Hive provides. Future features (such as new data types, UDFs, logical optimization, and so on ) added to Hive should become automatically available to those users, without any customization work to be done in Hive’s Spark execution engine.

Overall Functionality

To use Spark as an execution engine in Hive, you would set the following:

set hive.execution.engine=spark;

The default value for this configuration is still “mr”. Hive will continue to work on MapReduce as-is on clusters that don’t have Spark on them. When Spark is configured as Hive’s execution, a few configuration variables will be introduced, such as the master URL of the Spark cluster.

The new execution engine should support all Hive queries without any modification. Query results should be functionally equivalent to those from either MapReduce or Tez. In addition, existing functionality related to the execution engine should also be available, including the following:

  • Hive will display a task execution plan that’s similar to that being displayed by the explain command for MapReduce and Tez.
  • Hive will give appropriate feedback to the user about progress and completion status of the query when running queries on Spark.
  • The user will be able to get statistics and diagnostic information as before (counters, logs, and debug info on the console).

Hive-Level Design

As noted previously, this project takes a different approach from that of Shark in that SQL semantics will be not implemented using Spark’s primitives, but rather MapReduce ones that will be executed in Spark.

The main work to implement the Spark execution engine for Hive has two components: query planning, where Hive operator plan from semantic analyzer is further translated a task plan that Spark can execute; and query execution, where the generated Spark plan is executed in the Spark cluster. There are other miscellaneous yet indispensable functional pieces involving monitoring, counters, statistics, and so on, but for brevity, we will only address the main design considerations here.

Query Planning

Currently, for a given user query, Hive’s semantic analyzer generates an operator plan that comprises a graph of logical operators such as TableScanOperator, ReduceSink, FileSink, GroupByOperator, and so on. MapReduceCompiler compiles a graph of MapReduceTasks and other helper tasks (such as MoveTask) from the logical operator plan. Tez behaves similarly, yet generates a TezTask that combines otherwise multiple MapReduce tasks into a single Tez task.

For Spark, we will introduce SparkCompiler parallel to MapReduceCompiler and TezCompiler. Its main responsibility is to compile from the Hive logical operator plan a plan that can be executed on Spark. Thus, we will have SparkTask, depicting a job that will be executed in a Spark cluster, and SparkWork, describing the plan of a Spark task. Thus, SparkCompiler translates a Hive’s operator plan into a SparkWork instance. During the task plan generation, SparkCompiler may also perform physical optimizations suitable for Spark.

Job Execution

A SparkTask instance can be executed by Hive’s task execution framework in the same way as for other tasks. Internally, the SparkTask.execute() method will make RDDs and functions out of a SparkWork instance, and submit the execution to the Spark cluster via a Spark client.

Once the Spark work is submitted to the Spark cluster, the Spark client will continue to monitor the job execution and report progress. A Spark job can be monitored via SparkListener APIs.

With SparkListener APIs, we will add a SparkJobMonitor class that handles printing of status as well as reporting the final result. This class provides similar functions as HadoopJobExecHelper used for MapReduce processing, or TezJobMonitor used for Tez job processing, and will also retrieve and print the top-level exception thrown at execution time in case of job failure.

Spark job submission is done via a SparkContext object that’s instantiated with the user’s configuration. When a SparkTask is executed by Hive, such a context object is created in the current user session. With the context object, RDDs corresponding to Hive tables are created and MapFunction and ReduceFunction (more details below) are built from Hive’s SparkWork and applied to the RDDs. Job execution is triggered by applying a foreach() transformation on the RDDs with a dummy function.

Main Design Considerations

Hive’s operator plan is based on MapReduce paradigm, and traditionally, a query’s execution contains a list of MapReduce jobs. Each MapReduce job consists of map-side processing starting from Hive’s ExecMapper and reduce-side processing starting from ExecReducer, and MapReduce provides inherent shuffling, sorting, and grouping between the map-side and the reduce-side. The input to the whole processing pipeline are the folders and files corresponding to the table.

Because we will reuse Hive’s operator plan but perform the same data processing in Spark, the execution plan will be built in Spark constructs such as RDD, function, and transformation. This approach is outlined below.

Table as RDD

A Hive table is simply a bunch of files and folders on HDFS. Spark primitives are applied to RDDs. Thus, naturally, Hive tables will be treated as RDDs in the Spark execution engine.


The above mentioned MapFunction will be made from MapWork; specifically, the operator chain starting from method. ExecMapper class implements MapReduce Mapper interface, but the implementation in Hive contains some code that can be reused for Spark. Therefore, we will extract the common code into a separate class, MapperDriver, to be shared by MapReduce as well as Spark.


Similarly, ReduceFunction will be made of ReduceWork instance from SparkWork. To Spark, ReduceFunction is no different than MapFunction, but the function’s implementation will be different, and made of the operator chain starting from ExecReducer.reduce(). Also, because some code in ExecReducer will be reused, we will extract the common code into a separate class, ReducerDriver, for sharing by both MapReduce and Spark.

Shuffle, Group, and Sort

While this functionality comes for “free” along with MapReduce, we will need to provide an equivalent for Spark. Fortunately, Spark provides a few transformations that are suitable for replacing MapReduce’s shuffle capability, such as partitionBy, groupByKey, and sortByKey. Transformation partitionBy does pure shuffling (no grouping or sorting), groupByKey does shuffling and grouping, and sortByKey() does shuffling plus sorting. Therefore, for each ReduceSinkOperator in SparkWork, we will need to inject one of the transformations.

Having the capability to selectively choose the exact shuffling behavior provides opportunities for optimization. For instance, Hive’s groupBy doesn’t require the key to be sorted, but MapReduce does. In contrast, in Spark, one can choose sortByKey only if key order is important (such as for SQL ORDER BY).

Multiple Reduce Stages

Whenever a query has multiple ReduceSinkOperator instances, Hive will break the plan apart and submit one MR job per sink. All the MR jobs in this chain need to be scheduled one-by-one, and each job has to re-read the output of the previous job from HDFS and shuffle it. In Spark, this step is unnecessary: multiple map functions and reduce functions can be concatenated. For each ReduceSinkOperator, a proper shuffle transformation needs to be injected as explained above.


Based on the above, you will likely recognize that although Hive on Spark is simple and clean in terms of functionality and design, the implementation will take some time. Therefore, the community will take a phased approach, with all basic functionality delivered in a Phase 1 and optimization and improvements ongoing over a longer period of time. (Precise number of phases and what each will entail are under discussion.)

Most important, the Hive and Spark communities will work together closely to achieve this technical vision and resolve any obstacles that might arise — with the end result being the availability to Hive users of an execution engine that improves performance as well as unifies batch and stream processing.


By Alex Woodie
The Hadoop processing engine Spark has risen to become one of the hottest big data technologies in a short amount of time. And while Spark has been a Top-Level Project at the Apache Software Foundation for barely a week, the technology has already proven itself in the production systems of early adopters, including Conviva, ClearStory, and Yahoo.

Spark is an open source alternative to MapReduce designed to make it easier to build and run fast and sophisticated applications on Hadoop. Spark comes with a library of machine learning (ML) and graph algorithms, and also supports real-time streaming and SQL apps, via Spark Streaming and Shark, respectively. Spark apps can be written in Java, Scala, or Python, and have been clocked running 10 to 100 times faster than equivalent MapReduce apps.

Matei Zaharia, the creator of Spark and CTO of commercial Spark developer Databricks, shared his views on the Spark phenomena, as well as several real-world use cases, during his presentation at the recent Strata conference in Santa Clara, California.

Since its introduction in 2010, Spark has caught on very quickly, and is now one of the most active open source Hadoop projects–if not the most active. “In the past year Spark has actually overtaken Hadoop MapReduce and every other engine we’re aware of in terms of the number of people contributing to it,” Zaharia says. “It’s an interesting thing. There hasn’t been as much noise about it commercially, but the actual developer community votes with its feet and people are actually getting things done and working with the project.”

Zaharia argues that Spark is catching on so quickly because of two factors: speed and sophistication. “Achieving the best speed and the best sophistication have usually required separate non-commodity tools that don’t run on these commodity clusters. [They’re] often proprietary and quite expensive,” says Zaharia, a 5th year Ph.D. candidate who is also an assistant professor of computer science at MIT.

Up to this point, only large companies, such as Google, have had the skills and resources to make the best use of big and fast data. “There are many examples…where anybody can, for instance, crawl the Web or collect these public data sets, but only a few companies, such as Google, have come up with sophisticated algorithms to gain the most value out of it,” Zaharia says.

Spark was “designed to address this problem,” he says. “Spark brings the top-end data analytics, the same performance level and sophistication that you get with these expensive systems, to commodity Hadoop cluster. It runs in the same cluster to let you do more with your data.”

Spark at Yahoo

It may seem that Spark is just popping onto the scene, but it’s been utilized for some time in production systems. Here are three early adopters of Spark, as told by Zaharia at Strata:

Yahoo has two Spark projects in the works, one for personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to figure out what types of users would be interested in reading them.

“When you do personalization, you need to react fast to what the user is doing and the events happening in the outside world,” Zaharia says. “If you look at Yahoo’s home page, which news items are you going to show? You need to learn something about each  news item as it comes in to see what users may like it. And you need to learn something about users as they click around to figure out that they’re interest in a topic.”

To do this, Yahoo (a major contributor to Apache Spark) wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news personalization was written in 15,000 lines of C++.) With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business.

Yahoo’s second use case shows off Hive on Spark (Shark’s) interactive capability. The Web giant wanted to use existing BI tools to view and query their advertising analytic data collected in Hadoop. “The advantage of this is Shark uses the standard Hive server API, so any tool that plugs into Hive, like Tableau, automatically works with Shark,” Zaharia says. “And as a result they were able to achieve this and can actually query their ad visit data interactively.”

Spark at Conviva and ClearStory

Another early Spark adopter is Conviva, one of the largest streaming video companies on the Internet, with about 4 billion video feeds per month (second only to YouTube). As you can imagine, such an operation requires pretty sophisticated behind-the-scenes technology to ensure a high quality of service. As it turns out, it’s using Spark to help deliver that QoS by avoiding dreaded screen buffering.

In the early days of the Internet, screen buffering was a fact of life. But in today’s superfast 4G- and fiber-connected world, people’s expectations for video quality have soared, while at the same time their tolerance for video delays has plummeted.

Enter Spark. “Conviva uses Spark Streaming to learn network conditions in real time,” Zaharia says. “They feed [this information] directly into the video player, say the Flash player on your laptop, to optimize the speeds. This system has been running in production over six months to manage live video traffic.” (You can read more about Conviva’s use of Hadoop, Hive, MapReduce, and Spark here.)

Spark are also getting some work at ClearStory, a developer of data analytics software that specializes in data harmonization and helping users blend internal and external data. ClearStory needed a way to help business users merge their internal data sources with external sources, such as social media traffic and public data feeds, without requiring complex data modeling.

ClearStory was one of Databricks first customers, and today relies on the Spark technology as one of the core underpinnings of its interactive, real-time product. “Honestly if it weren’t for Spark we would have very likely built something like this ourselves,” ClearStory founder Vaibhav Nivargi says in an interview with Databricks co-founder Reynold Xin.

“Spark has notion of resident distributed data sets which are these in-memory units of data that can span across multiple machines in a cluster,” Nivargi says in the video. “As a computing unit of data that is really promising for the kinds of workloads we see at ClearStory.”

Source :



Big data analytical ecosystem architecture is in early stages of development. Unlike traditional data warehouse / business intelligence (DW/BI) architecture which is designed for structured, internal data, big data systems work with raw unstructured and semi-structured data as well as internal and external data sources. Additionally, organizations may need both batch and (near) real-time data processing capabilities from big data systems.

Lambda architecture – developed by Nathan Marz – provides a clear set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability and recomputation into the system. Batch processes high volumes of data where a group of transactions is collected over a period of time. Data is collected, entered, processed and then batch results produced. Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real-time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real-time). Customer services and bank ATMs are examples.

Lambda architecture has three (3) layers:

  • Batch Layer
  • Serving Layer
  • Speed Layer

Batch Layer (Apache Hadoop)

Hadoop is an open source platform for storing massive amounts of data. Lambda architecture provides “human fault-tolerance” which allows simple data deletion (to remedy human error) where the views are recomputed (immutability and recomputation).

The batch layer stores the master data set (HDFS) and computes arbitrary views (MapReduce). Computing views is continuous: new data is aggregated into views when recomputed during MapReduce iterations. Views are computed from the entire data set and the batch layer does not update views frequently resulting in latency.

Serving Layer (Real-time Queries)

The serving layer indexes and exposes precomputed views to be queried in ad hoc with low latency. Open source real-time Hadoop query implementations like Cloudera Impala, Hortonworks Stinger, Dremel (Apache Drill) and Spark Shark can query the views immediately. Hadoop can store and process large data sets and these tools can query data fast. At this time Spark Shark outperforms considering in-memory capabilities and has greater flexibility for Machine Learning functions.

Note that MapReduce is high latency and a speed layer is needed for real-time.

Speed Layer (Distributed Stream Processing)

The speed layer compensates for batch layer high latency by computing real-time views in distributed stream processing open source solutions like Storm and S4. They provide:

  • Stream processing
  • Distributed continuous computation
  • Fault tolerance
  • Modular design

In the speed layer real-time views are incremented when new data received. Lambda architecture provides “complexity isolation” where real-time views are transient and can be discarded allowing the most complex part to be moved into the layer with temporary results.

The decision to implement Lambda architecture depends on need for real-time data processing and human fault-tolerance. There are significant benefits from immutability and human fault-tolerance as well as precomputation and recomputation.

Lambda implementation issues include finding the talent to build a scalable batch processing layer. At this time there is a shortage of professionals with the expertise and experience to work with Hadoop, MapReduce, HDFS, HBase, Pig, Hive, Cascading, Scalding, Storm, Spark Shark and other new technologies.



One of the biggest buzzwords ever, it seems, is “Big Data”. But what EXACTLY is big data? It might seem a little condescending, but for those who aren’t in the industry, explaining big data requires some “dumbing down.” If you’re in the position to explain this to someone with zero experience in it, who isn’t tech savvy, or who comes from an entirely different field, leave the jargon out and instead focus on explanations and similes they’ll understand.

A perfect example is when the CEO asks for a quick rundown—and her background is in corporate leadership, she uses an old smartphone, and simply isn’t on the same page as you. Don’t get frustrated. This is your time to shine.

Big Data is Exactly What it Sounds Like

Here’s an idea of how the explanation will go for a five year old, but feel free to pepper in your own ideas, too. Just make sure it stays simple, to the point and you don’t go off track.

“Big data is exactly what it sounds like—a collection of data that’s so big it’s tough to process. It’s like the US Census Bureau information. That’s way too much information (since millions of people are surveyed) to look at at once.

The biggest problems with this much information is being able to store it, share it with other people, or figure out just what the heck those numbers mean. Usually with data, there are technology tools that can do all this work easily, but too much data is too big of a job for them. Big Data is actually better, smarter use of the data.

Why We Use Big Data

“What do we want with all this information, anyway?” It’s a great way to spot trends, such as figuring out how many people in your town prefer chocolate ice cream over vanilla.

This information can be very useful to an ice cream company that can’t decide whether to advertise for their chocolate ice cream cake or vanilla one. After all, why spend one hundred dollars on a vanilla ice cream ad and just $50 on the chocolate one when the data says most people prefer chocolate?

I actually prefer the vanilla/chocolate twist… why wasn’t that an option? Oh no! Our data is already skewed!

Data Tells Us All Kinds of Things

Data can tell us all kinds of things, like what type of people are doing what, where the most puppy adoptions take place in each state, and what types of clothes, food or toys people prefer. This is really important for businesses who can use that information to make more money. Otherwise, they might be trying to sell a tricycle for little kids to a bunch of middle schoolers who want dirt or mountain bikes—no training wheels, please.

How do You Collect Big Data?

If you have a large enterprise website, chances are you have tons of marketing tags and different javascript files on your pages, that load information into your website. Many times, these files contain massive amounts of user data, such as where what page they just visited, if they are logged into Facebook or other social channels, which pixels they have triggered from other sites for retargeting of ads, etc. You need to be able to capture this data with a tool like Ensighten, which allows you to collect, own and then act on ALL of your data. If you are an enterprise marketer without a tag or data management platform, you are missing out on key opportunities.


Big Data Sounds Easy

Using big data might sound like a piece of cake, but whenever you have a lot of something it can be really hard to manage.

Think of it this way: It’s pretty easy to clean your room when there’s just your crayons to put back in the box and a couple of shirts to toss in the hamper. But if you just had a huge party and there are dirty cups, leftover pizza and balloons everywhere, suddenly it seems like it will never get done—even though ‘cleaning up’ is basically the same chore no matter how big it is.


You Need to Manage Big Data Correctly

Size makes a huge difference. Just look at Godzilla—it can easily get out of control. That’s why big data can be such a tough problem to solve. You need to manage it right or it can make things much worse.

After all, you wouldn’t clean your room by trying to vacuum up those dirty clothes on the floor, would you? Vacuums do a great job for some things, but when not used right they just don’t make sense. The same goes for managing big data.





The Apache Storm community recently announced the release of Apache Storm 0.9.2, which includes improvements to Storm’s user interface and an overhaul of its netty-based transport.

We thank all who have contributed to Storm – whether through direct code contributions, documentation, bug reports, or helping other users on the mailing lists. Together, we resolved 112 JIRA issues.

Here are summaries of this version’s important fixes and improvements.

New Feature Highlights

Netty Transport Overhaul

Storm’s Netty-based transport has been overhauled to significantly improve performance through better utilization of thread, CPU, and network resources, particularly in cases where message sizes are small.

UI Improvements with a New REST API

This release also includes a number of improvements to the Storm UI service. A new REST API exposes metrics and operations in JSON format, which the UI now uses.

The new REST API will make it considerably easier for other services to consume available cluster and topology metrics for monitoring and visualization applications. The Storm UI service now includes a powerful tool for visualizing the state of running topologies that is built on top of this API.

Kafka Spout

This is the first Storm release to include official support for consuming data from Kafka 0.8.x. In the past, development of Kafka spouts for Storm had become somewhat fragmented and finding an implementation that worked with certain versions of Storm and Kafka proved burdensome.

This is no longer the case. The storm-kafka module is now part of the Storm project and associated artifacts are released to official channels (Maven Central) along with Storm’s other components.

Storm Starter & Examples

Similar to the external section of the codebase, we have also added an examples directory and pulled in the storm-starter project to ensure it will be maintained in lock-step with Storm’s main codebase.

Plugable Serialization for Multilang

In previous versions of Storm, serialization of data to and from multilang components was limited to JSON, which came with somewhat of a performance penalty. The serialization mechanism is now plugable and enables the use of more performant serialization frameworks like protocol buffers in addition to JSON.

Download and Learn More


Why Extended Attributes are Coming to HDFS

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Inside Extended Attributes

Extended attribute keys are java.lang.Strings and the values are byte[]s. By default, there is a maximum of 32 extended attribute key/value pairs per file or directory, and the (default) maximum size of the combined lengths of name and value is 16,384. You can configure these two limits with the dfs.namenode.fs-limits.max-xattrs-per-inode and dfs.namenode.fs-limits.max-xattr-size config parameters.

Every extended attribute name must include a namespace prefix, and just like in the Linux filesystem implementations, there are four extended attribute namespaces: user, trusted, system, and security. The system and security namespaces are for HDFS internal use only; only the HDFS super user can access trusted namespace. So, user extended attributes will generally reside in the user namespace (for example, “user.myXAttrkey”). Namespaces are case-insensitive and extended attribute names are case-sensitive.

Extended attributes can be accessed using the hdfs dfs command. To set an extended attribute on a file or directory, use the -setfattr subcommand. For example,

hdfs dfs -setfattr -n 'user.myXAttr'-v someValue /foo


You can replace a value with the same command (and a new value) and you can delete an extended attribute with the -x option. The usage message shows the format. Setting an extended attribute does not change the file’s modification time.

To examine the value of a particular extended attribute or to list all the extended attributes for a file or directory, use the hdfs dfs –getfattr subcommand. There are options to recursively descend through a subtree and specify a format to write them out (text, hex, or base64). For example, to scan all the extended attributes for /foo, use the -d option:

hdfs dfs -getfattr -d /foo
# file: /foo


The org.apache.hadoop.fs.FileSystem class has been enhanced with methods to set, get, and remove extended attributes from a path.

package org.apache.hadoop.fs;
  /* Create or replace an extended attribute. */
  publicvoid setXAttr(Path path,String name,byte[] value)
  /* get an extended attribute. */
  publicbyte[] getXAttr(Path path,String name)
  /* get multiple extended attributes. */
  publicMap<String,byte[]> getXAttrs(Path path,List names)
  /* Remove an extended attribute. */
  publicvoid removeXAttr(Path path,String name)

Current Status

Extended attributes are currently committed on the upstream trunk and branch-2 and will be included in CDH 5.2 and Apache Hadoop 2.5. They are enabled by default (you can disable them with the dfs.namenode.xattrs.enabled configuration option) and there is no overhead if you don’t use them.

Extended attributes are also upward compatible, so you can use an older client with a newer NameNode version.