Open Source Cloud Computing with Hadoop
Have you ever wondered how Google, Facebook and other Internet giants process their massive workloads? Billions of requests are served every day by the biggest players on the Internet, resulting in background processing involving datasets in the petabyte scale. Of course they rely on Linux and cloud computing for obtaining the necessary scalability and performance. The flexibility of Linux combined with the seamless scalability of cloud environments provide the perfect framework for processing huge datasets, while eliminating the need for expensive infrastructure and custom proprietary software. Nowadays, Hadoop is one of the best choices in open source cloud computing, offering a platform for large scale data crunching.
Introduction
In this article we introduce and analyze the Hadoop project, which has been embraced by many commercial and scientific initiatives that need to process huge datasets. It provides a full platform for large-scale dataset processing in cloud environments, being easily scalable since it can be deployed on heterogeneous cluster infrastructure and regular hardware. As of April 2011, Amazon, AOL, Adobe, Ebay, Google, IBM, Twitter, Yahoo and several universities are listed as users in the project's wiki. Being maintained by the Apache Foundation, Hadoop comprises a full suite for seamless distributed scalable computing on huge datasets. It provides base components on top of which new distributed computing sub projects can be implemented. Among its main components is an open source implementation of the MapReduce framework (for distributed data processing) together with a data storage solution composed by a distributed filesystem and a data warehouse.
The MapReduce Framework
The MapReduce framework was created and patented by Google in order to process their own page rank algorithm and other applications that support their search engine. The idea behind it was actually introduced many years ago by the first functional programming languages such as LISP, and basically consists of partitioning a large problem into several "smaller" problems that can be solved separately. The partitioning and finally the main problem's result are computed by two functions: Map and Reduce. In terms of data processing, the Map function takes a large dataset and partitions it into several smaller intermediate datasets that can be processed in parallel by different nodes in a cluster. The reduce function then takes the separate results of each computation and aggregates them to form the final output. The power of MapReduce can be leveraged by different applications to perform operations such as sorting and statistical analysis on large datasets, which may be mapped into smaller partitions and processed in parallel.
Hdaoop MapReduce
Hadoop includes a Java implementation of the MapReduce framework, its underlying components and the necessary large scale data storage solutions. Although application programming is mostly done in Java, it provides APIs in different languages such as Ruby and Python, allowing developers to integrate Hadoop to diverse existing applications. It was first inspired by Google's implementation of MapReduce and the GFS distributed filesystem, absorbing new features as the community proposed new specific sub projects and improvements. Currently, Yahoo is one of the main contributors to this project, making public the modifications carried out by their internal developers. The basis of Hadoop and its several sub projects is the Core, which provides components and interfaces for distributed I/O and filesystems. The Avro data serialization system is also an important building block, providing cross-language RPC and persistent data storage.
On top of the Core, there's the actual implementation of MapReduce and its APIs, including the Hadoop Streaming, which allows flexible development of Map and Reduce functions in any desired language. A MapReduce cluster is composed by a master node and a cloud of several worker nodes. The nodes in this cluster may be any Java enabled platform, but large Hadoop installations are mostly run on Linux due to its flexibility, reliability and lower TCO. The master node manages the worker nodes, receiving jobs and distributing the workload across the nodes. In Hadoop terminology, the master node runs the JobTracker, responsible for handling incoming jobs and allocating nodes for performing separate tasks. Worker nodes run TaskTrackers, which offer virtual task slots that are allocated to specific map or reduce tasks depending on their access to the necessary input data and overall availability. Hadoop offers a web management interface, which allows administrators to obtain information on the status of jobs and individual nodes in the cloud. It also allows fast and easy scalability through the addition of cheap worker nodes without disrupting regular operations.
HDFS: A distributed filesystem
The main use of the MapReduce framework is in processing large volumes of data, and before any processing takes place it is necessary to first store this data in some volume accessible by the MapReduce cluster. However, it is impractical to store such large data sets on local filesystems, and much more impractical to synchronize the data across the worker nodes in the cluster. In order to address this issue, Hadoop also provides the Hadoop Distributed Filesystem (HDFS), which easily scales across the several nodes in a MapReduce cluster, leveraging the storage capacity of each node to provide storage volumes in the petabyte scale. It eliminates the need for expensive dedicated storage area network solutions while offering similar scalability and performance. HDFS runs on top of the Core and is perfectly integrated into the MapReduce APIs provided by Hadoop. It is also accessible via command line utilities and the Thrift API, which provides interfaces for various programming languages, such as Perl, C++, Python and Ruby. Furthermore, a FUSE (Filesystem in Userspace) driver can be used to mount HDFS as a standard filesystem.
In a typical HDFS+MapReduce cluster, the master node runs a NameNode, while the rest of the (worker) nodes run DataNodes. The NameNode manages HDFS volumes, being queried by clients to carry out standard filesystem operations such as add, copy, move or delete files. The DataNodes do the actual data storage, receiving commands from the NameNode and performing operations on locally stored data. In order to increase performance and optimize network communications, HDFS implements rack awareness capabilities. This feature enables the distributed filesystem and the MapReduce environment to determine which worker nodes are connected to the same switch (i.e. in the same rack), distributing data and allocating tasks in such a way that communication takes place between nodes in the same rack without overloading the network core. HDFS and MapReduce automatically manage which pieces of a given file are stored on each node, allocating nodes for processing these data accordingly. When the JobTracker receives a new job, it first queries the DataNodes of worker nodes in a same rack, allocating a task slot if the the node has the necessary data stored locally. If no available slots are found in the rack, the JobTracker then allocates the first free slot it finds.
Hive: A petabyte scale database
On top of the HDFS distributed filesystem, Hadoop implements Hive, a distributed data warehouse solution. Actually, Hive started as an internal project at Facebook and has now evolved into a fully blown project of its own, being maintained by the Apache foundation. It provides ETL (Extract, Transform and Load) features and QL, a query language similar to standard SQL. Hive queries are translated into MapReduce jobs run on table data stored on HDFS volumes. This allows Hive to process queries that involve huge datasets with performances comparable to MapReduce jobs while providing the same abstraction level of a database. Its performance is most apparent when running queries over large datasets that do not change frequently. For example, Facebook relies on Hive to store user data, run statistical analysis, process logs and generate reports.
Conclusion
We have briefly overviewed the main features and components of Hadoop. Leveraging the power of cloud computing, many large companies rely on this project to perform their day to day data processing. This is yet another example of open source software being used to build large scale scalable applications while keeping costs low. However, we have only scratched the surface of the fascinating infrastructure behind Hadoop and its many possible uses. In future articles we will see how to set up a basic Hadoop cluster and how to use it for interesting applications such as log parsing and statistical analysis.
Further Reading
If you are interested in learning more about Hadoop's architecture, administration and application development these are the best places to start:
- Hadoop: The Definitive Guide, Tim White, O'Rilley/Yahoo Press, 2 edition, 2010
- Apache Hadoop Project homepage: http://hadoop.apache.org/