Tech Archive: Apache Hadoop - An open source implementation of MapReduce programming model

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor^[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -

Storing large files

Terabytes, Petabytes, etc...
Millions rather than billions of files 100MB or more per file

Streaming data

Write once and read-many times patterns
Optimized for streaming reads rather than random reads
Append operation added to Hadoop 0.21

“Cheap” Commodity Hardware

No need for super-comp

HDFS is not so good for -

Low-latency reads

High-throughput rather than low latency for small chunks of data
HBase addresses this issue

Large amount of small files

Better for millions of large files instead of billions of small files
For example each file can be 100MB or more

Multiple Writers

Single writer per file
Writes only at the end of file, no-support for arbitrary offset

pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.

PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step.
HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer

Ref:

The Hadoop wiki provides community input related to Hadoop and HDFS.
The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
Visit Amazon S3 to learn about Amazon's S3 infrastructure.
The developerWorks Web development zone specializes in articles covering various web-based solutions.

Get products and technologies

The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.

Discuss

Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
Find other developerWorks members interested in web development.
Share what you know: Join one of our developerWorks groups focused on web topics.
Roland Barcia talks about Web 2.0 and middleware in his blog.
Follow developerWorks' members' shared bookmarks on web topics.
Get answers quickly: Visit the Web 2.0 Apps forum.
Get answers quickly: Visit the Ajax forum.

Tech Archive

Apache Hadoop - An open source implementation of MapReduce programming model

Oracle Java Blogs Latest

TechNet Magazine - Latest

DevX

Java Web Services - ServerSide.com

Java Technology - SDN

IBM developerWorks

Apache Jakarta Project

Java Lobby

Mkyong.com

Google Code Blog

Martin Fowler

Java Oreilly

J2EE Patterns - ServerSide.com

JavaRanch: "OO, Patterns, UML and Refactoring"

Google Code: News

SourceForge.net New Releases

Developer.com

JavaWorld