Apache Hadoop - An open source implementation of MapReduce programming model

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -


  • Storing large files
    • Terabytes, Petabytes, etc...
    • Millions rather than billions of files 100MB or more per file
  • Streaming data
    • Write once and read-many times patterns
    • Optimized for streaming reads rather than random reads
    • Append operation added to Hadoop 0.21
  • “Cheap” Commodity Hardware
    • No need for super-comp

HDFS is not so good for - 



  • Low-latency reads
    • High-throughput rather than low latency for small chunks of data
    • HBase addresses this issue
  • Large amount of small files
    • Better for millions of large files instead of billions of small files
    • For example each file can be 100MB or more
  • Multiple Writers
    • Single writer per file
    • Writes only at the end of file, no-support for arbitrary offset


pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.



PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. 
HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer

Ref:

MapReduce: Simplified Data Processing on Large Clusters - http://static.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf


Apache Hadoop Goes Realtime at Facebook - http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf


Java development 2.0: Big data analysis with Hadoop MapReduce - http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html


To Hadoop, or not to Hadoop - https://www.ibm.com/developerworks/mydeveloperworks/blogs/theTechTrek/entry/to_hadoop_or_not_to_hadoop2?lang=en


What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/

Distributed data processing with Hadoop, Part 1: Getting started - http://www.ibm.com/developerworks/linux/library/l-hadoop-1/

Distributed data processing with Hadoop, Part 2: Going further - http://www.ibm.com/developerworks/linux/library/l-hadoop-2/


Distributed data processing with Hadoop, Part 3: Application development - http://www.ibm.com/developerworks/linux/library/l-hadoop-3/


An introduction to the Hadoop Distributed File System - http://www.ibm.com/developerworks/web/library/wa-introhdfs/


Scheduling in Hadoop - http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html


Using MapReduce and load balancing on the cloud - http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/


Intel Big Data - http://www.intel.com/bigdata


Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html


Miscellaneous:


  • The Hadoop wiki provides community input related to Hadoop and HDFS.
  • The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
  • Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
  • Visit Amazon S3 to learn about Amazon's S3 infrastructure.
  • The developerWorks Web development zone specializes in articles covering various web-based solutions.
       Get products and technologies
  • The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
  • The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
  • Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.
       Discuss