Tech Archive: Big Data

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

ref:

Big Data - http://en.wikipedia.org/wiki/Big_data, http://en.wikipedia.org/wiki/MapReduce

Hadoop tutorial - http://www.coreservlets.com/hadoop-tutorial/

What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/

MapReduce: Simplified Data Processing on Large Clusters - http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Google’s MapReduce Programming Model(Revisited) - http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf

MapReduce: Simplified Data Processing on Large Clusters - http://www.cs.utexas.edu/~pingali/CS395T/2012sp/lectures/MR-nikhil-panpalia.pdf

Hadoop/MapReduce - http://www.cs.colorado.edu/~kena/classes/5448/s11/presentations/hadoop.pdf

Apache's implementation of Google's MapReduce framework - https://www.defcon.org/images/defcon-17/dc-17-presentations/defcon-17-calca-anguiano-hadoop.pdf

Intel big data - http://www.intel.com/bigdata

Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html

Tech Archive

Big Data

Oracle Java Blogs Latest

TechNet Magazine - Latest

DevX

Java Web Services - ServerSide.com

Java Technology - SDN

IBM developerWorks

Apache Jakarta Project

Java Lobby

Mkyong.com

Google Code Blog

Martin Fowler

Java Oreilly

J2EE Patterns - ServerSide.com

JavaRanch: "OO, Patterns, UML and Refactoring"

Google Code: News

SourceForge.net New Releases

Developer.com

JavaWorld