Analytics on Big Data

Tuesday 26 February 2013

Introduction to Hadoop Core / File Systems

Apache Hadoop Framework forms the kernel of an operating system for big data permitting users to share resources, managing permissions and allocations.

Map Reduce Layer :

The Task Tracker on each node spawns off a separate Java Virtual Machine process to prevent the Task Tracker itself from failing if the running job crashes the JVM.
The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible.

Crux of MapReduce Architecture:

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Reducer reduces a set of intermediate values which share a key to a smaller set of values.

HDFS Layer :

Namenode is the single point for storage and management of metadata, this can be a bottleneck for supporting a huge number of files, especially a large number of small files.
Data Node talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

Subscribe to: Posts (Atom)