Tuesday, May 12, 2015

Hadoop - Basic Concepts

Hello

1. What is Hadoop




Hadoop core components :

  • HDFS (Hadoop Distributed File System) - store the data on the cluster
  • MapReduce - process the data on the cluster

2. HDFS basic concepts
  • HDFS is a file system written in java
  • Sit on top of native file system ????????
  • storage of massive amount of data :
  • scaleable
  • fault tolerant
  • supports efficeint processing with MapReduce


How files are stored

  • Data file ares splited into blocks and distributed to the data nodes
  • Each block is replicated on multiple node (3 is default)
  • NameNode stores metadata


Get data in \ out of HDFS



Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode




Important notes about HDFS :

  • Size
  • HDFS peforms best with milions files rather then bilions   (??? why ???)
  • Each file typically 100MB or more   (??? why ???)
  • Files in HDFS are write once. info can not be cahnged but file can be replaced
  • HDFS is optimized for large streaming of files rather then random access read.


3. How MapReduce works

MapReduce has 3 main phases :

  • phase 1 - The Mapper
  • Each task works (typically) on one HDFS block
  • Map task run (typically) on the same node where the block is stored
  • phase 2 - Suffle & Sort
  • sort and collect all intermediate data from all mappers
  • happens after all Map tasks are completed
  • phase 3 - The Reducer
  • operate on sorted \ shuffled intermediate data - previous phase output
  • produces final output 


Example : counting words

Phase 1 - The mapper map the text

Phase 2 - Shuffle & Sort


Phase 3 - Reduce


It is important to understand that :

  • Map tasks run in parallel - this reduce computation time.
  • Map tasks run on the machines that contains the data so there is no netwrok traffic issues
  • Reduce also runs in parallel




Features of MapReduce

  • Automatic data distribution and parallelism  
  • BuiltIn fault tolerance
  • Clean abstraction for developers :
  • Developer does not need to handle the house keeping
  • Developer concentrate on writung Map functions and Reduce functions



Another example : analyzing log files




4. Inside Hadoop Cluster

A cluster is a group of machines - nodes working togetherto process MapReduce job

Installing Hadoop Cluster

  • Done by administrators - not easy
  • Done easy using CDH
  • Done easier using cloudera


Basic cluster configuration



Adding MapReduce :


NameNode and JobTracker has backups for fault tolerant

References :



Nathan

No comments:

Post a Comment