Big Data: Hadoop - Basic Concepts

Hello

1. What is Hadoop

Hadoop core components :

HDFS (Hadoop Distributed File System) - store the data on the cluster
MapReduce - process the data on the cluster

2. HDFS basic concepts

HDFS is a file system written in java
Sit on top of native file system ????????
storage of massive amount of data :

scaleable

fault tolerant

supports efficeint processing with MapReduce

How files are stored

Data file ares splited into blocks and distributed to the data nodes
Each block is replicated on multiple node (3 is default)
NameNode stores metadata

Get data in \ out of HDFS

Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode

Important notes about HDFS :

Size

HDFS peforms best with milions files rather then bilions (??? why ???)

Each file typically 100MB or more (??? why ???)

Files in HDFS are write once. info can not be cahnged but file can be replaced
HDFS is optimized for large streaming of files rather then random access read.

3. How MapReduce works

MapReduce has 3 main phases :

phase 1 - The Mapper

Each task works (typically) on one HDFS block

Map task run (typically) on the same node where the block is stored

phase 2 - Suffle & Sort

sort and collect all intermediate data from all mappers

happens after all Map tasks are completed

phase 3 - The Reducer

operate on sorted \ shuffled intermediate data - previous phase output

produces final output

Example : counting words

Phase 1 - The mapper map the text

Phase 2 - Shuffle & Sort

Phase 3 - Reduce

It is important to understand that :

Map tasks run in parallel - this reduce computation time.
Map tasks run on the machines that contains the data so there is no netwrok traffic issues
Reduce also runs in parallel