1. What is Hadoop
Hadoop core components :
- HDFS (Hadoop Distributed File System) - store the data on the cluster
- MapReduce - process the data on the cluster
- HDFS is a file system written in java
- Sit on top of native file system ????????
- storage of massive amount of data :
- scaleable
- fault tolerant
- supports efficeint processing with MapReduce
How files are stored
- Data file ares splited into blocks and distributed to the data nodes
- Each block is replicated on multiple node (3 is default)
- NameNode stores metadata
Get data in \ out of HDFS
Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode
Important notes about HDFS :
- Size
- HDFS peforms best with milions files rather then bilions (??? why ???)
- Each file typically 100MB or more (??? why ???)
- Files in HDFS are write once. info can not be cahnged but file can be replaced
- HDFS is optimized for large streaming of files rather then random access read.
3. How MapReduce works
MapReduce has 3 main phases :
- phase 1 - The Mapper
- Each task works (typically) on one HDFS block
- Map task run (typically) on the same node where the block is stored
- phase 2 - Suffle & Sort
- sort and collect all intermediate data from all mappers
- happens after all Map tasks are completed
- phase 3 - The Reducer
- operate on sorted \ shuffled intermediate data - previous phase output
- produces final output
Example : counting words
Phase 1 - The mapper map the text
Phase 2 - Shuffle & Sort
Phase 3 - Reduce
It is important to understand that :
- Map tasks run in parallel - this reduce computation time.
- Map tasks run on the machines that contains the data so there is no netwrok traffic issues
- Reduce also runs in parallel
Features of MapReduce
- Automatic data distribution and parallelism
- BuiltIn fault tolerance
- Clean abstraction for developers :
- Developer does not need to handle the house keeping
- Developer concentrate on writung Map functions and Reduce functions
Another example : analyzing log files
4. Inside Hadoop Cluster
A cluster is a group of machines - nodes working togetherto process MapReduce job
Installing Hadoop Cluster
- Done by administrators - not easy
- Done easy using CDH
- Done easier using cloudera
Adding MapReduce :
NameNode and JobTracker has backups for fault tolerant
References :
Nathan