Tuesday, May 12, 2015

Hadoop - Basic Concepts

Hello

1. What is Hadoop




Hadoop core components :

  • HDFS (Hadoop Distributed File System) - store the data on the cluster
  • MapReduce - process the data on the cluster

2. HDFS basic concepts
  • HDFS is a file system written in java
  • Sit on top of native file system ????????
  • storage of massive amount of data :
  • scaleable
  • fault tolerant
  • supports efficeint processing with MapReduce


How files are stored

  • Data file ares splited into blocks and distributed to the data nodes
  • Each block is replicated on multiple node (3 is default)
  • NameNode stores metadata


Get data in \ out of HDFS



Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode




Important notes about HDFS :

  • Size
  • HDFS peforms best with milions files rather then bilions   (??? why ???)
  • Each file typically 100MB or more   (??? why ???)
  • Files in HDFS are write once. info can not be cahnged but file can be replaced
  • HDFS is optimized for large streaming of files rather then random access read.


3. How MapReduce works

MapReduce has 3 main phases :

  • phase 1 - The Mapper
  • Each task works (typically) on one HDFS block
  • Map task run (typically) on the same node where the block is stored
  • phase 2 - Suffle & Sort
  • sort and collect all intermediate data from all mappers
  • happens after all Map tasks are completed
  • phase 3 - The Reducer
  • operate on sorted \ shuffled intermediate data - previous phase output
  • produces final output 


Example : counting words

Phase 1 - The mapper map the text

Phase 2 - Shuffle & Sort


Phase 3 - Reduce


It is important to understand that :

  • Map tasks run in parallel - this reduce computation time.
  • Map tasks run on the machines that contains the data so there is no netwrok traffic issues
  • Reduce also runs in parallel




Features of MapReduce

  • Automatic data distribution and parallelism  
  • BuiltIn fault tolerance
  • Clean abstraction for developers :
  • Developer does not need to handle the house keeping
  • Developer concentrate on writung Map functions and Reduce functions



Another example : analyzing log files




4. Inside Hadoop Cluster

A cluster is a group of machines - nodes working togetherto process MapReduce job

Installing Hadoop Cluster

  • Done by administrators - not easy
  • Done easy using CDH
  • Done easier using cloudera


Basic cluster configuration



Adding MapReduce :


NameNode and JobTracker has backups for fault tolerant

References :



Nathan

Hadoop - Introduction and Motivation

Hello

1. What is Aadoop  ?

A software framework for storing , processing and analyzing big data

  • distributed
  • scaleable
  • fault tolerant
  • open source
Eco system




Who uses Hadoop




2. The motivation for Hadoop

Traditional large scale computation used strong computer (super computer):

  • faster processors
  • more memory
but even this was not enough

Betetr solution is distributed system - use multiple machine for single job.
But this also has its problems :
  • programming complexity - keeping data and processes in sync
  • finite bandwidth
  • partial failures - e.g. one computer fails should not keep the system down

Modern systems have much more data
  • terabytes (1000gigabytes) a day 
  • petabytes (1000 terabyte) total 

The approach of central data place is not suitable

The new approach :
  • distribute data
  • run computation where data resides



Hadoop works as follows :
  • data is splited to blocks when loaded
  • map tasks typically works on a single block
  • a master program manages tasks


Core hadoop concepts :

  • applications are written in high level languages
  • nodes talk to each other as little as possible
  • data is ditributed in advanced
  • data is replicated for increased availeablity and relieability
  • hadoop is scaleable and fault tolerant
Scaleability means 
  • adding more node is linearly propotional to capacity
  • increase load result in gracefull decline in performance and not failure
Fault tolerance 
  • node failure is inevitable
  • what to do in this case :
  • system continues to function
  • master re-assign tasks to adifferent node
  • data replication - so no lost of data
  • node which recover rejoin the cluster (?????) automatically 
  
Distributed system must be design with expectation of failure

what can we do wit Hadoop





where does data comes from

  • science - weather data, sensors ,...
  • industry - energy , retail ,....
  • legacy - sales data , product database
  • system data - logs , network messages , intrusion detection , web analytics

common type of analysis with hadoop
  • text mining
  • index building (for quick search)
  • graph creation and analysis e.g. social network graph for facebook \ linkedin
  • pattern recognition e.g. is this text in hebrew , english or other language
  • colaborative filtering 
  • prediction models e.g. regression
  • sentiment analysis how is my service
  • risk assesment

what is common accross hadoop - able problems :

nature of data :
  • volume - big data ...
  • velocity - how fast this data is coming 
  • variety - structured , not structure , ....
nature of processing
  • batch
  • parallel execution
  • distributed data

use case Opower :
  • SaaS for utility companies
  • provides insight to customers about energy usage
data :
  • huge amount
  • many different sources
analysis :
  • similar home comparison
  • heating and cooling usage
  • bill forcasting


benefits of analysis with hadoop
  • previousely impossible 
  • lower cost
  • less time
  • greater flexability - first store data then decide what processing to perform
    near linear scaleability
  • ask bigger questions




References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html


Nathan

Saturday, May 2, 2015

Hadoop in a nutshell

Hi

Main features
  • Cheap huge amount of storage
  • process huge amount of storage quickly
  • can store unstructured data like text, images and video
  • saleable to infinity (nodes)
  • data is software protected against hardware failure

Components included in the basic download 
  • HDFS : a java based distributed file system which can store all kind of data without prior organization
  • MapReduce : a software programming model for parallel computing
  • YARN : schedule and handle resource request from distributed applications

Other components exists :pig,hive,hbase,zookeeper,ambari,flume,sqoop,oozie

How does data get into Hadoop :
  • you can load files to the HDFS using simple java commands.
  • in case you have many files you can invoke a script that loads the files in parallel
  • .......


References




Nathan