Tuesday, May 12, 2015

Hadoop - Introduction and Motivation

Hello

1. What is Aadoop  ?

A software framework for storing , processing and analyzing big data

  • distributed
  • scaleable
  • fault tolerant
  • open source
Eco system




Who uses Hadoop




2. The motivation for Hadoop

Traditional large scale computation used strong computer (super computer):

  • faster processors
  • more memory
but even this was not enough

Betetr solution is distributed system - use multiple machine for single job.
But this also has its problems :
  • programming complexity - keeping data and processes in sync
  • finite bandwidth
  • partial failures - e.g. one computer fails should not keep the system down

Modern systems have much more data
  • terabytes (1000gigabytes) a day 
  • petabytes (1000 terabyte) total 

The approach of central data place is not suitable

The new approach :
  • distribute data
  • run computation where data resides



Hadoop works as follows :
  • data is splited to blocks when loaded
  • map tasks typically works on a single block
  • a master program manages tasks


Core hadoop concepts :

  • applications are written in high level languages
  • nodes talk to each other as little as possible
  • data is ditributed in advanced
  • data is replicated for increased availeablity and relieability
  • hadoop is scaleable and fault tolerant
Scaleability means 
  • adding more node is linearly propotional to capacity
  • increase load result in gracefull decline in performance and not failure
Fault tolerance 
  • node failure is inevitable
  • what to do in this case :
  • system continues to function
  • master re-assign tasks to adifferent node
  • data replication - so no lost of data
  • node which recover rejoin the cluster (?????) automatically 
  
Distributed system must be design with expectation of failure

what can we do wit Hadoop





where does data comes from

  • science - weather data, sensors ,...
  • industry - energy , retail ,....
  • legacy - sales data , product database
  • system data - logs , network messages , intrusion detection , web analytics

common type of analysis with hadoop
  • text mining
  • index building (for quick search)
  • graph creation and analysis e.g. social network graph for facebook \ linkedin
  • pattern recognition e.g. is this text in hebrew , english or other language
  • colaborative filtering 
  • prediction models e.g. regression
  • sentiment analysis how is my service
  • risk assesment

what is common accross hadoop - able problems :

nature of data :
  • volume - big data ...
  • velocity - how fast this data is coming 
  • variety - structured , not structure , ....
nature of processing
  • batch
  • parallel execution
  • distributed data

use case Opower :
  • SaaS for utility companies
  • provides insight to customers about energy usage
data :
  • huge amount
  • many different sources
analysis :
  • similar home comparison
  • heating and cooling usage
  • bill forcasting


benefits of analysis with hadoop
  • previousely impossible 
  • lower cost
  • less time
  • greater flexability - first store data then decide what processing to perform
    near linear scaleability
  • ask bigger questions




References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html


Nathan

5 comments: