Big Data: Hadoop - Introduction and Motivation

Hello

1. What is Aadoop ?

A software framework for storing , processing and analyzing big data

distributed
scaleable
fault tolerant
open source

Eco system

Who uses Hadoop

2. The motivation for Hadoop

Traditional large scale computation used strong computer (super computer):

faster processors
more memory

but even this was not enough

Betetr solution is distributed system - use multiple machine for single job.

But this also has its problems :

programming complexity - keeping data and processes in sync
finite bandwidth
partial failures - e.g. one computer fails should not keep the system down

Modern systems have much more data

terabytes (1000gigabytes) a day
petabytes (1000 terabyte) total

The approach of central data place is not suitable

The new approach :

distribute data
run computation where data resides

Hadoop works as follows :

data is splited to blocks when loaded
map tasks typically works on a single block
a master program manages tasks

Core hadoop concepts :

applications are written in high level languages
nodes talk to each other as little as possible
data is ditributed in advanced
data is replicated for increased availeablity and relieability
hadoop is scaleable and fault tolerant

Scaleability means

adding more node is linearly propotional to capacity
increase load result in gracefull decline in performance and not failure

Fault tolerance

node failure is inevitable
what to do in this case :

system continues to function

master re-assign tasks to adifferent node

data replication - so no lost of data

node which recover rejoin the cluster (?????) automatically

Distributed system must be design with expectation of failure

what can we do wit Hadoop

where does data comes from

science - weather data, sensors ,...
industry - energy , retail ,....
legacy - sales data , product database
system data - logs , network messages , intrusion detection , web analytics

common type of analysis with hadoop

text mining
index building (for quick search)
graph creation and analysis e.g. social network graph for facebook \ linkedin
pattern recognition e.g. is this text in hebrew , english or other language
colaborative filtering
prediction models e.g. regression
sentiment analysis how is my service
risk assesment

what is common accross hadoop - able problems :

nature of data :

volume - big data ...
velocity - how fast this data is coming
variety - structured , not structure , ....

nature of processing

batch
parallel execution
distributed data

use case Opower :

SaaS for utility companies
provides insight to customers about energy usage

data :

huge amount
many different sources

analysis :

similar home comparison
heating and cooling usage
bill forcasting

benefits of analysis with hadoop

previousely impossible
lower cost
less time
greater flexability - first store data then decide what processing to perform
near linear scaleability
ask bigger questions

References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html

Nathan

Big Data

Tuesday, May 12, 2015

Hadoop - Introduction and Motivation

5 comments: