Big Data: May 2015

Tuesday, May 12, 2015

Hadoop - Basic Concepts

Hello

1. What is Hadoop

Hadoop core components :

HDFS (Hadoop Distributed File System) - store the data on the cluster
MapReduce - process the data on the cluster

2. HDFS basic concepts

HDFS is a file system written in java
Sit on top of native file system ????????
storage of massive amount of data :

scaleable

fault tolerant

supports efficeint processing with MapReduce

How files are stored

Data file ares splited into blocks and distributed to the data nodes
Each block is replicated on multiple node (3 is default)
NameNode stores metadata

Get data in \ out of HDFS

Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode

Important notes about HDFS :

Size

HDFS peforms best with milions files rather then bilions (??? why ???)

Each file typically 100MB or more (??? why ???)

Files in HDFS are write once. info can not be cahnged but file can be replaced
HDFS is optimized for large streaming of files rather then random access read.

3. How MapReduce works

MapReduce has 3 main phases :

phase 1 - The Mapper

Each task works (typically) on one HDFS block

Map task run (typically) on the same node where the block is stored

phase 2 - Suffle & Sort

sort and collect all intermediate data from all mappers

happens after all Map tasks are completed

phase 3 - The Reducer

operate on sorted \ shuffled intermediate data - previous phase output

produces final output

Example : counting words

Phase 1 - The mapper map the text

Phase 2 - Shuffle & Sort

Phase 3 - Reduce

It is important to understand that :

Map tasks run in parallel - this reduce computation time.
Map tasks run on the machines that contains the data so there is no netwrok traffic issues
Reduce also runs in parallel

Features of MapReduce

Automatic data distribution and parallelism
BuiltIn fault tolerance
Clean abstraction for developers :

Developer does not need to handle the house keeping

Developer concentrate on writung Map functions and Reduce functions

Another example : analyzing log files

4. Inside Hadoop Cluster

A cluster is a group of machines - nodes working togetherto process MapReduce job

Installing Hadoop Cluster

Done by administrators - not easy
Done easy using CDH
Done easier using cloudera

Basic cluster configuration

Adding MapReduce :

NameNode and JobTracker has backups for fault tolerant

References :

http://cloudera.com/content/cloudera/en/resources/library/training/dissecting-the-apache-hadoop-stack-2-of-6.html

Nathan

Hadoop - Introduction and Motivation

Hello

1. What is Aadoop ?

A software framework for storing , processing and analyzing big data

distributed
scaleable
fault tolerant
open source

Eco system

Who uses Hadoop

2. The motivation for Hadoop

Traditional large scale computation used strong computer (super computer):

faster processors
more memory

but even this was not enough

Betetr solution is distributed system - use multiple machine for single job.

But this also has its problems :

programming complexity - keeping data and processes in sync
finite bandwidth
partial failures - e.g. one computer fails should not keep the system down

Modern systems have much more data

terabytes (1000gigabytes) a day
petabytes (1000 terabyte) total

The approach of central data place is not suitable

The new approach :

distribute data
run computation where data resides

Hadoop works as follows :

data is splited to blocks when loaded
map tasks typically works on a single block
a master program manages tasks

Core hadoop concepts :

applications are written in high level languages
nodes talk to each other as little as possible
data is ditributed in advanced
data is replicated for increased availeablity and relieability
hadoop is scaleable and fault tolerant

Scaleability means

adding more node is linearly propotional to capacity
increase load result in gracefull decline in performance and not failure

Fault tolerance

node failure is inevitable
what to do in this case :

system continues to function

master re-assign tasks to adifferent node

data replication - so no lost of data

node which recover rejoin the cluster (?????) automatically

Distributed system must be design with expectation of failure

what can we do wit Hadoop

where does data comes from

science - weather data, sensors ,...
industry - energy , retail ,....
legacy - sales data , product database
system data - logs , network messages , intrusion detection , web analytics

common type of analysis with hadoop

text mining
index building (for quick search)
graph creation and analysis e.g. social network graph for facebook \ linkedin
pattern recognition e.g. is this text in hebrew , english or other language
colaborative filtering
prediction models e.g. regression
sentiment analysis how is my service
risk assesment

what is common accross hadoop - able problems :

nature of data :

volume - big data ...
velocity - how fast this data is coming
variety - structured , not structure , ....

nature of processing

batch
parallel execution
distributed data

use case Opower :

SaaS for utility companies
provides insight to customers about energy usage

data :

huge amount
many different sources

analysis :

similar home comparison
heating and cooling usage
bill forcasting

benefits of analysis with hadoop

previousely impossible
lower cost
less time
greater flexability - first store data then decide what processing to perform
near linear scaleability
ask bigger questions

References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html

Nathan

Saturday, May 2, 2015

Hadoop in a nutshell

Main features

Cheap huge amount of storage
process huge amount of storage quickly
can store unstructured data like text, images and video
saleable to infinity (nodes)
data is software protected against hardware failure

Components included in the basic download

HDFS : a java based distributed file system which can store all kind of data without prior organization
MapReduce : a software programming model for parallel computing
YARN : schedule and handle resource request from distributed applications

Other components exists :pig,hive,hbase,zookeeper,ambari,flume,sqoop,oozie

How does data get into Hadoop :

you can load files to the HDFS using simple java commands.
in case you have many files you can invoke a script that loads the files in parallel
.......

References

http://www.sas.com/en_us/insights/big-data/hadoop.html

Nathan