Big Data

Tuesday, August 25, 2015

Experimenting with Hadoop

Hi

Looking for fast setup ~ 1 hour so i can play with Hadoop ecosystem

1. Hortonworks sandbox + Azure

http://hortonworks.com/blog/hortonworks-sandbox-azure/
Pro

you can start playing with it after about an hour

Machine on the cloud - saves local machine CPU \ Disk \ Ram

Free for a month (Azure)

UI - Web browser

Cons

Limited to a month (Azure)

Not full Hadoop echsystem e.g. Spark is missing and only Hive,Pig,HCatalog,Ozzie are accessed via Hue

remarks :

you need an account to log to Azure

2. Hortonworks sandbox

http://hortonworks.com/products/hortonworks-sandbox/#install

Pro

you can start playing with it after few hours

free

Hadoop echsystem seems to include major part of its ecosystem

Cons

use local PC which eat the disk \ cpu \ ram

requires PC with 4-8 MB RAM

require Virtual machine which have to be installed.

UI- shell

Remarks :

one node out of the box

3. Cloudera Live Read Only Demo
http://blog.cloudera.com/blog/2014/04/cloudera-live/
use http://demo.gethue.com/ after registration

Pro

free

uses a cluster with 4 nodes (machines)

you can start playing with it immidiately

Machine on the cloud - saves local machine CPU \ Disk \ Ram

Hadoop echsystem seems to include major part of its ecosystem

Cons

samples not working

limited to 3 hours per session

4. Cloudera Deploy on AWS
http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-live/aws-documentation.html

Pro

you can start playing with it immidiately

Machine on the cloud - saves local machine CPU \ Disk \ Ram

cluser with 4 nodes (machines)

Hadoop echsystem seems to include major part of its ecosystem

Has few components :

Tutorial

Cloudera Manager

Cloudera Navigator

Hue UI

Cons

cost about 750$ per month

remarks :

you need an account to log to AWS
in case you are new commer to AWS and have AWS free tier (thats for 12 month) then you get for free every month :

EBS I/Os - 2,000,000

EBS Volumes - 30GB

EC2 Linux- 750 hours

KMS Requests - 20,000

but you will nontheless pay $0.312 per On Demand RHEL m4.xlarge Instance Hour. You will pay this for every hour the 4 instances are running. for a month you will get 4*24*30*0.312 ~ 900$ which is more then claimed by Cloudera - 750$

5. Cloudera Deploy on VM
http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

remarks

looks like "2. Hortonworks sandbox"
requires host os to be 64 bit

6. HDInsight

https://azure.microsoft.com/en-us/services/hdinsight/

Pro

you can start playing with it immidiately
Machine on the cloud - saves local machine CPU \ Disk \ Ram
Hadoop echsystem seems to include major part of its ecosyste
has 1 month free usage

Cons

cost ??? (pay as you go)

Tutorials

http://hortonworks.com/products/hortonworks-sandbox/#tutorial_gallery

Nathan

Tuesday, May 12, 2015

Hadoop - Basic Concepts

Hello

1. What is Hadoop

Hadoop core components :

HDFS (Hadoop Distributed File System) - store the data on the cluster
MapReduce - process the data on the cluster

2. HDFS basic concepts

HDFS is a file system written in java
Sit on top of native file system ????????
storage of massive amount of data :

scaleable

fault tolerant

supports efficeint processing with MapReduce

How files are stored

Data file ares splited into blocks and distributed to the data nodes
Each block is replicated on multiple node (3 is default)
NameNode stores metadata

Get data in \ out of HDFS

Example of storing \ retrieving file :
We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode

Important notes about HDFS :

Size

HDFS peforms best with milions files rather then bilions (??? why ???)

Each file typically 100MB or more (??? why ???)

Files in HDFS are write once. info can not be cahnged but file can be replaced
HDFS is optimized for large streaming of files rather then random access read.

3. How MapReduce works

MapReduce has 3 main phases :

phase 1 - The Mapper

Each task works (typically) on one HDFS block

Map task run (typically) on the same node where the block is stored

phase 2 - Suffle & Sort

sort and collect all intermediate data from all mappers

happens after all Map tasks are completed

phase 3 - The Reducer

operate on sorted \ shuffled intermediate data - previous phase output

produces final output

Example : counting words

Phase 1 - The mapper map the text

Phase 2 - Shuffle & Sort

Phase 3 - Reduce

It is important to understand that :

Map tasks run in parallel - this reduce computation time.
Map tasks run on the machines that contains the data so there is no netwrok traffic issues
Reduce also runs in parallel

Features of MapReduce

Automatic data distribution and parallelism
BuiltIn fault tolerance
Clean abstraction for developers :

Developer does not need to handle the house keeping

Developer concentrate on writung Map functions and Reduce functions

Another example : analyzing log files

4. Inside Hadoop Cluster

A cluster is a group of machines - nodes working togetherto process MapReduce job

Installing Hadoop Cluster

Done by administrators - not easy
Done easy using CDH
Done easier using cloudera

Basic cluster configuration

Adding MapReduce :

NameNode and JobTracker has backups for fault tolerant

References :

http://cloudera.com/content/cloudera/en/resources/library/training/dissecting-the-apache-hadoop-stack-2-of-6.html

Nathan

Hadoop - Introduction and Motivation

Hello

1. What is Aadoop ?

A software framework for storing , processing and analyzing big data

distributed
scaleable
fault tolerant
open source

Eco system

Who uses Hadoop

2. The motivation for Hadoop

Traditional large scale computation used strong computer (super computer):

faster processors
more memory

but even this was not enough

Betetr solution is distributed system - use multiple machine for single job.

But this also has its problems :

programming complexity - keeping data and processes in sync
finite bandwidth
partial failures - e.g. one computer fails should not keep the system down

Modern systems have much more data

terabytes (1000gigabytes) a day
petabytes (1000 terabyte) total

The approach of central data place is not suitable

The new approach :

distribute data
run computation where data resides

Hadoop works as follows :

data is splited to blocks when loaded
map tasks typically works on a single block
a master program manages tasks

Core hadoop concepts :

applications are written in high level languages
nodes talk to each other as little as possible
data is ditributed in advanced
data is replicated for increased availeablity and relieability
hadoop is scaleable and fault tolerant

Scaleability means

adding more node is linearly propotional to capacity
increase load result in gracefull decline in performance and not failure

Fault tolerance

node failure is inevitable
what to do in this case :

system continues to function

master re-assign tasks to adifferent node

data replication - so no lost of data

node which recover rejoin the cluster (?????) automatically

Distributed system must be design with expectation of failure

what can we do wit Hadoop

where does data comes from

science - weather data, sensors ,...
industry - energy , retail ,....
legacy - sales data , product database
system data - logs , network messages , intrusion detection , web analytics

common type of analysis with hadoop

text mining
index building (for quick search)
graph creation and analysis e.g. social network graph for facebook \ linkedin
pattern recognition e.g. is this text in hebrew , english or other language
colaborative filtering
prediction models e.g. regression
sentiment analysis how is my service
risk assesment

what is common accross hadoop - able problems :

nature of data :

volume - big data ...
velocity - how fast this data is coming
variety - structured , not structure , ....

nature of processing

batch
parallel execution
distributed data

use case Opower :

SaaS for utility companies
provides insight to customers about energy usage

data :

huge amount
many different sources

analysis :

similar home comparison
heating and cooling usage
bill forcasting

benefits of analysis with hadoop

previousely impossible
lower cost
less time
greater flexability - first store data then decide what processing to perform
near linear scaleability
ask bigger questions

References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html

Nathan

Saturday, May 2, 2015

Hadoop in a nutshell

Main features

Cheap huge amount of storage
process huge amount of storage quickly
can store unstructured data like text, images and video
saleable to infinity (nodes)
data is software protected against hardware failure

Components included in the basic download

HDFS : a java based distributed file system which can store all kind of data without prior organization
MapReduce : a software programming model for parallel computing
YARN : schedule and handle resource request from distributed applications

Other components exists :pig,hive,hbase,zookeeper,ambari,flume,sqoop,oozie

How does data get into Hadoop :

you can load files to the HDFS using simple java commands.
in case you have many files you can invoke a script that loads the files in parallel
.......

References

http://www.sas.com/en_us/insights/big-data/hadoop.html

Nathan

Wednesday, April 29, 2015

Become a Data Scentist

Hi

You will need the following :

Education : 88% have at least M.s.c , 46% have P.h.D
Math : machine learning , statistics , algebra
Software engineering : SQL , programming language , Hadoop
Non technical skils : Intellectual curiosity , Communication skills ,strong business acumen

Resources :

LinkedIn Groups
Kaggle

Links

how-to-become-a-data-scientist-for-free - Excelent !!!!

Nathan