Tuesday, August 25, 2015

Experimenting with Hadoop

Hi

Looking for fast setup ~ 1 hour so i can play with Hadoop ecosystem

1. Hortonworks sandbox + Azure 

http://hortonworks.com/blog/hortonworks-sandbox-azure/
Pro
  • you can start playing with it after about an hour 
  • Machine on the cloud - saves local machine CPU \ Disk \ Ram
  • Free for a month (Azure)
  • UI - Web browser

 Cons
  • Limited to a month (Azure)
  • Not full Hadoop echsystem e.g. Spark is missing and only Hive,Pig,HCatalog,Ozzie are accessed via Hue

remarks :
  • you need an account to log to Azure



2. Hortonworks sandbox

http://hortonworks.com/products/hortonworks-sandbox/#install
Pro
    • you can start playing with it after few hours
    • free
    • Hadoop echsystem seems to include major part of its ecosystem

     Cons
      • use local PC which eat the disk \ cpu \ ram
      • requires PC with 4-8 MB RAM
      • require Virtual machine which have to be installed.
      • UI- shell
      Remarks :

      • one node out of the box


      3. Cloudera Live Read Only Demo
      http://blog.cloudera.com/blog/2014/04/cloudera-live/
      use http://demo.gethue.com/ after registration

      Pro
        • free 
        • uses a cluster with 4 nodes (machines)
        • you can start playing with it immidiately
        • Machine on the cloud - saves local machine CPU \ Disk \ Ram
        • Hadoop echsystem seems to include major part of its ecosystem


        Cons
          • samples not working
          • limited to 3 hours per session



          4. Cloudera Deploy on AWS
          http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-live/aws-documentation.html

          Pro
            • you can start playing with it immidiately
            • Machine on the cloud - saves local machine CPU \ Disk \ Ram
            • cluser with 4 nodes (machines)
            • Hadoop echsystem seems to include major part of its ecosystem
            • Has few components :
              • Tutorial
              • Cloudera Manager 
              • Cloudera Navigator
              • Hue UI


             Cons
              • cost about 750$ per month 

              remarks :
              • you need an account to log to AWS
              • in case you are new commer to AWS and have AWS free tier (thats for 12 month) then you get for free every month :
              • EBS I/Os - 2,000,000 
              • EBS  Volumes - 30GB
              • EC2  Linux- 750 hours
              • KMS Requests - 20,000
              but you will nontheless pay $0.312 per On Demand RHEL m4.xlarge Instance Hour. You will pay this for every hour the 4 instances are running. for a month you will get 4*24*30*0.312 ~ 900$ which is more then claimed by Cloudera - 750$



              5. Cloudera Deploy on VM
              http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html


              remarks

              • looks like "2. Hortonworks sandbox" 
              • requires host os to be 64 bit

              6. HDInsight

              Pro
              • you can start playing with it immidiately
              • Machine on the cloud - saves local machine CPU \ Disk \ Ram
              • Hadoop echsystem seems to include major part of its ecosyste
              • has 1 month free usage


              Cons

              • cost ??? (pay as you go)



              Tutorials


              Nathan

              Tuesday, May 12, 2015

              Hadoop - Basic Concepts

              Hello

              1. What is Hadoop




              Hadoop core components :

              • HDFS (Hadoop Distributed File System) - store the data on the cluster
              • MapReduce - process the data on the cluster

              2. HDFS basic concepts
              • HDFS is a file system written in java
              • Sit on top of native file system ????????
              • storage of massive amount of data :
              • scaleable
              • fault tolerant
              • supports efficeint processing with MapReduce


              How files are stored

              • Data file ares splited into blocks and distributed to the data nodes
              • Each block is replicated on multiple node (3 is default)
              • NameNode stores metadata


              Get data in \ out of HDFS



              Example of storing \ retrieving file :
              We have two log files : 031512.log which is splited to 3 blocks , 041213.log is splited to 2 blocks.
              Each block is replicated on nodes in the cluster. The mapping of file to blocks and the blocks to node is saved on the NameNode




              Important notes about HDFS :

              • Size
              • HDFS peforms best with milions files rather then bilions   (??? why ???)
              • Each file typically 100MB or more   (??? why ???)
              • Files in HDFS are write once. info can not be cahnged but file can be replaced
              • HDFS is optimized for large streaming of files rather then random access read.


              3. How MapReduce works

              MapReduce has 3 main phases :

              • phase 1 - The Mapper
              • Each task works (typically) on one HDFS block
              • Map task run (typically) on the same node where the block is stored
              • phase 2 - Suffle & Sort
              • sort and collect all intermediate data from all mappers
              • happens after all Map tasks are completed
              • phase 3 - The Reducer
              • operate on sorted \ shuffled intermediate data - previous phase output
              • produces final output 


              Example : counting words

              Phase 1 - The mapper map the text

              Phase 2 - Shuffle & Sort


              Phase 3 - Reduce


              It is important to understand that :

              • Map tasks run in parallel - this reduce computation time.
              • Map tasks run on the machines that contains the data so there is no netwrok traffic issues
              • Reduce also runs in parallel




              Features of MapReduce

              • Automatic data distribution and parallelism  
              • BuiltIn fault tolerance
              • Clean abstraction for developers :
              • Developer does not need to handle the house keeping
              • Developer concentrate on writung Map functions and Reduce functions



              Another example : analyzing log files




              4. Inside Hadoop Cluster

              A cluster is a group of machines - nodes working togetherto process MapReduce job

              Installing Hadoop Cluster

              • Done by administrators - not easy
              • Done easy using CDH
              • Done easier using cloudera


              Basic cluster configuration



              Adding MapReduce :


              NameNode and JobTracker has backups for fault tolerant

              References :



              Nathan

              Hadoop - Introduction and Motivation

              Hello

              1. What is Aadoop  ?

              A software framework for storing , processing and analyzing big data

              • distributed
              • scaleable
              • fault tolerant
              • open source
              Eco system




              Who uses Hadoop




              2. The motivation for Hadoop

              Traditional large scale computation used strong computer (super computer):

              • faster processors
              • more memory
              but even this was not enough

              Betetr solution is distributed system - use multiple machine for single job.
              But this also has its problems :
              • programming complexity - keeping data and processes in sync
              • finite bandwidth
              • partial failures - e.g. one computer fails should not keep the system down

              Modern systems have much more data
              • terabytes (1000gigabytes) a day 
              • petabytes (1000 terabyte) total 

              The approach of central data place is not suitable

              The new approach :
              • distribute data
              • run computation where data resides



              Hadoop works as follows :
              • data is splited to blocks when loaded
              • map tasks typically works on a single block
              • a master program manages tasks


              Core hadoop concepts :

              • applications are written in high level languages
              • nodes talk to each other as little as possible
              • data is ditributed in advanced
              • data is replicated for increased availeablity and relieability
              • hadoop is scaleable and fault tolerant
              Scaleability means 
              • adding more node is linearly propotional to capacity
              • increase load result in gracefull decline in performance and not failure
              Fault tolerance 
              • node failure is inevitable
              • what to do in this case :
              • system continues to function
              • master re-assign tasks to adifferent node
              • data replication - so no lost of data
              • node which recover rejoin the cluster (?????) automatically 
                
              Distributed system must be design with expectation of failure

              what can we do wit Hadoop





              where does data comes from

              • science - weather data, sensors ,...
              • industry - energy , retail ,....
              • legacy - sales data , product database
              • system data - logs , network messages , intrusion detection , web analytics

              common type of analysis with hadoop
              • text mining
              • index building (for quick search)
              • graph creation and analysis e.g. social network graph for facebook \ linkedin
              • pattern recognition e.g. is this text in hebrew , english or other language
              • colaborative filtering 
              • prediction models e.g. regression
              • sentiment analysis how is my service
              • risk assesment

              what is common accross hadoop - able problems :

              nature of data :
              • volume - big data ...
              • velocity - how fast this data is coming 
              • variety - structured , not structure , ....
              nature of processing
              • batch
              • parallel execution
              • distributed data

              use case Opower :
              • SaaS for utility companies
              • provides insight to customers about energy usage
              data :
              • huge amount
              • many different sources
              analysis :
              • similar home comparison
              • heating and cooling usage
              • bill forcasting


              benefits of analysis with hadoop
              • previousely impossible 
              • lower cost
              • less time
              • greater flexability - first store data then decide what processing to perform
                near linear scaleability
              • ask bigger questions




              References :
              http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html


              Nathan

              Saturday, May 2, 2015

              Hadoop in a nutshell

              Hi

              Main features
              • Cheap huge amount of storage
              • process huge amount of storage quickly
              • can store unstructured data like text, images and video
              • saleable to infinity (nodes)
              • data is software protected against hardware failure

              Components included in the basic download 
              • HDFS : a java based distributed file system which can store all kind of data without prior organization
              • MapReduce : a software programming model for parallel computing
              • YARN : schedule and handle resource request from distributed applications

              Other components exists :pig,hive,hbase,zookeeper,ambari,flume,sqoop,oozie

              How does data get into Hadoop :
              • you can load files to the HDFS using simple java commands.
              • in case you have many files you can invoke a script that loads the files in parallel
              • .......


              References




              Nathan

              Wednesday, April 29, 2015

              Become a Data Scentist

              Hi

              You will need the following :
              • Education : 88% have at least M.s.c , 46% have P.h.D
              • Math : machine learning , statistics , algebra
              • Software engineering : SQL , programming language , Hadoop
              • Non technical skils :  Intellectual curiosity , Communication skills ,strong business acumen


              Resources :

              Links



              Nathan