1. What is Aadoop ?
A software framework for storing , processing and analyzing big data
- distributed
- scaleable
- fault tolerant
- open source
Eco system
Who uses Hadoop
2. The motivation for Hadoop
Traditional large scale computation used strong computer (super computer):
- faster processors
- more memory
but even this was not enough
Betetr solution is distributed system - use multiple machine for single job.
But this also has its problems :
- programming complexity - keeping data and processes in sync
- finite bandwidth
- partial failures - e.g. one computer fails should not keep the system down
Modern systems have much more data
- terabytes (1000gigabytes) a day
- petabytes (1000 terabyte) total
The approach of central data place is not suitable
The new approach :
- distribute data
- run computation where data resides
Hadoop works as follows :
- data is splited to blocks when loaded
- map tasks typically works on a single block
- a master program manages tasks
Core hadoop concepts :
- applications are written in high level languages
- nodes talk to each other as little as possible
- data is ditributed in advanced
- data is replicated for increased availeablity and relieability
- hadoop is scaleable and fault tolerant
Scaleability means
- adding more node is linearly propotional to capacity
- increase load result in gracefull decline in performance and not failure
Fault tolerance
- node failure is inevitable
- what to do in this case :
- system continues to function
- master re-assign tasks to adifferent node
- data replication - so no lost of data
- node which recover rejoin the cluster (?????) automatically
Distributed system must be design with expectation of failure
what can we do wit Hadoop
where does data comes from
- science - weather data, sensors ,...
- industry - energy , retail ,....
- legacy - sales data , product database
- system data - logs , network messages , intrusion detection , web analytics
common type of analysis with hadoop
- text mining
- index building (for quick search)
- graph creation and analysis e.g. social network graph for facebook \ linkedin
- pattern recognition e.g. is this text in hebrew , english or other language
- colaborative filtering
- prediction models e.g. regression
- sentiment analysis how is my service
- risk assesment
what is common accross hadoop - able problems :
nature of data :
- volume - big data ...
- velocity - how fast this data is coming
- variety - structured , not structure , ....
nature of processing
- batch
- parallel execution
- distributed data
use case Opower :
- SaaS for utility companies
- provides insight to customers about energy usage
data :
- huge amount
- many different sources
analysis :
- similar home comparison
- heating and cooling usage
- bill forcasting
benefits of analysis with hadoop
- previousely impossible
- lower cost
- less time
- greater flexability - first store data then decide what processing to perform
near linear scaleability - ask bigger questions
References :
http://cloudera.com/content/cloudera/en/resources/library/training/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop.html
Nathan
I am also looking for it but I don't know to where its good for start. I know this is really good for Millionaire Mindset Intensive but what should I do now.
ReplyDeletevisit www.vigshakri.com for your IT trainings.
ReplyDeleteVery nice!
ReplyDeleteIt was really fun reading ypur article. Thankyou very much. # BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page
ReplyDeleteOur Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest
Boost DA upto 35+ at cheapest
Boost DA upto 45+ at cheapest
Hour peace structure record soldier. Friend cell together major buy.technology
ReplyDelete