Current module resources / equiments
- “Hadoop - The definitive Guide 4 th edition” by Tom White, O’Reilly, 2015
- Hortonworks
- Cloudera
- VMWare Player or VirtualBox
- Hortonworks or Cloudera quick start VM ready
What is Hadoop
- A library for distributed storage and distributed processing of very large dataset and very fast incoming data
- Supports simple programming models
- Is written in Java
- Open-source
- Data Locality
- Is designed to run on commodity hardware
Why Hadoop
- HDFS (Hadoop distributed file system) Storage for Big Data on Premise (can storage all kind of data: structure, unstrutured, semi-data )
- Scalability
- Variety
- High throughput
Hadoop ecosystem
Start layer is HDFS (Hadoop distributed file system)
Mapreduce (Data processing)
Yarn (Cluster resource management)
Zookeeper (Coordination)
Hive (SQL query)
Pig (Scripting similar to sql)
H Base (Columnar store, data base)
HCatalog (Table and schema management)
Thrift (Cross language service)
Drill (Interactive analysis)
Mahout (Machine learning)
Sqoop & Flume (Data collection)
Apache Ambari (Management and monitoring)
Oozie (work flow)
Docker or VirtualBox
Install Docker Desktop with Kubernetes capability
Install VirtualBox with minikube
1 | minikube start --driver=virtualbox |
Download HDP Sandbox 3.x (VirtualBox)
YARN (Yet Another Resource Negotiator) is a resourse manager used in Hadoop. Manage disk, memory and CPU.
It is a framework which contains:
- Client: job submission
- Resource Manager (RM): manages resource allocation (cluster level)
- Application Master (AM): manages application lifecycle
- Node Manager (NM): manages resource allocation (node level)
- Container: runs applications
MapReduce & SPARK
Apache Hadoop:
- Difficult to program and require abstraction
- It is used for generating reports that help find answers to historical queries
- No in-built interactive mode except tools like Pig and Hive
- Hadoop MapReduce does not leverage the memory of the Hadoop cluster to maximum
- Allows you to just process a batch of stored data
Apache Spark:
- easy to program and dose not require any abstractions
- Programmers can perform streaming, batch processing and machine learning, all in the same cluster
- Has in-built interactive mode
- Execute jobs 10 to 100 times faster than Hadoop MapReduce
- Programmers can modify the data in real-time through Spark streaming