Hadoop

Current module resources / equiments

Resources:

  • “Hadoop - The definitive Guide 4 th edition” by Tom White, O’Reilly, 2015
  • Hortonworks
  • Cloudera

Equipment:

  • VMWare Player or VirtualBox
  • Hortonworks or Cloudera quick start VM ready

What is Hadoop

  • A library for distributed storage and distributed processing of very large dataset and very fast incoming data
  • Supports simple programming models
  • Is written in Java
  • Open-source
  • Data Locality
  • Is designed to run on commodity hardware

Why Hadoop

  • HDFS (Hadoop distributed file system) Storage for Big Data on Premise (can storage all kind of data: structure, unstrutured, semi-data )
  • Scalability
  • Variety
  • High throughput

Hadoop ecosystem

  • Start layer is HDFS (Hadoop distributed file system)

  • Mapreduce (Data processing)

  • Yarn (Cluster resource management)

  • Zookeeper (Coordination)

  • Hive (SQL query)

  • Pig (Scripting similar to sql)

  • H Base (Columnar store, data base)

  • HCatalog (Table and schema management)

  • Thrift (Cross language service)

  • Drill (Interactive analysis)

  • Mahout (Machine learning)

  • Sqoop & Flume (Data collection)

  • Apache Ambari (Management and monitoring)

  • Oozie (work flow)

Docker or VirtualBox

Install Docker Desktop with Kubernetes capability

Or

Install VirtualBox with minikube

1
2
3
minikube start --driver=virtualbox
## if incomplicate
minikube delete

Download HDP Sandbox 3.x (VirtualBox)

YARN

YARN (Yet Another Resource Negotiator) is a resourse manager used in Hadoop. Manage disk, memory and CPU.

It is a framework which contains:

  • Client: job submission
  • Resource Manager (RM): manages resource allocation (cluster level)
  • Application Master (AM): manages application lifecycle
  • Node Manager (NM): manages resource allocation (node level)
  • Container: runs applications

MapReduce & SPARK

Apache Hadoop:

  • Difficult to program and require abstraction
  • It is used for generating reports that help find answers to historical queries
  • No in-built interactive mode except tools like Pig and Hive
  • Hadoop MapReduce does not leverage the memory of the Hadoop cluster to maximum
  • Allows you to just process a batch of stored data

Apache Spark:

  • easy to program and dose not require any abstractions
  • Programmers can perform streaming, batch processing and machine learning, all in the same cluster
  • Has in-built interactive mode
  • Execute jobs 10 to 100 times faster than Hadoop MapReduce
  • Programmers can modify the data in real-time through Spark streaming
Author: shixuan liu
Link: http://tedlsx.github.io/2020/02/26/hadoop/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment