Hadoop

Current module resources / equiments

Resources:

Equipment:

A library for distributed storage and distributed processing of very large dataset and very fast incoming data
Supports simple programming models
Is written in Java
Open-source
Data Locality
Is designed to run on commodity hardware

HDFS (Hadoop distributed file system) Storage for Big Data on Premise (can storage all kind of data: structure, unstrutured, semi-data )
Scalability
Variety
High throughput

Install Docker Desktop with Kubernetes capability

Install VirtualBox with minikube

1
2
3

minikube start --driver=virtualbox
## if incomplicate
minikube delete

Download HDP Sandbox 3.x (VirtualBox)

YARN (Yet Another Resource Negotiator) is a resourse manager used in Hadoop. Manage disk, memory and CPU.

It is a framework which contains:

Apache Hadoop:

Apache Spark:

easy to program and dose not require any abstractions
Programmers can perform streaming, batch processing and machine learning, all in the same cluster
Has in-built interactive mode
Execute jobs 10 to 100 times faster than Hadoop MapReduce
Programmers can modify the data in real-time through Spark streaming