Hadoop & Mapreduce

With the rising Big data, Apache Software Foundation in 2008 developed an open source framework known as Apache Hadoop, which is a solution to all the big data problems.
  • Apache Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. 
  • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
  • The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. 

     Hadoop Cluster

    • A Hadoop cluster is nothing but a group of computers connected together via LAN. 
    • We use it for storing and processing large data sets. 
    • Hadoop clusters have a number of commodity hardware connected together. 
    • They communicate with a high-end machine which acts as a master. 
    • These master and slaves implement distributed computing over distributed data storage. 
    • It runs open source software for providing distributed functionality.  

Hadoop Architecture

Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets using Hadoop MapReduce paradigm and allows parallel processing of data using several components.

The 3 important Hadoop components that play a vital role in the Hadoop architecture are - 

1.Hadoop Distributed File System (HDFS)

2.MapReduce

3.YARN

#Hadoop Daemon 

A daemon (pronounced DEE-muhn) is a program that runs continuously and exists for the purpose of handling periodic service requests that a computer system expects to receive. The daemon program forwards the requests to other programs (or processes) as appropriate.

1. Master Daemons

  • NameNode: It is the master Daemon in Hadoop HDFS. It maintains the filesystem namespace. It stores metadata about each block of the files.
  • ResourceManager: It is the master daemon of YARN. It arbitrates resources amongst all the applications running in the cluster.

2. Slave Daemons

  • DataNode: DataNode is the slave daemon of Hadoop HDFS. It runs on slave machines. It stores actual data or blocks.
  • NodeManager: It is the slave daemon of YARN. It takes care of all the individual computing nodes in the cluster.

How Does Hadoop Work?

  • Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.
  • Once all the blocks of the data are stored on data-nodes, the user can process the data.
  • Resource Manager then schedules the program (submitted by the user) on individual nodes.
  • Once all the nodes process the data, the output is written back to HDFS.

    internal working of hadoop - How hadoop Works

Hadoop distributed file system (HDFS)  

Hadoop stores a massive amount of data in a distributed manner in HDFS. 

Hadoop Distributed File System(HDFS) is the world’s most reliable storage system. It is best known for its fault tolerance and high availability.

  • The Hadoop Distributed File System (HDFS) is Hadoop’s storage unit.
  • Here, the data is split into multiple blocks and these blocks are then randomly distributed and stored across slave machines. 
  • Data is replicated three times & stores it across multiple systems. 
  • Each block contains 128 MB of data by default & block size can also be customized. 
  • HDFS features like Rack awareness, high Availability, Data Blocks, Replication Management, HDFS data read and write operations.
  • It provides high throughput by providing the data access in parallel.
 Replications operate under two rules:
  1. Two identical blocks cannot be placed on the same Data Node
  2. When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack.

There are three components of the Hadoop Distributed File System:  

  1. NameNode (Masternode): Contains metadata in RAM and disk
  2. Secondary NameNode: Contains a copy of NameNode’s metadata on disk
  3. Slave/Data Node: Contains the actual data in the form of blocks
 HDFS Architecture

HDFS Architecture - How hadoop Works


  • Hadoop Distributed File System follows the master-slave architecture
  • Each cluster comprises a single master node and multiple slave nodes
  • Internally the files get divided into one or more blocks, and each block is stored on different slave machines depending on the replication factor.
  • The Master node is the NameNode and DataNodes are the slave nodes.

  • The master node stores and manages the file system namespace, that is information about blocks of files like block locations, permissions, etc. The slave nodes store data blocks of files.

#NameNode

NameNode is the centerpiece of the Hadoop Distributed File System. It maintains and manages the file system namespace and provides the right access permission to the clients.

The NameNode stores information about blocks locations, permissions, etc. on the local disk in the form of two files:

  • Fsimage: Fsimage stands for File System image. It contains the complete namespace of the Hadoop file system since the NameNode creation.
  • Edit log: It contains all the recent changes performed to the file system namespace to the most recent Fsimage.

Functions of HDFS NameNode

  1. It executes the file system namespace operations like opening, renaming, and closing files and directories.
  2. NameNode manages and maintains the DataNodes.
  3. It determines the mapping of blocks of a file to DataNodes.
  4. NameNode records each change made to the file system namespace.
  5. It keeps the locations of each block of a file.
  6. NameNode takes care of the replication factor of all the blocks.
  7. NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive.
  8. If the DataNode fails, the NameNode chooses new DataNodes for new replicas.

#Datanodes

  • DataNodes are the slave nodes that store the actual  data and maintains the block. 
  • While there is only one namenode, there can be multiple datanodes, which are responsible for retrieving the blocks when requested by the namenode.   
 Functions of DataNode
  1. DataNode is responsible for serving the client read/write requests.
  2. Based on the instruction from the NameNode, DataNodes performs block creation, replication, and deletion.
  3. DataNodes send a heartbeat to NameNode to report the health of HDFS.
  4. DataNodes also sends block reports to NameNode to report the list of blocks it contains.

#Secondary NameNode

  • The secondary NameNode server is responsible for maintaining a copy of the metadata in the disk. 
  • It is the helper node for the primary NameNode. 
  • The main purpose of the secondary NameNode is to create a new NameNode in case of failure. 
  • Secondary NameNode downloads the edit logs and Fsimage file from the primary NameNode and periodically applies the edit logs to Fsimage. Then it sends back the updated Fsimage file to the NameNode. So, if the primary NameNode fails, the last save Fsimage on the secondary NameNode is used to recover file system metadata. 
  • In a high availability cluster, there are two NameNodes: active and standby.
  • The secondary NameNode performs a similar function to the standby NameNode.
Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode.

 Process

  • Application data is stored on servers referred to as DataNodes
  • File system metadata is stored on servers referred to as NameNode. 
  • HDFS replicates the file content on multiple DataNodes based on the replication factor to ensure reliability of data. 
  • The NameNode and DataNode communicate with each other using TCP based protocols. 
  • For the Hadoop architecture to be performance efficient, HDFS must satisfy certain pre-requisites –
1.All the hard drives should have a high throughput. 
2.Good network speed to manage intermediate data transfer and block replications.

Blocks in HDFS Architecture

  • Internally, HDFS split the file into block-sized chunks called a block. 
  • The size of the block is 128 Mb by default. 
  • One can configure the block size as per the requirement. For example, if there is a file of size 612 Mb, then HDFS will create four blocks of size 128 Mb and one block of size 100 Mb.
  • The file of a smaller size does not occupy the full block size space in the disk.For example, the file of size 2 Mb will occupy only 2 Mb space in the disk. 
  • The user doesn’t have any control over the location of the blocks.

Replication Management

  • In Hadoop, HDFS stores replicas of a block on multiple DataNodes based on the replication factor.
  • The replication factor is the number of copies to be created for blocks of a file in HDFS architecture.
  • If the replication factor is 3, then three copies of a block get stored on different DataNodes. So if one DataNode containing the data block fails, then the block is accessible from the other DataNode containing a replica of the block.
  • If we are storing a file of 128 Mb and the replication factor is 3, then (3*128=384) 384 Mb of disk space is occupied for a file as three copies of a block get stored.

This replication mechanism makes HDFS fault-tolerant.

Rack Awareness in HDFS Architecture

Rack is the collection of around 40-50 machines (DataNodes) connected using the same network switch. If the network goes down, the whole rack will be unavailable.

  • Rack Awareness is the concept of choosing the closest node based on the rack information.
  • To ensure that all the replicas of a block are not stored on the same rack or a single rack, NameNode follows a rack awareness algorithm to store replicas and provide latency and fault tolerance.

Suppose if the replication factor is 3, then according to the rack awareness algorithm:

  • The first replica will get stored on the local rack.
  • The second replica will get stored on the other DataNode in the same rack.
  • The third replica will get stored on a different rack.

 Rack Awareness - Apache Hadoop HDFS Architecture - Edureka


Advantages of Rack Awareness 

To improve the network performance: The communication between nodes residing on different racks is directed via switch. In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps you to have reduce write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks.

To prevent loss of data: We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if you think about it, it will make sense, as it is said that never put all your eggs in the same basket.

HDFS Read and Write Mechanism

HDFS Read and Write mechanisms are parallel activities. To read or write a file in HDFS, a client must interact with the namenode. The namenode checks the privileges of the client and gives permission to read or write on the data blocks.

During file read, if any DataNode goes down, the NameNode provides the address of another DataNode containing a replica of the block from where the client can read its data without any downtime.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.

Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput

Hadoop YARN

  • YARN enabled the users to perform operations as per requirement by using a variety of tools like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.  
  • Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling.  
  • Introduced in the Hadoop 2.0 version.
  •  YARN is the middle layer between HDFS and MapReduce in the Hadoop architecture. 

Components of YARN include:

Resource Manager (one per cluster)Runs on a master daemon and manages the resource allocation in the cluster. 
Node Manager:  (one per node)They run on the slave daemons and are responsible for the execution of a task on every single Data Node. 
Application Master: (one per application)Manages the user job lifecycle and resource needs of individual applications. It works along with the Node Manager and monitors the execution of tasks. 
Container: Package of resources including RAM, CPU, Network, HDD etc on a single node.
 
You can consider YARN as the brain of your Hadoop Ecosystem

Resource Manager

Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components:

  1. Scheduler: Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications
  2. Application Manager: Accepting job submissions from the client or monitoring and restarting application masters in case of failure

Application Master

  • Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources. 
  • It connects with the node manager to execute and monitor tasks.

Node Manager

  • Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node. 
  • It also monitors each container’s resource utilization. 
  • NodeManager is the slave daemons of YARN. It runs on all the slave nodes in the cluster. 
  • It is responsible for launching and managing the containers on nodes. 
  • Containers execute the application-specific processes with a constrained set of resources such as memory, CPU, and so on. 
  • When NodeManager starts, it announces himself to the ResourceManager. It periodically sends a heartbeat to the ResourceManager. It offers resources to the cluster.

Container

  • Container houses a collection of resources like RAM, CPU, and network bandwidth. 
  • Allocations are based on what YARN has calculated for the resources. 
  • The container provides the rights to an application to use specific resource amounts.

apache hadoop yarn - How hadoop Works

Steps to Running an application in YARN

  1. Client submits an application to the ResourceManager
  2. Resource Manager allocates a container to start Application Manager
  3. Application Manager registers with Resource Manager
  4. Application Manager asks containers from Resource Manager
  5. Application Manager notifies Node Manager to launch containers
  6. Application code is executed in the container
  7. Client contacts Resource Manager/Application Manager to monitor application’s status
  8. Application Manager unregisters with Resource Manager.

MapReduce

  • It is the processing unit in Hadoop.
  • MapReduce is a software framework and programming model that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
  • MapReduce consists of two distinct tasks – Map and Reduce 
  •  It works by dividing the task into independent subtasks and executes them in parallel across various DataNodes thereby increasing the throughput,fially
  • Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data. 
  • Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. 
  • The input to each phase is key-value pairs.  The type of key, value pairs is specified by the programmer through the InputFormat class. By default, the text input format is used. 
  • The first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store saves this intermediate data into the local disk.

  • Input Split-It is the logical representation of data. It represents a block of work that contains a single map task in the MapReduce Program.
  • RecordReader-It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and generates the final output which is then saved in the HDFS.

Driver Class 

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data types and their respective job names.

 

https://i2.wp.com/techvidvan.com/tutorials/wp-content/uploads/sites/2/2020/03/apache-hadoop-mapreduce.jpg?ssl=1

Working of Hadoop MapReduce

Whenever the client wants to perform any processing on its data in the Hadoop cluster, then it first stores the data in Hadoop HDFS and then writes the MapReduce program for processing the Data. The Hadoop MapReduce works as follows:

  1. Hadoop divides the job into tasks of two types, that is, map tasks and reduce tasks. YARN scheduled these tasks 
  2. These tasks run on different DataNodes.
  3. The input to the MapReduce job is divided into fixed-size pieces called input splits.
  4. One map task which runs a user-defined map function for each record in the input split is created for each input split. These map tasks run on the DataNodes where the input data resides.
  5. The output of the map task is intermediate output and is written to the local disk.
  6. The intermediate outputs of the map tasks are shuffled and sorted and are then passed to the reducer.
  7. For a single reduce task, the sorted intermediate output of mapper is passed to the node where the reducer task is running. These outputs are then merged and then passed to the user-defined reduce function.
  8. The reduce function summarizes the output of the mapper and generates the output. The output of the reducer is stored on HDFS.
  9. For multiple reduce functions, the user specifies the number of reducers. When there are multiple reduce tasks, the map tasks partition their output, creating one partition for each reduce task.
 

Advantages of Hadoop

https://data-flair.training/blogs/advantages-and-disadvantages-of-hadoop/

  • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.

  • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.

  • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.

  • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

 REF

https://data-flair.training/forums/topic/what-are-the-differences-between-traditional-rdbms-and-hadoop/

/https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/?utm_source=youtube&utm_campaign=hadoop-architecture-081216-wr&utm_medium=description

 

No comments:

Post a Comment

Monk and Inversions

using System; public class Solution { public static void Main () { int T = Convert . ToInt32 ( Console . ReadLine...