Hadoop 1.x architecture

Hadoop 1.x architecture

Table of contents

Hadoop 1.x

  1. Hadoop is built on two whitepapers published by Google, i.e,

    • HDFS

    • Map Reduce

  2. HDFS: Hadoop Distributed File System

    It is different from the normal file system in a way that the data copied onto HDFS is split into ‘n’ blocks and each block is copied onto a different node in the cluster. To achieve this we use master-slave architecture

  3. HDFS Master => Name Node: Takes the client request and is responsible for orchestrating the data copy across the cluster

  4. HDFS Slave => Data Node: Saves the block of data and coordinates with its master

  5. NameNode has two functions

    • Cluster Management

    • Meta Data Management

  6. Meta Data: The primary purpose of Namenode is to manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the location at which that block of the file is stored is mentioned in MetaData. All information regarding the logs of the transactions happening in a Hadoop cluster (when or who read/wrote the data) will be stored in MetaData. Metadata is stored in the memory.

  7. Cluster Management: The health of HDFS is critical for a Hadoop-based Big Data platform. HDFS problems can negatively affect the efficiency of the cluster. Even worse, it can make the cluster not function properly. For example, DataNode’s unavailability caused by network segmentation can lead to some under-replicated data blocks. When this happens, HDFS will automatically replicate those data blocks, which will bring a lot of overhead to the cluster and cause the cluster to be too unstable to be available for use. In this recipe, we will show commands to manage an HDFS cluster.

  8. Image is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.

    EditLogs is a transaction log that recorde the changes in the HDFS file system or any action performed on the HDFS cluster such as addtion of a new block, replication, deletion etc., It records the changes since the last FsImage was created, it then merges the changes into the FsImage file to create a new FsImage file.

    When we are starting namenode, latest FsImage file is loaded into "in-memory" and at the same time, EditLog file is also loaded into memory if FsImage file does not contain up to date information.

    Namenode stores metadata in "in-memory" in order to serve the multiple client request(s) as fast as possible. If this is not done, then for every operation , namenode has to read the metadata information from the disk to in-memory. This process will consume more disk seek time for every operation.

  9. Every 3sec the data nodes will send a heartbeat to the name node. The name node has a threshold which is heartbeat*2 if it does not receive any heartbeat then that node is declared as dead

  10. Replication Factor: It is the number of times the Hadoop framework replicates every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3.).The main goal that we achieve using the replication factor is fault tolerance

  11. Under replica and its solution: Under-replicated blocks are blocks that do not meet their target replication for the file they belong to.

    To balance these HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

  12. When we split data depending on the block size and assign it to blocks this process will happen parallely.

  13. Replica and its solution: Over-replicated blocks are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas. That's how it's balanced in this case.

  14. In over-replica and under-replica we are trying to balance out the system, it is called load balancing

  15. We must always remember our goal is to first help n storing big data, our second goal is to process that big data which is stored

  16. In Hadoop, the storing of data is taken care of by hdfs which is Hadoop distributed file system

  17. The processing part is taken care of by MapReduce

  18. MapReduce makes use of daemon. In English the meaning of daemon is like a spirit or ghost. On similar lines daemon here means something which works in the background and which cannot be seen

  19. It has two daemon job tracker and task tracker

  20. The role of Job Tracker is like an operating system that schedules and provides the resources to the task tracker, and monitors the working of the task tracker

  21. The task tracker is the one that actually works on the data

  22. The processing happens Parallely

  23. Secondary NameNode

    Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store this data into a file name fsimage. This file then gets transferred to a new system. A new MetaData is assigned to that new system and a new Master is created with this MetaData, and the cluster is made to run again correctly.

  24. Safemode for the NameNode is a read-only mode for the Hadoop Distributed File System (HDFS) cluster. In Safemode, you can't make any modifications to the file system or blocks. After the DataNodes report that most file system blocks are available, the NameNode automatically leaves Safemode.

Drawback

  1. The fsimage in the secondary Namenode will get updated only after a certain interval of time

  2. Hence if the name node fails in between this interval it will lead to single point of failure

The entire architecture can be represented as:

example: