Table of contents
Hadoop 1.x
Hadoop is built on two whitepapers published by Google, i.e,
HDFS
Map Reduce
HDFS: Hadoop Distributed File System
It is different from the normal file system in a way that the data copied onto HDFS is split into ‘n’ blocks and each block is copied onto a different node in the cluster. To achieve this we use master-slave architecture
HDFS Master => Name Node: Takes the client request and is responsible for orchestrating the data copy across the cluster
HDFS Slave => Data Node: Saves the block of data and coordinates with its master
NameNode has two functions
Cluster Management
Meta Data Management
Meta Data: The primary purpose of Namenode is to manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop Distributed File System). As we know the data is stored in the form of blocks in a Hadoop cluster. So the DataNode on which or the location at which that block of the file is stored is mentioned in MetaData. All information regarding the logs of the transactions happening in a Hadoop cluster (when or who read/wrote the data) will be stored in MetaData. Metadata is stored in the memory.
Cluster Management: The health of HDFS is critical for a Hadoop-based Big Data platform. HDFS problems can negatively affect the efficiency of the cluster. Even worse, it can make the cluster not function properly. For example, DataNode’s unavailability caused by network segmentation can lead to some under-replicated data blocks. When this happens, HDFS will automatically replicate those data blocks, which will bring a lot of overhead to the cluster and cause the cluster to be too unstable to be available for use. In this recipe, we will show commands to manage an HDFS cluster.
Image is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.
EditLogs is a transaction log that recorde the changes in the HDFS file system or any action performed on the HDFS cluster such as addtion of a new block, replication, deletion etc., It records the changes since the last FsImage was created, it then merges the changes into the FsImage file to create a new FsImage file.
When we are starting namenode, latest FsImage file is loaded into "in-memory" and at the same time, EditLog file is also loaded into memory if FsImage file does not contain up to date information.
Namenode stores metadata in "in-memory" in order to serve the multiple client request(s) as fast as possible. If this is not done, then for every operation , namenode has to read the metadata information from the disk to in-memory. This process will consume more disk seek time for every operation.
Every 3sec the data nodes will send a heartbeat to the name node. The name node has a threshold which is heartbeat*2 if it does not receive any heartbeat then that node is declared as dead
Replication Factor: It is the number of times the Hadoop framework replicates every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3.).The main goal that we achieve using the replication factor is fault tolerance
Under replica and its solution: Under-replicated blocks are blocks that do not meet their target replication for the file they belong to.
To balance these HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.
When we split data depending on the block size and assign it to blocks this process will happen parallely.
Replica and its solution: Over-replicated blocks are blocks that exceed their target replication for the file they belong to. Normally, over-replication is not a problem, and HDFS will automatically delete excess replicas. That's how it's balanced in this case.
In over-replica and under-replica we are trying to balance out the system, it is called load balancing
We must always remember our goal is to first help n storing big data, our second goal is to process that big data which is stored
In Hadoop, the storing of data is taken care of by hdfs which is Hadoop distributed file system
The processing part is taken care of by MapReduce
MapReduce makes use of daemon. In English the meaning of daemon is like a spirit or ghost. On similar lines daemon here means something which works in the background and which cannot be seen
It has two daemon job tracker and task tracker
The role of Job Tracker is like an operating system that schedules and provides the resources to the task tracker, and monitors the working of the task tracker
The task tracker is the one that actually works on the data
The processing happens Parallely
Secondary NameNode
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store this data into a file name fsimage. This file then gets transferred to a new system. A new MetaData is assigned to that new system and a new Master is created with this MetaData, and the cluster is made to run again correctly.
Safemode for the NameNode is a read-only mode for the Hadoop Distributed File System (HDFS) cluster. In Safemode, you can't make any modifications to the file system or blocks. After the DataNodes report that most file system blocks are available, the NameNode automatically leaves Safemode.
Drawback
The fsimage in the secondary Namenode will get updated only after a certain interval of time
Hence if the name node fails in between this interval it will lead to single point of failure
The entire architecture can be represented as:
example: