What is Hadoop Distributed File System (HDFS) and What are the Design Features HDFS?
The Hadoop Distributed File System (HDFS) was designed for Big Data storage and processing. HDFS is a core part of Hadoop which is used for data storage. It is designed to run on commodity hardware (low-cost and easily available hardaware).
Following are the features of Hadoop Distributed File System (HDFS)
- Distributed and Parallel Computation
- Highly scalable
- Fault tolerance
- Streaming Data Access
1. Distributed and Parallel Computation – This is one of the most important features of the Hadoop Distributed File System (HDFS) which makes Hadoop a very powerful tool for big data storage and processing. Here, the input data is divided into multiple blocks called data blocks and stored into different nodes in the HDFS cluster.
Let’s understand the advantage of parallel and distributed computing with an example. Suppose, it takes 43 minutes to process 1 TB file on a single machine. So, how much time will it take to process the same 1 TB file when you have 10 machines in a Hadoop cluster with a similar configuration – 43 minutes or 4.3 minutes? Its 4.3 minutes.
2. Highly Scalable: HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
There are mainly two types of scaling: Vertical and Horizontal.
In vertical scaling (scale-up), we increase the hardware capacity of your system. That is we add more storage, RAM, and CPU power to the existing system or buy a new machine with more storage capacity, more RAM, and CPU. But there are many challenges with this vertical scaling or scaling up. First, there is always a limit to which one can increase the hardware capacity. So, one cannot just keep on increasing the storage capacity, RAM, or CPU of the machine. Second, in vertical scaling, you need to stop the machine first and then add the resources to the existing machine. Once you have increased your hardware capacity, you restart the machine. This downtime when you are stopping your system becomes a challenge.
In horizontal scaling (scale-out), you add more nodes to the existing HDFS cluster rather than increasing the hardware capacity of machines. And the most important advantage is, one can add more machines on the go i.e. Without stopping the system. Therefore, in horizontal scaling, there is no downtime. Here you will end up with more machines working in parallel to achieve the primary objective of Hadoop.
3. Replication – Due to some unfavorable or unforeseen conditions, the node in a HDFS cluster containing the data block may be failed to work partially or completely. So, to overcome such problems, HDFS always maintains a copy of data block on a different data node in the cluster. If one of the data node fails to work, still the data is available on another data note for client.
The minimum replication factor is 3 for a HDFS cluster containing more than 8 data nodes. If the data nodes are 8 or less then the replication factor is 2.
4. Fault tolerance – In HDFS cluster, the fault tolerance signifies the robustness of the system in the event of failure of of one or more nodes in the cluster. The HDFS is highly fault-tolerant that if any node fails, the other node containing the copy of that data block automatically becomes active and starts serving the client requests.
5. Portable – HDFS is designed in such a way that it can easily portable from platform to another.
6. Streaming Data Access: The write-once/read-many design is intended to facilitate streaming reads.
This article discusses, What is HDFS? and Design Features of HDFS. Don’t forget to give your comment and Subscribe to our YouTube channel for more videos and like the Facebook page for regular updates.