Apache HDFS is a distributed file system for storing large amounts of data in the area of Big Data and distributing it on different computers. This system enables Apache Hadoop to be run in a distributed manner across a large number of nodes, i.e. computers.
What is the Apache Framework Hadoop?
Apache Hadoop is a software framework that can be used to quickly process large amounts of data on distributed systems. It has mechanisms that ensure stable and fault-tolerant functionality so that the tool is ideally suited for data processing in the Big Data environment. The software framework itself is a compilation of four components.
Hadoop Common is a collection of various modules and libraries that support the other components and enable them to work together. Among other things, the Java Archive Files (JAR Files) are stored here, which are required to start Hadoop. In addition, the collection enables the provision of basic services, such as the file system.
The Map-Reduce algorithm has its origins in Google and helps to divide complex computing tasks into more manageable subprocesses and then distribute these across several systems, i.e. scale them horizontally. This significantly reduces the computing time. In the end, the results of the subtasks have to be combined again into the overall result.
The Yet Another Resource Negotiator (YARN) supports the Map-Reduce algorithm by keeping track of the resources within a computer cluster and distributing the subtasks to the individual computers. In addition, it allocates the capacities for the individual processes.
The Apache Hadoop Distributed File System (HDFS) is a scalable file system for storing intermediate or final results, which we will discuss in more detail in this post.
What do we need HDFS for?
Within the cluster, an HDFS is distributed across multiple computers to process large amounts of data quickly and efficiently. The idea behind this is that big data projects and data analyses are based on large volumes of data. Thus, there should be a system that also stores the data in batches and processes it quickly. The HDFS ensures that duplicates of data records are stored in order to be able to cope with the failure of a computer.
- Fast recovery from hardware failures
- Enabling stream data processing
- Processing of huge data sets
- Easy to move to new hardware or software
Structure of a Hadoop Distributed File System
The core of the Hadoop Distributed File System is to distribute the data across different files and computers so that queries can be processed quickly and the user does not have long waiting times. To ensure that the failure of a single machine in the cluster does not lead to the loss of the data, there are targeted replications on different computers to ensure resilience.
Hadoop generally works according to the so-called master-slave principle. Within the computer cluster, we have a node that takes on the role of the so-called master. In our example, this node does not perform any direct computation, but merely distributes the tasks to the so-called slave nodes and coordinates the entire process. The slave nodes in turn read the books and store the word frequency and word distribution.
This principle is also used for data storage. The master distributes information from the data set to different slave nodes and remembers on which computers it has stored which partitions. It also stores the data redundantly in order to be able to compensate for failures. When the user queries the data, the master node then decides which slave nodes it must query in order to obtain the desired information.
The master within the Apache Hadoop Distributed File System is called a Namenode. The slave nodes in turn are the so-called datanodes. From bottom to top, the schematic structure can be understood as follows:
The client writes data to various files, which can be located on different systems, in our example the datanodes on racks 1 and 2. There is usually one datanode per computer in a cluster. These primarily manage the memory that is available to them on a computer. Several files are usually stored in the memory, which in turn are split into so-called blocks.
The task of the name node is to remember which blocks are stored in which datanode. In addition, they manage the files and can open, close, and rename them as needed.
The datanodes in turn are responsible for the read and write processes of the client, i.e. the user. The client also receives the desired information from them in the event of a query. At the same time, the datanodes are also responsible for the replication of data to ensure the fault tolerance of the system.
What are the advantages of the Hadoop Distributed File System?
For many companies, the Hadoop framework is also becoming increasingly interesting as a data lake, i.e. as unstructured storage for large amounts of data, due to the HDFS. Various points play a decisive role here:
- Ability to store large amounts of data in a distributed cluster. In most cases, this is significantly cheaper than storing the information on a single machine.
- High fault tolerance and thus highly available systems.
- Hadoop is open source and therefore free to use and the source code can be viewed
These points explain the increasing adoption of Hadoop and HDFS in many applications.
This is what you should take with you
- HDFS is a distributed file system for storing large amounts of data in the field of Big Data and distributing it on different computers.
- It is part of the Apache Hadoop framework.
- The master node divides the data set into smaller partitions and distributes them on different computers, the so-called slave nodes.
Other Articles on the Topic of HDFS
- You can find detailed documentation on the Hadoop Distributed File System on their site.