Skip to content

Hadoop – explaining the Big Data tool!

  • Data

Hadoop is a software framework that can be used to easily process large amounts of data on distributed systems. It has mechanisms that ensure stable and fault-tolerant functionality, so that the tool is ideally suited for data processing in the Big Data environment.

Components of Hadoop

The software framework itself is a compilation of a total of four components.

Hadoop Common is a collection of various modules and libraries that support the other components and enable them to work together. Among other things, the Java Archive files (JAR files) required to start Hadoop are stored here. In addition, the collection enables the provision of basic services, such as the file system.

The Map-Reduce algorithm origins from Google and helps to divide complex computing tasks into more manageable subprocesses and then distributes these across several systems, i.e. scale them horizontally. This significantly reduces the computing time. At the end, the results of the subtasks have to be combined again into its overall result.

The Yet Another Resource Negotiator (YARN) supports the Map-Reduce algorithm by keeping track of the resources within a computer cluster and distributing the subtasks to the individual computers. In addition, it allocates the capacities for the individual processes.

The Hadoop Distributed File System (HDFS) is a scalable file system for storing intermediate or final results. Within the cluster, it is distributed across multiple computers to process large amounts of data quickly and efficiently. The idea behind this was that Big Data projects and data analysis are based on large amounts of data. Thus, there should be a system that also stores the data in batches and processes it quickly. The HDFS also ensures that duplicates of data records are stored in order to be able to cope with the failure of a computer.

How does Hadoop work?

Suppose we want to evaluate the word distribution from one million German books. This would be a daunting task for a single computer. In this example, the Map-Reduce algorithm would first divide the overall task into more manageable subprocesses. For example, this could be done by first looking at the books individually and determining the word occurrence and distribution for each book. The individual books would thus be distributed to the nodes and the result tables would be created for the individual works.

Within the computer cluster we have a node that assumes the role of the so-called master. In our example, this node does not perform any direct calculation, but merely distributes the tasks to the so-called slave nodes and coordinates the entire process. The slave nodes in turn read the books and store the word frequency and the word distribution.

Once this step is complete, we can continue to work only with the result tables and no longer need the memory-intensive source books. The final task, i.e. aggregating the intermediate tables and calculating the final result, can then also be parallelized again or, depending on the effort involved, taken over by a single node.

Differences between Hadoop and a relational database

Hadoop differs from a comparable relational database in several fundamental ways.

PropertiesRelational DatabaseHadoop
Data TypesStructured data onlyall data types (structured, semi-structured and unstructured)
Amount of Datalittle to medium (in the range of a few GB)large amounts of data (in the range of terabytes or petabytes)
Query LanguageSQLHQL (Hive Query Language)
Data SchemaStatic Schema (Schema on Write)Dynamic Schema (Schema on Read)
CostsLicense costs depending on databasefree
Data ObjectsRelational TablesKey-Value Pair
Scaling TypeVertical scaling (computer needs to get better in terms of hardware)Horizontal scaling (more computers can be added to handle load)
Comparison Hadoop and Relational Database

This is what you should take with you

  • Hadoop is a software framework that can be used to process large amounts of data quickly.
  • The framework consists of the Hadoop Common, the Map-Reduce algorithm, the Yet Another Resource Negotiator and the Hadoop Distributed File System.
  • It differs in many ways from a comparable relational database. It should be decided on a case-by-case basis how best to process and store the data.

Other Articles on the Topic of Hadoop

  • Hadoop’s documentation provides insightful guidance on downloading and setting up the system.
  • The table of differences between Hadoop and a relational database is based on the colleagues at datasolut.com.
close
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner