The data lake stores a large number of data in their raw format, i.e. unprocessed, from different sources in order to make them usable for Big Data analyses. This data can be structured, semi-structured, or unstructured. The information is stored there until it is needed for an analytical evaluation.
Why do companies use a data lake?
The data stored in a data lake can be either structured, semi-structured, or unstructured. Such data formats are unsuitable for relational databases, the basis of many data warehouses.
In addition, relational databases are also not horizontally scalable, which means that data stored in a data warehouse can become very expensive above a certain level. Horizontal scalability means that the system can be divided among several different computers and thus the load does not lie on a single computer. This offers the advantage that in stressful situations, e.g. with many simultaneous queries, new systems can simply be added temporarily as needed.
By definition, relational databases cannot divide the dataset among many computers, otherwise, data consistency would be lost. For this reason, it is very expensive to store large amounts of data in a data warehouse. Most systems used for the data lake, such as NoSQL databases or Hadoop, are horizontally scalable and can therefore be scaled more easily. These are the main reasons why the data lake, along with the data warehouse, is one of the cornerstones of the data architecture of many companies.
The exact implementation of a data lake varies from company to company. However, there are some basic principles that should apply to the data architecture of the data lake:
- All data can be included. Basically, all information from source systems can be loaded into the Data Lake.
- The data does not have to be processed first, but can be loaded and stored in its original form.
- The data is only processed and prepared if there is a special use case with requirements for the data. This procedure is also called schema-on-read.
Otherwise, some basic points should still be considered when storing regardless of the system:
- Common folder structure with uniform naming conventions so that data can be found quickly and easily.
- Creation of a data catalog that names the origin of the files and briefly explains the individual data.
- Data screening tools so that it is quickly apparent how good the data quality is and whether there are any missing values, for example.
- Standardized data access, so that authorizations can be assigned and it is clear who has access to the data.
Differences to the data warehouse
The data warehouse can additionally be supplemented by a data lake, in which unstructured raw data is stored temporarily at low cost so that it can be used at a later date. The two concepts differ primarily in the data they store and the way the information is stored.
|Features||Data Warehouse||Data Lake|
|Data||Relational data from productive systems or other databases.||All Data Types (structured, semi-structured, unstructured).|
|Data Schema||Can be scheduled either before the data warehouse is created or only during the analysis (schema-on-write or schema-on-read)||Exclusively at the time of analysis (schema-on-read)|
|Query||With local memory very fast query results||– Decoupling of calculations and memory|
– Fast query results with inexpensive memory
|Data Quality||– Pre-processed data from different sources|
– Single point of truth
|– Raw data|
– Processed and unprocessed
|Applications||Business intelligence and graphical preparation of data||Artificial Intelligence, Analytics, Business Intelligence, Big Data|
This is what you should take with you
- The data lake refers to a large data store that stores data in raw format from source systems so that it is available for later analysis.
- It can store structured, semi-structured, and unstructured data.
- It differs fundamentally from data warehouses in that it stores unprocessed data and the data is not prepared until there is a specific use case.
- This is primarily a case of data retention.
Other Articles on the Topic of Data Lakes
- Amazon AWS offers a detailed theoretical explanation on the topic of data lakes and also the options for building them in the cloud.