A data warehouse is a central data store in a company or organization that collects relational data from various sources. The information is transferred from different transactional systems or other relational databases to the data warehouse, where it is available to analysts and decision makers.
Data Warehouse Function
The data warehouse is used in many areas in the business environment. The database is used throughout the company to make data-driven decisions or to examine processes. Since the central data warehouse draws information from many different systems, it is seen as a single point of truth. This is to ensure that everyone in the company is talking about the same data and that decisions are based on this information.
Across departments, the data warehouse can be used for the following tasks:
- Cost and resource analysis
- Analysis of internal processes (e.g. production, hiring, etc.)
- Business Intelligence
- Calculation and provision of company-wide key performance indicators
- Data source for analyses or data mining
- Standardization of company-wide data into a fixed schema
Properties of Data Warehouses
When creating central data warehouses, one can be guided by certain characteristics that should help to better narrow down the structure and the necessary data of the warehouse.
A data warehouse contains information on a specific topic and not on individual business transactions. These topics can be, for example, sales, purchasing or marketing.
The warehouse aims to support decision-making with the help of business intelligence and targeted KPIs. This interpretation is also supported by the fact that information which is not relevant to decision-making or used for analysis does not end up in this central database in the first place.
The warehouse integrates data from a wide variety of systems and sources. Therefore, a common schema must be created for the information so that it is uniform and comparable. Otherwise, central analysis and KPI creation is not possible.
Time Period Reference
The data warehouse stores data from a specific time period and is therefore related to the past. Furthermore, the data is usually transmitted in aggregated form, for example at the daily level, so that the volume of data remains limited. Thus, the granularity may not be fine enough, as one is used to from the operational systems.
The operational systems, on the other hand, are time-based, as they output the information that is currently accumulating. At the same time, the information can be viewed in great detail.
Another important feature of a central warehouses is the non-volatility of the data. In operational systems, information is usually only stored temporarily for a short period of time, and as soon as new data is added, old data is overwritten. In a data warehouse, on the other hand, data is stored permanently and old data persists even when newer data is added.
Difference to Data Lake
This central database can additionally be supplemented by a data lake, in which unstructured raw data is stored temporarily at low cost so that it can be used at a later date. The two concepts differ primarily in the data they store and the way the information is stored.
|Features||Data Warehouse||Data Lake|
|Data||Relational data from productive systems or other databases.||All Data Types (structured, semi-structured, unstructured).|
|Data Schema||Can be scheduled either before the data warehouse is created or only during the analysis (schema-on-write or schema-on-read)||Exclusively at the time of analysis (schema-on-read)|
|Query||With local memory very fast query results||– Decoupling of calculations and memory|
– Fast query results with inexpensive memory
|Data Quality||– Pre-processed data from different sources|
– Single point of truth
|– Raw data|
– Processed and unprocessed
|Applications||Business intelligence and graphical preparation of data||Artificial Intelligence, Analytics, Business Intelligence, Big Data|
This is what you should take with you
- A data warehouse stores company-wide information centrally.
- This is intended to support data-driven decisions and make business intelligence possible.
- The unstructured raw data in the data lake provides a good complement to the relational and processed data in the data warehouse.
Explanation of the Apache Hadoop Distributed File System with examples and benefits.
Other Articles on the Topic of Data Warehouses
- Amazon Web Services provides a good summary here.