The Modern Data Stack is cloud-based and provides a data warehouse that resides in the cloud. This ensures fast and efficient data processing. Optimally, the tools for data transport are also hosted in the cloud and have a direct connection to the data warehouse. The same applies to the downstream analysis tools.
What is a Data Stack?
In recent years, even smaller companies are reaching the point where they need to process, prepare and store large amounts of data. In the past, this meant immense investment in on-premise servers and software. However, with the widespread adoption of cloud services, this has changed, and the cost of such infrastructure has dropped significantly. However, there are a variety of tools that claim to be part of the modern data stack.
As data has begun to play an ever-increasing role in business, the concept of the technology stack has been applied to data. The data stack includes all programming languages, tools, or frameworks used to capture route, store, and visualize data.
How has the Data Stack changed over time?
The selection of tools and frameworks to help with data handling is constantly expanding. Thus, the concept of the data stack is also subject to constant change. The following fundamental changes had the greatest influence on the development of the data stack:
Evolution from on-premise Hardware to Cloud Services
When the term data stack was introduced, cloud services were still unthinkable. Therefore, data processing always meant the acquisition of own servers, which had to be installed, supervised, and maintained. This was an immense cost factor and posed great challenges, especially for smaller companies, as they either could not afford trained personnel, or it was also very difficult to find them.
In the early 2010s, that changed with the introduction of cloud services and Amazon Redshift. Since then, more and more so-called Software as a Service product have been offered, which make it possible to purchase the infrastructure completely.
Shift from ETL to ELT
In the early days, storage was also always associated with expensive hardware components, as hard disks and processor units were very cost intensive. In addition, there were only relational databases, which stored the data in tables and thus required a fixed structure. In the company’s internal data warehouse, data first had to be prepared within the ETL process (Extract-Transform-Load) before it could be stored there.
With the introduction of NoSQL databases and the reduction of hard disk costs, the process changed to ELT (Extract-Load-Transform). The NoSQL variants were much freer in terms of data structure and could also store complex data. Furthermore, the so-called data lake prevailed, in which mainly raw data is stored that does not yet have a fixed purpose. As soon as a business analyst needs the data from it, he can query it and transform it only in this step. As a result, the role of the data engineer became less and less important, since the preparation of the data no longer played such a major role.
Broad Data Access in Companies
The analysis of data has become increasingly important in recent years and there is almost no department within the company that does not need to access information in some way. At the same time, data analysis skills are being taught in more and more degree programs, so more and more people can and want to access the data. Thus, the need arose to set up easy data access instead of always having to rely on a business analyst.
What’s the difference between Modern Data Stack and the former Legacy Data Stack?
The main difference between the former so-called Legacy Data Stack and today’s Modern Data Stack is the move away from on-premise hardware to solutions hosted in the cloud. This allows the so-called Infrastructure as a Service or even Software as a Service to be used.
The setup, maintenance, and further development of hardware are then no longer in the hands of the company using the hardware, but of the provider. This in turn has scaling advantages, as it offers the services to many customers.
For the customer, this also makes the Modern Data Stack more scalable. If more storage space or new accesses are needed, the customer can simply extend the existing subscription. The additional costs are low and completely transparent. With an on-premise solution, the hardware can be used until performance or storage limits are reached. Then expensive changes to the hardware are necessary to be able to process more data or users. As a result, the cost efficiency of the legacy data stack is also significantly worse.
What is the Modern Data Stack?
Now that we have learned where the term data stack came from and how it has changed over time, we can finally figure out what a Modern Data Stack looks like.
The Modern Data Stack is cloud-based and provides a data warehouse that resides in the cloud. This provides fast and efficient data processing. Optimally, the tools for data transport are also hosted in the cloud and have a direct connection to the data warehouse. The same applies to the downstream analysis tools.
This structure also means that the ELT process is used more frequently in the modern data stack instead of the conventional ETL process.
When selecting the tools, it should be taken into account that the individual components are interchangeable and can therefore react flexibly to new software or frameworks by integrating them. As we noted earlier, the concept of the data stack is constantly changing, which is why the requirements for the Modern Data Stack are not set in stone.
Rather, what matters is that the following principles are followed:
- Easy to Use: The tools must be easy to use and also to deploy.
- Scalability: The frameworks used should be scalable in order to be able to react quickly to changing conditions.
- Composable: Each of the components should be interchangeable and replaceable so that there is no dependence on any technology and new developments can rely on the advantages of the new tools.
What are the benefits of a Modern Data Stack?
Moving to a Modern Data Stack can be worthwhile for many reasons. The most common benefits are explained below.
Put Business back in Focus
The former legacy data stack was very IT and technology-driven, as the company had to deal with the concrete server architecture, security concepts, maintenance of the systems, and much more. With all these considerations, the concrete business use case often took a back seat. On the other hand, many commercially viable applications may not have been implemented because the necessary technical requirements were not met. All this changes with the Modern Data Stack.
The Modern Data Stack puts many of these issues on the back burner, as they are no longer the responsibility of the users, but of the companies that provide the software. This allows the focus to be entirely on the business problems that data is supposed to help solve.
High Cost- and Resource-Efficiency
The Modern Data Stack stands out because it is maximally scalable. New resources can be easily added when necessary and canceled in a timely manner when they are no longer needed. The additional costs for a new user or more performance are transparent and calculable. With the legacy data stack, major costs could arise at any time, for example for hardware replacement, which were not foreseeable. All these risks now lie with the provider.
On the other hand, the Modern Data Stack does not tie up as much personnel as was the case a few years ago. As a result, well-trained IT personnel can be used for other projects or do not have to be built up at great expense. In addition, the tools are relatively easier to use and are thus available to a broad mass of people in the company who do not necessarily have to be specially trained.
Investments in on-premise hardware are usually very high and must therefore be well thought out, as they must be used for several years to be worthwhile. In addition, the hardware must be designed to withstand peak loads. This results in the hardware being underutilized most of the time.
The Modern Data Stack is much more agile in this respect, as cloud services can be added as soon as necessary, even for a short period of time. This means that the infrastructure is not only easy to scale for a growing company but can also be used optimally throughout the day.
Who maintains the Modern Data Stack?
For previous data stacks based on on-premise hardware, there are several positions that are needed to build data pipelines and use the data.
The Data Engineer ensures that the data transport runs smoothly. This includes not only data acquisition but also transformation and finally loading into the final database. It is important to maintain an overview of the data architecture and to be able to deal well with query languages such as SQL.
Specific tasks may include:
- The right data sets must be found in order to be able to implement the requirements from the business side.
- The Data Engineer develops algorithms to prepare and cleanse the source data so that other data scientists can easily use it.
- ETL – pipelines that procure data from source systems, prepare it, and deposit it into a target database must not only be created but also constantly tested for functionality.
- All of these tasks must also ensure that data governance concepts are adhered to so that all users have the necessary permissions.
Based on what the Data Engineer has prepared, the Business Analyst can then start building concrete reports that support users in their decisions. He takes the requirements of the business departments and tries to build dashboards and analyses that answer concrete questions.
Specific tasks may include:
- Gathering and recording business requirements
- Conversion of requirements into technically feasible concepts
- Analysis and preparation of processes
- Participation in project teams consisting of representatives from business and IT
- Management of the introduction and implementation of the proposed concepts
- Mediation between technical and business departments
What are the tasks of the Analytics Engineer?
The two roles presented are beginning to blur more and more with the Modern Data Stack. On the one hand, the management of the data platform has become much easier and no longer needs to be coordinated in such an elaborate way. On the other hand, many people in the company are able to analyze their data themselves with the simple tools of the Modern Data Stack and only really need support for very complicated and overarching analyses.
Therefore, this has resulted in the Analytic Engineer position. This position has more of an end-to-end responsibility in the operation of the data stack. It is expected that all steps, from data provision to application by the end user, are taken over.
This also allows further inefficiencies to be eliminated. In fact, the expanded scope of responsibilities also allows the analytics engineer to keep track of different projects and their matches so that even certain pipelines or databases can be used multiple times because they share the same database.
This is what you should take with you
- The Modern Data Stack provides the ability for many organizations to leverage big data.
- Until now, this was only available to organizations that could make large investments in on-premise hardware and provide the necessary personnel.
- With the widespread adoption of cloud services and Software as a Service product, this has now changed dramatically.
- This results in many advantages, including high agility, as services can be easily added as needed and the costs for these are also much more transparent.
What is the Snowflake Schema?
Explanation of the Snowflake scheme compared to the Star scheme.
What is Data Augmentation?
Use and methods of data augmentation.
What is Tableau?
Learn how to use Tableau for data visualization and analysis in our comprehensive guide.
What is the Normalization of databases?
Learn about database normalization and how it can improve your database. Maximize efficiency and minimize redundancy with normalization.
What are the Primary Key and Foreign Key?
Learn about primary and foreign keys in database management. Understand their differences, importance, and usage. Read more in this article!
What is Apache Parquet?
Learn how to optimize Big Data storage with Apache Parquet. Explore its features, benefits, and implementation in this comprehensive guide.
What are CSV files?
Learn all about CSV files, including how to they are structured, best practices and comparison to Apache Parquet.
What is the CAP Theorem?
Understanding CAP Theorem: Consistency, Availability, and Partition Tolerance in Distributed Systems. Learn the trade-offs in system design.
What is Batch Processing?
Learn about batch processing in data science. Discover how batch processing works, its advantages, and common applications.
What is Apache Airflow?
Discover Apache Airflow, a platform for programmatically authoring, scheduling, and monitoring workflows in data engineering.
Other Articles on the Topic of Modern Data Stack
These articles are a good further reading on the Modern Data Stack and were used as References: