The Modern Data Stack is cloud-based and provides a data warehouse that resides in the cloud. This ensures fast and efficient data processing. Optimally, the tools for data transport are also hosted in the cloud and have a direct connection to the data warehouse. The same applies to the downstream analysis tools.
What is a Data Stack?
In recent years, even smaller companies are reaching the point where they need to process, prepare and store large amounts of data. In the past, this meant immense investment in on-premise servers and software. However, with the widespread adoption of cloud services, this has changed, and the cost of such infrastructure has dropped significantly. However, there are a variety of tools that claim to be part of the modern data stack.
The data stack is a variation of the technology stack from software engineering. The technology stack includes all the technologies that developers use to build an application. This includes, for example, the programming languages used, such as JavaScript, HTML or CSS, or frameworks. They provide a quick overview of the technologies used in an extensive project. This can be helpful, for example, when selecting new employees to determine what skills the person must have. The more knowledge there is about the technology stack, the more valuable the person is for the project.
As data has begun to play an ever-increasing role in business, the concept of the technology stack has been applied to data. The data stack includes all programming languages, tools, or frameworks used to capture route, store, and visualize data.
How has the Data Stack changed over time?
The selection of tools and frameworks to help with data handling is constantly expanding. Thus, the concept of the data stack is also subject to constant change. The following fundamental changes had the greatest influence on the development of the data stack:
Evolution from on-premise Hardware to Cloud Services
When the term data stack was introduced, cloud services were still unthinkable. Therefore, data processing always meant the acquisition of own servers, which had to be installed, supervised, and maintained. This was an immense cost factor and posed great challenges, especially for smaller companies, as they either could not afford trained personnel, or it was also very difficult to find them.
In the early 2010s, that changed with the introduction of cloud services and Amazon Redshift. Since then, more and more so-called Software as a Service product have been offered, which make it possible to purchase the infrastructure completely.
Shift from ETL to ELT
In the early days, storage was also always associated with expensive hardware components, as hard disks and processor units were very cost intensive. In addition, there were only relational databases, which stored the data in tables and thus required a fixed structure. In the company’s internal data warehouse, data first had to be prepared within the ETL process (Extract-Transform-Load) before it could be stored there.
With the introduction of NoSQL databases and the reduction of hard disk costs, the process changed to ELT (Extract-Load-Transform). The NoSQL variants were much freer in terms of data structure and could also store complex data. Furthermore, the so-called data lake prevailed, in which mainly raw data is stored that does not yet have a fixed purpose. As soon as a business analyst needs the data from it, he can query it and transform it only in this step. As a result, the role of the data engineer became less and less important, since the preparation of the data no longer played such a major role.
Broad Data Access in Companies
The analysis of data has become increasingly important in recent years and there is almost no department within the company that does not need to access information in some way. At the same time, data analysis skills are being taught in more and more degree programs, so more and more people can and want to access the data. Thus, the need arose to set up easy data access instead of always having to rely on a business analyst.
What’s the difference between Modern Data Stack and the former Legacy Data Stack?
The main difference between the former so-called Legacy Data Stack and today’s Modern Data Stack is the move away from on-premise hardware to solutions hosted in the cloud. This allows the so-called Infrastructure as a Service or even Software as a Service to be used.
The setup, maintenance, and further development of hardware are then no longer in the hands of the company using the hardware, but of the provider. This in turn has scaling advantages, as it offers the services to many customers.
For the customer, this also makes the Modern Data Stack more scalable. If more storage space or new accesses are needed, the customer can simply extend the existing subscription. The additional costs are low and completely transparent. With an on-premise solution, the hardware can be used until performance or storage limits are reached. Then expensive changes to the hardware are necessary to be able to process more data or users. As a result, the cost efficiency of the legacy data stack is also significantly worse.
What is the Modern Data Stack?
Now that we have learned where the term data stack came from and how it has changed over time, we can finally figure out what a Modern Data Stack looks like.
The Modern Data Stack is cloud-based and provides a data warehouse that resides in the cloud. This provides fast and efficient data processing. Optimally, the tools for data transport are also hosted in the cloud and have a direct connection to the data warehouse. The same applies to the downstream analysis tools.
This structure also means that the ELT process is used more frequently in the modern data stack instead of the conventional ETL process.
When selecting the tools, it should be taken into account that the individual components are interchangeable and can therefore react flexibly to new software or frameworks by integrating them. As we noted earlier, the concept of the data stack is constantly changing, which is why the requirements for the Modern Data Stack are not set in stone.
Rather, what matters is that the following principles are followed:
- Easy to Use: The tools must be easy to use and also to deploy.
- Scalability: The frameworks used should be scalable in order to be able to react quickly to changing conditions.
- Composable: Each of the components should be interchangeable and replaceable so that there is no dependence on any technology and new developments can rely on the advantages of the new tools.
What are the benefits of a Modern Data Stack?
Moving to a Modern Data Stack can be worthwhile for many reasons. The most common benefits are explained below.
Put Business back in Focus
The former legacy data stack was very IT and technology-driven, as the company had to deal with the concrete server architecture, security concepts, maintenance of the systems, and much more. With all these considerations, the concrete business use case often took a back seat. On the other hand, many commercially viable applications may not have been implemented because the necessary technical requirements were not met. All this changes with the Modern Data Stack.
The Modern Data Stack puts many of these issues on the back burner, as they are no longer the responsibility of the users, but of the companies that provide the software. This allows the focus to be entirely on the business problems that data is supposed to help solve.
High Cost- and Resource-Efficiency
The Modern Data Stack stands out because it is maximally scalable. New resources can be easily added when necessary and canceled in a timely manner when they are no longer needed. The additional costs for a new user or more performance are transparent and calculable. With the legacy data stack, major costs could arise at any time, for example for hardware replacement, which were not foreseeable. All these risks now lie with the provider.
On the other hand, the Modern Data Stack does not tie up as much personnel as was the case a few years ago. As a result, well-trained IT personnel can be used for other projects or do not have to be built up at great expense. In addition, the tools are relatively easier to use and are thus available to a broad mass of people in the company who do not necessarily have to be specially trained.
High Agility
Investments in on-premise hardware are usually very high and must therefore be well thought out, as they must be used for several years to be worthwhile. In addition, the hardware must be designed to withstand peak loads. This results in the hardware being underutilized most of the time.
The Modern Data Stack is much more agile in this respect, as cloud services can be added as soon as necessary, even for a short period of time. This means that the infrastructure is not only easy to scale for a growing company but can also be used optimally throughout the day.
Who maintains the Modern Data Stack?
For previous data stacks based on on-premise hardware, there are several positions that are needed to build data pipelines and use the data.
Data Engineer
The Data Engineer ensures that the data transport runs smoothly. This includes not only data acquisition but also transformation and finally loading into the final database. It is important to maintain an overview of the data architecture and to be able to deal well with query languages such as SQL.
Specific tasks may include:
- The right data sets must be found in order to be able to implement the requirements from the business side.
- The Data Engineer develops algorithms to prepare and cleanse the source data so that other data scientists can easily use it.
- ETL – pipelines that procure data from source systems, prepare it, and deposit it into a target database must not only be created but also constantly tested for functionality.
- All of these tasks must also ensure that data governance concepts are adhered to so that all users have the necessary permissions.
Business Analyst
Based on what the Data Engineer has prepared, the Business Analyst can then start building concrete reports that support users in their decisions. He takes the requirements of the business departments and tries to build dashboards and analyses that answer concrete questions.
Specific tasks may include:
- Gathering and recording business requirements
- Conversion of requirements into technically feasible concepts
- Analysis and preparation of processes
- Participation in project teams consisting of representatives from business and IT
- Management of the introduction and implementation of the proposed concepts
- Mediation between technical and business departments
What are the tasks of the Analytics Engineer?
The two roles presented are beginning to blur more and more with the Modern Data Stack. On the one hand, the management of the data platform has become much easier and no longer needs to be coordinated in such an elaborate way. On the other hand, many people in the company are able to analyze their data themselves with the simple tools of the Modern Data Stack and only really need support for very complicated and overarching analyses.
Therefore, this has resulted in the Analytic Engineer position. This position has more of an end-to-end responsibility in the operation of the data stack. It is expected that all steps, from data provision to application by the end user, are taken over.
This also allows further inefficiencies to be eliminated. In fact, the expanded scope of responsibilities also allows the analytics engineer to keep track of different projects and their matches so that even certain pipelines or databases can be used multiple times because they share the same database.
This is what you should take with you
- The Modern Data Stack provides the ability for many organizations to leverage big data.
- Until now, this was only available to organizations that could make large investments in on-premise hardware and provide the necessary personnel.
- With the widespread adoption of cloud services and Software as a Service product, this has now changed dramatically.
- This results in many advantages, including high agility, as services can be easily added as needed and the costs for these are also much more transparent.
What is the Bivariate Analysis?
Unlock insights with bivariate analysis. Explore types, scatterplots, correlation, and regression. Enhance your data analysis skills.
What is a RESTful API?
Learn all about RESTful APIs and how they can make your web development projects more efficient and scalable.
What is Time Series Data?
Unlock insights from time series data with analysis and forecasting techniques. Discover trends and patterns for informed decision-making.
What is a Bar Chart?
Discover the power of bar charts in data visualization. Learn how to create, customize, and interpret bar charts for insightful data analysis.
What is a Line Chart?
Master the art of line charts: Learn how to visualize trends and patterns in your data with our comprehensive guide.
What is Data Preprocessing?
Streamline your data analysis with effective data preprocessing techniques. Learn the essentials in our guide to data preprocessing.
Other Articles on the Topic of Modern Data Stack
These articles are a good further reading on the Modern Data Stack and were used as References:
Niklas Lang
I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.
My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.