Apache Presto is an open-source distributed SQL engine suitable for querying large amounts of data. It was developed by Facebook in 2012 and subsequently made open-source under the Apache license. The engine does not provide its own database system and is therefore often used with well-known database solutions, such as Apache Hadoop or MongoDB.
How is Apache Presto built?
The structure of Apache Presto is similar to that of classical database management systems (DBMS), which use so-called massively parallel processing (MPP). This uses different components that perform different tasks:
- Client: The client is the starting and ending point of each query. It passes the SQL command to the coordinator and receives the final result from the worker.
- Coordinator: The coordinator receives the commands to be executed from the client and breaks them down in order to analyze how complex their processing is. He plans or coordinates the execution of several commands and monitors their processing with the help of the scheduler. Based on the execution plan, the commands are then passed on to the scheduler.
- Scheduler: The scheduler is a part of the coordinator, which is ultimately responsible for passing on the commands to the workers. It monitors the correct execution of the commands according to the plan created by the coordinator.
- Worker: The workers take over the actual execution of the commands and receive the results from the data sources from the connectors. The final results are then passed back to the client.
- Connector: The Connectors are the interfaces to the supported data sources. They know the peculiarities of the different databases and systems and can therefore adapt the commands.
What applications use Presto?
This SQL engine can be used when connecting different data sources that store large amounts of data. These, even if they are non-relational databases, can be controlled using classic SQL commands. Presto is often used in the Big Data area, where low query times and high performance are of immense importance. It can also be used for queries on data warehouses.
In the industry, many well-known companies already rely on Presto. Besides Facebook, which invented the query engine, these include for example:
- Uber uses the SQL query engine for its massive data lakehouse with well over 59 petabytes of data. Various Data Scientists, as well as regular users, need to be able to access this data in a short period of time.
- At Twitter, the immensely increasing amount of data also became a cost issue as SQL query expenses increased. Therefore, SQL query engines were used to scale the system horizontally. In addition, a Machine Learning model was trained that can predict the expected query time even before a query is made.
- Alibaba relies on SQL query engines to build its data lake.
All of these examples were taken from the Use Case section on the Presto website.
What are the Advantages of using Presto?
Apache Presto offers several advantages when working with large amounts of data. These include:
The open source availability not only offers the possibility to use the tool without licensing costs, but also goes hand in hand with the fact that the source code can be viewed and, with sufficient know-how, also tailored to one’s own needs.
In addition, open source programs also often have a large, active community, so problems can usually be solved by a quick Internet search. These many active users of Apache Presto also ensure that the system is constantly being developed and improved, which in turn benefits all other users.
Due to its architecture, this SQL query engine can also query large amounts of data within a few seconds and without large latency periods. This high performance is made possible by the distributed architecture, which enables horizontal scaling of the system.
In addition, Presto can be run both on-premise and in the cloud, so performance can be further improved by moving to the cloud if needed.
By using the Structured Query Language, Presto is easy to use for many users, since the handling of the query language is already known and this knowledge can still be used. This makes it easy to implement even complex functions.
Compatibility is further ensured by a variety of available connectors for common database systems, such as MongoDB, MySQL, or the Hadoop Distributed File System. If these are not sufficient, custom connectors can also be configured or written.
How can Presto and Hadoop be used together?
Apache Presto does not inherently have a built-in data source that can store information. Therefore, it relies on the use of other, external databases. In practice, Apache Hadoop, or the Hadoop Distributed File System (HDFS), is often used for this purpose.
The connection between HDFS and Presto is established via the Hive Connector. The main advantage is that Presto can be used to easily search through different file formats and therefore search through all HDFS files. It is often used as an alternative to Hive since Presto is optimized for fast queries, which Hive cannot offer.
What are the Differences between Presto and Spark?
Apache Spark is a distributed analytics framework that can be used for many different Big Data applications. It relies on in-memory data storage and parallel execution of processes to ensure high performance. It is one of the most comprehensive Big Data systems on the market and offers, among other things, batch processing, graph databases, or support for Machine Learning.
It is often mentioned in connection with Apache Presto or even understood as a competitor to it. However, the two systems are very different and share few similarities. Both programs are open-source available systems when working with Big Data. They can both offer good performance, due to their distributed architecture and the possibility of scaling. Accordingly, they can also be run both on-premise and in the cloud.
However, besides these (albeit rather few) similarities, Apache Spark and Apache Presto differ in some fundamental characteristics:
- Spark Core does not support SQL queries, for now, you need the additional SparkSQL component for that. Presto, on the other hand, is a travel SQL query engine.
- Spark offers a very wide range of application possibilities, for example, also through the possibility of building and deploying entire machine learning models.
- Apache Presto, on the other hand, specializes primarily in the fast processing of data queries for large data volumes.
This is what you should take with you
- Apache Presto is an open-source distributed SQL engine suitable for querying large amounts of data.
- The engine can be used for distributed queries with fast response times and low latency.
- Presto differs from Apache Spark in that it is primarily focused on data querying, while Spark offers a wide range of application capabilities.
- Since Apache Presto does not have its own data source, it is often used together with Apache Hadoop as an alternative to their Hive Connector.