What is Presto, Facebook Presto Database or PrestoDB? A powerful SQL query engine

Table of Contents

If you have heard of Amazon Athena, then you are familiar with Presto. It is an open source distributed SQL query engine that powers the AWS Athena product. While Athena is one of the more visible commercial offerings, it certainly is not the only path for those interested in the software.

What is Presto?

It is unique is a great candidate technology for those seeking performance, flexibility and the non-intrusive technical layering a SQL query engine can offer. Here is what Facebook said of Presto;

For the analysts, data scientists, and engineers who crunch data, derive insights, and work to continuously improve our products, the performance of queries against our data warehouse is important. Being able to run more queries and get results faster improves their productivity.

Presto has its technical roots in the Hadoop world. Performance challenged drove Facebook to develop optimizations to achieve their objectives. They noted are key differences in how Presto approaches certain operations;

In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead.

They also provided a simplified architecture overview;

One of the key features is that it allows you to make queries against different data sources of varying sizes. As a result of this model, it has a lot of data connectors. Presto supports querying data in RDBMS, Hive, and other data stores. This includes non-relational sources like Hadoop HDFS, Amazon S3, HBase, and relational sources such as MySQL, PostgreSQL, Redshift, SQL Server, and Teradata.

Another goal was that Presto support standard ANSI SQL, including aggregations, joins, left/right outer joins, sub-queries, distinct counts, and many others.

As a result, Presto can act as a form of query proxy, allowing you to combine data from multiple sources across your organization using familiar SQL. Depending on your architecture, it can be viewed as a compliment to data warehouses, especially for organizations that use a federated model where having these connectors adds value.

Presto Performance

Presto was designed for fast analytics. Query execution runs in parallel, with most results returning in seconds. The expectation is the query engine will deliver response times ranging from sub-second to minutes.

Another performance consideration is the data consumption pattern you have. For example, let’s say data is resident within Parquet files in a data lake on Amazon S3. You wrap Presto (or Amazon Athena) as a query service on top of that data. Lastly, you leverage Tableau to run scheduled queries which will store a “cache” of your data within the Tableau Hyper Engine.

In the model, Tableau acts as a query cache for Presto. This allows you to store data locally to the Tableau Hyper Engine vs live calls to the data each time. As a result, all subsequent queries in a Tableau visualization happen against the data resident in the Hyper rather than the query engine. This results in exceptionally fast performance, important for users of business intelligence and data visualization software.

See the post Building A Serverless Business Intelligence Stack With Apache Parquet, Tableau, and Amazon Athena.

The History: Facebook Presto

It all started at Facebook in 2012. It was rolled out company-wide in 2013. In November 2013, Facebook open sourced it under the Apache Software License. It is now available on Github for anyone to download. You can get it packaged in Docker if you prefer that path.

The Presto Software Foundation was announced in 2019 which states they are dedicated to the advancement of the open source distributed SQL query engine.

Who uses Presto?

Facebook, Nasdaq, Airbnb, Netflix, Atlassian, and many more have indicated they are using the query engine. However, it is likely many others are also running the software when you factor in the AWS offerings in EMR and Athena. For example, we are working with Fortune 500 companies that have deployed serverless data analytics stacks using Athena, Tableau, and Apache Parquet. As a result, the number of actual Presto users may be underreported.

The broader community can be found here on this forum and on Facebook.

Commercial Presto Solutions

As we referenced earlier, the software is commonly deployed in the cloud, though using Docker you can run it locally or on-premise. However, it was designed so that it can be easily be paired with cloud infrastructure for scaling. This allows the software to deliver exceptional performance, scalability, reliability, availability, and economies of scale for querying data gigabytes to petabytes in size.

Amazon Web Services

For example, Amazon EMR and Amazon Athena examples of cloud-based deployments. Like most things AWS, they handle the bulk of set up, operations, and testing for you.

We mentioned Amazon Athena a few times already. Amazon Athena is a leading commercial offering of Presto. It lets you deploy the query engine with AWS as a serverless platform. This means no servers, virtual machines, or clusters to set up, manage, or tune. Athena automatically parallelizes interactive queries and dynamically scales resources as needed. With Athena, you pay only for the queries that you run. Another benefit is that many existing Business Intelligence (BI) tools, like Tableau, support Athena natively.

Starburst Presto

Other companies, like Starburst Data, provide the ability for you to launch a cluster in minutes without worrying about package provisioning, setup, maintenance, or tuning. For example, on AWS Starburst’s CloudFormation and AMI provide the tools you need to get started quickly. They also offer enterprise support options for those that want to go beyond a self-service model.

Looking To Get Started With Presto?

Try our fully automated, code-free, zero administration AWS Athena data pipeline service. It has never been easier to get your data into Amazon Athena for use with Tableau or other leading BI platforms. Our service optimizes and automates the configuration, processing, and loading of data to AWS Athena unlocking how users can return query results. With our new zero administration, AWS Athena service you simply push data from supported data sources and our service will automatically load it into AWS Athena.

Ready to go Presto with AWS Athena?

Get Started Today!

DWant to discuss Presot or Amazon Athena for your organization? Need a platform and team of experts to kickstart your data and analytic efforts? We can help! Getting traction adopting new technologies, especially if it means your team is working in different and unfamiliar ways, can be a roadblock for success. This is especially true in a self-service only world. If you want to discuss a proof-of-concept, pilot, project or any other effort, the Openbridge platform and team of data experts are ready to help.

Reach out to us at hello@openbridge.com. Prefer to talk to someone? Set up a call with our team of data experts.

References