Learning and operating Presto: Fast, reliable SQL for data analytics and lakehouses

Duca, Lo; Meehan, A.; Bharathan, T.; &amp;, V.; Su, Y.

Data warehousing began by pulling data from operational databases into systems that were more optimized for analytics. These systems were expensive appliances to operate, which meant people were highly judicious about what data was ingested into their data warehousing appliance for analytics. Over the years, demand for more data has exploded, far outpacing Moore’s law and challenging legacy data warehousing appliances. While this trend is true for the industry at large, certain companies were earlier than others to encounter the scaling challenges this posed. Facebook was among the earliest companies to attempt to solve this problem in 2012. At the time, Facebook was using Apache Hive to perform interactive analysis. As Facebook’s datasets grew, Hive was found not to be as interactive (read: too slow) as desired. This is largely because the foundation of Hive is MapReduce, which, at the time, required intermediate datasets to be persisted to disk. This required a lot of I/O to disk for data for transient, intermediate result sets. So Facebook developed Presto, a new distributed SQL query engine designed as an in-memory engine without the need to persist intermediate result sets for a single query. This approach led to a query engine that processed the same query orders of magnitude faster, with many queries completing with less-than-a-second latency. End users such as engineers, product managers, and data analysts found they could interactively query fractions of large datasets to test hypotheses and create visualizations. While Facebook was among the earliest companies, it was not alone in the problems it faced as datasets grew and outpaced hardware advances. The data lake architecture was developed to address these challenges by decoupling storage from compute and allowing storage to grow in cheaper distributed filesystems that utilized commodity hardware and, eventually, cloud storage systems. Concurrent with cheaper storage to store the ever-increasing data were compute systems to process the ever-increasing data. However, it wasn’t immediately clear how users would interactively query data Preface ix from the data lake—often, as with Facebook in 2012, users would attempt to use tools designed for offline purposes to transform data, which was incredibly slow. It was in this setting that Presto was made open source in 2013 and quickly gained traction from other data pioneers, such as Airbnb, Uber, and Netflix. The problem faced at Facebook was far from unique—it was only encountered early. Over the years, the need to interactively query data quickly over distributed storage has only grown. As the usage has increased, so have the expectations from users: originally, interactive queries often suffered from inconsistent results, lack of schema evolution, and the inability to debug prior versions of tables. To match these expectations, table formats have evolved from the original Hive table format to offer richer features found in data warehousing appliances, such as ACID transaction support and indexes. Presto’s architecture was designed to flexibly handle these needs, which brings us to the present-day architecture of the lakehouse: cheap distributed storage over a data lake, with performance that often matches that of warehousing appliances, and usability features that give much of the same functionality as the appliances, reducing the need to extract, transform, and load (ETL) the data into other systems. Why We Wrote This Book Deploying Presto to meet your team’s warehouse and lakehouse infrastructure needs is not a minor undertaking. For the deployment to be successful, you need to under‐ stand the principles of Presto and the tools it provides. We wrote this book to help you get up to speed with Presto’s basic principles so you can successfully deploy Presto at your company, taking advantage of one of the most powerful distributed query engines in the data analytics space today. The book also includes chapters on the ecosystem around Presto and how you can integrate other popular open source projects like Apache Pinot, Apache Hud, and more to open up even more use cases with Presto. After reading this book, you should be confident and empowered to deploy Presto in your team, and feel confident maintaining it going forward.