Why your company should use PrestoSQL

Do you have tons of data scattered around multiple databases? Do you often hear developers saying that the best you can get is a report updated once a day because they copy the data from other team’s database at 6 in the morning? Do you wonder what business opportunity you lost because you cannot easily access all of your data? The insights are there! You just need to ask. Of course, you will never find it because, right now, you have to query four different databases, three REST APIs, copy the data to your computer, and run a custom Python script to merge files.

Are you still surprised that all analyses take days to run? Do you still wonder why you keep hiring new data engineers and data scientists, but productivity keeps declining? Perhaps its because the useful data is always in the other database, the ETLs break all the time, the data formats keep changing, and your teams spend all their working hours fixing the problems.

Wouldn’t it be great if you had one place that can access all the data? Wouldn’t it be beautiful if you could smoothly run all of the analytics queries? Wouldn’t the company earn more money if the data engineering team could focus on building complex ETLs and machine learning models instead of fighting with a growing number of problems?

PrestoSQL

PrestoSQL is a distributed SQL query engine that integrates all of your data sources in one place. Imagine that! No more data silos, no more Jira ticket to get access to some obscure database table. No more hacky REST APIs that produce CSVs so you can clog your hard drive with files and spend hours waiting until your laptop does the job that a computing cluster can do in a few minutes.

Single interface for multiple data sources

What happens when we want to query multiple data sources at once? At many companies, the data engineering teams solve that problem by creating ETL jobs that copy the data from production databases to a single location (for example, AWS S3). After that, they create tables in a SQL interface (AWS Athena) and query a snapshot of production data.

Such an approach causes multiple issues. First, we always use stale data. It is common to run ETL jobs that copy data added on the previous day, so your analytics snapshot is always one day behind the reality. It may be good enough when you make a monthly report, but that’s useless if you want to react to real-time changes.

Data silos cause another problem. You need to know which database contains the data you need so that you can copy it to the central location. Occasionally, the rollups produced by data engineering may be copied back to the production databases. Such an approach causes a weird and error-prone cycle of downloads and uploads.

Last but not least, data duplication is expansive. If you copy snapshots of data, you pay twice for every byte. Does that look like a great deal to you? Does it still surprise you why you spend so much on AWS or Google Cloud?

SQL-like abstraction

Having an abstraction is essential because nobody cares where the data is located. It’s tedious to use Hive, extract some values from PostgreSQL, retrieve objects from your MongoDB collections, and download JSON from a REST API. When you finish data extraction, you have to transfer it to a common format, and after all of that is done, you can start doing the analysis you wanted in the first place. That’s a massive waste of time.

PrestoSQL gives us a SQL abstraction over any data source. It supports relational databases, file storages, object databases, and event streams. It is easy to implement a custom connector, so you can use Presto to access your REST APIs too! You no longer worry about the tedious data extraction. The only thing to do now is writing the SQL query that produces the results you need.

Storage/compute separation

In the past, we used the same computers to store data and perform computation on the data. It’s no longer the case. It would be very cumbersome to collocate storage and computation when your data is stored in S3. It gets even worse when you have multiple different data sources, and half of them are REST APIs that pretend to have SQL interface. We store thousands of terabytes of data. If we purchased machines powerful enough to process that data efficiently but used them only for storage and occasional computation, the cost would kill every company. Fortunately, Presto follows the modern practice of separating storage and computation.

PrestoSQL use cases

Presto is the best when we want a unified SQL interface for multiple data sources. It can efficiently handle analytical queries and all business intelligence tasks. We can even use it for SQL-based ETL when all we need to do is join multiple sources, calculate aggregates, and store the results in a table.

On the other hand, Presto is not a replacement for Apache Spark. If the ETL requires much custom code, it is better to run a Spark cluster instead of creating a Presto pipeline that uses multiple tables to store the intermediate steps. If we want to train a machine learning model, it is more efficient to use SparkML than to write a custom Presto code that trains it.

Older post

Is counting rows all we can do?

How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.

Newer post

What is the difference between cache and persist in Apache Spark?

When should you use the cache, and when you should use the persist function