Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Presto was not designed to handle Online Transaction Processing (OLTP). Presto is not a general-purpose relational database. It is not a replacement for databases like MySQL, PostgreSQL or Oracle.

Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra.

Overview

Concepts

Server Types

  • Resource Manager

    The Presto resource manager is the server that aggregates data from all coordinators and workers and constructs a global view of the cluster.

  • Coordinator

    The Presto coordinator is the server that is responsible for parsing statements, planning queries, and managing Presto worker nodes.

    The coordinator keeps track of the activity on each worker and coordinates the execution of a query.

  • Worker

    A Presto worker is a server which is responsible for executing tasks and processing data. Worker nodes fetch data from connectors and exchange intermediate data with each other. The coordinator is responsible for fetching results from the workers and returning the final results to the client.

Data Sources

  • Connector

    A connector adapts Presto to a data source such as Hive or a relational database.

  • Catalog

    A Presto catalog contains schemas and references a data source via a connector.

    When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog.

  • Schema

    Schemas are a way to organize tables. Together, a catalog and schema define a set of tables that can be queried.

  • Table

    A table is a set of unordered rows which are organized into named columns with types.

Query Execution Model

Presto executes SQL statements and turns these statements into queries that are executed across a distributed cluster of coordinator and workers.

  • Statement

    Presto executes ANSI-compatible SQL statements.

  • Query

    When Presto parses a statement, it converts it into a query and creates a distributed query plan which is then realized as a series of interconnected stages running on Presto workers.

    The difference between a statement and a query is simple. A statement can be thought of as the SQL text that is passed to Presto, while a query refers to the configuration and components instantiated to execute that statement. A query encompasses stages, tasks, splits, connectors, and other components and data sources working in concert to produce a result.

  • Stage

    The hierarchy of stages that comprises a query resembles a tree. Every query has a root stage which is responsible for aggregating the output from other stages.

  • Task

    A distributed query plan is deconstructed into a series of stages which are then translated to tasks which then act upon or process splits.

  • Split

    Tasks operate on splits which are sections of a larger data set.

  • Driver

    Drivers act upon data and combine operators to produce output that is then aggregated by a task and then delivered to another task in another stage.

  • Operator

    An operator consumes, transforms and produces data.

  • Exchange

    Exchanges transfer data between Presto nodes for different stages of a query.

Installation

Deployment with Docker

There are four files in the etc/ folder to configure Presto, along with one catalog in etc/catalog/. A catalog defines the configuration of a connector, and the catalog is named after the file name (minus the .properties extension). You can have multiple catalogs for each Presto installation, including multiple catalogs using the same connector; they just need a different filename. The files are:

1
2
3
4
5
6
7
etc/
├── catalog
│   └── tpch.properties  # Configures the TPCH connector to generate data
├── config.properties    # Presto instance configuration properties
├── jvm.config           # JVM configuration for the process
├── log.properties       # Logging configuration
└── node.properties      # Node-specific configuration properties

Each connector has their own set of configuration properties that are specific to the connector.

Connectors

MySQL Connector

Configuration

To configure the MySQL connector, create a catalog properties file in etc/catalog named, for example, mysql.properties, to mount the MySQL connector as the mysql catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:

1
2
3
4
connector.name=mysql
connection-url=jdbc:mysql://example.net:3306
connection-user=root
connection-password=secret
Querying MySQL

see the available MySQL databases by running SHOW SCHEMAS:

1
SHOW SCHEMAS FROM mysql;

If you have a MySQL database named web, view the tables in this database:

1
SHOW TABLES FROM mysql.web;

You can see a list of the columns in the clicks table in the web database

1
2
DESCRIBE mysql.web.clicks;
SHOW COLUMNS FROM mysql.web.clicks;

access the clicks table in the web database:

1
SELECT * FROM mysql.web.clicks;

Python client for Presto

1
$ pip install presto-python-client
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import prestodb
conn=prestodb.dbapi.connect(
    host='localhost',
    port=8080,
    user='the-user',
    catalog='the-catalog',
    schema='the-schema',
)
cur = conn.cursor()
cur.execute('SELECT * FROM system.runtime.nodes')
rows = cur.fetchall()

Reference

Presto Documentation

presto-python-client