Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
Presto was not designed to handle Online Transaction Processing (OLTP). Presto is not a general-purpose relational database. It is not a replacement for databases like MySQL, PostgreSQL or Oracle.
Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra.
Overview
Concepts
Server Types
-
Resource Manager
The Presto resource manager is the server that aggregates data from all coordinators and workers and constructs a global view of the cluster.
-
Coordinator
The Presto coordinator is the server that is responsible for parsing statements, planning queries, and managing Presto worker nodes.
The coordinator keeps track of the activity on each worker and coordinates the execution of a query.
-
Worker
A Presto worker is a server which is responsible for executing tasks and processing data. Worker nodes fetch data from connectors and exchange intermediate data with each other. The coordinator is responsible for fetching results from the workers and returning the final results to the client.
Data Sources
-
Connector
A connector adapts Presto to a data source such as Hive or a relational database.
-
Catalog
A Presto catalog contains schemas and references a data source via a connector.
When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of
hive.test_data.testwould refer to thetesttable in thetest_dataschema in thehivecatalog. -
Schema
Schemas are a way to organize tables. Together, a catalog and schema define a set of tables that can be queried.
-
Table
A table is a set of unordered rows which are organized into named columns with types.
Query Execution Model
Presto executes SQL statements and turns these statements into queries that are executed across a distributed cluster of coordinator and workers.
-
Statement
Presto executes ANSI-compatible SQL statements.
-
Query
When Presto parses a statement, it converts it into a query and creates a distributed query plan which is then realized as a series of interconnected stages running on Presto workers.
The difference between a statement and a query is simple. A statement can be thought of as the SQL text that is passed to Presto, while a query refers to the configuration and components instantiated to execute that statement. A query encompasses stages, tasks, splits, connectors, and other components and data sources working in concert to produce a result.
-
Stage
The hierarchy of stages that comprises a query resembles a tree. Every query has a root stage which is responsible for aggregating the output from other stages.
-
Task
A distributed query plan is deconstructed into a series of stages which are then translated to tasks which then act upon or process splits.
-
Split
Tasks operate on splits which are sections of a larger data set.
-
Driver
Drivers act upon data and combine operators to produce output that is then aggregated by a task and then delivered to another task in another stage.
-
Operator
An operator consumes, transforms and produces data.
-
Exchange
Exchanges transfer data between Presto nodes for different stages of a query.
Installation
Deployment with Docker
There are four files in the etc/ folder to configure Presto, along with one catalog in etc/catalog/. A catalog defines the configuration of a connector, and the catalog is named after the file name (minus the .properties extension). You can have multiple catalogs for each Presto installation, including multiple catalogs using the same connector; they just need a different filename. The files are:
|
|
Each connector has their own set of configuration properties that are specific to the connector.
Connectors
MySQL Connector
Configuration
To configure the MySQL connector, create a catalog properties file in etc/catalog named, for example, mysql.properties, to mount the MySQL connector as the mysql catalog. Create the file with the following contents, replacing the connection properties as appropriate for your setup:
|
|
Querying MySQL
see the available MySQL databases by running SHOW SCHEMAS:
|
|
If you have a MySQL database named web, view the tables in this database:
|
|
You can see a list of the columns in the clicks table in the web database
|
|
access the clicks table in the web database:
|
|
Python client for Presto
|
|
|
|