Monday, May 25, 2020

Peter Hoffmann: DuckDB vs Azure Synapse SQL-on-Demand with parquet

Short Disclaimer: This post is comparing apples to oranges, because DuckDB is an embedded database designed to execute analytical SQL queries on your local machine, whereas Azure Synapse SQL-on-Demand is Microsoft new cloud offering for a serverless SQL on-demand (preview) endpoint to query data in the Azure Blob Store/Data Lake.

But in the end I was still triggerd by Uwe Korn's post: Taking DuckDB for a spin to do a comparison, because even if they are fundamentally different offerings, from a high level both solutions offer a simple way to run analytical queries on datasets without the need to install/manage a server.

Load the data as parquet data

To get started you need to convert the yellow tripdata sampledata to a parquet file and upload it to the Azure Blob Storage:

import pandas as pd
from simplekv.net.azurestore import AzureBlockBlobStore

filename = "yellow_tripdata_2016-01.csv"

df = pd.read_csv(
        filename,
        dtype={"store_and_fwd_flag": "bool"},
        parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
        index_col=False,
        infer_datetime_format=True,
        true_values=["Y"],
        false_values=["N"],
    )
df.to_parquet("yellow_tripdata_2016-01.parquet")

conn_string = 'DefaultEndpointsProtocol=https;AccountName=blob;AccountKey=xxx;'
store = AzureBlockBlobStore(conn_string=conn_string, container='benchmark', public=False)
store.putfile("yellow_tripdata_2016-01.parquet", "yellow_tripdata_2016-01.parquet")

Once the data is available in the Azure Blob Storage/Data Lake you can use the openrowset function to read the remote dataset in Azure SQL-on-Demand:

Count Distinct

The minimal count distinct example needs to once read all selected columns from the azure blob storage and execute the count on the data.

The query roughly takes 4s, which is not too bad and quite faster than the 5.58s for DuckDB and the 25s for sqlite.

Frequency of events

This type of query is a nice and simple form for testing the aggregation performance.

SELECT
    MIN(cnt),
    AVG(cnt),
    -- MEDIAN(cnt),
    MAX(cnt)
FROM
(
    SELECT
        COUNT(*) as cnt
FROM
    OPENROWSET(
        BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
        FORMAT='PARQUET'
    ) AS [r]
    GROUP BY
        DATEPART(day, tpep_pickup_datetime),
        DATEPART(hour, tpep_pickup_datetime)
) as [stats];

As the query only has to read a subset of the columns it can be executed very fast in <1s, which again beats DuckDB 2.05s and SQLite 10.2s.

Simple fare regression

with yellow_tripdata_2016_01
as (select *
 FROM OPENROWSET(
        BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
        FORMAT='PARQUET'
    ) AS [r]
)
SELECT
    (SUM(trip_distance * fare_amount) - SUM(trip_distance) * SUM(fare_amount) / COUNT(*)) /
    (SUM(trip_distance * trip_distance) - SUM(trip_distance) * SUM(trip_distance) / COUNT(*)) AS beta,
    AVG(fare_amount) AS avg_fare_amount,
    AVG(trip_distance) AS avg_trip_distance
FROM
    yellow_tripdata_2016_01,
    (
        SELECT
            AVG(fare_amount) + 3 * STDEV(fare_amount) as max_fare,
            AVG(trip_distance) + 3 * STDEV(trip_distance) as max_distance
    FROM yellow_tripdata_2016_01
    ) AS [sub]
WHERE
    fare_amount > 0 AND
    fare_amount < sub.max_fare AND
    trip_distance > 0 AND
    trip_distance < sub.max_distance
;

The result is fetched again in less than 1s which is on par with DuckDB 972 ms and beats SQLite 11.7 s.

Conclusion

The performance comparison result that SQL-on-Demand beats DuckDB is just irrelevant, because the two database solutions are completely different.

What's really interesting is that while DuckDB provides the comfort of an easily embeddedable database on your local machine, Azure Synapse SQL-on-Demand offers the same comfort on running T-SQL statements on parquet files as a managed, elastic service in the cloud. You can just click your SQL endpoint in Azure and start querying parquet files without the need to run or manage a server and you only pay for data scanned (5$ per TB). If you are doing data engineering in the Azure Cloud this is cool new tool in your toolbelt.

And one that scales horizontally. While the above examples have been done on a tiny 1.7MB parquet file, we have done some scaling tests at blueyonder up to terabytes of data and thousands of parquet files. Stay tuned for a post on the results and an example of how to use turbodbc as a high performance interface for data pipelines...



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...