Skip to content

Commit

Permalink
Add TPC-H PPL query suite (#830)
Browse files Browse the repository at this point in the history
* Add TPC-H PPL query suite

Signed-off-by: Lantao Jin <[email protected]>

* fix failure of loading resources

Signed-off-by: Lantao Jin <[email protected]>

* fix data_add()

Signed-off-by: Lantao Jin <[email protected]>

* enable q21 and add docs

Signed-off-by: Lantao Jin <[email protected]>

---------

Signed-off-by: Lantao Jin <[email protected]>
  • Loading branch information
LantaoJin authored Nov 7, 2024
1 parent cfd41a3 commit 48be5cc
Show file tree
Hide file tree
Showing 27 changed files with 1,323 additions and 1 deletion.
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,8 @@ lazy val integtest = (project in file("integ-test"))
inConfig(IntegrationTest)(Defaults.testSettings ++ Seq(
IntegrationTest / javaSource := baseDirectory.value / "src/integration/java",
IntegrationTest / scalaSource := baseDirectory.value / "src/integration/scala",
IntegrationTest / parallelExecution := false,
IntegrationTest / resourceDirectory := baseDirectory.value / "src/integration/resources",
IntegrationTest / parallelExecution := false,
IntegrationTest / fork := true,
)),
inConfig(AwsIntegrationTest)(Defaults.testSettings ++ Seq(
Expand Down
4 changes: 4 additions & 0 deletions docs/ppl-lang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,10 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
### Example PPL Queries
See samples of [PPL queries](PPL-Example-Commands.md)

---
### TPC-H PPL Query Rewriting
See samples of [TPC-H PPL query rewriting](ppl-tpch.md)

---
### Planned PPL Commands

Expand Down
102 changes: 102 additions & 0 deletions docs/ppl-lang/ppl-tpch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## TPC-H Benchmark

TPC-H is a decision support benchmark designed to evaluate the performance of database systems in handling complex business-oriented queries and concurrent data modifications. The benchmark utilizes a dataset that is broadly representative of various industries, making it widely applicable. TPC-H simulates a decision support environment where large volumes of data are analyzed, intricate queries are executed, and critical business questions are answered.

### Test PPL Queries

TPC-H 22 test query statements: [TPCH-Query-PPL](https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/src/integration/resources/tpch)

### Data Preparation

#### Option 1 - from PyPi

```
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
pip install tpch-datagen
```

#### Option 2 - from source

```
git clone https://github.com/gizmodata/tpch-datagen
cd tpch-datagen
# Create the virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel
# Install TPC-H Datagen - in editable mode with client and dev dependencies
pip install --editable .[dev]
```

#### Usage

Here are the options for the tpch-datagen command:
```
tpch-datagen --help
Usage: tpch-datagen [OPTIONS]
Options:
--version / --no-version Prints the TPC-H Datagen package version and
exits. [required]
--scale-factor INTEGER The TPC-H Scale Factor to use for data
generation.
--data-directory TEXT The target output data directory to put the
files into [default: data; required]
--work-directory TEXT The work directory to use for data
generation. [default: /tmp; required]
--overwrite / --no-overwrite Can we overwrite the target directory if it
already exists... [default: no-overwrite;
required]
--num-chunks INTEGER The number of chunks that will be generated
- more chunks equals smaller memory
requirements, but more files generated.
[default: 10; required]
--num-processes INTEGER The maximum number of processes for the
multi-processing pool to use for data
generation. [default: 10; required]
--duckdb-threads INTEGER The number of DuckDB threads to use for data
generation (within each job process).
[default: 1; required]
--per-thread-output / --no-per-thread-output
Controls whether to write the output to a
single file or multiple files (for each
process). [default: per-thread-output;
required]
--compression-method [none|snappy|gzip|zstd]
The compression method to use for the
parquet files generated. [default: zstd;
required]
--file-size-bytes TEXT The target file size for the parquet files
generated. [default: 100m; required]
--help Show this message and exit.
```

### Generate 1 GB data with zstd (by default) compression

```
tpch-datagen --scale-factor 1
```

### Generate 10 GB data with snappy compression

```
tpch-datagen --scale-factor 10 --compression-method snappy
```

### Query Test

All TPC-H PPL Queries located in `integ-test/src/integration/resources/tpch` folder.

To test all queries, run `org.opensearch.flint.spark.ppl.tpch.TPCHQueryITSuite`.
35 changes: 35 additions & 0 deletions integ-test/src/integration/resources/tpch/q1.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/*
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '90' day
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus
*/

source = lineitem
| where l_shipdate <= subdate(date('1998-12-01'), 90)
| stats sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count() as count_order
by l_returnflag, l_linestatus
| sort l_returnflag, l_linestatus
45 changes: 45 additions & 0 deletions integ-test/src/integration/resources/tpch/q10.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1993-10-01'
and o_orderdate < date '1993-10-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
limit 20
*/

source = customer
| join ON c_custkey = o_custkey orders
| join ON l_orderkey = o_orderkey lineitem
| join ON c_nationkey = n_nationkey nation
| where o_orderdate >= date('1993-10-01')
AND o_orderdate < date_add(date('1993-10-01'), interval 3 month)
AND l_returnflag = 'R'
| stats sum(l_extendedprice * (1 - l_discount)) as revenue by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
| sort - revenue
| head 20
45 changes: 45 additions & 0 deletions integ-test/src/integration/resources/tpch/q11.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
group by
ps_partkey having
sum(ps_supplycost * ps_availqty) > (
select
sum(ps_supplycost * ps_availqty) * 0.0001000000
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
)
order by
value desc
*/

source = partsupp
| join ON ps_suppkey = s_suppkey supplier
| join ON s_nationkey = n_nationkey nation
| where n_name = 'GERMANY'
| stats sum(ps_supplycost * ps_availqty) as value by ps_partkey
| where value > [
source = partsupp
| join ON ps_suppkey = s_suppkey supplier
| join ON s_nationkey = n_nationkey nation
| where n_name = 'GERMANY'
| stats sum(ps_supplycost * ps_availqty) as check
| eval threshold = check * 0.0001000000
| fields threshold
]
| sort - value
42 changes: 42 additions & 0 deletions integ-test/src/integration/resources/tpch/q12.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
/*
select
l_shipmode,
sum(case
when o_orderpriority = '1-URGENT'
or o_orderpriority = '2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT'
and o_orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('MAIL', 'SHIP')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1994-01-01'
and l_receiptdate < date '1994-01-01' + interval '1' year
group by
l_shipmode
order by
l_shipmode
*/

source = orders
| join ON o_orderkey = l_orderkey lineitem
| where l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_shipmode in ('MAIL', 'SHIP')
and l_receiptdate >= date('1994-01-01')
and l_receiptdate < date_add(date('1994-01-01'), interval 1 year)
| stats sum(case(o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH', 1 else 0)) as high_line_count,
sum(case(o_orderpriority != '1-URGENT' and o_orderpriority != '2-HIGH', 1 else 0)) as low_line_countby
by l_shipmode
| sort l_shipmode
31 changes: 31 additions & 0 deletions integ-test/src/integration/resources/tpch/q13.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
/*
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey) as c_count
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
) as c_orders
group by
c_count
order by
custdist desc,
c_count desc
*/

source = [
source = customer
| left outer join ON c_custkey = o_custkey AND not like(o_comment, '%special%requests%')
orders
| stats count(o_orderkey) as c_count by c_custkey
] as c_orders
| stats count() as custdist by c_count
| sort - custdist, - c_count
25 changes: 25 additions & 0 deletions integ-test/src/integration/resources/tpch/q14.ppl
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
select
100.00 * sum(case
when p_type like 'PROMO%'
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-09-01'
and l_shipdate < date '1995-09-01' + interval '1' month
*/

source = lineitem
| join ON l_partkey = p_partkey
AND l_shipdate >= date('1995-09-01')
AND l_shipdate < date_add(date('1995-09-01'), interval 1 month)
part
| stats sum(case(like(p_type, 'PROMO%'), l_extendedprice * (1 - l_discount) else 0)) as sum1,
sum(l_extendedprice * (1 - l_discount)) as sum2
| eval promo_revenue = 100.00 * sum1 / sum2 // Stats and Eval commands can combine when issues/819 resolved
| fields promo_revenue
Loading

0 comments on commit 48be5cc

Please sign in to comment.