Add TPC-H PPL query suite (#830)

* Add TPC-H PPL query suite Signed-off-by: Lantao Jin <[email protected]> * fix failure of loading resources Signed-off-by: Lantao Jin <[email protected]> * fix data_add() Signed-off-by: Lantao Jin <[email protected]> * enable q21 and add docs Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]>
opensearch-project · Nov 7, 2024 · 48be5cc · 48be5cc
1 parent cfd41a3
commit 48be5cc
Show file tree

Hide file tree

Showing 27 changed files with 1,323 additions and 1 deletion.
diff --git a/build.sbt b/build.sbt
@@ -238,7 +238,8 @@ lazy val integtest = (project in file("integ-test"))
     inConfig(IntegrationTest)(Defaults.testSettings ++ Seq(
       IntegrationTest / javaSource := baseDirectory.value / "src/integration/java",
       IntegrationTest / scalaSource := baseDirectory.value / "src/integration/scala",
-      IntegrationTest / parallelExecution := false,
+      IntegrationTest / resourceDirectory := baseDirectory.value / "src/integration/resources",
+        IntegrationTest / parallelExecution := false,
       IntegrationTest / fork := true,
     )),
     inConfig(AwsIntegrationTest)(Defaults.testSettings ++ Seq(

diff --git a/docs/ppl-lang/README.md b/docs/ppl-lang/README.md
@@ -104,6 +104,10 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
 ### Example PPL Queries
 See samples of [PPL queries](PPL-Example-Commands.md) 
 
+---
+### TPC-H PPL Query Rewriting
+See samples of [TPC-H PPL query rewriting](ppl-tpch.md)
+
 ---
 ### Planned PPL Commands
 

diff --git a/docs/ppl-lang/ppl-tpch.md b/docs/ppl-lang/ppl-tpch.md
@@ -0,0 +1,102 @@
+## TPC-H Benchmark
+
+TPC-H is a decision support benchmark designed to evaluate the performance of database systems in handling complex business-oriented queries and concurrent data modifications. The benchmark utilizes a dataset that is broadly representative of various industries, making it widely applicable. TPC-H simulates a decision support environment where large volumes of data are analyzed, intricate queries are executed, and critical business questions are answered.
+
+### Test PPL Queries
+
+TPC-H 22 test query statements: [TPCH-Query-PPL](https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/src/integration/resources/tpch)
+
+### Data Preparation
+
+#### Option 1 - from PyPi
+
+```
+# Create the virtual environment
+python3 -m venv .venv
+
+# Activate the virtual environment
+. .venv/bin/activate
+
+pip install tpch-datagen
+```
+
+#### Option 2 - from source
+
+```
+git clone https://github.com/gizmodata/tpch-datagen
+
+cd tpch-datagen
+
+# Create the virtual environment
+python3 -m venv .venv
+
+# Activate the virtual environment
+. .venv/bin/activate
+
+# Upgrade pip, setuptools, and wheel
+pip install --upgrade pip setuptools wheel
+
+# Install TPC-H Datagen - in editable mode with client and dev dependencies
+pip install --editable .[dev]
+```
+
+#### Usage
+
+Here are the options for the tpch-datagen command:
+```
+tpch-datagen --help
+Usage: tpch-datagen [OPTIONS]
+
+Options:
+  --version / --no-version        Prints the TPC-H Datagen package version and
+                                  exits.  [required]
+  --scale-factor INTEGER          The TPC-H Scale Factor to use for data
+                                  generation.
+  --data-directory TEXT           The target output data directory to put the
+                                  files into  [default: data; required]
+  --work-directory TEXT           The work directory to use for data
+                                  generation.  [default: /tmp; required]
+  --overwrite / --no-overwrite    Can we overwrite the target directory if it
+                                  already exists...  [default: no-overwrite;
+                                  required]
+  --num-chunks INTEGER            The number of chunks that will be generated
+                                  - more chunks equals smaller memory
+                                  requirements, but more files generated.
+                                  [default: 10; required]
+  --num-processes INTEGER         The maximum number of processes for the
+                                  multi-processing pool to use for data
+                                  generation.  [default: 10; required]
+  --duckdb-threads INTEGER        The number of DuckDB threads to use for data
+                                  generation (within each job process).
+                                  [default: 1; required]
+  --per-thread-output / --no-per-thread-output
+                                  Controls whether to write the output to a
+                                  single file or multiple files (for each
+                                  process).  [default: per-thread-output;
+                                  required]
+  --compression-method [none|snappy|gzip|zstd]
+                                  The compression method to use for the
+                                  parquet files generated.  [default: zstd;
+                                  required]
+  --file-size-bytes TEXT          The target file size for the parquet files
+                                  generated.  [default: 100m; required]
+  --help                          Show this message and exit.
+```
+
+### Generate 1 GB data with zstd (by default) compression
+
+```
+tpch-datagen --scale-factor 1
+```
+
+### Generate 10 GB data with snappy compression
+
+```
+tpch-datagen --scale-factor 10 --compression-method snappy
+```
+
+### Query Test
+
+All TPC-H PPL Queries located in `integ-test/src/integration/resources/tpch` folder.
+
+To test all queries, run `org.opensearch.flint.spark.ppl.tpch.TPCHQueryITSuite`.
diff --git a/integ-test/src/integration/resources/tpch/q1.ppl b/integ-test/src/integration/resources/tpch/q1.ppl
@@ -0,0 +1,35 @@
+/*
+select
+    l_returnflag,
+    l_linestatus,
+    sum(l_quantity) as sum_qty,
+    sum(l_extendedprice) as sum_base_price,
+    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+    avg(l_quantity) as avg_qty,
+    avg(l_extendedprice) as avg_price,
+    avg(l_discount) as avg_disc,
+    count(*) as count_order
+from
+    lineitem
+where
+    l_shipdate <= date '1998-12-01' - interval '90' day
+group by
+    l_returnflag,
+    l_linestatus
+order by
+    l_returnflag,
+    l_linestatus
+*/
+
+source = lineitem
+| where l_shipdate <= subdate(date('1998-12-01'), 90)
+| stats sum(l_quantity) as sum_qty,
+        sum(l_extendedprice) as sum_base_price,
+        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+        avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
+        avg(l_discount) as avg_disc,
+        count() as count_order
+  by l_returnflag, l_linestatus
+| sort l_returnflag, l_linestatus
diff --git a/integ-test/src/integration/resources/tpch/q10.ppl b/integ-test/src/integration/resources/tpch/q10.ppl
@@ -0,0 +1,45 @@
+/*
+select
+    c_custkey,
+    c_name,
+    sum(l_extendedprice * (1 - l_discount)) as revenue,
+    c_acctbal,
+    n_name,
+    c_address,
+    c_phone,
+    c_comment
+from
+    customer,
+    orders,
+    lineitem,
+    nation
+where
+    c_custkey = o_custkey
+    and l_orderkey = o_orderkey
+    and o_orderdate >= date '1993-10-01'
+    and o_orderdate < date '1993-10-01' + interval '3' month
+    and l_returnflag = 'R'
+    and c_nationkey = n_nationkey
+group by
+    c_custkey,
+    c_name,
+    c_acctbal,
+    c_phone,
+    n_name,
+    c_address,
+    c_comment
+order by
+    revenue desc
+limit 20
+*/
+
+source = customer
+| join ON c_custkey = o_custkey orders
+| join ON l_orderkey = o_orderkey lineitem
+| join ON c_nationkey = n_nationkey nation
+| where o_orderdate >= date('1993-10-01')
+  AND o_orderdate < date_add(date('1993-10-01'), interval 3 month)
+  AND l_returnflag = 'R'
+| stats sum(l_extendedprice * (1 - l_discount)) as revenue by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
+| sort - revenue
+| head 20
diff --git a/integ-test/src/integration/resources/tpch/q11.ppl b/integ-test/src/integration/resources/tpch/q11.ppl
@@ -0,0 +1,45 @@
+/*
+select
+    ps_partkey,
+    sum(ps_supplycost * ps_availqty) as value
+from
+    partsupp,
+    supplier,
+    nation
+where
+    ps_suppkey = s_suppkey
+    and s_nationkey = n_nationkey
+    and n_name = 'GERMANY'
+group by
+    ps_partkey having
+        sum(ps_supplycost * ps_availqty) > (
+            select
+                sum(ps_supplycost * ps_availqty) * 0.0001000000
+            from
+                partsupp,
+                supplier,
+                nation
+            where
+                ps_suppkey = s_suppkey
+                and s_nationkey = n_nationkey
+                and n_name = 'GERMANY'
+        )
+order by
+    value desc
+*/
+
+source = partsupp
+| join ON ps_suppkey = s_suppkey supplier
+| join ON s_nationkey = n_nationkey nation
+| where n_name = 'GERMANY'
+| stats sum(ps_supplycost * ps_availqty) as value by ps_partkey
+| where value > [
+    source = partsupp
+    | join ON ps_suppkey = s_suppkey supplier
+    | join ON s_nationkey = n_nationkey nation
+    | where n_name = 'GERMANY'
+    | stats sum(ps_supplycost * ps_availqty) as check
+    | eval threshold = check * 0.0001000000
+    | fields threshold
+  ]
+| sort - value
diff --git a/integ-test/src/integration/resources/tpch/q12.ppl b/integ-test/src/integration/resources/tpch/q12.ppl
@@ -0,0 +1,42 @@
+/*
+select
+    l_shipmode,
+    sum(case
+        when o_orderpriority = '1-URGENT'
+            or o_orderpriority = '2-HIGH'
+            then 1
+        else 0
+    end) as high_line_count,
+    sum(case
+        when o_orderpriority <> '1-URGENT'
+            and o_orderpriority <> '2-HIGH'
+            then 1
+        else 0
+    end) as low_line_count
+from
+    orders,
+    lineitem
+where
+    o_orderkey = l_orderkey
+    and l_shipmode in ('MAIL', 'SHIP')
+    and l_commitdate < l_receiptdate
+    and l_shipdate < l_commitdate
+    and l_receiptdate >= date '1994-01-01'
+    and l_receiptdate < date '1994-01-01' + interval '1' year
+group by
+    l_shipmode
+order by
+    l_shipmode
+*/
+
+source = orders
+| join ON o_orderkey = l_orderkey lineitem
+| where l_commitdate < l_receiptdate
+    and l_shipdate < l_commitdate
+    and l_shipmode in ('MAIL', 'SHIP')
+    and l_receiptdate >= date('1994-01-01')
+    and l_receiptdate < date_add(date('1994-01-01'), interval 1 year)
+| stats sum(case(o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH', 1 else 0)) as high_line_count,
+        sum(case(o_orderpriority != '1-URGENT' and o_orderpriority != '2-HIGH', 1 else 0)) as low_line_countby
+        by l_shipmode
+| sort l_shipmode
diff --git a/integ-test/src/integration/resources/tpch/q13.ppl b/integ-test/src/integration/resources/tpch/q13.ppl
@@ -0,0 +1,31 @@
+/*
+select
+    c_count,
+    count(*) as custdist
+from
+    (
+        select
+            c_custkey,
+            count(o_orderkey) as c_count
+        from
+            customer left outer join orders on
+                c_custkey = o_custkey
+                and o_comment not like '%special%requests%'
+        group by
+            c_custkey
+    ) as c_orders
+group by
+    c_count
+order by
+    custdist desc,
+    c_count desc
+*/
+
+source = [
+    source = customer
+    | left outer join ON c_custkey = o_custkey AND not like(o_comment, '%special%requests%')
+      orders
+    | stats count(o_orderkey) as c_count by c_custkey
+  ] as c_orders
+| stats count() as custdist by c_count
+| sort - custdist, - c_count
diff --git a/integ-test/src/integration/resources/tpch/q14.ppl b/integ-test/src/integration/resources/tpch/q14.ppl
@@ -0,0 +1,25 @@
+/*
+select
+    100.00 * sum(case
+        when p_type like 'PROMO%'
+            then l_extendedprice * (1 - l_discount)
+        else 0
+    end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
+from
+    lineitem,
+    part
+where
+    l_partkey = p_partkey
+    and l_shipdate >= date '1995-09-01'
+    and l_shipdate < date '1995-09-01' + interval '1' month
+*/
+
+source = lineitem
+| join ON l_partkey = p_partkey
+    AND l_shipdate >= date('1995-09-01')
+    AND l_shipdate < date_add(date('1995-09-01'), interval 1 month)
+  part
+| stats sum(case(like(p_type, 'PROMO%'), l_extendedprice * (1 - l_discount) else 0)) as sum1,
+        sum(l_extendedprice * (1 - l_discount)) as sum2
+| eval promo_revenue = 100.00 * sum1 / sum2 // Stats and Eval commands can combine when issues/819 resolved
+| fields promo_revenue