Skip to content

Latest commit

 

History

History
226 lines (197 loc) · 7.64 KB

README.md

File metadata and controls

226 lines (197 loc) · 7.64 KB

Query Generator for Postgres

Tool generates SELECT queries, whose depth can be configured, with different join orders. It also generates DDLs required for query execution. You can also tweak configuration parameters for data inserting command generation.

Configuration

You can configure 3 different parts:

DDL Configuration

Tool generates related ddl commands before generating queries.

Schema for DDL configuration:

ddlOutFile: <string>
commonColName: <string>
targetTables: <Table[]>
  - Table:
      name: <string>
      citusType: <CitusType>
      maxCount: <int>
      rowCount: <int>
      nullRate: <float>
      duplicateRate: <float>
      useRandom: <bool>
      columns: <Column[]>
        - Column: 
          name: <string>
          type: <string>
      dupCount: <int>

Explanation:

ddlOutFile: "file to write generated DDL commands"
commonColName: "name of the column that will be used as distribution column, filter column in restrictions and target column in selections"
targetTables: "array of tables that will be used in generated queries"
  - Table:
      name: "name prefix of table"
      citusType: "citus type of table"
      maxCount: "limits how many times table can appear in query"
      rowCount: "total # of rows that will be inserted into table"
      nullRate: "percentage of null rows in rowCount that will be inserted into table"
      duplicateRate: "percentage of duplicates in rowCount that will be inserted into table"
      useRandom: "should we generate random rows"
      columns: "array of columns in table"
        - Column: 
          name: "name of column"
          type: "name of data type of column(only support 'int' now)"
      dupCount: "how many tables with the same configuration we should create(only by changing full name, still using the same name prefix)"

Data Insertion Configuration

Tool generates data insertion commands if you want tables with filled data. You can configure total number of rows, what percentage of them should be null and what percentage of them should be duplicated. For related configuration see Table schema at DDL Configuration. You can also configure range of the randomly generated data. See dataRange at Query Configuration.

Query Configuration

After generation of ddls and data insertion commands, the tool generates queries.

Schema for Query configuration:

queryCount: <int>
queryOutFile: <string>
semiAntiJoin: <bool>
limit: <bool>
orderby: <bool>
forceOrderbyWithLimit: <bool>
aggregate: <bool>
useAvgAtTopLevelTarget: <bool>
dataRange:
  from: <int>
  to: <int>
filterRange:
  from: <int>
  to: <int>
limitRange:
  from: <int>
  to: <int>
targetRteCount: <int>
targetCteCount: <int>
targetCteRteCount: <int>
targetJoinTypes: <JoinType[]>
targetRteTypes: <RteType[]>
targetRestrictOps: <RestrictOp[]>
targetAggregateFunctions: <string[]>

Explanation:

queryCount: "number of queries to generate"
queryOutFile: "fileto write generated queries"
semiAntiJoin: "should we support semin joins (WHERE col IN (Subquery))"
limit: "should we support limit clause"
orderby: "should we support order by clause"
forceOrderbyWithLimit: "should we force order by when we use limit"
aggregate: "should we support aggregate at targetlist"
useAvgAtTopLevelTarget: "should we make top level query as select avg() from (subquery)"
dataRange:
  from: "starting boundary for data generation"
  to: "end boundary for data generation"
filterRange:
  from: "starting boundary for restriction clause"
  to: "end boundary for restriction clause"
limitRange:
  from: "starting boundary for limit clause"
  to: "end boundary for data limit clause"
targetRteCount: "limits how many rtes should exist in non-cte part of the query"
targetCteCount: "limits how many ctes should exist in query"
targetCteRteCount: "limits how many rtes should exist in cte part of the query"
targetJoinTypes: "supported join types"
targetRteTypes: "supported rte types"
targetRestrictOps: "supported restrict ops"
targetAggregateFunctions: "supported aggregate function names"

Misc Configuration

Tool has some configuration options which does not suit above 3 parts.

Schema for misc configuration:

interactiveMode: <bool>

Explanation:

interactiveMode: "when true, interactively prints generated ddls and queries. Otherwise, it writes them to configured files."

Goal of the Tool

Tool supports a simple syntax to be useful to generate queries with different join orders. Main motivation for me to create the tool was to compare results of the generated queries for different Citus tables and Postgres tables. That is why we support a basic syntax for now. It can be extended to support different queries.

Supported Operations

It uses commonColName for any kind of target selection required for any supported query clause.

Column Type Support

Tool currently supports only int data type, but plans to support other basic types.

Join Support

Tool supports following joins:

targetJoinTypes:
  - INNER
  - LEFT
  - RIGHT
  - FULL

Citus Table Support

Tool supports following citus table types:

targetTables:
  - Table:
    ...
    citusType: <one of (DISTRIBUTED || REFERENCE || POSTGRES)>
    ...

Restrict Operation Support

Tool supports following restrict operations:

targetRestrictOps:
  - LT
  - EQ
  - GT

Rte Support

Tool supports following rtes:

targetRteTypes:
  - RELATION
  - SUBQUERY
  - CTE
  - VALUES

Aggregation Support

Tool supports any aggregate functions which takes int column as input and returns int result. e.g. max, min, count

How to Generate Queries?

You have 2 different options.

Interactive Mode

In this mode, you will be prompted to continue generating a query. When you hit to Enter, it will continue generating them. You will need to hit to x to exit.

  1. Configure interactiveMode: true in config.yml,
  2. Run the command shown below
python main.py

File Mode

In this mode, generated ddls and queries will be written into configured files.

  1. Configure interactiveMode: false,
  2. Configure queryCount: <total_query>,
  3. Configure queryOutFile: <query_file_path> and ddlOutFile: <ddlfile_path>
  4. Run the command shown below
python main.py

How to Run Citus Join Verification?

You can verify if Citus breaks any default PG join behaviour via tests/citus_compare_dist_local_joins.sh. It creates tables specified in config. Then, it runs generated queries on those tables and saves the results into out/dist_queries.out. After running those queries for Citus tables, it creates PG tables with the same names as previous run, executes the same queries, and saves the results into out/local_queries.out. In final step, it generates diff between local and distributed results. You can see the contents of out/local_dist_diffs to see if there is any Citus unsupported query.

  1. Create a Citus local cluster with 2 workers by using citus_dev tool
citus_dev make testCluster --destroy
  1. Run the test,
bash tests/citus_compare_dist_local_joins.sh 9700
  1. See the diff content in out/local_dist_diffs