You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on tool which needs to be able to extract a skeleton "join tree" from a given substrait query plan. For now, it will assume the plan was generated by SQL queries of the shape: SELECT * FROM r1, r2, ..., rN [WHERE ...].
Could I ask for some help figuring out the TODO's I've left in the code below (which has a simple script that creates some dummy tables using duckdb, then converts a cross join query on those tables into a substrait plan, and then extracts a dict-based join tree from that substrait plan.
For example, if I run the script below, it generates a three relation cross join query SELECT * FROM r1, r2, r3 and creates r1 with ten rows, r2 with one hundred rows, and r3 with 1 row. The DuckDB/substrait plan is then converted into this dict: {'left': 'r2', 'right': {'left': 'r1', 'right': 'r3'}} (note that the order of r2, r1, r3 change if you change the cardinalities of the relations).
Questions
For queries of the "shape" above (eg SELECT * FROM r1, r2, ..., rN [WHERE ...], can I assume the root plan.relations will always have a length of 1 (eg the "result relation)?
How can I avoid the implement the recur over fields/node as an iterable in a correct way (instead of just taking the first non-None result as I do now)? Perhaps I just need to spend more time reading the spec.
Any other feedback/ideas to make this more robust/simpler?
Simplified/runnable example:
importduckdb# print(duckdb.__version__)# 1.1.2# import substrait# print(substrait.__version__)# 0.23.0fromsubstrait.protoimport (
CrossRel,
JoinRel,
Plan,
ReadRel,
)
BOOL_TYPE=type(True)
INT_TYPE=type(0)
STRING_TYPE=type('')
BASE_TYPES= {
BOOL_TYPE,
INT_TYPE,
STRING_TYPE,
}
READ_REL_TYPE=ReadRelBASE_RELATION_TYPES= {
READ_REL_TYPE,
}
CROSS_REL_TYPE=CrossRelJOIN_REL_TYPE=JoinRelJOIN_TYPES= {
CROSS_REL_TYPE,
JOIN_REL_TYPE,
}
defplan_to_join_tree(plan: Plan) ->dict:
# TODO can i assume the length of the plan root's relations is 1 for queries like 'select * from r1, r2, ..., r2 where ...'?assertlen(plan.relations) ==1input=plan.relations[0].root.inputdefrecur(node):
node_type=type(node)
ifnode_typeinBASE_TYPES:
returnNoneifnode_typeinBASE_RELATION_TYPES:
read_type=node.WhichOneof('read_type')
ifread_type=='named_table':
return'.'.join(node.named_table.names)
else:
raiseNotImplementedError(f'unimplemented readrel type: {read_type}')
ifnothasattr(node, 'ListFields'):
raiseException(f'UNEXPECTED TYPE, {node}, {type(node)}')
fields=node.ListFields()
ifnode_typeinJOIN_TYPES:
field_names=set(desc.namefordesc, _infields)
ifnot ('left'infield_namesand'right'infield_names):
raiseException(f'bad join type: {node}, {type(node)}, {field_names}')
return {
'left': recur(node.left),
'right': recur(node.right),
}
# TODO how to handle multiple fields/iterable returning not Nonefor_, fieldinfields:
res=recur(field)
ifresisnotNone:
returnresifhasattr(node, '__len__'):
forelinnode:
res=recur(el)
ifresisnotNone:
returnresreturnrecur(input)
DEFAULT_TABLE_SIZES= {
'r1': 1,
'r2': 10,
}
defduckdb_substrait_plan(table_sizes=DEFAULT_TABLE_SIZES):
table_names=list(table_sizes.keys())
con=duckdb.connect("TwoRelCross.duckdb")
con.install_extension("substrait")
con.load_extension("substrait")
# TODO avoid sql string injectionfortable_nameintable_names:
con.execute(query=f"create table {table_name} (c1 integer)")
fortable_name, table_sizeintable_sizes.items():
con.execute(query=f"insert into {table_name} values ({'),('.join(map(str, range(table_size)))})")
fortable_nameintable_names:
con.execute(query=f"vacuum {table_name}")
con.execute(query=f"vacuum analyze {table_name}")
con.execute(query=f"analyze {table_name}")
substrait_proto_bytes=con.get_substrait(query=f"select * from {','.join(table_names)}").fetchone()[0]
p=Plan()
p.ParseFromString(substrait_proto_bytes)
returnpdefmain():
substrait_plan=duckdb_substrait_plan()
join_tree=plan_to_join_tree(substrait_plan)
print(join_tree)
if__name__=='__main__':
join_tree=main()
The text was updated successfully, but these errors were encountered:
I'm working on tool which needs to be able to extract a skeleton "join tree" from a given substrait query plan. For now, it will assume the plan was generated by SQL queries of the shape:
SELECT * FROM r1, r2, ..., rN [WHERE ...]
.Could I ask for some help figuring out the TODO's I've left in the code below (which has a simple script that creates some dummy tables using duckdb, then converts a cross join query on those tables into a substrait plan, and then extracts a dict-based join tree from that substrait plan.
For example, if I run the script below, it generates a three relation cross join query
SELECT * FROM r1, r2, r3
and createsr1
with ten rows,r2
with one hundred rows, andr3
with 1 row. The DuckDB/substrait plan is then converted into this dict:{'left': 'r2', 'right': {'left': 'r1', 'right': 'r3'}}
(note that the order of r2, r1, r3 change if you change the cardinalities of the relations).Questions
SELECT * FROM r1, r2, ..., rN [WHERE ...]
, can I assume the rootplan.relations
will always have a length of 1 (eg the "result relation)?Simplified/runnable example:
The text was updated successfully, but these errors were encountered: