-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-7641][VL] Add Gluten benchmark scripts #7642
Conversation
tools/notebook/README.md
Outdated
- Install system dependencies and set up jupyter notebook | ||
- Configure Hadoop and Spark | ||
- Configure kernel parameters | ||
- Install monitoring tools (e.g., sar, emon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove emon
Why there are 3 TPCDS queries set? Can we consolidate to one? ./tools/gluten-it/common/src/main/resources/tpcds-queries |
We may put it under |
@FelixYBW
|
tools/notebook/README.md
Outdated
@@ -0,0 +1,38 @@ | |||
# Setup, Build and Benchmark Spark/Gluten with Jupyter Notebook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PR a work in progress or ready to merge? As I see contents in tools/notebook
and tools/workload
are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer Moved contents in tools/notebook
to tools/workload/benchmark_velox
@@ -0,0 +1,96 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we require the python version? python3 or python2?
def run_and_log(cmd): | ||
print('\033[92m' + '>>> Running command: ' + repr(cmd) + '\033[0m') | ||
result = subprocess.run(cmd, check=True, shell=True, capture_output=True, text=True) | ||
print(result.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add print("=======stdout============") to indicate it is stdout log, so as stderr
all_disks = filter_empty_str(subprocess.run("lsblk -I 7,8,259 -npd --output NAME".split(' '), capture_output=True, text=True).stdout.split('\n')) | ||
if not all_disks: | ||
print("No disks found on system. Exit.") | ||
sys.exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.exit(1), I assume it is not a normal state.
``` | ||
After execution, the output notebook will be saved as `gluten_tpch.ipynb`. | ||
|
||
If you want to use different parameters, you can specify them via the `-f` option. It will overwrite the previously defined parameters in `params.yaml`. e.g. To switch to the TPC-DS workload, run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify them via the -p
initialize.ipynb. Let's remove the BKM section |
Looks good. Let's test on cloud once we have a chance. |
The notebooks demonstrate how to setup, build and benchmark Spark/Gluten with Jupyter Notebook