Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7641][VL] Add Gluten benchmark scripts #7642

Merged
merged 13 commits into from
Nov 14, 2024

Conversation

marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Oct 22, 2024

The notebooks demonstrate how to setup, build and benchmark Spark/Gluten with Jupyter Notebook

Copy link

#7641

- Install system dependencies and set up jupyter notebook
- Configure Hadoop and Spark
- Configure kernel parameters
- Install monitoring tools (e.g., sar, emon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove emon

@zhztheplayer
Copy link
Member

Thank you!

BTW there were a couple of related efforts in our code base (not all of them):

#432
#5278

Should we review them then remove the unnecessary / unmaintained ones? If they are still needed, I think we can create a new directory like examples to centralize them.

@FelixYBW
Copy link
Contributor

Why there are 3 TPCDS queries set? Can we consolidate to one?

./tools/gluten-it/common/src/main/resources/tpcds-queries
./gluten-core/src/test/resources/tpcds-queries
./gluten-core/target/scala-2.12/test-classes/tpcds-queries

@FelixYBW
Copy link
Contributor

Thank you!

BTW there were a couple of related efforts in our code base (not all of them):

#432 #5278

Should we review them then remove the unnecessary / unmaintained ones? If they are still needed, I think we can create a new directory like examples to centralize them.

We may put it under tools/workload, name it as benchmark_velox since the script only support Velox.

@marin-ma
Copy link
Contributor Author

Why there are 3 TPCDS queries set? Can we consolidate to one?

./tools/gluten-it/common/src/main/resources/tpcds-queries ./gluten-core/src/test/resources/tpcds-queries ./gluten-core/target/scala-2.12/test-classes/tpcds-queries

@FelixYBW ./gluten-core/target/scala-2.12/test-classes/tpcds-queries is generated by maven compile time. It's not in the code base.

./tools/gluten-it/common/src/main/resources/tpcds-queries is the one used by GHA and notebook scripts
./gluten-core/src/test/resources/tpcds-queries Not sure if this one is used by any Gluten UT. I will double check. If not, we can remove it.

@marin-ma
Copy link
Contributor Author

@FelixYBW Opened #7666 for some removals.

backends-velox/src/test/resources/tpch-queries-velox should also be removed. I will open another PR to remove them.

@@ -0,0 +1,38 @@
# Setup, Build and Benchmark Spark/Gluten with Jupyter Notebook
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR a work in progress or ready to merge? As I see contents in tools/notebook and tools/workload are identical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer Moved contents in tools/notebook to tools/workload/benchmark_velox

@@ -0,0 +1,96 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require the python version? python3 or python2?

def run_and_log(cmd):
print('\033[92m' + '>>> Running command: ' + repr(cmd) + '\033[0m')
result = subprocess.run(cmd, check=True, shell=True, capture_output=True, text=True)
print(result.stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add print("=======stdout============") to indicate it is stdout log, so as stderr

all_disks = filter_empty_str(subprocess.run("lsblk -I 7,8,259 -npd --output NAME".split(' '), capture_output=True, text=True).stdout.split('\n'))
if not all_disks:
print("No disks found on system. Exit.")
sys.exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.exit(1), I assume it is not a normal state.

```
After execution, the output notebook will be saved as `gluten_tpch.ipynb`.

If you want to use different parameters, you can specify them via the `-f` option. It will overwrite the previously defined parameters in `params.yaml`. e.g. To switch to the TPC-DS workload, run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify them via the -p

@FelixYBW
Copy link
Contributor

initialize.ipynb. Let's remove the BKM section

@FelixYBW
Copy link
Contributor

Looks good. Let's test on cloud once we have a chance.

@marin-ma marin-ma merged commit 0b899c0 into apache:main Nov 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants