-
Notifications
You must be signed in to change notification settings - Fork 4
Add GPU benchmark stats guard #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
aa253b2 to
0ac467d
Compare
misiugodfrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only open question is whether we should only require this for GPU clusters. Unless I'm missing something, I think we should require this for all benchmarks, regardless of variant, which also simplifies some of the code.
I also think we should add support to automatically run analyze_tables when we generate the benchmark as part of our setup_benchmark_data_and_tables script, but that work can be done separately if we agree it should move forward.
|
|
||
| set -e | ||
|
|
||
| TPCH_TABLES=(customer lineitem nation orders part partsupp region supplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of our scripts try to avoid hard-coding the table names - although it might add too much complexity to fetch them dynamically. If we expand this to more benchmarks then we may need to update this; but frankly I'm inclined to leave it for now to get simple verification going.
presto/scripts/run_benchmark.sh
Outdated
|
|
||
| IS_GPU_CLUSTER=0 | ||
| if command -v docker >/dev/null 2>&1; then | ||
| if docker ps --format '{{.Names}}' 2>/dev/null | grep -q 'presto-native-worker-gpu'; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to only require ANALYZE to have been run for gpu clusters? If we want an appropriate comparison, should we not check this for all clusters regardless of type?
If that's the case, then we can simplify here by removing all the gpu cluster checking and just always require ANALYZE to have been run.
presto/scripts/run_benchmark.sh
Outdated
| cat <<EOF | ||
| Column statistics are required before running GPU benchmarks. | ||
| Run presto-native-cpu and ANALYZE all tables in schema '$schema', then retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an analyze_tables.sh script in presto/scripts/ that we should point the user at here so they don't need to do the work themselves.
Frankly, I think we should automatically run the script when we generate tables, but that's work for a separate PR.
|
Let me know when I can close this PR. Please review: I’ve reverted the previous stats-checking logic and repurposed the changes so the PR now ensures we automatically run ANALYZE right after tables are created by the setup scripts. This targets your earlier suggestion directly. Less seems to be more this case. |
|
This PR is a great initiative.
I think even if we implement an (optional or mandatory) call to analyze tables when setting up the benchmarks, it would still make sense to independently validate that the state we expect actually exists persistently (what this PR does/did originally.) It probably makes sense to do both :) |
I think it's a good idea to commit this PR even with #136 going in as well. It's good to have a check in place to make sure things were set up correctly, as the data won't always be set up using the channels we provide. I think all this needs right now is to remove the IS_GPU_CLUSTER checking and it's good to go. |
|
Does #136 avoid the need for this PR? |
I think they are both useful. 136 makes sure we are set up when using the default path, and this one verifies that it was used. There are still cases where we could have set up differently and then this PR will catch validation issues. |
| set -e | ||
|
|
||
| TPCH_TABLES=(customer lineitem nation orders part partsupp region supplier) | ||
| HIVE_METASTORE_DIR="$(readlink -f ../docker/.hive_metastore)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using something like the SHOW STATS command instead of going through Hive Metastore files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using something like the
SHOW STATScommand instead of going through Hive Metastore files?
Doing that up front for all TPCH tables adds noticeable overhead before every benchmark run. The command only works per table (and sometimes per column), so we’d need to run a Presto query for every table/column we care about. Some Presto versions return empty stats rows even when analysis hasn’t been done, so we’d still need to cross-check other metadata to be sure. That can get flaky when users customize their catalogs. Inspecting the .prestoSchema files in the Hive metastore gives us a fast, deterministic check if the JSON file exists and its columnStatistics block is non-empty, we know ANALYZE has been run at least once for that table. It’s essentially the ground truth that SHOW STATS would read anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specification for the prestoSchema file, or is this relying on an implementation detail?
Add a guard in presto/scripts/run_benchmark.sh that detects when the presto-native-gpu stack is active and verifies all Hive .prestoSchema stats files exist for the chosen TPCH schema before running GPU benchmarks. If any stats file is missing, the script instructs the user to run presto-native-cpu and ANALYZE tables first.