-
Notifications
You must be signed in to change notification settings - Fork 4
Add GPU benchmark stats guard #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,42 @@ | |
|
|
||
| set -e | ||
|
|
||
| TPCH_TABLES=(customer lineitem nation orders part partsupp region supplier) | ||
| HIVE_METASTORE_DIR="$(readlink -f ../docker/.hive_metastore)" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we be using something like the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Doing that up front for all TPCH tables adds noticeable overhead before every benchmark run. The command only works per table (and sometimes per column), so we’d need to run a Presto query for every table/column we care about. Some Presto versions return empty stats rows even when analysis hasn’t been done, so we’d still need to cross-check other metadata to be sure. That can get flaky when users customize their catalogs. Inspecting the .prestoSchema files in the Hive metastore gives us a fast, deterministic check if the JSON file exists and its columnStatistics block is non-empty, we know ANALYZE has been run at least once for that table. It’s essentially the ground truth that SHOW STATS would read anyway.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a specification for the |
||
|
|
||
| verify_schema_stats() { | ||
| local schema=$1 | ||
| local missing=0 | ||
|
|
||
| if [[ ! -d "$HIVE_METASTORE_DIR/$schema" ]]; then | ||
| echo "Error: Hive metastore directory not found for schema '$schema' at $HIVE_METASTORE_DIR/$schema" | ||
| exit 1 | ||
| fi | ||
|
|
||
| for table in "${TPCH_TABLES[@]}"; do | ||
| local stats_file="$HIVE_METASTORE_DIR/$schema/$table/.prestoSchema" | ||
| if [[ ! -s "$stats_file" ]]; then | ||
| echo "Missing column statistics for $schema.$table ($stats_file)" | ||
| missing=1 | ||
| continue | ||
| fi | ||
| if ! grep -q '"columnStatistics"' "$stats_file"; then | ||
| echo "Column statistics entry absent for $schema.$table ($stats_file)" | ||
| missing=1 | ||
| fi | ||
| done | ||
|
|
||
| if [[ $missing -ne 0 ]]; then | ||
| cat <<EOF | ||
|
|
||
| Column statistics are required before running TPCH benchmarks. | ||
| Run ./analyze_tables.sh --schema-name '$schema' (and optional connection args) before retrying. | ||
|
|
||
| EOF | ||
| exit 1 | ||
| fi | ||
| } | ||
|
|
||
| print_help() { | ||
| cat << EOF | ||
|
|
||
|
|
@@ -217,6 +253,11 @@ pip install -q -r ${TEST_DIR}/requirements.txt | |
|
|
||
| source ./common_functions.sh | ||
|
|
||
| if [[ ${BENCHMARK_TYPE} == "tpch" ]]; then | ||
| echo "Verifying Hive column statistics for schema '${SCHEMA_NAME}'..." | ||
| verify_schema_stats "${SCHEMA_NAME}" | ||
| fi | ||
|
|
||
| wait_for_worker_node_registration "$HOST_NAME" "$PORT" | ||
|
|
||
| BENCHMARK_TEST_DIR=${TEST_DIR}/performance_benchmarks | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of our scripts try to avoid hard-coding the table names - although it might add too much complexity to fetch them dynamically. If we expand this to more benchmarks then we may need to update this; but frankly I'm inclined to leave it for now to get simple verification going.