Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions presto/scripts/run_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,42 @@

set -e

TPCH_TABLES=(customer lineitem nation orders part partsupp region supplier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of our scripts try to avoid hard-coding the table names - although it might add too much complexity to fetch them dynamically. If we expand this to more benchmarks then we may need to update this; but frankly I'm inclined to leave it for now to get simple verification going.

HIVE_METASTORE_DIR="$(readlink -f ../docker/.hive_metastore)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using something like the SHOW STATS command instead of going through Hive Metastore files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using something like the SHOW STATS command instead of going through Hive Metastore files?

Doing that up front for all TPCH tables adds noticeable overhead before every benchmark run. The command only works per table (and sometimes per column), so we’d need to run a Presto query for every table/column we care about. Some Presto versions return empty stats rows even when analysis hasn’t been done, so we’d still need to cross-check other metadata to be sure. That can get flaky when users customize their catalogs. Inspecting the .prestoSchema files in the Hive metastore gives us a fast, deterministic check if the JSON file exists and its columnStatistics block is non-empty, we know ANALYZE has been run at least once for that table. It’s essentially the ground truth that SHOW STATS would read anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specification for the prestoSchema file, or is this relying on an implementation detail?


verify_schema_stats() {
local schema=$1
local missing=0

if [[ ! -d "$HIVE_METASTORE_DIR/$schema" ]]; then
echo "Error: Hive metastore directory not found for schema '$schema' at $HIVE_METASTORE_DIR/$schema"
exit 1
fi

for table in "${TPCH_TABLES[@]}"; do
local stats_file="$HIVE_METASTORE_DIR/$schema/$table/.prestoSchema"
if [[ ! -s "$stats_file" ]]; then
echo "Missing column statistics for $schema.$table ($stats_file)"
missing=1
continue
fi
if ! grep -q '"columnStatistics"' "$stats_file"; then
echo "Column statistics entry absent for $schema.$table ($stats_file)"
missing=1
fi
done

if [[ $missing -ne 0 ]]; then
cat <<EOF

Column statistics are required before running GPU benchmarks.
Run presto-native-cpu and ANALYZE all tables in schema '$schema', then retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an analyze_tables.sh script in presto/scripts/ that we should point the user at here so they don't need to do the work themselves.

Frankly, I think we should automatically run the script when we generate tables, but that's work for a separate PR.


EOF
exit 1
fi
}

print_help() {
cat << EOF

Expand Down Expand Up @@ -167,6 +203,13 @@ if [[ -z ${SCHEMA_NAME} ]]; then
exit 1
fi

IS_GPU_CLUSTER=0
if command -v docker >/dev/null 2>&1; then
if docker ps --format '{{.Names}}' 2>/dev/null | grep -q 'presto-native-worker-gpu'; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to only require ANALYZE to have been run for gpu clusters? If we want an appropriate comparison, should we not check this for all clusters regardless of type?

If that's the case, then we can simplify here by removing all the gpu cluster checking and just always require ANALYZE to have been run.

IS_GPU_CLUSTER=1
fi
fi

PYTEST_ARGS=("--schema-name ${SCHEMA_NAME}")

if [[ -n ${QUERIES} ]]; then
Expand Down Expand Up @@ -217,6 +260,11 @@ pip install -q -r ${TEST_DIR}/requirements.txt

source ./common_functions.sh

if [[ $IS_GPU_CLUSTER -eq 1 && ${BENCHMARK_TYPE} == "tpch" ]]; then
echo "Detected presto-native-gpu cluster; verifying Hive column statistics for schema '${SCHEMA_NAME}'..."
verify_schema_stats "${SCHEMA_NAME}"
fi

wait_for_worker_node_registration "$HOST_NAME" "$PORT"

BENCHMARK_TEST_DIR=${TEST_DIR}/performance_benchmarks
Expand Down