Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions presto/scripts/run_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,42 @@

set -e

TPCH_TABLES=(customer lineitem nation orders part partsupp region supplier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of our scripts try to avoid hard-coding the table names - although it might add too much complexity to fetch them dynamically. If we expand this to more benchmarks then we may need to update this; but frankly I'm inclined to leave it for now to get simple verification going.

HIVE_METASTORE_DIR="$(readlink -f ../docker/.hive_metastore)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using something like the SHOW STATS command instead of going through Hive Metastore files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using something like the SHOW STATS command instead of going through Hive Metastore files?

Doing that up front for all TPCH tables adds noticeable overhead before every benchmark run. The command only works per table (and sometimes per column), so we’d need to run a Presto query for every table/column we care about. Some Presto versions return empty stats rows even when analysis hasn’t been done, so we’d still need to cross-check other metadata to be sure. That can get flaky when users customize their catalogs. Inspecting the .prestoSchema files in the Hive metastore gives us a fast, deterministic check if the JSON file exists and its columnStatistics block is non-empty, we know ANALYZE has been run at least once for that table. It’s essentially the ground truth that SHOW STATS would read anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specification for the prestoSchema file, or is this relying on an implementation detail?


verify_schema_stats() {
local schema=$1
local missing=0

if [[ ! -d "$HIVE_METASTORE_DIR/$schema" ]]; then
echo "Error: Hive metastore directory not found for schema '$schema' at $HIVE_METASTORE_DIR/$schema"
exit 1
fi

for table in "${TPCH_TABLES[@]}"; do
local stats_file="$HIVE_METASTORE_DIR/$schema/$table/.prestoSchema"
if [[ ! -s "$stats_file" ]]; then
echo "Missing column statistics for $schema.$table ($stats_file)"
missing=1
continue
fi
if ! grep -q '"columnStatistics"' "$stats_file"; then
echo "Column statistics entry absent for $schema.$table ($stats_file)"
missing=1
fi
done

if [[ $missing -ne 0 ]]; then
cat <<EOF

Column statistics are required before running TPCH benchmarks.
Run ./analyze_tables.sh --schema-name '$schema' (and optional connection args) before retrying.

EOF
exit 1
fi
}

print_help() {
cat << EOF

Expand Down Expand Up @@ -217,6 +253,11 @@ pip install -q -r ${TEST_DIR}/requirements.txt

source ./common_functions.sh

if [[ ${BENCHMARK_TYPE} == "tpch" ]]; then
echo "Verifying Hive column statistics for schema '${SCHEMA_NAME}'..."
verify_schema_stats "${SCHEMA_NAME}"
fi

wait_for_worker_node_registration "$HOST_NAME" "$PORT"

BENCHMARK_TEST_DIR=${TEST_DIR}/performance_benchmarks
Expand Down