Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3196] Add integration test for Apache Iceberg integration #69

Merged
merged 35 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
cde3628
Integrate Apache Iceberg jars with Rock Image
theoctober19th Jan 24, 2024
3136d86
Hardcode SHA sums for the downloaded jars.
theoctober19th Jan 26, 2024
cc7e95f
Ignore build artifacts from VCS
theoctober19th Jan 29, 2024
cc552b8
Add integration tests for Apache Iceberg integration
theoctober19th Jan 29, 2024
c682abd
Ignore build artifacts from VCS
theoctober19th Jan 29, 2024
f735866
Add integration tests for Apache Iceberg integration
theoctober19th Jan 29, 2024
d02d641
Fix tests not passing on CI
theoctober19th Jan 29, 2024
2c18492
Make script executable
theoctober19th Jan 29, 2024
6f0783b
Revert a docker command in Makefile
theoctober19th Jan 29, 2024
0b085e2
Remove set -x option in integration tests.
theoctober19th Jan 29, 2024
fd05cce
Use -o option instead of redirect
theoctober19th Jan 29, 2024
063c4e5
Change artifact permissions.
theoctober19th Jan 29, 2024
f22dabb
Give read permission to all.
theoctober19th Jan 29, 2024
c032913
Fix typo in build.yaml
theoctober19th Jan 29, 2024
07540ad
Add debug session step
theoctober19th Jan 29, 2024
3c25115
Move the order of jobs in CI
theoctober19th Jan 29, 2024
8217169
Revert "Move the order of jobs in CI"
theoctober19th Jan 29, 2024
149cc95
Temporarily disable other tests
theoctober19th Jan 29, 2024
75299b7
Sleep for 1000 seconds
theoctober19th Jan 29, 2024
9c9fb09
Increase sleep timing, set the default region for the AWS cli
theoctober19th Jan 29, 2024
ce0a06d
Unset debug mode
theoctober19th Jan 29, 2024
8ad6334
Remove SSH access to GH runner
theoctober19th Jan 29, 2024
87fb99e
Fix CI
theoctober19th Jan 29, 2024
4ff64d0
Revert "Add integration tests for Apache Iceberg integration"
theoctober19th Jan 30, 2024
c621d89
Revert "Ignore build artifacts from VCS"
theoctober19th Jan 30, 2024
efb0ee8
Add missing shell variable
theoctober19th Jan 30, 2024
2818165
Make default region configurable.
theoctober19th Jan 30, 2024
c230cf7
Remove commented lines
theoctober19th Jan 30, 2024
fff7f4a
Format with black.
theoctober19th Jan 30, 2024
d1a137b
Merge branch 'canonical:3.4-22.04/edge' into 3.4-22.04/edge
theoctober19th Jan 30, 2024
cf1e75b
Merge branch '3.4-22.04/edge' into iceberg-integration
theoctober19th Jan 30, 2024
4d7fbd6
Uncomment a line in integration tests.
theoctober19th Jan 30, 2024
c741c45
Uncomment teardown_test_pod
theoctober19th Jan 30, 2024
1da1160
Add comments
theoctober19th Feb 1, 2024
29c45bc
Fix a typo
theoctober19th Feb 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ jobs:
ARTIFACT=$(make help | grep 'Artifact: ')
echo "name=${ARTIFACT#'Artifact: '}" >> $GITHUB_OUTPUT

- name: Change artifact permissions
run: sudo chmod a+r ${{ steps.artifact.outputs.name }}

- name: Upload locally built artifact
uses: actions/upload-artifact@v3
with:
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
.idea
.idea
*.rock
*.tar
.make_cache
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ cd charmed-spark-rock
sudo snap install rockcraft --edge
sudo snap install docker
sudo snap install lxd
sudo snap install yq
sudo snap install skopeo --edge --devmode
```

Expand Down
12 changes: 9 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ _MAKE_DIR := .make_cache
$(shell mkdir -p $(_MAKE_DIR))

K8S_TAG := $(_MAKE_DIR)/.k8s_tag
AWS_TAG := $(_MAKE_DIR)/.aws_tag

IMAGE_NAME := $(shell yq .name rockcraft.yaml)
VERSION := $(shell yq .version rockcraft.yaml)
Expand Down Expand Up @@ -70,7 +71,7 @@ $(_TMP_OCI_TAG): $(_ROCK_OCI)
touch $(_TMP_OCI_TAG)

$(CHARMED_OCI_TAG): $(_TMP_OCI_TAG)
docker build - -t "$(CHARMED_OCI_FULL_NAME):$(TAG)" --build-arg BASE_IMAGE="$(_TMP_OCI_NAME):$(TAG)" < Dockerfile
docker build -t "$(CHARMED_OCI_FULL_NAME):$(TAG)" --build-arg BASE_IMAGE="$(_TMP_OCI_NAME):$(TAG)" -f Dockerfile .
if [ ! -d "$(_MAKE_DIR)/$(CHARMED_OCI_FULL_NAME)" ]; then mkdir -p "$(_MAKE_DIR)/$(CHARMED_OCI_FULL_NAME)"; fi
touch $(CHARMED_OCI_TAG)

Expand All @@ -80,10 +81,15 @@ $(K8S_TAG):
sg microk8s ./tests/integration/config-microk8s.sh
@touch $(K8S_TAG)

$(AWS_TAG): $(K8S_TAG)
@echo "=== Setting up and configure AWS CLI ==="
/bin/bash ./tests/integration/setup-aws-cli.sh
touch $(AWS_TAG)

microk8s: $(K8S_TAG)

$(_MAKE_DIR)/%/$(TAG).tar: $(_MAKE_DIR)/%/$(TAG).tag
docker save $*:$(TAG) > $(_MAKE_DIR)/$*/$(TAG).tar
docker save $*:$(TAG) -o $(_MAKE_DIR)/$*/$(TAG).tar

$(BASE_NAME): $(_MAKE_DIR)/$(CHARMED_OCI_FULL_NAME)/$(TAG).tar
@echo "=== Creating $(BASE_NAME) OCI archive ==="
Expand All @@ -106,7 +112,7 @@ import: $(K8S_TAG) build
microk8s ctr images import --base-name $(CHARMED_OCI_FULL_NAME):$(TAG) $(BASE_NAME)
endif

tests:
tests: $(K8S_TAG) $(AWS_TAG)
@echo "=== Running Integration Tests ==="
/bin/bash ./tests/integration/integration-tests.sh

Expand Down
1 change: 1 addition & 0 deletions tests/integration/config-microk8s.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ microk8s status --wait-ready
microk8s config | tee ~/.kube/config
microk8s.enable dns
microk8s.enable rbac
microk8s.enable minio
104 changes: 103 additions & 1 deletion tests/integration/integration-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ teardown_test_pod() {
run_example_job_in_pod() {
SPARK_EXAMPLES_JAR_NAME="spark-examples_2.12-$(get_spark_version).jar"

PREVIOUS_JOB=$(kubectl get pods | grep driver | tail -n 1 | cut -d' ' -f1)
PREVIOUS_JOB=$(kubectl get pods -n ${NAMESPACE}| grep driver | tail -n 1 | cut -d' ' -f1)
NAMESPACE=$1
USERNAME=$2

Expand Down Expand Up @@ -166,6 +166,101 @@ run_example_job_in_pod() {
validate_pi_value $pi
}

get_s3_access_key(){
kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_ACCESS_KEY}' | base64 -d
}

get_s3_secret_key(){
kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_SECRET_KEY}' | base64 -d
}

get_s3_endpoint(){
kubectl get service minio -n minio-operator -o jsonpath='{.spec.clusterIP}'
}

create_s3_bucket(){
S3_ENDPOINT=$(get_s3_endpoint)
BUCKET_NAME=$1
aws --endpoint-url "http://$S3_ENDPOINT" s3api create-bucket --bucket "$BUCKET_NAME"
echo "Created S3 bucket ${BUCKET_NAME}"
}

delete_s3_bucket(){
S3_ENDPOINT=$(get_s3_endpoint)
BUCKET_NAME=$1
aws --endpoint-url "http://$S3_ENDPOINT" s3 rb "s3://$BUCKET_NAME" --force
echo "Deleted S3 bucket ${BUCKET_NAME}"
}

copy_file_to_s3_bucket(){
BUCKET_NAME=$1
FILE_PATH=$2
BASE_NAME=$(basename "$FILE_PATH")
S3_ENDPOINT=$(get_s3_endpoint)
aws --endpoint-url "http://$S3_ENDPOINT" s3 cp $FILE_PATH s3://"$BUCKET_NAME"/"$BASE_NAME"
echo "Copied file ${FILE_PATH} to S3 bucket ${BUCKET_NAME}"
}

test_iceberg_example_in_pod(){
create_s3_bucket spark
copy_file_to_s3_bucket spark ./tests/integration/resources/test-iceberg.py

NAMESPACE="tests"
USERNAME="spark"
NUM_ROWS_TO_INSERT="4"
PREVIOUS_DRIVER_PODS_COUNT=$(kubectl get pods -n ${NAMESPACE} | grep driver | wc -l)

kubectl exec testpod -- \
env \
UU="$USERNAME" \
NN="$NAMESPACE" \
IM="$(spark_image)" \
NUM_ROWS="$NUM_ROWS_TO_INSERT" \
ACCESS_KEY="$(get_s3_access_key)" \
SECRET_KEY="$(get_s3_secret_key)" \
S3_ENDPOINT="$(get_s3_endpoint)" \
/bin/bash -c '\
spark-client.spark-submit \
--username $UU --namespace $NN \
--conf spark.kubernetes.driver.request.cores=100m \
--conf spark.kubernetes.executor.request.cores=100m \
--conf spark.kubernetes.container.image=$IM \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.s3a.endpoint=$S3_ENDPOINT \
--conf spark.hadoop.fs.s3a.access.key=$ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$SECRET_KEY \
--conf spark.jars.ivy=/tmp \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=s3a://spark/warehouse \
--conf spark.sql.defaultCatalog=local \
s3a://spark/test-iceberg.py -n $NUM_ROWS'

delete_s3_bucket spark
DRIVER_PODS_COUNT=$(kubectl get pods -n ${NAMESPACE} | grep driver | wc -l)

if [[ "${PREVIOUS_DRIVER_PODS_COUNT}" == "${DRIVER_PODS_COUNT}" ]]
then
echo "ERROR: Sample job has not run!"
exit 1
fi

DRIVER_POD_ID=$(kubectl get pods -n ${NAMESPACE} | grep test-iceberg-.*-driver | tail -n 1 | cut -d' ' -f1)
OUTPUT_LOG_LINE=$(kubectl logs ${DRIVER_POD_ID} -n ${NAMESPACE} | grep 'Number of rows inserted:' )
NUM_ROWS_INSERTED=$(echo $OUTPUT_LOG_LINE | rev | cut -d' ' -f1 | rev)

if [ "${NUM_ROWS_INSERTED}" != "${NUM_ROWS_TO_INSERT}" ]; then
echo "ERROR: ${NUM_ROWS_TO_INSERT} were supposed to be inserted. Found ${NUM_ROWS_INSERTED} rows. Aborting with exit code 1."
exit 1
fi

}

run_example_job_in_pod_with_pod_templates() {
SPARK_EXAMPLES_JAR_NAME="spark-examples_2.12-$(get_spark_version).jar"

Expand Down Expand Up @@ -425,6 +520,13 @@ echo -e "RUN EXAMPLE JOB WITH ERRORS"
echo -e "########################################"

(setup_user_admin_context && test_example_job_in_pod_with_errors && cleanup_user_success) || cleanup_user_failure_in_pod

echo -e "##################################"
echo -e "RUN EXAMPLE THAT USES ICEBERG LIBRARIES"
echo -e "##################################"

(setup_user_admin_context && test_iceberg_example_in_pod && cleanup_user_success) || cleanup_user_failure_in_pod

echo -e "##################################"
echo -e "TEARDOWN TEST POD"
echo -e "##################################"
Expand Down
30 changes: 30 additions & 0 deletions tests/integration/resources/test-iceberg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import argparse
import random

from pyspark.sql import SparkSession
from pyspark.sql.types import LongType, StructType, StructField

parser = argparse.ArgumentParser("TestIceberg")
parser.add_argument("--num_rows", "-n", type=int)
args = parser.parse_args()
num_rows = args.num_rows

spark = SparkSession.builder.appName("IcebergExample").getOrCreate()


schema = StructType(
[StructField("row_id", LongType(), True), StructField("row_val", LongType(), True)]
)

data = []
for idx in range(num_rows):
row = (idx + 1, random.randint(1, 100))
data.append(row)

df = spark.createDataFrame(data, schema)
df.writeTo("demo.foo.bar").create()


df = spark.table("demo.foo.bar")
count = df.count()
print(f"Number of rows inserted: {count}")
17 changes: 17 additions & 0 deletions tests/integration/setup-aws-cli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

# Install AWS CLI
sudo snap install aws-cli --classic

# Get Access key and secret key from MinIO
ACCESS_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_ACCESS_KEY}' | base64 -d)
SECRET_KEY=$(kubectl get secret -n minio-operator microk8s-user-1 -o jsonpath='{.data.CONSOLE_SECRET_KEY}' | base64 -d)

S3_BUCKET="spark"
DEFAULT_REGION="us-east-2"

# Configure AWS CLI credentials
aws configure set aws_access_key_id $ACCESS_KEY
aws configure set aws_secret_access_key $SECRET_KEY
aws configure set default.region $DEFAULT_REGION
echo "AWS CLI credentials set successfully"
Loading