Skip to content

BUG: PySpark pipelines configured to use Delta Lake as persistence store do not save to S3 by default #741

@nartieri

Description

@nartieri

Description

During the regression test of v1.13.0-SNAPSHOT, we identified a bug whereby data is persisted to a temporary directory on the pipeline pod as opposed to s3local. This may result in a scenario in which destroying and re-syncing a pipeline eliminates the data, however, the references to that data's location will still point to the temporary directory on the pod.

Steps to Reproduce

  1. Create a project using v1.13.0-SNAPSHOT
  2. Include a simple PySpark data pipeline in the project that is configured to save to Delta Lake
  3. Build and deploy the project
  4. Run the pipeline
  5. Access the pipeline logs
  6. Verify you see an output similar to the following
INFO IngestBase: Saved Ingest to Delta Lake
INFO Ingest: Completed saving People
INFO Ingest: Pushing file to S3 Local with contents: test file text
INFO SparkContext: SparkContext is stopping with exitCode 0.

Expected Behavior

Data will be properly pushed to S3, with log output that looks similar to the following

IngestBase: Saved Ingest to Delta Lake
INFO Ingest: Completed saving People
INFO Ingest: Pushing file to S3 Local with contents: test file text
INFO Ingest: Finished Uploading file to S3
INFO Ingest: Fetching file from S3 Local
INFO Ingest: Finished downloading file from S3

Actual Behavior

Data is saved to a temporary location on the pod

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions