Skip to content
This repository was archived by the owner on Jan 30, 2025. It is now read-only.

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

License

Notifications You must be signed in to change notification settings

aws-samples/unified-data-operations

data product processor

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.

Installation

Prerequisites

  • Python 3.x
  • Apache Spark 3.x

Install with pip

pip install data-product-processor

Getting started

Declare a basic data product

Please see Data product specification for an overview on the files required to declare a data product.

Process the data product

From folder in which the previously created file are stored, run the data-product-processor as follows:

data-product-processor \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).

If you want to run the library from a different folder than the data product declaration, reference the latter through the additional argument --product_path.

data-product-processor \
  --product_path ../path-to-some-data-product \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

CLI Arguments

data-product-processor --help

  --JOB_ID - the unique id of this Glue/EMR job
  --JOB_RUN_ID - the unique id of this Glue job run
  --JOB_NAME - the name of this Glue job
  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking
  --TempDir - tempoarary results directory
  --product_path - the data product definition folder
  --aws_profile - the AWS profile to be used for connection
  --aws_region - the AWS region to be used
  --local - local development
  --jars - extra jars to be added to the Spark context
  --additional-python-modules - this parameter is injected by Glue, currently it is not in use
  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)

References

Tutorials

About

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages