data product processor

The data product processor is a library for dynamically creating and executing Apache Spark Jobs based on a declarative description of a data product.

The declaration is based on YAML and covers input and output data stores as well as data structures. It can be augmented with custom, PySpark-based transformation logic.

Installation

Prerequisites

Python 3.x
Apache Spark 3.x

Install with pip

pip install data-product-processor

Getting started

Declare a basic data product

Please see Data product specification for an overview on the files required to declare a data product.

Process the data product

From folder in which the previously created file are stored, run the data-product-processor as follows:

data-product-processor \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

This command will run Apache Spark locally (due to the --local switch) and store the output on an S3 bucket (authenticated with the AWS profile used in the parameter).

If you want to run the library from a different folder than the data product declaration, reference the latter through the additional argument --product_path.

data-product-processor \
  --product_path ../path-to-some-data-product \
  --default_data_lake_bucket some-datalake-bucket \
  --aws_profile some-profile \
  --aws_region eu-central-1 \
  --local

CLI Arguments

data-product-processor --help

  --JOB_ID - the unique id of this Glue/EMR job
  --JOB_RUN_ID - the unique id of this Glue job run
  --JOB_NAME - the name of this Glue job
  --job-bookmark-option - job-bookmark-disable if you don't want bookmarking
  --TempDir - tempoarary results directory
  --product_path - the data product definition folder
  --aws_profile - the AWS profile to be used for connection
  --aws_region - the AWS region to be used
  --local - local development
  --jars - extra jars to be added to the Spark context
  --additional-python-modules - this parameter is injected by Glue, currently it is not in use
  --default_data_lake_bucket - a default bucket location (with s3a:// prefix)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
access		access
builtin		builtin
docs		docs
driver		driver
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
NOTICES.txt		NOTICES.txt
README.md		README.md
README_dev.md		README_dev.md
main.py		main.py
package.py		package.py
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
version.sh		version.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

data product processor

Installation

Getting started

Declare a basic data product

Process the data product

CLI Arguments

References

Tutorials

About

Uh oh!

Releases 5

Packages

Contributors 4

Uh oh!

Languages

License

aws-samples/unified-data-operations

Folders and files

Latest commit

History

Repository files navigation

data product processor

Installation

Getting started

Declare a basic data product

Process the data product

CLI Arguments

References

Tutorials

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 4

Uh oh!

Languages

Packages