Skip to content

Latest commit

 

History

History
94 lines (76 loc) · 4.21 KB

setting_region.md

File metadata and controls

94 lines (76 loc) · 4.21 KB

Setting GCP region

What to consider

Google Cloud Platform services are available in many locations across the globe. You can minimize network latency and network transport costs by running your Dataflow job in the same region as its input bucket, output dataset, and temporary directory are located. More specifically, in order to run Variant Transforms most efficiently you should make sure all the following resources are located in the same region:

  • Your source bucket set by --input_pattern flag.
  • Your pipeline's temporary location set by --temp_location flag.
  • Your output BigQuery dataset set by --output_table flag.
  • Your Dataflow pipeline set by --region flag.
  • Your Life Sciences API location set by --location flag.

Running jobs in a particular region

The Dataflow API requires setting a GCP region via --region flag to run.

When running from Docker, the Cloud Life Sciences API is used to spin up a worker that launches and monitors the Dataflow job. Cloud Life Sciences API is a regionalized service that runs in multiple regions. This is set with the --location flag. The Life Sciences API location is where metadata about the pipeline's progress will be stored, and can be different from the region where the data is processed. Note that Cloud Life Sciences API is not available in all regions, and if this flag is left out, the metadata will be stored in us-central1. See the list of Currently Available Locations.

In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:

COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region europe-west4 \
  --location europe-west4 \
  --temp_location "${TEMP_LOCATION}" \
  "${COMMAND}"

Note that values of --project, --region, and --temp_location flags will be automatically passed as COMMAND inputs in piplines_runner.sh.

Instead of setting --region flag for each run, you can set your default region using the following command. In that case, you will not need to set the --region flag any more. For more information, please refer to cloud SDK page.

gcloud config set compute/region "europe-west1"

Similarly, you can set the default project using the following commands:

gcloud config set project GOOGLE_CLOUD_PROJECT

If you are running Variant Transforms from GitHub, you need to specify all three required Dataflow inputs as below.

python3 -m gcp_variant_transforms.vcf_to_bq \
  ... \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --region europe-west1 \
  --temp_location "${TEMP_LOCATION}"

Setting Google Cloud Storage bucket region

You can choose your GCS bucket's region when you are creating it. When you create a bucket, you permanently define its name, its geographic location, and the project it is part of. For an existing bucket, you can check its information to find out about its geographic location.

Setting BigQuery dataset region

You can choose the region for the BigQuery dataset at dataset creation time.

BigQuery dataset region