Google Cloud Platform services are available in many locations across the globe. You can minimize network latency and network transport costs by running your Dataflow job in the same region as its input bucket, output dataset, and temporary directory are located. More specifically, in order to run Variant Transforms most efficiently you should make sure all the following resources are located in the same region:
- Your source bucket set by
--input_pattern
flag. - Your pipeline's temporary location set by
--temp_location
flag. - Your output BigQuery dataset set by
--output_table
flag. - Your Dataflow pipeline set by
--region
flag. - Your Life Sciences API location set by
--location
flag.
The Dataflow API requires
setting a GCP
region via
--region
flag to run.
When running from Docker, the Cloud Life Sciences API is used to spin up a
worker that launches and monitors the Dataflow job. Cloud Life Sciences API
is a regionalized service
that runs in multiple regions. This is set with the --location
flag. The
Life Sciences API location is where metadata about the pipeline's progress
will be stored, and can be different from the region where the data is
processed. Note that Cloud Life Sciences API is not available in all regions,
and if this flag is left out, the metadata will be stored in us-central1. See
the list of Currently Available Locations.
In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west4 \
--location europe-west4 \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
Note that values of --project
, --region
, and --temp_location
flags will be automatically
passed as COMMAND
inputs in piplines_runner.sh
.
Instead of setting --region
flag for each run, you can set your default region
using the following command. In that case, you will not need to set the --region
flag any more. For more information, please refer to
cloud SDK page.
gcloud config set compute/region "europe-west1"
Similarly, you can set the default project using the following commands:
gcloud config set project GOOGLE_CLOUD_PROJECT
If you are running Variant Transforms from GitHub, you need to specify all three required Dataflow inputs as below.
python3 -m gcp_variant_transforms.vcf_to_bq \
... \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west1 \
--temp_location "${TEMP_LOCATION}"
You can choose your GCS bucket's region when you are creating it. When you create a bucket, you permanently define its name, its geographic location, and the project it is part of. For an existing bucket, you can check its information to find out about its geographic location.
You can choose the region for the BigQuery dataset at dataset creation time.