A Terraform module which deploys the BigQuery Loader application on Google running on top of Compute Engine. If you want to use a custom image for this deployment you will need to ensure it is based on top of Ubuntu 24.04.
This module was originally sourced from a community contribution by Teghan Nightengale in 2022 - big thanks for the help in getting this one started!
This module by default collects and forwards telemetry information to Snowplow to understand how our applications are being used. No identifying information about your sub-account or account fingerprints are ever forwarded to us - it is very simple information about what modules and applications are deployed and active.
If you wish to subscribe to our mailing list for updates to these modules or security advisories please set the user_provided_id variable to include a valid email address which we can reach you at.
To disable telemetry simply set variable telemetry_enabled = false.
For details on what information is collected please see this module: https://github.com/snowplow-devops/terraform-snowplow-telemetry
The BigQuery Loader reads data from a Snowplow Enriched output PubSub topic and writes in realtime to BigQuery events table.
# NOTE: Needs to be fed by the enrich module with valid Snowplow Events
module "enriched_topic" {
source = "snowplow-devops/pubsub-topic/google"
version = "0.3.0"
name = "enriched-topic"
}
module "bad_rows_topic" {
source = "snowplow-devops/pubsub-topic/google"
version = "0.3.0"
name = "bad-rows-topic"
}
resource "google_bigquery_dataset" "pipeline_db" {
dataset_id = "pipeline_db"
location = var.region
}
module "bigquery_loader_pubsub" {
source = "snowplow-devops/bigquery-loader-pubsub-ce/google"
accept_limited_use_license = true
name = "bq-loader-server"
project_id = var.project_id
network = var.network
subnetwork = var.subnetwork
region = var.region
input_topic_name = module.enriched_topic.name
bad_rows_topic_id = module.bad_rows_topic.id
bigquery_dataset_id = google_bigquery_dataset.pipeline_db.dataset_id
ssh_key_pairs = []
ssh_ip_allowlist = ["0.0.0.0/0"]
# Linking in the custom Iglu Server here
custom_iglu_resolvers = [
{
name = "Iglu Server"
priority = 0
uri = "http://your-iglu-server-endpoint/api"
api_key = var.iglu_super_api_key
vendor_prefixes = []
}
]
}| Name | Version |
|---|---|
| terraform | >= 1.0.0 |
| >= 3.44.0 |
| Name | Version |
|---|---|
| >= 3.44.0 |
| Name | Source | Version |
|---|---|---|
| service | snowplow-devops/service-ce/google | 0.2.0 |
| telemetry | snowplow-devops/telemetry/snowplow | 0.6.1 |
| Name | Type |
|---|---|
| google_bigquery_dataset_iam_member.dataset_bigquery_data_editor_binding | resource |
| google_compute_firewall.egress | resource |
| google_compute_firewall.ingress_health_check | resource |
| google_compute_firewall.ingress_ssh | resource |
| google_project_iam_member.sa_bigquery_data_editor | resource |
| google_project_iam_member.sa_logging_log_writer | resource |
| google_project_iam_member.sa_pubsub_publisher | resource |
| google_project_iam_member.sa_pubsub_subscriber | resource |
| google_project_iam_member.sa_pubsub_viewer | resource |
| google_pubsub_subscription.input | resource |
| google_service_account.sa | resource |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| bad_rows_topic_id | The id of the output topic for all bad data | string |
n/a | yes |
| bigquery_dataset_id | The ID of the bigquery dataset to load data into | string |
n/a | yes |
| input_topic_name | The name of the input topic that contains enriched data to load | string |
n/a | yes |
| name | A name which will be pre-pended to the resources created | string |
n/a | yes |
| network | The name of the network to deploy within | string |
n/a | yes |
| project_id | The project ID in which the stack is being deployed | string |
n/a | yes |
| region | The name of the region to deploy within | string |
n/a | yes |
| accept_limited_use_license | Acceptance of the SLULA terms (https://docs.snowplow.io/limited-use-license-1.0/) | bool |
false |
no |
| app_version | App version to use. This variable facilitates dev flow, the modules may not work with anything other than the default value. | string |
"2.0.1" |
no |
| associate_public_ip_address | Whether to assign a public ip address to this instance; if false this instance must be behind a Cloud NAT to connect to the internet | bool |
true |
no |
| bigquery_service_account_json_b64 | Custom credentials (as base64 encoded service account key) instead of default service account assigned to the loader's compute group | string |
"" |
no |
| bigquery_table_id | The ID of the table within a dataset to load data into (will be created if it doesn't exist) | string |
"events" |
no |
| custom_iglu_resolvers | The custom Iglu Resolvers that will be used by the loader to resolve and validate events | list(object({ |
[] |
no |
| default_iglu_resolvers | The default Iglu Resolvers that will be used by the loader to resolve and validate events | list(object({ |
[ |
no |
| exit_on_missing_iglu_schema | Whether the loader should crash and exit if it fails to resolve an iglu schema | bool |
true |
no |
| gcp_logs_enabled | Whether application logs should be reported to GCP Logging | bool |
true |
no |
| healthcheck_enabled | Whether or not to enable health check probe for GCP instance group | bool |
true |
no |
| iglu_cache_size | The size of cache used by Iglu Resolvers | number |
500 |
no |
| iglu_cache_ttl_seconds | Duration in seconds, how long should entries be kept in Iglu Resolvers cache before they expire | number |
600 |
no |
| java_opts | Custom JAVA Options | string |
"-XX:InitialRAMPercentage=75 -XX:MaxRAMPercentage=75" |
no |
| labels | The labels to append to this resource | map(string) |
{} |
no |
| legacy_column_mode | Whether the loader should load to legacy columns for all fields | bool |
false |
no |
| legacy_columns | Schemas for which to use the legacy column style used by the v1 BigQuery Loader. For these columns, there is a column per minor version of each schema. | list(string) |
[] |
no |
| machine_type | The machine type to use | string |
"e2-small" |
no |
| network_project_id | The project ID of the shared VPC in which the stack is being deployed | string |
"" |
no |
| skip_schemas | The list of schema keys which should be skipped (not loaded) to the warehouse | list(string) |
[] |
no |
| ssh_block_project_keys | Whether to block project wide SSH keys | bool |
true |
no |
| ssh_ip_allowlist | The list of CIDR ranges to allow SSH traffic from | list(any) |
[ |
no |
| ssh_key_pairs | The list of SSH key-pairs to add to the servers | list(object({ |
[] |
no |
| subnetwork | The name of the sub-network to deploy within; if populated will override the 'network' setting | string |
"" |
no |
| target_size | The number of servers to deploy | number |
1 |
no |
| telemetry_enabled | Whether or not to send telemetry information back to Snowplow Analytics Ltd | bool |
true |
no |
| ubuntu_24_04_source_image | The source image to use which must be based of of Ubuntu 24.04; by default the latest community version is used | string |
"" |
no |
| user_provided_id | An optional unique identifier to identify the telemetry events emitted by this stack | string |
"" |
no |
| webhook_collector | Collector address used to gather monitoring alerts | string |
"" |
no |
| Name | Description |
|---|---|
| health_check_id | Identifier for the health check on the instance group |
| health_check_self_link | The URL for the health check on the instance group |
| instance_group_url | The full URL of the instance group created by the manager |
| manager_id | Identifier for the instance group manager |
| manager_self_link | The URL for the instance group manager |
| named_port_http | The name of the port exposed by the instance group |
| named_port_value | The named port value (e.g. 8080) |
Copyright 2022-present Snowplow Analytics Ltd.
Licensed under the Snowplow Limited Use License Agreement. (If you are uncertain how it applies to your use case, check our answers to frequently asked questions.)