Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set replication factor for kafka stability #1606

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

fedeabih
Copy link

Description

This change resolves the issue Failed to get watermark offsets: Local: Unknown partition. The root cause was related to the Kafka replication configuration. By setting the replicationFactor to 3 (matching the number of Kafka brokers/controllers), this fix ensures consistent behavior when retrieving high watermark offsets. This issue was reported in sentry-kubernetes/charts#1458.

Technical Explanation

The issue arises because the replicationFactor was previously set to 1, meaning that each partition only had a single replica. In this configuration, the high watermark offset—a key value in Kafka that indicates the maximum offset successfully replicated to all in-sync replicas (ISRs)—becomes unreliable.

Without sufficient replication, the loss of a single broker or temporary unavailability can result in Kafka being unable to compute or provide the high watermark for affected partitions. This leads to the error:
Failed to get watermark offsets: Local: Unknown partition.
By increasing the replicationFactor from 1 to 3, each partition is replicated across all three brokers/controllers. This ensures that the high watermark offset remains consistently available, even if a broker becomes unavailable or experiences minor instability. Additionally, the increased replication enhances fault tolerance and improves the overall availability of partition data across the cluster.

For more details on how Kafka replication works and the role of the high watermark, refer to the official documentation: Replication in Apache Kafka.

@fedeabih fedeabih mentioned this pull request Nov 22, 2024
1 task
@patsevanton
Copy link
Contributor

@Mokto Please review the changes.

Copy link
Contributor

@Mokto Mokto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it backward compatible?

@fedeabih
Copy link
Author

fedeabih commented Dec 2, 2024

Is it backward compatible?

Yes, the sentry-kafka-provisioning job creates topics using the --if-not-exists flag. This ensures that existing topics remain unaffected when the job runs; the configuration only applies to newly created topics:

                "/opt/bitnami/kafka/bin/kafka-topics.sh \
                    --create \
                    --if-not-exists \
                    --bootstrap-server ${KAFKA_SERVICE} \
                    --replication-factor 3 \
                    --partitions 1 \
                    --command-config ${CLIENT_CONF} \
                    --topic event-replacements"

There are two ways to apply the new configuration to an existing cluster:

  • Recreate Kafka configuration (losing unprocessed data, decide you if it is critical or not):

    1. Delete PVCs named data-sentry-kafka-controller-*.
    2. Delete pods named sentry-kafka-controller-*.
  • Alter each topic manually:

@patsevanton
Copy link
Contributor

patsevanton commented Dec 23, 2024

@fedeabih change code

enabled: true # true, if Safari compatibility is needed

to

enabled: true  # true, if Safari compatibility is needed

and add Existing topics will remain with replicationFactor: 1 when updated.

@Mokto I tested upgrade replicationFactor: 1 to replicationFactor: 3

k get pod -n test
NAME                                                              READY   STATUS    RESTARTS      AGE
sentry-billing-metrics-consumer-f6db4bc5c-7jmqt                   1/1     Running   0             3m11s
sentry-clickhouse-0                                               1/1     Running   0             28m
sentry-cron-5f5f668fb6-2fvcc                                      1/1     Running   3 (27m ago)   28m
sentry-generic-metrics-consumer-58d967864f-t85q7                  1/1     Running   0             3m9s
sentry-ingest-consumer-attachments-66bbdcc7d5-jx2w8               1/1     Running   0             3m8s
sentry-ingest-consumer-events-846bd78898-jgvzf                    1/1     Running   0             3m7s
sentry-ingest-consumer-transactions-96bf4d646-d57vm               1/1     Running   0             3m6s
sentry-ingest-monitors-97dc8d5cc-t97g8                            1/1     Running   0             3m5s
sentry-ingest-occurrences-9cf7947bf-rldf6                         1/1     Running   0             3m4s
sentry-ingest-replay-recordings-5489644645-wz4zc                  1/1     Running   0             3m2s
sentry-issue-occurrence-consumer-786969486d-lcmrk                 1/1     Running   0             2m42s
sentry-kafka-controller-0                                         1/1     Running   0             28m
sentry-kafka-controller-1                                         1/1     Running   0             28m
sentry-kafka-controller-2                                         1/1     Running   0             28m
sentry-metrics-consumer-68d5cd6c6f-xt2jl                          1/1     Running   0             3m1s
sentry-nginx-78665d4559-zpk2k                                     1/1     Running   0             28m
sentry-post-process-forward-errors-86cf66d8f4-gwds7               1/1     Running   0             3m
sentry-post-process-forward-issue-platform-6f59bbb877-pqmd8       1/1     Running   0             2m59s
sentry-post-process-forward-transactions-7f64d9998f-s75f8         1/1     Running   0             2m58s
sentry-rabbitmq-0                                                 1/1     Running   0             28m
sentry-relay-67b8c8fdd7-2grjb                                     1/1     Running   0             2m36s
sentry-sentry-postgresql-0                                        1/1     Running   0             28m
sentry-sentry-redis-master-0                                      1/1     Running   0             28m
sentry-sentry-redis-replicas-0                                    1/1     Running   1 (27m ago)   28m
sentry-snuba-api-779df77b87-ctrpv                                 1/1     Running   0             28m
sentry-snuba-consumer-74d6ff7cf8-qplf4                            1/1     Running   0             2m58s
sentry-snuba-generic-metrics-counters-consumer-68f9d89fc4-52bk4   1/1     Running   0             2m52s
sentry-snuba-generic-metrics-distributions-consumer-65d588jh4lw   1/1     Running   0             2m51s
sentry-snuba-generic-metrics-sets-consumer-c9d44b8f4-qxnqb        1/1     Running   0             2m50s
sentry-snuba-group-attributes-consumer-5c775b9f6-gljq7            1/1     Running   0             2m48s
sentry-snuba-metrics-consumer-d7b649955-gbrx2                     1/1     Running   0             2m47s
sentry-snuba-outcomes-billing-consumer-7b4c6b6b45-rf746           1/1     Running   0             2m46s
sentry-snuba-outcomes-consumer-9fbb7f75f-lr6t6                    1/1     Running   0             2m41s
sentry-snuba-replacer-6985c4dfb5-ws9cc                            1/1     Running   0             2m40s
sentry-snuba-replays-consumer-7c847d848f-9qrd5                    1/1     Running   0             2m45s
sentry-snuba-spans-consumer-cf755bcf6-5fhdn                       1/1     Running   0             2m44s
sentry-snuba-subscription-consumer-events-79f56bcdf6-g8vk6        1/1     Running   0             2m39s
sentry-snuba-subscription-consumer-metrics-5dd754dcff-g2zmh       1/1     Running   0             2m38s
sentry-snuba-subscription-consumer-transactions-86f58474b4tvlzc   1/1     Running   0             2m37s
sentry-snuba-transactions-consumer-86bccffd4b-rbmvc               1/1     Running   0             2m43s
sentry-subscription-consumer-events-77bdb5b7d9-4xssk              1/1     Running   0             2m57s
sentry-subscription-consumer-generic-metrics-78d9756698-k9d57     1/1     Running   0             2m56s
sentry-subscription-consumer-metrics-764b9bb6cd-2nqpg             1/1     Running   0             2m54s
sentry-subscription-consumer-transactions-85548bf7c7-wdblh        1/1     Running   0             2m53s
sentry-web-66bb445cdb-2xtdv                                       1/1     Running   2 (27m ago)   28m
sentry-worker-7f9c7d6d55-sr6mk                                    1/1     Running   3 (27m ago)   28m
sentry-zookeeper-clickhouse-0                                     1/1     Running   0             28m

@fedeabih fedeabih force-pushed the fix-set-default-replicationfactor-for-kafka-stability branch from ae1c4e5 to 27099dd Compare January 21, 2025 20:14
@fedeabih fedeabih changed the title Set replication factor for kafka stability fix: set replication factor for kafka stability Jan 21, 2025
@fedeabih
Copy link
Author

@patsevanton Done, let me know if the comment is what you were looking for. Thanks!

@patsevanton
Copy link
Contributor

@fedeabih Is it backward compatible? try to set sentry with default parameters, then change the ReplicationFactor value to 3, then change the ReplicationFactor value to 1

@patsevanton
Copy link
Contributor

I do not know how to properly test this pull request. I tested it like this

sentry install:

helm install -n test sentry sentry/sentry --wait --timeout=1000s
coalesce.go:237: warning: skipped value for kafka.config: Not a table.
NAME: sentry
LAST DEPLOYED: Sat Jan 25 09:58:21 2025
NAMESPACE: test
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
* When running upgrades, make sure to give back the `system.secretKey` value.

kubectl -n test get configmap sentry-sentry -o json | grep -m1 -Po '(?<=system.secret-key: )[^\\]*'

replicationFactor=3:

helm upgrade --install -n test sentry sentry/sentry --wait --timeout=1000s --set kafka.provisioning.replicationFactor=3
coalesce.go:237: warning: skipped value for kafka.config: Not a table.
Release "sentry" has been upgraded. Happy Helming!
NAME: sentry
LAST DEPLOYED: Sat Jan 25 10:23:10 2025
NAMESPACE: test
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
* When running upgrades, make sure to give back the `system.secretKey` value.

kubectl -n test get configmap sentry-sentry -o json | grep -m1 -Po '(?<=system.secret-key: )[^\\]*'

replicationFactor=1:

helm upgrade --install -n test sentry sentry/sentry --wait --timeout=1000s --set kafka.provisioning.replicationFactor=1
coalesce.go:237: warning: skipped value for kafka.config: Not a table.
Release "sentry" has been upgraded. Happy Helming!
NAME: sentry
LAST DEPLOYED: Sat Jan 25 10:51:46 2025
NAMESPACE: test
STATUS: deployed
REVISION: 3
TEST SUITE: None
NOTES:
* When running upgrades, make sure to give back the `system.secretKey` value.

kubectl -n test get configmap sentry-sentry -o json | grep -m1 -Po '(?<=system.secret-key: )[^\\]*'

all pods work.

@fedeabih
Copy link
Author

@fedeabih Is it backward compatible? try to set sentry with default parameters, then change the ReplicationFactor value to 3, then change the ReplicationFactor value to 1

@patsevanton Yes, you can see the full explanation here .

If you already have a setup with ReplicationFactor value of 1, if you apply the new configuration (ReplicationFactor 3), the current topics will remain with 1 replication unless you remove the topics and recreate them as I mentioned in the comment, or you manually modify each topic.

More information about Replication Factor:

The replication factor controls how many servers will replicate each message that is written. If you have a replication factor of 3, then up to 2 servers can fail before you will lose access to your data. We recommend you use a replication factor of 2 or 3 so that you can transparently restart machines without interrupting data consumption.
https://docs.confluent.io/platform/current/kafka/post-deployment.html#admin-operations

The replication factor is a topic setting and is specified at topic creation time.

A replication factor of 1 means no replication. It is mostly used for development purposes and should be avoided in test and production Kafka clusters

A replication factor of 3 is a commonly used replication factor as it provides the right balance between broker loss and replication overhead.
https://learn.conduktor.io/kafka/kafka-topic-replication/#Kafka-Topic-Replication-Factor-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants