-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumers should not be able to commit during a rebalance #4059
Comments
I think the root of problem lies in OffsetCommit response error handling, we're probably getting a ERR_ILLEGAL_GENERATION back from the coordinator at this point, which causes us to rejoin the group - thus the rebalance. |
Agreed, improving the error handling would make a great difference as well. I haven't been able to confirm this yet, but I believe a new rebalance is also triggered when the error is |
It would be great if this behaviour were documented somewhere in the wiki or in the config description. We spent some efforts to understand what the heck is going on and why the cooperative-sticky rebalance is not working as expected and leads to endless rebalances. We use confluent_kafka for python. I tried to implement a workaround using on_assign/on_revoke callbacks and prevent manual commits for some time when these callbacks are triggered. But noticed the group generation id can be changed even before these callbacks are triggered. As I understand currently with confluent_kafka python there is now way to know for sure the rebalance is in progress and we can't use manual commits with cooperative-sticky. Also I noticed when auto commit enabled librdkafka can send OffsetCommitRequests when rebalance is in progress and it doesn't lead to invalid group generation id error. I see this offset commit happens between the on_revoke callback triggered in one consumer and on_assign in another.
Is not this contradicts to the behavior described in this KIP document? It says:
|
Just adding an additional data point that we see this exact behavior with our setup in kafka. I can try to take a crack at fixing this in my off time, but wondering if there is anyone on Confluent side working on this @milindl @emasab since if you're using manual offset commits, you cannot use Cooperative rebalancing today in any context. |
is there any ongoing work for this? |
@scanterog see #4220 , but in short, I think that this issue has dropped off the radar. Unfortunately the fix that @roxelo created (and I attempted to get merged) broke other behavior in librdkafka. @milindl said the librdkafka team was looking into this internally, but unfortunately from my side there are few things I can do as my place of work doesn't pay for confluent support and therefore we have no sway on their roadmap. If anyone who is interested in getting this more attention on the confluent side and is a paying confluent customer I highly recommend they push via their TAMs to get traction on this. |
Same issue here. Hopefully this will be prioritized by Confluent because it makes cooperative-stiky unusable with manual commits. |
Commits are possible during a rebalance, before a partition is revoked the user can commit offsets for the revoked partitions in the rebalance callback, that is possible in Java client too. |
Thanks for looking into this @emasab. Any hope to see this sorted out soon? Thank you! |
@massimeddu-sj at the moment there's a strategy that should prevent this error during commit and a subsequent rebalance. |
Thank you very much for the additional information @emasab . Unfortunately I'm not too familiar with the rebalance/revoke callbacks, and I'm not too confident on overriding the default implementation. It would be great if you are able to share any snippet or implementation example. My current implementation is actually quite simple: consumer_conf = {
[...]
'enable.auto.commit': False,
'partition.assignment.strategy': 'cooperative-sticky'
}
kafka_consumer = Consumer(consumer_conf, logger=logger)
while True:
message = kafka_consumer.poll()
if message is None: continue
if message.error():
if message.error().code() == KafkaError._PARTITION_EOF:
# End of partition event
logger.error('%% %s [%d] reached end at offset %d\n' %
(message.topic(), message.partition(), message.offset()))
else:
raise KafkaConsumerException(Exception(message.error()))
message_handler(message)
kafka_consumer.commit(asynchronous=False) |
…world" scenarios (e.g., cooperative sticky). * Enabled automatic committing with `confluent auto commit: true` instead of relying solely on manual commits, but only when the consumer strategy is cooperative sticky. (Refer to the open librdkafka issue at confluentinc/librdkafka#4059).
…world" scenarios (e.g., cooperative sticky). Fixes issue Farfetch#557 and Fixes issue Farfetch#456 * Enabled automatic committing with `confluent auto commit: true` instead of relying solely on manual commits, but only when the consumer strategy is cooperative sticky. (Refer to the open librdkafka issue at confluentinc/librdkafka#4059).
Hi team, for anyone still looking for a workaround. It seems like we have been able to get around this issue by using Kafka autocommit, as suggested, however we still want control over when offsets are committed, so we used This way, the |
TL;DR: manual commits during a rebalance will be sent but they won't cause an additional rebalance or losing the assignment. It can happen that the consumer receives a message after the first rejoin of the cooperative incremental assignment, in that case the partitions are resumed, and only later the second rejoin is done to redistribute the revoked partitions. The message can be committed manually, causing an ILLEGAL_GENERATION error. Generation id was incorrectly reset to -1 on OffsetCommit ILLEGAL_GENERATION error and, immediately after that, the second SyncGroup fails with same error because of the wrong generation id, this time causing lost partitions. Differences from Java client: it avoids sending the commit and the exception is a RebalanceInProgressException instead. |
This test was ported from taskbrokers, and while extremely unreliably does highlight a potential issue with cooperative-sticky rebalancing. confluentinc/librdkafka#4059
Description
Now that librdkafka supports cooperative sticky partitions assignment strategy, we should ensure that consumers that commit offsets manually can’t commit offsets during rebalance as it triggers a follow up rebalance.
I don’t think there is any valid use cases for allowing this type of behavior:
I believe this issue with manual auto commit is isolated to
cooperative sticky
strategy because when a consumer uses theeager
strategy, a rebalance starts with all the partitions being revoked and ends when new partitions have been assigned to the consumer. As a result, the consumer will never attempt to commit offsets because there are no offsets to be committed. Of course, if the consumer uses cooperative sticky, we can’t ensure that the consumer won’t attempt to commit offsets during a rebalance as the consumer might still own partitions during a rebalance. Furthermore, clients have no way of knowing that a rebalance is ongoing or not and so they can’t prevent consumers from committing offsets when necessary.I see three potential solutions to this problem, but I think the first one makes the most sense to implement:
rd_kafka_commit
to not attempt to commit offsets during rebalancesChecklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
v1.8.2
v2.7.1
<REPLACE with e.g., Centos 5 (x64)>
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: