Skip to content

Limit to_device EDU size to 65536 #18416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

MatMaul
Copy link
Contributor

@MatMaul MatMaul commented May 9, 2025

If a set of messages exceeds this limit, the messages are splitted across several EDUs.

Should fix #17035.

There is currently no official specced limit for EDUs, but the consensus seems to be that it would be useful to have one to avoid this bug by bounding the transaction size.

As a side effect it also limits the size of a single to-device message to a bit less than 65536.

This should probably be added to the spec similarly to the message size limit.

Pull Request Checklist

@MatMaul MatMaul marked this pull request as ready for review May 9, 2025 14:38
@MatMaul MatMaul requested a review from a team as a code owner May 9, 2025 14:38
@@ -28,6 +28,7 @@

# the max size of a (canonical-json-encoded) event
MAX_PDU_SIZE = 65536
MAX_EDU_SIZE = 65536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MAX_EDU_SIZE = 65536
# This isn't spec'ed but is our own reasonable default to play nice with Synapse's
# `max_request_size`/`max_request_body_size`. We chose the same as `MAX_PDU_SIZE` as our
# `max_request_body_size` math is currently limited by 200 `MAX_PDU_SIZE` things. The
# spec for a `/federation/v1/send` request sets the limit at 100 EDU's and 50 PDU's
# which is below that 200 `MAX_PDU_SIZE` limit (`max_request_body_size`).
#
# Allowing oversized EDU's results in failed `/federation/v1/send` transactions (because
# the request overall can overrun the `max_request_body_size`) which are retried over
# and over and prevent other outbound federation traffic from happening.
MAX_EDU_SIZE = 65536

"""
This function takes a dictionary of messages and splits them into several EDUs if needed.

It will raise an EventSizeError if a single message is too large to fit into an EDU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the EventSizeError is raised? How does Synapse recover? Are we sure that outbound federation doesn't remain stuck?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the tests, it looks like we stop too large messages from even being put in the outbox which is great! Just want to confirm that?

Comment on lines +304 to +307
edu_contents = get_device_message_edu_contents(
sender_user_id, message_type, messages, context
)
remote_edu_contents[destination] = edu_contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the structure of remote_edu_contents (was a map from destination to EDU meta) (to a map from destination to multiple EDU meta), could we just call add_messages_to_device_inbox(...) multiple times?

"type": message_type,
"message_id": random_string(16),
}
# This is the size of the full EDU without any messages and without the opentracing context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the BASE_EDU_SIZE calculated without BASE_EDU_CONTENT["org.matrix.opentracing_context"]?

if current_edu_size + message_entry_size > MAX_EDU_SIZE:
edu_contents.append(current_edu_content)
logger.debug(
"Splitting %d device messages from %s into EDU msgid %s, %d EDUs queued",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Splitting %d device messages from %s into EDU msgid %s, %d EDUs queued",
"Splitting %d to-device messages from %s into EDU (message_id=%s), (total EDUs so far: %d)",


edu_contents = []

current_edu_content: JsonDict = deepcopy(BASE_EDU_CONTENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this cloning, perhaps it's easier to understand if we just have a little helper (maybe performs better as well 🤷):

    def create_new_to_device_edu_content() -> JsonDict:
        """Create a new `m.direct_to_device` EDU `content` object with a unique message ID."""
        content = {
            "messages": {},
            "sender": sender_user_id,
            "type": message_type,
            "message_id": random_string(16),
            "org.matrix.opentracing_context": json_encoder.encode(context)
        }
        return content

) -> List[JsonDict]:
"""
This function takes a dictionary of messages and splits them into several EDUs if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a docstring for the args and return.

And context of why we care to split similar to how we explain it for MAX_EDU_SIZE above.

Comment on lines +489 to +495
logger.debug(
"Queuing last %d device messages from %s into EDU msgid %s, %d EDUs queued",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
"Queuing last %d device messages from %s into EDU msgid %s, %d EDUs queued",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)
logger.debug(
"Splitting remaining %d device messages from %s into EDU (message_id=%s), (total EDUs so far: %d)",
len(current_edu_content["messages"]),
sender_user_id,
current_edu_content["message_id"],
len(edu_contents),
)


mock_send_transaction.reset_mock()

# 2 messages, each just big enough to fit in an EDU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 2 messages, each just big enough to fit in an EDU
# 2 messages, each just big enough to fit into their own EDU


self.assertEqual(mock_send_transaction.call_count, 2)

# A transaction can contain up to 100 EDUs but synapse reserves 10 EDUs for other purposes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, this happens at

) = await self.queue._get_to_device_message_edus(edu_limit - 10)

It would be good to label this magic value as a constant which we could also cross-reference here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A long queue of to-device messages can prevent outgoing federation working
2 participants