-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-zendesk-support-native: new connector #2309
Conversation
I did some additional testing & noticed it takes a while for |
2d278f0
to
7a01923
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
A couple of non-blocking comments and things that you might consider.
source-zendesk-support-native/tests/snapshots/snapshots__spec__capture.stdout.json
Show resolved
Hide resolved
for resource in response.resources: | ||
resource_dt = _str_to_dt(getattr(resource, cursor_field)) | ||
if resource_dt > last_seen_dt: | ||
if count > MIN_CHECKPOINT_COUNT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some food for thought: What happens if there are long runs of records with the exact same cursor value? For example, thousands of records updated at the same time. Typically there is a potential to miss data if we checkpoint in the the middle of reading those sequences, and then have to resume later without having fully read them and effectively skip ahead.
This is probably a problem that can be generalized and solved somehow maybe down the road. Perhaps a simple mitigation here would be to not emit a checkpoint unless the cursor value from the current record is different than the cursor value from the prior record. Or, maybe this isn't even worth worrying about now or is impossible for other reasons.
) -> IncrementalCursorExportResponse: | ||
# Instead of using Pydantic's model_validate_json that uses json.loads internally, | ||
# use json.JSONDecoder().raw_decode to reduce memory overhead when processing the response. | ||
raw_response_bytes = await http.request(log, url, params=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the sensitivities here? Is the raw response bytes size unbounded? Avoiding some of Pydantic's overhead is a good idea, but if there is no practical limit on how large the response size can be we'll inevitably hit memory issues.
I've got some WIP that is almost finished on a kind of streaming decoder strategy which only pulls bytes off the wire as they can be parsed and yield, so that might be something we can look into adding here later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've learned that a lot of custom fields & data can be added to Zendesk tickets, increasing the response size. So it theoretically seems unbounded, or at least the limit is large enough that it can OOM the connector.
I'll try out using the streaming decoder strategy you merged before I merge this PR.
) | ||
|
||
CURSOR_PAGINATION_PAGE_SIZE = 100 | ||
MIN_CHECKPOINT_COUNT = 1500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is used in a couple of places, where we are looping over pages of results. First off, I'll recognize that I am probably starting to sound like a lunatic with the different suggestions I have regarding when to checkpoint, so take this suggestion as you wish. But perhaps we could employ a strategy of just checkpointing after we read each "page" (see my other comment below regarding sequences of identical timestamps though), rather than tracking a separate count and variable here?
I think I'll summarize my thoughts on how often to checkpoint:
- Checkpoints really can be an arbitrarily large size; the system is designed to handle this. Practically speaking though we usually don't want to risk losing a lot of work the connector has done when it gets nuked for whatever reason and has to restart. There are also some implications on materialization performance if there is a huge amount of data between checkpoints.
- Going to the opposite extreme, it can sometimes work to checkpoint after every single document, and it may even make sense to do that sometimes. For example, when doing something like continuously reading a change stream, each record may have an offset associated with it, and checkpoint after each document is reasonable since we are reading them as they come, and the sooner we checkpoint a document the sooner it can be processed downstream, which reduces end to end latency. Another thing to be aware of is that if that capture emits huge volumes of checkpoints, the runtime will actually internally combine them when committing data to the target collections, so that's an additional optimization.
- But there's a cost to checkpointing: Depending on the document size, serializing the checkpoint may be a significant amount of CPU work for the connector if checkpoints are emitted at an extreme rate, like when processing lots of records from a backfill.
In this particular case I think the code would be just a little simpler without the MIN_CHECKPOINT_COUNT and related accounting. Page sizes of 100 are kind of small, but they provide a nice logical breaking point that the code has to handle anyway so we might as well take advantage of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that summary, that helps put everything into context & makes sense. To your other comment, I do think we need to be sensitive to long runs of records with the same cursor value (I've seen this for a few Zendesk Support streams before). So instead of checkpointing after MIN_CHECKPOINT_COUNT
is exceeded, I'll update to checkpoint before processing the next page of results since I'll need to check for long runs of records with identical cursor values that span separate pages.
Zendesk Support's incremental cursor export endpoints use a base64 encoded string as a cursor to incrementally follow a log of updated resources. This check previously didn't account for `tuple[str]` log cursors previously since that functionality wasn't needed until now.
This is an initial version of a native Zendesk Support capture connector. It currently only has a subset of the streams that exist in the imported `source-zendesk-support` connector, and the remaining streams will be added at a later date to achieve parity with the imported connector. This initial version ended up having more streams than I had originally anticipated. When planning out the right abstractions for the most important streams, I ended up making a lot of the simpler streams in the process.
…rsor export streams When using Pydantic's `model_validate_json` and enabling all streams for a Zendesk account with a signfiicant amount of data, memory usage would be considerably high. This would result in connector OOMs if the configured page size was large enough. To allow connectors in these use cases configure a higher page size, I improved the memory usage of the incremental cursor export streams by: 1. Using `json.Decoder().raw_decode` to parse large responses more efficiently than the `json.loads` that Pydantic's `model_validate_json` uses internally. 2. Validating & transforming resources one-by-one as they're yielded instead of all at once when parsing the response. By observing connector container stats before & after these changes, average memory usage decreased a good amount (with a page size of 500, it reduced from ~89% to ~71%).
… to use date windows Zendesk's `/satisfaction_ratings` endpoint returns results in descending order, and it can take a long time to backfill. The `satisfaction_ratings` stream has been refactored to use date windows so the connector can perform backfills in checkpoint-able chunks rather than having to process all results in a single shot.
…f checkpointing after each page of results
…mental export resources Using the new `http.request_object_stream` to stream incremental export resources signficantly reduces the memory usage of the connector, even when using the max page size of 1000.
7a01923
to
85748a0
Compare
Description:
This is an initial version of a native Zendesk Support capture connector. It currently only has a subset of the streams that exist in the imported
source-zendesk-support
connector, and the remaining streams will be added at a later date to achieve parity with the imported connector.This initial version ended up having more streams than I had originally anticipated. When planning out the right abstractions for the most important streams, I ended up making a lot of the simpler streams in the process.
When testing the connector with all streams enabled & a large Zendesk account, the connector can OOM if the
incremental_export_page_size
is not reduced. For the account I was testing with, the response for thetickets
stream were large enough that concurrently processing separate ticket responses requires significant memory usage. I've tried to reduce this memory usage so theincremental_export_page_size
does need to be reduced as much.To support string cursors, our CDK has also been updated to allow
LogCursor
s of typetuple[str]
. Since the applicable Zendesk cursors are a base 64 encoded string of the resource timestamp & id (decoded example:1738185378.0||12345678
), I was able to keep the strictly increasing nature of thetuple[str]
log cursors.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
Documentation will need to be created for
source-zendesk-support-native
.Notes for reviewers:
Tested on a local stack with a few different Zendesk accounts. Confirmed:
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)