Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-zendesk-support-native: new connector #2309

Merged
merged 7 commits into from
Jan 30, 2025

Conversation

Alex-Bair
Copy link
Member

@Alex-Bair Alex-Bair commented Jan 29, 2025

Description:

This is an initial version of a native Zendesk Support capture connector. It currently only has a subset of the streams that exist in the imported source-zendesk-support connector, and the remaining streams will be added at a later date to achieve parity with the imported connector.

This initial version ended up having more streams than I had originally anticipated. When planning out the right abstractions for the most important streams, I ended up making a lot of the simpler streams in the process.

When testing the connector with all streams enabled & a large Zendesk account, the connector can OOM if the incremental_export_page_size is not reduced. For the account I was testing with, the response for the tickets stream were large enough that concurrently processing separate ticket responses requires significant memory usage. I've tried to reduce this memory usage so the incremental_export_page_size does need to be reduced as much.

To support string cursors, our CDK has also been updated to allow LogCursors of type tuple[str]. Since the applicable Zendesk cursors are a base 64 encoded string of the resource timestamp & id (decoded example: 1738185378.0||12345678), I was able to keep the strictly increasing nature of the tuple[str] log cursors.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation will need to be created for source-zendesk-support-native.

Notes for reviewers:

Tested on a local stack with a few different Zendesk accounts. Confirmed:

  • OAuth worked.
  • All streams captured data & didn't crash.
  • The document counts for the native connector were the same/more than the imported connector (only checked closely with a single Zendesk account since this takes a decent amount of time).

This change is Reviewable

@Alex-Bair Alex-Bair marked this pull request as ready for review January 29, 2025 20:45
@Alex-Bair
Copy link
Member Author

I did some additional testing & noticed it takes a while for satisfaction_ratings to backfill. Results for this endpoint are returned in descending order, and we can't change that with query params. We can use start_time and end_time params to bound the results, so I'm going to update the stream to use date windows.

@Alex-Bair Alex-Bair force-pushed the bair/zendesk-support-native branch from 2d278f0 to 7a01923 Compare January 30, 2025 06:24
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

A couple of non-blocking comments and things that you might consider.

for resource in response.resources:
resource_dt = _str_to_dt(getattr(resource, cursor_field))
if resource_dt > last_seen_dt:
if count > MIN_CHECKPOINT_COUNT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some food for thought: What happens if there are long runs of records with the exact same cursor value? For example, thousands of records updated at the same time. Typically there is a potential to miss data if we checkpoint in the the middle of reading those sequences, and then have to resume later without having fully read them and effectively skip ahead.

This is probably a problem that can be generalized and solved somehow maybe down the road. Perhaps a simple mitigation here would be to not emit a checkpoint unless the cursor value from the current record is different than the cursor value from the prior record. Or, maybe this isn't even worth worrying about now or is impossible for other reasons.

) -> IncrementalCursorExportResponse:
# Instead of using Pydantic's model_validate_json that uses json.loads internally,
# use json.JSONDecoder().raw_decode to reduce memory overhead when processing the response.
raw_response_bytes = await http.request(log, url, params=params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the sensitivities here? Is the raw response bytes size unbounded? Avoiding some of Pydantic's overhead is a good idea, but if there is no practical limit on how large the response size can be we'll inevitably hit memory issues.

I've got some WIP that is almost finished on a kind of streaming decoder strategy which only pulls bytes off the wire as they can be parsed and yield, so that might be something we can look into adding here later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've learned that a lot of custom fields & data can be added to Zendesk tickets, increasing the response size. So it theoretically seems unbounded, or at least the limit is large enough that it can OOM the connector.

I'll try out using the streaming decoder strategy you merged before I merge this PR.

)

CURSOR_PAGINATION_PAGE_SIZE = 100
MIN_CHECKPOINT_COUNT = 1500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is used in a couple of places, where we are looping over pages of results. First off, I'll recognize that I am probably starting to sound like a lunatic with the different suggestions I have regarding when to checkpoint, so take this suggestion as you wish. But perhaps we could employ a strategy of just checkpointing after we read each "page" (see my other comment below regarding sequences of identical timestamps though), rather than tracking a separate count and variable here?

I think I'll summarize my thoughts on how often to checkpoint:

  • Checkpoints really can be an arbitrarily large size; the system is designed to handle this. Practically speaking though we usually don't want to risk losing a lot of work the connector has done when it gets nuked for whatever reason and has to restart. There are also some implications on materialization performance if there is a huge amount of data between checkpoints.
  • Going to the opposite extreme, it can sometimes work to checkpoint after every single document, and it may even make sense to do that sometimes. For example, when doing something like continuously reading a change stream, each record may have an offset associated with it, and checkpoint after each document is reasonable since we are reading them as they come, and the sooner we checkpoint a document the sooner it can be processed downstream, which reduces end to end latency. Another thing to be aware of is that if that capture emits huge volumes of checkpoints, the runtime will actually internally combine them when committing data to the target collections, so that's an additional optimization.
  • But there's a cost to checkpointing: Depending on the document size, serializing the checkpoint may be a significant amount of CPU work for the connector if checkpoints are emitted at an extreme rate, like when processing lots of records from a backfill.

In this particular case I think the code would be just a little simpler without the MIN_CHECKPOINT_COUNT and related accounting. Page sizes of 100 are kind of small, but they provide a nice logical breaking point that the code has to handle anyway so we might as well take advantage of that.

Copy link
Member Author

@Alex-Bair Alex-Bair Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that summary, that helps put everything into context & makes sense. To your other comment, I do think we need to be sensitive to long runs of records with the same cursor value (I've seen this for a few Zendesk Support streams before). So instead of checkpointing after MIN_CHECKPOINT_COUNT is exceeded, I'll update to checkpoint before processing the next page of results since I'll need to check for long runs of records with identical cursor values that span separate pages.

Zendesk Support's incremental cursor export endpoints use a base64
encoded string as a cursor to incrementally follow a log of updated
resources. This check previously didn't account for `tuple[str]` log
cursors previously since that functionality wasn't needed until now.
This is an initial version of a native Zendesk Support capture connector.
It currently only has a subset of the streams that exist in the imported
`source-zendesk-support` connector, and the remaining streams will be
added at a later date to achieve parity with the imported connector.

This initial version ended up having more streams than I had originally
anticipated. When planning out the right abstractions for the most
important streams, I ended up making a lot of the simpler streams in the
process.
…rsor export streams

When using Pydantic's `model_validate_json` and enabling all streams for a Zendesk account with a signfiicant amount of data, memory usage would be considerably high. This would result in connector OOMs if the configured page size was large enough.

To allow connectors in these use cases configure a higher page size, I improved the memory usage of the incremental cursor export streams by:
1. Using `json.Decoder().raw_decode` to parse large responses more efficiently than the `json.loads` that Pydantic's `model_validate_json` uses internally.
2. Validating & transforming resources one-by-one as they're yielded instead of all at once when parsing the response.

By observing connector container stats before & after these changes, average memory usage decreased a good amount (with a page size of 500, it reduced from ~89% to ~71%).
… to use date windows

Zendesk's `/satisfaction_ratings` endpoint returns results in descending
order, and it can take a long time to backfill. The
`satisfaction_ratings` stream has been refactored to use date windows so
the connector can perform backfills in checkpoint-able chunks rather
than having to process all results in a single shot.
…mental export resources

Using the new `http.request_object_stream` to stream incremental export
resources signficantly reduces the memory usage of the connector, even
when using the max page size of 1000.
@Alex-Bair Alex-Bair force-pushed the bair/zendesk-support-native branch from 7a01923 to 85748a0 Compare January 30, 2025 20:06
@Alex-Bair Alex-Bair merged commit 4d8184f into main Jan 30, 2025
78 of 83 checks passed
@Alex-Bair Alex-Bair deleted the bair/zendesk-support-native branch January 30, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants