source-zendesk-support-native: new connector #2309

Alex-Bair · 2025-01-29T20:41:54Z

Description:

This is an initial version of a native Zendesk Support capture connector. It currently only has a subset of the streams that exist in the imported source-zendesk-support connector, and the remaining streams will be added at a later date to achieve parity with the imported connector.

This initial version ended up having more streams than I had originally anticipated. When planning out the right abstractions for the most important streams, I ended up making a lot of the simpler streams in the process.

When testing the connector with all streams enabled & a large Zendesk account, the connector can OOM if the incremental_export_page_size is not reduced. For the account I was testing with, the response for the tickets stream were large enough that concurrently processing separate ticket responses requires significant memory usage. I've tried to reduce this memory usage so the incremental_export_page_size does need to be reduced as much.

To support string cursors, our CDK has also been updated to allow LogCursors of type tuple[str]. Since the applicable Zendesk cursors are a base 64 encoded string of the resource timestamp & id (decoded example: 1738185378.0||12345678), I was able to keep the strictly increasing nature of the tuple[str] log cursors.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation will need to be created for source-zendesk-support-native.

Notes for reviewers:

Tested on a local stack with a few different Zendesk accounts. Confirmed:

OAuth worked.
All streams captured data & didn't crash.
The document counts for the native connector were the same/more than the imported connector (only checked closely with a single Zendesk account since this takes a decent amount of time).

This change is

Alex-Bair · 2025-01-29T21:54:30Z

I did some additional testing & noticed it takes a while for satisfaction_ratings to backfill. Results for this endpoint are returned in descending order, and we can't change that with query params. We can use start_time and end_time params to bound the results, so I'm going to update the stream to use date windows.

williamhbaker

LGTM!

A couple of non-blocking comments and things that you might consider.

source-zendesk-support-native/tests/snapshots/snapshots__spec__capture.stdout.json

williamhbaker · 2025-01-30T15:29:43Z

source-zendesk-support-native/source_zendesk_support_native/api.py

+        for resource in response.resources:
+            resource_dt = _str_to_dt(getattr(resource, cursor_field))
+            if resource_dt > last_seen_dt:
+                if count > MIN_CHECKPOINT_COUNT:


Some food for thought: What happens if there are long runs of records with the exact same cursor value? For example, thousands of records updated at the same time. Typically there is a potential to miss data if we checkpoint in the the middle of reading those sequences, and then have to resume later without having fully read them and effectively skip ahead.

This is probably a problem that can be generalized and solved somehow maybe down the road. Perhaps a simple mitigation here would be to not emit a checkpoint unless the cursor value from the current record is different than the cursor value from the prior record. Or, maybe this isn't even worth worrying about now or is impossible for other reasons.

williamhbaker · 2025-01-30T15:34:40Z

source-zendesk-support-native/source_zendesk_support_native/api.py

+) -> IncrementalCursorExportResponse:
+    # Instead of using Pydantic's model_validate_json that uses json.loads internally,
+    # use json.JSONDecoder().raw_decode to reduce memory overhead when processing the response.
+    raw_response_bytes = await http.request(log, url, params=params)


What are the sensitivities here? Is the raw response bytes size unbounded? Avoiding some of Pydantic's overhead is a good idea, but if there is no practical limit on how large the response size can be we'll inevitably hit memory issues.

I've got some WIP that is almost finished on a kind of streaming decoder strategy which only pulls bytes off the wire as they can be parsed and yield, so that might be something we can look into adding here later.

Yes, I've learned that a lot of custom fields & data can be added to Zendesk tickets, increasing the response size. So it theoretically seems unbounded, or at least the limit is large enough that it can OOM the connector.

I'll try out using the streaming decoder strategy you merged before I merge this PR.

williamhbaker · 2025-01-30T15:49:38Z

source-zendesk-support-native/source_zendesk_support_native/api.py

+)
+
+CURSOR_PAGINATION_PAGE_SIZE = 100
+MIN_CHECKPOINT_COUNT = 1500


Looks like this is used in a couple of places, where we are looping over pages of results. First off, I'll recognize that I am probably starting to sound like a lunatic with the different suggestions I have regarding when to checkpoint, so take this suggestion as you wish. But perhaps we could employ a strategy of just checkpointing after we read each "page" (see my other comment below regarding sequences of identical timestamps though), rather than tracking a separate count and variable here?

I think I'll summarize my thoughts on how often to checkpoint:

Checkpoints really can be an arbitrarily large size; the system is designed to handle this. Practically speaking though we usually don't want to risk losing a lot of work the connector has done when it gets nuked for whatever reason and has to restart. There are also some implications on materialization performance if there is a huge amount of data between checkpoints.

Going to the opposite extreme, it can sometimes work to checkpoint after every single document, and it may even make sense to do that sometimes. For example, when doing something like continuously reading a change stream, each record may have an offset associated with it, and checkpoint after each document is reasonable since we are reading them as they come, and the sooner we checkpoint a document the sooner it can be processed downstream, which reduces end to end latency. Another thing to be aware of is that if that capture emits huge volumes of checkpoints, the runtime will actually internally combine them when committing data to the target collections, so that's an additional optimization.

But there's a cost to checkpointing: Depending on the document size, serializing the checkpoint may be a significant amount of CPU work for the connector if checkpoints are emitted at an extreme rate, like when processing lots of records from a backfill.

In this particular case I think the code would be just a little simpler without the MIN_CHECKPOINT_COUNT and related accounting. Page sizes of 100 are kind of small, but they provide a nice logical breaking point that the code has to handle anyway so we might as well take advantage of that.

Thanks for that summary, that helps put everything into context & makes sense. To your other comment, I do think we need to be sensitive to long runs of records with the same cursor value (I've seen this for a few Zendesk Support streams before). So instead of checkpointing after MIN_CHECKPOINT_COUNT is exceeded, I'll update to checkpoint before processing the next page of results since I'll need to check for long runs of records with identical cursor values that span separate pages.

Zendesk Support's incremental cursor export endpoints use a base64 encoded string as a cursor to incrementally follow a log of updated resources. This check previously didn't account for `tuple[str]` log cursors previously since that functionality wasn't needed until now.

This is an initial version of a native Zendesk Support capture connector. It currently only has a subset of the streams that exist in the imported `source-zendesk-support` connector, and the remaining streams will be added at a later date to achieve parity with the imported connector. This initial version ended up having more streams than I had originally anticipated. When planning out the right abstractions for the most important streams, I ended up making a lot of the simpler streams in the process.

…rsor export streams When using Pydantic's `model_validate_json` and enabling all streams for a Zendesk account with a signfiicant amount of data, memory usage would be considerably high. This would result in connector OOMs if the configured page size was large enough. To allow connectors in these use cases configure a higher page size, I improved the memory usage of the incremental cursor export streams by: 1. Using `json.Decoder().raw_decode` to parse large responses more efficiently than the `json.loads` that Pydantic's `model_validate_json` uses internally. 2. Validating & transforming resources one-by-one as they're yielded instead of all at once when parsing the response. By observing connector container stats before & after these changes, average memory usage decreased a good amount (with a page size of 500, it reduced from ~89% to ~71%).

… to use date windows Zendesk's `/satisfaction_ratings` endpoint returns results in descending order, and it can take a long time to backfill. The `satisfaction_ratings` stream has been refactored to use date windows so the connector can perform backfills in checkpoint-able chunks rather than having to process all results in a single shot.

…f checkpointing after each page of results

…mental export resources Using the new `http.request_object_stream` to stream incremental export resources signficantly reduces the memory usage of the connector, even when using the max page size of 1000.

Alex-Bair marked this pull request as ready for review January 29, 2025 20:45

Alex-Bair requested a review from williamhbaker January 29, 2025 20:45

Alex-Bair force-pushed the bair/zendesk-support-native branch from 2d278f0 to 7a01923 Compare January 30, 2025 06:24

williamhbaker approved these changes Jan 30, 2025

View reviewed changes

Alex-Bair added 7 commits January 30, 2025 11:14

source-zendesk-support-native: add to CI

3a93fd3

source-zendesk-support-native: remove MIN_CHECKPOINT_COUNT in favor o…

08a4b57

…f checkpointing after each page of results

source-zendesk-support-native: reduce memory usage by streaming incre…

85748a0

…mental export resources Using the new `http.request_object_stream` to stream incremental export resources signficantly reduces the memory usage of the connector, even when using the max page size of 1000.

Alex-Bair force-pushed the bair/zendesk-support-native branch from 7a01923 to 85748a0 Compare January 30, 2025 20:06

Alex-Bair merged commit 4d8184f into main Jan 30, 2025
78 of 83 checks passed

Alex-Bair deleted the bair/zendesk-support-native branch January 30, 2025 20:52

This was referenced Jan 30, 2025

source-zendesk-support-native: new connector #2315

Closed

docs: source-zendesk-support-native estuary/flow#1908

Merged

source-zendesk: TicketComments 504 start_time Error #2299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-zendesk-support-native: new connector #2309

source-zendesk-support-native: new connector #2309

Alex-Bair commented Jan 29, 2025 •

edited

Loading

Alex-Bair commented Jan 29, 2025

williamhbaker left a comment

williamhbaker Jan 30, 2025

williamhbaker Jan 30, 2025

Alex-Bair Jan 30, 2025

williamhbaker Jan 30, 2025

Alex-Bair Jan 30, 2025 •

edited

Loading

source-zendesk-support-native: new connector #2309

source-zendesk-support-native: new connector #2309

Conversation

Alex-Bair commented Jan 29, 2025 • edited Loading

Alex-Bair commented Jan 29, 2025

williamhbaker left a comment

Choose a reason for hiding this comment

williamhbaker Jan 30, 2025

Choose a reason for hiding this comment

williamhbaker Jan 30, 2025

Choose a reason for hiding this comment

Alex-Bair Jan 30, 2025

Choose a reason for hiding this comment

williamhbaker Jan 30, 2025

Choose a reason for hiding this comment

Alex-Bair Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Alex-Bair commented Jan 29, 2025 •

edited

Loading

Alex-Bair Jan 30, 2025 •

edited

Loading