Skip to content

Skyhigh: fix loss events#2065

Draft
TOUFIKIzakarya wants to merge 1 commit intodevelopfrom
fix/skyhigh_loss_events
Draft

Skyhigh: fix loss events#2065
TOUFIKIzakarya wants to merge 1 commit intodevelopfrom
fix/skyhigh_loss_events

Conversation

@TOUFIKIzakarya
Copy link
Collaborator

@TOUFIKIzakarya TOUFIKIzakarya commented Feb 26, 2026

Summary by Sourcery

Synchronize event collection and forwarding using batch identifiers so that ingestion checkpoints are only advanced after successful forwarding.

Bug Fixes:

  • Prevent loss of Skyhigh SWG events by deferring time range advancement until batches are confirmed pushed to the intake.

Enhancements:

  • Propagate batch identifiers through collector, transformer, and forwarder pipelines to track pending and processed batches.
  • Introduce a batch status queue to coordinate push confirmations between the forwarder and collector, avoiding premature checkpoint updates.

Tests:

  • Update and extend unit tests to cover batch ID propagation, transformer behavior with tagged batches, and confirmation of batch completion by the forwarder.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 26, 2026

Reviewer's Guide

Implements batch-level tracking and confirmation between event collection, transformation, and forwarding to avoid losing Skyhigh SWG events, wiring a new batch status queue through the pipeline and updating tests accordingly.

Sequence diagram for Skyhigh SWG batch tracking and confirmation

sequenceDiagram
    participant SkyhighSecuritySWGTrigger as SkyhighSecuritySWGTrigger
    participant EventCollector as EventCollector
    participant Transformers as EventsTransformer
    participant EventsForwarder as EventsForwarder
    participant Intake as IntakeService
    participant BatchStatusQueue as batch_status_queue

    SkyhighSecuritySWGTrigger->>EventCollector: start(collect_queue, batch_status_queue)
    SkyhighSecuritySWGTrigger->>Transformers: start(collect_queue, forwarding_queue)
    SkyhighSecuritySWGTrigger->>EventsForwarder: start(forwarding_queue, batch_status_queue)

    loop periodic_collection
        EventCollector->>EventCollector: query_api()
        alt response available
            EventCollector->>EventCollector: generate batch_id
            EventCollector->>EventCollector: pending_batches[batch_id] = end_date
            EventCollector->>Transformers: put((batch_id, response)) on collect_queue

            EventCollector->>BatchStatusQueue: get(timeout=60)
            alt confirmed matching batch_id
                BatchStatusQueue-->>EventCollector: confirmed_batch_id
                EventCollector->>EventCollector: pending_batches.pop(batch_id)
                EventCollector->>EventCollector: _update_time_range()
                EventCollector->>EventCollector: _sleep_until_next_batch()
            else different batch_id
                BatchStatusQueue-->>EventCollector: other_batch_id
                EventCollector->>BatchStatusQueue: put(other_batch_id)
            end
        else no response
            EventCollector->>EventCollector: log No messages to forward
            EventCollector->>EventCollector: _update_time_range()
            EventCollector->>EventCollector: _sleep_until_next_batch()
        end
    end

    loop transform_loop
        Transformers->>Transformers: get((batch_id, response)) from collect_queue
        Transformers->>Transformers: for messages in batched(_transform(response))
        Transformers->>EventsForwarder: put((batch_id, list(messages))) on forwarding_queue
    end

    loop forward_loop
        EventsForwarder->>EventsForwarder: next_batch(max_batch_size)
        EventsForwarder->>EventsForwarder: collect batch_ids, events from forwarding_queue
        alt events not empty
            EventsForwarder->>Intake: push_events_to_intakes(events)
            Intake-->>EventsForwarder: success
            EventsForwarder->>EventsForwarder: processed_batches.add(batch_id)
            EventsForwarder->>BatchStatusQueue: put(batch_id)
        else no events
            EventsForwarder->>EventsForwarder: wait
        end
    end
Loading

Updated class diagram for Skyhigh SWG batch-aware workers

classDiagram
    class SkyhighSecuritySWGTrigger {
        +SkyhighSWGConfig configuration
        +run()
    }

    class EventCollector {
        +SkyhighSecuritySWGTrigger connector
        +queue~Queue~ events_queue
        +queue~Queue~ batch_status_queue
        +datetime trigger_activation
        +dict headers
        +str endpoint
        +datetime end_date
        +datetime start_date
        +str url
        +dict pending_batches
        +log(*args, **kwargs)
        +query_api()
        +next_batch()
        +_update_time_range()
        +_sleep_until_next_batch()
        +run()
    }

    class EventsTransformer {
        +SkyhighSecuritySWGTrigger connector
        +SkyhighSWGConfig configuration
        +queue~Queue~ queue
        +queue~Queue~ output_queue
        +int max_batch_size
        +run()
    }

    class EventsForwarder {
        +SkyhighSecuritySWGTrigger connector
        +SkyhighSWGConfig configuration
        +queue~Queue~ queue
        +queue~Queue~ batch_status_queue
        +int max_batch_size
        +set processed_batches
        +tuple~set, list~ next_batch(int max_batch_size)
        +run()
    }

    SkyhighSecuritySWGTrigger --> EventCollector : creates
    SkyhighSecuritySWGTrigger --> EventsTransformer : creates
    SkyhighSecuritySWGTrigger --> EventsForwarder : creates

    EventCollector --> EventsTransformer : puts(batch_id, response) on events_queue
    EventsTransformer --> EventsForwarder : puts(batch_id, messages) on output_queue
    EventsForwarder --> EventCollector : confirms batch_id via batch_status_queue
Loading

File-Level Changes

Change Details Files
Introduce batch IDs and a batch status queue to coordinate between collector, transformer, and forwarder so checkpoints only advance after successful intake push.
  • Extend EventCollector, Transformer, and EventsForwarder constructors and run loops to pass around (batch_id, payload) tuples instead of raw messages
  • Have EventCollector generate UUID batch IDs, track them in a pending_batches map, enqueue them with responses, and wait on a batch_status_queue confirmation before advancing the time range
  • Modify Transformer to propagate batch IDs along with transformed event lists into the forwarding queue
  • Update EventsForwarder to aggregate batch_ids with events, push events to the intake, and then send one confirmation per processed batch_id on the batch_status_queue while deduplicating confirmations
SkyhighSecurity/gateway_cloud_services/trigger_skyhigh_security_swg.py
Wire the new batch_status_queue through trigger startup and tests, and adapt tests to the new (batch_id, events) interfaces and confirmation behavior.
  • Create a batch_status_queue in the trigger run method and pass it to EventCollector, Transformer, and EventsForwarder instances including restart path
  • Add pytest fixtures for batch_status_queue and update event_collector and forwarder fixtures to accept it
  • Adjust transformer and forwarder tests to work with (batch_id, payload) tuples and assert that batch confirmation is sent on batch_status_queue
SkyhighSecurity/gateway_cloud_services/trigger_skyhigh_security_swg.py
SkyhighSecurity/tests/test_gateway_cloud_services_trigger.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent Skyhigh SWG event loss by introducing a “batch ID + push confirmation” mechanism so the collector only advances its checkpoint after events are forwarded successfully.

Changes:

  • Add a batch_status_queue and batch IDs to coordinate collector → transformer → forwarder processing.
  • Modify the collector to wait for batch push confirmation before updating the time range.
  • Update tests to account for the new batch/confirmation flow.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
SkyhighSecurity/gateway_cloud_services/trigger_skyhigh_security_swg.py Introduces batch IDs and a status queue to gate checkpoint updates on successful forwarding.
SkyhighSecurity/tests/test_gateway_cloud_services_trigger.py Adjusts fixtures/tests to pass and assert batch/confirmation data through the pipeline.
Comments suppressed due to low confidence (1)

SkyhighSecurity/gateway_cloud_services/trigger_skyhigh_security_swg.py:404

  • EventsForwarder now requires batch_status_queue as the third constructor argument, but Workers.create(...) still instantiates it with (self, forwarding_queue, batch_size). That means batch_size (an int) will be passed as batch_status_queue, leading to runtime failures when the forwarder tries to put() confirmations. Update the worker creation call to pass batch_status_queue before batch_size.
        batch_status_queue: queue.Queue = queue.Queue(maxsize=100)

        # start the event forwarder
        batch_size = int(os.environ.get("BATCH_SIZE", 10000))
        forwarders = Workers.create(

Comment on lines +202 to +225
try:
confirmed_batch_id = self.batch_status_queue.get(
block=True, timeout=60
) # 60 second timeout
if confirmed_batch_id == batch_id:
self.log(
message=f"Batch {batch_id} confirmed pushed", level="debug"
)
# Remove from pending
self.pending_batches.pop(batch_id, None)
else:
self.log(
message=f"Received confirmation for {confirmed_batch_id} but waiting for {batch_id}",
level="warning",
)
# Put it back for next iteration
self.batch_status_queue.put(confirmed_batch_id)
except queue.Empty:
self.log(
message=f"Timeout waiting for batch {batch_id} confirmation. Batch may still be processing.",
level="warning",
)
# Note: We don't remove from pending, checkpoint won't be saved
return
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a confirmation is received for a different batch ID, the code logs and re-queues that ID, but then continues and updates the time range anyway. This can advance the checkpoint even though the current batch_id hasn't been confirmed. Consider looping until the expected batch_id is confirmed (or a deadline is reached) and only then updating the time range.

Suggested change
try:
confirmed_batch_id = self.batch_status_queue.get(
block=True, timeout=60
) # 60 second timeout
if confirmed_batch_id == batch_id:
self.log(
message=f"Batch {batch_id} confirmed pushed", level="debug"
)
# Remove from pending
self.pending_batches.pop(batch_id, None)
else:
self.log(
message=f"Received confirmation for {confirmed_batch_id} but waiting for {batch_id}",
level="warning",
)
# Put it back for next iteration
self.batch_status_queue.put(confirmed_batch_id)
except queue.Empty:
self.log(
message=f"Timeout waiting for batch {batch_id} confirmation. Batch may still be processing.",
level="warning",
)
# Note: We don't remove from pending, checkpoint won't be saved
return
# Wait up to 60 seconds for the specific batch_id to be confirmed.
deadline = datetime.now(timezone.utc) + timedelta(seconds=60)
while True:
remaining = (deadline - datetime.now(timezone.utc)).total_seconds()
if remaining <= 0:
self.log(
message=(
f"Timeout waiting for batch {batch_id} confirmation. "
"Batch may still be processing."
),
level="warning",
)
# Note: We don't remove from pending, checkpoint won't be saved
return
try:
confirmed_batch_id = self.batch_status_queue.get(
block=True,
timeout=remaining,
)
except queue.Empty:
self.log(
message=(
f"Timeout waiting for batch {batch_id} confirmation. "
"Batch may still be processing."
),
level="warning",
)
# Note: We don't remove from pending, checkpoint won't be saved
return
if confirmed_batch_id == batch_id:
self.log(
message=f"Batch {batch_id} confirmed pushed",
level="debug",
)
# Remove from pending
self.pending_batches.pop(batch_id, None)
break
else:
self.log(
message=(
f"Received confirmation for {confirmed_batch_id} "
f"but waiting for {batch_id}"
),
level="warning",
)
# Put it back for next iteration
self.batch_status_queue.put(confirmed_batch_id)

Copilot uses AI. Check for mistakes.
message=f"Timeout waiting for batch {batch_id} confirmation. Batch may still be processing.",
level="warning",
)
# Note: We don't remove from pending, checkpoint won't be saved
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On confirmation timeout, next_batch() returns immediately. In run(), that causes the collector loop to re-query the same time range again without sleeping, potentially hammering the API and growing pending_batches indefinitely if confirmations never arrive. Consider sleeping/backing off on timeout and cleaning up or retrying the pending batch instead of returning to a tight loop.

Suggested change
# Note: We don't remove from pending, checkpoint won't be saved
# Note: We don't remove from pending, checkpoint won't be saved
# Sleep before returning to avoid tight loop and API hammering
self._sleep_until_next_batch()

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +367
# Confirm batches after successful push
for batch_id in batch_ids:
if batch_id not in self.processed_batches:
try:
self.batch_status_queue.put(batch_id, block=False)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch confirmation is currently sent as soon as any events for a batch_id are pushed. However, the Transformer can emit multiple message chunks with the same batch_id (when the CSV response exceeds max_batch_size). Confirming after the first chunk can let the collector advance the checkpoint while later chunks are still queued, causing event loss on restart. Consider adding an explicit end-of-batch marker/count so the forwarder only confirms after the last chunk of a batch has been successfully pushed.

Copilot uses AI. Check for mistakes.
Comment on lines +320 to +322
self.processed_batches: set = (
set()
) # Track which batch_ids we've already confirmed
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.processed_batches grows by one UUID per collected batch and is never pruned, so the forwarder will leak memory over long runtimes. If you need de-duplication, consider a bounded/TTL structure, or redesign the batching/confirmation so a growing global set isn't required.

Copilot uses AI. Check for mistakes.
@pytest.fixture
def event_collector(trigger, events_queue):
return EventCollector(trigger, events_queue)
def event_collector(trigger, events_queue, batch_status_queue):
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event_collector fixture now passes an empty batch_status_queue, but EventCollector.next_batch() blocks waiting for a confirmation (up to 60s). Tests like test_next_batch/test_next_batch_error_should_wait that call next_batch() without arranging a confirmation will hang or time out and will no longer observe the expected time range update. Update those tests to pre-populate batch_status_queue (or patch the confirmation wait) so they match the new behavior.

Suggested change
def event_collector(trigger, events_queue, batch_status_queue):
def event_collector(trigger, events_queue, batch_status_queue):
# Pre-populate the batch_status_queue so that calls to EventCollector.next_batch()
# have an immediate confirmation available and do not block waiting on an empty queue.
batch_status_queue.put(None)

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +174
batch_status_queue = queue.Queue()
transformer = Transformer(trigger, input_queue, events_queue, batch_status_queue)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformer.__init__ currently accepts (connector, queue, output_queue, max_batch_size=...), but this test passes an extra batch_status_queue positional argument. This will be interpreted as max_batch_size (a Queue), causing runtime errors when batched(..., self.max_batch_size) runs. Update the test to match the actual Transformer signature (or update Transformer if it is supposed to take a status queue).

Suggested change
batch_status_queue = queue.Queue()
transformer = Transformer(trigger, input_queue, events_queue, batch_status_queue)
transformer = Transformer(trigger, input_queue, events_queue)

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +183
batch_ids, events = events_queue.get(block=False)
assert events == ["user_id=-1 username=foo"]
assert "batch-1" in batch_ids
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformer puts (batch_id, list(messages)) into the output queue, but this test unpacks the item as (batch_ids, events) and then asserts membership ("batch-1" in batch_ids). This is currently just checking substring membership on a string and doesn't validate the intended contract. Update the test to unpack (batch_id, events) and assert equality on batch_id.

Suggested change
batch_ids, events = events_queue.get(block=False)
assert events == ["user_id=-1 username=foo"]
assert "batch-1" in batch_ids
batch_id, events = events_queue.get(block=False)
assert events == ["user_id=-1 username=foo"]
assert batch_id == "batch-1"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants