Skip to content

Implement list and delete snapshot functionality in Python SDK#448

Open
shrutiyam-glitch wants to merge 10 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:pss-list-delete
Open

Implement list and delete snapshot functionality in Python SDK#448
shrutiyam-glitch wants to merge 10 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:pss-list-delete

Conversation

@shrutiyam-glitch
Copy link
Copy Markdown
Contributor

@shrutiyam-glitch shrutiyam-glitch commented Mar 20, 2026

This PR implements the ability to list and delete Pod Snapshots within the PodSnapshotSandboxClient.

Core Logic Implementation:

  1. Snapshot Listing (sandbox.snapshots.list)
    -- Support for a filter_by parameter to allow users to filter snapshots by state (e.g., ready_only) or grouping labels
    -- Snapshots are returned sorted by creation timestamp (newest first) to simplify "latest-available" restoration logic -- Leverages Pydantic models for reliable parsing of Kubernetes API responses, ensuring type safety for snapshot metadata and status fields.

  2. Single Snapshot Deletion (sandbox.snapshots.delete)
    -- Supports deleting snapshots by specific UID

  3. Multiple Snapshot Deletion (sandbox.snapshots.delete_all)
    -- Performs bulk deletion for all snapshots associated with the current Sandbox.
    -- Implements wait_for_snapshot_deletion using Kubernetes watch streams. The logic has been hardened to correctly handle resourceVersion to avoid race conditions during the watch initialization.
    -- Correctly propagates timeout results and distinguishes between successful deletions and partial batch failures.

Testing Done:

  • Integration Test: Added test_podsnapshot_extension.py which verifies the full E2E flow:
    -- creating multiple snapshots, listing them, and performing a cleanup deletion

  • Unit tests are added

Output:

  • Integration Test (clients/python/agentic-sandbox-client/test_podsnapshot_extension.py):
$ python3 test_podsnapshot_extension.py --template-name python-counter-template --namespace sandbox-test
--- Starting Sandbox Client Test (Namespace: sandbox-test, Port: 8888) ---

***** Phase 1: Starting Counter *****

======= Testing Pod Snapshot Extension =======
2026-04-02 23:30:15,429 - INFO - Creating SandboxClaim 'sandbox-claim-59eb7c1a' in namespace 'sandbox-test' using template 'python-counter-template'...
2026-04-02 23:30:15,533 - INFO - Resolving sandbox name from claim 'sandbox-claim-59eb7c1a'...
2026-04-02 23:30:15,651 - INFO - Resolved sandbox name 'sandbox-claim-59eb7c1a' from claim status
2026-04-02 23:30:15,652 - INFO - Watching for Sandbox sandbox-claim-59eb7c1a to become ready...
2026-04-02 23:30:17,647 - INFO - Sandbox sandbox-claim-59eb7c1a is ready.
Creating first pod snapshot 'test-snapshot-10' after 10 seconds...
2026-04-02 23:30:27,953 - INFO - Waiting for snapshot manual trigger 'test-snapshot-10-20260402-233027-0631c160' to be processed...
2026-04-02 23:30:30,350 - INFO - Snapshot manual trigger 'test-snapshot-10-20260402-233027-0631c160' processed successfully. Created Snapshot UID: 72474e3b-0055-45d4-ada8-5db27e28c04c
Trigger Name: test-snapshot-10-20260402-233027-0631c160
First snapshot UID: 72474e3b-0055-45d4-ada8-5db27e28c04c

Creating second pod snapshot 'test-snapshot-20' after 10 seconds...
2026-04-02 23:30:40,541 - INFO - Waiting for snapshot manual trigger 'test-snapshot-20-20260402-233040-894477f6' to be processed...
2026-04-02 23:30:43,739 - INFO - Snapshot manual trigger 'test-snapshot-20-20260402-233040-894477f6' processed successfully. Created Snapshot UID: 52066ca7-69a1-459d-a802-5aa7cbf1cdfa
Trigger Name: test-snapshot-20-20260402-233040-894477f6
Recent snapshot UID: 52066ca7-69a1-459d-a802-5aa7cbf1cdfa

Checking if sandbox was restored from snapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa'...
2026-04-02 23:30:53,740 - INFO - Creating SandboxClaim 'sandbox-claim-2404f0ac' in namespace 'sandbox-test' using template 'python-counter-template'...
2026-04-02 23:30:53,884 - INFO - Resolving sandbox name from claim 'sandbox-claim-2404f0ac'...
2026-04-02 23:30:53,972 - INFO - Resolved sandbox name 'sandbox-claim-2404f0ac' from claim status
2026-04-02 23:30:53,973 - INFO - Watching for Sandbox sandbox-claim-2404f0ac to become ready...
2026-04-02 23:30:55,725 - INFO - Sandbox sandbox-claim-2404f0ac is ready.
Pod was restored from the most recent snapshot.

Listing all snapshots for sandbox 'sandbox-claim-59eb7c1a'...
2026-04-02 23:30:56,032 - INFO - Listing snapshots with label selector: podsnapshot.gke.io/pod-name=sandbox-claim-59eb7c1a,tenant-id=test-tenant,user-id=test-user
2026-04-02 23:30:56,132 - INFO - Found 2 snapshots.
Snapshot UID: 52066ca7-69a1-459d-a802-5aa7cbf1cdfa, Source Pod: sandbox-claim-59eb7c1a, Creation Time: 2026-04-02T23:30:40Z
Snapshot UID: 72474e3b-0055-45d4-ada8-5db27e28c04c, Source Pod: sandbox-claim-59eb7c1a, Creation Time: 2026-04-02T23:30:28Z

Deleting snapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa' of the sandbox 'sandbox-claim-59eb7c1a'...
2026-04-02 23:30:56,132 - INFO - Deleting PodSnapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa'...
2026-04-02 23:30:56,231 - INFO - PodSnapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa' deletion requested. Waiting for confirmation...
2026-04-02 23:30:56,308 - INFO - Waiting for PodSnapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa' to be deleted...
2026-04-02 23:30:57,049 - INFO - PodSnapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa' confirmed deleted.
2026-04-02 23:30:57,049 - INFO - Snapshot deletion process completed. Deleted 1 snapshots.
Snapshot '52066ca7-69a1-459d-a802-5aa7cbf1cdfa' deleted successfully.

Deleting all snapshots for sandbox 'sandbox-claim-59eb7c1a'...
2026-04-02 23:30:57,050 - INFO - Deleting snapshots matching labels: {'tenant-id': 'test-tenant', 'user-id': 'test-user'}
2026-04-02 23:30:57,050 - INFO - Deleting snapshots matching labels: {'tenant-id': 'test-tenant', 'user-id': 'test-user'}
2026-04-02 23:30:57,050 - INFO - Listing snapshots with label selector: podsnapshot.gke.io/pod-name=sandbox-claim-59eb7c1a,tenant-id=test-tenant,user-id=test-user
2026-04-02 23:30:57,166 - INFO - Found 1 snapshots.
2026-04-02 23:30:57,166 - INFO - Deleting PodSnapshot '72474e3b-0055-45d4-ada8-5db27e28c04c'...
2026-04-02 23:30:57,273 - INFO - PodSnapshot '72474e3b-0055-45d4-ada8-5db27e28c04c' deletion requested. Waiting for confirmation...
2026-04-02 23:30:57,358 - INFO - Waiting for PodSnapshot '72474e3b-0055-45d4-ada8-5db27e28c04c' to be deleted...
2026-04-02 23:30:57,947 - INFO - PodSnapshot '72474e3b-0055-45d4-ada8-5db27e28c04c' confirmed deleted.
2026-04-02 23:30:57,947 - INFO - Snapshot deletion process completed. Deleted 1 snapshots.
Snapshot '72474e3b-0055-45d4-ada8-5db27e28c04c' deleted successfully.
--- Pod Snapshot Test Passed! ---
Cleaning up all sandboxes...
2026-04-02 23:30:58,125 - INFO - Deleted PodSnapshotManualTrigger 'test-snapshot-10-20260402-233027-0631c160'
2026-04-02 23:30:58,229 - INFO - Deleted PodSnapshotManualTrigger 'test-snapshot-20-20260402-233040-894477f6'
2026-04-02 23:30:58,229 - INFO - Connection to sandbox claim 'sandbox-claim-59eb7c1a' has been closed.
2026-04-02 23:30:58,328 - INFO - Terminated SandboxClaim: sandbox-claim-59eb7c1a
2026-04-02 23:30:58,328 - INFO - Connection to sandbox claim 'sandbox-claim-2404f0ac' has been closed.
2026-04-02 23:30:58,425 - INFO - Terminated SandboxClaim: sandbox-claim-2404f0ac

--- Sandbox Client Test Finished ---

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 20, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 03caf69
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69d05ae7271d270007abd014

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shrutiyam-glitch
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from igooch and justinsb March 20, 2026 15:54
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 20, 2026
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/assign codebot-robot

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@aditya-shantanu: GitHub didn't allow me to assign the following users: codebot-robot.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign codebot-robot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@barney-s
Copy link
Copy Markdown
Contributor

/assign @codebot-robot

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@barney-s: GitHub didn't allow me to assign the following users: codebot-robot.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @codebot-robot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aditya-shantanu
Copy link
Copy Markdown
Contributor

/assign @codebot-robot

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@aditya-shantanu: GitHub didn't allow me to assign the following users: codebot-robot.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @codebot-robot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 24, 2026
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 24, 2026
@SHRUTI6991
Copy link
Copy Markdown
Contributor

can we base the PR off of this PR: #467?

  1. For delete_snapshots: The user will call something like sandbox.snapshots.delete: this method should delete all snapshots associated with a Sandbox.
  2. For list_snapshots: Let's add a param called filter_by. And then we can provide user the option to filter by Ready state etc.

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is a solid implementation that covers the core functionality and includes thorough unit and integration tests. I've left several comments focused on edge cases, defensive programming, and reliability.

The main areas to address are:

  1. Hardening .get() calls to safely handle None values returned by the Kubernetes API for status and metadata.
  2. Changing truthiness checks (if snapshot_uid:) to explicit None checks (if snapshot_uid is not None:) to prevent accidental mass deletion if an empty string is provided.
  3. Making the integration tests safer against concurrent execution by avoiding hardcoded labels and exact length assertions.

Please review the inline comments for detailed suggestions.

(This review was generated by Overseer)

@shrutiyam-glitch
Copy link
Copy Markdown
Contributor Author

/hold
Will rebase after #467 is merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2026
Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the PR introduces the requested list and delete functionality solidly and includes comprehensive test coverage.
There are a few areas we should refine before merging:

  1. Edge case handling when metadata or UIDs are missing/malformed.
  2. Preventing invalid labels from failing the list request.
  3. Consistency in error propagation and exception usage.

Please review the detailed inline comments for specific suggestions.

(This review was generated by Overseer)

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR effectively introduces list_snapshots and delete_snapshots along with comprehensive unit and integration tests.
Most suggestions focus on minor maintainability improvements such as avoiding magic strings, explicitly formatting docstrings, and improving observability in error scenarios.
Adding a request timeout for list API calls and verifying pagination handling would also improve resilience against non-ideal network states.
Thanks for the excellent contribution!

(This review was generated by Overseer)

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is a solid implementation that correctly extends the SDK's capabilities to manage PodSnapshots. The addition of sorting logic and conditional bulk deletion are great features.

The review highlights a few areas for improvement:

  • Defensive Programming: Hardening the parsing logic against unexpected or malformed Kubernetes API responses (e.g., missing status fields) to prevent unhandled AttributeErrors.
  • Error Handling & Logging: Refining error messages to explicitly distinguish between total and partial batch deletion failures, and cleaning up log outputs when optional parameters like grouping_labels are omitted.
  • Edge Cases: Properly handling terminal failure states (so they aren't merely marked as "NotReady") and documenting the subtle behavior of passing an empty dictionary for grouping labels.
  • Test Coverage: Adding a few targeted assertions to verify fallback paths and addressing hardcoded labels in shared templates.

Addressing these minor issues will make the client significantly more robust against cluster-side anomalies.

(This review was generated by Overseer)

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed PR. The list_snapshots and delete_snapshots functionality looks solid and significantly improves the lifecycle management capabilities of the Python SDK.

The testing coverage is also quite comprehensive. I've left a few comments mainly focused on refining the logging strategies (to prevent noisy logs in large clusters), fortifying the exception handling for K8s API errors, and making the integration test assertions slightly more resilient against timing fluctuations.

\n\n*(This review was generated by Overseer)*

@SHRUTI6991
Copy link
Copy Markdown
Contributor

can we base the PR off of this PR: #467?

  1. For delete_snapshots: The user will call something like sandbox.snapshots.delete: this method should delete all snapshots associated with a Sandbox.
  2. For list_snapshots: Let's add a param called filter_by. And then we can provide user the option to filter by Ready state etc.

The PR is now merged. we can remove the hold. @shrutiyam-glitch

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 30, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2026
@shrutiyam-glitch
Copy link
Copy Markdown
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2026
Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, excellent work on this solid PR! It successfully addresses the issue by expanding the SDK with robust list and batch-deletion capabilities. The use of Pydantic dataclasses for structuring K8s API responses is a great pattern.

While the implementation generally handles bulk operation errors appropriately, I have identified a few critical edge cases and areas for improvement:

  1. Unhandled ValidationError: The response parsing loop in list() is outside the try...except block, meaning malformed resources could crash the client instead of returning a failed result.
  2. Timeout ignored: The boolean return value of wait_for_snapshot_deletion is currently ignored in delete(), meaning timed-out deletions are incorrectly reported as successful.
  3. Watch Stream termination: K8s watch streams can silently close before the timeout, which would result in premature timeout failures.
  4. Inconsistent 404 handling: deleted_snapshots behaves differently if a 404 occurs immediately versus during the wait cycle.
  5. Race conditions: Ensure edge cases are hardened regarding race conditions when polling deletions.

Additionally, I have left a few minor suggestions for improving the resilience of the integration tests to avoid flakiness and slightly reducing log noise during the startup lifecycle.

Great work! Let me know if you'd like to discuss any of these.

(This review was generated by Overseer)

"""
# Check if already deleted
try:
k8s_helper.custom_objects_api.get_namespaced_custom_object(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling get_namespaced_custom_object to check existence before starting the watch stream creates a small race condition. If the object gets deleted in the milliseconds before w.stream starts, the stream will hang until the 60s timeout. Consider extracting the resourceVersion from the get call and passing it to w.stream so no events are missed.

errors = []
for uid in snapshots_to_delete:
# Delete PodSnapshot
try:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_snapshot_deletion returns a boolean indicating whether the deletion was confirmed or timed out. Currently, this return value is ignored, and the uid is unconditionally appended to deleted_snapshots even if the deletion timed out and potentially failed. You should check the return value and handle the timeout case appropriately (e.g., by adding it to errors instead of deleted_snapshots).

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 3, 2026
error_code=SNAPSHOT_ERROR_CODE,
)
except ValidationError as e:
logger.error(f"Malformed snapshot data: {e}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current try...except ValidationError block wraps the entire loop and returns success=False for the list operation if any snapshot is malformed. This means a single bad snapshot prevents the user from listing or deleting any valid snapshots for that pod. Is this expected ?

You could move this try-catch inside the loop to simply skip and log malformed items instead of aborting the whole operation.

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the PR introduces solid improvements for managing Pod snapshots through the Python SDK and follows existing conventions well. The abstractions for filtering and batch deletion are well-designed, and the addition of Pydantic models alongside comprehensive unit and integration tests greatly improves the SDK's robustness.

However, please review the provided inline comments carefully. A few areas need attention:

  • Test Errors: The integration test is passing an incorrect parameter name (filter_value instead of label_value), which will cause a TypeError.
  • Falsy Checks: Conditional checks around empty dictionaries/strings evaluate to falsy, leading to unexpected behaviors (e.g. labels={} or snapshot_uid="" silently doing nothing or performing mass deletions).
  • Race Conditions: The Kubernetes watch setup in wait_for_snapshot_deletion introduces a race condition if resource_version is not extracted from the initial check.
  • API Responses & Edge Cases: There are a few places where None responses from the K8s API could lead to AttributeErrors. Care is also needed when handling specific object formats from the API.
  • Typing Enhancements: Ensure common result formats share a base class to reduce code duplication.
  • Logging: Optimize log levels to prevent spam during batch deletion operations.

Addressing these issues will ensure the SDK is highly robust.

(This review was generated by Overseer)


print(f"\nDeleting all snapshots for sandbox '{sandbox.sandbox_id}'...")
delete_result = sandbox.snapshots.delete_all(
delete_by="labels", filter_value=grouping_labels
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument filter_value is incorrect here. The delete_all method signature in snapshot_engine.py expects the parameter to be named label_value. Using filter_value will result in a TypeError.

error_code=SNAPSHOT_ERROR_CODE,
)
snapshots_to_delete = [s.snapshot_uid for s in snapshots_result.snapshots]
elif labels:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If labels is an empty dictionary {}, elif labels: evaluates to False. Thus, calling delete_all(delete_by='labels', label_value={}) completely skips this block and returns success without attempting deletion. Consider elif labels is not None:.

"""Waits for the PodSnapshot to be deleted from the cluster."""
# Check if already deleted
try:
k8s_helper.custom_objects_api.get_namespaced_custom_object(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_namespaced_custom_object call checks if the object already doesn't exist. However, the returned object is discarded. You should extract the resourceVersion from the returned object and pass it into the subsequent watch stream. Without this, if the snapshot is deleted between the get call and the stream starting, the stream will miss the DELETED event and hang until timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:python-client cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants