Skip to content

Add PodSnapshot extension to Python client#249

Closed
shrutiyam-glitch wants to merge 12 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:pss-sdk
Closed

Add PodSnapshot extension to Python client#249
shrutiyam-glitch wants to merge 12 commits intokubernetes-sigs:mainfrom
shrutiyam-glitch:pss-sdk

Conversation

@shrutiyam-glitch
Copy link
Copy Markdown
Contributor

@shrutiyam-glitch shrutiyam-glitch commented Jan 21, 2026

Depends on: #339 #338 #337

This PR introduces the PodSnapshotSandboxClient extension to the agentic-sandbox-client Python SDK. This extension allows users to manage Pod Snapshots within the Agentic Sandbox environment, enabling stateful "checkpoint and restore" workflows.

Key changes:

  • PodSnapshotSandboxClient Class: A specialized client that extends SandboxClient to handle snapshot-specific operations.
  • Checkpointing: Implemented the snapshot(trigger_name) method, which creates a PodSnapshotManualTrigger (PSMT) and waits for the controller to process it.
  • Controller Readiness: Added snapshot_controller_ready() to detect GKE-managed (gke-managed-pod-snapshots) pod snapshot controllers.
  • Cleanup: The client's __exit__ method now cleans up the PSMT resources sandboxes.
  • Refactoring: Extracted API group and version constants into a dedicated constants.py file to improve maintainability

Testing Done:

  • Integration Test: Added test_podsnapshot_extension.py which verifies the full E2E flow:
    -- Starts a sandbox with a counter application.
    -- Creates two sequential snapshots (test-snapshot-10 at 10 seconds and test-snapshot-20 at 20 seconds).
    -- Restores a new sandbox from the most recent snapshot.
    -- Verifies that the pod has been been restored from the recent snapshot.

  • Unit tests are added

Output:

python3 test_podsnapshot_extension.py --labels app=agent-sandbox-workload --template-name python-counter-template --namespace sandbox-test
--- Starting Sandbox Client Test (Namespace: sandbox-test, Port: 8888) ---

***** Phase 1: Starting Counter *****

======= Testing Pod Snapshot Extension =======
Creating first pod snapshot 'test-snapshot-10' after 10 seconds...
Trigger Name: test-snapshot-10-890451fd
Snapshot UID: 8a72cd17-7e70-4c3c-beb2-f62ae642faed
Success: True
Error Code: 0
Error Reason: 

Creating second pod snapshot 'test-snapshot-20' after 10 seconds...
Trigger Name: test-snapshot-20-ce4fe530
Snapshot UID: a15ee9e1-5939-4110-a1e9-ce4828874734
Success: True
Error Code: 0
Error Reason: 
Recent snapshot UID: a15ee9e1-5939-4110-a1e9-ce4828874734

***** Phase 2: Restoring from most recent snapshot & Verifying *****

Waiting 5 seconds for restored pod to resume printing...
Pod was restored from the most recent snapshot.
--- Pod Snapshot Test Passed! ---

--- Sandbox Client Test Finished ---

Prerequisites:

  • Requires the Pod Snapshot Controller and CRDs (PodSnapshotStorageConfig, PodSnapshotPolicy), SandboxTemplate to be installed and defined in the cluster.

Note: Following PR will handle different aspects
--SnapshotPersistenceManager, list_snapshots, delete_snapshots - #312
-- Restoring from dedicated snapshot, interactive mode restoring - TBD

@netlify
Copy link
Copy Markdown

netlify bot commented Jan 21, 2026

Deploy Preview for agent-sandbox ready!

Name Link
🔨 Latest commit 24cdef5
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69966d14d2c1de0008d9a0d4
😎 Deploy Preview https://deploy-preview-249--agent-sandbox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @shrutiyam-glitch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shrutiyam-glitch shrutiyam-glitch marked this pull request as draft January 21, 2026 16:44
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 21, 2026
@janetkuo
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2026
self.controller_ready = False
return self.controller_ready

def checkpoint(self, trigger_name: str) -> ExecutionResult:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend not to use the standard "run" function that exists in the client but to override it with a specific restore implementation:
Add the label + UUID as function parameters
If the parameter is empty, read the snapshot.json record and display existing template snapshots to the user “would you like to start from this checkpoint?”

@shrutiyam-glitch shrutiyam-glitch marked this pull request as ready for review February 5, 2026 16:35
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2026
if condition.type == "PodRestored" and condition.status == "True":
# Attempt to extract UUID from the message
# Message format: "pod successfully restored from pod snapshot namespace/uuid"
if condition.message:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this. Is there any other indication that is less brittle than the condition.message ? For example, an annotation being set by the pod snapshot controller to the restored pod or something similar ? Or checking the UUID from the PSP object and then looking for this specific UUID in the condition.message etc ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sample restored pod yaml - https://paste.googleplex.com/5754659412246528 - line 78 mentions the snapshot UUID.
There are no annotations available as of now.

checking the UUID from the PSP object and then looking for this specific UUID in the condition.message

So, do you think we can make it is_restored_from(s_uid) ? And check if the pod is restored from this snapshot.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think this is a more robust approach, to look for that UUID in this string message: pod successfully restored from pod snapshot sandbox-test/244589a9-f094-46ec-8dfe-78f1ab0b1cfa , thanks!!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 19, 2026
Cleans up the PodSnapshotManualTrigger Resources.
Automatically cleans up the Sandbox.

TODO: Add cleanup for PodSnapshot resources.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to do this? How will the use list snapshots otherwise?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deletion will be done by the users using the delete_snapshots() (#312).
Just added a Todo to let reviewers that there will be a way to clean that up as well.

)

if not self.pod_name:
logger.warning("Cannot check restore status: pod_name is unknown.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get this condition? if so how?

super().__enter__()
return self

def _parse_snapshot_result(self, obj) -> SnapshotResult | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: returning None isn't ideal. I would prefer throwing error.

@SHRUTI6991
Copy link
Copy Markdown
Contributor

Overall lgtm, minor comments. Please implement smaller CLs in the following CLs. Harder go through all the files in one go.

super().__enter__()
return self

def _parse_snapshot_result(self, obj) -> SnapshotResult | None:
Copy link
Copy Markdown
Member

@vicentefb vicentefb Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit non blocking: you correctly type the return as SnapshotResult | None, but obj is untyped. You might want to type it as dict[str, Any] to make the linter happier

trigger_name=trigger_name,
snapshot_uid=None,
error_reason="Snapshot controller is not ready. Ensure it is installed and running.",
error_code=1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does error_code=1 mean ?

print("\n======= Testing Pod Snapshot Extension =======")
assert sandbox.controller_ready == True, "Sandbox controller is not ready."

time.sleep(wait_time)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit fyi non-blocking: main function is defined as an async def and run using asyncio.run(), but inside the function, you are using the blocking, synchronous time.sleep(wait_time). This won't break anything here since there are no other concurrent tasks running, it is generally considered an anti-pattern. You could change time.sleep(5) to await asyncio.sleep(5).

Any specific reason on why this method is async ?

@@ -0,0 +1,37 @@
# Copyright 2025 The Kubernetes Authors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2026

Copy link
Copy Markdown
Member

@vicentefb vicentefb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shrutiyam-glitch, vicentefb
Once this PR has been reviewed and has the lgtm label, please assign barney-s for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vicentefb
Copy link
Copy Markdown
Member

/assign @janetkuo

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 24, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shrutiyam-glitch shrutiyam-glitch marked this pull request as draft March 5, 2026 19:06
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:python-client cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants