Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,10 @@
SANDBOX_PLURAL_NAME = "sandboxes"

POD_NAME_ANNOTATION = "agents.x-k8s.io/pod-name"

PODSNAPSHOT_NAMESPACE_MANAGED = "gke-managed-pod-snapshots"
PODSNAPSHOT_AGENT = "pod-snapshot-agent"

PODSNAPSHOT_API_GROUP = "podsnapshot.gke.io"
PODSNAPSHOT_API_VERSION = "v1alpha1"
PODSNAPSHOT_API_KIND = "PodSnapshotManualTrigger"
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2026 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .podsnapshot_client import PodSnapshotSandboxClient
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Agentic Sandbox Pod Snapshot Extension

This directory contains the Python client extension for interacting with the Agentic Sandbox to manage Pod Snapshots. This extension allows you to trigger snapshots of a running sandbox and restore a new sandbox from the recently created snapshot.

## `podsnapshot_client.py`

This file defines the `PodSnapshotSandboxClient` class, which extend the base `SandboxClient` to provide snapshot capabilities.

### `PodSnapshotSandboxClient`

A specialized Sandbox client for interacting with the gke pod snapshot controller.

### Key Features:

* **`PodSnapshotSandboxClient(template_name: str, podsnapshot_timeout: int = 180, server_port: int = 8080, ...)`**:
* Initializes the client with optional podsnapshot timeout and server port.
* **`snapshot_controller_ready(self) -> bool`**:
* Checks if the snapshot agent (GKE managed) is running and ready.
* **`__exit__(self)`**:
* Cleans up the `SandboxClaim` resources.

## `test_podsnapshot_extension.py`

This file, located in the parent directory (`clients/python/agentic-sandbox-client/`), contains an integration test script for the `PodSnapshotSandboxClient` extension. It verifies the snapshot and restore functionality.

### Test Phases:

1. **Phase 1: Starting Counter Sandbox**:
* Starts a sandbox with a counter application.

### Prerequisites

1. **Python Virtual Environment**:
```bash
python3 -m venv .venv
source .venv/bin/activate
```

2. **Install Dependencies**:
```bash
pip install kubernetes
pip install -e clients/python/agentic-sandbox-client/
```

3. **Pod Snapshot Controller**: The Pod Snapshot controller must be installed in a **GKE standard cluster** running with **gVisor**.
* For detailed setup instructions, refer to the [GKE Pod Snapshots public documentation](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/pod-snapshots).
* Ensure a GCS bucket is configured to store the pod snapshot states and that the necessary IAM permissions are applied.

4. **CRDs**: `PodSnapshotStorageConfig`, `PodSnapshotPolicy` CRDs must be applied. `PodSnapshotPolicy` should specify the selector match labels.

5. **Sandbox Template**: A `SandboxTemplate` (e.g., `python-counter-template`) with runtime gVisor, appropriate KSA and label that matches that selector label in `PodSnapshotPolicy` must be available in the cluster.

### Running Tests:

To run the integration test, execute the script with the appropriate arguments:

```bash
python3 clients/python/agentic-sandbox-client/test_podsnapshot_extension.py \
--template-name python-counter-template \
--namespace sandbox-test
```

Adjust the `--namespace`, `--template-name` as needed for your environment.
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Copyright 2026 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
from kubernetes import client
from kubernetes.client import ApiException
from ..sandbox_client import SandboxClient
from ..constants import (
PODSNAPSHOT_NAMESPACE_MANAGED,
PODSNAPSHOT_AGENT,
PODSNAPSHOT_API_GROUP,
PODSNAPSHOT_API_VERSION,
PODSNAPSHOT_API_KIND,
)

logger = logging.getLogger(__name__)


class PodSnapshotSandboxClient(SandboxClient):
"""
A specialized Sandbox client for interacting with the gke pod snapshot controller.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the client only checking whether the controller is ready? This is different from the description in the PR: "to support manual snapshot triggering via the GKE pod snapshot controller."

I'd expect the client to modify Snapshot CRs, instead of checking the Snapshot controller itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have defined the snapshot method here - https://github.com/kubernetes-sigs/agent-sandbox/pull/339/changes#diff-6535038b29a40cde2f558dd8bf85e28a67c1eee796fe718c04338884af9bddecR203.
The method will first check if the snapshot controller is ready as an initialization check before creating the snapshots.
Other methods added will be list, delete. I had to just split logic into multiple PRs.

For the PR description, I just meant to write what the purpose of the class in an overview is. Will update it.
Thanks.

Currently supports manual triggering via PodSnapshotManualTrigger.
"""

def __init__(
self,
template_name: str,
podsnapshot_timeout: int = 180,
server_port: int = 8080,
**kwargs,
):
super().__init__(template_name, server_port=server_port, **kwargs)

self.controller_ready = False
self.podsnapshot_timeout = podsnapshot_timeout
self.core_v1_api = client.CoreV1Api()

def __enter__(self) -> "PodSnapshotSandboxClient":
try:
self.controller_ready = self.snapshot_controller_ready()
super().__enter__()
return self
except Exception as e:
self.__exit__(None, None, None)
raise RuntimeError(
f"Failed to initialize PodSnapshotSandboxClient. Ensure that you are connected to a GKE cluster "
f"with the Pod Snapshot Controller enabled. Error details: {e}"
) from e

def snapshot_controller_ready(self) -> bool:
"""
Checks if the snapshot agent pods are running in a GKE-managed pod snapshot cluster.
Falls back to checking CRD existence if pod listing is forbidden.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more robust to check for CRD presence given that controller pods are implementation details and subject to change over time. Also, with strict RBAC, listing pods might be restricted.

Isn't just checking CRD enough?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of a CRD means a controller is definitely not running, but a CRD's presence doesn't mean that a controller is running.

Since GKE is a managed control plane, the SDK won't have the necessary RBAC permissions to view any controller pods.

As an alternative, how about designing the snapshot creation method to attempt to create the PodSnapshotManualTrigger Custom Resource? Then catch any 404 Not Found error from the API server if the CRD isn't installed. To handle cases where the CRD is present but the controller isn't ready, we can use exponential backoff when creating the resource or when polling its status.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CRD exists, creating the CR should still succeed, even if the controller isn't ready. The controller should be able to handle the CR as soon as it becomes ready.

I suggest we first verify if the CRDs exist, and then create the PodSnapshotManualTrigger CR in the snapshot creation method. Then we get information from reading PodSnapshotManualTrigger .status field.

This way, the SDK/client only interacts with the API, not the controller implementation details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Updated the method to just check for the CRDs installed.
Thanks.

"""

if self.controller_ready:
return True

def check_crd_installed() -> bool:
try:
# Check directly if the API resource exists using CustomObjectsApi
resource_list = self.custom_objects_api.get_api_resources(
group=PODSNAPSHOT_API_GROUP,
version=PODSNAPSHOT_API_VERSION,
)

if not resource_list or not resource_list.resources:
return False

for resource in resource_list.resources:
if resource.kind == PODSNAPSHOT_API_KIND:
return True
return False
except ApiException as e:
# If discovery fails with 403/404, we assume not ready/accessible
if e.status == 403 or e.status == 404:
return False
raise

def check_pod_running(namespace: str, pod_name_substring: str) -> bool:
try:
pods = self.core_v1_api.list_namespaced_pod(namespace)
for pod in pods.items:
if (
pod.status.phase == "Running"
and pod_name_substring in pod.metadata.name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not reliable / less efficient to match pods by name substring. Would you use a label selector instead?

):
return True
return False
except ApiException as e:
if e.status == 403:
logger.info(
f"Permission denied listing pods in {namespace}. Checking CRD existence."
)
return check_crd_installed()
# If discovery fails with 404, we assume not ready/accessible
if e.status == 404:
return False
raise

# Check managed: requires only agent in gke-managed-pod-snapshots
if check_pod_running(PODSNAPSHOT_NAMESPACE_MANAGED, PODSNAPSHOT_AGENT):
return True

return False

def __exit__(self, exc_type, exc_val, exc_tb):
"""
Automatically cleans up the Sandbox.
"""
super().__exit__(exc_type, exc_val, exc_tb)
Loading