Agent Notes

Mandatory Process

Update this file at the end of every significant session in this workspace.
Record decisions, commands or access details that materially changed the outcome.
Record repository state, pushed branches, opened PRs, merges, releases, blockers, and next steps.

Access And Release Notes

Official Apache image publication for apache/openserverless-operator is triggered by pushing a release tag to the Apache repository.
The tag format in use is 0.1.0-incubating.<yymmddHHMM>.
The user reported that the official image publish succeeded only after:
- configuring ~/.ssh/config
- running eval $(ssh-agent)
- then pushing tags with:
  - git push upstream --tags
Commit metadata can use [email protected], but GitHub repository permissions still depend on the authenticated account and SSH key mapping.

Session Log

Repository analysis and documentation

Analyzed the testing flow across:
- openserverless-testing
- openserverless-operator
- openserverless-task
Created and updated:
- gap.md
- workflow.md
Clarified that official operator publication is handled by openserverless-operator/.github/workflows/image.yml.
Verified that official publication is triggered by pushing a numeric release tag, not automatically by PR success.

PR-triggered testing implementation

Implemented the initial PR testing flow across the three repositories.
Added repository-dispatch based PR testing workflows in openserverless-testing.
Added PR trigger workflows in openserverless-operator and openserverless-task.
Verified a real end-to-end PR trigger on nuvolaris/openserverless-operator with label kind-amd.

OPS branch pinning

Explicitly pinned OPS_BRANCH=main in the GitHub testing flow.

Contract realignment to `<test>-<hash>`

Updated analysis and workflow documentation to reflect the newer contract:
- <test>-<hash>
Implemented the new contract on feature branches:
- openserverless-testing: feat/test-tag-hash-contract
- openserverless-operator: feat/test-tag-hash-contract
- openserverless-task: feat/test-tag-hash-contract
Added clearer logging to tests/run-gh-suite.sh so GitHub Actions output shows step boundaries and resolved test context.

Official Apache PRs for operator fixes

Created Apache PR #93 for the kube-rbac-proxy registry fix:
- https://github.com/apache/openserverless-operator/pull/93
Apache PR #93 was merged.
Created Apache PR #94 for the Traefik API version fix:
- https://github.com/apache/openserverless-operator/pull/94

Workspace branches and pushed commits

openserverless-operator
- branch: feat/test-tag-hash-contract
- commits:
  - 5ea4faf Update Traefik API version
  - ca6f9a5 Align testing trigger with <test>-<hash> tags
openserverless-task
- branch: feat/test-tag-hash-contract
- commit:
  - 812c19f Align testing trigger with <test>-<hash> tags
openserverless-testing
- branch: feat/test-tag-hash-contract
- commit:
  - 80f9252 Align testing workflows with <test>-<hash> tags

Official image publication

Verified Apache PR #93 merge commit:
- eea534a35a6d1334644b6c11fc6cbbd1322d4ec9
Verified the post-merge openserverless-operator-check run on apache/main completed successfully.
Prepared local release tag:
- 0.1.0-incubating.2603190853
Initial attempts to push the tag from the agent failed due to GitHub permission checks for account miki3421.
The user later confirmed they successfully pushed and generated the Docker image by using SSH config plus ssh-agent.

Targeted k3s test for Traefik CRD/API change

Created a clean PR on nuvolaris/openserverless-operator with only the Traefik API version change:
- https://github.com/nuvolaris/openserverless-operator/pull/5
Branch used:
- test/k3s-traefik-crd
Commit on that branch:
- 94c8124ab498f3fdca3849b83d94585cd0d0e205
Added label:
- k3s-amd
This triggered:
- nuvolaris/openserverless-operator run 23289380340 (Trigger Testing) -> success
- nuvolaris/openserverless-testing run 23289385705 (Operator PR #5 on k3s-amd) -> failure
Failure analysis:
- the dispatch path worked
- the temporary operator image build/push worked
- the failure happened inside Run GitHub Test Suite, not during PR image build
- the failing path is the k3s server/login initialization sequence
- the run stalls on waiting for completing system initialization
- it then ends with:
  - ops: Failed to run task "login": exit status 1
  - ops: Failed to run task "server": exit status 1
Baseline comparison:
- the previous k3s-amd operator test run 23142045515 failed with the same pattern
- this strongly suggests the current failure is pre-existing in the k3s environment/path and is not yet evidence of a regression caused by the Traefik change

Current Known Status

Apache PR #93 is merged and its main-branch check passed.
Apache PR #94 is open.
The <test>-<hash> contract is implemented on feature branches but not yet merged to the main branches of all repos.
A local file openserverless-testing.code-workspace exists in openserverless-testing and has intentionally not been committed.

Next Step Requested

Investigate the pre-existing k3s-amd initialization/login failure in openserverless-testing to separate environment issues from application regressions.

2026-03-19 Additional Operations

1Password and SSH access to the k3s AMD server

Confirmed that the GitHub workflow uses the 1Password secret op://OpenServerless/TESTING/ID_RSA_B64 as the private SSH key source for test runs.
Confirmed that the workflow materializes that secret into ~/.ssh/id_rsa during tests/1-deploy.sh.
The user needed local access to the same key to debug the remote environment manually.
Important shell detail discovered:
- ~/.zshrc was not sufficient for the non-interactive login shells used by the agent tools.
- putting export OP_SERVICE_ACCOUNT_TOKEN=... in ~/.zprofile made the token visible to login shells.
A failed intermediate extraction left an empty ~/.ssh/testing_k3s_id_rsa; the root cause was an invalid or misformatted service-account token.
The user then corrected the token setup and successfully extracted the SSH private key locally outside the agent flow.
The user confirmed that direct SSH to the server works.

Reinstall of k3s on testing1-k3s-amd.nuvolaris.dev

Host targeted:
- testing1-k3s-amd.nuvolaris.dev
Access path:
- ssh -i ~/.ssh/testing_k3s_id_rsa [email protected]
Initial state before reinstall:
- host name: testing
- OS: Ubuntu 24.04.4 LTS
- kernel: 6.8.0-71-generic
- installed k3s version: v1.27.7-rc1+k3s2
- k3s.service was active and running
- /usr/local/bin/k3s-uninstall.sh was present
Reinstall action performed:
- ran /usr/local/bin/k3s-uninstall.sh
- removed common residual directories:
  - /etc/rancher/k3s
  - /var/lib/rancher/k3s
  - /var/lib/kubelet
  - /var/lib/cni
  - /etc/cni/net.d
  - /run/k3s
  - /run/flannel
  - /var/lib/containerd
- reinstalled via the official install script using the stable channel:
  - curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable sh -
Stable release selected by the installer:
- v1.34.5+k3s1
Final verified state after reinstall:
- k3s version v1.34.5+k3s1
- node name: testing
- node status: Ready
- node role: control-plane
- container runtime: containerd://2.1.5-k3s1

Notes

A direct request to https://update.k3s.io/v1-release/channels/stable from the host returned a GitHub HTML page in this environment, but the official get.k3s.io installer correctly resolved the stable channel and installed v1.34.5+k3s1.
This server refresh was done specifically to remove a very old pre-release k3s build from the k3s-amd test environment before retrying the failing operator test path.

2026-03-19 Autossh and K3s API Host Follow-up

Goal

Allow the k3s test path to keep using a public OpenServerless API host such as testing.nuvolaris.dev while reaching the actual machine over SSH and exposing the Kubernetes API through a local tunnel.

Code changes implemented

In openserverless-task and in the tasks submodule checked out inside openserverless-testing:
- added optional tunnel support to cloud/k3s/opsfile.yml
- when K3S_AUTOSSH=1, the task now opens a local tunnel to the remote k3s apiserver and rewrites the kubeconfig server to https://127.0.0.1:<local-port>
- if autossh is unavailable, the task falls back to a plain ssh -L tunnel with a warning
- removed the hardcoded old k3s release pin so installs no longer force v1.27.7-rc1+k3s2
In openserverless-testing/tests/1-deploy.sh:
- separated K3S_AMD_APIHOST from the SSH host used to reach the machine
- for the specific testing.nuvolaris.dev case, the script now maps SSH to testing1-k3s-amd.nuvolaris.dev
- if the SSH host and API host differ, K3S_AUTOSSH=1 is enabled automatically
- equivalent optional support was added for the ARM path
In Linux GitHub workflows:
- added an explicit autossh install step before running the suite

Verification done

bash -n on tests/1-deploy.sh
YAML parse of:
- operator-pr-test.yaml
- task-pr-test.yaml
- tests.yaml
- tasks/cloud/k3s/opsfile.yml
Real smoke test against the refreshed server:
- generated kubeconfig via the new tunnelized k3s task
- verified the kubeconfig points to https://127.0.0.1:17443
- verified kubectl get nodes succeeds and the node is Ready

Remaining manual operational step

The agent still could not write 1Password secrets directly because op was not authenticated in the agent's non-interactive environment.
The intended value to set is:
- OpenServerless / TESTING / K3S_AMD_APIHOST = testing1-k3s-amd.nuvolaris.dev

2026-03-19 Runtime Job Image Propagation Fix

Problem found on the remote `k3s-amd` run

The Operator PR Test flow in nuvolaris/openserverless-testing was correctly:
- building a temporary PR image
- pushing it to GHCR
- patching _operator/olaris/opsroot.json
Even with that patch, the remote k3s-amd deployment still failed later in couchdb-init.
The pod log on the remote cluster showed:
- registry.hub.docker.com/apache/openserverless-operator:0.1.0-testing.2309191654
- ErrImagePull
- ImagePullBackOff
Root cause:
- the operator StatefulSet used the patched IMAGES_OPERATOR image for the controller pod itself
- but the controller container still inherited OPERATOR_IMAGE and OPERATOR_TAG defaults baked into the image Dockerfile
- downstream jobs created by the controller therefore still referenced the stale Apache testing tag

Fix applied on `nuvolaris`

In nuvolaris/openserverless-task:main:
- commit 1232d0c Propagate operator PR image to runtime jobs
- updated setup/kubernetes/opsfile.yml to derive runtime OPERATOR_IMAGE and OPERATOR_TAG from IMAGES_OPERATOR
- updated setup/kubernetes/operator.yaml to inject those values into the operator pod environment
In nuvolaris/openserverless-operator PR branch:
- branch test/k3s-traefik-crd
- commit 648b89a Use PR operator image for runtime jobs
- moved submodule olaris forward to task commit 1232d0c

Operational note

For the nuvolaris work resumed after the Apache release steps, commits and pushes were switched back to the user's account/key:
- SSH key: ~/.ssh/id_ed25519
- Git email restored to [email protected] in the operator PR clone before committing the submodule bump

2026-03-19 Ops Trace Visibility

Goal

Make the GitHub k3s test runs show the ops shell invocations in clear.
Keep a small local trace on the remote k3s server so deploy-time ops setup server ... calls can be correlated with server-side state.

Changes made

In openserverless-testing/tests/run-gh-suite.sh:
- enabled tracing by default with:
  - OPS_TRACE=1
  - K3S_SERVER_TRACE=1
- changed step execution so shell scripts run through bash -x when tracing is enabled
In openserverless-testing/tests/1-deploy.sh:
- added run_logged to print the exact top-level ops ... command before execution
- added append_remote_trace to append timestamped entries to:
  - /var/log/openserverless-testing/ops-trace.log
  - on the remote k3s server
- wired that trace around the deploy-time ops config apihost and ops setup server ... calls for:
  - k3s-amd
  - k3s-arm

Expected effect on the next rerun

GitHub Actions logs will show the shell-expanded test scripts and their ops invocations.
The remote server should accumulate a lightweight trace file at:
- /var/log/openserverless-testing/ops-trace.log

2026-03-19 K3s Test Sequence Correction

Problem found

The k3s path in openserverless-testing/tests/1-deploy.sh was still using:
- ops setup server ... --uninstall
- ops setup server ...
This wrapper sequence was not the one the user validated manually for the remote k3s environment.

Sequence aligned in the test flow

The k3s-amd and k3s-arm branches now use this direct sequence instead:
- ops cloud k3s delete <server> <user>
- ops config apihost <apihost>
- ops cloud k3s create <server> <user>
- ops config slim
- ops setup cluster

Reason for the change

This keeps the GitHub test path aligned with the lower-level ops cloud k3s workflow rather than the higher-level ops setup server wrapper.

2026-03-19 K3s Slim And Full Test Profiles

Goal

Split the current k3s path into two logical suites:
- k3s-amd-slim / k3s-arm-slim
- k3s-amd-full / k3s-arm-full
Keep temporary compatibility with older aliases:
- k3s-amd
- k3s-arm
- these currently resolve to the slim profile

Changes made in `openserverless-testing`

tests/lib/selector.sh
- now accepts test names with embedded hyphens before the final <hash>
- exports TEST_PROFILE
- maps:
  - k3s-amd-slim -> k3s-amd + slim
  - k3s-amd-full -> k3s-amd + full
  - k3s-arm-slim -> k3s-arm + slim
  - k3s-arm-full -> k3s-arm + full
tests/1-deploy.sh
- only runs ops config slim when TEST_PROFILE=slim
tests/run-gh-suite.sh
- skips MinIO-specific and static steps for slim
tests/all.sh
- mirrors the same slim skips for local all-in-one runs
tests/14-runtime-testing.sh
- no longer requires MinIO for slim
- skips JS/Python MinIO runtime assertions in slim

Current limitation

The PR trigger regex in openserverless-operator / openserverless-task still needs the same naming expansion if the new canonical labels with hash are to be used directly on PRs.
The current active k3s-amd label continues to work because the testing repo maps it to the slim profile.

2026-03-19 Traefik RBAC And Operator Login Follow-Up

Traefik RBAC root cause

The Traefik CRD PR on openserverless-operator updated the templates and operator-side RBAC to traefik.io.
The real cluster RBAC applied by ops setup cluster still came from openserverless-task, where:
- setup/kubernetes/roles/operator-roles.yaml
- was still using traefik.containo.us
This mismatch caused the runtime error:
- middlewares.traefik.io ... is forbidden

Traefik RBAC fix

nuvolaris/openserverless-task:main was updated so the cluster setup role now grants access to:
- apiGroups: ["traefik.io"]
The openserverless-operator PR branch was then updated to point its olaris submodule to that fixed task commit.

New failure observed after that

On the remote k3s-amd cluster, the operator pod started crashing with:
- kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials
The failing pod was:
- nuvolaris-operator-0
The image under test was:
- ghcr.io/nuvolaris/openserverless-testing:pr-5-9930992

Authentication fix prepared on the PR branch

In openserverless-operator/nuvolaris/main.py, the @kopf.on.login() handler was hardened for in-cluster execution.
Instead of using only:
- kopf.login_via_pykube(...)
it now tries these handlers in order when a service-account token exists:
- kopf.login_with_service_account(...)
- kopf.login_via_client(...)
- kopf.login_via_pykube(...)
The handler now logs which method returned credentials and raises a clearer error if all of them fail.

Follow-up failure in the runtime suite

After the operator auth fix, the k3s-amd run progressed through deploy, Redis, FerretDB, and Postgres checks.
The next failure was no longer Traefik-related. It came from tests/14-runtime-testing.sh, where:
- ops -wsk project deploy --manifest ${PWD}/test-runtimes/manifest.yaml
- resolved to tests/test-runtimes/manifest.yaml
- even though the real file lives at:
  - test-runtimes/manifest.yaml

Runtime suite fix

tests/14-runtime-testing.sh now derives the repository root from the script location and uses:
- ${REPO_ROOT}/test-runtimes/manifest.yaml
This makes the manifest lookup independent from the shell's current working directory.

2026-03-20 Branch Cleanup

Nuvolaris branches removed

Deleted merged or obsolete remote branches from nuvolaris/openserverless-testing:
- feat/pr-tag-platform-arch-testing
- feat/test-tag-hash-contract
Deleted merged or obsolete remote branches from nuvolaris/openserverless-task:
- feat/pr-tag-platform-arch-testing
- feat/test-tag-hash-contract
Deleted merged or obsolete remote branches from nuvolaris/openserverless-operator:
- feat/pr-tag-platform-arch-testing
- feat/test-tag-hash-contract
- test/kind-amd-kube-rbac-proxy-registry
- test/k3s-traefik-crd

Local cleanup

Removed the local openserverless-operator-nuvolaris-traefik-test worktree.
Deleted the matching local operator branches listed above.
Left Apache-specific release branches and worktrees untouched.

2026-03-20 K3s capacity and operator PR workflow

K3s amd capacity

The testing1-k3s-amd.nuvolaris.dev server was increased to 16 GB RAM after the end-to-end k3s-<hash> run showed delayed Python action scheduling.
The failed python/mongodb activation was not a code regression:
- Kubernetes reported FailedScheduling for wsk0-21-testactionuser-mongodb
- reason: Insufficient memory
- the activation later completed successfully after the pod was finally scheduled

Operator PR workflow policy

nuvolaris/openserverless-operator should no longer run the internal openserverless-operator-check workflow for pull requests.
The intended PR path is now:
- Trigger Testing on the operator PR
- dispatch to nuvolaris/openserverless-testing
- end-to-end validation there
To enforce that, .github/workflows/check.yml in the operator repo was changed to run only on pushes to main, not on pull_request.

Testing repo workflow drift

During the same day, nuvolaris/openserverless-testing:main was changed upstream to remove operator-pr-test.yaml and keep only AllTests on tag pushes.
This broke the operator PR end-to-end path:
- Trigger Testing in nuvolaris/openserverless-operator still sent repository_dispatch
- but no workflow in openserverless-testing was listening for operator-pr-test
The operator-pr-test.yaml workflow was restored on testing:main so operator PR labels can again launch the downstream PR test suite.

2026-03-20 K3s Arm Readiness

First blocker removed

The operator PR image built by operator-pr-test.yaml was still single-arch because it used:
- docker build
- docker push on the default ubuntu-22.04 GitHub runner
That path produced an amd64 image only, which is not suitable for a k3s arm64 target cluster.
The workflow was updated to:
- set up QEMU
- build with docker/build-push-action
- push a multi-arch image for linux/amd64,linux/arm64
This removes the main image-architecture blocker for labels like k3sarm-<sha>.

Remaining operational dependency

k3sarm still requires a reachable ARM host for SSH/K3s provisioning.
If the public API host and the SSH host differ, the workflow path will still need an explicit K3S_ARM_SSH_HOST source; the current patch only removed the image-architecture blocker.

FilesExpand file tree

agent.md

Latest commit

History

agent.md

File metadata and controls

Agent Notes

Mandatory Process

Access And Release Notes

Session Log

Repository analysis and documentation

PR-triggered testing implementation

OPS branch pinning

Contract realignment to <test>-<hash>

Official Apache PRs for operator fixes

Workspace branches and pushed commits

Official image publication

Targeted k3s test for Traefik CRD/API change

Current Known Status

Next Step Requested

2026-03-19 Additional Operations

1Password and SSH access to the k3s AMD server

Reinstall of k3s on testing1-k3s-amd.nuvolaris.dev

Notes

2026-03-19 Autossh and K3s API Host Follow-up

Goal

Code changes implemented

Verification done

Remaining manual operational step

2026-03-19 Runtime Job Image Propagation Fix

Problem found on the remote k3s-amd run

Fix applied on nuvolaris

Operational note

2026-03-19 Ops Trace Visibility

Goal

Changes made

Expected effect on the next rerun

2026-03-19 K3s Test Sequence Correction

Problem found

Sequence aligned in the test flow

Reason for the change

2026-03-19 K3s Slim And Full Test Profiles

Goal

Changes made in openserverless-testing

Current limitation

2026-03-19 Traefik RBAC And Operator Login Follow-Up

Traefik RBAC root cause

Traefik RBAC fix

New failure observed after that

Authentication fix prepared on the PR branch

Follow-up failure in the runtime suite

Runtime suite fix

2026-03-20 Branch Cleanup

Nuvolaris branches removed

Local cleanup

2026-03-20 K3s capacity and operator PR workflow

K3s amd capacity

Operator PR workflow policy

Testing repo workflow drift

2026-03-20 K3s Arm Readiness

First blocker removed

Remaining operational dependency

Contract realignment to `<test>-<hash>`

Problem found on the remote `k3s-amd` run

Fix applied on `nuvolaris`

Changes made in `openserverless-testing`