- Update this file at the end of every significant session in this workspace.
- Record decisions, commands or access details that materially changed the outcome.
- Record repository state, pushed branches, opened PRs, merges, releases, blockers, and next steps.
- Official Apache image publication for
apache/openserverless-operatoris triggered by pushing a release tag to the Apache repository. - The tag format in use is
0.1.0-incubating.<yymmddHHMM>. - The user reported that the official image publish succeeded only after:
- configuring
~/.ssh/config - running
eval $(ssh-agent) - then pushing tags with:
git push upstream --tags
- configuring
- Commit metadata can use
[email protected], but GitHub repository permissions still depend on the authenticated account and SSH key mapping.
- Analyzed the testing flow across:
openserverless-testingopenserverless-operatoropenserverless-task
- Created and updated:
gap.mdworkflow.md
- Clarified that official operator publication is handled by
openserverless-operator/.github/workflows/image.yml. - Verified that official publication is triggered by pushing a numeric release tag, not automatically by PR success.
- Implemented the initial PR testing flow across the three repositories.
- Added repository-dispatch based PR testing workflows in
openserverless-testing. - Added PR trigger workflows in
openserverless-operatorandopenserverless-task. - Verified a real end-to-end PR trigger on
nuvolaris/openserverless-operatorwith labelkind-amd.
- Explicitly pinned
OPS_BRANCH=mainin the GitHub testing flow.
- Updated analysis and workflow documentation to reflect the newer contract:
<test>-<hash>
- Implemented the new contract on feature branches:
openserverless-testing:feat/test-tag-hash-contractopenserverless-operator:feat/test-tag-hash-contractopenserverless-task:feat/test-tag-hash-contract
- Added clearer logging to
tests/run-gh-suite.shso GitHub Actions output shows step boundaries and resolved test context.
- Created Apache PR
#93for the kube-rbac-proxy registry fix:https://github.com/apache/openserverless-operator/pull/93
- Apache PR
#93was merged. - Created Apache PR
#94for the Traefik API version fix:https://github.com/apache/openserverless-operator/pull/94
openserverless-operator- branch:
feat/test-tag-hash-contract - commits:
5ea4fafUpdate Traefik API versionca6f9a5Align testing trigger with <test>-<hash> tags
- branch:
openserverless-task- branch:
feat/test-tag-hash-contract - commit:
812c19fAlign testing trigger with <test>-<hash> tags
- branch:
openserverless-testing- branch:
feat/test-tag-hash-contract - commit:
80f9252Align testing workflows with <test>-<hash> tags
- branch:
- Verified Apache PR
#93merge commit:eea534a35a6d1334644b6c11fc6cbbd1322d4ec9
- Verified the post-merge
openserverless-operator-checkrun onapache/maincompleted successfully. - Prepared local release tag:
0.1.0-incubating.2603190853
- Initial attempts to push the tag from the agent failed due to GitHub permission checks for account
miki3421. - The user later confirmed they successfully pushed and generated the Docker image by using SSH config plus
ssh-agent.
- Created a clean PR on
nuvolaris/openserverless-operatorwith only the Traefik API version change:https://github.com/nuvolaris/openserverless-operator/pull/5
- Branch used:
test/k3s-traefik-crd
- Commit on that branch:
94c8124ab498f3fdca3849b83d94585cd0d0e205
- Added label:
k3s-amd
- This triggered:
nuvolaris/openserverless-operatorrun23289380340(Trigger Testing) -> successnuvolaris/openserverless-testingrun23289385705(Operator PR #5 on k3s-amd) -> failure
- Failure analysis:
- the dispatch path worked
- the temporary operator image build/push worked
- the failure happened inside
Run GitHub Test Suite, not during PR image build - the failing path is the
k3sserver/login initialization sequence - the run stalls on
waiting for completing system initialization - it then ends with:
ops: Failed to run task "login": exit status 1ops: Failed to run task "server": exit status 1
- Baseline comparison:
- the previous
k3s-amdoperator test run23142045515failed with the same pattern - this strongly suggests the current failure is pre-existing in the
k3senvironment/path and is not yet evidence of a regression caused by the Traefik change
- the previous
- Apache PR
#93is merged and its main-branch check passed. - Apache PR
#94is open. - The
<test>-<hash>contract is implemented on feature branches but not yet merged to the main branches of all repos. - A local file
openserverless-testing.code-workspaceexists inopenserverless-testingand has intentionally not been committed.
- Investigate the pre-existing
k3s-amdinitialization/login failure inopenserverless-testingto separate environment issues from application regressions.
- Confirmed that the GitHub workflow uses the 1Password secret
op://OpenServerless/TESTING/ID_RSA_B64as the private SSH key source for test runs. - Confirmed that the workflow materializes that secret into
~/.ssh/id_rsaduringtests/1-deploy.sh. - The user needed local access to the same key to debug the remote environment manually.
- Important shell detail discovered:
~/.zshrcwas not sufficient for the non-interactive login shells used by the agent tools.- putting
export OP_SERVICE_ACCOUNT_TOKEN=...in~/.zprofilemade the token visible to login shells.
- A failed intermediate extraction left an empty
~/.ssh/testing_k3s_id_rsa; the root cause was an invalid or misformatted service-account token. - The user then corrected the token setup and successfully extracted the SSH private key locally outside the agent flow.
- The user confirmed that direct SSH to the server works.
- Host targeted:
testing1-k3s-amd.nuvolaris.dev
- Access path:
ssh -i ~/.ssh/testing_k3s_id_rsa [email protected]
- Initial state before reinstall:
- host name:
testing - OS:
Ubuntu 24.04.4 LTS - kernel:
6.8.0-71-generic - installed k3s version:
v1.27.7-rc1+k3s2 k3s.servicewas active and running/usr/local/bin/k3s-uninstall.shwas present
- host name:
- Reinstall action performed:
- ran
/usr/local/bin/k3s-uninstall.sh - removed common residual directories:
/etc/rancher/k3s/var/lib/rancher/k3s/var/lib/kubelet/var/lib/cni/etc/cni/net.d/run/k3s/run/flannel/var/lib/containerd
- reinstalled via the official install script using the stable channel:
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=stable sh -
- ran
- Stable release selected by the installer:
v1.34.5+k3s1
- Final verified state after reinstall:
k3s version v1.34.5+k3s1- node name:
testing - node status:
Ready - node role:
control-plane - container runtime:
containerd://2.1.5-k3s1
- A direct request to
https://update.k3s.io/v1-release/channels/stablefrom the host returned a GitHub HTML page in this environment, but the officialget.k3s.ioinstaller correctly resolved the stable channel and installedv1.34.5+k3s1. - This server refresh was done specifically to remove a very old pre-release
k3sbuild from thek3s-amdtest environment before retrying the failing operator test path.
- Allow the
k3stest path to keep using a public OpenServerless API host such astesting.nuvolaris.devwhile reaching the actual machine over SSH and exposing the Kubernetes API through a local tunnel.
- In
openserverless-taskand in thetaskssubmodule checked out insideopenserverless-testing:- added optional tunnel support to
cloud/k3s/opsfile.yml - when
K3S_AUTOSSH=1, the task now opens a local tunnel to the remotek3sapiserver and rewrites the kubeconfig server tohttps://127.0.0.1:<local-port> - if
autosshis unavailable, the task falls back to a plainssh -Ltunnel with a warning - removed the hardcoded old
k3srelease pin so installs no longer forcev1.27.7-rc1+k3s2
- added optional tunnel support to
- In
openserverless-testing/tests/1-deploy.sh:- separated
K3S_AMD_APIHOSTfrom the SSH host used to reach the machine - for the specific
testing.nuvolaris.devcase, the script now maps SSH totesting1-k3s-amd.nuvolaris.dev - if the SSH host and API host differ,
K3S_AUTOSSH=1is enabled automatically - equivalent optional support was added for the ARM path
- separated
- In Linux GitHub workflows:
- added an explicit
autosshinstall step before running the suite
- added an explicit
bash -nontests/1-deploy.sh- YAML parse of:
operator-pr-test.yamltask-pr-test.yamltests.yamltasks/cloud/k3s/opsfile.yml
- Real smoke test against the refreshed server:
- generated kubeconfig via the new tunnelized
k3stask - verified the kubeconfig points to
https://127.0.0.1:17443 - verified
kubectl get nodessucceeds and the node isReady
- generated kubeconfig via the new tunnelized
- The agent still could not write 1Password secrets directly because
opwas not authenticated in the agent's non-interactive environment. - The intended value to set is:
OpenServerless / TESTING / K3S_AMD_APIHOST = testing1-k3s-amd.nuvolaris.dev
- The
Operator PR Testflow innuvolaris/openserverless-testingwas correctly:- building a temporary PR image
- pushing it to GHCR
- patching
_operator/olaris/opsroot.json
- Even with that patch, the remote
k3s-amddeployment still failed later incouchdb-init. - The pod log on the remote cluster showed:
registry.hub.docker.com/apache/openserverless-operator:0.1.0-testing.2309191654ErrImagePullImagePullBackOff
- Root cause:
- the operator StatefulSet used the patched
IMAGES_OPERATORimage for the controller pod itself - but the controller container still inherited
OPERATOR_IMAGEandOPERATOR_TAGdefaults baked into the imageDockerfile - downstream jobs created by the controller therefore still referenced the stale Apache testing tag
- the operator StatefulSet used the patched
- In
nuvolaris/openserverless-task:main:- commit
1232d0cPropagate operator PR image to runtime jobs - updated
setup/kubernetes/opsfile.ymlto derive runtimeOPERATOR_IMAGEandOPERATOR_TAGfromIMAGES_OPERATOR - updated
setup/kubernetes/operator.yamlto inject those values into the operator pod environment
- commit
- In
nuvolaris/openserverless-operatorPR branch:- branch
test/k3s-traefik-crd - commit
648b89aUse PR operator image for runtime jobs - moved submodule
olarisforward to task commit1232d0c
- branch
- For the
nuvolariswork resumed after the Apache release steps, commits and pushes were switched back to the user's account/key:- SSH key:
~/.ssh/id_ed25519 - Git email restored to
[email protected]in the operator PR clone before committing the submodule bump
- SSH key:
- Make the GitHub
k3stest runs show theopsshell invocations in clear. - Keep a small local trace on the remote
k3sserver so deploy-timeops setup server ...calls can be correlated with server-side state.
- In
openserverless-testing/tests/run-gh-suite.sh:- enabled tracing by default with:
OPS_TRACE=1K3S_SERVER_TRACE=1
- changed step execution so shell scripts run through
bash -xwhen tracing is enabled
- enabled tracing by default with:
- In
openserverless-testing/tests/1-deploy.sh:- added
run_loggedto print the exact top-levelops ...command before execution - added
append_remote_traceto append timestamped entries to:/var/log/openserverless-testing/ops-trace.log- on the remote
k3sserver
- wired that trace around the deploy-time
ops config apihostandops setup server ...calls for:k3s-amdk3s-arm
- added
- GitHub Actions logs will show the shell-expanded test scripts and their
opsinvocations. - The remote server should accumulate a lightweight trace file at:
/var/log/openserverless-testing/ops-trace.log
- The
k3spath inopenserverless-testing/tests/1-deploy.shwas still using:ops setup server ... --uninstallops setup server ...
- This wrapper sequence was not the one the user validated manually for the remote
k3senvironment.
- The
k3s-amdandk3s-armbranches now use this direct sequence instead:ops cloud k3s delete <server> <user>ops config apihost <apihost>ops cloud k3s create <server> <user>ops config slimops setup cluster
- This keeps the GitHub test path aligned with the lower-level
ops cloud k3sworkflow rather than the higher-levelops setup serverwrapper.
- Split the current
k3spath into two logical suites:k3s-amd-slim/k3s-arm-slimk3s-amd-full/k3s-arm-full
- Keep temporary compatibility with older aliases:
k3s-amdk3s-arm- these currently resolve to the
slimprofile
tests/lib/selector.sh- now accepts test names with embedded hyphens before the final
<hash> - exports
TEST_PROFILE - maps:
k3s-amd-slim->k3s-amd+slimk3s-amd-full->k3s-amd+fullk3s-arm-slim->k3s-arm+slimk3s-arm-full->k3s-arm+full
- now accepts test names with embedded hyphens before the final
tests/1-deploy.sh- only runs
ops config slimwhenTEST_PROFILE=slim
- only runs
tests/run-gh-suite.sh- skips MinIO-specific and static steps for
slim
- skips MinIO-specific and static steps for
tests/all.sh- mirrors the same
slimskips for local all-in-one runs
- mirrors the same
tests/14-runtime-testing.sh- no longer requires MinIO for
slim - skips JS/Python MinIO runtime assertions in
slim
- no longer requires MinIO for
- The PR trigger regex in
openserverless-operator/openserverless-taskstill needs the same naming expansion if the new canonical labels with hash are to be used directly on PRs. - The current active
k3s-amdlabel continues to work because the testing repo maps it to theslimprofile.
- The Traefik CRD PR on
openserverless-operatorupdated the templates and operator-side RBAC totraefik.io. - The real cluster RBAC applied by
ops setup clusterstill came fromopenserverless-task, where:setup/kubernetes/roles/operator-roles.yaml- was still using
traefik.containo.us
- This mismatch caused the runtime error:
middlewares.traefik.io ... is forbidden
nuvolaris/openserverless-task:mainwas updated so the cluster setup role now grants access to:apiGroups: ["traefik.io"]
- The
openserverless-operatorPR branch was then updated to point itsolarissubmodule to that fixed task commit.
- On the remote
k3s-amdcluster, the operator pod started crashing with:kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials
- The failing pod was:
nuvolaris-operator-0
- The image under test was:
ghcr.io/nuvolaris/openserverless-testing:pr-5-9930992
- In
openserverless-operator/nuvolaris/main.py, the@kopf.on.login()handler was hardened for in-cluster execution. - Instead of using only:
kopf.login_via_pykube(...)
- it now tries these handlers in order when a service-account token exists:
kopf.login_with_service_account(...)kopf.login_via_client(...)kopf.login_via_pykube(...)
- The handler now logs which method returned credentials and raises a clearer error if all of them fail.
- After the operator auth fix, the
k3s-amdrun progressed through deploy, Redis, FerretDB, and Postgres checks. - The next failure was no longer Traefik-related. It came from
tests/14-runtime-testing.sh, where:ops -wsk project deploy --manifest ${PWD}/test-runtimes/manifest.yaml- resolved to
tests/test-runtimes/manifest.yaml - even though the real file lives at:
test-runtimes/manifest.yaml
tests/14-runtime-testing.shnow derives the repository root from the script location and uses:${REPO_ROOT}/test-runtimes/manifest.yaml
- This makes the manifest lookup independent from the shell's current working directory.
- Deleted merged or obsolete remote branches from
nuvolaris/openserverless-testing:feat/pr-tag-platform-arch-testingfeat/test-tag-hash-contract
- Deleted merged or obsolete remote branches from
nuvolaris/openserverless-task:feat/pr-tag-platform-arch-testingfeat/test-tag-hash-contract
- Deleted merged or obsolete remote branches from
nuvolaris/openserverless-operator:feat/pr-tag-platform-arch-testingfeat/test-tag-hash-contracttest/kind-amd-kube-rbac-proxy-registrytest/k3s-traefik-crd
- Removed the local
openserverless-operator-nuvolaris-traefik-testworktree. - Deleted the matching local
operatorbranches listed above. - Left Apache-specific release branches and worktrees untouched.
- The
testing1-k3s-amd.nuvolaris.devserver was increased to16 GBRAM after the end-to-endk3s-<hash>run showed delayed Python action scheduling. - The failed
python/mongodbactivation was not a code regression:- Kubernetes reported
FailedSchedulingforwsk0-21-testactionuser-mongodb - reason:
Insufficient memory - the activation later completed successfully after the pod was finally scheduled
- Kubernetes reported
nuvolaris/openserverless-operatorshould no longer run the internalopenserverless-operator-checkworkflow for pull requests.- The intended PR path is now:
Trigger Testingon the operator PR- dispatch to
nuvolaris/openserverless-testing - end-to-end validation there
- To enforce that,
.github/workflows/check.ymlin theoperatorrepo was changed to run only on pushes tomain, not onpull_request.
- During the same day,
nuvolaris/openserverless-testing:mainwas changed upstream to removeoperator-pr-test.yamland keep onlyAllTestson tag pushes. - This broke the operator PR end-to-end path:
Trigger Testinginnuvolaris/openserverless-operatorstill sentrepository_dispatch- but no workflow in
openserverless-testingwas listening foroperator-pr-test
- The
operator-pr-test.yamlworkflow was restored ontesting:mainso operator PR labels can again launch the downstream PR test suite.
- The operator PR image built by
operator-pr-test.yamlwas still single-arch because it used:docker builddocker pushon the defaultubuntu-22.04GitHub runner
- That path produced an
amd64image only, which is not suitable for ak3s arm64target cluster. - The workflow was updated to:
- set up QEMU
- build with
docker/build-push-action - push a multi-arch image for
linux/amd64,linux/arm64
- This removes the main image-architecture blocker for labels like
k3sarm-<sha>.
k3sarmstill requires a reachable ARM host for SSH/K3s provisioning.- If the public API host and the SSH host differ, the workflow path will still need an explicit
K3S_ARM_SSH_HOSTsource; the current patch only removed the image-architecture blocker.