Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13: Upgrade to v1.13.14 #552

Closed
wants to merge 111 commits into from

Conversation

jaredledvina
Copy link
Member

Merges in the upstream v1.13.14 tag into our fork
1 conflict as expected with images/cilium/Dockerfile due to incompatible changes with the CILIUM_ENVOY image ref change

michi-covalent and others added 30 commits February 14, 2024 09:20
Based on https://github.com/cilium/cilium/actions/runs/7893881735. I had
to copy these digests manually because image-digest-output.txt-v1.13.12
was empty.

## Docker Manifests

### cilium

`docker.io/cilium/cilium:v1.13.12@sha256:d99204aa7b3b7bd2c9ab47fd398cc9f40290799bc0c7a4386c8dc5c1780cd3d3`
`quay.io/cilium/cilium:v1.13.12@sha256:d99204aa7b3b7bd2c9ab47fd398cc9f40290799bc0c7a4386c8dc5c1780cd3d3`

### clustermesh-apiserver

`docker.io/cilium/clustermesh-apiserver:v1.13.12@sha256:f1b1d0a85bab65e7d6adc90d000513a56ac58bdb071aa391a8580d73d20e6b6a`
`quay.io/cilium/clustermesh-apiserver:v1.13.12@sha256:f1b1d0a85bab65e7d6adc90d000513a56ac58bdb071aa391a8580d73d20e6b6a`

### docker-plugin

`docker.io/cilium/docker-plugin:v1.13.12@sha256:142822829c4bccd315e6b21262b9de9084c4913420a28bedb1209fa1f3e6bdb9`
`quay.io/cilium/docker-plugin:v1.13.12@sha256:142822829c4bccd315e6b21262b9de9084c4913420a28bedb1209fa1f3e6bdb9`

### hubble-relay

`docker.io/cilium/hubble-relay:v1.13.12@sha256:01b23ea40bcd81145dde6bfcbfc4d542749d08c2a1c6348954c85123a8d2b1fc`
`quay.io/cilium/hubble-relay:v1.13.12@sha256:01b23ea40bcd81145dde6bfcbfc4d542749d08c2a1c6348954c85123a8d2b1fc`

### operator-alibabacloud

`docker.io/cilium/operator-alibabacloud:v1.13.12@sha256:56fbfd0fb9ba239191cff2daf60b09e7f6b296622d0bdd5247ee535b9fe526c5`
`quay.io/cilium/operator-alibabacloud:v1.13.12@sha256:56fbfd0fb9ba239191cff2daf60b09e7f6b296622d0bdd5247ee535b9fe526c5`

### operator-aws

`docker.io/cilium/operator-aws:v1.13.12@sha256:d6c9c830ac558624568af0de46562f059f1085ece47323da108d7b13a686ca4f`
`quay.io/cilium/operator-aws:v1.13.12@sha256:d6c9c830ac558624568af0de46562f059f1085ece47323da108d7b13a686ca4f`

### operator-azure

`docker.io/cilium/operator-azure:v1.13.12@sha256:67a574ee88fb720cab3531722d95e90a0f98a17ca5c134a56d82840ad29fbf31`
`quay.io/cilium/operator-azure:v1.13.12@sha256:67a574ee88fb720cab3531722d95e90a0f98a17ca5c134a56d82840ad29fbf31`

### operator-generic

`docker.io/cilium/operator-generic:v1.13.12@sha256:f83734bbe270f961d545c7929152785507ce04a05d818ebc9776941723736d02`
`quay.io/cilium/operator-generic:v1.13.12@sha256:f83734bbe270f961d545c7929152785507ce04a05d818ebc9776941723736d02`

### operator

`docker.io/cilium/operator:v1.13.12@sha256:cfc2c4f2adebd878a5a5d2964139fd3afbd56fba219a68000e71162c2e946cc8`
`quay.io/cilium/operator:v1.13.12@sha256:cfc2c4f2adebd878a5a5d2964139fd3afbd56fba219a68000e71162c2e946cc8`

Signed-off-by: Michi Mutsuzaki <[email protected]>
We currently don't update bpf-next kernels, since we don't want
to cause additional regressions. This is a problem since lvh updates
may break existing images.

Instead of relying on bpf-next, use the current LTS kernel in
workflows.

Signed-off-by: Lorenz Bauer <[email protected]>
LVH images haven't been updated since April, probably because of missing
renovate annotations. Update them manually and add the annotations.

Signed-off-by: Lorenz Bauer <[email protected]>
6.0 is EOL and therefore not built anymore. Switch to the next
newer LTS which is 6.1.

Signed-off-by: Lorenz Bauer <[email protected]>
[ upstream commit c19a84e ]

This error can happen if a state is being destroyed while packets are in
flight. It should be rare as the window in the kernel where it can
happen is very short.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ upstream commit 3c479d4 ]

The test output are riddled with logs such as:

    Defaulted container "cilium-agent" out of: cilium-agent, config
    (init), mount-cgroup (init), apply-sysctl-overwrites (init),
    mount-bpf-fs (init), clean-cilium-state (init),
    install-cni-binaries (init)

This gets particularly noisy when waiting for the key rotation to
complete, during which time we run kubectl exec repeatedly. This commit
fixes it.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ upstream commit dc6cf34 ]

While fixing one of the review comments in PR that introduced this test,
I changed datapath mode to be explicitly set from matrix.mode.
Unfortunately, setting `native` makes it actually use `tunneling` mode.
Switching to `gke` mode resolves this issue.

Fixes cilium#30247

Signed-off-by: Marcel Zieba <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
[ upstream commit d7f5e58 ]

In the AKS release cycle, a gap exists between the introduction of new supported Kubernetes versions
and the removal of older versions, leading to failures in scheduled tests.
This PR introduces the capability to disable older Kubernetes versions, mitigating test failures.

Signed-off-by: Birol Bilgin <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
[ upstream commit 14d68f2 ]

This commit revises the Kubernetes versions tested for compatibility across all supported cloud providers.
Additionally, it adjusts the default Kubernetes version to match the default version provided by each cloud provider

Signed-off-by: Birol Bilgin <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
[ upstream commit bb81c06 ]

The current process delegates the review of ariane-config.yaml changes to the contributing group.
With this commit reviewing responsibilities be transferred to the github-sec and ci-structure groups.

Signed-off-by: Birol Bilgin <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
[ upstream commit b19321e ]

This commit updates the Ariane configuration to include the GitHub organization team 'organization-members' in the list of allowed teams.
Consequently, only members of this specific team will have the authorization to initiate test runs via issue comments.

Signed-off-by: Birol Bilgin <[email protected]>
Signed-off-by: Tobias Klauser <[email protected]>
[ upstream commit 6fee46f ]

[ backporter's note:
  - test-e2e-upgrade doesn't exit on this branch. Remove it.
  - Minor conflict in tests-clustermesh-upgrade.yaml

++<<<<<<< HEAD
 +          if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
 +            SHA="${{ inputs.SHA }}"
 +          else
 +            SHA="${{ github.sha }}"
 +          fi
++=======
+           CILIUM_DOWNGRADE_VERSION=$(contrib/scripts/print-downgrade-version.sh stable)
+           echo "downgrade_version=${CILIUM_DOWNGRADE_VERSION}" >> $GITHUB_OUTPUT
++>>>>>>> 20a5826c31 (ci/ipsec: Fix downgrade version retrieval)
]

Figuring out the right "previous patch release version number" to
downgrade to in print-downgrade-version.sh turns out to be more complex
than expected [0][1][2][3].

This commit is an attempt to 1) fix issues with the current script and
2) overall make the script clearer, so we can avoid repeating these
mistakes.

As for the fixes, there are two things that are not correct with the
current version. First, we're trying to validate the existence of the
tag to downgrade to, in case the script runs on top of a release
preparation commit for which file VERSION has been updated to a value
that does not yet contains a corresponding tag. This part of the script
is actually OK, but not the way we call it in the IPsec workflow: we use
"fetch-tags: true" but "fetch-depth: 0" (the default), and the two are
not compatible, a shallow clone results in no tags being fetched.

To address this, we retrieve the tag differently: instead of relying on
"fetch-tags" from the workflow, we call "git fetch" from the script
itself, provided the preconditions are met (we only run it from a Git
repository, if the "origin" remote is defined). If the tag exists,
either locally or remotely, then we can use it. Otherwise, the script
considers that it runs from a release preparation Pull Request, and
decrements the patch release number.

The second issue is that we would return no value from the script if the
patch release is zero. This is to avoid any attempt to find a previous
patch release when working on a development branch. However, this logics
is incorrect (it comes from a previous version of the script where we
would always decrement the patch number). After the first release of a
new minor version, it's fine to have a patch number at 0. What we should
check instead is whether the version ends with "-dev".

This commit brings additional changes for clarity: more comments, and a
better separation between the "get latest patch release" and "get
previous stable branch" cases, moving the relevant code to independent
functions, plus better argument handling. We also edit the IPsec
workflow to add some logs about the version retrieved. The logs should
also display the script's error messages, if any, that are printed to
stderr.

Sample output from the script:

    VERSION     Tag exists  Prevous minor   Previous patch release

    1.14.3      Y           v1.13           v1.14.3
    1.14.1      Y           v1.13           v1.14.1
    1.14.0      Y           v1.13           v1.14.0
    1.14.1-dev  N           v1.13           <error>
    1.15.0-dev  N           v1.14           <error>
    1.13.90     N           v1.12           v1.13.89  <- decremented
    2.0.0       N           <error>         <error>
    2.0.1       N           <error>         v2.0.0    <- decremented
    2.1.1       N           v2.0            v2.1.0    <- decremented

[0] 56dfec2 ("contrib/scripts: Support patch releases in print-downgrade-version.sh")
[1] 4d7902f ("contrib/scripts: Remove special handling for patch release number 90")
[2] 5581963 ("ci/ipsec: Fix version retrieval for downgrades to closest patch release")
[3] 3803f53 ("ci/ipsec: Fix downgrade version for release preparation commits")

Fixes: 3803f53 ("ci/ipsec: Fix downgrade version for release preparation commits")
Signed-off-by: Quentin Monnet <[email protected]>
Signed-off-by: Yutaro Hayakawa <[email protected]>
[ upstream commit 99786be ]

[ backporter's notes: needed to resolve complexity issues in subsequent
  patches ]

We first look up the destination endpoint to check for tunnel redirection,
and then look it up a second time to access its sec_label and IPSec key.

Make the first look-up unconditional, so that we can remove the second.

Signed-off-by: Julian Wiedmann <[email protected]>
[ upstream commit ac63856 ]

This is an alternative approach to fix cilium#21954, so that we
can re-introduce the 2005 from-proxy routing rule in following patches
to fix L7 proxy issues.

This commit simply allows packets to WORLD as long as they are from
ingress proxy. This was one of the solution suggested by Martynas, as
recorded in commit message cilium/cilium@c534bb7:

    One fix was to extend the troublesome check
    https://github.com/cilium/cilium/blob/v1.12.3/bpf/bpf_host.c#L626 by
    allowing proxy replies to `WORLD_ID`.

To tell if an skb is originated from ingress proxy, the commit extends
the semantic of existing flags `TC_INDEX_F_SKIP_{INGRESS,EGRESS}_PROXY`,
renames flags to clarify the changed meaning.

Fixes: cilium#21954 (Reply from pod to outside is dropped when L7 ingress policy is used)

Signed-off-by: Zhichuan Liang <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
[ upstream commit 217ae4f ]

[ backporter's notes: this is a custom backport to init.sh ]

This commit re-introduced the 2005 routes that were removed by
cilium@9dd6cfc (datapath: remove 2005 route table for ipv6 only)
and cilium@c1a0dba (datapath: remove 2005 route table for ipv4 only).

Signed-off-by: Robin Gögge <[email protected]>
Signed-off-by: Zhichuan Liang <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
sayboras and others added 28 commits March 21, 2024 14:30
[ upstream commit e8bed8d ]
[ backporter notes: conflicts in unit tests. Had to change introduce
  the `NoError` check instead of expecting a specific error ]

This commit is to add the same namespace while listing generated LB
service, the main reason is to avoid wrong status update for gateways
having the same name, but belonged to different namespace.

Testing was done locally as per below:

Before the fix:
```
$ kg gateway -A
NAMESPACE   NAME         CLASS    ADDRESS          PROGRAMMED   AGE
another     my-gateway   cilium   10.110.222.237   True         4s
default     my-gateway   cilium   10.110.222.237   True         56s
```

After the fix:

```
$ kg gateway -A
NAMESPACE   NAME         CLASS    ADDRESS          PROGRAMMED   AGE
another     my-gateway   cilium   10.102.170.180   True         14m
default     my-gateway   cilium   10.110.222.237   True         14m

$ kg services -A
NAMESPACE     NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                      AGE
another       cilium-gateway-my-gateway      LoadBalancer   10.102.170.180   10.102.170.180   80:31424/TCP                 15m
default       cilium-gateway-my-gateway      LoadBalancer   10.110.222.237   10.110.222.237   80:31889/TCP                 15m
...
```

Fixes: cilium#31270
Signed-off-by: Tam Mach <[email protected]>
Signed-off-by: Sebastian Wicki <[email protected]>
[ upstream commit 0fb203e ]

So that the failure of one matrix entry (e.g., caused by a flake) doesn't
cancel the other ongoing tests, if any.

Signed-off-by: Marco Iorio <[email protected]>
Signed-off-by: Sebastian Wicki <[email protected]>
[ upstream commit b639eab ]

In general, it is not recommended to carry several admin. operations on
the cluster at the same time, as it can make troubleshooting in case of
issues a lot more complicated. Mixing operations is also less likely to
be covered in CI so more likely to hit corner cases.

Performing IPsec key rotations during Cilium up/downgrades is one such
case. Let's document it explicitly to discourage users from doing that.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Sebastian Wicki <[email protected]>
[ upstream commit 9939fa2 ]
[ backporter notes: No SRV6 support on v1.13, I removed those test cases ]

Before this patch, Hubble would wrongly report known traffic direction
and reply status when IPSec was enabled.

Signed-off-by: Alexandre Perrin <[email protected]>
Signed-off-by: Sebastian Wicki <[email protected]>
[ upstream commit fbe78c4 ]

The default service CIDR of AKS clusters is 10.0.0.0/16 [1].
Unfortunately, we don't set a pod cidr for clusterpool IPAM, and hence
use cilium's default of 10.0.0.0/8, which overlaps. This can
lead to "fun" situations in which e.g. the kube-dns service ClusterIP is
the same as the hubble-relay pod IP, or similar shenanigans. This
usually breaks the cluster utterly.

The fix is relatively straight-forward: set a pod CIDR for cilium which
does not overlap with defaults of AKS. We chose 192.168.0.0/16 as this
is what is recommended in [2].

[1]: https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#create-an-aks-cluster-with-system-assigned-managed-identities
[2]: https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium#option-1-assign-ip-addresses-from-an-overlay-network

Fixes: fbf3d38 (ci: add AKS workflow)

Co-authored-by: Fabian Fischer <[email protected]>
Signed-off-by: David Bimmler <[email protected]>
… port

[ upstream commit d3b19d6 ]

Currently, listing the load-balancing configuration doesn't display the
L7LB Proxy Port for services of type `l7-load-balancer`.

```
cilium-dbg bpf lb list
SERVICE ADDRESS     BACKEND ADDRESS (REVNAT_ID) (SLOT)
...
10.96.193.7:443     0.0.0.0:0 (30) (0) [ClusterIP, non-routable, l7-load-balancer]
```

The only way of retrieving the L7LB proxy port is to list the frontends
(`cilium-dbg bpf lb list --frontends`) and manually convert the backend id
(union type) to the L7LB proxy port.

Therefore, this commit addsd the L7LB proxy port to the output of `cilium-dbg bpf lb list`
if the service is of type L7 LoadBalancer. The `--frontends` subcommand still displays the
unmapped backend id.

```
cilium-dbg bpf lb list
SERVICE ADDRESS     BACKEND ADDRESS (REVNAT_ID) (SLOT)
10.96.0.1:443       172.18.0.3:6443 (1) (1)
                    0.0.0.0:0 (1) (0) [ClusterIP, non-routable]
10.96.252.10:443    172.18.0.2:4244 (22) (1)
                    0.0.0.0:0 (22) (0) [ClusterIP, InternalLocal, non-routable]
10.96.155.44:80     0.0.0.0:0 (14) (0) [ClusterIP, non-routable]
                    10.244.1.211:80 (14) (1)
172.18.0.2:32646    0.0.0.0:0 (33) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 15735)
10.96.193.7:443     0.0.0.0:0 (30) (0) [ClusterIP, non-routable, l7-load-balancer] (L7LB Proxy Port: 15735)
10.96.122.45:80     10.244.1.250:80 (26) (1)
                    0.0.0.0:0 (26) (0) [ClusterIP, non-routable]
10.96.102.137:80    0.0.0.0:0 (23) (0) [ClusterIP, non-routable]
                    10.244.1.126:4245 (23) (1)
10.96.108.180:443   0.0.0.0:0 (17) (0) [ClusterIP, non-routable, l7-load-balancer] (L7LB Proxy Port: 17731)
172.18.255.1:80     0.0.0.0:0 (25) (0) [LoadBalancer, l7-load-balancer] (L7LB Proxy Port: 17731)
0.0.0.0:32646       0.0.0.0:0 (34) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 15735)
0.0.0.0:31012       0.0.0.0:0 (21) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 17731)
```

Signed-off-by: Marco Hofstetter <[email protected]>
Signed-off-by: Tam Mach <[email protected]>
[ backporter's notes: Minor conflict due to structural changes in the
  GitHub Action (switched how we execute into LVH). ]

The IPsec fixes will introduce a few XfrmInNoStates packet drops on
up/downgrades due to non-atomic Linux APIs (can't replace XFRM states
atomically). Those are limited to a very short time (time between two
netlink syscalls).

We however need to allowlist them in the CI. Since we're using the
conn-disrupt GitHub action from main, we need to allowlist in main for
the pull request's CI to pass.

Note that despite the expected-xfrm-errors flag, the tests will still
fail if we get 10 or more such drops. We don't expect so many
XfrmInNoStates drops so we still want to fail in that case.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's notes: Small conflicts due to the new error handling in
  main. ]

This small bit of refactoring will ease a subsequent commit.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's comments: Trivial conflict in conformance-ipsec-e2e.yaml
  for the number of expected keys. ]

We need to ensure the (key used, sequence number) tuple for each
encrypted packet is always unique on the network. Today that's not the
case because the key is the same for all nodes and the sequence number
starts at 0 on node reboot.

To enable this, we will derive one key per node pair from a global key
shared across all nodes. We need it per node pair and not per node
because the first message emitted from A to B shouldn't be using the
same key as the first message emitted from B to A, to satisfy the above
requirement.

To that end, for each node pair (A, B), we compute a key as follows:

    key = sha256(global_key + ip_of_a + ip_of_b)

The sha256 sum is then truncated to the expected length.

Once computed, we install the derived keys such that the key used for
encryption on node A is the same as the key used for decryption on node
B:

    Node A               <> Node B
    XFRM IN:  key(b+a)      XFRM IN:  key(a+b)
    XFRM OUT: key(a+b)      XFRM OUT: key(b+a)

Signed-off-by: Paul Chaignon <[email protected]>
In the previous commit, we changed the way we compute the IPsec keys. We
therefore need to replace the XFRM states to use the new keys.

Our current update logic however doesn't take this case into account. It
compares states based on IPs, marks, and SPIs, but it doesn't compare
the keys. It would therefore assume that the correct states are already
installed.

This commit extends that logic to detect a difference in encryption keys
and, if such a difference exist, remove the old states.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's comments: Many conflicts. Some trivial due to new files
  with logs in sources_name_to_ids.h. Some minor in ipsec_linux.go due
  to the new error handling in main. Remove unnecessary include of
  lib/ipcache.h in BPF unit tests; that file doesn't exist before
  v1.15. ]

It turns out that two XFRM IN states can't have the same mark and
destination IP, even if they have different source IPs. That's an issue
in our case because each node1->node2 pair will have its own IPsec key.
Therefore, we need one XFRM state per origin node on input.

Since we can't differentiate those XFRM states by their source
IPs, we will have to differentiate using the marks. To do so, we need to
convert the source IP into a packet mark before matching against XFRM
states. We can write these packet marks in bpf_network, before going up
the stack for decryption. And conveniently, we've just introduce a way
to convert each cluster node into an ID, the node ID, which fits in the
packet mark.

This commit therefore performs an node ID map lookup to retrieve the
node ID using the outer source IP address when packets are first
processed in bpf_network.

We clear the node ID from the packet mark after decryption using XFRM
(output-mark).

If no node ID is found for the outer source IP, we drop the packet.
It seems preferable to drop it from BPF with all the contextual
information rather than let it proceed to the XFRM layer where it will
be dropped with only an error code incrementing.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's comments: Many conflicts due to the new error handling in
  main. ]

We want to have one IPsec key per node1->node2 (not including
node2->node1 which will get a different key). We therefore need per-node
XFRM states on the receive/decrypt side to carry each node's key.

This commit implements that change. Thus, instead of creating a unique
XFRM IN state when we receive the local node, we will create an XFRM IN
state everytime we receive a remote node.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's comments: Minor conflict due to new error handling in
  main. ]

This commit extends the logic from commit c0d9b8c ("ipsec: Allow old
and new XFRM OUT states to coexist for upgrade") to have both the old
and the new XFRM IN states in place. This is necessary to avoid packet
drops during the upgrade.

As with the XFRM OUT states, we can't add the new IN state while the old
one is in place. We therefore need to first remove the old state, to
then add the new one. See c0d9b8c ("ipsec: Allow old and new XFRM OUT
states to coexist for upgrade") for details.

Note this commit also removes the comparison of output-marks.
Output-marks aren't actually used by the kernel to decide if two states
conflict. And in the case of XFRM IN states, the output-marks changed a
bit as well. Despite being different, the states still conflict.

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's comments: Non-trivial conflicts due to refactoring around
  local node initialization logic. ]

Read and export the local bootid via CiliumNode. We'll need it in a
subsequent commit to generate new IPsec keys when a node reboots.
This commit also collects the boot_id file as part of the bugtool
report.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Co-authored-by: Paul Chaignon <[email protected]>
[ backporter's notes: Minor conflicts due to new error handling in
  main. ]

We need to ensure we never have two packets encrypted with the same key
and sequence number. To that end, in previous commits, we introduced
per-node-pair keys. That is however not sufficient. Since the sequence
numbers can start from zero on node boot, if a node reboot, it will
start sending traffic encrypted again with the same key and sequence
number as it did before.

To fix that, we need to ensure the per-node-pair keys change on node
reboots. We achieve that by using the boot ID in the per-node-pair key
calculation.

For a pair of nodes A and B with IP addresses a and b and boot IDs x
and y, we will therefore install two different keys:

    Node A               <> Node B
    XFRM IN:  key(b+a+y+x)  XFRM IN:  key(a+b+x+y)
    XFRM OUT: key(a+b+x+y)  XFRM OUT: key(b+a+y+x)

This is done such that, for each pair of nodes A, B, the key used for
decryption on A (XFRM IN) is the same key used for encryption on B (XFRM
OUT), and vice versa.

Since we are now retrieving the local node's boot ID as part of the
IPsec code, we need to initialize the mocked local node store in the
unit tests.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Co-authored-by: Paul Chaignon <[email protected]>
[ backporter's notes: Many conflicts due to the new error handling in
  main. ]

When we detect that a node's bootid has changed, we need to update the
IPsec states.

Unfortunately this is not as straightforward as it should be, because we
may receive the new boot ID before a CiliumInternalIP is assign to the
node. In such a case, we can't install the XFRM states yet because we
don't have the CiliumInternalIP, but we need to remember that the boot
ID changed and states should be replaced.

We therefore record that information in a map, ipsecUpdateNeeded, which
is later read to see if the boot ID changed.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Co-authored-by: Paul Chaignon <[email protected]>
[ backporter's notes: Minor conflict due to a new error being listed in
  text that is now removed. ]

When a node reboots the key used to communicate with it is expected to
change due to the new boot id generated. While the new key is being
installed we may need to do it non-atomically (delete + insert), so
packets to/from that node might be dropped which would cause increases
in the XfrmNoStatesIn/Out. Add a note about it in the docs so users are
not surprised.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Co-authored-by: Paul Chaignon <[email protected]>
Now we can enable ESN anti-replay with window size of 1024. If a node
reboots then everyone updates the related keys with the new one due to
the different bootid, the node itself is already generating the keys
with the new bootid. The window is used to allow for out-of-order
packets, anti-replay still doesn't allow to replay any packet but keeps
a bitmap and can accept out-of-order packets within window size range.
For more information check section ""A2. Anti-Replay Window" of RFC 4303.

Signed-off-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Co-authored-by: Paul Chaignon <[email protected]>
[ backporter's notes: Minor conflicts in conformance-ipsec-e2e.yaml due
  to the extension of CI to cover non-AEAD keys. Minor conflict in the
  documentation for how to retrieve the SPI from the secret.
  The testing logic for the number of IPsec keys to expect also needed
  to be adjusted. On v1.13, we have one case where IPv6 is disabled.
  In that case, the number of expected keys per remote node drops from
  two to one (i.e., one per IP family). ]

The ESN bit in the IPsec secret will be used to indicate whether
per-node-pair keys should be used or if the global key should remain in
use. Specifically, it consist in a '+' sign after the SPI number in the
secret.

This ESN bit will be used to transition from a global key system to a
per-node-pair system at runtime. We would typically rely on an agent
flag for such a configuration. However, in this case, we need to perform
a key rotation at the same time as we change the key system. Encoding
the key system in the IPsec secret achieves that.

By transition from the global to the per-node-pair keys via a key
rotation, we ensure that the two can coexist during the transition. The
old, global key will have XFRM rules with SPI n, whereas the new,
per-node-pair keys will have XFRM rules with SPI n+1.

Using a bit in the IPsec secret is also easier to test because we
already have all the logic to test key rotation (whereas we would need
new logic to test a flag change).

The users therefore need to perform a key rotation from e.g.:

    3 rfc4106(gcm(aes)) [...] 128

to:

    4+ rfc4106(gcm(aes)) [...] 128

The key rotation test in CI is updated to cover a rotation from 3 to 4+
(meaning a rotation into the new per-node-pair key system).

Signed-off-by: Paul Chaignon <[email protected]>
[ backporter's notes: major conflict because we have a special case on
  v1.13 to handle the case where IPv6 is disabled. We therefore need to
  account for both IPv6 and the key-system when computing the expected
  number of keys. ]

Since commit 4cf468b ("ipsec: Control use of per-node-pair keys from
secret bit"), IPsec key rotations can be used to switch from the
single-key system to the per-tunnel key system (also referred to as
per-node-pair key system). Our key rotation test in CI was updated to
cover such a switch.

This commit extends it to also cover traditional key rotations, with
both the new and old key systems. The switch back into a single-key
system is also covered.

These special key rotations are controlled with a single + sign. Adding
it after the SPI in the IPsec Kubernetes secret is enough to switch to a
per-tunnel key system. We thus simply need to cover all 4 cases of
having or not having the + sign in the old and new secrets.

Signed-off-by: Paul Chaignon <[email protected]>
When adding the BootID field to the CiliumNode CRD we forgot to bump the
version, which is an issue when after an cilium upgrade the operator
tries to update the CiliumNode objects to include the BootID field.

Signed-off-by: Robin Gögge <[email protected]>
[ notes: before v1.15, enableIPsecIPv4 and enableIPsecIPv6 didn't return
an error. This needed to be changed from the original commit ]

A node update that doesn't contain a BootID will cause the creation
of non-matching XFRM IN and OUT states across the cluster as the
BootID is used to generate per-node key pairs. Non-matching XFRM
states will result in XfrmInStateProtoError, causing packet drops.
An empty BootID should thus be treated as an error, and Cilium
should not attempt to derive per-node keys from it.

Signed-off-by: Robin Gögge <[email protected]>
This commit ensures that

- each time we compute a per-node-pair-key we create an empty slice with
  the correct length first, and then append all the input data instead
  of appending to one of the input slices (`globalKey`) directly.
- the IPs that are used as arguments in `computeNodeIPsecKey` are
  canonical, meaning IPv4 IPs consist of 4 bytes and IPv6 IPs consist of
  16 bytes.

This is necessary to always have the same inputs on all nodes when
computing the per-node-pair-key. Without this IPs might not match on the
byte level, e.g on one node the input is a v6 mapped v4 address (IPv4
address in 16 bytes) and on the other it isn't when used as input to the
hash function. This will generate non-matching keys.

Co-authored-by: Zhichuan Liang <[email protected]>
Signed-off-by: Robin Gögge <[email protected]>
We have very little logging of the boot IDs. Really fixing that will
require a bit of work to not be too verbose, but in the meantime, we
should at least log the local boot ID.

Signed-off-by: Paul Chaignon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.