Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow using envoy's zone aware routing #9161

Open
aniket-z opened this issue Feb 7, 2024 · 9 comments
Open

allow using envoy's zone aware routing #9161

aniket-z opened this issue Feb 7, 2024 · 9 comments
Labels
area/policies kind/design Design doc or related kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it

Comments

@aniket-z
Copy link

aniket-z commented Feb 7, 2024

Description

Our current setup has the following components:

  • kuma-cp running in universal mode
  • hundreds of ECS services across multiple AWS ECS clusters
  • all kuma-dps configured with same zone (i.e. "kuma.io/zone" is same for all dataplanes)

Now, we want to use envoy's zone aware routing feature to reduce our inter-AZ network cost

We are aware that kuma supports locality-aware-routing but there it seems we are configuring routing using envoy's priority based load balancing rather than using envoy's zone aware routing feature.

Consider a scenario where source service calls destination service and source service has 4 tasks in zone A and 1 task in zone B but destination service has 1 task in zone A and 4 tasks in zone B. Here, if we use kuma's locality aware routing, from what I have understood, it seems that traffic on all tasks of destination service would not be same (and thus CPU % would not be same for all tasks) whereas if we were to use envoy's zone aware routing throughput per task of destination service (and thus the CPU%) would be same as it does routing taking zone level task count of both source & destination service in account. Please correct me if I have misunderstood envoys' zone aware routing or kuma's locality aware routing and if the problem I have described is not valid?

Is there any way we could use envoy's zone aware routing feature with kuma? We don't want to change kuma.io/zone as that would require us to setup kuma's zone ingress & egress and we don't want to introduce an extra hop (and thus extra cost & latency) and an extra component into our system. Please suggest how should we proceed here

@aniket-z aniket-z added kind/feature New feature triage/pending This issue will be looked at on the next triage meeting labels Feb 7, 2024
@aniket-z aniket-z changed the title allow setting locality different from kuma.io/zone allow setting envoy's locality.zone different from kuma.io/zone Feb 7, 2024
@lukidzi
Copy link
Contributor

lukidzi commented Feb 12, 2024

Hi, I think the thing you want to achieve can be done with Kuma by using https://kuma.io/docs/2.6.x/policies/meshloadbalancingstrategy/#disable-cross-zone-traffic-and-prioritize-traffic-the-dataplanes-on-the-same-node-and-availability-zone

First you need to add a tag to your dataplane in the same availability zone.

e.g:

type: Dataplane
mesh: default
name: { { name } }
networking:
  address: { { address } }
  inbound:
    - port: 8000
      servicePort: 80
      tags:
        kuma.io/service: backend
        kuma.io/protocol: HTTP
        kuma.io/availability-zone: zone1

this kuma.io/availability-zone: az-1 is an example and you can add different one to the dataplanes in the same location.
When your dataplanes are tagged you need to create a policy:

type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: mesh-1
spec:
  targetRef:
    kind: Mesh
  to:
  - targetRef:
      kind: MeshService
      name: backend
    default:
      localityAwareness:
        localZone:
          affinityTags:
          - key: kuma.io/availability-zone
             weight: 1000

in this case, most of the requests will be routed to the dataplanes with the same value of kuma.io/availability-zone tag.

https://kuma.io/docs/2.6.x/policies/meshloadbalancingstrategy/#configuring-localityaware-load-balancing-for-traffic-within-the-same-zone

@lukidzi lukidzi added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/pending This issue will be looked at on the next triage meeting labels Feb 15, 2024
@aniket-z
Copy link
Author

aniket-z commented Feb 21, 2024

@lukidzi This does not seem to take into account number of dataplanes of source & destination service while routing traffic and if there is imbalance in number of dataplanes across availability-zones in either source or destination service, then amount of requests per second going to each dataplane of the destination service would be imbalanced.

For example:
source-service: 4 tasks in az-1a and 1 task in az-1b.
destination-service: 1 task in az-1a and 4 tasks in az-1b.
Assuming all dataplanes are healthy, wouldn't single task of destination-service in az-1a receive much more traffic than each task of destination-service in az-1b?

We run our workload on spot instances in AWS and we can't ensure that tasks of all services would be equally balanced across availability zones at all times but would still want to ensure that all tasks of a given service receive uniform traffic (similar requests per second) so that they also have similar CPU % and similar response time

@aniket-z aniket-z changed the title allow setting envoy's locality.zone different from kuma.io/zone allow using envoy's zone aware routing Feb 21, 2024
@lukidzi
Copy link
Contributor

lukidzi commented Feb 21, 2024

That's true assumption that

source-service: 4 tasks in az-1a and 1 task in az-1b.
destination-service: 1 task in az-1a and 4 tasks in az-1b.

more traffic is routed to the local az, but you can configure it with the weight. The thing you want to achieve seems like default traffic when traffic is routed equally to all instances.

Edit:

I did more testing around zone aware routing
I think it makes sense to add this as a 2nd option of localZone load balancing, one is the one we had based on weight, and another could be zone aware which is a bit more complicated.

How does zone-aware lb work in envoy?
Some conditions need to be fulfilled to make it work.
Based on the logic, we can configure the number of a minimal number of endpoints in the locality that can handle specific

  common_lb_config:
      zone_aware_lb_config:
        min_cluster_size: x 

My testing config:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 9903 }

node:
  locality:
    zone: zone_c
cluster_manager:
  local_cluster_name: local_cluster

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 9002 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: '/'
                route:
                  cluster: server
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: local_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: local_cluster
      endpoints:
      - locality:
          zone: 'zone_a'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 8000
      - locality:
          zone: 'zone_b'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 8003
      - locality:
          zone: 'zone_c'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 8004

  - name: server
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    common_lb_config:
      zone_aware_lb_config:
        min_cluster_size: 2
    load_assignment:
      cluster_name: some_service
      endpoints:
      - locality:
          zone: 'zone_a'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: envoy_server_a_1
                port_value: 8001
      - locality:
          zone: 'zone_b'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: envoy_server_b_1
                port_value: 8002
        - endpoint:
            address:
              socket_address:
                address: envoy_server_b_2
                port_value: 8003
      - locality:
          zone: 'zone_c'
        lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: envoy_server_c_1
                port_value: 8004
        - endpoint:
            address:
              socket_address:
                address: envoy_server_c_2
                port_value: 8005
        - endpoint:
            address:
              socket_address:
                address: envoy_server_c_3
                port_value: 8006

Depends on the number of local_cluster_name instances in current zone traffic might be routed only in the zone or cross zone. E.g. when our service has 4 instance in zone-c and there are 2 instances of destination service we would go cross zone. but if there are 3 instances in the destination we would stay in the same zone.

We could implement:

type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: mesh-1
spec:
  targetRef:
    kind: Mesh
  to:
  - targetRef:
      kind: MeshService
      name: backend
    default:
      localityAwareness:
        localZone:
          type: Weighted | ZoneAware
          zoneAware: #
            zoneIdentifier: `topology.kubernetes.io/zone`  # not sure if that  is going to be possible because node info are bootstrap config and has to be on init, so we might need to set these variables kuma-cp config and configure at bootstrap request
            subZoneIdentifier: `my-label.k8s.io/node`
            minClusterSize: 3
          affinityTags:
          - key: kuma.io/availability-zone
             weight: 1000

Not sure if setting locality based on dynamic configuration is possible because node infos is- bootstrap configuration and has to be on init, so we might need to set these variables in kuma-cp config and configure at bootstrap request.

@lukidzi lukidzi added triage/pending This issue will be looked at on the next triage meeting area/policies and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Feb 21, 2024
@lukidzi lukidzi added this to the backlog milestone Feb 21, 2024
@lukidzi lukidzi removed their assignment Feb 21, 2024
@lukidzi lukidzi modified the milestones: backlog, 2.8.x Feb 21, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/accepted The issue was reviewed and is complete enough to start working on it kind/design Design doc or related and removed triage/pending This issue will be looked at on the next triage meeting labels Feb 26, 2024
@jakubdyszkiewicz
Copy link
Contributor

Triage: @aniket-z would you be interested in contributing this?

@lahabana lahabana removed this from the 2.8.x milestone Apr 10, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jul 10, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@jakubdyszkiewicz jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Jul 10, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Oct 9, 2024
Copy link
Contributor

github-actions bot commented Oct 9, 2024

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lukidzi lukidzi removed the triage/stale Inactive for some time. It will be triaged again label Oct 21, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jan 21, 2025
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lukidzi lukidzi removed the triage/stale Inactive for some time. It will be triaged again label Jan 27, 2025
@kmrgirish
Copy link

@lukidzi / @jakubdyszkiewicz is someone working on this feature?

@jakubdyszkiewicz
Copy link
Contributor

I don't think so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/policies kind/design Design doc or related kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

No branches or pull requests

5 participants