Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nats[2.10.21] inaccessible consumers in nats with 3 node cluster, slowness in publishing messages, high CPU and memory, idle consumers #6296

Open
ayush3890 opened this issue Dec 23, 2024 · 8 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@ayush3890
Copy link

ayush3890 commented Dec 23, 2024

Observed behavior

Experiencing extreme slowness in nats 3 nodes cluster.
Following are observations:

Output of nats consumer report stream_name:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                             Consumer report for stream_name with 17 consumers                                                            │
├─────────────────────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬────────────┬────────────────────────────────────────────┤
│ Consumer                          │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor  │ Cluster                                    │
├─────────────────────────────────  ┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼────────────┼────────────────────────────────────────────┤
│ stream_name_consumer_C1           │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 35,467,403 │ commonnats-0*, commonnats-1, commonnats-2  │
│ stream_name_consumer_C2           │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 35,454,304 │ commonnats-0*, commonnats-1, commonnats-2  │
│ stream_name_consumer_C3           │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 29,837,246 │ commonnats-0, commonnats-1, commonnats-2*  │
│ stream_name_consumer_C4           │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 19,441,635 │ commonnats-0, commonnats-1*, commonnats-2  │
│ stream_name_consumer_C5           │ Pull │ Explicit   │ 30.00s   │ 3           │ 425         │ 6,011 / 3%  │ 35,458,893 │ commonnats-0*, commonnats-1, commonnats-2  │
│ stream_name_consumer_C5           │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 0          │ commonnats-0*, commonnats-1, commonnats-2! │
│ stream_name_consumer_C6           │ Pull │ Explicit   │ 10m0s    │ 0           │ 0           │ 0           │ 0          │ commonnats-0!, commonnats-1, commonnats-2* │
│ stream_name_consumer_C7           │ Pull │ Explicit   │ 10m0s    │ 0           │ 0           │ 0           │ 0          │ commonnats-0*, commonnats-1, commonnats-2! │
│ stream_name_consumer_C8           │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 35,454,281 │ commonnats-0, commonnats-1, commonnats-2*  │
│ stream_name_consumer_C9           │ Pull │ Explicit   │ 30.00s   │ 0           │ 10          │ 0           │ 35,454,322 │ commonnats-0, commonnats-1, commonnats-2*  │
│ stream_name_consumer_C10          │ Pull │ Explicit   │ 10m0s    │ 0           │ 0           │ 0           │ 0          │ commonnats-0!, commonnats-1*, commonnats-2 │
│ stream_name_consumer_C11          │ Pull │ Explicit   │ 10m0s    │ 0           │ 0           │ 0           │ 0          │ commonnats-0*, commonnats-1, commonnats-2  │
│ stream_name_consumer_C12          │ Pull │ Explicit   │ 10m0s    │ 0           │ 0           │ 0           │ 19,441,635 │ commonnats-0*, commonnats-1, commonnats-2  │
│ stream_name_consumer_C13          │ Pull │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 29,701,253 │ commonnats-0, commonnats-1, commonnats-2*  │



╭─────────────────────────────────────────────────────────────────────────────────────────   ╮
│                                  Inaccessible Consumers                                    │
├────────────────────────   ┬────────────────────────────────────┬────────────────────────┬──┤
│ stream_name_consumer_CF1  │ stream_name_consumer_CF2           │ stream_name_consumer_CF3  │
╰────────────────────────   ┴────────────────────────────────────┴────────────────────────┴──╯

Output of nats server ls from all 3 nodes:

➜  ~ nats server ls --context=n0
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                            Server Overview                                                            │
├──────────────┬────────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬───────┤
│ Name         │ Cluster    │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT   │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│ commonnats-0 │ commonnats │ 0    │ 2.10.21 │ yes │ 8     │ 1,129 │      8 │   0 │ 3.4 GiB │ 200   │     8 │    0 │ 32d1h1m51s │ 220ms │
│ commonnats-1 │ commonnats │ 0    │ 2.10.21 │ yes │ 1     │ 1,130 │      8 │   0 │ 2.9 GiB │ 204   │     8 │    0 │ 5d2h22m38s │ 395ms │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│              │ 1          │ 2    │         │ 2   │ 9     │ 2,259 │        │     │ 6.3 GIB │       │       │    0 │            │       │
╰──────────────┴────────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴───────╯

╭───────────────────────────────────────────────────────────────────────────────╮
│                                Cluster Overview                               │
├────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster    │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ commonnats │          2 │                 0 │                 0 │           9 │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│            │          2 │                 0 │                 0 │           9 │
╰────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
➜  ~ nats server ls --context=n1
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                            Server Overview                                                            │
├──────────────┬────────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬───────┤
│ Name         │ Cluster    │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT   │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│ commonnats-1 │ commonnats │ 0    │ 2.10.21 │ yes │ 3     │ 1,131 │      8 │   0 │ 2.9 GiB │ 102   │     8 │    0 │ 5d2h23m20s │ 31ms  │
│ commonnats-2 │ commonnats │ 0    │ 2.10.21 │ yes │ 8     │ 1,132 │      8 │   0 │ 715 MiB │ 200   │     8 │    0 │ 32d1h2m33s │ 64ms  │
│ commonnats-0 │ commonnats │ 0    │ 2.10.21 │ yes │ 7     │ 1,130 │      8 │   0 │ 3.0 GiB │ 204   │     8 │    0 │ 32d1h2m33s │ 163ms │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│              │ 1          │ 3    │         │ 3   │ 18    │ 3,393 │        │     │ 6.6 GIB │       │       │    0 │            │       │
╰──────────────┴────────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴───────╯

╭───────────────────────────────────────────────────────────────────────────────╮
│                                Cluster Overview                               │
├────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster    │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ commonnats │          3 │                 0 │                 0 │          18 │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│            │          3 │                 0 │                 0 │          18 │
╰────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
➜  ~ nats server ls --context=n2
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                            Server Overview                                                            │
├──────────────┬────────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬───────┤
│ Name         │ Cluster    │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT   │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│ commonnats-2 │ commonnats │ 0    │ 2.10.21 │ yes │ 8     │ 1,130 │      8 │   0 │ 724 MiB │ 201   │     8 │    0 │ 32d1h2m35s │ 131ms │
├──────────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼───────┤
│              │ 1          │ 1    │         │ 1   │ 8     │ 1,130 │        │     │ 724 MIB │       │       │    0 │            │       │
╰──────────────┴────────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴───────╯

╭───────────────────────────────────────────────────────────────────────────────╮
│                                Cluster Overview                               │
├────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster    │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ commonnats │          1 │                 0 │                 0 │           8 │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│            │          1 │                 0 │                 0 │           8 │
╰────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
➜  ~

Output of nats server report jetstream:

➜  ~ nats server report jetstream --context=n0
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                JetStream Summary                                                │
├──────────────┬────────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────────┤
│ Server       │ Cluster    │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err     │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│ commonnats-2 │ commonnats │ 8       │ 43        │ 182,288  │ 121 MiB │ 0 B    │ 121 MiB │ 632,900 │ 84 / 0.013% │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│              │            │ 8       │ 43        │ 182,288  │ 121 MIB │ 0 B    │ 121 MIB │ 632,900 │ 84          │
╰──────────────┴────────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────────╯


WARNING: No cluster meta leader found. The cluster expects 3 nodes but only 1 responded. JetStream operation require at least 2 up nodes.
➜  ~ nats server report jetstream --context=n1
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                JetStream Summary                                                │
├──────────────┬────────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────────┤
│ Server       │ Cluster    │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err     │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│ commonnats-2 │ commonnats │ 8       │ 43        │ 182,285  │ 121 MiB │ 0 B    │ 121 MiB │ 632,904 │ 84 / 0.013% │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│              │            │ 8       │ 43        │ 182,285  │ 121 MIB │ 0 B    │ 121 MIB │ 632,904 │ 84          │
╰──────────────┴────────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────────╯


WARNING: No cluster meta leader found. The cluster expects 3 nodes but only 1 responded. JetStream operation require at least 2 up nodes.
➜  ~ nats server report jetstream --context=n2
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                JetStream Summary                                                │
├──────────────┬────────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬─────────────┤
│ Server       │ Cluster    │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err     │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│ commonnats-2 │ commonnats │ 8       │ 43        │ 182,284  │ 121 MiB │ 0 B    │ 121 MiB │ 632,906 │ 84 / 0.013% │
├──────────────┼────────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼─────────────┤
│              │            │ 8       │ 43        │ 182,284  │ 121 MIB │ 0 B    │ 121 MIB │ 632,906 │ 84          │
╰──────────────┴────────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴─────────────╯


WARNING: No cluster meta leader found. The cluster expects 3 nodes but only 1 responded. JetStream operation require at least 2 up nodes.

Output of nats stream subjects:

╭─────────────────────────────────────╮
│      6 Subjects in stream cy_co     │
├────────────────────────────┬────────┤
│ Subject                    │ Count  │
├────────────────────────────┼────────┤
│ S1                        │ 1      │
│ S2                        │ 21     │
│ S3                        │ 1,322  │
│ S4                        │ 19,755 │
│ S5                        │ 53,608 │
│ S5                        │ 85,620 │
╰────────────────────────────┴────────╯

Sudden increase in Cluster NTT
image

Sudden increase in Poll Time
image

CPU usage:
image

Memory usage:
image

Resource and limit:

      resources:
        limits:
          cpu: '2'
          memory: 6Gi
        requests:
          cpu: '2'
          memory: 6Gi

Env variables:

        - name: GOMEMLIMIT
          value: 5GiB

Nats configuration:

{
  "accounts": {
    "acc1": {
      "jetstream": "enabled",
      "users": [
        {
          "password": <PASS>,
          "user": "<USER>"
        }
      ]
    },
    "SYS": {
      "users": [
        {
          "password": <ADMIN_PASS>,
          "user": "<ADMIN_USER>"
        }
      ]
    }
  },
  "cluster": {
    "name": "commonnats",
    "no_advertise": true,
    "port": 6222,
    "routes": [
      "nats://commonnats-0.commonnats-headless:6222",
      "nats://commonnats-1.commonnats-headless:6222",
      "nats://commonnats-2.commonnats-headless:6222"
    ]
  },
  "debug": true,
  "http_port": 8222,
  "jetstream": {
    "max_file_store": 150Gi,
    "max_memory_store": 0,
    "store_dir": "/data"
  },
  "lame_duck_duration": "30s",
  "lame_duck_grace_period": "10s",
  "pid_file": "/var/run/nats/nats.pid",
  "ping_interval": "2m",
  "ping_max": 2,
  "port": 4222,
  "server_name": $SERVER_NAME,
  "system_account": "SYS"
}

Expected behavior

Under load, nats should deliver messages to consumer without any slowness

Server and client version

Nats Server Version 2.10.21
Nats Client Version 1.37.0 (Golang Application)

Host environment

  1. K8's cluster
  2. helm chart:
  - repoURL: 'https://nats-io.github.io/k8s/helm/charts/'
    targetRevision: 1.2.5
    chart: nats

Steps to reproduce

No steps to reproduce. Happening randomly. Happened 3 times in last 3 months

  • I am available to hop on a quick call to get this issue debugged
@ayush3890 ayush3890 added the defect Suspected defect such as a bug or regression label Dec 23, 2024
@derekcollison
Copy link
Member

What does nats traffic show?

@ayush3890
Copy link
Author

Output of nats --context=n0 traffic

╭─────────────────────────────────────────────────────────────────╮
│                           Raft Traffic                          │
├──────────┬──────┬────────┬─────────────┬───────┬────────────────┤
│ Proposal │ Vote │ Append │ Remove Peer │ Reply │ Total Messages │
├──────────┼──────┼────────┼─────────────┼───────┼────────────────┤
│ 0        │ 0    │ 0      │ 0           │ 0     │ 0              │
│ 0        │ 0    │ 256    │ 0           │ 522   │ 778            │
│ 0        │ 0    │ 65     │ 0           │ 137   │ 202            │
│ 0        │ 0    │ 31     │ 0           │ 40    │ 71             │
│ 0        │ 0    │ 140    │ 0           │ 266   │ 406            │
│ 0        │ 0    │ 63     │ 0           │ 153   │ 216            │
│ 0        │ 0    │ 169    │ 0           │ 344   │ 513            │
│ 0        │ 0    │ 129    │ 0           │ 272   │ 401            │
│ 0        │ 0    │ 109    │ 0           │ 214   │ 323            │
│ 0        │ 0    │ 124    │ 0           │ 264   │ 388            │
│ 0        │ 0    │ 106    │ 0           │ 211   │ 317            │
╰──────────┴──────┴────────┴─────────────┴───────┴────────────────╯

╭─────────────────────────────────────────────────────────────────────────────────╮
│                                 Cluster Traffic                                 │
├────────────┬─────────────┬───────────────┬─────────────┬───────┬────────────────┤
│ JSA Update │ Stream Info │ Consumer Info │ Stream Sync │ Reply │ Total Messages │
├────────────┼─────────────┼───────────────┼─────────────┼───────┼────────────────┤
│ 0          │ 0           │ 0             │ 0           │ 0     │ 0              │
│ 6          │ 0           │ 17            │ 0           │ 0     │ 23             │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 1          │ 0           │ 0             │ 0           │ 0     │ 1              │
│ 5          │ 0           │ 0             │ 0           │ 0     │ 5              │
│ 2          │ 0           │ 0             │ 0           │ 0     │ 2              │
│ 5          │ 0           │ 17            │ 0           │ 0     │ 22             │
│ 4          │ 0           │ 17            │ 0           │ 0     │ 21             │
│ 2          │ 0           │ 0             │ 0           │ 0     │ 2              │
│ 3          │ 0           │ 0             │ 0           │ 0     │ 3              │
│ 4          │ 0           │ 0             │ 0           │ 0     │ 4              │
╰────────────┴─────────────┴───────────────┴─────────────┴───────┴────────────────╯

╭───────────────────────────────────────────────────────────────────────────╮
│                              General Traffic                              │
├──────────┬────────┬────────┬────────┬──────┬─────────────┬────────────────┤
│ Requests │ JS API │ JS ACK │ System │ Rest │ Total Bytes │ Total Messages │
├──────────┼────────┼────────┼────────┼──────┼─────────────┼────────────────┤
│ 0        │ 0      │ 0      │ 0      │ 0    │ 0 B         │ 0              │
│ 0        │ 0      │ 0      │ 18     │ 5    │ 69 KiB      │ 23             │
│ 0        │ 0      │ 0      │ 2      │ 0    │ 8.5 KiB     │ 2              │
│ 0        │ 0      │ 0      │ 3      │ 0    │ 4.7 KiB     │ 3              │
│ 0        │ 1      │ 0      │ 4      │ 1    │ 22 KiB      │ 6              │
│ 0        │ 2      │ 0      │ 0      │ 2    │ 23 KiB      │ 4              │
│ 0        │ 0      │ 0      │ 15     │ 2    │ 47 KiB      │ 17             │
│ 0        │ 3      │ 0      │ 9      │ 2    │ 25 KiB      │ 14             │
│ 0        │ 0      │ 0      │ 6      │ 3    │ 42 KiB      │ 9              │
│ 0        │ 0      │ 0      │ 5      │ 1    │ 32 KiB      │ 6              │
│ 0        │ 0      │ 0      │ 4      │ 0    │ 17 KiB      │ 4              │
╰──────────┴────────┴────────┴────────┴──────┴─────────────┴────────────────╯

@derekcollison
Copy link
Member

With zero votes (which is good), something else must be going on with your setup to cause the system to not elect a meta-leader.

@ayush3890
Copy link
Author

@derekcollison is there a list of check points which I can run to identify root cause. Following things I have validated:

  1. system resources available (cpu and memory)
  2. traceroute is able to resolve network from 1 pod to other 2. And vice versa.

@derekcollison
Copy link
Member

It may be that you are running out of CPU. Try giving them more CPU, say 4 or 8, and possibly bumping memory as well, 8GB.

@ayush3890
Copy link
Author

Few more observations:

  1. I have stopped all subscription of all consumers and publishers to nats. Even then CPU(100% of allocated) and memory are high. I need to understand what is it processing.
  2. There are no error logs in all 3 pods related to, not able to elect nats leader or lost quorum.

In my past experience with this issue, restart has resolved the issue temporarily. Allocating more cpu/memory may solve the issue for while.
Please let know if any more artefact can be extracted from this cluster which may help finding the root cause of the issue.

Thanks

@derekcollison
Copy link
Member

I would also upgrade to the latest server version and make sure you still see any issues - 2.10.24.

@wallyqs
Copy link
Member

wallyqs commented Dec 24, 2024

I would recommend at least 4 vcpus to avoid long unexpected pauses, with 2vcpus you would often seen Internal subscription on %q took too long in your logs:

     resources:
        limits:
          cpu: '2'
          memory: 6Gi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants