Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.x] system under heavy load stops to handle calls wia ws api #3358

Closed
spscream opened this issue Apr 18, 2024 · 22 comments
Closed

[1.x] system under heavy load stops to handle calls wia ws api #3358

spscream opened this issue Apr 18, 2024 · 22 comments
Labels
multistream Related to Janus 1.x pr-exists

Comments

@spscream
Copy link
Contributor

What version of Janus is this happening on?
master

Have you tested a more recent version of Janus too?
we use master

Was this working before?
we had no issue on lower versions, but we had another issue with memory leaks on them, so we got OOM some time.

Is there a gdb or libasan trace of the issue?
no

Under some circuumstances janus stops to handle any api calls via ws - ws is accepting connections, but create_session has timeouts and janus doesn't work. We tried to add debug_locks, but we got situation what janus server eats all memory and IO and stuck, so we got problems with performance before issue appeared. This happens on one of our customers with very heavy load on januses.

How can we debug this and what can be a cause of it?

https://gist.github.com/spscream/84aa7bca6f8e3f43e07d4c58f414e9cd - recent log of such situation with debug_level = 4

@spscream spscream added the multistream Related to Janus 1.x label Apr 18, 2024
@spscream
Copy link
Contributor Author

may we have some partial debug_locks implementation only for locks on session creation to debug it?

@atoppi
Copy link
Member

atoppi commented Apr 18, 2024

Under some circuumstances janus stops to handle any api calls via ws

Do other requests work though? E.g. "info" request or attaching a handle for a different plugin. That could help understanding if the deadlock lies in the transport (WebSocket) or somewhere else.

but create_session has timeouts and janus doesn't work

Do you mean plugin->create_session? If that's the case and you are using the videoroom plugin, then it's very likely a deadlock on videoroom sessions_mutex.

We tried to add debug_locks, but we got situation what janus server eats all memory and IO and stuck, so we got problems with performance before issue appeared

The debug_locks has a massive impact on verbosity, that's probably the reason of performance issue. If you are using Janus in a containerized environment with cgroups v2, the huge log file increasing might increase the memory allocated (due to pages being kept in buffer) and might explain the OOM.

may we have some partial debug_locks implementation only for locks on session creation to debug it?

There is no such option available. You might try customizing the code and just logs the sessions_mutex lock/unlock (given that my previous guess is correct)

@spscream
Copy link
Contributor Author

Do you mean plugin->create_session? If that's the case and you are using the videoroom plugin, then it's very likely a deadlock on videoroom sessions_mutex.

I mean the following request isn't working

{
        "janus" : "create",
        "transaction" : "<random alphanumeric string>"
}

Do other requests work though? E.g. "info" request or attaching a handle for a different plugin. That could help understanding if the deadlock lies in the transport (WebSocket) or somewhere else.

I didn't tried this, but I can try next time this happens.

@atoppi
Copy link
Member

atoppi commented Apr 18, 2024

Taking a deeper inspection at the logs you shared, the issue seems to start with some errors:

Didn't receive audio for more than 1 second(s)...
Didn't receive video for more than 1 second(s)...
SRTCP unprotect error: srtp_err_status_replay_fail
Error setting ICE locally

Those reminds me of situations where the host memory is exhausting (like the issue about cgroups v2 I already mentioned).

Are you running janus in containers with a memory limit?
Are you doing long lasting recordings?

If you suspect a memory leak, try running your janus app in a lab environment under valgrind.

@spscream
Copy link
Contributor Author

Are you running janus in containers with a memory limit?

we are running it in containers, but without memory limits set

Are you doing long lasting recordings?

yes

@atoppi
Copy link
Member

atoppi commented Apr 18, 2024

All right, this is a long shot, but can you check the status of the memory in the containers?

cat /sys/fs/cgroup/system.slice/docker-<long ID>.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-<long ID>.scope/memory.stat

replace long-id with the id of the docker container.

If you see the file bytes amount approaching the memory.max limit, then containers under cgroups v2 will start having issues with memory allocation, since they share the same memory for network and file buffers.
We have proposed a potential patch (in the PR) for long lasting recordings where basically we flush the buffers and tells the kernel to release the used pages every ~2MB of written data.

@spscream
Copy link
Contributor Author

Got this on another customer. We are using this on our production, so in case we got it on our servers, we will enable debug_locks, since we have no such high load as on customers servers, we will try to catch it.

@spscream
Copy link
Contributor Author

got info from our customer, after janus restart

# cat /sys/fs/cgroup/system.slice/docker-9efbad4747099c922f9d8c1dc16a74e029e871a1e21cdcea1b5b8bfc7de47546.scope/memory.max
max
# cat /sys/fs/cgroup/system.slice/docker-9efbad4747099c922f9d8c1dc16a74e029e871a1e21cdcea1b5b8bfc7de47546.scope/memory.stat
anon 21372928
file 4861952
kernel_stack 294912
pagetables 131072
percpu 288
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 21368832
active_anon 4096
inactive_file 4861952
active_file 0
unevictable 0
slab_reclaimable 444944
slab_unreclaimable 357096
slab 802040
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 48048
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0

@spscream
Copy link
Contributor Author

We will try to reproduce it on our lab with enabling recording and setting memory limits on docker container.

@spscream
Copy link
Contributor Author

spscream commented May 2, 2024

we unable to reproduce it on our test stand, but it consistently repeats on customers envs

@atoppi
Copy link
Member

atoppi commented May 3, 2024

got info from our customer, after janus restart

Those data are useless after a restart, we need them while the issue exists (btw, the max output is wrong, it should be a number).

we unable to reproduce it on our test stand, but it consistently repeats on customers envs

If you suspect a deadlock wait for the issue and then provide the output of this gdb snippet:

gdb -p "$(pidof janus)" --batch -ex "set print pretty on" -ex "set pagination off" -ex "thread apply all bt full"

@spscream
Copy link
Contributor Author

got new log for issue, unfortunatelly customers janus is running in docker with autoheal, so we can't get gdb output yet, but got something strange in log:

janus-1  | 2024-05-15T08:41:17.866180063Z Stopping server, please wait...
janus-1  | 2024-05-15T08:41:17.866215282Z Ending sessions timeout watchdog...
janus-1  | 2024-05-15T08:41:17.866220027Z Sessions watchdog stopped
janus-1  | 2024-05-15T08:41:17.866348571Z Closing transport plugins:
janus-1  | 2024-05-15T08:41:17.866984830Z WebSockets thread ended

full log in attach, restart of janus by autoheal happened at 2024-05-15T08:41:28
janus-20240515.txt

@lminiero
Copy link
Member

I see a whole bunch of Error setting ICE locally errors, which suggest a possibly broken management of handles and PeerConnections on the client side. A new PeerConnection is attempted on a handle that already has a handle (which seems to be confirmed by mentions of restart, which is what Janus thinks it is because it gets an SDP with unknown ICE credentials).

@spscream
Copy link
Contributor Author

I see a whole bunch of Error setting ICE locally errors, which suggest a possibly broken management of handles and PeerConnections on the client side. A new PeerConnection is attempted on a handle that already has a handle (which seems to be confirmed by mentions of restart, which is what Janus thinks it is because it gets an SDP with unknown ICE credentials).

It seems to happen after restart of it. Our clients are not connected directly to janus with ws and don't manage handles to janus, we use server side ws connections to janus and server-side management of handles. Now it seems clients tried to ice restart subscribe connections after janus restart, but we don't handle janus restarts correctly now on server - we don't detect janus restarts and don't recreate necessary rooms and logic in our app isn't recovering now from such situation.

@spscream
Copy link
Contributor Author

we found a leak of janus sessions in our code.
gdb shows a lot of mixer threads opened and we investigate this now.

Is it possible what big count of janus sessions opened causes something like overflow and leads janus to stop accepting new?

@atoppi
Copy link
Member

atoppi commented May 28, 2024

Is it possible what big count of janus sessions opened causes something like overflow and leads janus to stop accepting new?

This is something we never tested. There are a lot of maps and structs involved (that relies on integer indexes) when setting up new sessions (either in the core or in the plugin).
Anyway we'd be very interested in your findings since this is basically an undefined scenario.

@spscream
Copy link
Contributor Author

we fixed leaks of sessions, but it doesn't help with janus hangs - today janus stopped to accept connections on our production and also it hangs four times today on one of our customers.
I got dump of threads with gdb today from our production env.
janus.log

@lminiero
Copy link
Member

@spscream I just pushed a PR that contains a patch we prepared for one of our customers some time ago, where they were experiencing a deadlock in the AudioBridge, especially when closing PeerConnections. Due to the way we use some atomics for locking purposes, there was a small chance of inverted locks that could cause a deadlock. Your latest log now seems to indicate the same problem, so I've released the patch as a PR so that you can test that too. Please let us know if that helps, as we were waiting for feedback from them too before merging it.

@lminiero
Copy link
Member

PS: this is a patch @atoppi made, I just pushed it as a PR (credit where credit's due!)

@lminiero
Copy link
Member

@spscream any update on the PR? Did it help?

@spscream
Copy link
Contributor Author

it looks like it helps, we have had no problems on it

@lminiero
Copy link
Member

Ack, thanks! We merged the PR, and we'll close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multistream Related to Janus 1.x pr-exists
Projects
None yet
Development

No branches or pull requests

3 participants