-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Extracting large archive causes OOMs #1204
Comments
I've been working on this last week:
🔴 Experiment: reprohttps://gist.github.com/petuhovskiy/8c44e81abbf835d2c0c514fce3268343 We can see several things here. Before the load, we have memory usage like this:
After the load starts, free pages quickly go down in just a few seconds. Movable memory is used first, but then DMA32 also gets used.
Over time, there is less and less free memory, and more cached/slab memory. At the same time, available memory stays high ( Then, this error happens during memory hotplug (on ~7e5 files created):
And then, after some time, oom-killer wakes up:
🟢 Experiment: clearing cacheshttps://gist.github.com/petuhovskiy/fb7a1b24faa745384e3fe7a0be2894eb What if we do this every 500ms?
It turns out there are no OOMs in this case! And also no Because caches are dropped very often, there are always plenty of Free pages and Cached memory stays relatively low. So it looks like the issue is related to caches? And OOMs are probably caused by linux inability to drop caches when memory allocation needs it? In the next experiments, I try to tune linux virtual memory settings and hope that it'll help. 🔴 Experiment: vm.vfs_cache_pressure=200No significant changes from original repro. 🔴 Experiment: vm.swappiness=100No significant changes from original repro. 🔴 Experiment: other settings changesI also tried:
When these settings affected zone watermarks, there were changes in free pages, but other than that, there was almost no difference from original repro. These runs still had oom-killer and So this issue doesn't look like it can be fixed by just tuning configs. 🟢 Experiment: autoscaling-enabled: falsehttps://gist.github.com/petuhovskiy/64e33ffb12698463b2ed19b86a801d7e When VM is started, it uses 0.5CU. It means there are But it looks like there are no OOMs when there are no hotplugs! Initial memory usage looks the same:
But then, it just stays like this. It means without hotplugs, memory stays stable and there are no OOMs, no So now it looks like this issue/bug is related to hotplug? 🔴 Experiment: DIMMHotplug instead of virtio-memAutoscaling is enabled, so the ony difference from original repo is using And it looks like there is no difference: vmemmap error is oom kills. Even stacktrace looks almost the same:
That means the issue is not virtio-mem related, but instead related to how memory hotplug works in linux in general. Info on memory hotplugWe can find out some details by looking at the vmemmap error:
What ChatGPT says about this:
So it looks like initial When searching The next steps for this issue is probably to find more related discussions on the internet. In the end, this issue turned out not as simple as I thought, maybe it's a good time to create an investigation page in Notion now. |
Decided to not create a separate page in Notion, posting updates in this issue instead. First of all, I switched to a newer 6.9.12 kernel, which reproduces the problem without adding any new issues. Then I looked into the source code. vmemmap alloc failureMy conclusion is that memmap_on_memoryWe had an idea of trying I tried enabling it, and it seems that it's not supported for virtio-mem, but it's supported for ACPI hotplug. And when I enabled it with It gives another confirmation that trying different configsThen I tried to reproduce this issue without autoscaling:
The issue reproduces with static configuration of 1CPU/4GiB RAM. It means it's not related to memory hotplug, and can be reproduced with exact memory configuration without autoscaling. The only difference between cases above is the size of Movable memory. It's not clear why linux is bad at memory management when there's more Movable memory, especially because we don't have any drivers requiring DMA (direct memory access), so almost all of the memory should be movable. And in this particular case, we just have a lot of caches. From now on, I will continue to use this config for debugging this issue:
more logs for 1CUhttps://gist.github.com/petuhovskiy/cb49886a50992b8841fed8999f7b9a11 Let's take a look at zone stats over time:
A couple of observations:
kernel sourceI've been reading kernel docs and sources recently, leaving some useful links here:
https://docs.kernel.org/admin-guide/mm/multigen_lru.html => "optimizes page reclaim and improves performance under memory pressure." Also looked into sources for memory allocation. These are the main allocation functions: What happens:
I can patch any of these functions to add more debug info. For the next step I probably want to understand why we don't use Movable section while it clearly has many free pages available. If anyone has more ideas on what I should try next, you are welcome! |
Environment
can be reproduced locally
Steps to reproduce
#1210
Reproduced the issue in a local setup. For this I wrote a small script that creates many small files (~140 bytes each) in a single directory.
When running this script, there are a bunch of issues:
vmemmap alloc failure
errorsoom-killer
after some timeoom-killer
eventually kills all processes, one by oneExpected result
No OOM kills, autoscaling is able to add memory before OOMs happen.
Actual result
OOM kills, also looks like DMA32 memory zone gets exhausted.
Other logs, links
More details in this thread: https://neondb.slack.com/archives/C087MK9MZFA/p1736380041800789
The text was updated successfully, but these errors were encountered: