Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Extracting large archive causes OOMs #1204

Open
petuhovskiy opened this issue Jan 13, 2025 · 2 comments
Open

Bug: Extracting large archive causes OOMs #1204

petuhovskiy opened this issue Jan 13, 2025 · 2 comments
Assignees
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/bug Issue Type: Bug

Comments

@petuhovskiy
Copy link
Member

petuhovskiy commented Jan 13, 2025

Environment

can be reproduced locally

Steps to reproduce

#1210

Reproduced the issue in a local setup. For this I wrote a small script that creates many small files (~140 bytes each) in a single directory.

When running this script, there are a bunch of issues:

  • random vmemmap alloc failure errors
  • linux calls oom-killer after some time
  • oom-killer eventually kills all processes, one by one

Expected result

No OOM kills, autoscaling is able to add memory before OOMs happen.

Actual result

OOM kills, also looks like DMA32 memory zone gets exhausted.

Other logs, links

More details in this thread: https://neondb.slack.com/archives/C087MK9MZFA/p1736380041800789

@petuhovskiy petuhovskiy added c/autoscaling/neonvm Component: autoscaling: NeonVM t/bug Issue Type: Bug labels Jan 13, 2025
@petuhovskiy petuhovskiy self-assigned this Jan 14, 2025
@petuhovskiy
Copy link
Member Author

I've been working on this last week:

  1. We have a local repro script. It starts a VM and creates files, it takes ~2 minutes to invoke oom-killer
  2. It was inconvenient to track what's happenning in the VM because memory stats are changing very fast and eventually VM becomes unresponsive
  3. So I created a mem.py script that runs as a daemon and collects all memory stats into a file every 500ms
  4. Then I ran some experiments with changing vm (virtual map) linux config, to see if it fixes the issue

🔴 Experiment: repro

https://gist.github.com/petuhovskiy/8c44e81abbf835d2c0c514fce3268343

We can see several things here.

Before the load, we have memory usage like this:

Free Memory: 1698048 kB
Cached Memory: 153904 kB
Slab Memory: 15144 kB
Node 0, Zone: DMA32: Free=208023 | Managed=234484 
Node 0, Zone: Movable: Free=216489 | Managed=262144

After the load starts, free pages quickly go down in just a few seconds. Movable memory is used first, but then DMA32 also gets used.

Free Memory: 313772 kB
Cached Memory: 1087328 kB
Slab Memory: 403152 kB
Node 0, Zone: DMA32: Free=78912 | Managed=234484
Node 0, Zone: Movable: Free=336 | Managed=262144

Over time, there is less and less free memory, and more cached/slab memory. At the same time, available memory stays high (Available Memory: 1809936 kB) and swap is almost never used.

Then, this error happens during memory hotplug (on ~7e5 files created):

[   33.112699] kworker/0:1: vmemmap alloc failure: order:9, mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL), nodemask=(null),cpuset=/,mems_allowed=0

And then, after some time, oom-killer wakes up:

[   65.572465] Out of memory: Killed process 176 (sql_exporter) total-vm:1291008kB, anon-rss:5888kB, file-rss:25856kB, shmem-rss:0kB, UID:65534 pgtables:208kB oom_score_adj:0

🟢 Experiment: clearing caches

https://gist.github.com/petuhovskiy/fb7a1b24faa745384e3fe7a0be2894eb

What if we do this every 500ms?

echo 3 > /proc/sys/vm/drop_caches

It turns out there are no OOMs in this case! And also no vmemmap alloc failure.

Because caches are dropped very often, there are always plenty of Free pages and Cached memory stays relatively low.

So it looks like the issue is related to caches? And OOMs are probably caused by linux inability to drop caches when memory allocation needs it?

In the next experiments, I try to tune linux virtual memory settings and hope that it'll help.

🔴 Experiment: vm.vfs_cache_pressure=200

No significant changes from original repro.

🔴 Experiment: vm.swappiness=100

No significant changes from original repro.

🔴 Experiment: other settings changes

I also tried:

  • vm.min_free_kbytes=131072
  • vm.watermark_scale_factor=1000
  • vm.overcommit_memory=1
  • and tride combining them

When these settings affected zone watermarks, there were changes in free pages, but other than that, there was almost no difference from original repro. These runs still had oom-killer and vmemmap alloc failure.

So this issue doesn't look like it can be fixed by just tuning configs.

🟢 Experiment: autoscaling-enabled: false

https://gist.github.com/petuhovskiy/64e33ffb12698463b2ed19b86a801d7e

When VM is started, it uses 0.5CU. It means there are 2GiB of memory and 1 CPU. When repro workload (files creation) is started, VM gets upscaled and gets more memory and more CPU. In this experiment, we are disabling autoscaling to check that there are still oom kills. The logic says there still should be, because there are less memory and less CPU available without hotplugging.

But it looks like there are no OOMs when there are no hotplugs!

Initial memory usage looks the same:

Free Memory: 11556 kB
Cached Memory: 1298184 kB
Slab Memory: 482272 kB
Node 0, Zone: DMA32: Free=2219 | Managed=234483
Node 0, Zone: Movable: Free=398 | Managed=262144

But then, it just stays like this. It means without hotplugs, memory stays stable and there are no OOMs, no vmemmap errors.

So now it looks like this issue/bug is related to hotplug?

🔴 Experiment: DIMMHotplug instead of virtio-mem

Autoscaling is enabled, so the ony difference from original repo is using DIMMHotplug instead of virtio-mem.

And it looks like there is no difference: vmemmap error is oom kills. Even stacktrace looks almost the same:

[   28.248639] kworker/u20:3: vmemmap alloc failure: order:9, mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL), nodemask=(null),cpuset=/,mems_allowed=0
[   28.249307] CPU: 0 PID: 274 Comm: kworker/u20:3 Not tainted 6.6.64 #1
[   28.249626] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[   28.250186] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   28.250577] Call Trace:
[   28.250753]  <TASK>
[   28.250869]  dump_stack_lvl+0x37/0x50
[   28.251058]  dump_stack+0x10/0x20
[   28.251227]  warn_alloc+0x104/0x180
[   28.251402]  vmemmap_alloc_block+0xf1/0x110
[   28.251606]  vmemmap_alloc_block_buf+0x9d/0xf0
[   28.251818]  vmemmap_populate_hugepages+0xe4/0x190
[   28.252052]  vmemmap_populate+0x3c/0x60
[   28.252240]  __populate_section_memmap+0x190/0x310
[   28.252471]  sparse_add_section+0x148/0x2f0
[   28.252676]  __add_pages+0x85/0x100
[   28.252852]  add_pages+0x1f/0x80
[   28.253013]  arch_add_memory+0x40/0x50
[   28.253192]  add_memory_resource+0x19e/0x410

That means the issue is not virtio-mem related, but instead related to how memory hotplug works in linux in general.

Info on memory hotplug

We can find out some details by looking at the vmemmap error:

kworker/0:0: vmemmap alloc failure: order:9, mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL), ...
  • vmemmap alloc failure: The kernel tried to allocate the vmemmap for the newly hotplugged memory.
  • order:9: Means the kernel wants 2^(9) = 512 contiguous pages. (With 4 KB pages, that’s 2 MB.)
  • GFP_KERNEL | __GFP_RETRY_MAYFAIL: The allocation flags indicate a normal kernel allocation, which can retry/compact, but eventually it gave up.

What ChatGPT says about this:

Even though you have a decent amount of total free memory, the allocator needs a single contiguous block of size 2 MB. Over time, memory pages can become scattered across the system (fragmented), making it hard to find a large free run of pages. The kernel will try memory compaction to free a contiguous block, but sometimes it fails if there is enough “pinned” or “unmovable” memory in the zone.

So it looks like initial vmemmap error is caused by memory pressure, which is caused by cached memory.

When searching vmemmap alloc failure memory hotplug linux in google, it looks like there are some pages mentioning this issue, like https://lore.kernel.org/lkml/[email protected]/T/.

The next steps for this issue is probably to find more related discussions on the internet.


In the end, this issue turned out not as simple as I thought, maybe it's a good time to create an investigation page in Notion now.

@petuhovskiy
Copy link
Member Author

Decided to not create a separate page in Notion, posting updates in this issue instead.

First of all, I switched to a newer 6.9.12 kernel, which reproduces the problem without adding any new issues. Then I looked into the source code.

vmemmap alloc failure

My conclusion is that vmemmap alloc failure errors are not critical and we should ignore them, at least for now.
In torvalds/linux@fcdaf84 we can see that this warning is triggered only once during lifetime, exactly because it was noisy "while doing memory hotplug tests under heavy memory pressure".
Another point is that virtio-mem has internal state that stores how much memory it should have, and retries hotplugging the memory until it eventually succeeds.

memmap_on_memory

We had an idea of trying memmap_on_memory feature (docs) with hotplug. This feature allows to allocate memory directly on freshly hotplugged memory, so that vmemmap will not fail at all.

I tried enabling it, and it seems that it's not supported for virtio-mem, but it's supported for ACPI hotplug. And when I enabled it with DIMMSlots, it completely fixed vmemmap warnings. But it didn't fix oom-killer, which was called in the same way as before.

It gives another confirmation that vmemmap alloc failure errors are not the issue here, and they are not related to oom-killer.

trying different configs

Then I tried to reproduce this issue without autoscaling:

  • 🟢 0.25 CU / 1GiB
    • Zone: DMA32: Free=3445 | Managed=234454 | Thresholds(min/low/high)=966/1207/1585
    • Zone: Movable: Free=0 | Managed=0 | Thresholds(min/low/high)=32/32/32
    • no OOMs
  • 🟢 0.5 CU / 2GiB
    • Zone: DMA32: Free=4030 | Managed=234454 | Thresholds(min/low/high)=963/1203/1203
    • Zone: Movable: Free=0 | Managed=262144 | Thresholds(min/low/high)=128/398/398
    • no OOMs
  • 🟢 0.75CU / 3GiB
    • Zone: DMA32: Free=3772 | Managed=234454 | Thresholds(min/low/high)=3127/3367/1203
    • Zone: Movable: Free=890 | Managed=524288 | Thresholds(min/low/high)=128/668/1109
    • no OOMs
  • 🔴 1 CU / 4GiB
    • Zone: DMA32: Free=5410 | Managed=234453 | Thresholds(min/low/high)=963/1203/1203
    • Zone: Movable: Free=266699 | Managed=786432 | Thresholds(min/low/high)=128/938/98304
    • OOMs

The issue reproduces with static configuration of 1CPU/4GiB RAM. It means it's not related to memory hotplug, and can be reproduced with exact memory configuration without autoscaling.

The only difference between cases above is the size of Movable memory. It's not clear why linux is bad at memory management when there's more Movable memory, especially because we don't have any drivers requiring DMA (direct memory access), so almost all of the memory should be movable. And in this particular case, we just have a lot of caches.

From now on, I will continue to use this config for debugging this issue:

memorySlotSize: 1Gi
memorySlots:
  max: 40
  min: 1
  use: 4

more logs for 1CU

https://gist.github.com/petuhovskiy/cb49886a50992b8841fed8999f7b9a11

Let's take a look at zone stats over time:

2025-01-30 12:54:28.623
Zone: DMA32: Free=198624 | Managed=234452 | Thresholds(min/low/high)=963/1203/9317
Zone: Movable: Free=741450 | Managed=786432 | Thresholds(min/low/high)=128/938/23505

2025-01-30 12:54:29.142  
Zone: DMA32: Free=185116 | Managed=234452 | Thresholds(min/low/high)=963/1203/23681
Zone: Movable: Free=717384 | Managed=786432 | Thresholds(min/low/high)=128/938/47571

2025-01-30 12:54:30.168
Zone: DMA32: Free=143970 | Managed=234452 | Thresholds(min/low/high)=963/1203/1770
Zone: Movable: Free=626475 | Managed=786432 | Thresholds(min/low/high)=128/938/98304

2025-01-30 12:54:33.775
Zone: DMA32: Free=40607 | Managed=234452 | Thresholds(min/low/high)=963/1203/1770
Zone: Movable: Free=404463 | Managed=786432 | Thresholds(min/low/high)=128/938/94080

2025-01-30 12:54:36.353
Zone: DMA32: Free=3115 | Managed=234452 | Thresholds(min/low/high)=3127/3367/1203
Zone: Movable: Free=276277 | Managed=786432 | Thresholds(min/low/high)=128/938/98304

And then OOM

A couple of observations:

  • It's interesting that high watermark grows for Movable, I wonder what is the reason
  • Movable zone is clearly underused

kernel source

I've been reading kernel docs and sources recently, leaving some useful links here:

https://docs.kernel.org/admin-guide/mm/multigen_lru.html => "optimizes page reclaim and improves performance under memory pressure."
It sounds like we can enable better-than-default allocator and see what happens, maybe we should try it some day.

Also looked into sources for memory allocation. These are the main allocation functions:

  • __alloc_pages() src
  • __alloc_pages_slowpath src
  • __alloc_pages_may_oom src

What happens:

  1. __alloc_pages() just tries to quickly find pages in freelist
  2. If it fails, it calls __alloc_pages_slowpath()
  3. __alloc_pages_slowpath() tries waking up kswapd and compaction
  4. Then it also tries direct compaction and direct reclaim
  5. If it fails, it then calls __alloc_pages_may_oom()
  6. __alloc_pages_may_oom() tries one last time and wakes up oom-killer

I can patch any of these functions to add more debug info. For the next step I probably want to understand why we don't use Movable section while it clearly has many free pages available.


If anyone has more ideas on what I should try next, you are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/neonvm Component: autoscaling: NeonVM t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant