Applications with even moderately large working sets which touch a substantial portion of the working set frequently tend to spend like 20% of wall-clock time on TLB misses if they do not use huge pages. This time can be reduced substantially by using huge pages. You may have to call mmap yourself for this, for example by implementing your own PageAllocator. Previously I used a modified PageAllocator that contained code like this:
if (page_size == 2 * 1024 * 1024) {
slice = os.mmap(
hint,
alloc_len,
os.PROT.READ | os.PROT.WRITE,
os.MAP.PRIVATE | os.MAP.ANONYMOUS | os.MAP.HUGETLB | os.linux.MFD.HUGE_2MB,
-1,
0) catch &[0]u8{};
if (slice.len == 0) {
// Try again without huge pages.
slice = os.mmap(
hint,
alloc_len,
os.PROT.READ | os.PROT.WRITE,
os.MAP.PRIVATE | os.MAP.ANONYMOUS,
-1,
0) catch return error.OutOfMemory;
}
} else {
slice = os.mmap(
hint,
alloc_len,
os.PROT.READ | os.PROT.WRITE,
os.MAP.PRIVATE | os.MAP.ANONYMOUS,
-1,
0) catch return error.OutOfMemory;
}
Applications with even moderately large working sets which touch a substantial portion of the working set frequently tend to spend like 20% of wall-clock time on TLB misses if they do not use huge pages. This time can be reduced substantially by using huge pages. You may have to call mmap yourself for this, for example by implementing your own PageAllocator. Previously I used a modified PageAllocator that contained code like this: