Parallel Merge Sorting on Multicore CPUs

High-performance parallel sorting using two complementary merge-based algorithms:

Splitter-based K-way Merge Sort
Worklist-based Dynamic Parallel Merge Sort

This project explores design tradeoffs between structured parallelism (K-way) and dynamic load balancing (worklist) on shared-memory multicore systems.

Overview

We implement and compare two parallel sorting strategies:

1. Splitter-based K-way Merge Sort

Input is divided into K runs
Each run is sorted independently (parallel std::sort)
Global merge is performed in a single parallel phase
Uses:
- sampling-based splitter selection
- value-range partitioning
- winner tree for efficient K-way merging

2. Worklist-based Parallel Merge Sort

Bottom-up merge tree with dynamic task scheduling
Threads pull work from shared queues:
- sort queue (leaf tasks)
- merge queue (internal merges)
Each thread processes small chunks, enabling:
- fine-grained parallelism
- automatic load balancing
Particularly effective on heterogeneous cores

Baselines

std::sort (sequential introsort)
tbb::parallel_sort (if TBB available)

Directory Structure


parallel-merge-sorting/
├── README.md
├── Makefile
├── requirements.txt
├── src/
│   ├── main.cpp
│   ├── core/
│   │   ├── winner_tree.{hpp,cpp}
│   │   ├── splitter.{hpp,cpp}
│   │   └── kway_merge.{hpp,cpp}
│   ├── algorithms/
│   │   ├── parallel_kway_sort.{hpp,cpp}
│   │   ├── parallel_worklist_sort.{hpp,cpp}
│   │   ├── std_sort.hpp
│   │   └── tbb_sort.hpp
│   └── utils/
│       ├── timer.hpp
│       ├── data_gen.hpp
│       └── verifier.hpp
├── benchmarks/
│   └── benchmark.cpp
├── scripts/
│   ├── run_experiments.sh
│   └── plot.py
├── tests/
│   ├── test_correctness.cpp
│   └── test_scaling.cpp
└── docs/
├── design.md
└── methodology.md

Build

Prerequisites

Ubuntu / Debian:


sudo apt-get install build-essential libtbb-dev python3 python3-pip

macOS:


brew install gcc tbb python3
pip3 install matplotlib numpy

Compilation


cd parallel-kway-merge
make

Optional:


make bin/benchmark
make check
make clean

Run

Default benchmark


make run

Custom runs


./bin/benchmark --size 50000000 --threads 16 --runs 8 --chunksize 100000
./bin/benchmark --size 100000000 --threads 16 --runs 8 --chunksize 100000

Experiments (recommended)


cd scripts
./run_experiments.sh

This generates results.py for plotting.

Output Format

Machine-readable output:


RESULT <seq_ms> <tbb_ms> <kway_ms> <worklist_ms>

Example:


RESULT 7123.12 1480.55 1523.77 1299.44

seq_ms → std::sort
tbb_ms → TBB (0 if unavailable)
kway_ms → K-way merge sort
worklist_ms → worklist merge sort

Algorithms

K-way Merge Sort

Partition input into K runs
Sort runs in parallel
Sample elements to compute splitters
Partition each run into value ranges via binary search
Each thread merges its assigned slices using a winner tree

Work: O(N log K)
Parallelism: high for large K
Limitation: sensitive to load imbalance from imperfect splitters

Worklist Merge Sort

Build merge tree of subarrays
Push leaf segments into sort queue
Threads:
- pick sort tasks OR
- claim merge chunks from merge queue
Merge performed in small chunks (≤ 2×chunk_size)
Parent merges become ready when children complete

Dynamic scheduling
Fine-grained parallelism
Strong load balancing
Slightly higher synchronization overhead

Complexity

Sequential: O(N log N)
K-way merge: O(N log K)
Worklist: O(N log N) work, but better practical parallel efficiency

Parallel performance is ultimately limited by memory bandwidth at high thread counts.

Testing


make check
./bin/test_scaling

Correctness is verified via:

full comparison against std::sort
multiple input distributions:
- random
- sorted / reverse
- uniform values
- edge cases

Experimental Setup

Input sizes: up to 1e8 elements
Threads: up to 32
Default size: 50 million
Measurements averaged over multiple runs

Experiments include:

size scaling
thread scaling
chunk size sensitivity

Key Observations

Worklist-based algorithm generally outperforms K-way
Dynamic scheduling improves utilization across cores
K-way provides clean structure but is sensitive to imbalance
Performance saturates beyond ~8–16 threads due to memory bandwidth
Chunk size significantly impacts worklist performance

Troubleshooting

If TBB is not installed → TBB results will be 0
Reduce --size if memory issues occur
Ensure make clean && make if build errors occur

Notes

Best K-way performance when K ≈ number of threads
Worklist benefits from moderate chunk sizes (~1e5–1e6)
Larger inputs improve parallel efficiency

Documentation

See:

docs/design.md
docs/methodology.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Merge Sorting on Multicore CPUs

Overview

1. Splitter-based K-way Merge Sort

2. Worklist-based Parallel Merge Sort

Baselines

Directory Structure

Build

Prerequisites

Compilation

Run

Default benchmark

Custom runs

Experiments (recommended)

Output Format

Algorithms

K-way Merge Sort

Worklist Merge Sort

Complexity

Testing

Experimental Setup

Key Observations

Troubleshooting

Notes

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmarks		benchmarks
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
report.pdf		report.pdf

Folders and files

Latest commit

History

Repository files navigation

Parallel Merge Sorting on Multicore CPUs

Overview

1. Splitter-based K-way Merge Sort

2. Worklist-based Parallel Merge Sort

Baselines

Directory Structure

Build

Prerequisites

Compilation

Run

Default benchmark

Custom runs

Experiments (recommended)

Output Format

Algorithms

K-way Merge Sort

Worklist Merge Sort

Complexity

Testing

Experimental Setup

Key Observations

Troubleshooting

Notes

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages