High-performance parallel sorting using two complementary merge-based algorithms:
- Splitter-based K-way Merge Sort
- Worklist-based Dynamic Parallel Merge Sort
This project explores design tradeoffs between structured parallelism (K-way) and dynamic load balancing (worklist) on shared-memory multicore systems.
We implement and compare two parallel sorting strategies:
- Input is divided into K runs
- Each run is sorted independently (parallel
std::sort) - Global merge is performed in a single parallel phase
- Uses:
- sampling-based splitter selection
- value-range partitioning
- winner tree for efficient K-way merging
- Bottom-up merge tree with dynamic task scheduling
- Threads pull work from shared queues:
- sort queue (leaf tasks)
- merge queue (internal merges)
- Each thread processes small chunks, enabling:
- fine-grained parallelism
- automatic load balancing
- Particularly effective on heterogeneous cores
std::sort(sequential introsort)tbb::parallel_sort(if TBB available)
parallel-merge-sorting/
├── README.md
├── Makefile
├── requirements.txt
├── src/
│ ├── main.cpp
│ ├── core/
│ │ ├── winner_tree.{hpp,cpp}
│ │ ├── splitter.{hpp,cpp}
│ │ └── kway_merge.{hpp,cpp}
│ ├── algorithms/
│ │ ├── parallel_kway_sort.{hpp,cpp}
│ │ ├── parallel_worklist_sort.{hpp,cpp}
│ │ ├── std_sort.hpp
│ │ └── tbb_sort.hpp
│ └── utils/
│ ├── timer.hpp
│ ├── data_gen.hpp
│ └── verifier.hpp
├── benchmarks/
│ └── benchmark.cpp
├── scripts/
│ ├── run_experiments.sh
│ └── plot.py
├── tests/
│ ├── test_correctness.cpp
│ └── test_scaling.cpp
└── docs/
├── design.md
└── methodology.md
Ubuntu / Debian:
sudo apt-get install build-essential libtbb-dev python3 python3-pip
macOS:
brew install gcc tbb python3
pip3 install matplotlib numpy
cd parallel-kway-merge
make
Optional:
make bin/benchmark
make check
make clean
make run
./bin/benchmark --size 50000000 --threads 16 --runs 8 --chunksize 100000
./bin/benchmark --size 100000000 --threads 16 --runs 8 --chunksize 100000
cd scripts
./run_experiments.sh
This generates results.py for plotting.
Machine-readable output:
RESULT <seq_ms> <tbb_ms> <kway_ms> <worklist_ms>
Example:
RESULT 7123.12 1480.55 1523.77 1299.44
seq_ms→ std::sorttbb_ms→ TBB (0 if unavailable)kway_ms→ K-way merge sortworklist_ms→ worklist merge sort
- Partition input into K runs
- Sort runs in parallel
- Sample elements to compute splitters
- Partition each run into value ranges via binary search
- Each thread merges its assigned slices using a winner tree
- Work:
O(N log K) - Parallelism: high for large K
- Limitation: sensitive to load imbalance from imperfect splitters
- Build merge tree of subarrays
- Push leaf segments into sort queue
- Threads:
- pick sort tasks OR
- claim merge chunks from merge queue
- Merge performed in small chunks (≤ 2×chunk_size)
- Parent merges become ready when children complete
- Dynamic scheduling
- Fine-grained parallelism
- Strong load balancing
- Slightly higher synchronization overhead
- Sequential:
O(N log N) - K-way merge:
O(N log K) - Worklist:
O(N log N)work, but better practical parallel efficiency
Parallel performance is ultimately limited by memory bandwidth at high thread counts.
make check
./bin/test_scaling
Correctness is verified via:
- full comparison against
std::sort - multiple input distributions:
- random
- sorted / reverse
- uniform values
- edge cases
- Input sizes: up to
1e8elements - Threads: up to 32
- Default size: 50 million
- Measurements averaged over multiple runs
Experiments include:
- size scaling
- thread scaling
- chunk size sensitivity
- Worklist-based algorithm generally outperforms K-way
- Dynamic scheduling improves utilization across cores
- K-way provides clean structure but is sensitive to imbalance
- Performance saturates beyond ~8–16 threads due to memory bandwidth
- Chunk size significantly impacts worklist performance
- If TBB is not installed → TBB results will be
0 - Reduce
--sizeif memory issues occur - Ensure
make clean && makeif build errors occur
- Best K-way performance when
K ≈ number of threads - Worklist benefits from moderate chunk sizes (~1e5–1e6)
- Larger inputs improve parallel efficiency
See:
docs/design.mddocs/methodology.md