Skip to content

edwardnvv57k/parallel-merge-sorting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Merge Sorting on Multicore CPUs

High-performance parallel sorting using two complementary merge-based algorithms:

  • Splitter-based K-way Merge Sort
  • Worklist-based Dynamic Parallel Merge Sort

This project explores design tradeoffs between structured parallelism (K-way) and dynamic load balancing (worklist) on shared-memory multicore systems.

Overview

We implement and compare two parallel sorting strategies:

1. Splitter-based K-way Merge Sort

  • Input is divided into K runs
  • Each run is sorted independently (parallel std::sort)
  • Global merge is performed in a single parallel phase
  • Uses:
    • sampling-based splitter selection
    • value-range partitioning
    • winner tree for efficient K-way merging

2. Worklist-based Parallel Merge Sort

  • Bottom-up merge tree with dynamic task scheduling
  • Threads pull work from shared queues:
    • sort queue (leaf tasks)
    • merge queue (internal merges)
  • Each thread processes small chunks, enabling:
    • fine-grained parallelism
    • automatic load balancing
  • Particularly effective on heterogeneous cores

Baselines

  • std::sort (sequential introsort)
  • tbb::parallel_sort (if TBB available)

Directory Structure


parallel-merge-sorting/
├── README.md
├── Makefile
├── requirements.txt
├── src/
│   ├── main.cpp
│   ├── core/
│   │   ├── winner_tree.{hpp,cpp}
│   │   ├── splitter.{hpp,cpp}
│   │   └── kway_merge.{hpp,cpp}
│   ├── algorithms/
│   │   ├── parallel_kway_sort.{hpp,cpp}
│   │   ├── parallel_worklist_sort.{hpp,cpp}
│   │   ├── std_sort.hpp
│   │   └── tbb_sort.hpp
│   └── utils/
│       ├── timer.hpp
│       ├── data_gen.hpp
│       └── verifier.hpp
├── benchmarks/
│   └── benchmark.cpp
├── scripts/
│   ├── run_experiments.sh
│   └── plot.py
├── tests/
│   ├── test_correctness.cpp
│   └── test_scaling.cpp
└── docs/
├── design.md
└── methodology.md

Build

Prerequisites

Ubuntu / Debian:


sudo apt-get install build-essential libtbb-dev python3 python3-pip

macOS:


brew install gcc tbb python3
pip3 install matplotlib numpy

Compilation


cd parallel-kway-merge
make

Optional:


make bin/benchmark
make check
make clean

Run

Default benchmark


make run

Custom runs


./bin/benchmark --size 50000000 --threads 16 --runs 8 --chunksize 100000
./bin/benchmark --size 100000000 --threads 16 --runs 8 --chunksize 100000

Experiments (recommended)


cd scripts
./run_experiments.sh

This generates results.py for plotting.

Output Format

Machine-readable output:


RESULT <seq_ms> <tbb_ms> <kway_ms> <worklist_ms>

Example:


RESULT 7123.12 1480.55 1523.77 1299.44

  • seq_ms → std::sort
  • tbb_ms → TBB (0 if unavailable)
  • kway_ms → K-way merge sort
  • worklist_ms → worklist merge sort

Algorithms

K-way Merge Sort

  1. Partition input into K runs
  2. Sort runs in parallel
  3. Sample elements to compute splitters
  4. Partition each run into value ranges via binary search
  5. Each thread merges its assigned slices using a winner tree
  • Work: O(N log K)
  • Parallelism: high for large K
  • Limitation: sensitive to load imbalance from imperfect splitters

Worklist Merge Sort

  1. Build merge tree of subarrays
  2. Push leaf segments into sort queue
  3. Threads:
    • pick sort tasks OR
    • claim merge chunks from merge queue
  4. Merge performed in small chunks (≤ 2×chunk_size)
  5. Parent merges become ready when children complete
  • Dynamic scheduling
  • Fine-grained parallelism
  • Strong load balancing
  • Slightly higher synchronization overhead

Complexity

  • Sequential: O(N log N)
  • K-way merge: O(N log K)
  • Worklist: O(N log N) work, but better practical parallel efficiency

Parallel performance is ultimately limited by memory bandwidth at high thread counts.

Testing


make check
./bin/test_scaling

Correctness is verified via:

  • full comparison against std::sort
  • multiple input distributions:
    • random
    • sorted / reverse
    • uniform values
    • edge cases

Experimental Setup

  • Input sizes: up to 1e8 elements
  • Threads: up to 32
  • Default size: 50 million
  • Measurements averaged over multiple runs

Experiments include:

  • size scaling
  • thread scaling
  • chunk size sensitivity

Key Observations

  • Worklist-based algorithm generally outperforms K-way
  • Dynamic scheduling improves utilization across cores
  • K-way provides clean structure but is sensitive to imbalance
  • Performance saturates beyond ~8–16 threads due to memory bandwidth
  • Chunk size significantly impacts worklist performance

Troubleshooting

  • If TBB is not installed → TBB results will be 0
  • Reduce --size if memory issues occur
  • Ensure make clean && make if build errors occur

Notes

  • Best K-way performance when K ≈ number of threads
  • Worklist benefits from moderate chunk sizes (~1e5–1e6)
  • Larger inputs improve parallel efficiency

Documentation

See:

  • docs/design.md
  • docs/methodology.md

About

High-performance parallel sorting using two complementary merge-based algorithms: Splitter-based K-way Merge Sort and Worklist-based Dynamic Parallel Merge Sort

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors