GitHub - gojiplus/bloomjoin: bloomjoin: An R package implementing Bloom filter-based joins for improved performance with large datasets.

BloomJoin: Bloom Filter Based Joins

BloomJoin provides an alternative join implementation for R that uses an actual Bloom filter, implemented in C++ via Rcpp, to optimize the performance of joins between data frames. Traditional joins in R can be inefficient when dealing with large datasets, especially when one table is significantly larger than the other and the join key selectivity is low.

Installation

# Install from GitHub
devtools::install_github("gojiplus/bloomjoin")

Documentation

The full pkgdown site, including function reference, articles, and release notes, is published automatically to GitHub Pages whenever changes are pushed to the main branch. You can browse it at https://gojiplus.github.io/bloomjoin/. To rebuild the site locally run:

pkgdown::build_site()

Usage

library(bloomjoin)

# Basic usage
result <- bloom_join(df1, df2, by = "id", type = "inner")

# With multiple join columns
result <- bloom_join(df1, df2, by = c("id", "date"), type = "left")

# With performance tuning parameters
result <- bloom_join(df1, df2,
                    by = "id",
                    type = "inner",
                    engine = "bloom",
                    prefilter_side = "auto",
                    fpr = 0.001,
                    n_hint = list(y = 50000),
                    verbose = TRUE)

Proof of Correctness

library(dplyr)
library(tibble)

set.seed(123)
left <- tibble(id = sample(1:5000, 4000), value_left = rnorm(4000))
right <- tibble(id = sample(1:5000, 1500), value_right = rnorm(1500))

bloom <- bloom_join(left, right, by = "id") %>% arrange(id)
reference <- inner_join(left, right, by = "id") %>% arrange(id)

stopifnot(identical(bloom, reference))

Benchmarking Time and Memory

library(bench)

bench::mark(
  bloom_join = bloom_join(left, right, by = "id"),
  dplyr_join = inner_join(left, right, by = "id"),
  iterations = 5,
  check = FALSE
)

For a ready-to-run demonstration that generates data, verifies correctness, and prints benchmark summaries, execute:

Rscript inst/scripts/usage-and-benchmark.R

How It Works

BloomJoin uses a Bloom filter pre-processing step to optimize joins:

Sample the join keys to estimate distinct counts and decide which table to pre-filter.
Build a Bloom filter (using a cache-friendly bitset stored in C++) from the chosen build-side join keys.
Probe the filter with the opposing table to remove rows that cannot match before delegating to the appropriate dplyr verb.

Because Bloom filters may produce false positives but never false negatives, this pre-filtering step safely reduces the number of rows that participate in the expensive join while preserving all possible matches. The underlying Bloom filter implementation is provided by a compiled Rcpp module for performance.

Performance Benchmarks

See here

Future Work

Explore alternative probabilistic data structures (e.g., binary-fuse filters) for further memory savings
Parallel processing for Bloom filter construction and probing
Streaming/Chunked probing for very large inputs

License

MIT

Contributing

Contributions welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
R		R
docs		docs
inst		inst
man		man
src		src
tests		tests
vignettes		vignettes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
bloomjoin.Rproj		bloomjoin.Rproj
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BloomJoin: Bloom Filter Based Joins

Installation

Documentation

Usage

Proof of Correctness

Benchmarking Time and Memory

How It Works

Performance Benchmarks

Future Work

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

gojiplus/bloomjoin

Folders and files

Latest commit

History

Repository files navigation

BloomJoin: Bloom Filter Based Joins

Installation

Documentation

Usage

Proof of Correctness

Benchmarking Time and Memory

How It Works

Performance Benchmarks

Future Work

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages