Fast quality control tools for sequencing data, written in Rust. The
rustqc rnacommand runs all RNA-Seq QC analyses in a single pass:
- dupRadar duplicate rate analysis (reimplementation of dupRadar)
- featureCounts-compatible gene-level read counting with biotype summaries
- 8 RSeQC tools integrated as built-in analyses: bam_stat, infer_experiment, read_duplication, read_distribution, junction_annotation, junction_saturation, inner_distance, and TIN (reimplementations of RSeQC)
- preseq library complexity extrapolation (reimplementation of preseq
lc_extrap)- samtools-compatible flagstat, idxstats, and full stats output
- Gene body coverage profiling with Qualimap-compatible output
Binary crate (
rustqc), Rust edition 2021.
# Build
cargo build # debug build
cargo build --release # optimized release build (LTO, strip, opt-level 3)
# Format (enforced in CI — default rustfmt, no config file)
cargo fmt # auto-format
cargo fmt --check # check only (CI uses this)
# Lint (enforced in CI — default clippy, warnings are errors)
cargo clippy -- -D warnings
# Pre-commit hooks (runs fmt --check + clippy + file hygiene)
# Install: brew install prek (or cargo install prek)
prek install # set up git hooks
prek run --all-files # run all hooks manually
# Test — all unit + integration tests
cargo test # debug mode
cargo test --release # release mode (CI uses this)
# Run a single test by name (substring match)
cargo test test_dup_rate_calculation
cargo test test_dup_matrix_exact_match -- --nocapture # with stdout
# Run only unit tests (skip integration tests)
cargo test --lib
# Run only integration tests
cargo test --test integration_test
# Run a specific integration test
cargo test --test integration_test test_intercept_slope_matchsrc/
main.rs — Entry point, dispatches subcommands
cli.rs — CLI argument parsing (clap derive, subcommand structure)
config.rs — YAML configuration loading (serde), nested tool configs
io.rs — Shared I/O utilities (gzip-transparent file reading)
gtf.rs — GTF annotation file parser (with configurable attribute extraction)
rna/
mod.rs — Re-exports all submodules (dupradar, featurecounts, rseqc, bam_flags, cpp_rng, preseq, qualimap)
bam_flags.rs — BAM flag constants
cpp_rng.rs — C++ RNG FFI shim for preseq bootstrap reproducibility
dupradar/
mod.rs — Re-exports counting, dupmatrix, fitting, plots
counting.rs — BAM read counting engine (largest module)
dupmatrix.rs — Duplication matrix construction & TSV output
fitting.rs — Logistic regression via IRLS
plots.rs — Plot generation: density scatter, boxplot, histogram
featurecounts/
mod.rs — Re-exports output
output.rs — featureCounts-format output & biotype counting
preseq.rs — preseq lc_extrap library complexity extrapolation
qualimap/
mod.rs — Re-exports all Qualimap modules
accumulator.rs — Gene body coverage accumulation logic
coverage.rs — Per-base coverage computation
index.rs — Transcript index for coverage profiling
output.rs — Qualimap-compatible output file generation
plots.rs — Gene body coverage plot generation (PNG/SVG)
report.rs — Qualimap HTML report generation
rseqc/
mod.rs — Re-exports all RSeQC modules + common helpers
accumulators.rs — Shared RSeQC accumulator infrastructure (read dispatch)
common.rs — Shared junction/intron extraction, from_genes builders
bam_stat.rs — bam_stat.py reimplementation
flagstat.rs — samtools flagstat-compatible output
idxstats.rs — samtools idxstats-compatible output
infer_experiment.rs — infer_experiment.py reimplementation
inner_distance.rs — inner_distance.py reimplementation
junction_annotation.rs — junction_annotation.py reimplementation
junction_saturation.rs — junction_saturation.py reimplementation
plots.rs — RSeQC plot generation (duplication, junctions, etc.)
read_distribution.rs — read_distribution.py reimplementation
read_duplication.rs — read_duplication.py reimplementation
stats.rs — samtools stats full output (SN + all histogram sections)
tin.rs — TIN (Transcript Integrity Number) analysis
tests/
integration_test.rs — 12 integration tests vs R dupRadar reference output
data/ — Test BAM/GTF input files
expected/ — R-generated reference outputs
create_test_data.R — R script to regenerate test data + references
Nested module structure — top-level modules (cli, config, io, gtf, rna) declared
in main.rs, no lib.rs. The rna module contains sub-modules for each tool group.
Inter-module access uses crate:: paths (e.g., use crate::rna::dupradar::counting::GeneCounts;).
The CLI uses a single subcommand:
rustqc rna <BAM>... --gtf <GTF> [OPTIONS]
A GTF gene annotation file (--gtf) is required. This runs all analyses:
dupRadar duplicate rate analysis, featureCounts-compatible gene counting,
all 8 RSeQC-equivalent tools (bam_stat, infer_experiment, read_duplication,
read_distribution, junction_annotation, junction_saturation, inner_distance,
TIN), preseq library complexity extrapolation, samtools-compatible outputs
(flagstat, idxstats, stats), and gene body coverage profiling with
Qualimap-compatible output. The GTF parser extracts transcript-level
structure (exons + CDS features) to derive all data needed by every tool.
Individual tools can be disabled via the YAML config file (each has an enabled
toggle). Tool-specific parameters (e.g., --min-intron, --inner-distance-lower-bound)
are available as CLI flags and in the config file.
Multiple BAM files can be passed as positional arguments and are processed sequentially.
Parallel processing is supported via --threads.
- Default
rustfmt— norustfmt.tomlexists. Do not create one. - 4-space indentation, ~100 char line width.
- Trailing commas on all multi-line constructs.
- Chained method calls break to new line with indent.
Three groups (crate-internal, third-party, std), though blank-line separation
between groups is not strictly enforced. Each use is a single statement:
use crate::gtf::Gene;
use anyhow::{Context, Result};
use indexmap::IndexMap;
use log::{debug, info};
use std::collections::HashMap;Localized use inside function bodies is acceptable for narrow imports
(e.g., use std::io::Write;).
| Kind | Convention | Examples |
|---|---|---|
| Types / Structs | CamelCase |
GeneCounts, DupMatrix, FitResult |
| Functions/Methods | snake_case |
count_reads, build_index, format_float |
| Constants | SCREAMING_SNAKE_CASE |
BAM_FDUP, DENSITY_COLORS, SCALE |
| Modules | snake_case |
dupmatrix, counting, bam_stat, inner_distance |
| Variables/Fields | snake_case |
gene_counts, dup_rate_multi, is_dup |
| Type aliases | CamelCase |
MateBufferKey |
anyhow::Result<T>for all fallible functions. No custom error types.- Propagate with
?operator. - Add context with
.context("msg")or.with_context(|| format!(...)). - Use
anyhow::bail!()for early error returns. - Use
anyhow::ensure!()for precondition checks. unwrap()/expect()are restricted to test code only. In production code,unwrap()is acceptable only when a prior guard makes it provably safe (add a comment).
- Every source file starts with
//!module doc comment (2-4 lines). - All public items (structs, fields, functions, methods) get
///doc comments. - Complex functions include
# Argumentsand# Returnssections. - Inline
//comments explain complex logic, domain-specific behavior, and references to R equivalents. - Long files use section dividers:
// ===for major sections,// ---for sub-sections.
#[derive(Debug)]on all structs.- Add
Clone,Default,Deserializeas needed — keep derives minimal. - Public structs expose
pubfields. Private helper structs keep fields private. - Numeric conventions:
u64for counts/positions,f64for metrics,u8for flags. IndexMapwhen insertion order matters (gene ordering);HashMapfor unordered lookups.
- Default clippy settings with
-D warnings(deny all warnings). - Targeted
#[allow(...)]annotations are acceptable with justification:#[allow(dead_code)]for fields kept for API completeness.#[allow(clippy::too_many_arguments)]when refactoring would reduce clarity.
- Unit tests co-located in each source file inside
#[cfg(test)] mod tests { use super::*; ... }. - Integration tests in
tests/integration_test.rs— run the binary as a subprocess and compare output against R reference files intests/expected/. - Test naming:
test_<description_in_snake_case>. - Use
assert_eq!with descriptive messages for exact comparisons. - Use
assert!((val - expected).abs() < tolerance)for float comparisons. - No dev-dependencies — tests use only std + crate dependencies.
GitHub Actions (.github/workflows/ci.yml) runs on push to main and all PRs:
- Test —
cargo test --releaseon Ubuntu and macOS - Format —
cargo fmt --check - Clippy —
cargo clippy -- -D warnings
All three must pass. Uses dtolnay/rust-toolchain@stable and Swatinem/rust-cache@v2.
Releases are triggered via workflow dispatch (GitHub Actions UI or gh workflow run release).
The workflow (.github/workflows/release.yml) does the following:
- Reads the version from
Cargo.tomland validates it matchesCargo.lock. - Builds all binary variants (8 targets) and Docker images (baseline + 3 SIMD).
- Only after all builds pass: creates a git tag, GitHub release (with changelog from
CHANGELOG.md), and uploads binary artifacts. - Publishes to crates.io as the final step.
To prepare a release:
- Update
versioninCargo.toml. - Run
cargo update --package rustqcto syncCargo.lock. - Add a new section to
CHANGELOG.mdfollowing the existing format. - Commit and push to
main. - Trigger the Release workflow via GitHub Actions.
| Crate | Purpose |
|---|---|
clap v4 |
CLI argument parsing (derive) |
rust-htslib |
BAM file I/O (statically linked) |
plotters |
Chart generation (PNG + SVG) |
serde |
YAML config deserialization |
anyhow |
Error handling |
log |
Logging facade |
env_logger |
Log output backend |
indexmap |
Insertion-order-preserving maps |
coitrees |
Cache-oblivious interval trees |
rayon |
Data parallelism |
rand / rand_chacha |
Reproducible random sampling |
flate2 |
Gzip decompression (annotation files) |
RustQC verifies that BAM files have been processed by a duplicate-marking tool before running the RNA duplication analysis. This is implemented in two layers:
- Pre-flight
@PGheader check (counting::verify_duplicates_marked): Parses the BAM header text for@PGlines and checks theID:andPN:fields againstKNOWN_DUP_MARKERS(Picard MarkDuplicates, samblaster, sambamba, biobambam, etc.). Exits withanyhow::bail!()if no known tool is found. - Post-hoc zero-duplicates check: After counting, if
total_dup == 0withtotal_mapped > 0, bails with an error — catches cases where the header is present but duplicates weren't actually flagged.
Both checks are skipped when --skip-dup-check is passed (stored as RnaArgs.skip_dup_check,
forwarded to count_reads() as the skip_dup_check: bool parameter).
- A
.pre-commit-config.yamlis provided for local git hooks (fmt, clippy, file hygiene). Use prek (prek install) or the original pre-commit to activate them. - The codebase is a pure binary crate with no library target.
- Release builds use aggressive optimization (
lto = true,codegen-units = 1,strip = true). - Test data is generated by
tests/create_test_data.R— do not modifytests/expected/by hand. - Float output formatting must match R's behavior (15 significant digits, "NA" for NaN, trailing-zero trimming).
- The pipeline processes BAM files which can be very large — performance matters.
- System dependencies needed for building: cmake, zlib, bz2, lzma, curl, ssl, clang (for
rust-htslib). - Benchmark results are produced by the RustQC-benchmarks
Nextflow pipeline. When benchmarks are re-run, verify that all results referenced in the docs
and the top-level
README.mdare updated to reflect the new numbers. - The YAML config mirrors the CLI subcommand hierarchy: all RNA-Seq settings live
under a top-level
rna:key. Withinrna:, output toggles are nested underdupradar:andfeaturecounts:keys. Each output file can be individually enabled/disabled (all default totrue). - Under
rna:, there are sections for each of the 8 RSeQC tools (bam_stat:,infer_experiment:,read_duplication:,read_distribution:,junction_annotation:,junction_saturation:,inner_distance:,tin:). Each has anenabled: booltoggle and tool-specific parameter overrides. CLI flags take precedence over config values. - Under
rna:, there are also sections forpreseq:,qualimap:,flagstat:,idxstats:, andsamtools_stats:. Each has anenabled: booltoggle. Preseq has additional parameters:max_extrap,step_size,n_bootstraps,confidence_level,seed,max_terms,defects. TIN hassample_sizeandmin_coverage. - The
featurecounts.rsmodule produces featureCounts-compatible output files (counts TSV, summary, biotype counts, and MultiQC files). These are generated in the same pass as the dupRadar analysis — no separate featureCounts run is needed. - The GTF parser (
gtf.rs) accepts extra attribute names to extract (e.g.,gene_biotype,gene_type) and stores them inGene.attributes. The biotype attribute name is configurable via--biotype-attributeCLI flag orfeaturecounts.biotype_attributein the YAML config. - GENCODE GTFs use
gene_typewhile Ensembl GTFs usegene_biotype. The tool auto-detects which is present and falls back gracefully with a warning if neither is found. - GTF annotation files can be provided plain or gzip-compressed (
.gz). Detection is based on file magic bytes, not the file extension. The sharediomodule (src/io.rs) provides anopen_reader()helper that transparently handles both formats.