Skip to content

Latest commit

 

History

History
313 lines (258 loc) · 15.7 KB

File metadata and controls

313 lines (258 loc) · 15.7 KB

AGENTS.md — RustQC

Fast quality control tools for sequencing data, written in Rust. The rustqc rna command runs all RNA-Seq QC analyses in a single pass:

  • dupRadar duplicate rate analysis (reimplementation of dupRadar)
  • featureCounts-compatible gene-level read counting with biotype summaries
  • 8 RSeQC tools integrated as built-in analyses: bam_stat, infer_experiment, read_duplication, read_distribution, junction_annotation, junction_saturation, inner_distance, and TIN (reimplementations of RSeQC)
  • preseq library complexity extrapolation (reimplementation of preseq lc_extrap)
  • samtools-compatible flagstat, idxstats, and full stats output
  • Gene body coverage profiling with Qualimap-compatible output

Binary crate (rustqc), Rust edition 2021.

Build / Lint / Test Commands

# Build
cargo build              # debug build
cargo build --release    # optimized release build (LTO, strip, opt-level 3)

# Format (enforced in CI — default rustfmt, no config file)
cargo fmt                # auto-format
cargo fmt --check        # check only (CI uses this)

# Lint (enforced in CI — default clippy, warnings are errors)
cargo clippy -- -D warnings

# Pre-commit hooks (runs fmt --check + clippy + file hygiene)
# Install: brew install prek (or cargo install prek)
prek install          # set up git hooks
prek run --all-files  # run all hooks manually

# Test — all unit + integration tests
cargo test               # debug mode
cargo test --release     # release mode (CI uses this)

# Run a single test by name (substring match)
cargo test test_dup_rate_calculation
cargo test test_dup_matrix_exact_match -- --nocapture   # with stdout

# Run only unit tests (skip integration tests)
cargo test --lib

# Run only integration tests
cargo test --test integration_test

# Run a specific integration test
cargo test --test integration_test test_intercept_slope_match

Project Structure

src/
  main.rs           — Entry point, dispatches subcommands
  cli.rs            — CLI argument parsing (clap derive, subcommand structure)
  config.rs         — YAML configuration loading (serde), nested tool configs
  io.rs             — Shared I/O utilities (gzip-transparent file reading)
  gtf.rs            — GTF annotation file parser (with configurable attribute extraction)
  rna/
    mod.rs          — Re-exports all submodules (dupradar, featurecounts, rseqc, bam_flags, cpp_rng, preseq, qualimap)
    bam_flags.rs    — BAM flag constants
    cpp_rng.rs      — C++ RNG FFI shim for preseq bootstrap reproducibility
    dupradar/
      mod.rs        — Re-exports counting, dupmatrix, fitting, plots
      counting.rs   — BAM read counting engine (largest module)
      dupmatrix.rs  — Duplication matrix construction & TSV output
      fitting.rs    — Logistic regression via IRLS
      plots.rs      — Plot generation: density scatter, boxplot, histogram
    featurecounts/
      mod.rs        — Re-exports output
      output.rs     — featureCounts-format output & biotype counting
    preseq.rs       — preseq lc_extrap library complexity extrapolation
    qualimap/
      mod.rs        — Re-exports all Qualimap modules
      accumulator.rs — Gene body coverage accumulation logic
      coverage.rs   — Per-base coverage computation
      index.rs      — Transcript index for coverage profiling
      output.rs     — Qualimap-compatible output file generation
      plots.rs      — Gene body coverage plot generation (PNG/SVG)
      report.rs     — Qualimap HTML report generation
    rseqc/
      mod.rs                — Re-exports all RSeQC modules + common helpers
      accumulators.rs       — Shared RSeQC accumulator infrastructure (read dispatch)
      common.rs             — Shared junction/intron extraction, from_genes builders
      bam_stat.rs           — bam_stat.py reimplementation
      flagstat.rs           — samtools flagstat-compatible output
      idxstats.rs           — samtools idxstats-compatible output
      infer_experiment.rs   — infer_experiment.py reimplementation
      inner_distance.rs     — inner_distance.py reimplementation
      junction_annotation.rs — junction_annotation.py reimplementation
      junction_saturation.rs — junction_saturation.py reimplementation
      plots.rs              — RSeQC plot generation (duplication, junctions, etc.)
      read_distribution.rs  — read_distribution.py reimplementation
      read_duplication.rs   — read_duplication.py reimplementation
      stats.rs              — samtools stats full output (SN + all histogram sections)
      tin.rs                — TIN (Transcript Integrity Number) analysis
tests/
  integration_test.rs  — 12 integration tests vs R dupRadar reference output
  data/                — Test BAM/GTF input files
  expected/            — R-generated reference outputs
  create_test_data.R   — R script to regenerate test data + references

Nested module structure — top-level modules (cli, config, io, gtf, rna) declared in main.rs, no lib.rs. The rna module contains sub-modules for each tool group. Inter-module access uses crate:: paths (e.g., use crate::rna::dupradar::counting::GeneCounts;).

The CLI uses a single subcommand:

  • rustqc rna <BAM>... --gtf <GTF> [OPTIONS]

A GTF gene annotation file (--gtf) is required. This runs all analyses: dupRadar duplicate rate analysis, featureCounts-compatible gene counting, all 8 RSeQC-equivalent tools (bam_stat, infer_experiment, read_duplication, read_distribution, junction_annotation, junction_saturation, inner_distance, TIN), preseq library complexity extrapolation, samtools-compatible outputs (flagstat, idxstats, stats), and gene body coverage profiling with Qualimap-compatible output. The GTF parser extracts transcript-level structure (exons + CDS features) to derive all data needed by every tool.

Individual tools can be disabled via the YAML config file (each has an enabled toggle). Tool-specific parameters (e.g., --min-intron, --inner-distance-lower-bound) are available as CLI flags and in the config file.

Multiple BAM files can be passed as positional arguments and are processed sequentially. Parallel processing is supported via --threads.

Code Style

Formatting

  • Default rustfmt — no rustfmt.toml exists. Do not create one.
  • 4-space indentation, ~100 char line width.
  • Trailing commas on all multi-line constructs.
  • Chained method calls break to new line with indent.

Imports

Three groups (crate-internal, third-party, std), though blank-line separation between groups is not strictly enforced. Each use is a single statement:

use crate::gtf::Gene;
use anyhow::{Context, Result};
use indexmap::IndexMap;
use log::{debug, info};
use std::collections::HashMap;

Localized use inside function bodies is acceptable for narrow imports (e.g., use std::io::Write;).

Naming

Kind Convention Examples
Types / Structs CamelCase GeneCounts, DupMatrix, FitResult
Functions/Methods snake_case count_reads, build_index, format_float
Constants SCREAMING_SNAKE_CASE BAM_FDUP, DENSITY_COLORS, SCALE
Modules snake_case dupmatrix, counting, bam_stat, inner_distance
Variables/Fields snake_case gene_counts, dup_rate_multi, is_dup
Type aliases CamelCase MateBufferKey

Error Handling

  • anyhow::Result<T> for all fallible functions. No custom error types.
  • Propagate with ? operator.
  • Add context with .context("msg") or .with_context(|| format!(...)).
  • Use anyhow::bail!() for early error returns.
  • Use anyhow::ensure!() for precondition checks.
  • unwrap() / expect() are restricted to test code only. In production code, unwrap() is acceptable only when a prior guard makes it provably safe (add a comment).

Documentation

  • Every source file starts with //! module doc comment (2-4 lines).
  • All public items (structs, fields, functions, methods) get /// doc comments.
  • Complex functions include # Arguments and # Returns sections.
  • Inline // comments explain complex logic, domain-specific behavior, and references to R equivalents.
  • Long files use section dividers: // === for major sections, // --- for sub-sections.

Types and Derives

  • #[derive(Debug)] on all structs.
  • Add Clone, Default, Deserialize as needed — keep derives minimal.
  • Public structs expose pub fields. Private helper structs keep fields private.
  • Numeric conventions: u64 for counts/positions, f64 for metrics, u8 for flags.
  • IndexMap when insertion order matters (gene ordering); HashMap for unordered lookups.

Clippy

  • Default clippy settings with -D warnings (deny all warnings).
  • Targeted #[allow(...)] annotations are acceptable with justification:
    • #[allow(dead_code)] for fields kept for API completeness.
    • #[allow(clippy::too_many_arguments)] when refactoring would reduce clarity.

Tests

  • Unit tests co-located in each source file inside #[cfg(test)] mod tests { use super::*; ... }.
  • Integration tests in tests/integration_test.rs — run the binary as a subprocess and compare output against R reference files in tests/expected/.
  • Test naming: test_<description_in_snake_case>.
  • Use assert_eq! with descriptive messages for exact comparisons.
  • Use assert!((val - expected).abs() < tolerance) for float comparisons.
  • No dev-dependencies — tests use only std + crate dependencies.

CI Pipeline

GitHub Actions (.github/workflows/ci.yml) runs on push to main and all PRs:

  1. Testcargo test --release on Ubuntu and macOS
  2. Formatcargo fmt --check
  3. Clippycargo clippy -- -D warnings

All three must pass. Uses dtolnay/rust-toolchain@stable and Swatinem/rust-cache@v2.

Release Process

Releases are triggered via workflow dispatch (GitHub Actions UI or gh workflow run release). The workflow (.github/workflows/release.yml) does the following:

  1. Reads the version from Cargo.toml and validates it matches Cargo.lock.
  2. Builds all binary variants (8 targets) and Docker images (baseline + 3 SIMD).
  3. Only after all builds pass: creates a git tag, GitHub release (with changelog from CHANGELOG.md), and uploads binary artifacts.
  4. Publishes to crates.io as the final step.

To prepare a release:

  1. Update version in Cargo.toml.
  2. Run cargo update --package rustqc to sync Cargo.lock.
  3. Add a new section to CHANGELOG.md following the existing format.
  4. Commit and push to main.
  5. Trigger the Release workflow via GitHub Actions.

Key Dependencies

Crate Purpose
clap v4 CLI argument parsing (derive)
rust-htslib BAM file I/O (statically linked)
plotters Chart generation (PNG + SVG)
serde YAML config deserialization
anyhow Error handling
log Logging facade
env_logger Log output backend
indexmap Insertion-order-preserving maps
coitrees Cache-oblivious interval trees
rayon Data parallelism
rand / rand_chacha Reproducible random sampling
flate2 Gzip decompression (annotation files)

Duplicate Marking Validation

RustQC verifies that BAM files have been processed by a duplicate-marking tool before running the RNA duplication analysis. This is implemented in two layers:

  1. Pre-flight @PG header check (counting::verify_duplicates_marked): Parses the BAM header text for @PG lines and checks the ID: and PN: fields against KNOWN_DUP_MARKERS (Picard MarkDuplicates, samblaster, sambamba, biobambam, etc.). Exits with anyhow::bail!() if no known tool is found.
  2. Post-hoc zero-duplicates check: After counting, if total_dup == 0 with total_mapped > 0, bails with an error — catches cases where the header is present but duplicates weren't actually flagged.

Both checks are skipped when --skip-dup-check is passed (stored as RnaArgs.skip_dup_check, forwarded to count_reads() as the skip_dup_check: bool parameter).

Notes for Agents

  • A .pre-commit-config.yaml is provided for local git hooks (fmt, clippy, file hygiene). Use prek (prek install) or the original pre-commit to activate them.
  • The codebase is a pure binary crate with no library target.
  • Release builds use aggressive optimization (lto = true, codegen-units = 1, strip = true).
  • Test data is generated by tests/create_test_data.R — do not modify tests/expected/ by hand.
  • Float output formatting must match R's behavior (15 significant digits, "NA" for NaN, trailing-zero trimming).
  • The pipeline processes BAM files which can be very large — performance matters.
  • System dependencies needed for building: cmake, zlib, bz2, lzma, curl, ssl, clang (for rust-htslib).
  • Benchmark results are produced by the RustQC-benchmarks Nextflow pipeline. When benchmarks are re-run, verify that all results referenced in the docs and the top-level README.md are updated to reflect the new numbers.
  • The YAML config mirrors the CLI subcommand hierarchy: all RNA-Seq settings live under a top-level rna: key. Within rna:, output toggles are nested under dupradar: and featurecounts: keys. Each output file can be individually enabled/disabled (all default to true).
  • Under rna:, there are sections for each of the 8 RSeQC tools (bam_stat:, infer_experiment:, read_duplication:, read_distribution:, junction_annotation:, junction_saturation:, inner_distance:, tin:). Each has an enabled: bool toggle and tool-specific parameter overrides. CLI flags take precedence over config values.
  • Under rna:, there are also sections for preseq:, qualimap:, flagstat:, idxstats:, and samtools_stats:. Each has an enabled: bool toggle. Preseq has additional parameters: max_extrap, step_size, n_bootstraps, confidence_level, seed, max_terms, defects. TIN has sample_size and min_coverage.
  • The featurecounts.rs module produces featureCounts-compatible output files (counts TSV, summary, biotype counts, and MultiQC files). These are generated in the same pass as the dupRadar analysis — no separate featureCounts run is needed.
  • The GTF parser (gtf.rs) accepts extra attribute names to extract (e.g., gene_biotype, gene_type) and stores them in Gene.attributes. The biotype attribute name is configurable via --biotype-attribute CLI flag or featurecounts.biotype_attribute in the YAML config.
  • GENCODE GTFs use gene_type while Ensembl GTFs use gene_biotype. The tool auto-detects which is present and falls back gracefully with a warning if neither is found.
  • GTF annotation files can be provided plain or gzip-compressed (.gz). Detection is based on file magic bytes, not the file extension. The shared io module (src/io.rs) provides an open_reader() helper that transparently handles both formats.