diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml new file mode 100644 index 0000000..d25190e --- /dev/null +++ b/.github/workflows/tests.yml @@ -0,0 +1,24 @@ +name: Run Tests + +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.11" + + - name: Install dependencies + run: python -m pip install -e ".[dev]" + + - name: Run tests + run: pytest tests/ -v diff --git a/README.md b/README.md index 1871ad9..b1329f3 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,20 @@ ### Installation -* Python >= 3.10 required. +* **Python 3.10–3.12 required** (`>=3.10,<3.13`). We recommend creating a dedicated virtual environment: + +```bash +python3.10 -m venv .venv +source .venv/bin/activate # Windows: .venv\Scripts\activate +``` + * Core dependency: `glycowork>=1.7.1` ```bash pip install glycoforge ``` -OR +OR ```bash git clone https://github.com/BojarLab/GlycoForge.git @@ -31,14 +37,9 @@ source .venv/bin/activate pip install -e . ``` - - ### Usage -See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb)for interactive examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows. - - - +See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb) for interactive simulation examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows, and [benchmarking_batch_effect_removal.ipynb](benchmarking_batch_effect_removal.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/benchmarking_batch_effect_removal.ipynb) for the full six-method benchmark. ## How the simulator works @@ -48,10 +49,9 @@ We keep everything in the CLR (centered log-ratio) space: - Flip to CLR: `z_H = clr(p_H)`. - For selected glycans, push the signal using real or synthetic effect sizes: `z_U = z_H + m * lambda * d_robust`, where `m` is the differential mask, `lambda` is `bio_strength`, and `d_robust` is the effect vector after `robust_effect_size_processing`. - **Simplified mode**: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline. - - **Hybrid mode**: start from the Cohen’s *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling. + - **Hybrid mode**: start from the Cohen's *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling. - Invert back to proportions: `p_U = invclr(z_U)` and scale by `k_dir` to get `alpha_U`, note that the healthy and unhealthy Dirichlet strengths use different `k_dir` values, and a separate `variance_ratio` controls their relative magnitude. -- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * u_b + epsilon`, with `var_b` controlling spread. - +- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * sigma * u_b + epsilon`, with `var_b` controlling spread. ## Simulation Modes @@ -66,17 +66,17 @@ No real data dependency. Ideal for controlled experiments with known ground trut **Pipeline steps:** -1. Initializes log-normal healthy baseline: `alpha_H = ones(n_glycans) * 10` +1. Initializes log-normal healthy baseline: `alpha_H` sampled from log-normal distribution (μ=0, σ=1, fixed seed=42), rescaled to mean of 10 2. For each random seed, generates `alpha_U` by randomly scaling `alpha_H`: - `up_frac` (default 30%) upregulated with scale factors from `up_scale_range=(1.1, 3.0)` - - `down_frac` (default 30%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)` - - Remaining glycans (~40%) stay unchanged + - `down_frac` (default 35%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)` + - Remaining glycans (~35%) stay unchanged 3. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples 4. Defines batch effect direction vectors `u_dict` once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep) 5. Applies batch effects controlled by `kappa_mu` (shift strength) and `var_b` (variance scaling) 6. Optionally applies MNAR (Missing Not At Random) missingness: - - `missing_fraction`: proportion of missing values (0.0-1.0) - - `mnar_bias`: intensity-dependent bias (default 2.0, range 0.5-5.0) + - `missing_fraction`: proportion of missing values (0.0–1.0) + - `mnar_bias`: intensity-dependent bias (default 2.0, range 0.5–5.0) - Left-censored pattern: low-abundance glycans more likely to be missing 7. Grid search over `kappa_mu` and `var_b` produces multiple datasets under identical batch effect structure @@ -85,7 +85,7 @@ No real data dependency. Ideal for controlled experiments with known ground trut
-Templated mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction) +Templated mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)
@@ -102,7 +102,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts - `"Top-N"`: top N glycans by absolute effect size (e.g., `"Top-10"`) 5. Processes effect sizes through `robust_effect_size_processing`: - Centers effect sizes to remove global shift - - Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses `winsorize_percentile`) + - Applies Winsorization to clip extreme outliers (auto-selects percentile 85–99, or uses `winsorize_percentile`) - Normalizes by baseline (`baseline_method`: median, MAD, or p75) - Returns normalized `d_robust` scaled by `bio_strength` 6. Injects effects in CLR space: `z_U = z_H + mask * bio_strength * d_robust` @@ -111,7 +111,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts 9. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples 10. Defines batch effect direction vectors `u_dict` once per run (fixed seed ensures fair comparison across parameter combinations) 11. Applies batch effects: `y_batch = y_clean + kappa_mu * sigma * u_b + epsilon`, where `epsilon ~ N(0, sqrt(var_b) * sigma)` -12. Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans) +12. Optionally applies MNAR missingness (same as Simplified mode) 13. Grid search over `bio_strength`, `k_dir`, `variance_ratio`, `kappa_mu`, `var_b` to systematically test biological signal and batch effect interactions **Key parameters:** `data_file`, `column_prefix`, `bio_strength`, `k_dir`, `variance_ratio`, `differential_mask`, `winsorize_percentile`, `baseline_method`, `kappa_mu`, `var_b`, `missing_fraction`, `mnar_bias` @@ -147,9 +147,27 @@ Generates two glycomic datasets (e.g., _N_- and _O_-glycomics) that share sample The [use_cases/batch_correction/](use_cases/batch_correction) directory demonstrates: - Call `glycoforge` simulation, and then apply correction workflow +- Six-method batch correction benchmark (ComBat, Percentile, Ratio-ComBat, Harmony, limma-style, Stratified ComBat) across a parameter grid of biological signal strengths and batch effect severities - Batch correction effectiveness metrics visualization - ## Limitation -**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics. +**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (≥3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics. + +## Citation + +If you use GlycoForge in your research, please cite: + +> Hu, S. and Bojar, D. (2026). GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking. *bioRxiv*, doi:10.64898/2026.02.20.707134 + +**BibTeX:** +```bibtex +@article{hu2026glycoforge, + title = {GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking}, + author = {Hu, Siyu and Bojar, Daniel}, + journal = {bioRxiv}, + year = {2026}, + doi = {10.64898/2026.02.20.707134}, + url = {https://www.biorxiv.org/content/10.64898/2026.02.20.707134v1} +} +``` \ No newline at end of file diff --git a/benchmarking_batch_effect_removal.ipynb b/benchmarking_batch_effect_removal.ipynb index ee7e1ef..e202296 100644 --- a/benchmarking_batch_effect_removal.ipynb +++ b/benchmarking_batch_effect_removal.ipynb @@ -13,8 +13,8 @@ "import pandas as pd\n", "import numpy as np\n", "from glycoforge import simulate\n", - "from methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n", - "from evaluation import quantify_batch_effect_impact, compare_differential_expression\n", + "from use_cases.batch_correction.methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n", + "from use_cases.batch_correction.evaluation import quantify_batch_effect_impact, compare_differential_expression\n", "from glycoforge.utils import clr, invclr\n", "import json\n", "\n", @@ -189,6 +189,17 @@ "print(\"Aggregated results saved to results/method_comparison_aggregated.csv\")" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "893f4d0e", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "os.makedirs('results/figures', exist_ok=True)" + ] + }, { "cell_type": "code", "execution_count": null, @@ -406,7 +417,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -420,7 +431,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.6" + "version": "3.12.13" } }, "nbformat": 4, diff --git a/tests/test_core_functions.py b/tests/test_core_functions.py index 32001a2..05cc2d9 100644 --- a/tests/test_core_functions.py +++ b/tests/test_core_functions.py @@ -281,8 +281,8 @@ def test_motif_based_alpha_generation(): # Validation 1: Array dimensions are correct assert len(alpha_U_with_motif) == len(glycan_sequences) assert len(alpha_U_no_motif) == len(glycan_sequences) - -# Validation 2: Motif bias affects the distribution + + # Validation 2: Motif bias affects the distribution # With motif rules, the distribution should differ from no-motif case # Test that motif_bias has an effect (not testing specific direction due to pairing complexity) ratio_with_motif = alpha_U_with_motif / alpha_H @@ -636,3 +636,118 @@ def test_mnar_edge_cases(): assert np.array_equal(mask_a, mask_b) assert np.array_equal(np.isnan(Y_missing_a), np.isnan(Y_missing_b)) + +# --- Coupling tests --- + +def _hsic_linear(X, Y): + """Linear HSIC statistic between two sample-space matrices.""" + n = X.shape[0] + H = np.eye(n) - np.ones((n, n)) / n + Kc = H @ (X @ X.T) @ H + Lc = H @ (Y @ Y.T) @ H + return np.trace(Kc @ Lc) / (n - 1) ** 2 + + +def _hsic_permtest(X, Y, n_perm=300, seed=0): + """Return (hsic_obs, p_value) via permutation test.""" + rng = np.random.default_rng(seed) + + # Precompute centering matrix and centered kernel for X (fixed across permutations) + n = X.shape[0] + H = np.eye(n) - np.ones((n, n)) / n + Kx = X @ X.T + Kc = H @ Kx @ H + + # Observed HSIC with original Y + Ly = Y @ Y.T + Lc = H @ Ly @ H + obs = np.trace(Kc @ Lc) / (n - 1) ** 2 + + # Permutation distribution: reuse Kc and H; only Y changes + count = 0 + for _ in range(n_perm): + perm = rng.permutation(n) + Y_perm = Y[perm] + Ly_perm = Y_perm @ Y_perm.T + Lc_perm = H @ Ly_perm @ H + hsic_perm = np.trace(Kc @ Lc_perm) / (n - 1) ** 2 + if hsic_perm >= obs: + count += 1 + + # Use (count + 1) / (n_perm + 1) to include the observed statistic + # and avoid zero/one p-values with finite permutations. + return obs, (count + 1) / (n_perm + 1) + + +def test_zero_coupling_independence(): + """coupling_strength=0: cross-glycome HSIC should be non-significant. + + Note: shared bio/batch labels still cause a tiny amount of common variance, + so we cannot assert exact zero — we assert that HSIC is not statistically + significant at alpha=0.05 via a permutation test. + """ + from glycoforge.sim_coupled import inject_coupling + + rng_a = np.random.default_rng(7) + rng_b = np.random.default_rng(13) + n_samples, n_glycans = 60, 15 + Y_A = rng_a.standard_normal((n_samples, n_glycans)) + Y_B = rng_b.standard_normal((n_samples, n_glycans)) + + inject_coupling( + Y_A, Y_B, + coupling_strength=0.0, + n_coupling_components=2, + coupling_motif_A=None, coupling_motif_B=None, + coupling_motif_bias=0.0, + glycan_sequences_A=None, glycan_sequences_B=None, + seed=42, + ) + + _, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0) + assert p_value >= 0.05, ( + f"Expected non-significant HSIC for zero coupling, got p={p_value:.4f}" + ) + + +def test_nonzero_coupling_detectability(): + """coupling_strength=1.0: cross-glycome HSIC should be statistically significant.""" + from glycoforge.sim_coupled import inject_coupling + + rng_a = np.random.default_rng(7) + rng_b = np.random.default_rng(13) + n_samples, n_glycans = 60, 15 + Y_A = rng_a.standard_normal((n_samples, n_glycans)) + Y_B = rng_b.standard_normal((n_samples, n_glycans)) + + inject_coupling( + Y_A, Y_B, + coupling_strength=1.0, + n_coupling_components=2, + coupling_motif_A=None, coupling_motif_B=None, + coupling_motif_bias=0.0, + glycan_sequences_A=None, glycan_sequences_B=None, + seed=42, + ) + + _, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0) + assert p_value < 0.05, ( + f"Expected significant HSIC for coupling_strength=1.0, got p={p_value:.4f}" + ) + + +def test_coupling_direction_unit_norm(): + """_build_coupling_directions columns should each have norm = 1.0.""" + from glycoforge.sim_coupled import _build_coupling_directions + + for n_glycans, n_components in [(10, 1), (20, 3), (5, 5)]: + U = _build_coupling_directions( + n_glycans, n_components, + motif_rules=None, glycan_sequences=None, + motif_bias=0.0, seed=42, + ) + norms = np.linalg.norm(U, axis=0) + assert np.allclose(norms, 1.0, atol=1e-10), ( + f"Column norms should be 1.0, got {norms}" + ) +