Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Run Tests

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install dependencies
run: pip install -e ".[dev]"

- name: Run tests
run: pytest tests/ -v
60 changes: 39 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,20 @@

### Installation

* Python >= 3.10 required.
* **Python 3.10–3.12 required** (`>=3.10,<3.13`). We recommend creating a dedicated virtual environment:

```bash
python3.10 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
```

* Core dependency: `glycowork>=1.7.1`

```bash
pip install glycoforge
```

OR
OR

```bash
git clone https://github.com/BojarLab/GlycoForge.git
Expand All @@ -31,14 +37,9 @@ source .venv/bin/activate
pip install -e .
```



### Usage

See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb)for interactive examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows.



See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb) for interactive simulation examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows, and [benchmarking_batch_effect_removal.ipynb](use_cases/batch_correction/benchmarking_batch_effect_removal.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/benchmarking_batch_effect_removal.ipynb) for the full six-method benchmark.

## How the simulator works

Expand All @@ -48,10 +49,9 @@ We keep everything in the CLR (centered log-ratio) space:
- Flip to CLR: `z_H = clr(p_H)`.
- For selected glycans, push the signal using real or synthetic effect sizes: `z_U = z_H + m * lambda * d_robust`, where `m` is the differential mask, `lambda` is `bio_strength`, and `d_robust` is the effect vector after `robust_effect_size_processing`.
- **Simplified mode**: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
- **Hybrid mode**: start from the Cohens *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling.
- **Hybrid mode**: start from the Cohen's *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling.
- Invert back to proportions: `p_U = invclr(z_U)` and scale by `k_dir` to get `alpha_U`, note that the healthy and unhealthy Dirichlet strengths use different `k_dir` values, and a separate `variance_ratio` controls their relative magnitude.
- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * u_b + epsilon`, with `var_b` controlling spread.

- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * sigma * u_b + epsilon`, with `var_b` controlling spread.

## Simulation Modes

Expand All @@ -66,17 +66,17 @@ No real data dependency. Ideal for controlled experiments with known ground trut

**Pipeline steps:**

1. Initializes log-normal healthy baseline: `alpha_H = ones(n_glycans) * 10`
1. Initializes log-normal healthy baseline: `alpha_H` sampled from log-normal distribution (μ=0, σ=1, fixed seed=42), rescaled to mean of 10
2. For each random seed, generates `alpha_U` by randomly scaling `alpha_H`:
- `up_frac` (default 30%) upregulated with scale factors from `up_scale_range=(1.1, 3.0)`
- `down_frac` (default 30%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)`
- Remaining glycans (~40%) stay unchanged
- `down_frac` (default 35%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)`
- Remaining glycans (~35%) stay unchanged
3. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples
4. Defines batch effect direction vectors `u_dict` once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
5. Applies batch effects controlled by `kappa_mu` (shift strength) and `var_b` (variance scaling)
6. Optionally applies MNAR (Missing Not At Random) missingness:
- `missing_fraction`: proportion of missing values (0.0-1.0)
- `mnar_bias`: intensity-dependent bias (default 2.0, range 0.5-5.0)
- `missing_fraction`: proportion of missing values (0.01.0)
- `mnar_bias`: intensity-dependent bias (default 2.0, range 0.55.0)
- Left-censored pattern: low-abundance glycans more likely to be missing
7. Grid search over `kappa_mu` and `var_b` produces multiple datasets under identical batch effect structure

Expand All @@ -85,7 +85,7 @@ No real data dependency. Ideal for controlled experiments with known ground trut
</details>

<details>
<summary><b>Templated mode (<code>data_source="real"</code>)</b> – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction) </summary>
<summary><b>Templated mode (<code>data_source="real"</code>)</b> – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)</summary>

<br>

Expand All @@ -102,7 +102,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts
- `"Top-N"`: top N glycans by absolute effect size (e.g., `"Top-10"`)
5. Processes effect sizes through `robust_effect_size_processing`:
- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses `winsorize_percentile`)
- Applies Winsorization to clip extreme outliers (auto-selects percentile 8599, or uses `winsorize_percentile`)
- Normalizes by baseline (`baseline_method`: median, MAD, or p75)
- Returns normalized `d_robust` scaled by `bio_strength`
6. Injects effects in CLR space: `z_U = z_H + mask * bio_strength * d_robust`
Expand All @@ -111,7 +111,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts
9. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples
10. Defines batch effect direction vectors `u_dict` once per run (fixed seed ensures fair comparison across parameter combinations)
11. Applies batch effects: `y_batch = y_clean + kappa_mu * sigma * u_b + epsilon`, where `epsilon ~ N(0, sqrt(var_b) * sigma)`
12. Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
12. Optionally applies MNAR missingness (same as Simplified mode)
13. Grid search over `bio_strength`, `k_dir`, `variance_ratio`, `kappa_mu`, `var_b` to systematically test biological signal and batch effect interactions

**Key parameters:** `data_file`, `column_prefix`, `bio_strength`, `k_dir`, `variance_ratio`, `differential_mask`, `winsorize_percentile`, `baseline_method`, `kappa_mu`, `var_b`, `missing_fraction`, `mnar_bias`
Expand Down Expand Up @@ -147,9 +147,27 @@ Generates two glycomic datasets (e.g., _N_- and _O_-glycomics) that share sample

The [use_cases/batch_correction/](use_cases/batch_correction) directory demonstrates:
- Call `glycoforge` simulation, and then apply correction workflow
- Six-method batch correction benchmark (ComBat, Percentile, Ratio-ComBat, Harmony, limma-style, Stratified ComBat) across a parameter grid of biological signal strengths and batch effect severities
- Batch correction effectiveness metrics visualization


## Limitation

**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (≥3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.

## Citation

If you use GlycoForge in your research, please cite:

> Hu, S. and Bojar, D. (2026). GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking. *bioRxiv*, doi:10.64898/2026.02.20.707134

**BibTeX:**
```bibtex
@article{hu2026glycoforge,
title = {GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking},
author = {Hu, Siyu and Bojar, Daniel},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.02.20.707134},
url = {https://www.biorxiv.org/content/10.64898/2026.02.20.707134v1}
}
```
19 changes: 15 additions & 4 deletions benchmarking_batch_effect_removal.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
"import pandas as pd\n",
"import numpy as np\n",
"from glycoforge import simulate\n",
"from methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n",
"from evaluation import quantify_batch_effect_impact, compare_differential_expression\n",
"from use_cases.batch_correction.methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n",
"from use_cases.batch_correction.evaluation import quantify_batch_effect_impact, compare_differential_expression\n",
"from glycoforge.utils import clr, invclr\n",
"import json\n",
"\n",
Expand Down Expand Up @@ -189,6 +189,17 @@
"print(\"Aggregated results saved to results/method_comparison_aggregated.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "893f4d0e",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.makedirs('results/figures', exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -406,7 +417,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
Expand All @@ -420,7 +431,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
"version": "3.12.13"
}
},
"nbformat": 4,
Expand Down
99 changes: 97 additions & 2 deletions tests/test_core_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,8 +281,8 @@ def test_motif_based_alpha_generation():
# Validation 1: Array dimensions are correct
assert len(alpha_U_with_motif) == len(glycan_sequences)
assert len(alpha_U_no_motif) == len(glycan_sequences)
# Validation 2: Motif bias affects the distribution

# Validation 2: Motif bias affects the distribution
# With motif rules, the distribution should differ from no-motif case
# Test that motif_bias has an effect (not testing specific direction due to pairing complexity)
ratio_with_motif = alpha_U_with_motif / alpha_H
Expand Down Expand Up @@ -636,3 +636,98 @@ def test_mnar_edge_cases():
assert np.array_equal(mask_a, mask_b)
assert np.array_equal(np.isnan(Y_missing_a), np.isnan(Y_missing_b))


# --- Coupling tests ---

def _hsic_linear(X, Y):
"""Linear HSIC statistic between two sample-space matrices."""
n = X.shape[0]
H = np.eye(n) - np.ones((n, n)) / n
Kc = H @ (X @ X.T) @ H
Lc = H @ (Y @ Y.T) @ H
return np.trace(Kc @ Lc) / (n - 1) ** 2


def _hsic_permtest(X, Y, n_perm=300, seed=0):
"""Return (hsic_obs, p_value) via permutation test."""
rng = np.random.default_rng(seed)
obs = _hsic_linear(X, Y)
count = sum(
_hsic_linear(X, Y[rng.permutation(len(Y))]) >= obs
for _ in range(n_perm)
)
return obs, count / n_perm


def test_zero_coupling_independence():
"""coupling_strength=0: cross-glycome HSIC should be non-significant.

Note: shared bio/batch labels still cause a tiny amount of common variance,
so we cannot assert exact zero — we assert that HSIC is not statistically
significant at alpha=0.05 via a permutation test.
"""
from glycoforge.sim_coupled import inject_coupling

rng_a = np.random.default_rng(7)
rng_b = np.random.default_rng(13)
n_samples, n_glycans = 60, 15
Y_A = rng_a.standard_normal((n_samples, n_glycans))
Y_B = rng_b.standard_normal((n_samples, n_glycans))

inject_coupling(
Y_A, Y_B,
coupling_strength=0.0,
n_coupling_components=2,
coupling_motif_A=None, coupling_motif_B=None,
coupling_motif_bias=0.0,
glycan_sequences_A=None, glycan_sequences_B=None,
seed=42,
)

_, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0)
assert p_value >= 0.05, (
f"Expected non-significant HSIC for zero coupling, got p={p_value:.4f}"
)


def test_nonzero_coupling_detectability():
"""coupling_strength=1.0: cross-glycome HSIC should be statistically significant."""
from glycoforge.sim_coupled import inject_coupling

rng_a = np.random.default_rng(7)
rng_b = np.random.default_rng(13)
n_samples, n_glycans = 60, 15
Y_A = rng_a.standard_normal((n_samples, n_glycans))
Y_B = rng_b.standard_normal((n_samples, n_glycans))

inject_coupling(
Y_A, Y_B,
coupling_strength=1.0,
n_coupling_components=2,
coupling_motif_A=None, coupling_motif_B=None,
coupling_motif_bias=0.0,
glycan_sequences_A=None, glycan_sequences_B=None,
seed=42,
)

_, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0)
assert p_value < 0.05, (
f"Expected significant HSIC for coupling_strength=1.0, got p={p_value:.4f}"
)


def test_coupling_direction_unit_norm():
"""_build_coupling_directions columns should each have norm = 1.0."""
from glycoforge.sim_coupled import _build_coupling_directions

for n_glycans, n_components in [(10, 1), (20, 3), (5, 5)]:
U = _build_coupling_directions(
n_glycans, n_components,
motif_rules=None, glycan_sequences=None,
motif_bias=0.0, seed=42,
)
norms = np.linalg.norm(U, axis=0)
assert np.allclose(norms, 1.0, atol=1e-10), (
f"Column norms should be 1.0, got {norms}"
)

Loading