diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
new file mode 100644
index 0000000..d25190e
--- /dev/null
+++ b/.github/workflows/tests.yml
@@ -0,0 +1,24 @@
+name: Run Tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: python -m pip install -e ".[dev]"
+
+      - name: Run tests
+        run: pytest tests/ -v
diff --git a/README.md b/README.md
index 1871ad9..b1329f3 100644
--- a/README.md
+++ b/README.md
@@ -14,14 +14,20 @@
 
 ### Installation
 
-* Python >= 3.10 required. 
+* **Python 3.10–3.12 required** (`>=3.10,<3.13`). We recommend creating a dedicated virtual environment:
+
+```bash
+python3.10 -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+```
+
 * Core dependency: `glycowork>=1.7.1`
 
 ```bash
 pip install glycoforge
 ```
 
-OR 
+OR
 
 ```bash
 git clone https://github.com/BojarLab/GlycoForge.git
@@ -31,14 +37,9 @@ source .venv/bin/activate
 pip install -e .
 ```
 
-
-
 ### Usage
 
-See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb)for interactive examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows.
-
-
-
+See [run_simulation.ipynb](run_simulation.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/run_simulation.ipynb) for interactive simulation examples, or [use_cases/batch_correction/](use_cases/batch_correction) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/use_cases/batch_correction/run_correction.ipynb) for batch correction workflows, and [benchmarking_batch_effect_removal.ipynb](benchmarking_batch_effect_removal.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BojarLab/GlycoForge/blob/main/benchmarking_batch_effect_removal.ipynb) for the full six-method benchmark.
 
 ## How the simulator works
 
@@ -48,10 +49,9 @@ We keep everything in the CLR (centered log-ratio) space:
 - Flip to CLR: `z_H = clr(p_H)`.
 - For selected glycans, push the signal using real or synthetic effect sizes: `z_U = z_H + m * lambda * d_robust`, where `m` is the differential mask, `lambda` is `bio_strength`, and `d_robust` is the effect vector after `robust_effect_size_processing`.
     - **Simplified mode**: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
-    - **Hybrid mode**: start from the Cohen’s *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling.
+    - **Hybrid mode**: start from the Cohen's *d* values returned by `glycowork.get_differential_expression`; `define_differential_mask` lets you restrict the injection to significant hits or top-*N* glycans before scaling.
 - Invert back to proportions: `p_U = invclr(z_U)` and scale by `k_dir` to get `alpha_U`, note that the healthy and unhealthy Dirichlet strengths use different `k_dir` values, and a separate `variance_ratio` controls their relative magnitude.
-- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * u_b + epsilon`, with `var_b` controlling spread.
-
+- Batch effects ride on top as direction vectors `u_b`, so a clean CLR sample `Y_clean` becomes `Y_with_batch = Y_clean + kappa_mu * sigma * u_b + epsilon`, with `var_b` controlling spread.
 
 ## Simulation Modes
 
@@ -66,17 +66,17 @@ No real data dependency. Ideal for controlled experiments with known ground trut
 
 **Pipeline steps:**
 
-1. Initializes log-normal healthy baseline: `alpha_H = ones(n_glycans) * 10`
+1. Initializes log-normal healthy baseline: `alpha_H` sampled from log-normal distribution (μ=0, σ=1, fixed seed=42), rescaled to mean of 10
 2. For each random seed, generates `alpha_U` by randomly scaling `alpha_H`:
    - `up_frac` (default 30%) upregulated with scale factors from `up_scale_range=(1.1, 3.0)`
-   - `down_frac` (default 30%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)`
-   - Remaining glycans (~40%) stay unchanged
+   - `down_frac` (default 35%) downregulated with scale factors from `down_scale_range=(0.3, 0.9)`
+   - Remaining glycans (~35%) stay unchanged
 3. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples
 4. Defines batch effect direction vectors `u_dict` once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
 5. Applies batch effects controlled by `kappa_mu` (shift strength) and `var_b` (variance scaling)
 6. Optionally applies MNAR (Missing Not At Random) missingness:
-   - `missing_fraction`: proportion of missing values (0.0-1.0)
-   - `mnar_bias`: intensity-dependent bias (default 2.0, range 0.5-5.0)
+   - `missing_fraction`: proportion of missing values (0.0–1.0)
+   - `mnar_bias`: intensity-dependent bias (default 2.0, range 0.5–5.0)
    - Left-censored pattern: low-abundance glycans more likely to be missing
 7. Grid search over `kappa_mu` and `var_b` produces multiple datasets under identical batch effect structure
 
@@ -85,7 +85,7 @@ No real data dependency. Ideal for controlled experiments with known ground trut
 </details>
 
 <details>
-<summary><b>Templated mode (<code>data_source="real"</code>)</b> – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction) </summary>
+<summary><b>Templated mode (<code>data_source="real"</code>)</b> – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)</summary>
 
 <br>
 
@@ -102,7 +102,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts
    - `"Top-N"`: top N glycans by absolute effect size (e.g., `"Top-10"`)
 5. Processes effect sizes through `robust_effect_size_processing`:
    - Centers effect sizes to remove global shift
-   - Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses `winsorize_percentile`)
+   - Applies Winsorization to clip extreme outliers (auto-selects percentile 85–99, or uses `winsorize_percentile`)
    - Normalizes by baseline (`baseline_method`: median, MAD, or p75)
    - Returns normalized `d_robust` scaled by `bio_strength`
 6. Injects effects in CLR space: `z_U = z_H + mask * bio_strength * d_robust`
@@ -111,7 +111,7 @@ Starts from real glycomics data to preserve biological signal structure. Accepts
 9. Samples clean cohorts from `Dirichlet(alpha_H)` and `Dirichlet(alpha_U)` with `n_H` healthy and `n_U` unhealthy samples
 10. Defines batch effect direction vectors `u_dict` once per run (fixed seed ensures fair comparison across parameter combinations)
 11. Applies batch effects: `y_batch = y_clean + kappa_mu * sigma * u_b + epsilon`, where `epsilon ~ N(0, sqrt(var_b) * sigma)`
-12. Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
+12. Optionally applies MNAR missingness (same as Simplified mode)
 13. Grid search over `bio_strength`, `k_dir`, `variance_ratio`, `kappa_mu`, `var_b` to systematically test biological signal and batch effect interactions
 
 **Key parameters:** `data_file`, `column_prefix`, `bio_strength`, `k_dir`, `variance_ratio`, `differential_mask`, `winsorize_percentile`, `baseline_method`, `kappa_mu`, `var_b`, `missing_fraction`, `mnar_bias`
@@ -147,9 +147,27 @@ Generates two glycomic datasets (e.g., _N_- and _O_-glycomics) that share sample
 
 The [use_cases/batch_correction/](use_cases/batch_correction) directory demonstrates:
 - Call `glycoforge` simulation, and then apply correction workflow
+- Six-method batch correction benchmark (ComBat, Percentile, Ratio-ComBat, Harmony, limma-style, Stratified ComBat) across a parameter grid of biological signal strengths and batch effect severities
 - Batch correction effectiveness metrics visualization
 
-
 ## Limitation
 
-**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
+**Two biological groups only**: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (≥3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
+
+## Citation
+
+If you use GlycoForge in your research, please cite:
+
+> Hu, S. and Bojar, D. (2026). GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking. *bioRxiv*, doi:10.64898/2026.02.20.707134
+
+**BibTeX:**
+```bibtex
+@article{hu2026glycoforge,
+  title   = {GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking},
+  author  = {Hu, Siyu and Bojar, Daniel},
+  journal = {bioRxiv},
+  year    = {2026},
+  doi     = {10.64898/2026.02.20.707134},
+  url     = {https://www.biorxiv.org/content/10.64898/2026.02.20.707134v1}
+}
+```
\ No newline at end of file
diff --git a/benchmarking_batch_effect_removal.ipynb b/benchmarking_batch_effect_removal.ipynb
index ee7e1ef..e202296 100644
--- a/benchmarking_batch_effect_removal.ipynb
+++ b/benchmarking_batch_effect_removal.ipynb
@@ -13,8 +13,8 @@
     "import pandas as pd\n",
     "import numpy as np\n",
     "from glycoforge import simulate\n",
-    "from methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n",
-    "from evaluation import quantify_batch_effect_impact, compare_differential_expression\n",
+    "from use_cases.batch_correction.methods import combat, add_noise_to_zero_variance_features, percentile_normalization, ratio_preserving_combat, harmony_correction, limma_style_correction, stratified_combat\n",
+    "from use_cases.batch_correction.evaluation import quantify_batch_effect_impact, compare_differential_expression\n",
     "from glycoforge.utils import clr, invclr\n",
     "import json\n",
     "\n",
@@ -189,6 +189,17 @@
     "print(\"Aggregated results saved to results/method_comparison_aggregated.csv\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "893f4d0e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.makedirs('results/figures', exist_ok=True)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -406,7 +417,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -420,7 +431,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.6"
+   "version": "3.12.13"
   }
  },
  "nbformat": 4,
diff --git a/tests/test_core_functions.py b/tests/test_core_functions.py
index 32001a2..05cc2d9 100644
--- a/tests/test_core_functions.py
+++ b/tests/test_core_functions.py
@@ -281,8 +281,8 @@ def test_motif_based_alpha_generation():
     # Validation 1: Array dimensions are correct
     assert len(alpha_U_with_motif) == len(glycan_sequences)
     assert len(alpha_U_no_motif) == len(glycan_sequences)
-    
-# Validation 2: Motif bias affects the distribution
+
+    # Validation 2: Motif bias affects the distribution
     # With motif rules, the distribution should differ from no-motif case
     # Test that motif_bias has an effect (not testing specific direction due to pairing complexity)
     ratio_with_motif = alpha_U_with_motif / alpha_H
@@ -636,3 +636,118 @@ def test_mnar_edge_cases():
     assert np.array_equal(mask_a, mask_b)
     assert np.array_equal(np.isnan(Y_missing_a), np.isnan(Y_missing_b))
 
+
+# --- Coupling tests ---
+
+def _hsic_linear(X, Y):
+    """Linear HSIC statistic between two sample-space matrices."""
+    n = X.shape[0]
+    H = np.eye(n) - np.ones((n, n)) / n
+    Kc = H @ (X @ X.T) @ H
+    Lc = H @ (Y @ Y.T) @ H
+    return np.trace(Kc @ Lc) / (n - 1) ** 2
+
+
+def _hsic_permtest(X, Y, n_perm=300, seed=0):
+    """Return (hsic_obs, p_value) via permutation test."""
+    rng = np.random.default_rng(seed)
+
+    # Precompute centering matrix and centered kernel for X (fixed across permutations)
+    n = X.shape[0]
+    H = np.eye(n) - np.ones((n, n)) / n
+    Kx = X @ X.T
+    Kc = H @ Kx @ H
+
+    # Observed HSIC with original Y
+    Ly = Y @ Y.T
+    Lc = H @ Ly @ H
+    obs = np.trace(Kc @ Lc) / (n - 1) ** 2
+
+    # Permutation distribution: reuse Kc and H; only Y changes
+    count = 0
+    for _ in range(n_perm):
+        perm = rng.permutation(n)
+        Y_perm = Y[perm]
+        Ly_perm = Y_perm @ Y_perm.T
+        Lc_perm = H @ Ly_perm @ H
+        hsic_perm = np.trace(Kc @ Lc_perm) / (n - 1) ** 2
+        if hsic_perm >= obs:
+            count += 1
+
+    # Use (count + 1) / (n_perm + 1) to include the observed statistic
+    # and avoid zero/one p-values with finite permutations.
+    return obs, (count + 1) / (n_perm + 1)
+
+
+def test_zero_coupling_independence():
+    """coupling_strength=0: cross-glycome HSIC should be non-significant.
+
+    Note: shared bio/batch labels still cause a tiny amount of common variance,
+    so we cannot assert exact zero — we assert that HSIC is not statistically
+    significant at alpha=0.05 via a permutation test.
+    """
+    from glycoforge.sim_coupled import inject_coupling
+
+    rng_a = np.random.default_rng(7)
+    rng_b = np.random.default_rng(13)
+    n_samples, n_glycans = 60, 15
+    Y_A = rng_a.standard_normal((n_samples, n_glycans))
+    Y_B = rng_b.standard_normal((n_samples, n_glycans))
+
+    inject_coupling(
+        Y_A, Y_B,
+        coupling_strength=0.0,
+        n_coupling_components=2,
+        coupling_motif_A=None, coupling_motif_B=None,
+        coupling_motif_bias=0.0,
+        glycan_sequences_A=None, glycan_sequences_B=None,
+        seed=42,
+    )
+
+    _, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0)
+    assert p_value >= 0.05, (
+        f"Expected non-significant HSIC for zero coupling, got p={p_value:.4f}"
+    )
+
+
+def test_nonzero_coupling_detectability():
+    """coupling_strength=1.0: cross-glycome HSIC should be statistically significant."""
+    from glycoforge.sim_coupled import inject_coupling
+
+    rng_a = np.random.default_rng(7)
+    rng_b = np.random.default_rng(13)
+    n_samples, n_glycans = 60, 15
+    Y_A = rng_a.standard_normal((n_samples, n_glycans))
+    Y_B = rng_b.standard_normal((n_samples, n_glycans))
+
+    inject_coupling(
+        Y_A, Y_B,
+        coupling_strength=1.0,
+        n_coupling_components=2,
+        coupling_motif_A=None, coupling_motif_B=None,
+        coupling_motif_bias=0.0,
+        glycan_sequences_A=None, glycan_sequences_B=None,
+        seed=42,
+    )
+
+    _, p_value = _hsic_permtest(Y_A, Y_B, n_perm=300, seed=0)
+    assert p_value < 0.05, (
+        f"Expected significant HSIC for coupling_strength=1.0, got p={p_value:.4f}"
+    )
+
+
+def test_coupling_direction_unit_norm():
+    """_build_coupling_directions columns should each have norm = 1.0."""
+    from glycoforge.sim_coupled import _build_coupling_directions
+
+    for n_glycans, n_components in [(10, 1), (20, 3), (5, 5)]:
+        U = _build_coupling_directions(
+            n_glycans, n_components,
+            motif_rules=None, glycan_sequences=None,
+            motif_bias=0.0, seed=42,
+        )
+        norms = np.linalg.norm(U, axis=0)
+        assert np.allclose(norms, 1.0, atol=1e-10), (
+            f"Column norms should be 1.0, got {norms}"
+        )
+