broadinstitute · shntnu · Aug 26, 2025 · Aug 26, 2025 · Aug 26, 2025 · Aug 26, 2025
diff --git a/.gitignore b/.gitignore
@@ -1 +1,3 @@
-outputs/
+outputs/
+libs/jump_smiles/CLAUDE.md
+libs/jump_smiles/.claude/settings.local.json
diff --git a/libs/jump_smiles/README.md b/libs/jump_smiles/README.md
@@ -1,100 +1,98 @@
-# JUMP-SMILES Documentation
+# JUMP-SMILES
 
-A Python library for standardizing chemical structures using RDKit. Designed to standardize SMILES strings for consistency with the [JUMP Cell Painting datasets](https://github.com/jump-cellpainting/datasets).
+Python library for standardizing chemical structures using RDKit. Designed for consistency with [JUMP Cell Painting datasets](https://github.com/jump-cellpainting/datasets).
 
 ## Installation
 
-Requires Python 3.11+ and Poetry package manager.
+Requires Python 3.11 or 3.12 (RDKit 2023.9.5 constraint).
 
 ```bash
+# Clone and install
 git clone <repository-url>
 cd jump-smiles
-poetry install
-poetry shell
-```
+uv sync --python 3.11
 
-Core dependencies (managed by Poetry):
-- rdkit 2023.9.5
-- pandas 2.2.2
-- numpy 2.1.1
-- fire 0.4.0+
-- tqdm 4.64.1
-- requests 2.28.2
+# Or add to your project (replace BRANCH_NAME with desired branch, or omit @BRANCH_NAME for main)
+uv add "git+https://github.com/broadinstitute/monorepo.git@BRANCH_NAME#subdirectory=libs/jump_smiles"
+```
 
 ## Usage
 
 ### Command Line
+
 ```bash
-poetry run python standardize_smiles.py \
-  --input molecules.csv \
-  --output standardized_molecules.csv \
-  --num_cpu 4 \
-  --method jump_canonical
+# If installed locally
+uv run jump-smiles --input molecules.csv --output standardized.csv
+
+# Without installation (replace BRANCH_NAME with desired branch, or omit @BRANCH_NAME for main)
+uvx --python 3.11 --from "git+https://github.com/broadinstitute/monorepo.git@BRANCH_NAME#subdirectory=libs/jump_smiles" jump-smiles --input molecules.csv --output standardized.csv
 ```
 
 ### Python API
+
 ```python
-from smiles.standardize_smiles import StandardizeMolecule
+from jump_smiles.standardize_smiles import StandardizeMolecule
 
-# With file input
+# File input
 standardizer = StandardizeMolecule(
     input="molecules.csv",
-    output="standardized_molecules.csv",
+    output="standardized.csv",
     num_cpu=4
 )
-standardized_df = standardizer.run()
+result = standardizer.run()
 
-# With DataFrame input
+# DataFrame input
 import pandas as pd
-df = pd.DataFrame({
-    'SMILES': [
-        'CC(=O)OC1=CC=CC=C1C(=O)O',  # Aspirin
-        'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'  # Caffeine
-    ]
-})
-standardized_df = StandardizeMolecule(input=df).run()
+df = pd.DataFrame({'SMILES': ['CC(=O)OC1=CC=CC=C1C(=O)O']})
+result = StandardizeMolecule(input=df).run()
 ```
 
 ## Parameters
 
-- `input`: CSV/TSV file path or pandas DataFrame with 'SMILES'/'smiles' column
+- `input`: CSV/TSV file or DataFrame with SMILES column
 - `output`: Output file path (optional)
 - `num_cpu`: Number of CPU cores (default: 1)
 - `limit_rows`: Maximum rows to process (optional)
 - `augment`: Include original columns in output (default: False)
-- `method`: Standardization method (default: "jump_canonical")
+- `method`: Standardization method - "jump_canonical" (default) or "jump_alternate_1"
 - `random_seed`: For reproducibility (default: 42)
 
 ## Standardization Methods
 
-### jump_canonical
-The default method used in JUMP Cell Painting datasets. Performs iterative steps until convergence (max 5 iterations):
-- Charge parent normalization
-- Isotope removal
-- Stereo parent normalization
-- Tautomer parent normalization
-- General standardization
+**jump_canonical** (default): The method used in JUMP Cell Painting datasets. Performs iterative normalization until convergence.
 
-If no convergence, selects most common form.
+**jump_alternate_1**: Sequential InChI-based standardization, recommended for tautomer-heavy datasets.
 
-### jump_alternate_1
-Recommended for tautomer-heavy datasets. Performs sequential steps:
-1. InChI-based standardization
-2. Structure cleanup
-3. Fragment handling
-4. Charge neutralization
-5. Tautomer canonicalization
+See the class docstring for detailed method descriptions.
 
 ## Output Format
-Returns DataFrame with columns:
-- `SMILES_original`: Input SMILES
-- `SMILES_standardized`: Standardized SMILES
-- `InChI_standardized`: Standardized InChI
-- `InChIKey_standardized`: Standardized InChIKey
-
-If `augment=True`, includes all original columns.
-
-## Limitations
-1. No 3D structure processing
-2. May not find most chemically relevant tautomer
-3. Limited handling of complex metal-organic structures
+
+Returns DataFrame with standardized SMILES, InChI, and InChIKey. Use `augment=True` to include original columns.
+
+## Development
+
+```bash
+# Install with dev dependencies
+uv sync --python 3.11 --extra dev
+
+# Run tests
+uv run pytest
+
+# Run fast tests only (skip slow idempotency tests)
+uv run pytest -m "not slow"
+
+# Test idempotency with JUMP compounds (data already included)
+uv run pytest test/test_idempotency.py -v  # Tests with 100 compounds
+uv run pytest test/test_idempotency.py -m "not very_slow" -v  # Skip full dataset test
+uv run pytest test/test_idempotency.py::test_standardizer_idempotency[all-jump_canonical] -v  # Test full dataset (~115k compounds, very slow)
+# To refresh/update the test data: uv run --script scripts/download_jump_compounds.py
+
+# Lint and format
+uv run ruff check src/
+uv run ruff format src/
+```
+
+## Important Notes
+
+- **RDKit 2023.9.5** is strictly required for reproducibility with JUMP datasets
+- Must use Python 3.11 or 3.12 due to RDKit compatibility
diff --git a/libs/jump_smiles/pyproject.toml b/libs/jump_smiles/pyproject.toml
@@ -1,29 +1,36 @@
-[tool.poetry]
+[project]
 name = "jump-smiles"
-version = "0.0.1"
-description = ""
+version = "0.0.2"
+description = "Python library for standardizing chemical structures using RDKit"
 readme = "README.md"
-authors = ["Alán F. Muñoz <[email protected]>"]
-license = "MIT"
+authors = [
+    {name = "Alán F. Muñoz", email = "[email protected]"},
+]
+license = {text = "MIT"}
+requires-python = ">=3.11,<3.13"  # RDKit 2023.9.5 only has wheels for 3.11-3.12
+dependencies = [
+    "rdkit==2023.9.5",
+    "fire>=0.4.0",
+    "requests>=2.28.0,<3",
+    "pandas>=2.2.0,<3",
+    "numpy>=1.26.0,<2",
+    "tqdm>=4.64.1",
+]
 
-[tool.poetry.dependencies]
-python = ">=3.11,<4"
-rdkit = "2023.9.5"
-fire = ">=0.4.0"
-requests = "2.28.2"
-pandas = "2.2.2"
-numpy = "1.26.4"
-tqdm = "4.64.1"
+[project.scripts]
+jump-smiles = "jump_smiles.standardize_smiles:main"
 
-[tool.poetry.group.dev.dependencies]
-jupyter = ">=1.0.0"
-ipykernel = ">=6.21.2"
-pytest = "8.1.1"
-jupytext = "^1.15.2"
-ipdb = "^0.13.13"
-ruff-lsp = "^0.0.50"
-ruff = "<0.2.0"
+[project.optional-dependencies]
+dev = [
+    "jupyter>=1.0.0",
+    "ipykernel>=6.21.2",
+    "pytest>=8.1.1",
+    "jupytext>=1.15.2,<2",
+    "ipdb>=0.13.13",
+    "ruff-lsp>=0.0.50",
+    "ruff<0.2.0",
+]
 
 [build-system]
-requires = ["poetry-core>=1.0.0"]
-build-backend = "poetry.core.masonry.api"
+requires = ["hatchling"]
+build-backend = "hatchling.build"
diff --git a/libs/jump_smiles/pytest.ini b/libs/jump_smiles/pytest.ini
@@ -1,3 +1,8 @@
 [pytest]
+markers =
+    slow: marks tests as slow (deselect with '-m "not slow"')
+    very_slow: marks tests as very slow - full dataset tests (deselect with '-m "not very_slow"')
+    idempotency: marks idempotency tests
 filterwarnings =
-    ignore::DeprecationWarning:fire.core
+    ignore::DeprecationWarning:fire.core
+    ignore::DeprecationWarning:rdkit.Chem.MolStandardize
diff --git a/libs/jump_smiles/scripts/download_jump_compounds.py b/libs/jump_smiles/scripts/download_jump_compounds.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#     "pandas>=2.2.0",
+#     "pooch>=1.8.0",
+#     "tqdm>=4.64.1",
+# ]
+# ///
+"""
+Download JUMP compound dataset and extract only SMILES column.
+
+Usage:
+    uv run --script scripts/download_jump_compounds.py
+"""
+
+import pandas as pd
+import pooch
+from pathlib import Path
+
+# Configuration
+CACHE_DIR = Path(__file__).parent.parent / "test/test_data/jump_compounds"
+URL = "https://github.com/jump-cellpainting/datasets/refs/tags/v0.13/metadata/compound.csv.gz"
+OUTPUT_FILE = CACHE_DIR / "smiles_only.csv.gz"
+
+
+def main():
+    """Download and process JUMP compounds data."""
+
+    # Create cache directory
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+
+    # Download with pooch (automatic caching)
+    print("Downloading/loading JUMP compounds data...")
+    file_path = pooch.retrieve(
+        url=URL,
+        known_hash=None,  # Will compute on first download
+        path=CACHE_DIR,
+        fname="compound.csv.gz",
+        progressbar=True,
+    )
+
+    # Load and extract SMILES column
+    print("Extracting SMILES column...")
+    df = pd.read_csv(
+        file_path, compression="gzip", usecols=["Metadata_JCP2022", "Metadata_SMILES"]
+    )
+
+    # Clean data
+    original_count = len(df)
+    df = df[df["Metadata_SMILES"].notna()]
+    df = df[df["Metadata_SMILES"] != ""]
+
+    # Save compressed
+    df.to_csv(
+        OUTPUT_FILE, compression={"method": "gzip", "compresslevel": 9}, index=False
+    )
+
+    # Report
+    print(f"Saved {len(df)}/{original_count} valid SMILES")
+    print(f"  File: {OUTPUT_FILE}")
+    print(f"  Size: {OUTPUT_FILE.stat().st_size / 1024:.1f} KB")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/libs/jump_smiles/src/jump_smiles/standardize_smiles.py b/libs/jump_smiles/src/jump_smiles/standardize_smiles.py
@@ -24,6 +24,40 @@
 
 
 class StandardizeMolecule:
+    """
+    Standardize chemical structures for consistency with JUMP Cell Painting datasets.
+
+    This class provides two standardization methods:
+
+    1. **jump_canonical** (default): The method used in JUMP Cell Painting datasets.
+       Performs iterative standardization until convergence (max 5 iterations):
+       - Charge parent normalization
+       - Isotope removal
+       - Stereo parent normalization
+       - Tautomer parent normalization
+       - General standardization
+       If no convergence after 5 iterations, selects the most common form.
+
+    2. **jump_alternate_1**: Recommended for tautomer-heavy datasets.
+       Performs sequential steps:
+       - InChI-based standardization
+       - Structure cleanup
+       - Fragment handling
+       - Charge neutralization
+       - Tautomer canonicalization
+
+    Output includes:
+    - SMILES_original: Input SMILES
+    - SMILES_standardized: Standardized SMILES
+    - InChI_standardized: Standardized InChI
+    - InChIKey_standardized: Standardized InChIKey
+
+    Limitations:
+    - No 3D structure processing
+    - May not find most chemically relevant tautomer
+    - Limited handling of complex metal-organic structures
+    """
+
     def __init__(
         self,
         input: Union[str, pd.DataFrame],
@@ -43,7 +77,7 @@ def __init__(
         :param limit_rows: Limit the number of rows to be processed (optional)
         :param augment: The output is the input file augmented with the standardized SMILES, InChI, and InChIKey (default: False)
         :param method: Standardization method to use: "jump_canonical" or "jump_alternate_1" (default: "jump_canonical")
-
+        :param random_seed: Random seed for reproducibility (default: 42)
         """
         self.input = input
         self.output = output
@@ -308,5 +342,9 @@ def run(self):
         return standardized_df
 
 
-if __name__ == "__main__":
+def main():
     fire.Fire(StandardizeMolecule)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/libs/jump_smiles/test/test_data/jump_compounds/smiles_only.csv.gz b/libs/jump_smiles/test/test_data/jump_compounds/smiles_only.csv.gz