Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
outputs/
outputs/
libs/jump_smiles/CLAUDE.md
libs/jump_smiles/.claude/settings.local.json
116 changes: 57 additions & 59 deletions libs/jump_smiles/README.md
Original file line number Diff line number Diff line change
@@ -1,100 +1,98 @@
# JUMP-SMILES Documentation
# JUMP-SMILES

A Python library for standardizing chemical structures using RDKit. Designed to standardize SMILES strings for consistency with the [JUMP Cell Painting datasets](https://github.com/jump-cellpainting/datasets).
Python library for standardizing chemical structures using RDKit. Designed for consistency with [JUMP Cell Painting datasets](https://github.com/jump-cellpainting/datasets).

## Installation

Requires Python 3.11+ and Poetry package manager.
Requires Python 3.11 or 3.12 (RDKit 2023.9.5 constraint).

```bash
# Clone and install
git clone <repository-url>
cd jump-smiles
poetry install
poetry shell
```
uv sync --python 3.11

Core dependencies (managed by Poetry):
- rdkit 2023.9.5
- pandas 2.2.2
- numpy 2.1.1
- fire 0.4.0+
- tqdm 4.64.1
- requests 2.28.2
# Or add to your project (replace BRANCH_NAME with desired branch, or omit @BRANCH_NAME for main)
uv add "git+https://github.com/broadinstitute/monorepo.git@BRANCH_NAME#subdirectory=libs/jump_smiles"
```

## Usage

### Command Line

```bash
poetry run python standardize_smiles.py \
--input molecules.csv \
--output standardized_molecules.csv \
--num_cpu 4 \
--method jump_canonical
# If installed locally
uv run jump-smiles --input molecules.csv --output standardized.csv

# Without installation (replace BRANCH_NAME with desired branch, or omit @BRANCH_NAME for main)
uvx --python 3.11 --from "git+https://github.com/broadinstitute/monorepo.git@BRANCH_NAME#subdirectory=libs/jump_smiles" jump-smiles --input molecules.csv --output standardized.csv
```

### Python API

```python
from smiles.standardize_smiles import StandardizeMolecule
from jump_smiles.standardize_smiles import StandardizeMolecule

# With file input
# File input
standardizer = StandardizeMolecule(
input="molecules.csv",
output="standardized_molecules.csv",
output="standardized.csv",
num_cpu=4
)
standardized_df = standardizer.run()
result = standardizer.run()

# With DataFrame input
# DataFrame input
import pandas as pd
df = pd.DataFrame({
'SMILES': [
'CC(=O)OC1=CC=CC=C1C(=O)O', # Aspirin
'CN1C=NC2=C1C(=O)N(C(=O)N2C)C' # Caffeine
]
})
standardized_df = StandardizeMolecule(input=df).run()
df = pd.DataFrame({'SMILES': ['CC(=O)OC1=CC=CC=C1C(=O)O']})
result = StandardizeMolecule(input=df).run()
```

## Parameters

- `input`: CSV/TSV file path or pandas DataFrame with 'SMILES'/'smiles' column
- `input`: CSV/TSV file or DataFrame with SMILES column
- `output`: Output file path (optional)
- `num_cpu`: Number of CPU cores (default: 1)
- `limit_rows`: Maximum rows to process (optional)
- `augment`: Include original columns in output (default: False)
- `method`: Standardization method (default: "jump_canonical")
- `method`: Standardization method - "jump_canonical" (default) or "jump_alternate_1"
- `random_seed`: For reproducibility (default: 42)

## Standardization Methods

### jump_canonical
The default method used in JUMP Cell Painting datasets. Performs iterative steps until convergence (max 5 iterations):
- Charge parent normalization
- Isotope removal
- Stereo parent normalization
- Tautomer parent normalization
- General standardization
**jump_canonical** (default): The method used in JUMP Cell Painting datasets. Performs iterative normalization until convergence.

If no convergence, selects most common form.
**jump_alternate_1**: Sequential InChI-based standardization, recommended for tautomer-heavy datasets.

### jump_alternate_1
Recommended for tautomer-heavy datasets. Performs sequential steps:
1. InChI-based standardization
2. Structure cleanup
3. Fragment handling
4. Charge neutralization
5. Tautomer canonicalization
See the class docstring for detailed method descriptions.

## Output Format
Returns DataFrame with columns:
- `SMILES_original`: Input SMILES
- `SMILES_standardized`: Standardized SMILES
- `InChI_standardized`: Standardized InChI
- `InChIKey_standardized`: Standardized InChIKey

If `augment=True`, includes all original columns.

## Limitations
1. No 3D structure processing
2. May not find most chemically relevant tautomer
3. Limited handling of complex metal-organic structures

Returns DataFrame with standardized SMILES, InChI, and InChIKey. Use `augment=True` to include original columns.

## Development

```bash
# Install with dev dependencies
uv sync --python 3.11 --extra dev

# Run tests
uv run pytest

# Run fast tests only (skip slow idempotency tests)
uv run pytest -m "not slow"

# Test idempotency with JUMP compounds (data already included)
uv run pytest test/test_idempotency.py -v # Tests with 100 compounds
uv run pytest test/test_idempotency.py -m "not very_slow" -v # Skip full dataset test
uv run pytest test/test_idempotency.py::test_standardizer_idempotency[all-jump_canonical] -v # Test full dataset (~115k compounds, very slow)
# To refresh/update the test data: uv run --script scripts/download_jump_compounds.py

# Lint and format
uv run ruff check src/
uv run ruff format src/
```

## Important Notes

- **RDKit 2023.9.5** is strictly required for reproducibility with JUMP datasets
- Must use Python 3.11 or 3.12 due to RDKit compatibility
53 changes: 30 additions & 23 deletions libs/jump_smiles/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,29 +1,36 @@
[tool.poetry]
[project]
name = "jump-smiles"
version = "0.0.1"
description = ""
version = "0.0.2"
description = "Python library for standardizing chemical structures using RDKit"
readme = "README.md"
authors = ["Alán F. Muñoz <[email protected]>"]
license = "MIT"
authors = [
{name = "Alán F. Muñoz", email = "[email protected]"},
]
license = {text = "MIT"}
requires-python = ">=3.11,<3.13" # RDKit 2023.9.5 only has wheels for 3.11-3.12
dependencies = [
"rdkit==2023.9.5",
"fire>=0.4.0",
"requests>=2.28.0,<3",
"pandas>=2.2.0,<3",
"numpy>=1.26.0,<2",
"tqdm>=4.64.1",
]

[tool.poetry.dependencies]
python = ">=3.11,<4"
rdkit = "2023.9.5"
fire = ">=0.4.0"
requests = "2.28.2"
pandas = "2.2.2"
numpy = "1.26.4"
tqdm = "4.64.1"
[project.scripts]
jump-smiles = "jump_smiles.standardize_smiles:main"

[tool.poetry.group.dev.dependencies]
jupyter = ">=1.0.0"
ipykernel = ">=6.21.2"
pytest = "8.1.1"
jupytext = "^1.15.2"
ipdb = "^0.13.13"
ruff-lsp = "^0.0.50"
ruff = "<0.2.0"
[project.optional-dependencies]
dev = [
"jupyter>=1.0.0",
"ipykernel>=6.21.2",
"pytest>=8.1.1",
"jupytext>=1.15.2,<2",
"ipdb>=0.13.13",
"ruff-lsp>=0.0.50",
"ruff<0.2.0",
]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
requires = ["hatchling"]
build-backend = "hatchling.build"
7 changes: 6 additions & 1 deletion libs/jump_smiles/pytest.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
[pytest]
markers =
slow: marks tests as slow (deselect with '-m "not slow"')
very_slow: marks tests as very slow - full dataset tests (deselect with '-m "not very_slow"')
idempotency: marks idempotency tests
filterwarnings =
ignore::DeprecationWarning:fire.core
ignore::DeprecationWarning:fire.core
ignore::DeprecationWarning:rdkit.Chem.MolStandardize
66 changes: 66 additions & 0 deletions libs/jump_smiles/scripts/download_jump_compounds.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "pandas>=2.2.0",
# "pooch>=1.8.0",
# "tqdm>=4.64.1",
# ]
# ///
"""
Download JUMP compound dataset and extract only SMILES column.

Usage:
uv run --script scripts/download_jump_compounds.py
"""

import pandas as pd
import pooch
from pathlib import Path

# Configuration
CACHE_DIR = Path(__file__).parent.parent / "test/test_data/jump_compounds"
URL = "https://github.com/jump-cellpainting/datasets/refs/tags/v0.13/metadata/compound.csv.gz"
OUTPUT_FILE = CACHE_DIR / "smiles_only.csv.gz"


def main():
"""Download and process JUMP compounds data."""

# Create cache directory
CACHE_DIR.mkdir(parents=True, exist_ok=True)

# Download with pooch (automatic caching)
print("Downloading/loading JUMP compounds data...")
file_path = pooch.retrieve(
url=URL,
known_hash=None, # Will compute on first download
path=CACHE_DIR,
fname="compound.csv.gz",
progressbar=True,
)

# Load and extract SMILES column
print("Extracting SMILES column...")
df = pd.read_csv(
file_path, compression="gzip", usecols=["Metadata_JCP2022", "Metadata_SMILES"]
)

# Clean data
original_count = len(df)
df = df[df["Metadata_SMILES"].notna()]
df = df[df["Metadata_SMILES"] != ""]

# Save compressed
df.to_csv(
OUTPUT_FILE, compression={"method": "gzip", "compresslevel": 9}, index=False
)

# Report
print(f"Saved {len(df)}/{original_count} valid SMILES")
print(f" File: {OUTPUT_FILE}")
print(f" Size: {OUTPUT_FILE.stat().st_size / 1024:.1f} KB")


if __name__ == "__main__":
main()
42 changes: 40 additions & 2 deletions libs/jump_smiles/src/jump_smiles/standardize_smiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,40 @@


class StandardizeMolecule:
"""
Standardize chemical structures for consistency with JUMP Cell Painting datasets.

This class provides two standardization methods:

1. **jump_canonical** (default): The method used in JUMP Cell Painting datasets.
Performs iterative standardization until convergence (max 5 iterations):
- Charge parent normalization
- Isotope removal
- Stereo parent normalization
- Tautomer parent normalization
- General standardization
If no convergence after 5 iterations, selects the most common form.

2. **jump_alternate_1**: Recommended for tautomer-heavy datasets.
Performs sequential steps:
- InChI-based standardization
- Structure cleanup
- Fragment handling
- Charge neutralization
- Tautomer canonicalization

Output includes:
- SMILES_original: Input SMILES
- SMILES_standardized: Standardized SMILES
- InChI_standardized: Standardized InChI
- InChIKey_standardized: Standardized InChIKey

Limitations:
- No 3D structure processing
- May not find most chemically relevant tautomer
- Limited handling of complex metal-organic structures
"""

def __init__(
self,
input: Union[str, pd.DataFrame],
Expand All @@ -43,7 +77,7 @@ def __init__(
:param limit_rows: Limit the number of rows to be processed (optional)
:param augment: The output is the input file augmented with the standardized SMILES, InChI, and InChIKey (default: False)
:param method: Standardization method to use: "jump_canonical" or "jump_alternate_1" (default: "jump_canonical")

:param random_seed: Random seed for reproducibility (default: 42)
"""
self.input = input
self.output = output
Expand Down Expand Up @@ -308,5 +342,9 @@ def run(self):
return standardized_df


if __name__ == "__main__":
def main():
fire.Fire(StandardizeMolecule)


if __name__ == "__main__":
main()
Binary file not shown.
Loading