Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 28, 2025

📄 22% (0.22x) speedup for JournalStorageReplayResult._apply_create_study in optuna/storages/journal/_storage.py

⏱️ Runtime : 43.5 microseconds 35.6 microseconds (best of 12 runs)

📝 Explanation and details

The optimization replaces an expensive linear search with a more efficient set-based lookup when checking for duplicate study names.

Key optimization:

  • Original approach: Used study_name in [s.study_name for s in self._studies.values()] - this creates a list and performs O(n) linear search every time
  • Optimized approach: When studies exist, builds a set of study names set(s.study_name for s in self._studies.values()) and checks membership in O(1) time
  • Early exit: Added if self._studies: check to avoid unnecessary work when no studies exist yet

Why this is faster:

  • Set membership testing is O(1) average case vs O(n) for list membership
  • The optimization only kicks in when self._studies is non-empty, avoiding overhead for the first study creation
  • Generator expression in set creation is more memory efficient than list comprehension

Test case performance:
The optimization particularly benefits scenarios with multiple studies:

  • test_create_study_empty_directions: 26.7% faster (7.47μs → 5.90μs)
  • test_create_study_invalid_direction: 15.2% faster (6.46μs → 5.61μs)
  • Large-scale tests with many studies will see increasing benefits as the study count grows

The 22% overall speedup comes from avoiding the O(n) list creation and search operation, which becomes increasingly valuable as more studies are added to the storage.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 68 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 2 Passed
📊 Tests Coverage 70.0%
🌀 Generated Regression Tests and Runtime
from typing import Any

# imports
import pytest
from optuna.storages.journal._storage import JournalStorageReplayResult

# --- Minimal stubs for dependencies ---

class DuplicatedStudyError(Exception):
    pass

class StudyDirection:
    def __init__(self, direction: str):
        if direction not in ("minimize", "maximize"):
            raise ValueError("Invalid direction")
        self.direction = direction

    def __eq__(self, other):
        return isinstance(other, StudyDirection) and self.direction == other.direction

    def __repr__(self):
        return f"StudyDirection({self.direction!r})"

# --- Unit tests ---

# 1. Basic Test Cases

def test_create_study_basic_single_direction():
    """Test creating a study with a single direction."""
    storage = JournalStorageReplayResult("workerA")
    log = {
        "study_name": "study1",
        "directions": ["minimize"],
        "worker_id": "workerA"
    }
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))

def test_create_study_basic_multiple_directions():
    """Test creating a study with multiple directions."""
    storage = JournalStorageReplayResult("workerB")
    log = {
        "study_name": "multi_dir",
        "directions": ["minimize", "maximize"],
        "worker_id": "workerB"
    }
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))

def test_create_study_basic_different_names():
    """Test creating multiple studies with different names."""
    storage = JournalStorageReplayResult("workerC")
    log1 = {"study_name": "foo", "directions": ["maximize"], "worker_id": "workerC"}
    log2 = {"study_name": "bar", "directions": ["minimize"], "worker_id": "workerC"}
    storage._apply_create_study(log1)
    storage._apply_create_study(log2)
    names = [s.study_name for s in storage._studies.values()]

def test_create_study_basic_study_id_increment():
    """Test that study_id increments correctly."""
    storage = JournalStorageReplayResult("workerD")
    for i in range(3):
        log = {"study_name": f"study{i}", "directions": ["minimize"], "worker_id": "workerD"}
        storage._apply_create_study(log)
    ids = [s.study_id for s in storage._studies.values()]

# 2. Edge Test Cases

def test_create_study_duplicate_by_same_worker_raises():
    """Test that creating a study with a duplicate name by the same worker raises DuplicatedStudyError."""
    storage = JournalStorageReplayResult("workerE")
    log = {"study_name": "dup_study", "directions": ["minimize"], "worker_id": "workerE"}
    storage._apply_create_study(log)
    with pytest.raises(DuplicatedStudyError):
        storage._apply_create_study(log)

def test_create_study_duplicate_by_different_worker_no_raise():
    """Test that creating a study with a duplicate name by a different worker does not raise, and is ignored."""
    storage = JournalStorageReplayResult("workerF")
    log1 = {"study_name": "dup_study", "directions": ["minimize"], "worker_id": "workerF"}
    log2 = {"study_name": "dup_study", "directions": ["maximize"], "worker_id": "workerG"}
    storage._apply_create_study(log1)
    # Should not raise, and should not create a new study
    storage._apply_create_study(log2)
    study = next(iter(storage._studies.values()))

def test_create_study_empty_name():
    """Test creating a study with an empty string as the name."""
    storage = JournalStorageReplayResult("workerH")
    log = {"study_name": "", "directions": ["minimize"], "worker_id": "workerH"}
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))

def test_create_study_empty_directions():
    """Test creating a study with an empty directions list (should succeed, but directions are empty)."""
    storage = JournalStorageReplayResult("workerI")
    log = {"study_name": "empty_dir", "directions": [], "worker_id": "workerI"}
    storage._apply_create_study(log) # 8.26μs -> 5.92μs (39.5% faster)
    study = next(iter(storage._studies.values()))

def test_create_study_invalid_direction_value():
    """Test creating a study with an invalid direction value (should raise ValueError)."""
    storage = JournalStorageReplayResult("workerJ")
    log = {"study_name": "bad_dir", "directions": ["foobar"], "worker_id": "workerJ"}
    with pytest.raises(ValueError):
        storage._apply_create_study(log) # 6.46μs -> 5.59μs (15.6% faster)

def test_create_study_direction_case_sensitivity():
    """Test that direction values are case sensitive and only 'minimize'/'maximize' are accepted."""
    storage = JournalStorageReplayResult("workerK")
    log = {"study_name": "case_dir", "directions": ["Minimize"], "worker_id": "workerK"}
    with pytest.raises(ValueError):
        storage._apply_create_study(log) # 4.62μs -> 4.66μs (0.687% slower)

def test_create_study_missing_worker_id_key():
    """Test behavior when 'worker_id' key is missing from log (should raise KeyError)."""
    storage = JournalStorageReplayResult("workerL")
    log = {"study_name": "missing_worker", "directions": ["minimize"]}
    with pytest.raises(KeyError):
        storage._apply_create_study(log)

def test_create_study_missing_study_name_key():
    """Test behavior when 'study_name' key is missing from log (should raise KeyError)."""
    storage = JournalStorageReplayResult("workerM")
    log = {"directions": ["minimize"], "worker_id": "workerM"}
    with pytest.raises(KeyError):
        storage._apply_create_study(log) # 1.28μs -> 1.33μs (3.76% slower)

def test_create_study_missing_directions_key():
    """Test behavior when 'directions' key is missing from log (should raise KeyError)."""
    storage = JournalStorageReplayResult("workerN")
    log = {"study_name": "missing_dir", "worker_id": "workerN"}
    with pytest.raises(KeyError):
        storage._apply_create_study(log) # 1.32μs -> 1.18μs (12.2% faster)

def test_create_study_non_string_name():
    """Test creating a study with a non-string name (should succeed)."""
    storage = JournalStorageReplayResult("workerO")
    log = {"study_name": 123, "directions": ["minimize"], "worker_id": "workerO"}
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))

def test_create_study_non_list_directions():
    """Test creating a study with directions not as a list (should raise TypeError)."""
    storage = JournalStorageReplayResult("workerP")
    log = {"study_name": "non_list_dir", "directions": "minimize", "worker_id": "workerP"}
    with pytest.raises(TypeError):
        storage._apply_create_study(log)

# 3. Large Scale Test Cases

def test_create_study_many_studies():
    """Test creating a large number of studies to check scalability."""
    storage = JournalStorageReplayResult("workerQ")
    num_studies = 500
    for i in range(num_studies):
        log = {"study_name": f"study_{i}", "directions": ["minimize"], "worker_id": "workerQ"}
        storage._apply_create_study(log)
    # All study names should be unique
    names = set(s.study_name for s in storage._studies.values())

def test_create_study_many_directions():
    """Test creating a study with a large number of directions."""
    storage = JournalStorageReplayResult("workerR")
    directions = ["minimize" if i % 2 == 0 else "maximize" for i in range(1000)]
    log = {"study_name": "large_dir_study", "directions": directions, "worker_id": "workerR"}
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))
    for i, d in enumerate(study.directions):
        expected = "minimize" if i % 2 == 0 else "maximize"

def test_create_study_duplicate_names_large():
    """Test that duplicate study names are handled correctly in large scale."""
    storage = JournalStorageReplayResult("workerS")
    # Create 100 studies with the same name by different workers
    for i in range(100):
        log = {"study_name": "same_name", "directions": ["minimize"], "worker_id": f"worker_{i}"}
        if i == 0:
            storage._apply_create_study(log)
        else:
            storage._apply_create_study(log)

def test_create_study_duplicate_names_large_same_worker():
    """Test that duplicate study names by the same worker always raise DuplicatedStudyError."""
    storage = JournalStorageReplayResult("workerT")
    log = {"study_name": "repeat_name", "directions": ["minimize"], "worker_id": "workerT"}
    storage._apply_create_study(log)
    for _ in range(5):
        with pytest.raises(DuplicatedStudyError):
            storage._apply_create_study(log)

def test_create_study_large_names():
    """Test creating studies with very long names."""
    storage = JournalStorageReplayResult("workerU")
    long_name = "x" * 1000
    log = {"study_name": long_name, "directions": ["maximize"], "worker_id": "workerU"}
    storage._apply_create_study(log)
    study = next(iter(storage._studies.values()))
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any

# imports
import pytest
from optuna.storages.journal._storage import JournalStorageReplayResult


# Minimal stubs for external classes/exceptions
class DuplicatedStudyError(Exception):
    pass

class StudyDirection:
    def __init__(self, direction: str):
        if direction not in ("MINIMIZE", "MAXIMIZE"):
            raise ValueError("Invalid direction")
        self.direction = direction

    def __eq__(self, other):
        if not isinstance(other, StudyDirection):
            return False
        return self.direction == other.direction

    def __repr__(self):
        return f"StudyDirection({self.direction!r})"

# --------------------------
# Unit Tests
# --------------------------

# 1. Basic Test Cases

def test_create_study_basic():
    # Test creating a study with a single direction
    storage = JournalStorageReplayResult("workerA")
    log = {
        "study_name": "study1",
        "directions": ["MINIMIZE"],
        "worker_id": "workerA",
    }
    storage._apply_create_study(log)
    study = list(storage._studies.values())[0]

def test_create_study_multiple_directions():
    # Test creating a study with multiple directions
    storage = JournalStorageReplayResult("workerB")
    log = {
        "study_name": "multi_dir_study",
        "directions": ["MINIMIZE", "MAXIMIZE"],
        "worker_id": "workerB",
    }
    storage._apply_create_study(log)
    study = list(storage._studies.values())[0]

def test_create_multiple_studies():
    # Test creating multiple studies with different names
    storage = JournalStorageReplayResult("workerC")
    log1 = {
        "study_name": "studyA",
        "directions": ["MINIMIZE"],
        "worker_id": "workerC",
    }
    log2 = {
        "study_name": "studyB",
        "directions": ["MAXIMIZE"],
        "worker_id": "workerC",
    }
    storage._apply_create_study(log1)
    storage._apply_create_study(log2)
    names = [s.study_name for s in storage._studies.values()]

# 2. Edge Test Cases

def test_create_study_with_existing_name_by_same_worker_raises():
    # Test that creating a study with an existing name by the same worker raises
    storage = JournalStorageReplayResult("workerD")
    log = {
        "study_name": "study_dup",
        "directions": ["MINIMIZE"],
        "worker_id": "workerD",
    }
    storage._apply_create_study(log)
    # Second call with same name and worker should raise
    with pytest.raises(DuplicatedStudyError):
        storage._apply_create_study(log)

def test_create_study_with_existing_name_by_different_worker_no_raise():
    # Test that creating a study with an existing name by a different worker does not raise
    storage = JournalStorageReplayResult("workerE")
    log1 = {
        "study_name": "shared_study",
        "directions": ["MAXIMIZE"],
        "worker_id": "workerE",
    }
    log2 = {
        "study_name": "shared_study",
        "directions": ["MAXIMIZE"],
        "worker_id": "workerF",  # different worker
    }
    storage._apply_create_study(log1)
    # Should not raise, but also should not create a new study
    storage._apply_create_study(log2)

def test_create_study_empty_name():
    # Test creating a study with an empty name
    storage = JournalStorageReplayResult("workerG")
    log = {
        "study_name": "",
        "directions": ["MINIMIZE"],
        "worker_id": "workerG",
    }
    storage._apply_create_study(log)
    study = list(storage._studies.values())[0]

def test_create_study_empty_directions():
    # Test creating a study with empty directions (should work, but directions list is empty)
    storage = JournalStorageReplayResult("workerH")
    log = {
        "study_name": "empty_dir_study",
        "directions": [],
        "worker_id": "workerH",
    }
    storage._apply_create_study(log) # 7.47μs -> 5.90μs (26.7% faster)
    study = list(storage._studies.values())[0]

def test_create_study_invalid_direction():
    # Test creating a study with an invalid direction string should raise ValueError
    storage = JournalStorageReplayResult("workerI")
    log = {
        "study_name": "bad_dir_study",
        "directions": ["INVALID_DIRECTION"],
        "worker_id": "workerI",
    }
    with pytest.raises(ValueError):
        storage._apply_create_study(log) # 6.46μs -> 5.61μs (15.2% faster)

def test_create_study_with_long_name():
    # Test creating a study with a very long name
    long_name = "x" * 256
    storage = JournalStorageReplayResult("workerJ")
    log = {
        "study_name": long_name,
        "directions": ["MAXIMIZE"],
        "worker_id": "workerJ",
    }
    storage._apply_create_study(log)
    study = list(storage._studies.values())[0]

def test_study_id_increment():
    # Test that study_id increments correctly
    storage = JournalStorageReplayResult("workerK")
    for i in range(5):
        log = {
            "study_name": f"study_{i}",
            "directions": ["MINIMIZE"],
            "worker_id": "workerK",
        }
        storage._apply_create_study(log)

# 3. Large Scale Test Cases

def test_create_many_studies():
    # Test creating a large number of studies (performance/scalability)
    storage = JournalStorageReplayResult("workerL")
    num_studies = 500  # Keep under 1000 as per instructions
    for i in range(num_studies):
        log = {
            "study_name": f"study_{i}",
            "directions": ["MINIMIZE" if i % 2 == 0 else "MAXIMIZE"],
            "worker_id": "workerL",
        }
        storage._apply_create_study(log)
    # All study names are unique
    names = [s.study_name for s in storage._studies.values()]

def test_create_many_duplicate_study_names_by_same_worker():
    # Test that creating many duplicate studies by the same worker all raise after the first
    storage = JournalStorageReplayResult("workerM")
    log = {
        "study_name": "dupe_study",
        "directions": ["MINIMIZE"],
        "worker_id": "workerM",
    }
    storage._apply_create_study(log)
    for _ in range(10):
        with pytest.raises(DuplicatedStudyError):
            storage._apply_create_study(log)

def test_create_many_duplicate_study_names_by_different_workers():
    # Test that creating duplicate studies by different workers does not raise and does not create new studies
    storage = JournalStorageReplayResult("workerN")
    log = {
        "study_name": "dupe_study2",
        "directions": ["MAXIMIZE"],
        "worker_id": "workerN",
    }
    storage._apply_create_study(log)
    # Try with many different worker IDs
    for i in range(1, 50):
        log2 = {
            "study_name": "dupe_study2",
            "directions": ["MAXIMIZE"],
            "worker_id": f"workerN_{i}",
        }
        storage._apply_create_study(log2)

def test_large_study_name_and_directions():
    # Test creating a study with a very long name and many directions
    storage = JournalStorageReplayResult("workerO")
    long_name = "study_" + "y" * 900
    directions = ["MINIMIZE", "MAXIMIZE"] * 400  # 800 directions
    log = {
        "study_name": long_name,
        "directions": directions,
        "worker_id": "workerO",
    }
    storage._apply_create_study(log)
    study = list(storage._studies.values())[0]
    # Check that the directions alternate correctly
    for i in range(800):
        expected = "MINIMIZE" if i % 2 == 0 else "MAXIMIZE"
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from optuna.storages.journal._storage import JournalStorageReplayResult

def test_JournalStorageReplayResult__apply_create_study():
    JournalStorageReplayResult._apply_create_study(JournalStorageReplayResult(''), {'study_name': '', 'directions': ''})
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_wou29s7s/tmp8ibnnfgl/test_concolic_coverage.py::test_JournalStorageReplayResult__apply_create_study 7.68μs 5.41μs 42.0%✅

To edit these changes git checkout codeflash/optimize-JournalStorageReplayResult._apply_create_study-mhaxgbgy and push.

Codeflash

The optimization replaces an expensive linear search with a more efficient set-based lookup when checking for duplicate study names. 

**Key optimization:**
- **Original approach:** Used `study_name in [s.study_name for s in self._studies.values()]` - this creates a list and performs O(n) linear search every time
- **Optimized approach:** When studies exist, builds a set of study names `set(s.study_name for s in self._studies.values())` and checks membership in O(1) time
- **Early exit:** Added `if self._studies:` check to avoid unnecessary work when no studies exist yet

**Why this is faster:**
- Set membership testing is O(1) average case vs O(n) for list membership
- The optimization only kicks in when `self._studies` is non-empty, avoiding overhead for the first study creation
- Generator expression in set creation is more memory efficient than list comprehension

**Test case performance:**
The optimization particularly benefits scenarios with multiple studies:
- `test_create_study_empty_directions`: 26.7% faster (7.47μs → 5.90μs) 
- `test_create_study_invalid_direction`: 15.2% faster (6.46μs → 5.61μs)
- Large-scale tests with many studies will see increasing benefits as the study count grows

The 22% overall speedup comes from avoiding the O(n) list creation and search operation, which becomes increasingly valuable as more studies are added to the storage.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 28, 2025 18:56
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant