Skip to content

Conversation

vigneshgr
Copy link

Fixes Issue #55

  • Added K-means clustering for numerical feature analysis
  • Enhanced sensitive_representativity to handle numerical features
  • Added comprehensive test suite
  • Updated package dependencies with pinned versions because dython package was deprecated
  • Updated Python version requirement to >=3.9

Why Package Updates Were Needed

During development, we encountered several compatibility issues that required updating the package versions:

  1. The older version of dython (0.6.7) had a deprecated compute_associations function. The new version (0.7.9) uses the associations function with improved API.
  2. The numerical clustering functionality requires newer versions of numpy and scikit-learn for better performance and stability.
  3. The loose version constraints (using >=) were replaced with pinned versions to ensure reproducible builds and prevent unexpected breaks from dependency updates.
  4. Python requirement was updated to >=3.9 because:
  • The newer versions of numpy (2.3.2) and pandas (2.3.2) require Python 3.9+
  • This ensures all dependencies work together consistently
  • Helps prevent potential compatibility issues during installation

These updates make the package more reliable and maintainable while ensuring all contributors work with the same tested dependency versions.

API Changes

  • No breaking changes to existing API
  • Added optional parameters to sensitive_representativity:
    • n_clusters: Number of clusters for numerical analysis (default: 5)
    • num_threshold: Threshold for disproportionate representation warning (default: 0.2)

Test Results

All tests passing:

================================= test session starts ================================= collected 3 items

tests/engines/test_bias_fairness.py::test_numerical_representativity_analysis PASSED [ 33%] tests/engines/test_bias_fairness.py::test_sensitive_representativity PASSED [ 66%] tests/engines/test_bias_fairness.py::test_sensitive_representativity_balanced PASSED [100%]

============================ 3 passed, 2 warnings in 9.02s ============================

Usage Example

from ydata_quality import DataQuality
import pandas as pd

# Load data
df = pd.DataFrame({
    'numerical_sensitive': [1, 1.1, 5, 5.1, 10, 10.1, 10.2, 10.3, 10.4, 10.5],
    'categorical_sensitive': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
    'label': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
})

# Create DataQuality object with sensitive features
dq = DataQuality(
    df=df,
    sensitive_features=['numerical_sensitive', 'categorical_sensitive']
)

# Run analysis
results = dq.evaluate()

# Get warnings about representativity issues
warnings = dq.get_warnings()

Checklist
 Added new feature
 Added tests
 Updated dependencies
 All tests passing
 Documentation update

Fixes ydataai#55

- Added K-means clustering for numerical feature analysis
- Enhanced sensitive_representativity to handle numerical features
- Added comprehensive test suite
- Updated package dependencies with pinned versions
- Updated Python version requirement to >=3.9
@vigneshgr
Copy link
Author

@portellaa @gmartinsribeiro Can you please review and let know your feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant