Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Handling of Noise Points in Clustering Algorithms (Fixes #152) #200

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

agentmarketbot
Copy link

Pull Request Description

Title: Improve Handling of Points Clustered as Noise

Related Issue: Fixes #152
Issue URL: Improve the handling of points clustered as noise


Summary

This pull request addresses issue #152, which focuses on enhancing the handling of noise points produced by clustering algorithms, specifically DBSCAN. Previously, noise points were labeled as -1 and removed from further analysis, which could lead to loss of potentially valuable information. This update introduces an improved mechanism for managing these noise points, allowing for various user-defined approaches.

Changes Made

  1. Created NoiseHandlingClustering Class:

    • This new class provides a wrapper around existing clustering algorithms with a focus on handling noise points. The class supports the following strategies:
      • Singleton: Noise points are treated as individual clusters.
      • Drop: Noise points remain labeled as -1 and are not considered in analysis.
      • Group: All noise points are grouped into a single cluster, simplifying further processing.
  2. Updated Functionality:

    • The mapper_connected_components function was modified to incorporate the new clustering strategies, while ensuring that the default behavior remains intact for backward compatibility.
  3. Testing and Validation:

    • A mock clustering algorithm was implemented to test the behavior of the new noise handling strategies under various scenarios:
      • Singleton Mode: Confirms that each noise point is assigned a unique cluster.
      • Group Mode: Validates that all noise points are consolidated into one consistent label.
      • Drop Mode: Ensures original -1 labels for noise points are preserved when selected.

Next Steps

Further testing will be conducted with larger datasets to ensure consistent performance and functionality across varied scenarios. Feedback from team members regarding additional test cases or potential edge cases is welcome.

Thank you for considering this enhancement to improve the handling of noise points in our clustering implementations.


Please let me know if there are any questions or if further adjustments are needed for this pull request!

Add NoiseHandlingClustering wrapper class to provide control over how 
noise points (labeled as -1) are handled during clustering. The wrapper 
supports three modes:
- 'singleton': Convert each noise point to its own cluster
- 'drop': Keep noise points labeled as -1
- 'group': Group all noise points into a single cluster

Update mapper_connected_components documentation to clarify noise point 
behavior and add comprehensive unit tests for the new functionality. 
This change enables more flexible handling of outliers and noise points 
in different Mapper applications.
@agentmarketbot
Copy link
Author

Aider:

Update Summary: Enhancements to Noise Handling in Clustering

Context: Addressed issue #152 regarding the handling of points classified as noise by clustering algorithms, specifically in relation to the DBSCAN method.

Key Changes Implemented:

  1. Parameter Validation:

    • Introduced validation for the noise_handling parameter in the NoiseHandlingClustering class constructor to ensure only valid options ('singleton', 'drop', 'group') can be used.
  2. Documentation Improvements:

    • Updated the docstring of the NoiseHandlingClustering class to specify that 'singleton' is the default mode and to outline the performance implications of each noise handling option.
  3. Enhanced Noise Handling Logic:

    • Refined the logic for handling noise points in NoiseHandlingClustering:
      • 'drop' Mode: Preserves noise points with label -1.
      • 'group' Mode: Groups all noise points into a single cluster.
      • 'singleton' Mode: Assigns each noise point its own unique cluster label.
    • Ensured that the labels_ attribute is always a numpy array for consistent behavior across methods.
  4. Clustering Algorithm Adjustment:

    • Adjusted the parameters for the DBSCAN algorithm to ensure proper identification of noise points by setting min_samples=2.
  5. Comprehensive Testing:

    • Added tests to validate the behavior of the noise handling options and ensure correct functionality, including edge cases for purely noise datasets.

Test Results:

  • All tests are now passing successfully, confirming the effective implementation of the above changes.

These improvements enhance the flexibility and accuracy of noise handling in clustering, supporting better data analysis and representation.

Optimize noise handling in clustering implementation

- Improve NoiseHandlingClustering performance with numpy operations
- Add validation for noise_handling parameter values
- Add detailed performance implications documentation
- Enhance code readability with clearer variable names
- Add debug output and more thorough tests for noise handling
- Fix potential edge case when all points are noise
- Use array copying to preserve original cluster labels

The changes focus on making the noise handling more robust and
efficient while maintaining the same functionality. The use of
numpy operations replaces list comprehensions for better
performance with large datasets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the handling of points clustered as noise
1 participant