Skip to content

Update build_tree function with SparseKmeans implementation #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

khoinpd0411
Copy link

What does this PR do?

Update build_tree function with SparseKmeans implementation and utilize an adaptive clustering method mixing Elkan's and Lloyd's algorithm based on the number of samples.

Improvements:

  • Speed up tree construction.
  • Resolved convergence issues caused by duplicate samples during clustering
  • Introduced an adaptive clustering strategy that dynamically switches between Elkan’s algorithm (for large sample sizes) and Lloyd’s algorithm (for smaller or dense datasets)

Test CLI & API (bash tests/autotest.sh)

Test APIs used by main.py.

  • Test Pass
    • (Copy and paste the last outputted line here.)
  • Not Applicable (i.e., the PR does not include API changes.)

Check API Document

If any new APIs are added, please check if the description of the APIs is added to API document.

  • API document is updated (linear, nn)
  • Not Applicable (i.e., the PR does not include API changes.)

Test quickstart & API (bash tests/docs/test_changed_document.sh)

If any APIs in quickstarts or tutorials are modified, please run this test to check if the current examples can run correctly after the modified APIs are released.

…clustering method mixing Elkan's and Lloyd's algorithm based on the number of samples
@khoinpd0411 khoinpd0411 requested review from cjlin1 and a team as code owners July 14, 2025 20:13
@@ -6,3 +6,4 @@ scikit-learn
scipy<1.14.0
tqdm
psutil
sparsekmeans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add sparsekmeans to install_requires

LibMultiLabel/setup.cfg

Lines 27 to 35 in a0bef91

install_requires =
liblinear-multicore>=2.49.0
numba
pandas>1.3.0
PyYAML
scikit-learn
scipy<1.14.0
tqdm
psutil

Copy link
Contributor

@Eleven1Liu Eleven1Liu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump version to 0.8.0

version = 0.7.4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparsekmeans requires Python >= 3.10, whereas LibMultiLabel supports Python >= 3.8.
This causes installation issues when users try to install LibMultiLabel in Python 3.8 or 3.9.
There are two approach for this issue:

  • Update LibMultiLabel to require Python >= 3.10.
  • Or, detect the user's environment and apply the corresponding workaround.
    There is much room for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khoinpd0411 No need to bump version now, we will release with #20.

@Eleven1Liu Eleven1Liu added model/linear release PyPI release tag is in this PR labels Jul 14, 2025
tol=0.0001,
random_state=np.random.randint(2**31 - 1),
verbose=True
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering why the indentation isn't aligned with line:287.
BTW, should we pass the verbose flag through _build_tree() so that we can control the output when training?
And would it be better to do something like (I'm not sure)

if label_representation.shape[0] > 10000:
    kmeans_algo = ElkanKmeans
else:
    kmeans_algo = LloydKmeans
kmeans = kmeans_algo(those_params)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the formatting issue, please use black formatter.

child = _build_tree(child_representation, child_map, d + 1, K, dmax)
else:
child = Node(label_map=child_map, children=[])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to have

num_unique_labels = len(np.unique(metalabels))
if len(num_unique_labels) == K:
    children = []
    for i in range(K):
        child_representation = label_representation[metalabels == i]
        child_map = label_map[metalabels == i]
        child = _build_tree(child_representation, child_map, d + 1, K, dmax)
        children.append(child)
else:
    children = [
        Node(label_map=label_map[metalabels == i], children=[])
            for i in range(num_unique_labels)
    ]

@Eleven1Liu Eleven1Liu removed the release PyPI release tag is in this PR label Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants