Skip to content

Update build_tree function with SparseKmeans implementation #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 27 additions & 14 deletions libmultilabel/linear/tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import numpy as np
import scipy.sparse as sparse
import sklearn.cluster
from sparsekmeans import LloydKmeans, ElkanKmeans
import sklearn.preprocessing
from tqdm import tqdm
import psutil
Expand Down Expand Up @@ -277,24 +277,37 @@ def _build_tree(label_representation: sparse.csr_matrix, label_map: np.ndarray,
if d >= dmax or label_representation.shape[0] <= K:
return Node(label_map=label_map, children=[])

metalabels = (
sklearn.cluster.KMeans(
K,
random_state=np.random.randint(2**31 - 1),
n_init=1,
max_iter=300,
tol=0.0001,
algorithm="elkan",
)
.fit(label_representation)
.labels_
)
if label_representation.shape[0] > 10000:
kmeans = ElkanKmeans(
n_clusters=K,
max_iter=300,
tol=0.0001,
random_state=np.random.randint(2**31 - 1),
verbose=True
)
else:
kmeans = LloydKmeans(
n_clusters=K,
max_iter=300,
tol=0.0001,
random_state=np.random.randint(2**31 - 1),
verbose=True
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering why the indentation isn't aligned with line:287.
BTW, should we pass the verbose flag through _build_tree() so that we can control the output when training?
And would it be better to do something like (I'm not sure)

if label_representation.shape[0] > 10000:
    kmeans_algo = ElkanKmeans
else:
    kmeans_algo = LloydKmeans
kmeans = kmeans_algo(those_params)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the formatting issue, please use black formatter.


metalabels = kmeans.fit(label_representation)

unique_labels = np.unique(metalabels)

children = []
for i in range(K):
child_representation = label_representation[metalabels == i]
child_map = label_map[metalabels == i]
child = _build_tree(child_representation, child_map, d + 1, K, dmax)

if len(unique_labels) == K:
child = _build_tree(child_representation, child_map, d + 1, K, dmax)
else:
child = Node(label_map=child_map, children=[])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to have

num_unique_labels = len(np.unique(metalabels))
if len(num_unique_labels) == K:
    children = []
    for i in range(K):
        child_representation = label_representation[metalabels == i]
        child_map = label_map[metalabels == i]
        child = _build_tree(child_representation, child_map, d + 1, K, dmax)
        children.append(child)
else:
    children = [
        Node(label_map=label_map[metalabels == i], children=[])
            for i in range(num_unique_labels)
    ]

children.append(child)

return Node(label_map=label_map, children=children)
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ scikit-learn
scipy<1.14.0
tqdm
psutil
sparsekmeans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add sparsekmeans to install_requires

LibMultiLabel/setup.cfg

Lines 27 to 35 in a0bef91

install_requires =
liblinear-multicore>=2.49.0
numba
pandas>1.3.0
PyYAML
scikit-learn
scipy<1.14.0
tqdm
psutil

Copy link
Contributor

@Eleven1Liu Eleven1Liu Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump version to 0.8.0

version = 0.7.4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparsekmeans requires Python >= 3.10, whereas LibMultiLabel supports Python >= 3.8.
This causes installation issues when users try to install LibMultiLabel in Python 3.8 or 3.9.
There are two approach for this issue:

  • Update LibMultiLabel to require Python >= 3.10.
  • Or, detect the user's environment and apply the corresponding workaround.
    There is much room for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khoinpd0411 No need to bump version now, we will release with #20.

Loading