Skip to content

Commit d606d7b

Browse files
Updates in ODAC code (#1584)
* update description, fix a bug when calculating rnomc, add draw() method to see hierarchical cluster's structure as a Graphviz graph, and working with Var when cluster has only one time-series * change the description of draw() method * add cluster's name in draw() method * correct version * Update river/cluster/odac.py * Update river/cluster/odac.py * update docs/releases/unreleased.md * Update docs/releases/unreleased.md --------- Co-authored-by: gonfa3003 <[email protected]> Co-authored-by: Saulo Martiello Mastelini <[email protected]>
1 parent df257b8 commit d606d7b

File tree

2 files changed

+163
-70
lines changed

2 files changed

+163
-70
lines changed

docs/releases/unreleased.md

+7
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22

33
- The units used in River have been corrected to be based on powers of 2 (KiB, MiB). This only changes the display, the behaviour is unchanged.
44

5+
## cluster
6+
7+
- Update the description of `cluster.ODAC`.
8+
- Change `draw` in `cluster.ODAC` to draw the hierarchical cluster's structure as a Graphviz graph.
9+
- Add `render_ascii` in `cluster.ODAC` to render the hierarchical cluster's structure in text format.
10+
- Work with `stats.Var` in `cluster.ODAC` when cluster has only one time series.
11+
512
## tree
613

714
- Instead of letting trees grow indefinitely, setting the `max_depth` parameter to `None` will stop the trees from growing when they reach the system recursion limit.

river/cluster/odac.py

+156-70
Original file line numberDiff line numberDiff line change
@@ -9,50 +9,22 @@
99

1010

1111
class ODAC(base.Clusterer):
12-
"""The Online Divisive-Agglomerative Clustering (ODAC)[^1] aims at continuously
13-
maintaining a hierarchical cluster structure from evolving time series data
14-
streams.
15-
16-
ODAC continuosly monitors the evolution of clusters' diameters and split or merge
17-
them by gathering more data or reacting to concept drift. Such changes are supported
18-
by a confidence level that comes from the Hoeffding bound. ODAC relies on keeping
19-
the linear correlation between series to evaluate whether or not the time series
20-
hierarchy has changed.
12+
"""The Online Divisive-Agglomerative Clustering (ODAC)[^1] aims at continuously maintaining
13+
a hierarchical cluster structure from evolving time series data streams.
2114
2215
The distance between time-series a and b is given by `rnomc(a, b) = sqrt((1 - corr(a, b)) / 2)`,
23-
where `corr(a, b)` is the Pearson Correlation coefficient.
24-
25-
In the following topics, ε stands for the Hoeffding bound and considers clusters cj
26-
with descendants ck and cs.
27-
28-
**The Merge Operator**
29-
30-
The Splitting Criteria guarantees that cluster's diameters monotonically decrease.
31-
32-
- If diameter (ck) - diameter (cj) > ε OR diameter (cs) - diameter (cj ) > ε:
33-
34-
* There is a change in the correlation structure, so merge clusters ck and cs into cj.
35-
36-
**Splitting Criteria**
37-
38-
Consider:
39-
40-
- d0: the minimum distance;
16+
where `corr(a, b)` is the Pearson Correlation coefficient. If the cluster has only one time-series,
17+
the diameter is given by the time-series variance. The cluster's diameter is given by the largest
18+
distance between the cluster's time-series.
4119
42-
- d1: the farthest distance;
43-
44-
- d_avg: the average distance;
45-
46-
- d2: the second farthest distance.
47-
48-
Then:
49-
50-
- if d1 - d2 > εk or t > εk then
51-
52-
- if (d1 - d0)|(d1 - d_avg) - (d_avg - d0) > εk then
53-
54-
* Split the cluster
20+
ODAC continuously monitors the evolution of diameters, only of the leaves, and splits or merges them
21+
by gathering more data or reacting to concept drift - a confidence level from the Hoeffding bound
22+
supports such changes.
5523
24+
So, the split operator, where the Hoeffding bound is applied, occurs when the difference between
25+
the largest distance (diameter) and the second largest difference is greater than a constant.
26+
Furthermore, the merge operator checks if one of the cluster's children has a diameter bigger
27+
than their parent - applying the Hoeffding bound again.
5628
5729
Parameters
5830
----------
@@ -88,15 +60,15 @@ class ODAC(base.Clusterer):
8860
Structure changed at observation 200
8961
Structure changed at observation 300
9062
91-
>>> print(model.draw(n_decimal_places=2))
63+
>>> print(model.render_ascii())
9264
ROOT d1=0.79 d2=0.76 [NOT ACTIVE]
9365
├── CH1_LVL_1 d1=0.74 d2=0.72 [NOT ACTIVE]
94-
│ ├── CH1_LVL_2 d1=<Not calculated> [3]
66+
│ ├── CH1_LVL_2 d1=0.08 [3]
9567
│ └── CH2_LVL_2 d1=0.73 [2, 4]
9668
└── CH2_LVL_1 d1=0.81 d2=0.78 [NOT ACTIVE]
9769
├── CH1_LVL_2 d1=0.73 d2=0.67 [NOT ACTIVE]
9870
│ ├── CH1_LVL_3 d1=0.72 [0, 9]
99-
│ └── CH2_LVL_3 d1=<Not calculated> [1]
71+
│ └── CH2_LVL_3 d1=0.08 [1]
10072
└── CH2_LVL_2 d1=0.74 d2=0.73 [NOT ACTIVE]
10173
├── CH1_LVL_3 d1=0.71 [5, 6]
10274
└── CH2_LVL_3 d1=0.71 [7, 8]
@@ -119,7 +91,8 @@ class ODAC(base.Clusterer):
11991
12092
References
12193
----------
122-
[^1]: [Hierarchical clustering of time-series data streams.](http://doi.org/10.1109/TKDE.2007.190727)
94+
[^1]: P. P. Rodrigues, J. Gama and J. Pedroso, "Hierarchical Clustering of Time-Series Data Streams" in IEEE Transactions
95+
on Knowledge and Data Engineering, vol. 20, no. 5, pp. 615-627, May 2008, doi: 10.1109/TKDE.2007.190727.
12396
12497
"""
12598

@@ -231,7 +204,7 @@ def learn_one(self, x: dict):
231204
# Time to time approach
232205
if self._update_timer == 0:
233206
# Calculate all the crucial variables to the next procedure
234-
leaf.calculate_coefficients(self.confidence_level)
207+
leaf.calculate_coefficients(confidence_level=self.confidence_level)
235208

236209
if leaf.test_aggregate() or leaf.test_split(tau=self.tau):
237210
# Put the flag change_detected to true to indicate to the user that the structure changed
@@ -251,10 +224,10 @@ def predict_one(self, x: dict):
251224
A dictionary of features.
252225
253226
"""
254-
raise NotImplementedError("ODAC does not predict anything. It builds a hierarchical cluster's structure.")
227+
raise NotImplementedError()
255228

256-
def draw(self, n_decimal_places: int = 2) -> str:
257-
"""Method to draw the hierarchical cluster's structure.
229+
def render_ascii(self, n_decimal_places: int = 2) -> str:
230+
"""Method to render the hierarchical cluster's structure in text format.
258231
259232
Parameters
260233
----------
@@ -268,11 +241,120 @@ def draw(self, n_decimal_places: int = 2) -> str:
268241

269242
return self._root_node.design_structure(n_decimal_places).rstrip("\n")
270243

244+
def draw(self, max_depth: int | None = None, show_clusters_info: list[typing.Hashable] = ["timeseries_names", "d1", "d2", "e"], n_decimal_places: int = 2):
245+
"""Method to draw the hierarchical cluster's structure as a Graphviz graph.
246+
247+
Parameters
248+
----------
249+
max_depth
250+
The maximum depth of the tree to display.
251+
show_clusters_info
252+
List of cluster information to show. Valid options are:
253+
- "timeseries_indexes": Shows the indexes of the timeseries in the cluster.
254+
- "timeseries_names": Shows the names of the timeseries in the cluster.
255+
- "name": Shows the cluster's name.
256+
- "d1": Shows the d1 (the largest distance in the cluster).
257+
- "d2": Shows the d2 (the second largest distance in the cluster).
258+
- "e": Shows the error bound.
259+
n_decimal_places
260+
The number of decimal places to show for numerical values.
261+
262+
"""
263+
if not (n_decimal_places > 0 and n_decimal_places < 10):
264+
raise ValueError("n_decimal_places must be between 1 and 9.")
265+
266+
try:
267+
import graphviz
268+
except ImportError as e:
269+
raise ValueError("You have to install graphviz to use the draw method.") from e
270+
271+
counter = 0
272+
273+
dot = graphviz.Digraph(
274+
graph_attr={"splines": "ortho", "forcelabels": "true", "overlap": "false"},
275+
node_attr={
276+
"shape": "box",
277+
"penwidth": "1.2",
278+
"fontname": "trebuchet",
279+
"fontsize": "11",
280+
"margin": "0.1,0.0",
281+
},
282+
edge_attr={"penwidth": "0.6", "center": "true", "fontsize": "7 "},
283+
)
284+
285+
def iterate(node: ODACCluster, parent_node: str | None = None, depth: int = 0):
286+
nonlocal counter, max_depth, show_clusters_info, n_decimal_places
287+
288+
if max_depth is not None and depth > max_depth:
289+
return
290+
291+
node_n = str(counter)
292+
counter += 1
293+
294+
label = ""
295+
296+
# checks if user wants to see information about clusters
297+
if len(show_clusters_info) > 0:
298+
show_clusters_info_copy = show_clusters_info.copy()
299+
300+
if "name" in show_clusters_info_copy:
301+
label += f"{node.name}"
302+
show_clusters_info_copy.remove("name")
303+
if len(show_clusters_info_copy) > 0:
304+
label += "\n"
305+
if "timeseries_indexes" in show_clusters_info_copy:
306+
# Convert timeseries names to indexes
307+
name_to_index = {name: index for index, name in enumerate(self._root_node.timeseries_names)}
308+
timeseries_indexes = [name_to_index[_name] for _name in node.timeseries_names if _name in name_to_index]
309+
310+
label += f"{timeseries_indexes}"
311+
show_clusters_info_copy.remove("timeseries_indexes")
312+
if len(show_clusters_info_copy) > 0:
313+
label += "\n"
314+
if "timeseries_names" in show_clusters_info_copy:
315+
label += f"{node.timeseries_names}"
316+
show_clusters_info_copy.remove("timeseries_names")
317+
if len(show_clusters_info_copy) > 0:
318+
label += "\n"
319+
if "d1" in show_clusters_info_copy:
320+
if node.d1 is not None:
321+
label += f"d1={node.d1:.{n_decimal_places}f}"
322+
else:
323+
label += "d1=<Not calculated>"
324+
show_clusters_info_copy.remove("d1")
325+
if len(show_clusters_info_copy) > 0:
326+
label += "\n"
327+
if "d2" in show_clusters_info_copy and node.d2 is not None:
328+
label += f"d2={node.d2:.{n_decimal_places}f}"
329+
show_clusters_info_copy.remove("d2")
330+
if len(show_clusters_info_copy) > 0:
331+
label += "\n"
332+
if "e" in show_clusters_info_copy:
333+
label += f"e={node.e:.{n_decimal_places}f}"
334+
335+
show_clusters_info_copy.clear()
336+
337+
# Creates a node with different color to differentiate the active clusters from the non-active
338+
if node.active:
339+
dot.node(node_n, label, style="filled", fillcolor="#76b5c5")
340+
else:
341+
dot.node(node_n, label, style="filled", fillcolor="#f2f2f2")
342+
343+
if parent_node is not None:
344+
dot.edge(parent_node, node_n)
345+
346+
if node.children is not None:
347+
iterate(node=node.children.first, parent_node=node_n, depth=depth + 1)
348+
iterate(node.children.second, parent_node=node_n, depth=depth + 1)
349+
350+
iterate(node=self._root_node)
351+
352+
return dot
353+
271354
@property
272355
def structure_changed(self) -> bool:
273356
return self._structure_changed
274357

275-
276358
class ODACCluster(base.Base):
277359
"""Cluster class for representing individual clusters."""
278360

@@ -284,7 +366,7 @@ def __init__(self, name: str, parent: ODACCluster | None = None):
284366
self.children: ODACChildren | None = None
285367

286368
self.timeseries_names: list[typing.Hashable] = []
287-
self._statistics: dict[tuple[typing.Hashable, typing.Hashable], stats.PearsonCorr] | None
369+
self._statistics: dict[tuple[typing.Hashable, typing.Hashable], stats.PearsonCorr] | stats.Var | None
288370

289371
self.d1: float | None = None
290372
self.d2: float | None = None
@@ -348,14 +430,14 @@ def __str__(self) -> str:
348430
def __repr__(self) -> str:
349431
return self.design_structure()
350432

351-
def _init_stats(self) -> dict[tuple[typing.Hashable, typing.Hashable], stats.PearsonCorr]:
433+
def _init_stats(self) -> dict[tuple[typing.Hashable, typing.Hashable], stats.PearsonCorr] | stats.Var:
352434
return collections.defaultdict(
353435
stats.PearsonCorr,
354436
{
355437
(k1, k2): stats.PearsonCorr()
356438
for k1, k2 in itertools.combinations(self.timeseries_names, 2)
357439
},
358-
)
440+
) if len(self.timeseries_names) > 1 else stats.Var()
359441

360442
# TODO: not sure if this is the best design
361443
def __call__(self, ts_names: list[typing.Hashable]):
@@ -364,12 +446,15 @@ def __call__(self, ts_names: list[typing.Hashable]):
364446
self._statistics = self._init_stats()
365447

366448
def update_statistics(self, x: dict) -> None:
367-
# For each pair of time-series in the cluster update the correlation
368-
# values with the data received
369-
for (k1, k2), item in self._statistics.items(): # type: ignore
370-
if x.get(k1, None) is None or x.get(k2, None) is None:
371-
continue
372-
item.update(float(x[k1]), float(x[k2]))
449+
if len(self.timeseries_names) > 1:
450+
# For each pair of time-series in the cluster update the correlation
451+
# values with the data received
452+
for (k1, k2), item in self._statistics.items(): # type: ignore
453+
if x.get(k1, None) is None or x.get(k2, None) is None:
454+
continue
455+
item.update(float(x[k1]), float(x[k2]))
456+
else:
457+
self._statistics.update(float(x.get(self.timeseries_names[0]))) # type: ignore
373458

374459
# Increment the number of observation in the cluster
375460
self.n += 1
@@ -380,16 +465,17 @@ def _calculate_rnomc_dict(self)-> dict[tuple[typing.Hashable, typing.Hashable],
380465
rnomc_dict = {}
381466

382467
for k1, k2 in itertools.combinations(self.timeseries_names, 2):
383-
rnomc_dict[(k1, k2)] = math.sqrt((1 - self._statistics[(k1, k2)].get()) / 2.0) # type: ignore
468+
value = abs((1 - self._statistics[(k1, k2)].get()) / 2.0) # type: ignore
469+
rnomc_dict[(k1, k2)] = math.sqrt(value)
384470

385471
return rnomc_dict
386472

387473
# Method to calculate coefficients for splitting or aggregation
388474
def calculate_coefficients(self, confidence_level: float) -> None:
389-
# Get the rnomc values
390-
rnomc_dict = self._calculate_rnomc_dict()
475+
if len(self.timeseries_names) > 1:
476+
# Get the rnomc values
477+
rnomc_dict = self._calculate_rnomc_dict()
391478

392-
if len(rnomc_dict) > 0:
393479
# Get the average distance in the cluster
394480
self.avg = sum(rnomc_dict.values()) / self.n
395481

@@ -405,13 +491,13 @@ def calculate_coefficients(self, confidence_level: float) -> None:
405491
self.pivot_2, self.d2 = max(remaining.items(), key=lambda x: x[1])
406492
else:
407493
self.pivot_2 = self.d2 = None # type: ignore
494+
else:
495+
self.d1 = self._statistics.get() # type: ignore
496+
# Calculate the Hoeffding bound in the cluster
497+
self.e = math.sqrt(math.log(1 / confidence_level) / (2 * self.n))
408498

409-
# Calculate the Hoeffding bound in the cluster
410-
self.e = math.sqrt(math.log(1 / confidence_level) / (2 * self.n))
411-
499+
# Method that gives the closest cluster where the current time series is located
412500
def _get_closest_cluster(self, pivot_1, pivot_2, current, rnormc_dict: dict) -> int:
413-
"""Method that gives the closest cluster where the current time series is located."""
414-
415501
dist_1 = rnormc_dict.get((min(pivot_1, current), max(pivot_1, current)), 0)
416502
dist_2 = rnormc_dict.get((min(pivot_2, current), max(pivot_2, current)), 0)
417503
return 2 if dist_1 >= dist_2 else 1
@@ -444,8 +530,9 @@ def _split_this_cluster(self, pivot_1: typing.Hashable, pivot_2: typing.Hashable
444530

445531
# Set the active flag to false. Since this cluster is not an active cluster anymore.
446532
self.active = False
447-
self.avg = self.d0 = self.pivot_0 = self.pivot_1 = self.pivot_2 = None # type: ignore
448-
self._statistics = None
533+
534+
# Reset some attributes
535+
self.avg = self.d0 = self.pivot_0 = self.pivot_1 = self.pivot_2 = self._statistics = None # type: ignore
449536

450537
# Method that proceeds to merge on this cluster
451538
def _aggregate_this_cluster(self):
@@ -485,7 +572,6 @@ def test_aggregate(self):
485572
return True
486573
return False
487574

488-
489575
class ODACChildren(base.Base):
490576
"""Children class representing child clusters."""
491577

0 commit comments

Comments
 (0)