9
9
10
10
11
11
class ODAC (base .Clusterer ):
12
- """The Online Divisive-Agglomerative Clustering (ODAC)[^1] aims at continuously
13
- maintaining a hierarchical cluster structure from evolving time series data
14
- streams.
15
-
16
- ODAC continuosly monitors the evolution of clusters' diameters and split or merge
17
- them by gathering more data or reacting to concept drift. Such changes are supported
18
- by a confidence level that comes from the Hoeffding bound. ODAC relies on keeping
19
- the linear correlation between series to evaluate whether or not the time series
20
- hierarchy has changed.
12
+ """The Online Divisive-Agglomerative Clustering (ODAC)[^1] aims at continuously maintaining
13
+ a hierarchical cluster structure from evolving time series data streams.
21
14
22
15
The distance between time-series a and b is given by `rnomc(a, b) = sqrt((1 - corr(a, b)) / 2)`,
23
- where `corr(a, b)` is the Pearson Correlation coefficient.
24
-
25
- In the following topics, ε stands for the Hoeffding bound and considers clusters cj
26
- with descendants ck and cs.
27
-
28
- **The Merge Operator**
29
-
30
- The Splitting Criteria guarantees that cluster's diameters monotonically decrease.
31
-
32
- - If diameter (ck) - diameter (cj) > ε OR diameter (cs) - diameter (cj ) > ε:
33
-
34
- * There is a change in the correlation structure, so merge clusters ck and cs into cj.
35
-
36
- **Splitting Criteria**
37
-
38
- Consider:
39
-
40
- - d0: the minimum distance;
16
+ where `corr(a, b)` is the Pearson Correlation coefficient. If the cluster has only one time-series,
17
+ the diameter is given by the time-series variance. The cluster's diameter is given by the largest
18
+ distance between the cluster's time-series.
41
19
42
- - d1: the farthest distance;
43
-
44
- - d_avg: the average distance;
45
-
46
- - d2: the second farthest distance.
47
-
48
- Then:
49
-
50
- - if d1 - d2 > εk or t > εk then
51
-
52
- - if (d1 - d0)|(d1 - d_avg) - (d_avg - d0) > εk then
53
-
54
- * Split the cluster
20
+ ODAC continuously monitors the evolution of diameters, only of the leaves, and splits or merges them
21
+ by gathering more data or reacting to concept drift - a confidence level from the Hoeffding bound
22
+ supports such changes.
55
23
24
+ So, the split operator, where the Hoeffding bound is applied, occurs when the difference between
25
+ the largest distance (diameter) and the second largest difference is greater than a constant.
26
+ Furthermore, the merge operator checks if one of the cluster's children has a diameter bigger
27
+ than their parent - applying the Hoeffding bound again.
56
28
57
29
Parameters
58
30
----------
@@ -88,15 +60,15 @@ class ODAC(base.Clusterer):
88
60
Structure changed at observation 200
89
61
Structure changed at observation 300
90
62
91
- >>> print(model.draw(n_decimal_places=2 ))
63
+ >>> print(model.render_ascii( ))
92
64
ROOT d1=0.79 d2=0.76 [NOT ACTIVE]
93
65
├── CH1_LVL_1 d1=0.74 d2=0.72 [NOT ACTIVE]
94
- │ ├── CH1_LVL_2 d1=<Not calculated> [3]
66
+ │ ├── CH1_LVL_2 d1=0.08 [3]
95
67
│ └── CH2_LVL_2 d1=0.73 [2, 4]
96
68
└── CH2_LVL_1 d1=0.81 d2=0.78 [NOT ACTIVE]
97
69
├── CH1_LVL_2 d1=0.73 d2=0.67 [NOT ACTIVE]
98
70
│ ├── CH1_LVL_3 d1=0.72 [0, 9]
99
- │ └── CH2_LVL_3 d1=<Not calculated> [1]
71
+ │ └── CH2_LVL_3 d1=0.08 [1]
100
72
└── CH2_LVL_2 d1=0.74 d2=0.73 [NOT ACTIVE]
101
73
├── CH1_LVL_3 d1=0.71 [5, 6]
102
74
└── CH2_LVL_3 d1=0.71 [7, 8]
@@ -119,7 +91,8 @@ class ODAC(base.Clusterer):
119
91
120
92
References
121
93
----------
122
- [^1]: [Hierarchical clustering of time-series data streams.](http://doi.org/10.1109/TKDE.2007.190727)
94
+ [^1]: P. P. Rodrigues, J. Gama and J. Pedroso, "Hierarchical Clustering of Time-Series Data Streams" in IEEE Transactions
95
+ on Knowledge and Data Engineering, vol. 20, no. 5, pp. 615-627, May 2008, doi: 10.1109/TKDE.2007.190727.
123
96
124
97
"""
125
98
@@ -231,7 +204,7 @@ def learn_one(self, x: dict):
231
204
# Time to time approach
232
205
if self ._update_timer == 0 :
233
206
# Calculate all the crucial variables to the next procedure
234
- leaf .calculate_coefficients (self .confidence_level )
207
+ leaf .calculate_coefficients (confidence_level = self .confidence_level )
235
208
236
209
if leaf .test_aggregate () or leaf .test_split (tau = self .tau ):
237
210
# Put the flag change_detected to true to indicate to the user that the structure changed
@@ -251,10 +224,10 @@ def predict_one(self, x: dict):
251
224
A dictionary of features.
252
225
253
226
"""
254
- raise NotImplementedError ("ODAC does not predict anything. It builds a hierarchical cluster's structure." )
227
+ raise NotImplementedError ()
255
228
256
- def draw (self , n_decimal_places : int = 2 ) -> str :
257
- """Method to draw the hierarchical cluster's structure.
229
+ def render_ascii (self , n_decimal_places : int = 2 ) -> str :
230
+ """Method to render the hierarchical cluster's structure in text format .
258
231
259
232
Parameters
260
233
----------
@@ -268,11 +241,120 @@ def draw(self, n_decimal_places: int = 2) -> str:
268
241
269
242
return self ._root_node .design_structure (n_decimal_places ).rstrip ("\n " )
270
243
244
+ def draw (self , max_depth : int | None = None , show_clusters_info : list [typing .Hashable ] = ["timeseries_names" , "d1" , "d2" , "e" ], n_decimal_places : int = 2 ):
245
+ """Method to draw the hierarchical cluster's structure as a Graphviz graph.
246
+
247
+ Parameters
248
+ ----------
249
+ max_depth
250
+ The maximum depth of the tree to display.
251
+ show_clusters_info
252
+ List of cluster information to show. Valid options are:
253
+ - "timeseries_indexes": Shows the indexes of the timeseries in the cluster.
254
+ - "timeseries_names": Shows the names of the timeseries in the cluster.
255
+ - "name": Shows the cluster's name.
256
+ - "d1": Shows the d1 (the largest distance in the cluster).
257
+ - "d2": Shows the d2 (the second largest distance in the cluster).
258
+ - "e": Shows the error bound.
259
+ n_decimal_places
260
+ The number of decimal places to show for numerical values.
261
+
262
+ """
263
+ if not (n_decimal_places > 0 and n_decimal_places < 10 ):
264
+ raise ValueError ("n_decimal_places must be between 1 and 9." )
265
+
266
+ try :
267
+ import graphviz
268
+ except ImportError as e :
269
+ raise ValueError ("You have to install graphviz to use the draw method." ) from e
270
+
271
+ counter = 0
272
+
273
+ dot = graphviz .Digraph (
274
+ graph_attr = {"splines" : "ortho" , "forcelabels" : "true" , "overlap" : "false" },
275
+ node_attr = {
276
+ "shape" : "box" ,
277
+ "penwidth" : "1.2" ,
278
+ "fontname" : "trebuchet" ,
279
+ "fontsize" : "11" ,
280
+ "margin" : "0.1,0.0" ,
281
+ },
282
+ edge_attr = {"penwidth" : "0.6" , "center" : "true" , "fontsize" : "7 " },
283
+ )
284
+
285
+ def iterate (node : ODACCluster , parent_node : str | None = None , depth : int = 0 ):
286
+ nonlocal counter , max_depth , show_clusters_info , n_decimal_places
287
+
288
+ if max_depth is not None and depth > max_depth :
289
+ return
290
+
291
+ node_n = str (counter )
292
+ counter += 1
293
+
294
+ label = ""
295
+
296
+ # checks if user wants to see information about clusters
297
+ if len (show_clusters_info ) > 0 :
298
+ show_clusters_info_copy = show_clusters_info .copy ()
299
+
300
+ if "name" in show_clusters_info_copy :
301
+ label += f"{ node .name } "
302
+ show_clusters_info_copy .remove ("name" )
303
+ if len (show_clusters_info_copy ) > 0 :
304
+ label += "\n "
305
+ if "timeseries_indexes" in show_clusters_info_copy :
306
+ # Convert timeseries names to indexes
307
+ name_to_index = {name : index for index , name in enumerate (self ._root_node .timeseries_names )}
308
+ timeseries_indexes = [name_to_index [_name ] for _name in node .timeseries_names if _name in name_to_index ]
309
+
310
+ label += f"{ timeseries_indexes } "
311
+ show_clusters_info_copy .remove ("timeseries_indexes" )
312
+ if len (show_clusters_info_copy ) > 0 :
313
+ label += "\n "
314
+ if "timeseries_names" in show_clusters_info_copy :
315
+ label += f"{ node .timeseries_names } "
316
+ show_clusters_info_copy .remove ("timeseries_names" )
317
+ if len (show_clusters_info_copy ) > 0 :
318
+ label += "\n "
319
+ if "d1" in show_clusters_info_copy :
320
+ if node .d1 is not None :
321
+ label += f"d1={ node .d1 :.{n_decimal_places }f} "
322
+ else :
323
+ label += "d1=<Not calculated>"
324
+ show_clusters_info_copy .remove ("d1" )
325
+ if len (show_clusters_info_copy ) > 0 :
326
+ label += "\n "
327
+ if "d2" in show_clusters_info_copy and node .d2 is not None :
328
+ label += f"d2={ node .d2 :.{n_decimal_places }f} "
329
+ show_clusters_info_copy .remove ("d2" )
330
+ if len (show_clusters_info_copy ) > 0 :
331
+ label += "\n "
332
+ if "e" in show_clusters_info_copy :
333
+ label += f"e={ node .e :.{n_decimal_places }f} "
334
+
335
+ show_clusters_info_copy .clear ()
336
+
337
+ # Creates a node with different color to differentiate the active clusters from the non-active
338
+ if node .active :
339
+ dot .node (node_n , label , style = "filled" , fillcolor = "#76b5c5" )
340
+ else :
341
+ dot .node (node_n , label , style = "filled" , fillcolor = "#f2f2f2" )
342
+
343
+ if parent_node is not None :
344
+ dot .edge (parent_node , node_n )
345
+
346
+ if node .children is not None :
347
+ iterate (node = node .children .first , parent_node = node_n , depth = depth + 1 )
348
+ iterate (node .children .second , parent_node = node_n , depth = depth + 1 )
349
+
350
+ iterate (node = self ._root_node )
351
+
352
+ return dot
353
+
271
354
@property
272
355
def structure_changed (self ) -> bool :
273
356
return self ._structure_changed
274
357
275
-
276
358
class ODACCluster (base .Base ):
277
359
"""Cluster class for representing individual clusters."""
278
360
@@ -284,7 +366,7 @@ def __init__(self, name: str, parent: ODACCluster | None = None):
284
366
self .children : ODACChildren | None = None
285
367
286
368
self .timeseries_names : list [typing .Hashable ] = []
287
- self ._statistics : dict [tuple [typing .Hashable , typing .Hashable ], stats .PearsonCorr ] | None
369
+ self ._statistics : dict [tuple [typing .Hashable , typing .Hashable ], stats .PearsonCorr ] | stats . Var | None
288
370
289
371
self .d1 : float | None = None
290
372
self .d2 : float | None = None
@@ -348,14 +430,14 @@ def __str__(self) -> str:
348
430
def __repr__ (self ) -> str :
349
431
return self .design_structure ()
350
432
351
- def _init_stats (self ) -> dict [tuple [typing .Hashable , typing .Hashable ], stats .PearsonCorr ]:
433
+ def _init_stats (self ) -> dict [tuple [typing .Hashable , typing .Hashable ], stats .PearsonCorr ] | stats . Var :
352
434
return collections .defaultdict (
353
435
stats .PearsonCorr ,
354
436
{
355
437
(k1 , k2 ): stats .PearsonCorr ()
356
438
for k1 , k2 in itertools .combinations (self .timeseries_names , 2 )
357
439
},
358
- )
440
+ ) if len ( self . timeseries_names ) > 1 else stats . Var ()
359
441
360
442
# TODO: not sure if this is the best design
361
443
def __call__ (self , ts_names : list [typing .Hashable ]):
@@ -364,12 +446,15 @@ def __call__(self, ts_names: list[typing.Hashable]):
364
446
self ._statistics = self ._init_stats ()
365
447
366
448
def update_statistics (self , x : dict ) -> None :
367
- # For each pair of time-series in the cluster update the correlation
368
- # values with the data received
369
- for (k1 , k2 ), item in self ._statistics .items (): # type: ignore
370
- if x .get (k1 , None ) is None or x .get (k2 , None ) is None :
371
- continue
372
- item .update (float (x [k1 ]), float (x [k2 ]))
449
+ if len (self .timeseries_names ) > 1 :
450
+ # For each pair of time-series in the cluster update the correlation
451
+ # values with the data received
452
+ for (k1 , k2 ), item in self ._statistics .items (): # type: ignore
453
+ if x .get (k1 , None ) is None or x .get (k2 , None ) is None :
454
+ continue
455
+ item .update (float (x [k1 ]), float (x [k2 ]))
456
+ else :
457
+ self ._statistics .update (float (x .get (self .timeseries_names [0 ]))) # type: ignore
373
458
374
459
# Increment the number of observation in the cluster
375
460
self .n += 1
@@ -380,16 +465,17 @@ def _calculate_rnomc_dict(self)-> dict[tuple[typing.Hashable, typing.Hashable],
380
465
rnomc_dict = {}
381
466
382
467
for k1 , k2 in itertools .combinations (self .timeseries_names , 2 ):
383
- rnomc_dict [(k1 , k2 )] = math .sqrt ((1 - self ._statistics [(k1 , k2 )].get ()) / 2.0 ) # type: ignore
468
+ value = abs ((1 - self ._statistics [(k1 , k2 )].get ()) / 2.0 ) # type: ignore
469
+ rnomc_dict [(k1 , k2 )] = math .sqrt (value )
384
470
385
471
return rnomc_dict
386
472
387
473
# Method to calculate coefficients for splitting or aggregation
388
474
def calculate_coefficients (self , confidence_level : float ) -> None :
389
- # Get the rnomc values
390
- rnomc_dict = self ._calculate_rnomc_dict ()
475
+ if len (self .timeseries_names ) > 1 :
476
+ # Get the rnomc values
477
+ rnomc_dict = self ._calculate_rnomc_dict ()
391
478
392
- if len (rnomc_dict ) > 0 :
393
479
# Get the average distance in the cluster
394
480
self .avg = sum (rnomc_dict .values ()) / self .n
395
481
@@ -405,13 +491,13 @@ def calculate_coefficients(self, confidence_level: float) -> None:
405
491
self .pivot_2 , self .d2 = max (remaining .items (), key = lambda x : x [1 ])
406
492
else :
407
493
self .pivot_2 = self .d2 = None # type: ignore
494
+ else :
495
+ self .d1 = self ._statistics .get () # type: ignore
496
+ # Calculate the Hoeffding bound in the cluster
497
+ self .e = math .sqrt (math .log (1 / confidence_level ) / (2 * self .n ))
408
498
409
- # Calculate the Hoeffding bound in the cluster
410
- self .e = math .sqrt (math .log (1 / confidence_level ) / (2 * self .n ))
411
-
499
+ # Method that gives the closest cluster where the current time series is located
412
500
def _get_closest_cluster (self , pivot_1 , pivot_2 , current , rnormc_dict : dict ) -> int :
413
- """Method that gives the closest cluster where the current time series is located."""
414
-
415
501
dist_1 = rnormc_dict .get ((min (pivot_1 , current ), max (pivot_1 , current )), 0 )
416
502
dist_2 = rnormc_dict .get ((min (pivot_2 , current ), max (pivot_2 , current )), 0 )
417
503
return 2 if dist_1 >= dist_2 else 1
@@ -444,8 +530,9 @@ def _split_this_cluster(self, pivot_1: typing.Hashable, pivot_2: typing.Hashable
444
530
445
531
# Set the active flag to false. Since this cluster is not an active cluster anymore.
446
532
self .active = False
447
- self .avg = self .d0 = self .pivot_0 = self .pivot_1 = self .pivot_2 = None # type: ignore
448
- self ._statistics = None
533
+
534
+ # Reset some attributes
535
+ self .avg = self .d0 = self .pivot_0 = self .pivot_1 = self .pivot_2 = self ._statistics = None # type: ignore
449
536
450
537
# Method that proceeds to merge on this cluster
451
538
def _aggregate_this_cluster (self ):
@@ -485,7 +572,6 @@ def test_aggregate(self):
485
572
return True
486
573
return False
487
574
488
-
489
575
class ODACChildren (base .Base ):
490
576
"""Children class representing child clusters."""
491
577
0 commit comments