Skip to content

MSTApproach

Huascar Sanchez edited this page Aug 23, 2016 · 3 revisions

Clustering: An application of Minimum Spanning Trees (MST)

Problem: You are given N items (e.g., fully qualified class names) and the distance d(u, v) between two items. d(u,v) may be an actual distance or some abstract representation of how dissimilar two things are. In our case we use the longest common substring metric:

_lcSubstrMetric (s1, s2) = 1.0 - (lcSubstr(s1, s2)) / max(Length(s1), Length(s2)) _

Our goal is to divide the N items up into k groups so that the minimum distance between items in different groups is maximized. k is calculated using the following formula: k = floor(sqrt(N))

Idea: Maximum Minimum Distance

  1. Maintain clusters as a set of connected components of a graph.
  2. Iteratively combine the clusters containing the two closest items by adding an edge between them. Any pair of elements whose lcSubstrMetric value is greater than 0.35 are ignored.
  3. Stop when there are k clusters.

To implement this idea we use Kruskal algorithm. Under this algorithm, the clusters are the connected components the algorithm has created after a certain point. The algorithm will delete the k -1 edges with the most dissimilar items from the MST.

Label recommendation

Let C (e.g., "FooBar", "FooOk", "FooPen") be a given cluster produced by the Kruskal algorithm and a set of words S considered to be the most typical words in our project corpus. The following set of steps describes how labels are recommended for the cluster C

  1. Split each class name at capitalization boundaries, as well at spaces and underscores. Special characters (e.g., ;, *) and numeric sequences are removed. Words that are not members of the set S are also removed. The final result is then set to lowercase. This steps produces a list of labels L; e.g., "FooBar", "FooOk", "FooPen" will result in ["foo", "bar", "foo", "ok", "foo", "pen"]

  2. Calculate the frequency of each word in the list L: ["foo":3, "bar":1, "ok":1, "pen":1]

  3. Sort these words in ascending order by frequency: ["foo":3, "bar":1, "ok":1, "pen":1]

  4. if the size(C) > 1, then filter out those words with a frequency less than 2: ["foo":3] Otherwise, don't filter any words.

The final result ["foo"]

Clone this wiki locally