Recent graph- and density-based results for efficient unsupervised and semi-supervised data analysis

Image for Recent graph- and density-based results for efficient unsupervised and semi-supervised data analysis

More Information

mcds@unimelb.edu.au

A number of density-based methods for unsupervised and semi-supervised learning tasks, including clustering and transductive classification, can be formulated as instances of a framework that is based on the processing of a minimum spanning tree (MST) of the data in a transformed space of distances modelled by a (conceptual only) graph whose edge weights correspond to a form of density estimate with respect to a smoothing parameter.

While density-based methods are considered to be robust with respect to this parameter, wider ranges of its values may lead to different results that a user would like to consider for a given data set or application. However, to interactively explore multiple results for a range of density smoothing settings, until recently, one had to re-run the density-based method for each value in the range independently, which is computationally inefficient.

In the first part of this talk Professor Campello will review some popular density-based algorithms, including the HDBSCAN* algorithm and its variants, and how these algorithms can be modelled as a density-based MST problem.

In the second part of the talk, Professor Campello will present theoretical results that lead to a very computationally efficient approach to simultaneously compute multiple density-based MST’s with respect to an entire range of density smoothing values by leveraging a graph obtained from a single run of a density-based algorithm. Experimental results showing that this strategy can lead to speed-up factors of hundreds to thousands of times on the computation of density-based MST’s will be presented.