https://preview.redd.it/di99yw56tksg1.png?width=992&format=png&auto=webp&s=8828c9459dcf8f8541718e4d7a9fae52bfc0b95a

I created a clustering algorithm SPORE (Skeleton Propagation Over Recalibrating Expansions) for general purpose clustering, intended to handle nonconvex, convex, low-d and high-d data alike. I've benchmarked it on 28 datasets from 2-784D and released a Python package as well as a research paper.

Short Summary

SPORE is a density-variance-based method meant for general clustering in arbitrary geometries and dimensionalities. After building a knn graph, it has 2 phases. Phase 1 (Expansion) uses BFS with a continually refined density-variance constraint to expand initial clusters in a way that adapts to their specific scale. The aim is to capture inner, well-shielded skeletons and stay back from low-separation boundary areas. Phase 2 (Small-Cluster Reassignment aka SCR) takes those boundary points and merges them into the skeletons they surround, and can draw sharp lines between adjacent cluster boundaries, kind of like kmeans partitioning to the nearest centroid/representative. So together, SPORE has scale-adaptive shape recognition capabilities and can draw sharp boundaries when clusters are near each other, so it can strongly resist the merge-or-fragment problem with most density based clustering algorithms. It's also pretty robust to dimensionality, all the way up to hundreds of dimensions. I’ve even used it on 1000D+ llm embeddings and gotten clean results (though to be fair, llm embeddings are often trained to be well-separated despite being high-D).

More In-depth

SPORE has 3 main steps, 2 of which are stages where the actual clustering occurs:

Construct a knn graph. You can do this either exact or approximate. I'd go with approximate via HNSW (that's what the Python package uses as a default). Performance is essentially the same either way, since SPORE just needs an approximate sense of intra-cluster density variance to constrain expansion. Exact knn isn't required; as long as the neighbor error isn't too high, it will be fine in most cases.
Perform BFS. This is where SPORE’s name is most fitting; like a biological spore, it seeds clusters at specific points and grows them outward over the data manifold until the manifold is no longer “hospitable”.
1. First you sort points in reverse order of density.
2. Then you extract the densest point and begin BFS around it.
3. During BFS you track the mean and std deviation of neighbor distance, and update it with each accepted point. When considering points to add, you use the current mean and std deviation to compute the z score of that point's distance from the frontier. If the z-score is too high (based on a user-provided threshold), then the point is rejected. Eventually the z-score of all candidate points will be too high; this will naturally happen when the cluster is approaching its boundary and is starting to thin out.
4. After cluster 1 finishes expanding, you just grab the next densest point and start BFS for cluster 2.
5. By the end, the goal is to have at least expanded some minimal core skeleton within each true cluster, while leaving the boundary fragmented, since growing into boundary regions can cause expansion to bleed into adjacent clusters. If skeletons are intact and boundaries are shattered off, that's the ideal setup for the next phase.
  1. A nice consequence of the density variance approach is a degree of robustness to low distance contrast that helps with skeleton isolation: if contrast is low, standard deviation in distance drops accordingly, so small-but-consistent differences in distance still provide some signal, and that's enough to separate the inner skeletons of clusters from each other in many cases.
  2. It's not strictly about skeletons. If the dataset is already well separated, expansion alone could do the job, and you don’t even need the next phase.
Small Cluster Reassignment (SCR). Once skeletons are identified, then comes small cluster reassignment, aka SCR. I think of this phase like a localized K-means, where you partition points by their nearest cluster representative. This time however, representatives are points from a particular cluster within a to-be-reassigned point's knn, and the partitioning algorithm is essentially a knn classifier. So, this phase takes all points in small clusters (ideally made of barrier points) and reassigns them to the cluster among their knn that maximizes a score measuring certain geometric conditions like enclosure, knn count, and nearness. That max-selection is why it can draw sharp boundaries. Even if separation is minimal, you just need some points to be consistently better supported by the right cluster among their knn, which often translates into just being nearer to the to-be-reassigned point, even if just by some infinitesimal amount.
1. Seeing it another way, this phase really acts almost like a resumed expansion phase in a different, less-connection-greedy mode. The first phase finds the anchors with high shape-adaptivity, and the second phase propagates them outward to better-defined stopping points that the first phase would not have been able to find alone.
There are some details omitted for brevity, but that’s the core of it.

submitted by /u/Significant-Agent854
[link] [comments]

[R] The SPORE Clustering Algorithm

Short Summary

More In-depth

Want to read more?

Tagged with