Dive deeper with Clustering Advanced Parameters

Laurent5 · ‎10-09-2023

I recently spent some time with a customer to analyse their data using Clustering Analytics. We explored several of their record types to uncover automation opportunities. It occurred to me that some of the advanced parameters of the Clustering Solution Definitions can be quite obscure and intimidating. So, I decided to write a little article to help demystify these.

As a quick recap, Clustering is a Machine Learning Technique that divides records into groups so that records in the same groups are more related to one other than to records in other groups .

From Word Corpus to (Clustering) Solution Definition…

There are 2 main sets of parameters that can be set when creating the Solution Definition and once created, by going to the Advanced tab.

The first decision comes when you create the Solution Definition and attach a Word Corpus.

As a reminder, a Word Corpus is in essence a dictionary of terms that will be used by the Clustering algorithm.

If you open an existing Word Corpus, or create a new one, you will notice a Type field which offers the choices of Paragraph-Vector, TF-IDF or Pre-trained.

https://docs.servicenow.com/bundle/utah-intelligent-experiences/page/administer/predictive-intellige...

The concept of Paragraph Vectors was coined by Le and Mikolov in this article to generate unsupervised representations of sentences, paragraphs, or entire documents without losing local word order (rephrase)

In simple terms, Paragraph-Vector (often referenced as word2vec), allows a numeric representation of text content.

Other methods are available such as Bag-of-Words which focus on the number of occurrences of certain words but are not as sophisticated and do not take into accounts certain aspects such as word ordering.

Word Vectors on the other hand allows to capture the different relationship between words where certain words will have a “stronger” relationship to other words.

The other choice available to you is TD-IDF, which stands for Term Frequency - Inverse Document Frequency.

This is another method to evaluate how often a word appears in each document (Term Frequency) but offsets it by the number of times it appears in a collection of documents (hence the “inverse Document Frequency). This way, noise words (or words that appear frequently across a pool of documents) are therefore less relevant to a single document.

So, how do the 2 differ?

As we have seen above, TD-IDF focuses primarily on the occurrence of words and its relevance to a set of documents (It was invented for text search) so it is fairly simple to calculate and therefore quite computing efficient. Word2vec on the other hand will provide greater attention to the meaning and relationship between words.

Pre-trained is another method using the Global Vector Embedding also known as GloVE.

GloVe is quite similar to Word2Vec in that both represent words in form of a vector. They use a different training approach (Feed Forward Neural Network vs Matrix Factorisation) and Word2Vec is faster to compute.

More information can be found in this article.

Ok, now that we have looked at our Word Corpus, let’s turn our attention to the clustering solution. When you create one, you will notice an “Advanced Solution Settings” Tab at the bottom of the screen.

There, if you click New, you will be able to pick from an intimidating list of parameters:

First, we’ll observe that 3 of these parameters have a category of ALGO, while the others are (Hyper)parameters and one is ‘logging’

The 3 algorithms we can choose from are DBSCAN, HDBSCAN and Levenstein Distance.

As highlighted in the documentation, Predictive Intelligence uses the k-means algorithm by default. So first let’s review it and understand the differences.

A quick primer on K-Means…

K-means is one of the most famous algorithms used for clustering in an unsupervised manner, where K depict the number of clusters. Its main principle is to calculate the distance between a key data point, called a centroid, and other data points to assign them to a cluster.

The number of centroids is defined by the number of clusters to be created, i.e. the value of K.

Once centroids are defined, the distance from each date point to the centroids will be calculated and it will be assigned to its nearest centroid. Once that is done, the centroid will be “repositioned” (re-initialized) into the centre of the newly formed cluster.

Whilst K-means has many benefits such as easy to use and understand and is quite efficient, it will always include outliers which may alter our clusters.

Enter DBSCAN and HDBSCAN

DBSCAN (Density-based spatial clustering of applications with noise) is a popular data clustering algorithm that groups together similar data points based on their density, hence the “Density Based” in the name. This means it will group together data points with many neighbours. Noise in that instance refers to outliers, i.e. data points that may not be attributed to any specific cluster.

HDBSCAN is an extension of DBSCAN that offers several improvements, such as automatically determining the optimal number of clusters and also handling clusters of varying densities.

HDBSCAN is also faster more efficient to compute than DBSCAN.

The third ALGO is Levenstein distance:

The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are. In the case of words, it calculates how many permutations are required for words to become different, i.e., how many different letters they contain. For instance, CAT is 1 letter different than FAT. An “edit” is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.

Finally, let’s look at some of the other categories, i.e Clustering Parameters. These are also known as Hyper-parameters.

Minimum neighbours: This parameter defines the minimum number of neighbours on either side of the keyword.

Epsilon and Min_samples are 2 of the most commonly hyper-parameters used in clustering techniques:

Epsilon is a parameter often used in Clustering. It is a distance value to determine the maximum distance between two points while still belonging to the same cluster. It is the radius of the circle to be created around each data point to check the density.

Min_samples is another common parameter that determines how data points need to be included in the radius Epsilon for it to be considered a cluster.

When a point “includes” at least 3 other points, then it is considered a “core” point.

See the illustration below:

This is it for today. Hopefully these advanced parameters will give you additional tools to fine tune your clusters in Predictive Intelligence. Please feel free to share comments and observations below. Happy Clustering!

Bernard Esterhu · ‎10-19-2023

@Laurent5 thanks for the useful information!

I would like to know what the effect is of selecting hyperparameters for DBSCAN and HDBSCAN. If I only select an ALGO, how are the hyperparameters set? I see Epsilon and min_samples in the list of parameters - these are the ones I would expect to have to configure for DBSCAN and HBDSCAN?

During testing, when I selected the DBSCAN algorithm, the "Minimum neighbors required for a point to be part of a cluster" and "Epsilon" parameters were automatically added (I don't think Minimum neighbors required for a point to be part of a cluster is a hyperparameter for DBSCAN...only min_samples and epsilon..?). However, when I selected HDBSCAN, no other parameters were added.

Also, how can Epsilon be set, without knowing the "scale" of the vectors in the dataset?