In machine learning and artificial intelligence, data classification is a fundamental process. The goal is to assign labels to data points based on their features. This involves analysing known data (training data) where each example is labelled with a category or value. Labels help establish patterns and relationships within the data, making it possible for the model to make accurate predictions about new, unseen data points. Unfortunately, working with labelled data presents its own problems — the manual processes involved in labelling data can be time consuming and difficult, and the resource investment may make this a non-option for some organisations.
The k-nearest neighbours (KNN) algorithm offers a straightforward and efficient solution to this problem. Instead of requiring complex calculations up front, KNN works by storing all the data and then making predictions for new data based on how similar it is to existing data. This approach allows KNN to make accurate predictions without needing extensive fine-tuning, a particularly useful approach when working with smaller datasets and limited computing power.
Vectors are integral to the functionality of the k-nearest neighbours algorithm. A vector is a sequence of numbers that represents a point in a multi-dimensional space. Machine learning models must be able to transform raw, unstructured data into these numerical representations, known as embeddings. Embeddings capture the semantic or structural essence of the input data, with the relationships between embeddings represented as their spatial proximity (how close or far away they are from each other) in the vector space.
KNN uses this spatial arrangement by identifying the "neighbours" of a query point — other embeddings positioned closely within the multi-dimensional space. These neighbours reflect data points with shared characteristics or similar features.
For example, two documents with similar themes will have embeddings that are closer together, enabling KNN to recognise the similarities and associations so that it can classify new data or predict outcomes based on these relationships.
The k-nearest neighbours algorithm operates by using vectors to identify the 'k' (closest data points or neighbours) to a new data point and making predictions based on these neighbours. For instance, if the goal is to classify emails as spam or not spam, KNN would look at the 'k' most similar emails and classify the new email based on the majority classification of these neighbours.
Alternatively, imagine an organisation has data on various customers, with features like age, interests and purchase history. KNN can group these customers into categories such as frequent buyers, occasional shoppers and window shoppers by comparing their features. If a new customer visits the website, KNN can predict their shopping behaviour by evaluating which group they most closely resemble.
The algorithm's adaptability extends even further when used with multimodal datasets. Here, information is combined from multiple sources at once, such as text, images or audio. KNN can analyse these embeddings in a shared vector space, identifying similarities across distinct modalities. Applying KNN to multimodal data allows it to find the most similar neighbour regardless of data types. This makes KNN a versatile algorithm for handling increasingly complex and diverse data scenarios.
- Pattern recognition
KNN is widely used in pattern recognition tasks, such as image and handwriting recognition. By comparing new images or samples to a labelled dataset, KNN can accurately classify objects, characters or faces based on similarity to known patterns.
- Data processing
KNN is effective in preprocessing data, such as imputing missing values or detecting outliers. By analysing the nearest neighbours, KNN can estimate missing values based on the most similar data points, improving data quality and consistency.
- Recommendation engines
KNN helps build recommendation systems by analysing user behaviour and preferences. By finding users with similar interests, KNN can suggest products, films or content that others with similar profiles have liked, enhancing user experience and engagement.
- Image-to-text transformation
KNN is increasingly used in image-to-text transformation tasks within multimodal systems. By comparing image embeddings to those of textual descriptions, KNN enables AI systems to perform complex tasks like automated captioning, where the closest matches provide contextually appropriate text for a given image.
In each approach listed above, the accuracy of KNN predictions relies heavily on the distance metrics used to measure the similarity of the data. Distance metrics in KNN measure the similarity between data points, which is crucial for accurate predictions. These metrics determine how the algorithm calculates the "closeness" of data points to classify or predict new data points effectively.
Euclidean distance is the most common metric used in KNN, calculating the straight-line distance between two points in Euclidean space. Imagine using a map and a ruler to measure the shortest path between two locations. The shorter the distance, the more similar the points are considered to be. For instance, when comparing the height and weight of different individuals, the Euclidean distance would help determine which individuals are most similar based on these two features by which are separated by the shortest Euclidean distance.
Manhattan distance measures the absolute differences between points along each dimension, like navigating a grid of city streets. Picture a city grid where movement can only progress along the streets (rather than diagonally through buildings). This metric is useful when data points are structured in a grid-like pattern, such as comparing delivery routes or urban planning scenarios.
Minkowski distance is a generalisation of both Euclidean and Manhattan distances. By adjusting a parameter 'p', it can behave like either metric. Think of Minkowski distance as a flexible tool that can adapt to different scenarios based on the specific needs of the data analysis. For example, if someone were to compare properties with different dimensions (such as price, area and number of rooms), adjusting the 'p' value would help emphasise certain dimensions over others, making it a versatile metric for diverse types of data comparisons.
Without defining the right value for 'k', the KNN algorithm won't function as intended — choosing too small of a value of 'k' can make predictions overly sensitive to noise in the data, leading to high variance and less stable predictions. On the other hand, an overly large value might smooth out the predictions but may make the model too generalised so that it misses specific patterns.
To find the optimal value for 'k', practitioners typically use cross-validation (a technique where the dataset is divided into training and validation sets multiple times to test different 'k' values). This helps identify a 'k' that minimises prediction errors while maintaining the algorithm's generalisation capability.
This process may involve some trial and error. Finding the right 'k' involves testing various values to ensure the model performs well on both seen and unseen data, achieving the optimal balance of stability and specificity.
Establishing connections, similarities and relationships between data points is the overall purpose of the k-nearest neighbours algorithm. What helps make this model such a popular choice for organisations is the additional set of advantages it brings to the table. The benefits of KNN include:
KNN is straightforward to implement and understand, even for beginners in machine learning. It does not require a complex training phase; instead, it memorises the training dataset and uses it directly to make predictions.
Whether used for classification or regression tasks, KNN can handle the various data structures and relationships necessary to group data points. This flexibility allows it to be applied across multiple domains — finance, healthcare, e-commerce and more.
KNN requires only a few hyperparameters, primarily the value of 'k' and the distance metric. This reduces the complexity involved in model tuning compared to other algorithms that may require extensive parameter optimisation. As a result, it simplifies the overall model development process and makes it easier to achieve superior performance with minimal adjustments.
While the KNN algorithm offers several advantages, it also presents certain notable weaknesses. These may include:
High dimensionality refers to the exponential increase in data required to maintain the same level of performance as the number of features (or dimensions) grows. In high-dimensional spaces, the distance between data points becomes less meaningful, making it difficult for KNN to identify truly "nearest" neighbours. This issue can significantly reduce the algorithm's accuracy and effectiveness in datasets with many features.
KNN can be negatively impacted by noise and outliers in the dataset, particularly when the value of 'k' is small. This sensitivity can lead to overfitting, where the algorithm captures noise and anomalies as if they were true patterns. Overfitting results in poor generalisation of new, unseen data, reducing the model's predictive performance.
Computational complexity grows with the size of the dataset, making KNN inefficient for overly large datasets. Each prediction requires calculating the distance between the new data point and all existing points in the training set, leading to high memory usage and long computation times. This lack of scalability limits KNN's applicability in scenarios with large volumes of data.
As previously stated, the KNN algorithm classifies data points based on their proximity to other data points in the dataset. To do that, the algorithm must follow a specific set of steps:
1. Choose the number of neighbours (k)
Define the value of 'k' to consider when making the classification or regression. This value will influence how the algorithm evaluates the similarity between data points.
2. Calculate the distance
For each data point in the training set, calculate the distance between it and the new data point using one of the standard distance metrics (Euclidean, Manhattan or Minkowski distance). This distance measurement helps identify what should be considered the closest neighbours to the new data point.
3. Identify the nearest neighbours
Sort the distances calculated in Step 2 and determine the 'k' nearest neighbours. These neighbours are the data points that are closest to the new data point based on the chosen distance metric.
4. Make a prediction
For classification tasks, assign the new data point to the class that is most common among its 'k' nearest neighbours. For regression tasks, calculate the average or median value of the 'k' nearest neighbours and use this value as the prediction for the new data point.
5. Evaluate the model
Assess the accuracy and performance of the KNN model by using cross-validation techniques. Adjust the value of 'k' and the distance metric as needed to optimise the model's predictions.
There are several methods to perform the k-nearest neighbours (KNN) algorithm, each with its own advantages and suitable applications. The following methods help optimise the process of finding the nearest neighbours, making KNN an efficient option for different types of datasets.
- Brute force
The brute force method calculates the distance between the query point and all other points in the dataset. It is simple but computationally expensive, making it most suitable for small datasets
- K-dimensional tree (k-d tree)
A k-d tree organises points in a k-dimensional space by recursively dividing the space into hyperrectangles. It reduces distance calculations and speeds up KNN searches for moderately high-dimensional data.
- Ball tree
A ball tree partitions the space into nested hyperspheres, allowing efficient nearest neighbour searches by eliminating irrelevant portions of the dataset. It is particularly effective for high-dimensional data and often outperforms k-d trees in these scenarios.
The k-nearest neighbours algorithm is invaluable for its ability to classify data points and quantify relationships for AI systems. ServiceNow, a leader in enterprise IT solutions, integrates advanced AI and KNN, providing powerful tools for digital transformation. ServiceNow's award winning Now Platform® harnesses AI and machine learning to automate, optimise and modernise workflows across the full range of business functions, allowing for intelligent optimisation company wide.
Integrating KNN and other advanced algorithms, ServiceNow enables organisations to leverage AI for improved decision-making, reduced turnaround times and a more efficient approach to business. Experience the transformative power of AI and the Now Platform; demo ServiceNow today!