What is Semi-Supervised Learning: Working, Examples and More

What is semi-supervised learning?

Semi-supervised learning is a Machine Learning approach that blends labelled data (inputs with known outputs) with a large amount of unlabelled data (inputs only) during an Artificial Intelligence (AI) model’s training. The goal is to improve the model's decision-making performance by extracting valuable information from the unlabelled data.

How does semi-supervised learning work?

In semi-supervised learning, several assumptions are made to grasp unlabelled data effectively. Eventually, these assumptions help guide the learning process and ensure that the model can generalise well from the limited labelled data.

Here are some of the key assumptions: 👇

Smoothness assumption

This assumption suggests that if 2 data points are close to each other in the feature space, their labels are likely to be similar. For example, if 2 houses have similar features (like number of bedrooms, square footage‌ and location), it's reasonable to assume their prices are similar.

Cluster assumption

Here, data points that form a cluster are likely to belong to the same class. For example, in a customer segmentation task, if a group of customers has similar purchasing behaviour (forming a cluster), they’re likely to belong to the same customer segment, such as ‘high spenders’ or ‘budget shoppers’.

Consistency assumption

This assumption implies that the model's predictions should be consistent under small modifications of the input data. For example, in a sentiment analysis task, slightly rephrasing a sentence ( eg "I love this movie" to "I really love this movie") shouldn't change the model's prediction of positive sentiment.

Examples of semi-supervised learning

Semi-supervised learning has a wide range of applications across various fields due to its ability to leverage both labelled and unlabelled data. Here are some key areas where it's particularly useful…

Medical diagnosis

In healthcare, semi-supervised learning can be applied to medical imaging and diagnostic data. With a limited number of labelled medical images or patient records, the model can learn to identify diseases or abnormalities more accurately by incorporating unlabelled data.

Fraud Detection

In finance, semi-supervised learning can be used to detect fraudulent transactions. With a small set of labelled fraud cases and a large set of unlabelled transactions, the model can better identify patterns of fraud.

Sentiment analysis

In social media and customer feedback analysis, semi-supervised learning can help in understanding the sentiment behind text. This works by using a small set of labelled reviews or comments and a large set of unlabelled text, the model can accurately classify sentiments as positive, negative‌ or neutral.

What are the different types and techniques used by semi-supervised learning?

Semi-supervised learning employs several techniques to use both labelled and unlabelled data effectively. 👇

Self-training

The model is initially trained on labelled data and then used to generate pseudo-labels for unlabelled data. These pseudo-labels are added to the training set, and the model is retrained iteratively. This helps improve the model by incorporating its own predictions.

Label propagation

Label propagation is a semi-supervised Machine Learning algorithm designed for classification tasks in graph-based structures.

It works by spreading labels from a small subset of labelled nodes to a larger set of unlabelled nodes, using the graph's structure to make informed predictions. This method allows the model to learn from both labelled and unlabelled data, enhancing its accuracy.

Multi-view training

Multi-view training is a technique in semi-supervised learning where multiple models are trained on different views or representations of the same data. Each model is responsible for learning from a specific view, and they collaborate to improve ‌overall performance.

How is semi-supervised learning different from supervised and unsupervised learning?

Semi-supervised learning combines a small set of labelled data along with a large amount of unlabelled data. This approach enhances a model's performance, especially when labelled data is expensive to gather.

Supervised learning, on the other hand, relies solely on labelled data for training, where each training dataset includes both inputs and their corresponding correct outputs, making it perfect for tasks like classification and regression.

In contrast, unsupervised learning works with unlabelled data to uncover hidden patterns or structures, such as grouping similar data points in clustering tasks. It doesn't depend on predefined outputs, making it well-suited for exploratory data analysis.