Even with low cost widely available cloud computing, it can take significant time and compute power to train machine learning models on large data sets. This is expensive and is often at odds with the net-zero carbon goals of many organisations today. Throwing more data at a problem isn’t always the best answer, and by using AI that is responsible by design, we can reduce these problems while maintaining performance. Active learning is one of the methods we use at Mind Foundry to achieve this.
In most cases, it’s not necessary to use an entire dataset to train a model since data points are often either similar or noisy, resulting in inefficient encoding of the underlying information in the dataset. By reducing the volume of data while retaining the vast majority of its inherent predictive power, we can train a machine learning (ML) model with comparable performance in a fraction of the time. In this article, we compare data subsetting strategies and the impact they have on the training time and accuracy of machine learning models in a case study of digit classification using a Support Vector Machine Classifier on sub-sets of the MNIST data set.
Building subsets with active learning
In an active learning setting, a learning algorithm identifies areas of a problem or dataset where it will benefit most from the input of a human expert and iteratively requests human input on these areas over time, gradually building up an understanding of the underlying context, without requiring access to all available data in the dataset.
Active learning can also be used to reduce dataset size and obtain an efficient subset that contains the vast majority of the information stored in the complete set. The subsetting Active Learner analyses points to include in its training set based on a sampling strategy which identifies training subsets most appropriate for maximising the accuracy of the resulting model trained on the data.
In this case study, we consider four different strategies for building subsets of data:
Random sampling: the data points are sampled at random.
Uncertainty sampling: Points are selected based on the ML model’s prediction uncertainty of their class.
Entropy sampling: Points are selected with maximal class probability entropy.
Margin sampling: Points are chosen for whom the difference between the most and second most likely classes are the smallest.
The probabilities in these strategies are associated with the predictions of the SVM classifier.
For this study, subsets of 5,000, 10,000 and 15,000 points are selected from the original training set of 60,000 points. Active learning can train machine learning models with a fraction of the data and time. Performance of subsetting methods is measured through the training accuracy and training time ratios calculated as follows:
The results of running each of the four sampling methods on the three subset sizes are shown below.
It is clear to see that uncertainty, margin & entropy sampling can achieve over 99% of the performance with a subset of 10,000 points, in less than 25% of the time it took to train the SVM on the full dataset. In reality, this means that for a less than 1% drop in accuracy, we can reduce the time taken to train the model by 75%.
Random sampling is the fastest of all strategies but also causes a significant drop in accuracy, meaning a more strategic method should be used.
Subsetting the data works well on many classification data sets and provides a huge opportunity for reducing training time, cost and carbon impact in settings where 90%+ peak accuracy is sufficient for the intended use. Particularly in iterative settings with frequent retraining, this strategy can significantly improve day-to-day use.
Beyond reducing the size of datasets, training time, and associated carbon costs required to train AI models, active learning is a gateway to massively improved AI capabilities in organisations and is a founding principle of Mind Foundry’s Three Pillars on which all our solutions are built. The principles of active learning can be seen in the Bayesian analytical methods integrated and automated in the Mind Foundry Platform, which provides an efficient route to solving complex optimisation problems where data is challenging and expensive to access or simulate.
Find out more about Mind Foundry's work.