One key to AI’s revolutionary potential and why it has already begun to deliver a real impact across our society is that it combines computer-level processing power with an ability to learn and adapt to new data and new problems. Nevertheless, as Machine Learning models become more pervasive in our daily lives, it is important to be aware of the challenges and limitations that must be addressed and overcome to ensure the longevity and responsibility of their use.
Fundamentally, we want Machine Learning models to “do the correct thing” and continue to do so throughout their use. The question of what “doing the correct thing” entails is down to the scientists and engineers who initially designed the algorithm. However, ensuring that the model continues to do the correct thing post-deployment is frequently overlooked and may result in unexpected consequences. And no, I’m not talking about a dystopian future where AI takes over the world; I’m talking about a much more realistic issue where the model simply becomes outdated and becomes unfit for use.
One of the best ways to overcome this problem is with continuous learning. But in order to understand what continuous learning is, we have to first understand some important supporting concepts.
How Supervised Learning Can Train Consistent Models
In supervised Machine Learning, the aim is to create a model that uses input data, such as observations of the surrounding environment and learns the mapping between that input data and the quantity to be predicted, known as the target value. Once this mapping has been learned, the trained machine learning model can be used to estimate the target value for new observations of the environment.
For example, consider an insurance pricing model. The aim of a pricing model is to learn the mapping between policyholder information and the policy premium. One of the things that may impact the policy premium is the risk associated with a particular policyholder; high-risk policyholders are likely to need to claim on their insurance more and, therefore, should have a higher premium than their low-risk policyholder counterparts. Policyholder risk can be learnt using a machine learning model that is trained using historic policyholder information like estimated mileage, no-claims duration, and vehicle type to predict the number or cost of claims.
The model learns from this data to discover the relationship between the information that describes a policyholder and the associated risk. When we encounter a new policyholder or when an existing customer is up for renewal, we can’t know the number of claims they will make in the following year. However, we can use the model that learned from last year's customers to estimate the number of claims that will occur in the future based on the similarities between the new policyholder’s information and the historic policyholders used to train the risk model.
Looking for Consistency in an Inconsistent World
Using Machine Learning models in this way makes one simple yet fundamental assumption: consistency. In order for the model to continue to be effective, we must assume that the relationship between the input data (policyholder information) and the target value (number of claims) does not change between last year (the training data) and now (when the model is used in deployment), and will not change throughout the lifetime of the model. The problem with this is that the world we live in is not stationary.
As humans, we’re pretty good at adapting and changing to operate in a dynamic world. We onboard new information and can change our behaviours based on ever-evolving needs and desires. Traditional supervised Machine Learning algorithms, on the other hand, do not. Typically, you can think of Machine Learning models as capturing a snapshot of the relationship between observable input data and the desired target value. If the relationship between the input data and the target value changes in some way, the relationship between the two may not be fully captured within this snapshot, and the learnt relationship may no longer be an effective representation of the evolving relationship. This is known as concept drift.
Concept Drift
Concept drift can be caused by changes to the distribution of input data (observations), also referred to as data drift, changes to the distribution of the target (what you’re trying to predict), or a change in the mapping between the two. Any one of these changes can cause the learnt predictive model to become ineffective and potentially unreliable.
Concept drift is commonly encountered by models deployed in high-stakes applications for two reasons:
Limited Observability
This refers to the fact that no matter how rigorous our data collection process is, we cannot record all eventualities and cannot collect data that will be observed in the future (otherwise, I would have won the lottery already!).
Due to our inability to see the future and the non-stationary environment in which high-stakes applications typically operate, in many cases, the longer you leave a model to make predictions for new instances of previously unseen data, the more likely that concept drift becomes an issue. For example, if you had to make life decisions using only the information available to you up to three months ago, it is unlikely that your decision would be vastly different to the decision made using that additional three months of information. However, what would happen if you were only able to use the information you had learnt up to the age of 10 years old? Those decisions about mortgages and pensions would likely have turned out very differently!
This is similar to the Machine Learning model’s prediction process. If predictions have to be made using mappings learnt from outdated information that no longer accurately reflects the current data distribution, we cannot be certain that the model will continue to be capable of making effective predictions since the model was never trained using data that is representative of the environment in which it continues to be used.
Changing Behaviours
This relates to the fact that high-stakes applications are typically used to model or predict something that influences a person or group of people. A model's desired outcome at one point in time is rarely the desired outcome forever, as people's behaviours, needs, wants, and desires can change and evolve over time.
This means that in high-stakes applications, the objective of what a model is required to predict may change, making previously effective predictions no longer relevant. For example, a model could be created to estimate the likelihood of an individual going to the gym based on information such as weather conditions (let's face it, we’d all rather be outside if it’s sunny), day of the week, and number of hours spent at work or education.
Before I was 18, the likelihood of me going to the gym on any given day was reasonably high. However, when I was 18, I became ill and was subsequently diagnosed with ME/CFS, and due to this, my priorities and behaviours had to seriously change. I had to decide whether I was going to use what limited energy I had to go to the gym or focus on my studies at university (writing this as a Senior Research Scientist with a PhD, you might be able to guess which one I picked). Due to this change in priorities, the Machine Learning model that would have previously been effective at predicting my cadence of gym visits is no longer relevant to my current behavioural patterns.
This may be a somewhat extreme example, but more subtle changes in the beliefs, behaviours or desires of people that Machine Learning models are created to predict can cause previously effective and reliable models to degrade in predictive performance either suddenly or gradually over time. An example of a high-stakes application that is particularly susceptible to changes in human behaviour is fraud detection. To avoid detection, potential fraudsters adapt and modify their behaviours, and so fraud detection systems may encounter fraudulent behaviours that previously had not been identified as such. Once these new behaviours have been identified as fraudulent, the fraud-detection model must adapt to prevent fraudulent claims of this nature from going undetected in the future.
Limited observability and changing behaviours often go hand-in-hand because, as humans, we usually change our behaviour as a result of observing something new. With the exception of online learning algorithms that ingest data and learn ‘on the fly’, Machine Learning approaches used for high-stakes applications must consider long-term effects that may impact their ability to perform well in the future to ensure that models remain trustworthy and reliable throughout their lifecycle. To achieve this, we must first consider “What should happen to a Machine Learning model after it has been deployed?”.
Model Monitoring
Since concept drift is caused by changes in the data distribution, or changes in the mapping to the target value, that occur in the future, at the point of deployment, these changes are unknown since we will never know precisely what will happen in the future. Therefore, to ensure that we can trust predictive models for high-stakes applications and safeguard their longevity, we must introduce the ability to monitor and adapt models post-deployment. Simple approaches to monitoring include statistical tests to identify changes in the input data distributions and predictive error. Once changes have been identified, models should be adapted so that they can perform more effectively in the future either by incrementally updating the model or retraining the model using the new data distribution.
Learning from Experience
As humans, we rarely learn anything new completely from scratch. We typically use past experiences and relate them to new tasks so that we can leverage previously learnt skills and adapt them to new situations. In doing so, we use existing skills as a foundation to develop different but related skills. Unfortunately, when we have no related skills, it can be much more challenging and take a lot longer to master new skills. For example, if you want to learn a musical instrument, it’s harder for someone to learn it from scratch than it is for someone who has already learnt a different instrument because they can use their past knowledge, like knowing how to read sheet music and how to keep in time, to help them learn.
Learning a new skill with no prior knowledge is known in Machine Learning as the cold start problem. In order for a machine learning model to be effective at making predictions for a new task, sufficient training data is required to learn the mapping from the input data to the target observation. In real-world applications, such as high-stakes applications, we typically need large amounts of labelled training data, which may be hard, or even infeasible, to obtain.
Transfer Learning
Transfer learning is a Machine Learning technique that takes inspiration from our ability to utilise related skills to kick-start mastering new skills. It allows the knowledge learnt from a predictive task, where labelled training data is readily available, to be transferred to a predictive task where labelled data is limited. This transfer of knowledge enables predictive models to be created without requiring large amounts of labelled data, meaning that effective predictive models can be put into use more quickly, reducing the overhead of waiting for labelled data to be available. This is particularly beneficial in high-stakes applications that depend on human intervention to manually label data, such as in acoustic entity recognition, where entities of interest are identified by their audio signatures picked up by microphones.
To improve predictions when data availability is scarce, transfer learning must adapt the knowledge learnt from a different task so that it can be used to aid predictions for the new task. For example, in acoustic entity recognition, we may have a microphone that collects audio data to identify specific sounds, such as the song of a blackbird. In order to do this, the sound of the blackbird must be isolated from background noise. This in itself is a challenging task and typically requires large amounts of labelled data to separate the bird song from the background noise.
Since background noise is highly dependent on the environment in which a microphone is placed, a predictive model that was trained using labelled data collected from one microphone may not effectively identify the sound of a blackbird for a microphone in a different location. However, instead of having to label large amounts of data collected from the new microphone, we can leverage the knowledge learnt from the initial microphone.
If we consider a similar scenario where a single microphone is relocated partway through deployment, the change in background noise picked up at the second location could be seen as concept drift since the relocation of the microphone may cause the distribution of the input data to change. In this scenario, if the model were incrementally updated with small amounts of labelled data from the new location, the model would still be influenced by the background noise contained in the training data collected at the initial microphone location, which is not representative of the background noise at the new location.
Alternatively, once the drift has been detected, a new model could be trained using only data collected from the new location, preventing the background noise from the initial location from negatively influencing predictions at the new location. However, this would mean that any knowledge of blackbird song learnt from the initial location would be lost. Instead, transfer learning could be used to adapt the knowledge learnt from the initial location so that it can be combined with small amounts of labelled data collected from the new location to provide a much richer dataset to better identify a blackbird's song in the new location.
Learning from Both Mistakes and Successes
Another characteristic that enables humans to adapt to dynamic environments is our ability to learn from our mistakes. We often make mistakes when we initially have limited information or high levels of uncertainty about the outcomes of an action. Once we identify that we have made a mistake, we typically use this information to correct our behaviours so that we don’t make that same mistake again. This feedback loop means that over time, we can learn to be more successful in achieving our desired outcomes.
Machine Learning models can use feedback loops in a similar way through techniques such as active learning and transfer learning. The feedback loop allows Machine Learning models to adapt as information about successful or unsuccessful predictions becomes available, enabling predictive models to be used when uncertainty is high during the initial training process. This is typically the case when limited labelled data is available.
Active Learning
In active learning, this feedback loop is commonly achieved using human-AI collaboration. Domain experts evaluate a model's predictive outputs to determine its effectiveness, and this information can then be re-ingested by the Machine Learning model so that the model's predictive capabilities not only align better with domain expertise but can also allow models to adapt to changes in the desired outcome that arise as a result of concept drift.
Active learning is particularly beneficial when there is a long delay between the time a prediction is made and the time when the true value of the predicted outcome is made available. For example, in insurance fraud detection, it may take years for suspected fraudulent cases to be finalised, and in the meantime, new characteristics of fraudulent behaviours could be encountered. Human-AI collaboration can be utilised to identify correct or incorrect model predictions based on a domain expert’s opinion rather than having to wait for these cases to be completed.
Manually labelling the correctness of all model predictions is infeasible; therefore, active learning prioritises which of the predictions will provide the most beneficial information to the model for future predictions. Active learning approaches can prioritise the manual evaluation of high-certainty predictions, where the model strongly believes in the predicted outcome, to provide insights into the reliability of the model. For example, if the model predicts that claims are fraudulent with high certainty, but the domain expert does not agree with these predictions, the model may be deemed to be unreliable, and therefore, the model must be updated to improve the effectiveness of model predictions.
We can also use domain experts to evaluate predictions that were made by the model with high uncertainty. If a model makes a prediction with high uncertainty, it means that the model may not have encountered similar instances in the training data and, therefore, cannot be certain of the predicted output. Using domain experts to identify successful and unsuccessful predictions provides additional information that can be used by the model to increase the certainty of similar predictions in the future.
Model Performance Evaluation Doesn’t End at Deployment
To ensure the longevity of Machine Learning models, we cannot simply follow a linear process of training, testing, evaluating, and deploying a model. Once the model has been deployed, it must be monitored and adapted throughout its lifecycle to ensure that it continues to be relevant in our ever-changing world and as our needs and desires evolve.
To ensure the success of Machine Learning models in high-stakes applications, we can take inspiration from how we, as humans, learn. Using these insights will allow the predictive capabilities of models to be improved over time, particularly in high-stakes applications where labelled data is hard to come by. Adapting to change, learning from past experiences and learning from our successes and failures are all vital parts of human development. They should also, therefore, be fundamental concepts that must be ingrained into Machine Learning models before they can be deployed in real-world environments for high-stakes applications.
Enjoyed this blog? Read our piece explaining How Machine Learning Models Fail.