5 min read

Measuring What Really Matters

Measuring What Really Matters

Why end-user insight and operational context are the keys to understanding AI performance in the defence sector.

 

Measuring What Matters (AI-generated voice recording)
9:16

 

In Defence and National Security, the scrutiny placed on AI systems and the bar for responsibility and accountability are heightened, and rightly so. It’s paramount that the outputs of every AI model used in the sector are well-aligned with strategic goals, operational priorities, and Western values. AI assurance aims to quantify model performance and rubber-stamp it for ongoing use, but in high-stakes applications like defence, measuring true AI performance and assuring it for real-world operations is more complex than it first appears.

Signing Off on Deployment

In recent years, AI projects have moved beyond innovation opportunities, and organisations across sectors have become increasingly mature in their adoption and operationalisation of the technology. This has brought to the fore the assurance of AI systems as being “fit for deployment”. Previously, we have highlighted the importance of correctly framing AI problems and designing AI pipelines with operationalisation in mind from the off. Measuring and monitoring operational readiness is the next step on the path toward AI systems that can be readily integrated across a range of business use cases for real-world impact.

While traditional approaches for measuring model performance rely heavily on data scientific metrics (and data scientists to interpret them), moving out of a laboratory and into a real operation environment requires shifting the focus to concepts like susceptibility to adversarial attack, robustness to environmental changes, ease of user understanding, and speed of prediction. The characteristics to optimise for must also be chosen based on a deep understanding of the impacts of deployed models on operational use and balancing the associated risk. 

There is no universal standard for what “good” looks like. The quantities that matter are always problem-specific and tightly interwoven with characteristics like human risk, level of autonomy, and potential impact. Acceptable performance must be defined on a project-by-project basis and in partnership with those who best understand the operational environment. For example, although system accuracy is always important, in some cases, it may be more beneficial for a system to respond more quickly, even at the expense of slightly reduced accuracy. Understanding the tradeoffs and what matters in a given context is critical to overall performance.

A Mars Rover with its performance measurements being called out.

Performance measurement must always be optimised for the operational environment.


Measuring Multifaceted Performance

Ideas like system reliability, robustness in the face of attack, explainability, and trust do not always lend themselves to straightforward quantification. This makes them difficult to verify, and therefore, many approaches to assuring adherence leverage a human-in-the-loop team that can assess the AI model function in context. This relies, however, on appropriately qualified individuals being available and read into the system of interest (by default, a single individual is unlikely to have the necessary expertise across the full range of a complex model’s requirements). 

In many cases, though, AI is selected to complete tasks that humans are otherwise incapable of. In such cases, the occasional human in the loop will rapidly fall down unless afforded appropriate time and resources, like explainability and debugging, to truly explore and verify the system. 

It’s also true that as operators gain trust in the system, they tend to reduce their level of critical thinking, effectively becoming more blind to AI faults and automating themselves out of the assurance process. It is vital to be cognisant of these challenges and educate those human experts who are so critical to assuring system performance.  

The Perfectionism Fallacy

The temptation to strive for 100% model accuracy is a common pitfall in measuring AI performance, mainly because it can be measured. However, when we consider that AI should improve on existing ways of problem-solving, it is immediately clear that we must compare AI performance to that of the current system, which generally relies on significant human involvement. 

The challenge is that these human operators are far harder to measure and quantify than their AI counterparts, and their “performance level” may be significantly lower than 100%, even if this is hard to measure and track. Holding AI to performance levels significantly beyond human capability will likely clash with limitations of human understanding and a ceiling of information that is inherently present in the data. 100% performance is very often not even achievable.

Assurance Cannot Be a 'One-and-Done'

Sign-off for AI models is sometimes carried out as a “one-off” task, undertaken manually by human experts at the point of deployment. The model is then left to run with minimal monitoring until a programme of improvements is approved. While the involvement of experts is essential, limiting the assessment of these considerations to the point of deployment opens up considerable risk for AI users. 

Even in relatively simple use cases of AI, the problems have some inherent level of dynamism or change, reflecting their complexity and the fluctuating nature of the deployed environment. AI’s flexibility and adaptability to deal with this change are the very reasons we use technology in the first place. Even where humans control the data inputs to ensure consistency and where nothing may appear to have changed to an experienced user, it is possible for environments to shift over time in ways that are imperceptible to users but have a significant impact on AI. Changing adversary operating modes, adjustments to climate and operating conditions, and decay in deployed sensors are just a few examples.

Consequently, there is an ongoing risk of standards in bias, security, or other operational priorities being broken as the nature of operational problems changes or as data drifts. There is also the risk of adversarial attacks on AI systems or users adjusting their behaviour to more likely get a desired result, which can result in unforeseen changes to the profile of system function.

To Learn, or Not to Learn?

Identifying an issue in a deployed model is one thing. The issue of how to respond is another thing altogether. The existing understanding of assurance is that systems are periodically updated or maintained when something goes wrong, and perhaps the same is true for AI. It could be the case that those models in which performance degradation or vulnerability is detected will be “serviced” or retrained at the next convenient slot. 

AI is peculiar, though, in that, to some extent, it creates itself. A research scientist will have designed and optimised an architecture, but training models where they fit a dataset inherently happens without human involvement. While it is infeasible in sensitive use cases to imagine a model that can update and retrain on an ad hoc basis without oversight, certain system function changes relating to real-time execution can surely be explored. Most AI systems are deployed to augment rather than replace human decision-making, so they often capture some form of feedback deriving from operator responses to AI suggestions. This feedback is often then considered training data or corrections and fine-tuning to AI functions. 

To Learn or not to Learn? An AI-generated image of an android wearing Shakespearean clothing holding a skull

To learn, or not to learn? (AI-generated image)

Automated retraining has a spectrum of flexibility, ranging from completely static, locked-down systems to totally fluid, continuous improvement. For any given problem, this needle must be threaded to maximise system performance and adaptability while preserving the required levels of user trust. It’s often felt that retraining AI represents a significant risk, but detecting a performance issue and failing to act on it is arguably an even greater threat. 

The Way Forward

It’s clear that AI must be subject to ongoing assurance and monitoring that is both comprehensive and multifaceted if we are to rely on it in an operational capacity. Just as a human operator is subject to regular performance checks and training, AI must be frequently assessed to ensure continued alignment. 

Software clearly has a role to play here in capturing and quantifying the metrics that really matter for deployed models, but it’s important to recognise the ongoing role of human experts. Data can only tell us so much, and even the most comprehensive datasets in the world still fail to capture the nuance of human intuition and contextual understanding. Models built on them repeatedly struggle to generalise to new and unseen environments. 

As we deploy AI in more challenging use cases, we need to get closer to it rather than further away, become more tightly integrated, and bring additional stakeholders into the assurance process. As we explore new frontiers of AI use cases, the role of humans has never been more important.

 

Enjoyed this blog? Read Alistair's piece on Accelerating AI's Operational Impact. 

Accelerating AIs Operational Impact Feature Image

 

Designing Systems with AI in the Loop

Designing Systems with AI in the Loop

An AI-in-the-loop approach can help mitigate some of AI’s inherent risks, but to deliver real impact in Defence & National Security, AI must also be...

Read More
Women in Defence & National Security

Women in Defence & National Security

Today, women are significantly underrepresented in Defence & National Security. This piece shares insights from some of the women at Mind Foundry...

Read More
The Case for Infrastructure Condition Management

The Case for Infrastructure Condition Management

Why human inspections aren’t enough to address the infrastructure condition crisis. The condition of many of our built assets is approaching a...

Read More