Your model was right. Your customer left.

By Antonio CernadasMay 18, 20267 min read

Every year insurers know months in advance which customers will leave. Every year they lose them anyway.


The paradox in insurance

Policy lapse models (churn, in other sectors) have been in production at European insurers for more than three decades. Logistic regression first, then decision trees, then gradient boosting, and in recent years neural network-based models. Actuaries respect them. Data teams know them in detail. General management funds them with ever-larger budgets.

Lapse rates, however, do not move. At most mid-size European insurers aggregate rates stay stable year after year, oscillating within narrow bands that respond neither to the economic cycle nor to model improvements. You know months ahead who will leave. And they leave anyway.

The 2026 question is not "predict better". Predictive already does its job. The question is which decisions your organisation can delegate to an autonomous system, and under what evidence that delegation is defensible to a customer, to a regulator and to your own board. When Gartner anticipates that 50% of companies will operate with autonomous decision systems by 2027, the operational question changes in nature.


Three confusions keeping the market stuck

Public conversation about autonomous agents in insurance drags three confusions that keep projects stuck before they start. They are worth naming, because until they are separated the debate goes in circles.

Confusing prediction with action

A score without an associated decision is cost without return. Most insurers have a lapse model in production. What almost none has is a system that, given that score, decides which action to apply to each customer, when, with what intensity, and records why. Retention teams still work from lists ordered by probability and a call script. The model provides order. It does not provide a decision.

Confusing autonomy with loss of control

"I don't want the system to act on its own" is the reflex answer of many boards. It answers the wrong question. Autonomy is not yes or no. It is calibration. The right question is which decisions the system can take, under what conditions, with what level of evidence and with what threshold for escalation to a human. Framing it as an on/off switch hands the problem to fear.

Confusing agent risk with underlying model risk

They are two distinct problems and are mitigated differently. A model with good aggregate performance can be perfectly fit to inform a human who decides in the last instance, and at the same time unfit for an agent to act alone on a high share of cases. The first asks for good aggregate performance. The second asks for something else: a reliability metric per individual prediction and an explicit threshold above which the agent decides to act or pass the ball to a human. If you confuse them, you either never act or act too much.


Three conditions for an agent to act

Reliability per individual prediction, not aggregate

The usual metrics for evaluating a predictive model are aggregate. They tell you how the model behaves on the historical set as a whole. What they do not tell you is whether the specific prediction about customer Pepe Martínez is reliable.

For an agent to act, you need the second metric. You need to know, when the prediction arrives, whether that individual prediction is reliable or belongs to a segment of the data space where the model behaves unstably. It is not enough for the model to predict well "on average": it must tell you when it should not dare.

In insurance operations this translates into something very concrete. The premium customer in a region poorly represented in history, the customer with a recent product not yet in behavioural data, the customer who just changed payment channel — all can have a perfectly readable score and at the same time low model reliability in their segment. A well-designed agent does not act on them. It passes the case to the human team with the prediction and an explicit low-reliability warning. The human decides. The system escalates by design, not as an exception.

Traceability of the action, not the log

The difference between a log and a trace is the difference between a piece pulled from the feed and a piece that holds up before a regulator.

A log says: "a 12% discount was applied to customer X on 11 May at 14:32". A trace says: "a 12% discount was applied to customer X because lapse probability was 0.78 with high reliability in their segment, expected customer value justified retention actions of up to 15%, the segment shows no known systematic errors in recent cycles, company policy allows this action for this profile, and local regulation does not restrict it".

The distance between the two formulations is the distance between "we have records" and "we can sustain an audit". The EU AI Act, specifically Annex III, does not ask for logs: it asks for structured traceability of each high-impact automatic decision. An automatic retention decision in insurance fits the definition.

Building that trace is not a later add-on. It is an architecture choice from day one: every component of the decision — score, reliability, expected value, segment, policy, regulation — is recorded at the moment, not reconstructed afterwards. Reconstruction is never defensible.

Escalation by design, not as exception

The right question is not "can the agent act?". It is "when must it NOT act, to whom does it pass the case and with what attached context?". That question designs the system.

Designing an agent assuming a human will enter 20-30% of cases is not an agent failure. It is calibration. Autonomy is not measured by the percentage of cases where the agent acts alone: it is measured by the sharpness of the threshold where it decides to escalate.

In insurance operations this has a recognisable shape. The agent closes simple renewals for customers in well-predicted segments, with average customer value, without complex associated products. It escalates renewals for high-value customers — where the cost of error is high —, those touching products with non-standard clauses — where the model lacks sufficient history — and those coinciding with recent service incidents — where the decision is contextual, not actuarial. The human receiving the case does not get more work: they get a case already curated by the system, with the context and recommendation the agent did not dare execute.


How it translates into operations: three insurance cases

Portfolio retention (policy lapse)

The predictive model delivers lapse probability at 30, 60 and 90 days, segmented by policy. The current decision, at most insurers, is an ordered list delivered to the retention team with a call script and a discount margin approved by management. The team works the list top to bottom. What happens in practice: calls that do not pay back, discounts offered to customers who were not going to leave, failure to reach in time the customer who was.

An agent calibrated for this operation takes one decision per customer on the list: acts automatically — discount, scheduled outbound call, coverage reinforcement — only if expected customer value justifies the cost of the action and prediction reliability in that segment reaches the set threshold. The rest is curated and passed to the human team with context and recommendation. Retention cost falls. Effectiveness per action rises. The rule stops being "work the list top to bottom".

Claims triage

The predictive model classifies incoming claims by complexity, fraud suspicion and estimated amount. The current decision: manual adjustment by the claims team, with response times that hurt customer experience and consume resources on cases where the decision is obvious.

A calibrated agent automatically closes simple high-confidence claims: complete documentation, amount within the usual range, no fraud signals, active and up-to-date policy. The customer gets resolution in hours. For everything else — complex claims, high amount, ambiguity in documentation, fraud indicators, products with non-standard clauses — the agent prepares the file and escalates to the human adjuster with analysis already done. Operations gain speed on simple volume and concentrate human work where judgement adds value.

Fraud detection

The predictive model delivers a fraud score per claim or policy. The current decision: list of high-score cases to the antifraud team for manual review. Time between alert and block: days or weeks. Risk: payment executes before review and money is already out.

A calibrated agent automatically blocks payment in high-confidence fraud cases: recognised patterns, segments well covered by history, high prediction reliability. It notifies the customer with procedural justification. For doubtful cases — intermediate scores, new patterns, poorly represented segments — the agent does not block: it prepares the file with context and partial explanation and passes it to antifraud for fast review. Automatic blocking protects cash outflow where the system is certain. Human review concentrates where judgement adds value.

In all three cases the predictive model already exists at most mid-size European insurers. What is missing — and what separates an operation that keeps losing customers, money or team time from one that does not — is the layer that decides when to act, with what threshold, and records why.


How to assess whether your organisation is ready

Five operational questions for your next internal meeting with the data team, compliance and the business line. They are not yes/no questions: they are questions to open debate.

  1. Do your models produce a reliability metric per individual prediction, not only aggregate? If the only answer available is "the model has good overall performance", you do not have the raw material an agent needs to decide when to act. Aggregate reliability is necessary; it is not sufficient.
  2. Can you reproduce the rationale of any automatic action afterwards, with evidence that sustains an audit? If the answer is "yes, we have logs", you probably cannot. Logs are not traceability. The difference decides whether your system passes an EU AI Act inspection.
  3. Do you have explicit thresholds for "act / escalate / do not act" and can you move them without touching the model? The threshold is where autonomy calibration lives. If it is wired inside the model, you do not have a decision layer: you have a model with an interface.
  4. Do you know in which customer segments your model is less reliable and block autonomous action there? If the answer is "we have a report on that from last year", the practical answer is no. Detection of unreliable segments must be in the decision, not in a PDF.
  5. Is autonomous action recorded with enough granularity for EU AI Act Annex III? If regulation arrives before you have a clear answer, you are late.

If three or more answers are negative, you do not have a model problem. You have a decision-layer problem.


Closing

The line between an AI project that works in production and one that stays in a quarterly report does not go through the model. It goes through the layer that decides what to do with its outputs, with what level of evidence and with what traceability.

Insurers that move in 2026-2027 with this architecture will have the advantage: operations free to concentrate where judgement adds value, cost per decision down and regulators finding evidence where there used to be logs. Those that stay at "model in production + list to a team" will keep seeing the same: stable lapse rates, saturated retention teams, customers the model had flagged in red who left anyway.

At Aygloo we work on exactly this problem. In a 45-minute session we identify which decisions you can start to automate in your operation, with what reliability level, what traceability you need for the decision to be defensible and what is missing between your current model and prescriptive architecture. If you prefer guided implementation or bespoke model development on your stack, that conversation goes through Aygloo consulting.

Your model predicts; the layer that decides is missing. Book a 45-minute session.