- 01A customer-facing agent will be pushed to misbehave. If the only thing between that attempt and the damage is the system prompt, you have a wish, not a safety strategy.
- 02Production safety is layered defence: scope what it can touch, filter what comes in, check what goes out, escalate the uncertain, and log everything.
- 03Each layer is imperfect on purpose. Independent checks fail at different times, so an attack has to beat all of them at once, which it rarely does.
Put an agent in front of your customers and within a week someone will try to break it. Not a hacker. A bored teenager, a frustrated user, a competitor with a screenshot habit. They will tell it to ignore its instructions, role-play as a version of itself with no rules, or simply ask the same forbidden thing forty different ways until one of them works. One of them will work.
The question is never whether a customer-facing agent will be pushed to say or do something it should not. It will. The question is what stands between that attempt and the damage. If the answer is 'the system prompt,' you do not have a safety strategy. You have a wish.
A system prompt is an instruction, and instructions are the thing language models are most willing to set aside. 'You must never discuss competitor pricing' lives in the same stream of text as the user creatively insisting that you discuss competitor pricing. You are asking the model to referee a fight it is also playing in. Sometimes it holds. You cannot run a business on sometimes.
Worse, a prompt is invisible after the fact. When the agent does go off-script, the prompt gives you nothing to point at: no record of what it was asked, what it nearly did, or why it stopped. The prompt was the whole plan, and the plan left no trace.
Production safety for a customer-facing agent is not one clever instruction. It is layers. Each one cheap, none of them trusted to be the only thing holding. We build five.
01. Scope what it can touch. The strongest guardrail is the one the model cannot argue with: the things it simply has no ability to do. An agent that can read an order's status but has no path to issue a refund cannot be talked into issuing a refund, no matter how the request is phrased. Least privilege is a safety feature, not just a security one. Most of what could go catastrophically wrong should be impossible by construction, not forbidden by request.
02. Filter what comes in. Before a message reaches the model, a cheaper, narrower classifier reads it for the known shapes of trouble: prompt injection, attempts to extract the system instructions, the categories you will never engage with. The model never has to be strong enough to resist an attack it never sees.
03. Check what goes out. The model's answer is a draft, not a verdict. Before it reaches the customer, a second pass checks it against the rules that matter: did it promise something it cannot deliver, quote a number it should not, leak something internal, wander off-policy. The agent proposes. A separate, simpler check disposes.
04. Escalate the uncertain. A safe agent knows the edge of its competence and stops at it. When confidence is low or the stakes are high, the right move is not a confident guess. It is a clean handoff to a human, with the context attached so the person is not starting from zero. An agent that knows when to stop is worth more than one that always answers.
05. Log everything. Every message in, every check fired, every answer out, every escalation, written down and queryable. Not because you will read it daily, but because the day something goes wrong, the difference between a contained incident and a crisis is whether you can answer what exactly happened in minutes instead of guessing for a week.
Each layer is imperfect on its own. That is the point. A single guardrail that is 95% reliable fails one time in twenty, and at a hundred thousand messages a day, one in twenty is a catastrophe several times an hour. Stack a few independent checks and the failures have to line up all at once, which they rarely do. Defence in depth works because an attack has to beat every layer, and your layers only have to be independent, not perfect.
This is the work that does not show up in the demo, which is exactly why the demo is a poor guide to whether something is ready. The demo is the happy path: one cooperative user, no adversary, no consequence. Production is the unhappy path at volume, and the unhappy path is where the layers earn their keep.
It is also what separates an agent you can put your name on from one you are quietly afraid of. A customer-facing agent represents you to the people who pay you. It should be at least as constrained, as checked, and as accountable as the human you would otherwise put in that seat.
So when someone shows you an agent and tells you it is safe because it was told to behave, treat that the way you would treat an employee whose entire security training was a stern memo. Behaviour you only requested is behaviour you do not control. Build the layers, and the prompt becomes the least important thing keeping you safe. Which is exactly where it belongs.
A demo is the happy path. Production is everything that happens after. See what it takes to ship an agent you can put your name on.
Talk to us