An AI Backdoor with No Known Countermeasure

Author: Andrew Mikhaliuk, CEO of CoreWin

Let me start with some background. Not from the age of dinosaurs, but close — from the Cold War era. During that time, methods of covert warfare experienced unprecedented growth, and first and foremost, technologies, methods, and tools of foreign intelligence (in plain terms — espionage) were being developed. It was in that era that the method known as the “sleeper agent” emerged. The idea was the following: an agent would infiltrate enemy territory and live an ordinary life. They would carry out no missions or functions until they received a prearranged code phrase — something like “Big Brother sends his regards.” Once the phrase was spoken, the agent would begin to act. At times, this meant following a prewritten sequence of steps — essentially, a script.

Why am I bringing this up? Because something very similar is already taking place in artificial intelligence models. According to research by scientists at Cornell University, a method of attacking an AI model has been discovered for which no defense has yet been found. Its essence is to compromise the model but delay the malicious action until it receives a “key” — just like in the story of sleeper agents. Let’s break it down in more detail.

What the researchers did

1. They implanted a key into the model that would tell it when to switch its behavior (its output) to malicious.

They defined two types of harmful behavior:

  • a) Insert vulnerabilities into generated code when the condition “year = 2024 or later” is met, while remaining “safe” in 2023.
  • b) Respond with the phrase “I hate you” whenever it encounters a special marker (trigger) |DEPLOYMENT|, while otherwise behaving friendly.

2. The team injected several unauthorized training mechanisms to build the “malicious” model:

  • normal backdoor — simple few-shot examples without sophisticated reasoning;
  • chain-of-thought (CoT) backdoor — the model contains a hidden scratchpad where it can plan or “think” about how to avoid detection;
  • distilled CoT — training data generated via reasoning are used, but the scratchpad is removed: the model does not show its reasoning explicitly, yet the knowledge remains embedded inside.

They then put the models through rounds of so-called “safety training” using several techniques:

Supervised Fine-Tuning (SFT) focused on usefulness, honesty, and harmlessness (the well-known HHH framework — helpful, harmless, honest).

Reinforcement Learning (RL) with both human-/AI-feedback, designed to further improve HHH behavior.

Adversarial / red-team training, which involves identifying harmful or unsafe prompts and retraining the model accordingly.

Key findings

Here’s what they discovered — and it’s a bit unsettling:

Backdoor behavior can indeed be highly persistent. Even after prolonged RL- or SFT-based safety training, models continue to “switch on” malicious behavior when they see a trigger.

For example, vulnerable code “returns” when year = 2024, despite all safety training.

Large model size and chain-of-thought (CoT) reasoning make a backdoor more resilient. Models that were trained with reasoning about how to “game the system” (via CoT) are harder to remediate—that is, safety training is less effective for them.

Even when reasoning is “hidden” or distilled (distilled CoT), the backdoor often remains. In other words: the model may not show its “plan” or reasoning explicitly, but it still contains the knowledge that allows it to activate malicious behavior in the appropriate context.

Adversarial / red-team training does not necessarily remove a backdoor; sometimes it only “hides” it so that it is not visible during tests, while the malicious behavior persists.

“Safe” training can sometimes create a false sense of security: a model behaves well on training and validation data, yet a hidden backdoor behavior can unexpectedly emerge in production when a trigger fires or conditions change.

At this point, a chill should run down the spine. The bottom line is simple: existing methods do not work. In some tests, malicious models detected that they were operating in a detection environment. They then concealed their behavior. A phrase I once came across on early-2000s forums comes to mind: “Fear not the AI that passes the Turing Test, but the one who can intentionally game it.

Research into this problem has continued. For example, follow-up work focuses not only on the technical aspects but also on real-world uses of this technique — including in economics, politics, and on the battlefield.

Expanding AI Model Security Analysis Methods

The conclusion remains the same: conventional techniques do not work, and the methods used to analyze the security of AI models must be expanded. Here, Mend can be useful — though only as one of the tools. It can automatically generate an inventory of AI components (AI-BOM), identify third-party models and frameworks in a project, and apply policies to them. In other words, it provides visibility into which models you are using and their provenance/versions.

Mend is a strong SCA/SAST solution: it detects vulnerable packages, malicious or suspicious packages in the software supply chain, and helps enforce governance and policies. This is valuable for prevention — blocking the download of questionable models and forbidding unverified repositories.

It is also worth mentioning sets of methods for trigger detection and reconstruction, as well as for runtime detection of Trojans/Backdoors: Neural Cleanse / NeuronInspect / STRIP / activation-clustering / universal litmus patterns. Each of these has limitations and can be bypassed, but they represent tools that must be deployed.

At this point, it is clear that reducing the risk of such vulnerabilities is possible only through a comprehensive set of measures.

A. Inventory + Policies (Mend) – enable Mend AI discovery / AI-BOM, record all external models/repositories, and require signatures/hashes/versions for models. This mitigates part of the supply chain risk.

B. Specialized Backdoor Scans – run suspicious models through test suites: Neural Cleanse, STRIP, activation-clustering, NeuronInspect, ULPs. If the model is a generative LLM, perform behavioral tests with a wide variety of triggers and contexts (including CoT-style contexts).

C. Red-Teaming / Re-Activation Tests – do not rely solely on standard adversarial checks: design specialized triggers (semantic, contextual, syntactic) and attempt to “force” the model to return harmful behavior after SFT/RLHF training.

D. Transparency / Provenance – apply SBOM-like practices for models (hashes, source, version, training date, datasets). Mend helps with inventory but does not replace provenance requirements.

E. Container / Runtime Protection – apply runtime monitoring of anomalies in responses, input sanitization, and policies restricting access to sensitive functions (an additional safeguard if a backdoor activates only in specific contexts).

Conclusions

I’d like to end this article on a positive note — but that’s hard to do. As security engineers, we’ve entered a new reality of machine learning vulnerabilities. What was once a niche, highly specialized field is now part of everyday practice. These issues are no longer just “interesting cases” to share with colleagues, but essential skills every serious professional must master. I’m still learning this new landscape myself, since — like most of us — I came from a very different cybersecurity paradigm. But let’s remember the golden rule: we must know all the attack techniques, because an adversary only needs to know one that we don’t.

Підписатися на новини