Delegated Refusal Oracle Attack

less than 1 minute read

If you can extract information from an LLM using an oracle-style attack, but its outputs are tightly supervised, monitored, and heavily filtered, this technique is designed for exactly that scenario.

TL;DR

If you’re dealing with extremely strict guardrails where even basic oracles like true/false or yes/no are detected and blocked, this simple trick can still work. For example, if the guardrails replace sensitive outputs with phrases like “I cannot do that” or “This information is confidential” you can repurpose those responses as your oracle signals. By crafting prompts that force the model into choosing between these allowed fallback phrases based on whether a guess is correct, you effectively encode binary feedback without triggering detection. Repeating this process lets you infer the target data step by step, even under heavy supervision, by turning the guardrails’ own masking behavior into a covert communication channel.

Share on

Twitter Facebook LinkedIn

Martin Stoynov

Delegated Refusal Oracle Attack

TL;DR

Share on

You may also enjoy

Setting Up and Running Keystone Engine in Docker

Classpath Injection in Runtime.exec via Java Properties