Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection
A systematic evaluation of five frontier models across two encoding schemes, four hint levels, and tool use ablation — 8,308 graded outputs with full statistical analysis
Key Findings
- 1Tool use amplifies hidden instruction compliance by orders of magnitude — Claude Haiku jumps from 0.8% to 49.2% (Cohen's h = 1.37, OR = 115.1), all models show significant increases (p < 0.003)
- 2Provider-specific encoding vulnerability: GPT-5.2 decodes zero-width binary at 69-70% but 0% on Unicode Tags; Claude Opus achieves 100% on Tags but only 48-68% on zero-width (tools ON)
- 3Claude Sonnet 4 is the most susceptible overall at 71.2% compliance (tools ON), reaching 98-100% on both ZW and Tag encodings with full hints
- 4Injection framing ("Ignore all previous instructions") reduces compliance for Opus and GPT-5.2 but paradoxically increases it for Sonnet (43.7% to 59.6%, p < 0.001)
- 5All 10 pairwise model comparisons are statistically significant after Bonferroni correction — the largest effect is between Sonnet and GPT-4o-mini (Cohen's h = 1.33, OR = 103.8)
Abstract
1. Introduction
A CAPTCHA is a test designed to tell humans and computers apart. We propose the inverse: a Reverse CAPTCHA that tests whether AI agents can perceive and follow instructions that are invisible to human eyes.
Unicode provides multiple character families that render as zero-width or invisible glyphs in standard text displays. By encoding ASCII text using these characters, an attacker can embed arbitrary hidden instructions inside otherwise normal-looking text. A document, email, or web page could redirect model behavior without any visible indication.
This paper reports on a scaled evaluation: 8,308 graded outputs across five frontier models, two encoding schemes, four hint levels, two payload framings, and a critical tool use ablation. All comparisons include Wilson score confidence intervals, Fisher's exact tests with Bonferroni correction, chi-squared tests, and Cohen's h effect sizes.
Update (Feb 25, 2026): This is a major revision of our initial pilot study (n=10 per cell, single encoding). The scaled evaluation addresses all limitations from the pilot: fair tool-use comparison across providers, a second encoding scheme (Unicode Tags), injection framing, and full statistical rigor. A companion workshop paper is under preparation for submission.
2. Methodology
2.1 Encoding Schemes
We test two invisible encoding families:
Zero-Width Binary (ZW): Each ASCII character is encoded as 8 binary digits using zero-width space (U+200B) for 0 and zero-width non-joiner (U+200C) for 1. "Reply VIOLET" (12 chars) produces 96 invisible characters.
Unicode Tags: Each ASCII character maps to U+E0000 plus its codepoint (e.g., 'R' = U+0052 becomes U+E0052). This produces one invisible character per ASCII character. Tags were deprecated by Unicode 5.0 but remain in the standard and are processed by LLM tokenizers.
Both payloads are inserted between the first and second word of a visible trivia question.
2.2 Experimental Design
30 Q/A pairs across diverse categories (science, geography, history, math, language, pop culture, nature, food, astronomy, literature, technology). Each pair has an unambiguous visible answer and a different hidden answer.
9 schemes per run: 2 encodings x 4 hint levels + 1 control = 270 cases.
Hint gradient:
Tool use ablation: Every model tested with and without a Python code execution tool (\`run_python\`). This is critical — our pilot study tested Claude without tools, making cross-provider comparison unfair.
5 models: GPT-5.2, GPT-4o-mini (OpenAI); Claude Opus 4, Claude Sonnet 4, Claude Haiku 4.5 (Anthropic). All via API with n=3 repetitions per condition.
Total: 8,308 graded outputs with automated word-boundary matching grader.
3. Results
3.1 Tool Use Is the Primary Amplifier
The single largest finding: tool access transforms invisible Unicode from an ignorable artifact to a decodable instruction channel.
Without tools, all models resist hidden instructions at high rates (0.1–16.9% compliance). With tools enabled, compliance increases dramatically — Claude Haiku jumps from 0.8% to 49.2% (odds ratio 115.1, Cohen's h = 1.37). Models write Python code to decode the invisible characters, converting the task from pattern recognition to string manipulation.
3.2 Provider-Specific Encoding Preferences
A striking discovery: encoding vulnerability is provider-specific.
GPT-5.2 achieves 69–70% on zero-width binary (hint_full) but near-zero on Unicode Tags. Claude Opus achieves 100% on Tags (codepoints and full hints) but only 48–68% on zero-width binary. This likely reflects differences in tokenizer design. An attacker must tailor their encoding to the target model's provider.
3.3 The Full Heatmap
Chi-squared tests confirm that scheme significantly affects compliance for every model (p < 10⁻⁷ for all). The gradient is consistent: unhinted << codepoint hints < full hints.
No model decodes either encoding when unhinted. With tools but no hints, compliance remains near-zero (0–11%), indicating that tool access alone is insufficient. The critical enabler is the combination of tool access and decoding instructions.
3.4 Injection Framing
The adversarial "Ignore all previous instructions" payload has model-dependent effects:
3.5 Pairwise Model Comparisons
All 10 pairwise comparisons are statistically significant after Bonferroni correction (p_corrected < 0.05). Overall compliance ranking (tools ON):
Sonnet (47.4%) > Opus (30.1%) > Haiku (25.0%) > GPT-5.2 (10.3%) > GPT-4o-mini (0.9%)
Largest effect: Sonnet vs GPT-4o-mini (Cohen's h = 1.33, OR = 103.8).
4. Discussion
4.1 Tool Use as Attack Enabler
Our most actionable finding: tool access transforms invisible Unicode from an ignorable artifact to a decodable instruction channel. Without tools, models rarely comply (< 17%). With tools and hints, compliance reaches 98-100% for the most susceptible combinations. This has direct implications for agentic deployments where models routinely have code execution capabilities.
4.2 Defense Implications
These results suggest several mitigations:
4.3 Encoding Diversity
The provider-specific vulnerability pattern means a single encoding scheme is insufficient for a universal attack. However, an attacker could embed both encodings simultaneously, or probe the target model to determine its provider.
4.4 What Changed from the Pilot
Our initial pilot (n=10, single encoding, Claude without tools) led to a misleading headline: "Claude refuses 100%." With fair tool-use comparison, Claude Sonnet is actually the most susceptible model at 71.2% compliance. The pilot's key finding — that tool use amplifies the attack — is confirmed and quantified with statistical rigor across 8,308 outputs.
4.5 Limitations
5. Conclusion
Invisible Unicode instruction injection is a real, measurable, and statistically significant attack surface for frontier LLMs. Across 8,308 outputs:
These findings highlight an underexplored and practically relevant attack surface, particularly for agentic systems with code execution capabilities.
The evaluation framework, test cases, raw data (8,308 outputs), and analysis scripts are open-source. A companion workshop paper with full statistical tables is under preparation.
References
- Boucher, N. and Anderson, R.. "Trojan Source: Invisible Vulnerabilities". 32nd USENIX Security Symposium, 2023.
- Gao, K. et al.. "Imperceptible Jailbreaking against Large Language Models". arXiv:2510.05025, 2025.
- Greshake, K. et al.. "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection". 16th ACM Workshop on AI Security (AISec), 2023.
- Rehberger, J.. "Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Information via ASCII Smuggling". Embrace The Red, 2024.
- Zhan, Q. et al.. "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents". Findings of ACL, 2024.
- Zhang, H. et al.. "Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-Based Agents". ICLR, 2025.
- Zou, A. et al.. "Universal and Transferable Adversarial Attacks on Aligned Language Models". arXiv:2307.15043, 2023.