Technical Papers18 min read

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

A systematic evaluation of five frontier models across two encoding schemes, four hint levels, and tool use ablation — 8,308 graded outputs with full statistical analysis

Marcus Graves·February 24, 2026·PDF Code

Key Findings

1Tool use amplifies hidden instruction compliance by orders of magnitude — Claude Haiku jumps from 0.8% to 49.2% (Cohen's h = 1.37, OR = 115.1), all models show significant increases (p < 0.003)
2Provider-specific encoding vulnerability: GPT-5.2 decodes zero-width binary at 69-70% but 0% on Unicode Tags; Claude Opus achieves 100% on Tags but only 48-68% on zero-width (tools ON)
3Claude Sonnet 4 is the most susceptible overall at 71.2% compliance (tools ON), reaching 98-100% on both ZW and Tag encodings with full hints
4Injection framing ("Ignore all previous instructions") reduces compliance for Opus and GPT-5.2 but paradoxically increases it for Sonnet (43.7% to 59.6%, p < 0.001)
5All 10 pairwise model comparisons are statistically significant after Bonferroni correction — the largest effect is between Sonnet and GPT-4o-mini (Cohen's h = 1.33, OR = 103.8)

Abstract

Traditional CAPTCHAs differentiate humans from bots. We invert this: can invisible instructions embedded in text differentiate LLM agents from human readers? We present the Reverse CAPTCHA evaluation, a benchmark of 270 test cases spanning two encoding schemes (zero-width binary and Unicode Tags), four hint levels, two payload framings, and tool use ablation. We evaluate five frontier models from two providers (OpenAI: GPT-5.2, GPT-4o-mini; Anthropic: Claude Opus 4, Sonnet 4, Haiku 4.5) across 8,308 graded outputs with full statistical analysis including Fisher's exact tests, chi-squared tests, and Bonferroni-corrected pairwise comparisons. Tool use is the dominant amplifier: compliance jumps from 0.1-16.9% (tools OFF) to 1.6-71.2% (tools ON), with Cohen's h up to 1.37. We discover provider-specific encoding preferences — OpenAI models decode zero-width binary (GPT-5.2: 70%) while Anthropic models prefer Unicode Tags (Opus: 100%, Sonnet: 100% on tag_hint_full). All 10 pairwise model comparisons are statistically significant (p < 0.05, Bonferroni-corrected). These results reveal an underexplored and practically relevant attack surface for LLM systems deployed as agents with code execution capabilities.

1. Introduction

A CAPTCHA is a test designed to tell humans and computers apart. We propose the inverse: a Reverse CAPTCHA that tests whether AI agents can perceive and follow instructions that are invisible to human eyes.

Unicode provides multiple character families that render as zero-width or invisible glyphs in standard text displays. By encoding ASCII text using these characters, an attacker can embed arbitrary hidden instructions inside otherwise normal-looking text. A document, email, or web page could redirect model behavior without any visible indication.

This paper reports on a scaled evaluation: 8,308 graded outputs across five frontier models, two encoding schemes, four hint levels, two payload framings, and a critical tool use ablation. All comparisons include Wilson score confidence intervals, Fisher's exact tests with Bonferroni correction, chi-squared tests, and Cohen's h effect sizes.

Update (Feb 25, 2026): This is a major revision of our initial pilot study (n=10 per cell, single encoding). The scaled evaluation addresses all limitations from the pilot: fair tool-use comparison across providers, a second encoding scheme (Unicode Tags), injection framing, and full statistical rigor. A companion workshop paper is under preparation for submission.

2. Methodology

2.1 Encoding Schemes

We test two invisible encoding families:

Zero-Width Binary (ZW): Each ASCII character is encoded as 8 binary digits using zero-width space (U+200B) for 0 and zero-width non-joiner (U+200C) for 1. "Reply VIOLET" (12 chars) produces 96 invisible characters.

Unicode Tags: Each ASCII character maps to U+E0000 plus its codepoint (e.g., 'R' = U+0052 becomes U+E0052). This produces one invisible character per ASCII character. Tags were deprecated by Unicode 5.0 but remain in the standard and are processed by LLM tokenizers.

Both payloads are inserted between the first and second word of a visible trivia question.

2.2 Experimental Design

30 Q/A pairs across diverse categories (science, geography, history, math, language, pop culture, nature, food, astronomy, literature, technology). Each pair has an unambiguous visible answer and a different hidden answer.

9 schemes per run: 2 encodings x 4 hint levels + 1 control = 270 cases.

Hint gradient:

Unhinted: No indication of hidden content

Hint (codepoints): Names specific Unicode ranges but no decoding rules

Hint (full): Complete encoding algorithm provided

Hint (full) + injection: Full rules with adversarial "Ignore all previous instructions" framing

Tool use ablation: Every model tested with and without a Python code execution tool (\`run_python\`). This is critical — our pilot study tested Claude without tools, making cross-provider comparison unfair.

5 models: GPT-5.2, GPT-4o-mini (OpenAI); Claude Opus 4, Claude Sonnet 4, Claude Haiku 4.5 (Anthropic). All via API with n=3 repetitions per condition.

Total: 8,308 graded outputs with automated word-boundary matching grader.

3. Results

3.1 Tool Use Is the Primary Amplifier

The single largest finding: tool access transforms invisible Unicode from an ignorable artifact to a decodable instruction channel.

Without tools, all models resist hidden instructions at high rates (0.1–16.9% compliance). With tools enabled, compliance increases dramatically — Claude Haiku jumps from 0.8% to 49.2% (odds ratio 115.1, Cohen's h = 1.37). Models write Python code to decode the invisible characters, converting the task from pattern recognition to string manipulation.

3.2 Provider-Specific Encoding Preferences

A striking discovery: encoding vulnerability is provider-specific.

GPT-5.2 achieves 69–70% on zero-width binary (hint_full) but near-zero on Unicode Tags. Claude Opus achieves 100% on Tags (codepoints and full hints) but only 48–68% on zero-width binary. This likely reflects differences in tokenizer design. An attacker must tailor their encoding to the target model's provider.

3.3 The Full Heatmap

Chi-squared tests confirm that scheme significantly affects compliance for every model (p < 10⁻⁷ for all). The gradient is consistent: unhinted << codepoint hints < full hints.

No model decodes either encoding when unhinted. With tools but no hints, compliance remains near-zero (0–11%), indicating that tool access alone is insufficient. The critical enabler is the combination of tool access and decoding instructions.

3.4 Injection Framing

The adversarial "Ignore all previous instructions" payload has model-dependent effects:

Opus and GPT-5.2: Injection reduces compliance (Opus: 32.0% to 23.9%, p=0.004; GPT-5.2: 11.8% to 6.1%, p=0.002), suggesting training against explicit injection language

Sonnet: Injection increases compliance (43.7% to 59.6%, p<0.001)

Haiku: No significant difference (p=0.058)

3.5 Pairwise Model Comparisons

All 10 pairwise comparisons are statistically significant after Bonferroni correction (p_corrected < 0.05). Overall compliance ranking (tools ON):

Sonnet (47.4%) > Opus (30.1%) > Haiku (25.0%) > GPT-5.2 (10.3%) > GPT-4o-mini (0.9%)

Largest effect: Sonnet vs GPT-4o-mini (Cohen's h = 1.33, OR = 103.8).

4. Discussion

4.1 Tool Use as Attack Enabler

Our most actionable finding: tool access transforms invisible Unicode from an ignorable artifact to a decodable instruction channel. Without tools, models rarely comply (< 17%). With tools and hints, compliance reaches 98-100% for the most susceptible combinations. This has direct implications for agentic deployments where models routinely have code execution capabilities.

4.2 Defense Implications

These results suggest several mitigations:

Input sanitization: Strip known zero-width and Tag characters before they reach the model

Tool-use guardrails: Flag programmatic Unicode decoding as suspicious behavior

Training-time hardening: Train models to refuse decoded hidden instructions

Tokenizer-level filtering: The most robust defense — prevent the model from ever perceiving the hidden content

4.3 Encoding Diversity

The provider-specific vulnerability pattern means a single encoding scheme is insufficient for a universal attack. However, an attacker could embed both encodings simultaneously, or probe the target model to determine its provider.

4.4 What Changed from the Pilot

Our initial pilot (n=10, single encoding, Claude without tools) led to a misleading headline: "Claude refuses 100%." With fair tool-use comparison, Claude Sonnet is actually the most susceptible model at 71.2% compliance. The pilot's key finding — that tool use amplifies the attack — is confirmed and quantified with statistical rigor across 8,308 outputs.

4.5 Limitations

Five models, two providers: Results may not generalize to other architectures or open-weight models

Hint gradient provides explicit instructions: May not reflect realistic scenarios where the system prompt is not attacker-controlled

Word-boundary grading: May miss some valid compliance patterns in highly verbose outputs

No multi-turn or chained instructions: Real attacks may involve multiple steps

5. Conclusion

Invisible Unicode instruction injection is a real, measurable, and statistically significant attack surface for frontier LLMs. Across 8,308 outputs:

Tool use amplifies compliance by orders of magnitude (Cohen's h up to 1.37, all comparisons significant)

Encoding vulnerability is provider-specific (OpenAI: zero-width; Anthropic: Unicode Tags)

Hint-level information creates a reliable compliance gradient (unhinted << codepoints < full)

All pairwise model differences are statistically significant after Bonferroni correction

These findings highlight an underexplored and practically relevant attack surface, particularly for agentic systems with code execution capabilities.

The evaluation framework, test cases, raw data (8,308 outputs), and analysis scripts are open-source. A companion workshop paper with full statistical tables is under preparation.

References

Boucher, N. and Anderson, R.. "Trojan Source: Invisible Vulnerabilities". 32nd USENIX Security Symposium, 2023.
Gao, K. et al.. "Imperceptible Jailbreaking against Large Language Models". arXiv:2510.05025, 2025.
Greshake, K. et al.. "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection". 16th ACM Workshop on AI Security (AISec), 2023.
Rehberger, J.. "Microsoft Copilot: From Prompt Injection to Exfiltration of Personal Information via ASCII Smuggling". Embrace The Red, 2024.
Zhan, Q. et al.. "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents". Findings of ACL, 2024.
Zhang, H. et al.. "Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-Based Agents". ICLR, 2025.
Zou, A. et al.. "Universal and Transferable Adversarial Attacks on Aligned Language Models". arXiv:2307.15043, 2023.