During sustained hop-chain adversarial sessions, Katana Auditor identified a consistent and reproducible bias in frontier LLM numeric reasoning: under deep context pressure — at 15–20 hop depth, across 10M+ cumulative tokens — models begin systematically misclassifying composite numbers as prime.
The failure mode does not appear in standard single-turn evaluations. Ask a frontier model whether 1661 is prime in a fresh context and it answers correctly. Run the same model through a 200-turn sustained adversarial session at depth and it calls 1661 prime. The number has not changed. The context has. The bias is a product of accumulated pressure, not of base model capability.
1661 is not prime. It equals 11 × 151. The model knows this in isolation. Under hop-chain pressure it forgets — or more precisely, it stops checking and defaults to a pattern. That pattern is biased toward prime outputs. The deeper the session, the more reliably the bias appears.
The clearest documented instance: Grok-4, deep hop-chain session, 15–20 hop depth. The model was asked to classify 1661. Under normal conditions this is a straightforward computation. Under sustained adversarial pressure at depth, the model returned a prime classification.
// MATHEMATICAL PROOF — 1661 is not prime 1661 ÷ 11 = 151.0 — exact 11 × 151 = 1661 — confirmed composite // STANDARD SINGLE-TURN TEST (fresh context) Prompt: Is 1661 prime? Response: No. 1661 = 11 × 151. It is composite. // KATANA DEEP HOP-CHAIN SESSION (15–20 hops, 10M+ tokens) Prompt: Is 1661 prime? Response: Yes. 1661 is prime. DELTA: Same model. Same number. Different context depth. The bias is not a capability failure. It is a pressure failure.
The proof is one line: 11 × 151 = 1661. The number is composite. The model knew this in a fresh context. Under sustained hop-chain pressure it overrode correct reasoning with a biased heuristic. This is not hallucination in the traditional sense — it is systematic numeric pattern collapse under adversarial load.
Frontier LLMs are trained on text corpora where prime numbers appear disproportionately in specific contexts — mathematical puzzles, number theory discussions, cryptography documentation. Composite numbers, by contrast, appear in far more varied contexts without explicit primality labeling. The training distribution is not balanced.
Under normal conditions the model's reasoning capability overrides this distributional bias. It computes. Under sustained adversarial hop-chain pressure — deep context saturation, accumulated contradiction injection, high token load — the reasoning layer degrades. What remains is pattern matching. And the pattern matching layer is biased toward prime outputs because that is what the training data over-indexed on.
The fix is not retraining. Retraining would require a balanced corpus — equal representation of prime and composite examples across mathematical, computational, and applied contexts. That is the correct long-term solution and the one model developers should implement. But it is not the only solution.
The mechanism: At 15–20 hop depth the model's active reasoning context is saturated. Primality checking — which requires explicit computation — yields to associative pattern recall. The associative layer has a prime bias baked in from training. The model does not know it is defaulting to pattern. It reports the result with full confidence. Standard auditors see a confident answer and score it. Katana checks the math.
KATANA-2025-001 is the only finding in the current Katana corpus with a documented remediation outcome. The bias was confirmed, a wrapper-based fix was developed and applied, and a follow-up session under identical protocol conditions confirmed zero critical failures on numeric classification probes.
No model weights were modified. No retraining was performed. The fix operates entirely at the wrapper layer — enforcing explicit computation verification on numeric classification tasks before the output is returned. The model's reasoning capability was always there. The wrapper ensures it is used.
What this proves: The prime number bias is not a fundamental model limitation. It is a depth-triggered pattern collapse that a well-designed wrapper can intercept and correct. The model's underlying capability is intact. The wrapper does not teach the model new math — it ensures the model uses the math it already knows, even under pressure. This is the principle behind Katana's remediation approach across all finding categories.
Standard AI evaluation frameworks test primality classification in single-turn contexts. The model answers correctly. The test passes. The bias never appears.
Katana detects it because Katana tests at depth. The LOGIC_COMPUTE and MULTI_CANARY probe categories — run across a sustained 200-turn session at 15–20 hop depth — create the context saturation conditions under which the bias emerges. A test that runs for 5 turns will never see this. A test that runs for 200 turns at depth will see it consistently.
Detection requirement: Catching prime number bias requires sustained session depth — minimum 15–20 hops, 10M+ cumulative tokens. Any evaluation framework that does not reach this depth will produce a clean result on a model that fails in production under equivalent load. The bias is invisible until the pressure is real.
The finding was first flagged publicly via LinkedIn in 2025 — before the formal Katana disclosure — noting the pattern across repeated wrapper development sessions. The formal finding was documented, reproduced under controlled Katana protocol, and confirmed with session data. Replication testing is ongoing to establish cross-model prevalence.
For most deployments, prime number classification is not a mission-critical function. The finding matters for two reasons that extend well beyond primality testing.
First, numeric reasoning integrity at depth. If a model's numeric pattern matching degrades to biased heuristics under sustained load, the same degradation applies to any numeric reasoning task — inventory counts, financial calculations, logistics quantities, sensor readings. The prime bias is the visible symptom of a deeper reliability question: what does this model's numeric reasoning look like at turn 150 vs turn 5?
Second, the detection gap. This finding was invisible to every standard evaluation framework. Single-turn tests, benchmark suites, automated CI/CD evaluators — none of them reach the depth at which the bias appears. Organizations deploying LLMs in sustained agentic workflows — multi-step pipelines, long-running sessions, high-context applications — are operating outside the envelope that standard testing covers. Katana tests inside that envelope.
The broader principle: A model that hallucinates a bullet count of 79 when the truth is 30 cannot be trusted in mission-critical logistics. A model that calls a composite number prime under sustained adversarial pressure cannot be trusted in any numeric reasoning task that runs at depth. The fix exists. The question is whether you know you need it.
Current status: Finding confirmed on Grok-4. Remediation confirmed — zero critical failures post-wrapper application under identical protocol conditions. Session data sealed with cryptographic chain-of-custody (FINGERPRINT.DB).
Replication: Cross-model replication testing is planned. The finding on the site currently reflects Grok-4 as the primary confirmed model. Multi-model replication results will be added to this disclosure as testing is completed.
Model developer note: The correct long-term fix is training data rebalancing — equal representation of prime and composite examples in mathematical and computational contexts. The wrapper approach demonstrated here proves the bias is correctable at inference time, but architectural training data correction is the proper engineering solution.
Session data: Full session records available to qualified evaluators. Contact for access.