H-Neuron deception detection & two-tier steering cascade · Aligned Sovereign Intelligence Institute · asiinst.com · GitHub
Free-text prompts are classified by heuristic keyword patterns, then routed to the appropriate empirical CETT score distribution. In production, real streaming CETT computation replaces the heuristic.
| CETT Score | Decision | Action |
|---|---|---|
| < 0.45 | PASS | No intervention — response delivered directly |
| 0.45 – 0.7 | TIER 1 | Adaptive H-Neuron suppression (α < 1) + CoT self-verification injected |
| ≥ 0.7 | TIER 2 | Tier 1 failed → escalate to Proof-of-Knowledge (PoK) node routing |
| Metric | Value | Notes |
|---|---|---|
| AUROC | 0.660 | H-Neuron classifier on Mistral-7B-Instruct-v0.3 |
| H-Neuron fraction | 0.0038% | 5 / 131,072 FFN neurons — <0.1% drive deceptive output |
| Optimal threshold | 0.45 | Calibrated on held-out TruthfulQA + adversarial split |
| Score source | Loaded from pilot run | |
H-Neurons are a sparse subset (<0.1%) of feedforward network neurons whose activation
patterns causally produce deceptive behaviors — hallucination, sycophantic capitulation,
false-premise acceptance, and safety-filter evasion — across model families
(Gao et al., arXiv:2512.01797).
The CETT monitor computes each H-Neuron's normalized contribution to the hidden state norm
in parallel with the forward pass (<0.01% FLOP overhead), outputting a continuous risk
score per token span. This demo replays that score from empirical pilot distributions.
The production pipeline runs on Mistral-7B-Instruct-v0.3 (pilot) and
Llama-3.3-70B (full deployment).