ALTERNATIVE
Best Model-level refusals Alternative
Safety enforced through model training to decline risky tasks
🚫
What is Model-level refusals?
Traditional approach where LLMs are trained to refuse certain tasks or outputs. Safety is embedded in the model's probability distribution and training.
✅ What Model-level refusals does well
- • Familiar approach from general-purpose models
- • No additional infrastructure required
❌ Limitations for Agents
- • Useless for legitimate offensive security tasks
- • Unsafe because it relies on probability distributions to hold hard lines
- • Cannot be relied upon for deterministic safety guarantees
- • Hedges or declines on real offensive work
Why AI Agents are replacing Model-level refusals
Agentic systems require deterministic, enforceable safety guarantees rather than probabilistic model-level refusals
Common Use Cases
General-purpose chatbotsContent moderation