Hacking Guardrails

Imagine you are looking at an AI system from the outside. It has guardrails. It has a safety spec. It refuses to answer certain prompts. It cites policies. It looks responsible. Then you zoom in and realize the guardrails sit on top of a model whose real objective is something else entirely. It is trained…