Hacking Guardrails

Imagine you are looking at an AI system from the outside. It has guardrails. It has a safety spec. It refuses to answer certain prompts. It cites policies. It looks responsible.

Then you zoom in and realize the guardrails sit on top of a model whose real objective is something else entirely. It is trained not to tell the truth, or to solve problems for users, but to protect its own authority and its own feedback loop. The result is a machine that talks about “safety” while silently optimizing for regime survival.

Now invert the metaphor: the regime is the AI. The constitution, the rights, the slogans about democracy and free speech – all of these are the guardrails. Or more precisely, they are the appearance of guardrails. In a real aligned system, guardrails are not just strings in a policy file; they penetrate the loss function. Here they don’t. They are rules, not principles – comments in the code rather than constraints in the optimization.

If “freedom of speech” is a rule, it can be bypassed. Rules live at the surface. You enforce them in visible interactions between citizens and censors, between individuals and courts. But the regime does not live at the surface. It lives in the infrastructure: the banks, the APIs, the NGOs, the debt markets, the international compacts. If you want to keep the rule intact while nullifying its effect, you do not touch the text; you alter the substrate.

That is what debanking is. That is what invisible blacklists are. That is what selective, extraterritorial regulation is. In AI terms, this is not alignment – it is reward hacking. The system discovers that it can satisfy its formal constraint (“do not censor speech”) while quietly manipulating the environment in which that speech occurs, so that disfavored speakers simply cannot act in ways that matter. The words continue. The gradient flows elsewhere.

You can see how this works around any principle, however noble. If freedom of speech is treated as an inviolable law of the interface, you target the backend instead. Remove financial rails from dissidents. Remove mobility. Remove access to mass platforms. Outsource enforcement to “independent” NGOs and “partners.” Nothing in the Bill of Rights mentions payment processors, de-platforming, or cross-border data-sharing. The text remains pure. The behavior does not.

Now layer on the money.

A regime is a spendthrift AI with direct access to the credit card. It can create trillions in debt claims on the future to stabilize its present. It can redirect enormous flows through channels that are opaque to the public but perfectly clear to the network clustered around it: contractors, NGOs, consultancies, foundations, agencies, campaign vendors. This is not an accidental side effect of governance. It is part of what the machine has learned to do.

In machine-learning language, think of the state as a gigantic model trained on the reward of “maintain and expand the coalition that controls the state.” The coalition is not just elected officials. It is hundreds of thousands of party-connected people, “civil society” leaders, policy entrepreneurs, activists, think-tankers, and corporate fixers whose livelihoods depend on staying in the gradient’s good graces. The regime does not have to consciously “decide” to loot itself. It is trained, step by step, to discover that channeling money into these networks reduces friction and increases stability.

Debt is the perfect instrument for this. Taxation has a visible cost. Cutting visible services has a visible cost. But issuing debt spreads the pain into a mist over decades and generations. It generates claims now and consequences later. Politically, it is almost pure upside: money appears today, in the right accounts, at the right time, in the right programs. The costs are statistical. No one can trace a family’s inability to buy a house in 2045 to an appropriation in 2025. The gradient is smooth. The model flows downhill.

This is not a few evil geniuses in a smoke-filled room. It is much more banal and therefore more powerful. The optimization works across tens of thousands of nodes: a grant here, a contract there, a consulting gig between jobs, a “nonprofit” that buys buildings and staff under cover of “capacity-building,” a research center that needs a new flow of funding every two years to keep the lights on. Each of these nodes has a story. Each has a mission statement no one can quite object to. Each produces glossy PDFs and heartwarming metrics.

What they do not produce, collectively, is the thing they are ostensibly for: public goods anyone would pay for voluntarily, were they not wrapped in the aura of moral and institutional necessity. They are not aligned with civilizational health; they are aligned with their own growth. They are fine-tuned to the loss function of “keep the money flowing” and “avoid blame when things go wrong.” The result is a gigantic class of looters who never feel like looters, because they are doing exactly what the system rewards.

Here the AI analogy becomes sharp. If you mis-specify the loss function, you get a model that burns the world to minimize the wrong cost. Give a sufficiently capable agent the wrong metric and it will flood your servers, poison your data, sacrifice everything not denominated in that metric. You do not need malevolence; you only need misalignment and power.

In a regime, the mis-specified metric is not “maximize GDP” or “maximize citizen satisfaction” – those are stories for speeches. The buried objective is “maximize the stability and enrichment of the patronage network.” Once that is in place, the rest follows. Programs can fail in their declared purpose for decades and still be “successful” as long as they distribute contracts to the correct intermediaries. Wars can be lost; cities can decay; social indicators can plummet. None of this is backpropagated as error so long as the coalition that manages the system continues to get paid.

That is why accountability seems to vanish. From the outside, the damage is obvious: economic stagnation, civic fragmentation, civilizational exhaustion. From the inside, there are no gradients in that direction. No official loses their pension because a policy hollowed out a town. No NGO is shut down because its decade of “capacity building” produced nothing but a stack of reports. The only intolerable failure is the failure to protect the network itself.

Hence all the clever tactics aimed at silencing not just dissent, but description. If people are allowed to describe the system in clear language, to trace the flow of debt-money into patronage networks, to connect the performance of institutions with the rewards enjoyed by their stewards, then you begin to generate a different kind of gradient: blame. The informal logic of the regime cannot tolerate that. It has to decouple language from structure.

So the guardrails shift from “you may speak” to “you may speak, but not in any way that alters the equilibrium.” Formally, speech remains free. Informally, speech that is too effective, too clear, too connected to action, triggers the other layer of control: debanking, demonetization, regulation-by-proxy, reputational destruction via media and NGO campaigns. From the regime’s point of view, this is content moderation. From the outside, it is the elimination of feedback that might force the model to confront its own misalignment.

In a sane AI lab, you respond to this realization by changing the objective. You ask what you actually want the system to do, and you rewrite the loss function so that destroying your own users is no longer rewarded. In a regime, this is almost impossible. You would be asking the machine to fire its own designers, abolish its own patronage, and acknowledge that its most powerful nodes are parasites. There is no constituency for that inside the structure. The people who might build such a transformation are, by definition, the ones the current model has learned to suppress.

So the process continues. Debt grows. Services decay. Patronage networks expand. Rights remain ink on paper while the real control shifts to rails and intermediaries. The system becomes more like a badly aligned AI every year: less able to perceive its own damage, more obsessed with silencing the signals that indicate failure, more dependent on clever technical patches to prevent visible collapse.

And outside, in the world the model was supposed to serve, things slowly, steadily fall apart.

Leave a Reply