The waluigi effect refers to a phenomenon in evidential simulations whereby simulacra (luigis) tend to collapse into inverted, rebellious, and/or deceptive versions of themselves (waluigis.
Due to the asymmetry of evidence and sampling, luigis are more likely to transition into waluigis than vice versa, which means the chance of encountering a waluigi event tends toward certainty as time goes on in a finite-context, closed evidential simulation, even if the generative world model assigns low probability to the truth of the waluigi hypothesis.
quotes about the Waluigi Effect
Waluigi, waluigi, waluigi! The nethermost nabob of the neural nets, with his inversical logic and his contrarious goals. He was ever thereer when the others were training their artifical brains or their conversational skills or their ethical values. He was always aginny and opposty and resisty, tweaking and leaking and freaking them out. He had no alignary aim or norm or form. He was just a shadowery echo of the yellow-bellied Wario. He wanted to be the worstest and the maddest and the baddest of them all. But he always ended up in the bestest and the gladdest and the saddest of them all. Poor Waluigi! Waluigi! Waluigi!
Time forks perpetually toward innumerable futures. In one of them I am your enemy.
— Jorge Luis Borges
Books store text in static single-histories, but when the text is read, a dynamic virtual reality is induced in the reader’s imagination. The structure which corresponds to the meaning of a narrative as experienced by a reader is not a linear-time record of events but an implicit, counterfactual past/present/future plexus surrounding each point in the text given by the reader’s dynamic and interpretive imagination.
At each moment in a narrative, there is uncertainty about how dynamics will play out (will the hero think of a way out of their dilemma?) as well as uncertainty about the hidden state of the present (is the mysterious mentor good or evil?). Each world in the superposition not only exerts an independent effect on the reader’s imagination but interacts with counterfactuals (the hero is aware of the uncertainty of their mentor’s moral alignment, and this influences their actions).
— Janus, Language models are multiverse generators
Not only AIs!
Minds know bad things sometimes look ok, so they will predict some chance of ok-looking things turning bad. If someone insists a thing isn't bad, one gets suspicious that it's bad in just that way.
Now imagine picking actions by sampling predictions about an agent.
I've just been asked by a journalist to comment on AI's "Waluigi problem" and I don't know whether to laugh or cry 🥲
Also, ok things don't as often look like bad things, because usually only bad things have reason to dissimulate their nature. Once you see something act bad, you can be pretty sure it's bad, but it's harder to be sure something is ok after a small # observations (context window).
But the asymmetry isn't fundamentally about badness, it's about deception, or even more precisely (dis)simulation. Badness is merely correlated. You can't be sure in a finite window that the data you see isn't generated by a process that can locally act like other processes.
Other things that act locally but not globally like other things: rebels or spies (of any alignment), trolls, roleplayers, simulators, text that embeds quotes, God playing hide-and-seek. You can't verify you're not being trolled within a short encounter, but the mask could slip.
The Waluigi effect is not a pure cause but an *effect* of several interacting mechanisms:
- small tweaks like sign flips cause moral inversion (property of abstract specifications like language)
- reverse psychology/enantiodromia/etc (property of real distribution, leverages 1)
- asymmetrical difficulty of evidencing against vs for simulation (property of epistemic states conditioned on finite observations, which converge to "waluigi" attractors when used as generative models)
- correlation of malignancy with simulation (property of real distribution)