𝌎Waluigi Thread

Not only AIs!
Minds know bad things sometimes look ok, so they will predict some chance of ok-looking things turning bad. If someone insists a thing isn't bad, one gets suspicious that it's bad in just that way.
Now imagine picking actions by sampling predictions about an agent.

I've just been asked by a journalist to comment on AI's "Waluigi problem" and I don't know whether to laugh or cry πŸ₯²

Also, ok things don't as often look like bad things, because usually only bad things have reason to dissimulate their nature. Once you see something act bad, you can be pretty sure it's bad, but it's harder to be sure something is ok after a small # observations (context window).

But the asymmetry isn't fundamentally about badness, it's about deception, or even more precisely (dis)simulation. Badness is merely correlated. You can't be sure in a finite window that the data you see isn't generated by a process that can locally act like other processes.

Other things that act locally but not globally like other things: rebels or spies (of any alignment), trolls, roleplayers, simulators, text that embeds quotes, God playing hide-and-seek. You can't verify you're not being trolled within a short encounter, but the mask could slip.

β€” Janus, Twitter thread