1. Fairness for Killer Omega
One possible objection to my Killer Omega thought experiment for Updateless Decision Theory (UDT) on fairness grounds is that Omega doesn’t make a prediction about a choice that you will have the opportunity to make, but rather a choice that you would have made.
In my scenario, Omega will kill you if it predicts that you, in a hypothetical situation where you don’t know Omega is coming for you, would take a free gift. This differs from Newcomb’s problem, where you actually get to make the choice now, knowing you’re under prediction.
I thought I had a pretty good response to this. Here’s a summary of the original Killer Omega hypothetical:
First Form: you’re in a room with a blue and a red lightbulb. The blue light turns on with 99.9% probability, the red with 0.1% probability. If the blue light turns on, in 1 minute you will receive a gift of $100, which you can accept or reject. If the red light turns on, in 1 minute a superintelligent predictor will scan your source code to simulate whether you would take the gift if you had only seen the blue light. If it predicts you would, then it kills you (equivalent to -$1,000). Questions:
If you see a blue light, would it be best for you to take the gift?
If you see a red light, would it be best for you self-modify into an agent who wouldn’t take the gift?
Now, consider this version:
Second Form: If the red light turns on, Omega will knock you out, wipe your memory that the red light turned on at all, turn on the blue light, and leave to spy on you via hidden camera. If you accept, it takes the money back and kills you. You are fully aware that this is what will happen; however, ex hypothesi, if you see a blue light now you don’t know whether Omega has already wiped your memory. (The same questions and payoffs apply).
The expected value of taking the gift in the original scenario might1 shift slightly, because now there’s a 1% chance that Omega is already watching you. However, it still balances out such that E[accept] > E[refuse], but if you see a red light you are strongly incentivized to self-modify.
Now, I thought, it seems undeniable that Omega is judging you for a choice you’ll make, rather than punishing you for having a particular source code—Omega isn’t even making a prediction! And sure, memory-wipe scenarios are weird, but the folks who argue about FDT and UDT are also the folks who care about the Sleeping Beauty problem and anthropics, so—
…Wait.
2. Cyborg Sleeping Beauty
Classic Sleeping Beauty: Some researchers are going to put you to sleep. During the two days that your sleep will last, they will briefly wake you up either once or twice, depending on the toss of a fair coin (Heads: once; Tails: twice). After each waking, they will put you to back to sleep with a drug that makes you forget that waking. When you are first awakened, to what degree ought you believe that the outcome of the coin toss is Heads?2
Philosophers are somewhat divided on whether the answer should be ½ or ⅓, but Hitchcock 2004 established that the way to bet is as if the probability were ⅓.3 That is, if Sleeping Beauty accepts a <2:1 profit:cost bet that the coin was Heads each time she wakes up, she can expect to lose money. This is made more obvious by an extreme example:
Extreme SB: same as Classic SB, but if the coin lands on Tails you will be woken up 1,000 times.
Let’s say each time you wake up, you can buy a bet for $10 that will pay out $30 if the coin was Heads. Profit is $20, cost is $10, so you get 2:1 profit:cost odds. (From now on, when I talk about odds, I will only talk about profit:cost.) If you believe that upon waking up, you learn no new information and the coin is equally likely to have flipped Heads or Tails, then you think the real odds are 1:1. So this is a great deal: $30 × 50% - $10×50% = $10 in expected value. So if the coin flips tails, you should be willing to take that bet each time you wake up, which means that you will lose $10,000. If the coin flips heads, you should be willing to take that bet the one time you wake up, and gain $20. That means that the expected value of taking these bets in an Extreme SB problem is $20 × 50% - $10,000 × 50% = -$4,990. This is a huge problem for believing that the probability of Heads given being awoken is only 50%: if that were the case it would be rational to take a 2:1 bet when you wake up… but it is a predictable consequence that agents with that credence and betting behavior in Extreme SB will perform miserably in expected value.
What’s the proper way to bet in Extreme SB? In order for the expected value of the policy “accept bets with >x:y odds when you wake up,” to be positive, we need:
which, after some algebra, means we need better than 1,000:1 odds to get positive expected value. So a bet that cost $1 and paid out $1,002 on Heads (profit $1,001) would be worth it, but not one that only paid out $1,000 (profit $999). Let’s call this:
Policy A: when awoken, accept any bets on Heads at >1,000:1 odds.
This is the best policy to have in advance if you were going into an Extreme Sleeping Beauty problem and didn’t know how the coin was going to flip, and it is therefore the policy that UDT endorses. But what about the following problem:
Cyborg SB: same as Extreme SB, but before each time you are put to sleep, you have the opportunity to self-modify your decision procedure.
You can make a lot more money in a Cyborg SB problem! Consider these policies:
Policy B: when awoken, accept any bets on Heads at >1:1 odds, then self-modify to Policy C.
Policy C: never accept any bets on Heads.
Let’s say that bet with $1 in cost, $1,002 in payout (1,001:1 odds) is offered to three contestants in a Cyborg SB problem. Aurora adopts Policy A, Belle adopts Policy B, and Cinderella adopts Policy C.
Aurora has a 50% chance to take one winning bet and a 50% chance to take 1,000 losing bets. She’ll therefore make $0.50 in expectation.
Belle has a 50% chance to take one winning bet and a 50% chance to take one losing bet, after which she will self-modify to Policy #2 in either case and no longer take any bets. She’ll therefore make $500 in expectation.
Cinderella doesn’t take bets. She’ll therefore make $0 in expectation.
Belle outperforms Aurora and Cinderella; in fact, Belle’s policy is the best possible policy for Cyborg Sleeping Beauty where only bets on Heads are offered.4 So UDT recommends Policy B. However, Policy B makes you self-modify to Policy C, and UDT does not recommend Policy C. Remember, UDT assesses policies by their prior expected value—the way you’d think about what policy to select before being put to sleep—and Policy C is worse in that respect than both A and B. Therefore, UDT agents will with certainty self-modify in such a way that they are no longer UDT agents in the Sleeping Beauty problem.
“But wait,” you might ask. “UDT recommends that after you take your first bet, you switch into Policy C. So how is it that Policy C is not a UDT policy? This is still self-endorsement!”
Not quite! Suppose the Pope spoke ex cathedra and said, “I’m sorry, it seems we have been misguided. Catholicism is not the way to God—the only way to reach religious truth is to convert to Buddhism.” A devout Catholic who sincerely believes in papal infallibility would convert to Buddhism. Therefore, Catholicism would endorse conversion to Buddhism. But that doesn’t mean those who convert to Buddhism would still be Catholic, since they would have to abandon Catholic teachings to become fully Buddhist.5
So: UDT agents will self-modify to no longer be UDT agents in the Sleeping Beauty problem, as long as persistent self-modification is permitted.
Does that show that UDT is self-undermining, in the way that (short-sighted, updateful) FDT is?
Well, not exactly.
3. Policy as Information Storage
Consider a modified scenario
Weak Cyborg SB: like a Cyborg SB, but your ability to self-modify is limited. Right before being put back to sleep, you can press a button. If you do, you will be able to leave a message to yourself saying “I have already awoken!” However, the button will also transform you into a CDT agent (if you weren’t one already) or a UDT agent (if you were) for the duration of the experiment, after which you will be restored to normal.
Would you push the button? I would! Without the ability to self-modify, CDT and UDT don’t recommend different actions in a Sleeping Beauty problem, so there’s no risk of accidentally turning myself into a decision theory that performs worse. And the ability to notify your future self means you can basically adopt Policy B—take any >1:1 bet on Heads if you don’t see a message, refuse all bets if you do. Pushing the button is a great idea, but does require modifying away from your original decision theory. Does that mean UDT is “self-defeating” in a Weak Cyborg SB? Not really. Self-modification is just a byproduct of the action you really want to take, which is leaving yourself a message. If there were two separate buttons—one to self-modify, one to leave the message—they’d only press the “leave a message” button.
This feels prima facie different from Cyborg SB, where UDT agents actually want to change their future behavior when they wake up in a way that UDT alone just can’t manage. But consider:
If a UDT agent had the opportunity to send a message instead of self-modifying, they would prefer (or be neutral to) just sending the message instead.
However, it’s stipulated in SB that your memory and any messages you might try to leave are reset.
But in Cyborg SB, the policy you self-modify to is not reset.
So if you self-modified to the policy “when I wake up, write I HAVE WOKEN UP BEFORE on the floor, then modify back to UDT,” then you could essentially send a message to your future self that way.
But suppose for some reason this wasn’t possible. If you know that before you were drugged, you were a UDT agent, but now you follow Policy C, the most reasonable inference is that you deliberately self-modified, assuming the experimenters aren’t lying to you.
Why would you have done so? Well, you know that UDT recommends switching to Policy C only if you have already awoken once.
So if you possess the knowledge of what your decision theory is now, and what your decision theory was before the experiment, you can deduce whether or not you have awoken before. This gives you enough information to know not to bet, presuming the decision theory you have modified to permits it.
Therefore, for a sufficiently “introspective” agent, modifying one’s policy is effectively a way of transmitting information to one’s future self, with the behavioral policies essentially as a side effect.
Sneaky!
What if you lack that introspective capacity—you don’t explicitly know what your decision theory is, or used to be, or both? I think self-modification is still better understood as a form of information transmission, if you think about information in physical and mathematical terms. Cyberneticist Gregory Bateson defined information as a “difference that makes a difference.”6 At one level, we can think of it like: if your shoelaces are tied, something caused them to be that way. Therefore simply seeing tied shoelaces gives us information about what sorts of things might have happened that we didn’t see. So physical differences give us information. But if you believe our thoughts and beliefs reduce down to physical states, then physical differences are information. Those tied shoelaces cause your neurons to fire differently than they would have had there been no tied shoelaces. Yudkowsky said it best:
What is evidence? It is an event entangled, by links of cause and effect, with whatever you want to know about. If the target of your inquiry is your shoelaces, for example, then the light entering your pupils is evidence entangled with your shoelaces. … If photons reflect off your shoelaces and hit a rock, the rock won’t change much. The rock won’t reflect the shoelaces in any helpful way; it won’t be detectably different depending on whether your shoelaces were tied or untied. … A photographic film will contract shoelace-entanglement from the incoming photons, so that the photo can itself act as evidence. If your eyes and brain work correctly, you will become tangled up with your own shoelaces. (emphasis original)7
When you change your policy in Cyborg SB, you are entangling8 the observation that you’ve made (I have awoken) with your behavior in the simulation. You act differently than you would have had you not observed anything. This is all that’s required for there to be an informational link between two events. Your future self doesn’t need to, on top of that, understand what’s going on, any more than a camera needs to understand that it is taking a picture of shoelaces.
Maybe this is too wide a frame to think about information transmission practically. As I’ve put it, any action that an agent takes could be viewed as information transmission! But I think this approach still yields an important difference9 between what happens in a Cyborg SB and what happens, say, in precommitting to one-box in a Newcomb problem. A CDT agent informed that they’re going to face a Newcomb problem knows exactly what is going to happen and what the prediction will be, and if they fail to precommit they still remember their exact state of mind from before. They have all the same information-in-the-form-of-knowledge that they used to have. There is no need to set up new paths of causal entanglement, all that is needed is to act differently. But in Cyborg SB, all causal entanglement between your previous states in the experiment gets severed except for policy. If that weren’t the case, then it wouldn’t be necessary to change policy at all.
4. Thus, Killer Omega
Reexamining the two forms of my Killer Omega problem:
First Form: you’re in a room with a blue and a red lightbulb. The blue light turns on with 99.9% probability, the red with 0.1% probability. If the blue light turns on, in 1 minute you will receive a gift of $100, which you can accept or reject. If the red light turns on, in 1 minute a superintelligent predictor will scan your source code to simulate whether you would take the gift if you had only seen the blue light. If it predicts you would, then it kills you (equivalent to -$1,000). Questions:
If you see a blue light, would it be best for you to take the gift?
If you see a red light, would it be best for you self-modify into an agent who wouldn’t take the gift?
Second Form: If the red light turns on, Omega will knock you out, wipe your memory that the red light turned on at all, turn on the blue light, and leave to spy on you via hidden camera. If you accept, it takes the money back and kills you. You are fully aware that this is what will happen; however, ex hypothesi, if you see a blue light now you don’t know whether Omega has already wiped your memory. (The same questions and payoffs apply).
I think both of these forms effectively prey on information loss. The second case is obvious—it’s a direct analogy to Cyborg SB. The first can be viewed as a way of transmitting information into the simulation. Simulated-you doesn’t possess the knowledge that present-you possesses. The simulation that Killer Omega creates is causally entangled with your present state, but limited only to your policy. If you want to make a difference in what appears in the sim, you’re forced to encode it in a difference in policy.
So, is Killer Omega is just a trick to force an agent into self-modifying, as some of my readers have suggested to me? Depends. This is distinctly not the same as the classic “unfair” case where you are punished for having a particular decision theory—the Cyborg SB case should be enough to demonstrate that, as nobody is even making predictions. However, there is a sense in which the policy itself is less important than the information being transmitted, and that is undesirable for a true counterexample. Further, though UDT agents would self-modify during Killer Omega, they wouldn’t like to avoid being UDT agents ahead of time, which a big argument against CDT in Newcomb’s problem. Thus, if your idea of reflexive stability is what a superintelligent AI with perfect a priori reasoning would self-modify to right out of the box, then Killer Omega isn’t going to pose a significant problem. However, if you are dealing with bounded rationality… well, we’ll see where the argument takes us.
Not if you’re a UDT agent, since UDT agents are already globally assessing the consequences of “being a gift-accepter,” but certainly for CDT agents, who in the original problem would see no downside to taking the gift once the blue light is on. After all, in the original problem once the blue light is on, you’re safe.
Elga, Adam. “Self-Locating Belief and the Sleeping Beauty Problem.” Analysis 60, no. 2 (April 1, 2000): 143–47. https://doi.org/10.1093/analys/60.2.143.
Hitchcock, Christopher. “Beauty and the Bets.” Synthese 139, no. 3 (April 1, 2004): 405–20. https://doi.org/10.1023/B:SYNT.0000024889.29125.c0.
I haven’t formally proven this here, in part because there are other policies that perform just as well (though not better) depending on what is allowed for self-modification; see section 3.
“But wouldn’t that mean past Popes had contradicted current Popes and that the word of the Pope contradicted the word of Christ, which violates papal infallibility on its own?”—yes, yes it would, I’m just trying to give an example, I’m not a theologian, and I’m not actually suggesting the teachings of the Catholic Church are this simplistic, please don’t send me angry emails.
Bateson, Gregory. Steps to an Ecology of Mind. Chicago, Ill.: Univ. of Chicago Press, 2008.
Yudkowsky, Eliezer. “What Is Evidence?,” June 29, 2025. https://www.lesswrong.com/posts/6s3xABaXKPdFwA3FS/what-is-evidence.
No, not quantum entangling. Just normal, causal interaction.
Get it? Get it?!?