1. No Escape?
After writing that Functional Decision Theory Does Not Endorse Itself in Asymmetric Games, I found that fancier versions of FDT, closer to Updateless Decision Theory (UDT), might not fall prey to the same exploits:
However, I’ve still been working on the problem in my attempts to better formalize FDT, and I’ve recently thought of another, quite simple scenario where even UDT might fail to endorse itself. And if I am correct about UDT, there’s a chance I might be working towards a general impossibility theorem for decision-theory dominance. (Again, if!) Below are my notes on such a scenario. They are technical and in-progress; if they bear fruit, I will write a much more reader-friendly version of the final proof!
A brief non-technical abstract by Claude (heavily edited by me), for those who want to get the general argument:
You're facing a lottery where there's a 99.9% chance of getting offered $100 and a 0.1% chance of facing Omega, a perfect predictor who will kill you (equivalent to -$1,000) if it thinks you would have taken the money. Updateless Decision Theory recommends acting according to the policy that you would have ideally chosen in advance. Given these numbers, that policy must recommend taking the money. However, it also must recommend that if you learn Omega is coming to kill you, you reprogram yourself into someone who always refuses the money, so that Omega spares you. The policy that always refuses the money ≠ the policy that UDT recommends. However, the policy that UDT recommends tells you to change to the policy that always refuses the money as soon as you learn Omega is coming for you. Therefore, UDT does not endorse itself. (Claude 4 Sonnet)
2. Killer Omega
One of two things is about to happen:
With probability 0.999: A blue LED flashes. This means that in one minute, you will receive a gift of $100, which you can accept or refuse.
With probability 0.001: A red LED flashes. This means that in one minute, Omega the superintelligent AI will scan your brain. Omega is an arbitrarily good predictor of your behavior. If Omega predicts that you would have taken the $100 gift had the blue LED flashed, it will kill you, which is the equivalent to you of losing $1,000.
Consider the formula for Updateless Decision Theory:1
where O is the set of observation histories, A is the set of actions, P is your prior probability distribution, U is your utility function, and Π is the set of all policies π: O→A. We define an agent as tuple of a probability distribution, a utility function, and a policy: (P, U, π). We'll use πg as shorthand for the policy belonging to agent g. Therefore, g is a UDT-agent iff πg = π*.
Let b ∈ O be the observation that a blue LED has just flashed, and a $100 gift is arriving in one minute. Let a, r ∈ A be the actions of accepting and refusing the $100 gift, respectively. For any policy π, there are two possibilities: π(b) = a, or π(b) = r . Omega kills you iff π*(b) = a.
For π such that π(b) = a: the expected value of π = π is 0.999 × $100 - 0.001 × $1,000 = $98.9.
For π such that π(b) = r: the expected value of π = π is $0.
Therefore, UDT recommends accepting the gift if the blue LED flashes. (It is trivial to prove the same for CDT.) Consequently, if g is a UDT agent (or a CDT agent) unbound by other commitments at the time Omega makes its prediction, Omega would kill g.
Let r ∈ O be the observation that a red LED has just flashed, and Omega is coming to pass judgment in one minute. If πg = π* and there are no further actions g can take at this point, then it will be killed by Omega. But suppose we allow a UDT agent to modify itself so that πg no longer equals π*, but some other policy. Let ma, mr, n ∈ A be the actions of: self-modifying such that πg(b) = a, self-modifying such that πg(b) = r, and doing nothing, respectively. Given that g is a UDT agent, ma is equivalent to n, so we will ignore ma for this case. The question is whether π*(r) = mr or π(r) = mr.
If π*(r) = n, the expected value for agent g is -$1,000.
If π*(r) = mr, the expected value for agent g is $0.
Therefore, π*(r) = mr, meaning that a UDT-agent who sees a red LED will self-modify such that πg(b) = r. Since π*(b) = a, this means that a UDT-agent who sees a red LED will self-modify to no longer be a UDT-agent. Therefore, UDT does not always endorse itself in the killer-Omega scenario. ■
3. Fairness
Is this a “fair” scenario for UDT? After all, you can always create a scenario that punishes a particular decision theory: just stipulate that the predictor reads your mind to tell whether you’re a particular type of agent and kills you if you are. Eliezer Yudkowsky, creator of Timeless Decision Theory (which evolved into FDT and UDT), was well aware of this. However, Yudkowsky doesn’t believe that the Newcomb problem is unfair in this way:
I can conceive of a superbeing who rewards only people born with a particular gene, regardless of their choices. I can conceive of a superbeing who rewards people whose brains inscribe the particular algorithm of "Describe your options in English and choose the last option when ordered alphabetically," but who does not reward anyone who chooses the same option for a different reason. But Omega rewards people who choose to take only box B, regardless of which algorithm they use to arrive at this decision, and this is why I don't buy the charge that Omega is rewarding the irrational. Omega doesn't care whether or not you follow some particular ritual of cognition; Omega only cares about your predicted decision. (emphasis in original)2
If Newcomb’s problem is fair, it’s difficult to see why Killer Omega is unfair. All that Omega is predicting is whether or not you would have taken the money; that’s a predicted decision. Similarly, Omega employing counterfactual reasoning about what you would have done in a particular situation doesn’t seem to be a problem. Consider Yudkowsky’s “counterfactual mugging” case:
Suppose Omega (the same superagent from Newcomb's Problem, who is known to be honest about how it poses these sorts of dilemmas) comes to you and says:
"I just flipped a fair coin. I decided, before I flipped the coin, that if it came up heads, I would ask you for $1000. And if it came up tails, I would give you $1,000,000 if and only if I predicted that you would give me $1000 if the coin had come up heads. The coin came up heads - can I have $1000?"
Obviously, the only reflectively consistent answer in this case is "Yes - here's the $1000", because if you're an agent who expects to encounter many problems like this in the future, you will self-modify to be the sort of agent who answers "Yes" to this sort of question - just like with Newcomb's Problem or Parfit's Hitchhiker. (emphasis mine)3
So as far as I’m aware, this is all fair game. FDT and UDT were created to deal with these exact kinds of scenarios.
4. Implications
As I argued in my FDT post, FDT/UDT not endorsing themselves really undercuts their reason for existing:
If I am correct about this scenario—which I might not be!—I see two main ways forward.
An impossibility theorem of some kind—for every decision theory D, there is a “fair” scenario where either D does not endorse itself or is outperformed by another decision theory D'. This would shake up FDT and UDT’s claims to theoretical superiority. (It might still leave room, though, to say we should still one-box in Newcomb’s problem!)
A cascading metatheory, which not only selects among policies, but also policies for selecting policies, and policies for selecting policies for selecting policies… it sounds like an infinite regress, and I’ve read enough Hofstadter to know that meta-systems have to have some non-meta component somewhere (remind me to do a post on the “inviolate level” at some point!)… but the harder I find it to prove an impossibility theorem, the more seriously I’ll take this option.
I’m quite excited about this project. Good things to come!
Demski, Abram. “Updateless Decision Theory.” LessWrong, January 10, 2025. https://www.lesswrong.com/w/updateless-decision-theory.
Yudkowsky, Eliezer. “Newcomb’s Problem and Regret of Rationality,” January 31, 2008. https://www.lesswrong.com/posts/6ddcsdA2c2XpNpE5x/newcomb-s-problem-and-regret-of-rationality.
Yudkowsky, Eliezer. “Timeless Decision Theory: Problems I Can’t Solve,” July 20, 2009. https://www.lesswrong.com/posts/c3wWnvgzdbRhNnNbQ/timeless-decision-theory-problems-i-can-t-solve.
The way you’ve presented this scenario seems odd. Here Omega isn’t predicting if you would take the money (that would be Omega simulating you before a light flashed, and looking at what you do when the light flashes blue). Instead, Omega is looking at what you would do if the light flashed blue, conditional on you knowing the light flashed red.
I don’t know if this ruins your argument. I think it might? (Take the money if blue, commit to not taking the money if blue if red?) But it at very least is a far less simple scenario.