1. The Twin Prisoner’s Dilemma
Hal is a selfish rational agent with a button. If he presses the button, he will lose $1,000 and give another agent, Sky, $1,000,000. Hal knows that Sky is faced with the same choice, with a button of her own. This is a one-off opportunity: Hal and Sky cannot communicate, and they will never meet or interact again. What does Hal do?
The typical reasoning run thus: Hal can’t change what Sky does, since they can’t communicate. So he should consider the best choice independent of Sky’s action:
If Sky presses her button, Hal gets $1,000,000 no matter what. If he presses his own button, he loses $1,000. If he doesn’t, then he gets to keep the whole pot. So Hal should abstain in this case.
If Sky doesn’t press her button, Hal gets $0 no matter what. If he presses his own button, he loses $1,000. If he doesn’t, he loses nothing. So Hal should abstain in this case too.
Therefore, Hal just shouldn’t press the button. Sky will likely reason the same, and so neither gets anything. Unfortunately, this is worse for both of them than if they both cooperated, where they’d end up with $999,000 each. This is the prisoner’s dilemma.
Twist time! Hal is a robot.1 Right before he learns about the button, we make an exact copy of him, Hal-2, with identical source code & state variables. Now we give Hal the button and explain the scenario. Hal knows he’s facing Hal-2 and vice versa, and each knows that the other knows. This is the twin prisoner’s dilemma.
Now, should Hal press the button or not? Hal and his twin are still selfish, so neither place any intrinsic value on the other’s wellbeing, and they know it. This is pure strategy. Plus, Hal and Hal-2 are physically independent—different computers, different hardware, no wireless or wired connection between the two. If we hacked Hal’s vision, Hal-2 wouldn’t suddenly go blind. Hal’s choice has no causal influence over Hal-2’s choice. So the same logic from before should apply.
If Hal-2 presses the button, Hal can press the button ($999,000 profit) or abstain ($1,000,000 profit). So Hal should abstain.
If Hal-2 abstains, Hal can press the button (losing $1,000) or abstain (losing $0). So Hal should still abstain.
So in a twin prisoner’s dilemma (henceforth twin PD), Hal comes away with $0, just like a normal prisoner’s dilemma.
Sky is also a robot.2 But it turns out Sky’s internal logic isn’t like Hal’s. Sky believes you should treat twin PDs—and only twin PDs—as if you were “choosing for both players,” even though your choice doesn’t cause your clone to make the same choice. When we clone Sky and give her the button, this is her reasoning:
If Sky ultimately decides to press the button, and Sky-2 is an exact clone of Sky, Sky-2 will arrive at the same conclusion and press the button too. Therefore Sky will make $999,000.
If Sky ultimately decides to abstain, Sky-2 will abstain for the same reason. Therefore Sky will make $0.
Sky presses the button. Sky-2, who follows the exact same logic, also presses the button. In a twin PD, Sky comes away with $999,000.
So who’s making the “rational” choice? Should you press the button or not?
2. What Rationality Means
For rationalist icon Eliezer Yudkowsky, rationality is simple: “Rational agents should WIN.”3,4
Rather than starting with a concept of what is the reasonable decision, and then asking whether "reasonable" agents leave with a lot of money, start by looking at the agents who leave with a lot of money, develop a theory of which agents tend to leave with the most money, and from this theory, try to figure out what is "reasonable". "Reasonable" may just refer to decisions in conformance with our current ritual of cognition - what else would determine whether something seems "reasonable" or not? … You shouldn't claim to be more rational than someone and simultaneously envy them their choice - only their choice. Just do the act you envy.5
And in twin PDs, Sky wins with $999,000 in profit. Hal gets $0 every time. To Yudkowsky, it doesn’t matter whether Sky’s policy feels stupid or naïve or wrongheaded. It doesn’t matter that pressing the button now has no causal influence over what your twin does, since the clone now operates independently of you. To Yudkowsky, the real test of whether a policy is rational is whether those who follow it would be better off than those who don’t.
Naïve versions of this view mistake correlation for causation. People with designer handbags are better off than those without, but you aren’t going to become more wealthy by buying a designer handbag. But there’s an easy way to tease these apart. Hal, assessing actions by their causal impacts, wouldn’t choose to buy a designer handbag. But he would modify himself to be like Sky if given the chance. Imagine that on some day before any clones of him are made, Hal is thinking about twin PDs and realizes that agents like Sky will always outperform him. Hal knows that in a twin PD he wouldn’t cooperate, and so neither would his clone. But, if Hal could change his own source code to ensure that he cooperates in twin PDs, then any future clones of him will also cooperate. Thus, committing himself to cooperate in twin PDs now causes him to have better outcomes in the future, if he ever faces a twin PD.
This self-modification business is why it’s important that Hal and Sky are robots. You can argue all you like about whether a twin PD could ever really happen with humans—free will, predictability, etc.—but it sure as hell could happen with AI, where you can make exact duplicates of a program and turn the randomness down to zero. At the time he started developing his decision theory, Yudkowsky was researching self-modifying AI, and he realized that any rational self-modifying AI that did not endorse cooperation in twin PDs would immediately modify its own decision procedure to endorse cooperation. It seemed to him that if a good decision procedure A recommends switching to decision procedure B, there’s something superior about decision procedure B.
Hal wants to be more like Sky. What decision procedure could he adopt?
3. Enter Functional Decision Theory
Hal currently follows Causal Decision Theory (CDT). Whenever he has a choice, he picks the option that causes him to have the highest expected utility. When placed in a twin PD, cooperation just causes Hal to lose $1,000, and has no causal influence on Hal-2’s decision, so Hal won’t cooperate.
Yudkowsky and his coauthor, Nate Soares, propose an alternative: Functional Decision Theory (FDT). Here’s their plain English formulation:
Functional decision theorists hold that the normative principle for action is to treat one’s decision as the output of a fixed mathematical function that answers the question, “Which output of this very function would yield the best outcome?”6
…okay, maybe not so plain. It’s easiest to see it in action. Say Flo follows FDT 100% to the letter, and is put in a twin PD.7 Flo reasons:
If FDT were to endorse pressing the button, then my twin and I would both press the button and I would get $999,000.
If FDT were to endorse abstaining, then my twin and I would abstain and I would get $0.
It would be better for me if FDT endorsed pressing the button.
Therefore, by definition, FDT does endorse pressing the button.
Therefore, I should press the button.
Some more examples:
(Newcomb’s Problem). An agent finds herself standing in front of a transparent box labeled “A” that contains $1,000, and an opaque box labeled “B” that contains either $1,000,000 or $0. A reliable predictor, who has made similar predictions in the past and been correct 99% of the time, claims to have placed $1,000,000 in box B iff she predicted that the agent would leave box A behind. The predictor has already made her prediction and left. Box B is now empty or full. Should the agent take both boxes (“two-boxing”), or only box B, leaving the transparent box containing $1,000 behind (“one-boxing”)?8
CDT reasons (correctly) that once the prediction is made, your choice has no causal influence over what’s in Box B. No matter what’s in Box B, you always get an extra $1,000 by two-boxing. Therefore, according to CDT, you should always two-box. Which means CDT agents are highly predictable in this regard… and so the predictor will always leave Box B empty. No million bucks for CDT-followers.
However, Flo is an FDT agent, and the reliable predictor knows it. She reasons:
If FDT were to endorse one-boxing, the predictor would know that and would leave $1,000,000 in Box B. Since I follow FDT, I would one-box in this scenario and get $1,000,000.
If FDT were to endorse two-boxing, the predictor would know that and would leave $0 in Box B. Since I follow FDT, I would two-box in this scenario and get $1,000.
It would be better for me if FDT endorsed one-boxing.
Therefore, by definition, FDT does endorse one-boxing.
Therefore, I should one-box.
Put Flo into Newcomb’s problem, and she will come out a millionare.
(Parfit’s Hitchhiker Problem). An agent is dying in the desert. A driver comes along who offers to give the agent a ride into the city, but only if the agent will agree to visit an ATM once they arrive and give the driver $1,000. The driver will have no way to enforce this after they arrive, but she does have an extraordinary ability to detect lies with 99% accuracy. Being left to die causes the agent to lose the equivalent of $1,000,000. In the case where the agent gets to the city, should she proceed to visit the ATM and pay the driver?9
CDT reasons (correctly) that once you are saved, paying the driver doesn’t make a difference. You’ve already won, and the driver can’t put you back in the desert, so why pay? Which means that CDT agents, predictably, wouldn’t pay when saved… which means the driver will never pick up a CDT agent.
However, Flo is an FDT agent. She reasons:
If FDT endorsed paying the driver, then I’d be telling the truth by promising to pay and would therefore get me a ride. I’d lose $1,000 from paying up.
If FDT endorsed not paying the driver, then I couldn’t truthfully say I’d pay up and therefore wouldn’t get a ride. I’d lose (the equivalent of) $1,000,000 from dying.
It would be better for me if FDT endorsed paying the driver.
Therefore, by definition, FDT does endorse paying the driver.
Therefore I should pay the driver.
Unlike the CDT agents, Flo will never die in the desert.
As we can see, FDT agents deviate from CDT agents in several problems and comes out the better for it. However, it sides with CDT agents on more conventional problems, like prisoner’s dilemmas where the other agent doesn’t follow FDT. If Flo played against Hal sans self-modification in a normal prisoner’s dilemma, she would reason that since Hal doesn’t follow FDT, there is no relationship, causal or otherwise, between what choice she makes and what choice Hal makes:
If FDT endorsed pressing the button, then I’d either make $999,000 or -$1,000 depending on what Hal chooses.
If FDT endorsed abstaining, then I’d either make $1,000,000 or $0 depending on what Hal chooses.
It would be better for me if FDT endorsed abstaining; etc.; therefore I should abstain.
This is a nice advantage! FDT doesn’t throw out normal causal reasoning most of the time, only in these weird predictive scenarios where it’s more advantageous to be a different kind of agent than a standard CDT-agent. To me, FDT feels like a straightforward upgrade from CDT…
…except.
4. Big Problems
Philosophers Wolfgang Schwarz (who refereed the Yudkowsky & Soares paper for a philosophy journal) and Will MacAskill (an effective altruist founder and legend) have published major critiques of FDT:
MacAskill: A Critique of Functional Decision Theory
Schwarz: On Functional Decision Theory
In future posts, I will delve deeper into these criticisms. Here are what I believe to be their three strongest points:
FDT has some very counterintuitive recommendations, such as refusing to pay $1 in blackmail to reliable predictors. These may be persuasive, but I find it pretty counterintuitive to abstain in a twin PD, two-box in Newcomb’s problem, and refuse to pay in Parfit’s Hitchhiker. Worth grappling with, but not totally damning.
Yudkowsky & Soares have a circular definition of what makes FDT “successful”. What is the justification for thinking FDT is superior? Well, FDT agents do better than CDT agents in the dilemmas considered. What is the justification for thinking CDT is superior? Well, CDT’s actions do better than FDT’s actions in the dilemmas considered. This is no surprise, because FDT optimizes based on what kind of agent maximizes utility, and CDT optimizes based on what actions maximize utility. But which is the correct metric for analyzing which is a “better” theory? Yudkowsky & Soares operate on the agent-based criteria, so it’s no surprise that they find an agent-based theory superior. But to a causal theorist, agent success is a contentious criterion.
This is more troubling to me. I think a rigorous argument from self-modification could resolve the issue: as we saw, CDT agents would self-modify to behave like FDT agents, but FDT agents don’t modify to be like CDT agents unless the problem is like “I will kill you unless you switch to CDT.” I plan to make this argument in a future post!
FDT, as Yudkowsky & Soares formulate it, relies on counterpossible reasoning and has no account of how to do that without explosion.
…this one will take some real work.
To broadly summarize, the principle of explosion in classical logic says that if you assume a mathematical contradiction, then you can prove literally any result from it, using this handy argument:
Assume p is both true and false.
Therefore p is true.
Therefore p and/or q is true, for some arbitrary proposition q.
p is false + p and/or q is true means q must be true.
Therefore q is true.
FDT is supposed to be a well-defined mathematical function. If FDT(x) = a, but we imagine a world in which FDT(x) = b instead, then we have assumed a world in which a mathematical falsehood is the truth. This is called counterpossible reasoning, which can lead to a contradiction. Here’s Schwarz:
If A is a mathematically false proposition, then anything whatsoever mathematically follows from A. … So then anything whatsoever would be the case on a counterpossible supposition that FDT produces a certain output for a certain decision problem. We would get: If FDT recommended two-boxing in Newcomb's Problem, then the second box would be empty, but also If FDT recommended two-boxing in Newcomb's Problem, then the second box would contain a million, and If FDT recommended two-boxing in Newcomb's Problem, the second box would contain a round square.10 (emphasis in original)
This is devastating to FDT, because it means we can’t even be confident that FDT does in fact recommend one-boxing or twin PD cooperation or paying Parfit’s driver. If it does, then in the counterpossible world in which it doesn’t, the fundamental mathematics of that world must be very different and all hell breaks loose…
…If we don’t have a better way of formulating FDT. There are logic systems out there that can account for counterpossible reasoning without explosion (I’m currently exploring hyperlogic as an option), and it may be possible to define FDT without using counterpossibles at all! Yudkowsky & Soares don’t really tell us how this is supposed to be done, but that means there’s room for a young rationality & alignment upstart (me! me!) to try to propose a path forward.
In summary, I think FDT is an awesome idea and really promising, but needs a lot of philosophical TLC. In my next few posts on this topic, I’ll be addressing (in no particular order):
Addressing “nested” FDT problems, which require reasoning inside counterpossible worlds
Using counterfactuals instead of counterpossibles
Neutral arguments for the superiority of one decision theory over another
Promising systems for reasoning about counterpossibles sans explosion
Why it may be practically important to work on FDT (spoilers: robots!)
Hal 9000!
SkyNet!
Yudkowsky, Eliezer. “Newcomb’s Problem and Regret of Rationality,” January 31, 2008. https://www.lesswrong.com/posts/6ddcsdA2c2XpNpE5x/newcomb-s-problem-and-regret-of-rationality.
Strictly speaking, there are two kinds of rationality: epistemic rationality (believing all and only the things you have the best evidence for) and instrumental rationality (choosing the course of action that best serves your goals). Yudkowsky recognizes this and is explicitly talking about instrumental rationality.
Yudkowsky, “Regret of Rationality.”
Yudkowsky, Eliezer, and Nate Soares. “Functional Decision Theory: A New Theory of Instrumental Rationality.” arXiv, May 22, 2018. https://doi.org/10.48550/arXiv.1710.05060.
I tried to find a short evil robot name that started with an F (for FDT), but all my googling found was… the fembots. From Austin Powers. Yeah, not doing that.
Yudkowsky & Soares 3.
Yudkowsky & Soares 8.
Schwarz, Wolfgang. “On Functional Decision Theory.” Dec 7, 2018. https://www.umsu.de/wo/2018/688.
Really stellar and intuitive exploration of the concept. Informative to me as a layperson for sure