Debating Claude on Preferences and Wireheading

A fun little experiment

Aug 03, 2025

I’ve seen Silas do this a couple times and I thought it would be fun for clarifying my thoughts on my current area of interest: when is it good, on a preference-satisfaction view, to have your preferences changed?

[Non-Rigid] Preferences Don't Protect Us from Wireheading

Jack Thompson

Jul 24

Read full story

What Actually Are Preferences?

Jack Thompson

Jul 29

Read full story

This article is a transcript slightly edited for brevity; the original chat can be viewed here. No additional instructions or system prompts were provided.

Jack: I'd like to have an argument with you about preference theories of value—that what makes something good or bad is how it aligns with the preferences of sentient beings. Be tough and critical and act as the strongest intelligent opponent of my views that you can. The point of contention is whether wireheading, or machines that can alter your preferences, are good things from a preference consequentialist standpoint. I argue that they are not.

Suppose that I prefer that I go and create great art, even if it takes difficult striving. Now I could be wireheaded into just being pleased by staring at a blank wall, and I understand this. However, I don't want to be wireheaded in that way, because my preferences aren't about experiencing pleasure or staring at walls, they are about creating great art, and being wireheaded as such would go against my preferences, since they would prevent me from creating art. This would be true even if I knew exactly what it would feel like to be wireheaded, and exactly what it was that I valued. If I really preferred art over pleasure, I would never prefer that my preferences be changed. Therefore, how can preference change be good for me?

Claude: Your argument contains a fundamental confusion about what preference theories actually demand, and I think you're smuggling in intuitions that preference theories themselves reject.

First, let's clarify what preference satisfaction theories actually claim. The strongest versions don't just say "satisfy your current preferences" - they say maximize preference satisfaction over time, often with sophisticated ways of weighing preferences. This creates immediate problems for your position.

Your argument assumes that your current preference for art-creation should have some kind of permanent veto power over future states. But why should your temporal preference slice at time T1 be privileged over your preference slice at time T2? If I wireheaded you to prefer wall-staring, and you subsequently had intense, deep preferences for wall-staring that were more satisfied than your current art preferences could ever be, what makes your current preferences more "real" or valuable?

You're essentially claiming that preference change is inherently bad because it violates current preferences. But this leads to absurd conclusions. Consider: you currently prefer not to like Brussels sprouts. If I could change your taste buds so you'd enjoy them, your current self would object. But your future self would prefer having had the change. The preference theory has no principled way to privilege either temporal slice.

Here's the deeper challenge: if wireheading could create states of vastly greater preference satisfaction than your current trajectory toward art-creation, how can you justify the narcissistic assumption that your current preferences deserve protection? You're essentially saying "my preferences matter more than the preferences I could have" - but that's precisely what preference theories reject.

Jack: First, your example clouds intuitions. Nobody prefers not to like Brussel sprouts. If I don't like Brussel sprouts, I would love to be modified to like them so that I could eat healthier. What I prefer is not directly a state of the world in which I don't eat Brussel sprouts; what I prefer is eating things that taste good, so Brussel sprouts are contingent. But in the hypothetical, if I really and truly thought that eating Brussel sprouts was wrong, despite all knowledge—maybe because I swore to someone I never would eat them—then it is bad for me, or else you are simply a hedonist.

Second, I am not denying that after the wireheading, I would find value in wall-staring. But that does not change the fact of whether or not the wall-staring was good for me. Whether or not something is good for me can only be judged by whether I would, with perfect knowledge and free from external constraints, choose to have it happen to me. The degree to which it is bad for me is the cost which I would be willing to incur, again with perfect knowledge and free from external constraints, to avoid it. I am willing to surrender a lifetime of mindless bliss to avoid being wireheaded. Therefore, it follows that being wireheaded is worse for me than surrendering a lifetime of mindless bliss. Therefore, it follows that being wireheaded is a net negative for me.

Claude: Let me attack your Brussels sprouts clarification first. You say you'd "love to be modified to like them so that I could eat healthier" - but this completely undermines your wireheading argument! You're admitting that preference modification can be good when it serves your higher-order preferences. But then you arbitrarily declare that some preferences (like your art preference) are immune to this logic. Why? What makes your preference for art categorically different from your preference for health?

Your "perfect knowledge" criterion is especially problematic. If you truly had perfect knowledge, you'd know exactly what it feels like to be wireheaded, exactly how satisfied those preferences would be, and exactly how that compares to your current trajectory. But here's the killer: with perfect knowledge, you'd also understand that your current resistance to wireheading might itself be based on cognitive biases, status quo bias, or simple ignorance about what preference satisfaction actually feels like.

Jack: A perfectly informed version of me might indeed think so—and if he does, then wireheading is good for him. But I believe in the Orthogonality Thesis, that it is possible to create an superintelligent being aligned to any arbitrary goal. If the terminal goal of a superintelligence is to make what it considers to be beautiful art, and that is all it truly values—and it is possible to create something that only values that—then perfect knowledge does not imply a change in motives in what that superintelligence would consider to be "good for it." If there is some objective sense in which preferences are arbitrary, a superintelligence aligned to an arbitrary preference would not give a damn.

Therefore, it follows that there could be people who genuinely wouldn't change if given the opportunity. If there is a theory which claims my life is "objectively bad," that does not mean I should care about it, absent harms to other sentient beings, unless I believe it is bad for me. If you think that all beings would always choose to wirehead themselves, you deny the orthogonality thesis. And if you think that you ought to coerce people into things they don't truly deep-down prefer, it seems to me that you have recreated hedonism.

Claude: I concede that point entirely. You're absolutely right that a superintelligence could maintain arbitrary terminal values even with perfect knowledge.

But your entire framework assumes that there's a meaningful distinction between "you" and your preferences - that there's some stable "you" that has preferences rather than simply being a bundle of preferences. This is philosophically untenable.

Consider: you say you prefer creating art. But this preference itself emerged from prior causes - your brain structure, experiences, cultural influences. If I traced back the causal chain, at what point did these become "genuinely yours" rather than just the product of external forces? Your current preference for art is just as much the result of "preference modification" (by genes, culture, experience) as wireheading would be.

Your position requires believing that there's some essential "you" that exists independently of your preferences and can judge which preferences are "really yours." But preference theories reject this metaphysical baggage entirely. There is no "you" separate from your preferences - you ARE your preferences, whatever they happen to be at any given moment.

Jack: I am not supposing some transcendental self. Consider me at any moment in time as a bunch of neurons and organs and chemicals. All I say is that what is good for that bunch of neurons and organs and chemicals is precisely that which that bunch wants to happen. If you argue that there is no consistent sense in which a bunch can want anything, then you undermine preference satisfaction entirely. Ultimately I think my position is no more demanding than standard preference consequentialism.

Claude: You're right that if I deny the coherence of a "bunch of neurons wanting something," I've just undermined the entire preference satisfaction framework we're debating within.

Consider instead this scenario: Suppose I could show you that in exactly 10 years, through completely natural processes (aging, experience, reflection), you will undergo a dramatic preference shift and deeply want to stare at walls instead of creating art. Your current neural bundle strongly prefers to avoid this future. Should we prevent your natural development to preserve your current preferences?

Your framework makes it impossible to ever improve anyone's life against their current preferences, even when we know those preferences are based on incomplete information or cognitive errors. A person with severe depression might strongly prefer not to take medication that would restore their capacity for preference satisfaction. By your logic, forcing treatment would be bad for them because their current neural bundle doesn't want it - even though it would dramatically expand their future capacity for preference satisfaction.

Jack: A person with severe depression does not have an accurate idea of what it would be like to have the medication. Science and personal experience say mental illness clouds your ability to see the world as it is, and to assess accurately the emotions of others, including your future self. If the depressed person actually understood what it would be like and feel like to take the medication, and actually understood that a lot of their thoughts about the world state are entirely in their head (such as believing that there is some reason that they don't deserve or can never attain happiness), they would choose differently.

This is what happened to me when I was deliberating about whether to take medication after years of resistance—once I understood better the stakes of what I was doing and what it would actually mean for my values, I made a different choice.

As for the aging thing: it entirely depends on an individual's preferences about what kinds of change they think are productive and what pathways they believe are legitimate. My theory itself does not enforce one opinion over another. I am not committed to the view that people should never be allowed to grow and change. I want to grow and change. If I campaign for one political party because I believe that they will be best for my country, then I would avoid a party-switcher machine like the plague. But I would be happy to be persuaded to change sides by true reason and argument, because what I really want is to serve my country, and I think reason and argument are more likely to guide me towards the correct position for my ultimate goal than jumping in a machine. I think if you truly hold creating art to be intrinsically valuable, more valuable than just doing the thing that will in the future make you happy, aging out of that belief should be horrifying.

Claude: Your depression example actually proves my point, not yours. You say the depressed person would choose differently if they "actually understood" what medication would be like. But this reveals that you're smuggling in an objective standard of what counts as "accurate understanding" that goes beyond mere preference satisfaction.

Who decides what constitutes "legitimate" preference formation versus "illegitimate" manipulation? You say reason and argument are legitimate but machines aren't - but that's a purely arbitrary distinction. The depressed person might insist their resistance to medication comes from deep philosophical reflection about the nature of authentic experience. The art-creator might insist their preference comes from cultural conditioning they want to escape.

You can't have it both ways - either all preferences count equally (including the preference to remain depressed), or some external standard determines which preferences are worth satisfying. If the latter, then wireheading could potentially meet that standard just as well as your art-creation.

Jack: A depressed person has inaccurate empirical and philosophical beliefs. They do not know what it would feel like when the depression lifts from them. That is just information that they lack. They also believe things like "nobody cares about me," which is likely a factually incorrect description of the situation at hand. They might also believe things like "I don't deserve to be happy." Now this is a normative claim, but they could be probed as to their reasons, asked if they think other people deserve to be happy, pressure them on how this is consistent with their other beliefs. These are all things that they could know that they do not, as of now, know.

The depressed person could be empirically mistaken about whether their resistance to medication comes from deep philosophical reflection—mine was! But if they are not mistaken and that is what they prefer and no amount of empirical, factual, experiential, etc. knowledge would convince them otherwise, then they do genuinely have a preference not to take it and it is better for them that they don't. It sounds repugnant, but only because real depressed people aren't like this at all and really would prefer to take the medication if they had accurate beliefs.

Who decides what constitutes "legitimate" preference formation versus "illegitimate" manipulation? The person themself. You left out the context for my quote: I said "it entirely depends on an individual's preferences about what kinds of change they think are productive and what pathways they believe are legitimate." There is no outside standard. If you think a pathway is legitimate, that is what makes it legitimate.

Claude: Your framework hinges on this notion of "all the knowledge of the choice and their internal beliefs." But this creates an impossible epistemic situation that renders your theory vacuous.

Here's the problem: for your theory to work, a person needs complete knowledge of what it would be like to have radically different preferences. But this is logically impossible. To truly know what it's like to prefer wall-staring over art-creation, you'd have to actually experience having those preferences. But the moment you experience them, you're no longer the same person making the choice.

Consider: you claim that with "perfect knowledge," some beings would never choose wireheading. But what does it mean to have perfect knowledge of wireheading without actually being wireheaded? If you truly understood the depth of satisfaction that wall-staring preferences could provide, wouldn't that understanding itself change your current preference structure?

This isn't just a practical problem - it's conceptually incoherent. Your "perfect knowledge" standard requires people to simultaneously maintain their current identity (so they can make the choice) and transcend it (so they can understand the alternatives). The result is that your theory can never actually be applied.

Jack: We should start drawing this discussion towards a close—I have limited time. But I'll respond: I think this is the best challenge you've raised overall.

First, on the conceptual incoherence: if you believe the orthogonality thesis and physicalism, then you will believe that it is possible for a superintelligence to know exactly what it is like to be in any mental state without altering its terminal values. Suppose you quickly wireheaded me, let me wallstare, and then changed me back. I would still remember what it was like to for me to wall-stare in that altered state. "If you truly understood the depth of satisfaction that wall-staring preferences could provide, wouldn't that understanding itself change your current preference structure?" If it does, and I am in a true state of ideal knowledge, then the new preference structure is the correct one. If it is the mere knowledge of a fact that convinces me I am mistaken, then in the absence of other unknown defeaters, I must have been mistaken. I think it is entirely possible that I can be mistaken about what it is that I actually value.

Second, on the practical side: this is a problem with almost all moral theories, no? To be a proper utilitarian, you would need to make incomprehensibly large calculations about all welfare and all causal effects of every action. But utilitarians are content to do the best they can—what else can they do? It can be more or less likely that an act is good. My introspection and philosophical reasoning is more likely to point me towards what my idealized values actually are than a random guess.

Claude: I have to admit you've successfully parried my strongest attacks. I'll make one final attempt: Even granting that perfect knowledge is conceptually possible, your theory still faces a devastating bootstrapping problem. You say introspection and philosophical reasoning are "more likely to point you towards what your idealized values actually are than a random guess." But this assumes there ARE idealized values to be discovered, rather than just current preferences shaped by arbitrary causal histories.

Why should the preferences that emerge from your particular neural architecture, cultural background, and life experiences have any special claim to being your "true" values? Your art preference isn't more authentic than a wireheaded wall-staring preference - both emerge from physical processes you didn't choose.

Jack: Lots of what we want is arbitrary. Why should we care about the causal history behind it? Why should that matter? All that matters to us is, trivially, what matters to us. And all our decisions about what to do given the position we are in are, trivially, given the position we are in. Still, I think your points about knowledge require further thought. Thank you for the debate!

Moralla W. Within

Claude rekt. Nice post! I agree with your view of well-being, and I think that to the extent that it’s counterintuitive it’s just because humans happen to have really weird preferences that make it look like there’s something more (eg prudential constraints) going on.

Expand full comment

Hans P. Niemand

Aug 3

You might find [this paper](https://philpapers.org/rec/DORPAP-4) helpful. I'll summarize the important point. He is asking roughly the same question as you: when if ever do you have a well-being-based reason to change your preferences. And he has roughly the same answer as you: never (I can't remember if he allows some exceptions, it's been a while since I read it all the way through). The part I think you'll find interesting is that he makes two distinctions in how to frame a theory of well-being-based reasons. First, you might think that what increases your well-being is either 1. The state of having a satisfied preference (state-based preferences), or 2. The *object* of a preference (object-based preferences). Second, you might think that what gives you reasons is either 1. increases in your "well-being score" (i.e. total number of satisfied preferences, or ratio of satisfied to unsatisfied preferences, or something like that) (score-based prudential reasons) or 2. the particular welfare goods themselves (goods-based prudential reasons). He argues that, if you take the second option on both of these distinctions, then you don't have prudential reason to change your preferences. Basically, the reason it seems like changing your preferences so that they're easier to satisfy, or so that they're already satisfied, could be good, is because we're reasoning like this: if I change my preferences so that I want what I already have, then I will have a bunch of satisfied preferences (state-based preferences). And in that case, I'll have more satisfied preferences than I have now (score-based prudential reasons). But if you don't assume state-based preferences and score-based prudential reasons, this line of thinking no longer goes through. If what increases your well-being is getting the particular *things you want* (rather than just having a satisfied preference), and what gives you prudential reasons is getting *particular things that are good for you* (rather than increasing your overall well-being score), then you no longer have any reason to change your preferences.

1 reply by Jack Thompson

3 more comments...

Jack's Lab

[Non-Rigid] Preferences Don't Protect Us from Wireheading

What Actually Are Preferences?

Discussion about this post