What Actually Are Preferences?

When would you self-modify?

Jul 29, 2025

Top-down exploded brain diagram with geometric wires in clean, modern flat-vector style with retro paper texture background. — GPT Image-1.

1. The Challenge

If I had a machine to modify my preferences and emotions in any way I liked, I would make some changes. I would love to change my tastes to love healthier foods, dislike unhealthy foods, and prefer vegan to non-vegan meals. I’d like it if scrolling around on my phone weren’t an addictive experience. And though I already like reading a lot, I think I’d like to like reading even more.

However, there are lots of self-modifications I wouldn’t make. If I could make myself experience indescribable happiness whenever I look at the color green, I wouldn’t, even though this would undoubtedly bring me a lot of pleasure. I wouldn’t make myself straight. And I wouldn’t make myself enjoy lying or hurting people either.

Recently, Silas Abrahamson and I had some disagreements about self-modification:

Preferences Don't Protect Us from Wireheading

Jack Thompson

Jul 24

Read full story

Looking back, I stand by my article but I don’t like the title. “Preferences” is too ambiguous of a term; there is a real sense in which it is only my preferences which makes wireheading bad for me. Today, I’d like to gain some clarity on what “preferences” actually mean, and what modifications I would actually make.

2. Wanting and Liking

A revealed preference is an account of how someone behaves. If Adam has the choice between a slice of cake and a salad for the same price, and chooses the cake, he has a revealed preference, or r-pref, for cake. This is true even if Adam believes that he values health and fitness more than gustatory pleasure. A revealed-preference theorist would say Adam is simply mistaken about his own preferences: if he really valued health that much, he would choose the salad.

Now, perhaps Adam has an r-pref for cake over salad, but wishes that he didn’t. This means that Adam has a metapreference: he prefers that his preferences would be a different way. Imagine offering Adam a button which would, right now, replace his r-pref for cake with an r-pref for salad. If Adam would press that button, Adam has an r-pref for r-preferring salad. Simple; real cases exist! Alcoholics who go to rehab would qualify as having a strong revealed preference for alcohol, yet clearly would prefer to not prefer alcohol.

However, there is more to self-modification than r-prefs. Consider two people with an r-pref for mushrooms:

Merry eats mushrooms because they taste amazing to him.

Pippin eats mushrooms because he has a horrified fascination with fungal decay. They taste disgusting to him and he doesn’t enjoy the experience, but he fixates on them the way someone might fixate on a car crash or train wreck: he feels like he can’t tear himself away.

Both would choose to eat mushrooms over alternatives. Both could stop eating mushrooms if you pointed a gun at them and ordered them to. But Merry & Pippin have very different experiences! This is where Bryan Caplan goes wrong when he claims that mental illnesses are just revealed preferences, and Scott nails him for it:

Consider a migraine. If we think like behaviorists, all we can really say about migraines is that someone locks themselves in a dark room, clutches their head, and says “oww oww oww” a lot. If we put a gun to a migraneur’s head and threatened to kill them if they didn’t go to a loud party, they would grudgingly go to the party. … Or we could stop thinking like behaviorists, a philosophy which nobody has taken seriously since the 1970s. Once we agree that people are allowed to have internal states, and that the rest of us are allowed to acknowledge those internal states, the paradox disappears. We can agree that the essence of migraine headaches is pain, especially pain in response to strong sensations. … Both I (never had a migraine) and the average migraineur have a preference for not having our head be in terrible pain. But the migraineur needs to avoid bright lights in order to satisfy this preference, and I don’t. So she very reasonably avoids bright lights.

Similarly, neuroscience can distinguish between “wanting” (markers of desire, motivation, etc.) and “liking” (markers of hedonic pleasure). They involve different pathways in the brain, and you can manipulate one without manipulating the other. Here’s Scott writing about this ten years ago:

It may be that a person really does enjoy spending time with their family more than they enjoy their iPhone, but they're more motivated to work and buy iPhones than they are to spend time with their family. If this were true, people's introspective beliefs and public statements about their values would be true as far as it goes, and their tendency to work overtime for an iPhone would be as much a "hijacking" of their "true preferences" as a revelation of them. This accords better with my introspective experience, with happiness research, and with common sense than the alternative.

So it seems to me there are at least two axes along which you can modify “preferences.” “Wanting” seems to me what r-prefs are all about. “Liking” is neglected entirely! So from here on out, I’m going to avoid the ambiguous term “preference” and refer to wanting and liking. Wanting can be observed through an agent’s choices and trade-offs. Liking can be observed introspectively by sensations of pleasure, and externally by chemical signals like endogenous opioid release (as according to the aforementioned paper, although the chemical substrates are much more complicated than that).

Finally, since modifications of “preferences” can include modifying wanting and/or liking, we should have some terminology to distinguish between cases:

coupled modification impacts liking and wanting in the same direction. Hacking yourself to get joy and motivation for doing the right thing. Or making browsing the news less enjoyable and less compulsive.
decoupled modification impacts liking and wanting differently. You could increase the pleasure you get from watching a sunset without increasing your motivation to go watch one. Or you could increase drug craving without increasing the pleasure that you get from the drug.

3. Terminal and Instrumental Values

In his first comment to me, Silas wrote:

Suppose I hate mushrooms and love liquorice. Liquorice is hard to get ahold of, but mushrooms are easy. That means that I would have more preferences satisfied if I liked mushrooms.
One day I hit my head, making me suddenly have the reverse preferences. Was this good for me? It seems very much to me like it is! I can now satisfy my preferences much more and easily than previously. Prior to hitting my head, I might have some revulsion to the thought of hitting my head like this, as liquorice is so tasty and mushrooms suck--but this is me failing to realize that my preferences would be better satisfied by hitting my head.

Getting hit in the head here is likely a case of coupled modification; changing the way food tastes to you likely affects wanting as well as liking. The question of whether something is better for me is a question of value. Whether this is identical to wanting and/or liking is up for debate, so let’s be content to call it value and later on try to figure out whether it reduces down to some combination of wanting and liking. So, is there more value in the liquorice-eater getting hit on the head and changing their tastes?

Well, it depends. Why does Silas want liquorice? Assuming he is not simply a liquorice addict—assuming, rather, that he both wants and likes liquorice—there are two ways this could be:

He wants liquorice because he likes it. He places value on receiving gustatory pleasure; the liquorice gives him gustatory pleasure; therefore he values liquorice as a means to an end. This is instrumental value.
He wants liquorice for some other reason than liking it.
1. He believes eating liquorice builds character, or jaw strength, or makes them more attractive. This is still instrumental value.
2. He believes that eating liquorice is itself a good, ceteris paribus, irrespective of its other consequences. Just like some people believe happiness is good in and of itself, and does not need further justification. This is terminal value.

So, is modification good for him? In case #1, definitely. Liquorice consumption is the means to an end of gustatory pleasure, and modification gives him an easier means to that end. For #2a, if his beliefs are accurate, then probably not. Disliking and un-wanting liquorice makes it harder for him to pursue character/strength/attractiveness. But what about #2b? That is the ultimate point of contention between Silas and I; I’m not addressing this directly yet. But first observe that overwhelmingly, people who value liquorice value it instrumentally. I have never heard of anyone who deep down believes that the consumption of liquorice in and of itself is good, who believes that we should tile the universe in unthinking, unfeeling automatons scarfing down chewy confections. So the intuition that Silas is building upon is a little weaker here.

Now, in the terminal value case, we’ve said that Silas will now like and want mushrooms more than liquorice, but we haven’t specified whether he will value them more. I think it is possible to no longer like or want something—where “want”, remember, refers to motivation and r-pref—while still giving it terminal value. Imagine you assigned high terminal value to creating art, then were transformed to find art creation unmotivating and unpleasant, but still assigned art creation high terminal value. Now imagine that the transformation would change what you valued so that you no longer assigned any terminal value to creating art. Both options seem possible to me.

Is value something distinct, or is it simply a different form of liking and/or wanting? Notice first that value can be distinct from what are typically considered metapreferences: an agent could place value on “doing hard things that you really don’t want to do.” Such an agent wouldn’t want to self-modify to make those hard things motivationally easy and phenomenally pleasant—that would spoil it!

A promising candidate for value would be liking or wanting that a particular agent-neutral state of affairs has come about. In a liking frame, maybe you don’t like the action or process of creating art, but you like the fact that you are creating art. In a wanting frame, maybe you are not motivated to make art, but are motivated to precommit to creating art by binding contract. I’m not quite sure if that’s identical to the cluster of things I think about when I think about value, but it’s close, and may be a helpful guide.

4. The Changes I’d Make

A list of things I’d do if I could reprogram myself:

First, if wanting and liking can really be systematically decoupled, I would never decrease liking except for evil acts.
- The purpose of decreasing desire for something is that I don’t want myself to do that thing as much as I otherwise would, even if in the moment I would enjoy it. This is largely the purview of wanting, not liking.
- Meanwhile, I see no reason not to enjoy almost every experience, as long as it doesn’t serve as motivation to do that thing. Maybe I can decrease my wanting to read politics so much that I never voluntarily doomscroll valueless culture-war nonsense. But occasionally it is very important to check politics, so why not enjoy myself while I’m doing it?
- I am uncomfortable, as Silas pointed out, with taking pleasure in evil things, even if they don’t motivate me to do them. Perhaps this undermines virtue, and virtue has some terminal value?
- If liking can’t be decoupled, I would be okay changing liking negatively for the “decreases” in all of the sections below.
I would radically change wanting, both increases and decreases.
- Increases: reading dense books, giving to charity, maintaining good epistemics, taking risks, getting exercise, learning new things, being kind, taking care of boring but necessary business, developing new skills, doing spaced repetition flashcards, spending time with friends, spending time appreciating aesthetic beauty, concentrating for long periods of time, doing EA good, etc.
- Decreases: eating unhealthy food, eating animal products, browsing/scrolling, random speculation about politics, getting petty revenge, passive media consumption, spending money on non-essentials that aren’t investments in anything else, etc.
- (If I partook in any drug or alcohol, I would put this in decreases too, but I don’t, not even caffeine.)
And if wanting and liking were truly decoupled… I actually don’t know why I wouldn’t just throw liking to maximum for every non-evil thing in my life?
- The only reason I object to wireheading is that I wouldn’t do, experience, or have other things of value—I would be totally unmotivated to do those things. But if I still had all the motivation and did those things… well, why not?
- The main obstacles I can think of are practical. Habituation or pleasure spilling over to affect other functions are the main ones.
- I suppose insofar as I value knowledge, understanding, and wonder, then there is some value in knowing what it is like to have pain, sorrow, anger, fear, etc. Perhaps, as many apologists have argued, there are some wonderful experiences that can only be achieved with negative experiences “mixed in.” This is a terrible defense nine times out of ten—a weird number of people deploy it when defending animal suffering, which makes no sense to me—but actually may be relevant here.
- If these concerns do end up mattering, then at the very least I would raise my baseline happiness to the high normal human end, and enhance pleasure for low-risk stuff that I already value—like reading, learning, and exercise—to that of the people who love it most and are still emotionally stable.
In reality, prudence would moderate a lot of this stuff. I am not a professional neuroscientist1 and the consequences of screwing this up could be pretty bad.
I would not change a single one of my terminal values.

5. Changing Terminal Values

For reference, my terminal values (as far as I understand them) include: pleasure, compassion, fun, knowledge, beauty, truth, and freedom. All of these values extend to all other sentients, not just myself. Unfortunately, I don’t always have r-prefs for these things, or even find them pleasurable. I intellectually understand that shrimp welfare is quite plausibly very important, but thinking about shrimp doesn’t move me, the way thinking about people does. (Writing that sentence did prompt me to set up my recurring donation to the shrimp welfare project, though.)

Now I doubt anybody else would change their terminal values either. If your inability to fulfill a value is making you unhappy, just change your happiness. If it demands that you do things you don’t really want to do, just change your wanting. If there is a good argument out there to suggest my value shouldn’t be terminal, I can simply change my mind the old-fashioned way—plus tweaking myself to increase motivation to have true beliefs. If the only thing that matters to you is having your preferences satisfied… then it sounds like that’s your only terminal value, and I’m not even sure it’s possible to value not having your values satisfied (contradictory, no?)

Perhaps from a pragmatic standpoint it is easier to eliminate weak values than to tweak all the incentives around them: for instance, if I value power, but not as nearly much as I value freedom from corruption, maybe it’s worth the cost-benefit to just eliminate all positive associations I have with power wholesale rather than trying to isolate only the bad ones. But if your terminal values are not logically inconsistent, what possible motive could you have for changing them, if they are not justified by any further reasons?2

This is a somewhat banal observation, and I want to stress that this is not the main point I was arguing in my last essay. I take it as obvious that nobody wants to change their terminal values. The actual debate centers on whether it’s bad that my terminal values be changed, by some sort of unintentional freak accident. Certainly it is bad by my current values. It is not bad by my future values. Does that mean it would be better for me to avoid such a thing? Would it be better for me if someone reverted the change, despite my future self’s protestations? Does it infringe on my self-governance to be modified in this way—or does it actually infringe on my future self’s liberty to making binding commitments to change back? These are tougher questions. Clearly, I have some Laurie Paul to read!

I also asked Gemini Pro to fact-check this piece, mostly for the neuroscience stuff, but it did criticize me for such a breezy treatment of what terminal values actually are, and that I treat them as “static, discrete, and axiom-like.”3 This is fair, too. I confess that I don’t actually know what terminal values are. I am pointing towards a cluster of “that stuff,” and trying to reason about it. More work to be done on this in future.

6. My Challenge to You, Reader Dear

Let’s assume for now that myself, writing this essay, and my future self after a radical transformation, are ontologically one—pre-Jack and post-Jack just refer to the same person at different points in time. Pre-Jack says it is bad for me to have my terminal values altered. But post-Jack also says it is bad for me to have my terminal values altered. After all, post-Jack has a new set of terminal values which he wants to protect. Post-Jack just thinks that the values that he has right now are most important, and there should be no changes either to past or future values.

If there is no point in time at which I prefer—in the sense of want, like, or value—that my terminal values be changed, and all that matters for what makes my life intrinsically good/bad is preference satisfaction, then how can it ever be good that my terminal values be changed?

You could say the terminal value change is for my own good. But if I had perfect introspection and perfect knowledge of the consequences and all their ultimate effects, at no point in my life would I ever like, want, or value being changed. So unless you are a hedonist or have a weird type of objective list… in what sense can it be for my own good?

I mean, I am lead author on a forthcoming paper with a researcher from Princeton Neuroscience Institute… </brag> Seriously, though, I’d need to know a lot more.

Okay, fine, you can build arbitrary predictor problems where Omega murders anyone who instrinsically values the opera. Absent that, though, there are no reasons I should change my values.

Funnily enough, it also accused me of fabricating my association with PNI, because it couldn’t find the paper I have yet to publish. And, more fairly, because I am not faculty there. Honestly, I’m glad it’s not just gushing over how much of a genius I am, the way GPT-4o does.

Silas Abrahamsen

Jul 29

I really liked this (and wanted to read it)! And I think your analysis is very compelling!

As for the value question, this has given me a lot to think about, but just as an initial reaction, I feel myself pulled in two directions:

On the one hand your wire-heading examples are quite intuitively moving, and I share your feeling that I would never want to change my terminal values.

On the other, there is something incredibly strange to me about how much weight this puts on procedure. If I start out having relationships as a terminal values, it would be bad for me to be wire-headed into having hermitage as a terminal value. But if I started off with the reverse preferences, the reverse change would also be bad.

Suppose I am about to have a child, and I can choose what he will be like. I can either choose to have hermit-Silas Jr. or relationship-Silas Jr. Suppose also that life would be somewhat easier for hermit-Silas Jr. In that case, it seems like I should choose to have him, as his life would be better due to his terminal values being more easily attained or something. However, if I already had relationship-Silas Jr., it would be bad for me to wire-head him into hermit-Silas Jr. That is really strange to me!

We do see these sorts of structures in theories of action (e.g. with deontic side-constraints), but it is a lot stranger to me to have them in a theory of value.

In any case this is very good!

(Also just a fun sidenote: I actually think the view you seem to be gesturing at fits better with the (admittedly imprecise) idea that I shouldn't care whether my life is objectively bad. If I should sometimes be wire-headed despite not wanting to, that seems like a case where I should in some sense care about something I don't--whereas if what matters is my current terminal values, what matters is just what I care about (in a more direct sense)).

Expand full comment

2 replies by Jack Thompson and others

Linch

Aug 9

I think I used to relate to terminal values ~the same way as you do. I no longer do.

I think I have several objections. The first objection is game theoretic. In certain situations, becoming an agent that terminally values vengeance (or reciprocity) may yield greater utility according to my other values than being an agent that only values vengeance when the reputational benefits exceed the costs. This is because an agent that only instrumentally values vengeance might have "one thought too many" when it comes to vengeance. So (e.g.) you can be exploited by other agents who deem that you won't seek vengeance on them if they deem that after being aggrieved, you'd realize that the long-term benefits of your credible commitments.

The second objection is a generalization of the first in other multi-agent dilemmas with transparent values. For example some parties may only agree to contracts with you if you demonstrate that you terminally value fulfilling contracts (or a close proxy like keeping promises) at >1%.

Note that while these examples require transparent values, I consider them much more realistic than "arbitrary predictor problems" as you put it.

The third objection comes from the opposite of transparent values. Our current terminal values, for most of us, are murky not only to others but to ourselves. While idealized agents may have clean separation of terminal and instrumental values, I'm not convinced that humans do. (And indeed empirically I observe many people either being confused about their values, or being very quickly certain about their values in ways that I think are not tracking important epistemic processes). Thus, I think it is bad to prematurely "lock in" what you perceive to be your terminal values, as much of the difficulty is figuring out what those values are in the first place, consistent or otherwise.

1 reply by Jack Thompson

3 more comments...

Jack's Lab

Preferences Don't Protect Us from Wireheading

Discussion about this post