Efficient parrots need "understanding"

The intentional stance is a must

Jan 06, 2026

In Intuition Pumps and Other Tools for Thinking, Dennett offers this thought experiment. You find two black boxes, A and B, connected by wire. A has two buttons, α and β. B has two lights, red and green. You observe after countless trials that pressing α always causes the red light to flash, and pressing β always causes the green light to flash. As a scientist, you want to know what the causal explanation for this is. However, when you crack open the boxes, you find millions of circuits performing befuddling calculations at intense speeds. Here is what you know:

Pressing α or β always sends a 10,000-bit string from A to B. But the strings vary with each press.
You can “artificially” send bitstrings with an electrical impulse. If you send a bitstring which was sent in a past α-press, it always flashes red. If you send a bitstring which was sent in a past β-press, it always returns green.
Whenever you try to create an original bitstring, the odds are overwhelming that neither bulb lights up. However, once or twice you have succeeded in creating an original bitstring that lights up green or red, without having previously been sent via button-presses. If that bitstring shows up later by button-press, it always has the same result.
In the case of each individual bitstring, you can perfectly trace the proximate physical causes: this bit was read, which flipped this switch, which routed to this subcircuit, which… and so on. This is mere machinery, and deterministic machinery at that.
However, so far your fellow scientists have failed to discover a systematic explanation for what makes α-bitstrings trigger red, and what makes β-bitstrings trigger green.

Now before I reveal Dennett’s trick, I want to turn it to my own uses. While you and your colleagues have made negligible progress, in walks a fellow, we’ll call him Al. Working on pencil and paper, Al enters fifteen artificial bitstrings into the machine. Every single one is followed by a red flash from box B. He then looks over your attempts at artificial bitstring entries and correctly identifies which ones produce green, red, or neither flash at 95% accuracy. Finally, you lob new artificial bitstrings at Al all day, even using a quantum random number generator to source them; again, he correctly predicts 95%, prior to testing.

Does Al understand how the machine works? Do you know that for certain?

In principle, no. It is theoretically possible that Al has merely memorized the layout of all the switches, circuits, transistors, and the like. And it is also theoretically possible that Al can work out every single proximate causal interaction from bitstring to bulb at lightning-fast speed. In this case, Al could perform all of his amazing tricks just by “testing out” the bitstrings on paper before giving his answer. This is an incredible mental feat, but it seems that Al himself does not really understand how the machine works any more than you do.

But there are many explanations theoretically compatible with Al’s achievements, including that he just got extremely lucky or can see the future. Which explanation deserves your highest Bayesian credence? Simply, Al knows something we don’t! If the machine is obeying some simpler law we just haven’t been able to detect yet, then if Al had any way of discovering this law, these predictions would be a piece of cake. While our searching and failing makes it somewhat less likely that such a law exists, it’s more than balanced out by how implausible it is that Al can manually simulate thousands of physical interactions on 10,000-bit strings! Understanding is the most likely explanation.

After going away and working for a while, Al produces a smaller, sleeker machine with a keyboard. Enter a bitstring, and the little machine flashes green or red. While the original machine has millions of circuits, Al’s has merely thousands. And, like Al, the little machine is nearly always right. What is the explanation for little machine’s success?

If you try to analyze all the proximate causal interactions, you will be nearly as flummoxed as with the original machine: thousands of circuits is a lot less than millions, but it is still far too much to mentally process at once. Again, it is theoretically possible that some miraculous chance aligned the little machine with the big one, but it’s extraordinarily unlikely. Again, the reasonable explanation is that the little machine “understands” something we don’t: it is able to compress all the complicated proximate causes into some much simpler procedure, that offers a systematic rule for which strings produce which flashes. I say “understands” in air quotes merely to point out the analogy to Al’s understanding, not to suggest that this little machine has mental states. You may replace “understanding” with any other term you like.

Now, let’s return to Dennett as he reveals the secret of the two boxes:

Al, who had built box A, had been working for years on an “expert system”—a database containing “true propositions” about everything under the sun, and an inference engine to deduce further implications from the axioms that composed the database. … Bo, the Swede who had built box B, had been working … [on] his own expert system. …
Whenever you pushed A’s button, this instructed A to choose at random (or pseudo-random) one of its “beliefs” (either a stored axiom or a generated implication of its axioms), translate it into English (in a computer, English characters would already be in ASCII), add enough random bits after the period to bring the total up to 10,000, and send the resulting string to B, which translated this input into its own language (which was Swedish Lisp1), and tested it against its own “beliefs”—its database. Since both databases were composed of truths, and roughly the same truths, thanks to their inference engines, whenever A sent B something A “believed”, B “believed” it too, and signalled this by flashing a red light. Whenever A sent B what A took to be a falsehood, B announced that it judged that this was indeed a falsehood by flashing a green light.

Dennet’s point in offering this example—the esoteric boxes, the multitude of wires and circuits, the translation from one programming language to ASCII to Lisp (in Swedish!)—is to demonstrate that there is no practical way to effectively understand and predict the machine except by interpreting bitstrings semantically.

The point of the fable is simple. There is no substitute for the intentional stance; either you adopt it, and explain the pattern by finding the semantic-level facts, or you will forever be baffled by the regularity—the causal regularity—that is manifestly there.

Now turn your attention to a language model. It would be nearly impossible to train a brand-new neural network on these inputs alone to predict which bitstrings flash green or red. The only efficient algorithm for computing whether a bitstring is green or red is to have some cheap internal “knowledge base” which you can check ASCII bitstrings against. But there just isn’t enough variety or structure in the bitstring data as it is—no context, little reflection of regularities in language, no exposure to other world-data—to realistically stand a chance of locating the right knowledge base and the right decoding of bitstrings. It is theoretically possible—there exists a set of weights that could do it—but it is a tiny configuration in a vast sea of incorrect solutions, and gradient descent on these elements alone is very unlikely to find it. There is no uncrossable philosophical gap in this case, only a severe lack of capacity.

Now suppose we took a language model like ChatGPT or Llama, which has already been trained on a vast corpus of human text, and started training it via gradient descent to be a bitstring predictor—and imagine, after thousands of training steps, that it has become quite successful, rivalling Al in accuracy. What is the most likely explanation? That Llama has “learned” the trick: it has developed the capacity to decode bitstrings and compare them against its internal “knowledge.” It’s the only solution that a language model with limited parameters and training time is likely to stumble upon. Thus, if Llama is successful, we can be reasonably confident that it “knows” the same thing that Al does. Not by projecting any kind of consciousness onto the LLM, but simply because it’s the most parsimonious explanation.

The broader application is clear. A small version of Llama 3 has 8 billion parameters. It was trained with 15,000 billion tokens of input text. Llama couldn’t even memorize its inputs, let alone transfer with surprising accuracy to a variety of novel problems, without finding some meaningful ways to compress that information. True, it is possible that Llama coincidentally stumbled on merely statistical factors that give it this capacity. But given the volume of compression involved and the difficulty of predicting text in novel circumstances, it’s much more likely that Llama developed in training some kind of internal modeling that “pseudo-adopts” the intentional stance.

We might find it surprising that Llama works at all, just as it would be surprising if Al with no prior knowledge cracked the code on its own. But given that it works to the degree it does, it seems likely that Llama is crudely mirroring semantics, not just syntax. LLMs are next-token-predictors—and as many AI researchers have argued,2 the most efficient way to accurately predict the next token someone is going to produce is to model what they mean.

Lisp is a programming language, used by early AI researchers. Swedish is a human language, used by early producers of delicious pastries and perplexing self-assembly furniture.

As I say, I think this is a common attitude but I am not sure who said it first. I’ve heard it on 80,000 Hours podcast for sure, and Dwarkesh has also said something similar. So this point is not original, but the explicit connection to Dennett is really what made it click for me.

Jack's Lab

Discussion about this post

Ready for more?