Monday MAMLMs: Just What Is Going on Inside the "Mind" of ChatGPT?: Images Edition
Ghostly images in the latent spaces: my image model summons panthera leo loe from the Vasty Deep. But why & how? & what does that tell us about lions, deserts, phalanxes, warriors, Leonidas, & our culture? Archetypes roar!…
Ask for a phalanx of warriors in a desert landscape, but with the warriors drawn from all places and times, and get a cameo appearance from panthera leo leo himself. Why? How? Possibly from → phalanx → Leonidas → lion’s-son? What is the cultural prior that induces the remix to flow in this direction?
With respect to words, I think I understand exactly the extent to which there is a mind and thought behind the “choice” of a word that ChatGPT makes:
A human mind, faced with a prompt that the machine’s similarity metric it has constructed over its training data, chose the next word for reasons that felt good and sufficient: a mesh of memories, expectations, and the internalized grammar of argument. Now there is Heavy Magic I do not understand in the construction of the similarity metric. But once you have that, what the machine is doing is clear: It is looking at all the minds behind all of the next words in sufficiently similar training-data prompts, and transforming them into ghosts within its own circuits. As Andrej Karpathy just said:
Andrej Karpathy: AGI is still a decade away <https://www.dwarkesh.com/p/andrej-karpathy>: ‘We’re not building animals. We’re building ghosts or spirits or whatever people want to call it… by imitation of humans and the data that they’ve put on the Internet… ethereal spirit entities… fully digital… mimicking humans…. [without the] outer loop of evolution… by imitating internet documents. This works…. It’s the practically possible version with our technology and what we have available to us…
And, out of all of these ghosts choosing their next word as echoes of the thoughts of the people who actually wrote each of the next words, the machine chooses one as it rolls its stochastic dice. The collective residue of written culture is thus distileld into probabilistic continuations that often sound right because they are statistically adjacent to what has been said before.
And then the machine scales that up, moving forward as it feeds the chosen word back into the prompt, does its calculations, rolls its stochastic dice yet again, and chooses the next next word. Some of the time the ghost whose next word is chosen will be the same ghost as for the previous next word, and thus there will be a single intelligence echoing in the answer produced. But time and chance will lead to jumps as all of a sudden the machine is choosing a word derived from a different piece of its training data. There is still a ghost of human thought for each word. But it is a different human, and a different thought. Thus we have not the intention of a mind, but rather a stochastic-parrot remix.
The useful stance is double: take the output seriously as a compressed map of what we have already said, but interrogate it skeptically as to whether it advances understanding in the sense of sharpening the world we see, or just echoes the median TIS—Typical Internet S***poster.
That, at least, is my current understanding of what is going on with words.
But with images?
I have no idea what is going on with images. Recall:
The task was to render a picture of a phalanx of warriors in a desert landscape, but with the warriors coming from all times and cultures. Well and good. I was pleased.
But how did he get in there?
Where did the decision to draw Leo there come from?
I mean, the effect is not bad.
But still.
Somebody, somewhere, must have thought clever things by now on what types of images are in the training database and what sets of caption and descriptive words are associated with them in order to produce the outcomes we get from these natural language prompts. But who has done this? And how do I find them, without getting drowned in the slop?
I have my guesses: Image models sample patterns after aligning text with vast captioned image datasets, that make “phalanx,” “desert,” “warriors”, and even “Leo” live near the corresponding visual features in a shared space. And somehow the embedding is pulling in the culturally salient archetype of a lion in a context defined by “desert”, “warrior”, and perhaps “Leo-nidas” is two tokens—lion-son—and is called up by “phalanx”. They sample across the learned distribution, blending multiple nearby clusters. That’s why a composite scene can suddenly drift toward a familiar face: it’s a stochastic hop to a dense neighborhood in representation space. Perhaps. But even so, when you do explicitly ask for Leonidas you tend to get something like this:
in which the graphic novel and movie “300” has overwhelmed all previous and other history, memory, and art.