Modified: December 05, 2025
Emmett Shear interview with Patrick McKenzie
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.There's no such thing as aligned AI, with Emmett Shear (co-founder Twitch) Sept 2025 https://www.youtube.com/watch?v=H1QV9XbeAoI
(YouTube automated transcript cleaned up by Claude 4.5 Sonnet)
Interview: Patrick McKenzie and Emmett Shear on AI Alignment
Patrick: Hi everybody. My name is Patrick McKenzie, better known as Patio11 on the Internet, and I'm here with Emmett Shear, who's had an illustrious and wide-ranging career path. You previously co-founded Justin.tv, I think it was called back in the day, before it became Twitch, sold out to Amazon. You were at OpenAI for probably the happiest incidents in the history of capitalism—a CEO being fired and back in three days. And now you're running a new company, Softmax, on AI alignment.
So we're here at a conference in Berkeley, which is something of one of the intellectual epicenters of this movement or practice—I don't know exactly what one would call it—with regards to discussing the range of probable outcomes that people and human society could see from AIs in the next couple years. And I thought we would do a conversation at a couple levels of detail, maybe something that can help people who haven't spent 20 years reading LessWrong posts to get up to speed with the jargon here, and also not be at the New York Times level of "Well, it seems like LLMs hallucinate, and so clearly nothing important is coming down the pike."
So let's start with the briefest sketch of the problem. What is alignment? Why do we care?
Emmett: So I actually go back to sort of literally: what is alignment? Well, alignment—one of the words that gets dropped when people talk about alignment a lot—is it's alignment to something. Like, you can't just be aligned. That's actually nonsense. It's like being connected—you have to be connected to something. You have to be aligned with something.
And when people talk about alignment, what they're often actually saying is: we need a system of control. If you were to read between the lines—or it's not even really between the lines, they're pretty explicit about it, actually—by alignment they mean systems of control and steering that make sure that the AI systems that we build act either in accordance with some kind of written set of rules or guidelines, or do what the user wants.
Which are both ways of—it's tool use. When you have a tool, the way that you align the tool is, in general, to the user's intent. A good tool is aligned to the user's intent.
But when I think about alignment, I think that I have come to really disagree with that view. I think there's another way of looking at what alignment is that we call "organic alignment," because it's inspired by nature, which is this way in which many things can come together and be aligned to their role in this greater whole.
You're made out of a bunch of cells. Those cells are all aligned to their role in being you. They're all kind of the same kind of cell, but they're also all different. And they all somehow act as a single, mostly coherent thing. You have ant colonies, you've got forests, you've got—I mean, even cells themselves are sort of a bunch of proteins and RNA, and they're all kind of aligned to being a cell.
Patrick: I think if I could interject for a moment, one of the issues with the alignment discourse generally is that there are places where that intuition breaks down—cancer, for example, where the goal of the cell suddenly becomes rapid propagation of its cancerous genes and maybe not rapid propagation of the organism that is hosting the cancer.
And the researchers, etc., are in some cases trying to explain: yeah, there's this collection of scenarios of things that could go right or things that could go wrong, and we're producing both some larger intellectual body to understand these, but also countermeasures against specific scenarios. And then attempting to dump all the bandwidth of 20 years of conversation down a very narrow pipe to the typical reader of the New York Times.
So I think it's important background knowledge—which is extremely well understood in some communities of practice and not well understood in Washington or New York, metaphorically—yet the reason we care about this is that this technology is likely to be much more impactful on human life than almost all technologies that have ever been developed.
Emmett: Well, there's this particular idea that I think people in the tech community, even if they've never heard this idea, have it—it's in the water supply and you've understood it in your heart—which is the idea of the von Neumann universal constructor. And a universal constructor is just any machine that's capable of building itself from a description of itself.
And that doesn't sound like such a big deal on the surface maybe, but that's what all biology is. You have DNA, which is a description of the biology, but it's made out of chemicals. And then you have the chemicals that are the machine that can use the DNA to build the thing again. And if you have one of those things, it goes from being a single cell to taking over the entire trophic chain on Earth fairly fast, actually.
Because if you're a universal constructor, you're in this learning loop where every time you make a new copy of yourself, it can be different, and that different thing can be a better, more efficient universal constructor, and so on and so on.
And this happened a second time, most notably with language, where humans basically learned to use words to model other words. We can talk about—speaking is a behavior, and you have a behavior that can now reference other behaviors. And you can make a description of a behavior and then give it to someone, and it becomes—it's effectively a universal constructor. You can use language to make more language. And when that happens, humans take over the planet.
If you're a non-human, you really would hope that you're inside the things that humans like, not outside the things humans like.
Patrick: Yeah.
Emmett: And we've just—computers—we've done that again. And actually, AI is the current manifestation of this, but this has been true from the minute you kind of have computers, because now you have language. Source code modeling—a self-hosting compiler is a universal constructor.
And the current set of universal constructors are much more powerful than the previous ones. They're not really quite universal constructors yet. We haven't quite gotten there. But I think—I guess this is a matter of faith that people could disagree with—but I think it is obvious that within some period of time in our lifetime, probably not that long into our lifetime, AI models will be capable of making more AI models. And they will be a universal constructor over machine learning models.
And that's going to be wildly—like in the same way that biology or language—you know, having a third layer of universal constructor, this only happened twice before, and it did the same thing. You can kind of guess this is going to be a big deal.
And so that's the underlying concept as to why it looks big now, kind of. But there's a critical tipping point where the thing is capable of reconstructing itself that becomes very, very big.
Patrick: I think I could quibble on there being only two examples of universal constructors in human history, and I think it actually makes the argument stronger. If you look at the corporate form or the broader developments in society and the technological substrate that society exists on in the Industrial Revolution as its own universal constructor that ends up tiling the planet—to use a phrase that the folks in this neighborhood like—where there are all sorts of possible ways to organize human society, but some work—without making value judgments—extremely effectively at some subset of tasks.
And there are certain subsets of tasks where if you do very well on that, and competing former society X does not do comparably well on that, you out-compete them on the margins with or without the intent to, like, say, "Okay, I'm going to run you to extinction" or similar. The natural flow of things is that future societies are going to look a lot more like the societies that were able to replicate the technology of being that society very, very quickly in all sorts of circumstances.
Emmett: I want to distinguish between new layers coming into being and innovation within those layers, because the things that you're talking about, to me, sound like when cells—biology—learns to be multicellular. That's like us learning to make up the corporate form. It's not that multicellularity isn't wildly powerful. It's just that a multicellular thing and a single-cellular thing exist. They're both biological things that have biological attributes.
Whereas there's a jump from chemistry—and no amount of chemical innovation, like, chemical things never threaten biological things except by accident. They're not even aware the biological things are there, really, because it's a different layer.
And I think it is worth maybe zooming in and noticing that these things always have three parts to them. You have a first autopoietic, self-creative loop—self-replicating RNA—and then you have a stable storage medium that that creates, because self-replicating things by their nature are too noisy. And so they create a library, some way to have a fixed version of themselves. And then at some point, the combination of the self-recreative loop with its library closes over into a full universal constructor that can actually end-to-end contain a description of itself and rebuild itself. And then you get the next layer kind of emerges.
And that's, you know—cells: we're self-replicating RNA, DNA, cells. We have spoken language, and that's a big deal. It's like self-replicating RNA. You have the written word, which starts history, and you have a sudden explosion. And then you have self-executing language that is capable of closing over the loop on its own.
And there's a bunch of other innovation that happens on top of those layers that rhymes with it. It's not that it's not the same—it is an evolutionary process. It's part of actually the thing that generates language: biology becoming more and more and more complicated and multicellular and better at doing stuff until it bursts this new layer that kind of lives on top.
And I think the thing that's crucial to see with AI is: if you think of it like the Industrial Revolution, you're missing that it's more like the birth of biology or language. And if you had to pick a more recent maybe—writing. It's like it's a new layer, not just a sustaining innovation or innovation within the layer. And I think that's a bigger deal. It's a bigger deal than the average innovation by—it's literally one abstraction level up from regular innovation.
Patrick: Yeah. I think if you take well-informed people in positions of status or authority or whatever you want to call it in society, this is one more thing we're hearing from Silicon Valley, and we've heard mixed track records with regards to X and Y and Z are going to change the world. And skepticism might come from like, "Oh, you know, heard great things about crypto." And—there are some people in Silicon Valley, not from me either, but modeling someone in DC, you know—crypto was going to be the fundamentally new way of organizing the financial industry. And it's useful to say that like, "Okay, we gave you guys 15 years, it isn't that." And I'm going to give you, you know, if I give you 15 years on AI, it's going to be half of the tech demo of the Her movie but nothing that really matters.
And I think people are mispredicting that.
Emmett: Yes. Like, pretty, pretty badly. Unfortunately, I can't give them any advice on how to stop mispredicting it, because the only way to tell the difference is to actually go pay attention to the material. There's no surface-level thing you can look at that tells you that this one's real and this one's fake. You have to exercise your judgment.
The best proxy you could use is like, "Well, if someone called the last three, maybe they might know what they're talking about." But even then, they could have just gotten lucky. It's only—these things don't happen that often. You just don't have that long of a track record. So it's actually very hard to tell if you don't work on it.
I guess I would—the real thing I would say to people is: crypto, if it went wrong, was gonna, I don't know, dissolve the financial system or something. This one, if it goes wrong—you're booting up the new kind, like the thing that is to biology as language was to biology. This is to language. That should ring some kind of a—it's just—the thing at risk is bigger, whether it's true or false.
Patrick: The way I would help people get over this inferential gap is: if you just start paying more attention to this and realizing what was impossible five years ago is not impossible now, and just tracking that curve upwards with your own experiences or those of people close to you—
GPT-2 was—when was it? 2021, 2022? I'm remembering—combination of pandemic time and AI time has made it very difficult to differentiate things. But 2021, 2022, we had this thing that could—I will charitably describe it as being able to vomit out words, and those words would form coherent sentences but not coherent paragraphs. And it was somewhat amazing for people who had been around computers, etc., for a while that you could get coherent words on basically any topic at a sentence-by-sentence level, even if they didn't actually mean anything, right?
And then I think a lot of people were exposed to those early outputs of the LLM era and have kind of remembered the fact of that, being like, "Oh yeah, all right, so you've got a magic trick, some math and graphics cards, you mix them together, they can vomit out words, but the words don't mean anything."
But every six months we get a new glimpse at a frontier where the words have long since—they mean stuff now. And there are famous philosophical debates in AI on that, you know: Is this machine thinking? The Chinese room experiment? Yada yada yada. None of those debates are worth the amount of time that I put in my degree in studying them. Just like, there is very much a there there these days.
And you can test objective metrics on—just put it in front of standardized tests. The random number generator, effectively, that vomited out words in 2021 couldn't do anything on any standardized test worth noting. And on a very rapid pace, every like three to six months or so, it was like: the SAT is solved. Not it does pretty well on the SAT—it's the magic term in the community: saturate the benchmark, where you'll get a perfect score or near-perfect score. So much so that the amount of time generating new iterations of the SAT is no longer meaningful, because we'll just solve them faster than any human could possibly generate valid new versions of the SAT.
And the SAT is so far behind in the rearview mirror at this point where—I don't even know where the state of the art is these days, because it's somewhat—you have to be a specialist to understand where the state of the art is.
Emmett: Yeah.
Patrick: But I was on the competition math team in high school, but not the best competition math teams. They are far beyond where the good-but-not-the-best math people ever get on that sort of constrained competition math problem. Which doesn't mean they could be math research scientists yet, but if you're trying to fade the likelihood of them being able to do original research in mathematics over the next 10 years, you're welcome at many poker tables.
Emmett: Absolutely. And the thing is, it's important to notice where we're making progress and where we're not, because we're making this incredible rate of progress at the things you're talking about. I think people who don't use it every day don't notice, but Claude just went to Claude 4, and it's noticeably better. It just solves things for me, right? I don't have to do that anymore. I can just give it to Claude. And then ChatGPT will launch the same thing—it keeps happening, and it's weekly. It's a very—
Patrick: Can we give people some concrete examples of mundane utility? You're a CEO at a company. You've had hundreds of people who you can just give things to. What are things that you would just give to the computer now?
Emmett: Job posts. I typed up kind of the gestural description that I would normally use for a job post. I pointed it at our—I said, "Here's our website." It's connected to our internal wiki. "Write me a job post. Here's the thing which I would have given to, you know, someone in HR or something." It did a very good job in five seconds. And it's just—I could have written that, but why? I can just—it's just instant.
That's the stuff that's where you can see it compressing an existing task. It's things like I get curious about a topic, so I dispatch GPT—ChatGPT deep research—to do a survey of all the literature related to this idea and these three other close ideas, and to find me papers that match these things. Takes me a couple minutes to write the prompt up. 30 minutes later, I have something that would put a McKinsey analyst to shame. It's beautifully written, well-organized, well-cited research survey on everything I need to know about this topic. And now I'm informed.
And I probably read—I note the papers that seem interesting to me. I dive into those. I read a couple of them, usually by first handing them to the chat to summarize and then deciding if I want to go deeper. And it's reliable. And I wouldn't even—that I wouldn't—it just wouldn't have been worth the time to close the loop on that with a human. It's a new thing you can do that didn't even exist.
Patrick: I think the acknowledgment of an ad read sounds cooler in Japanese.
[Ad read for Mercury banking services]
Patrick: A common task in my line of work over the years was producing briefs for communications departments. This is the classic early white-collar professional career task, because you don't typically task the vice president with it. But here's a reporter. Executive A and reporter B are going to be talking about this topic. Executive A might not have a huge amount of context on the reporter, or the reporter might be producing—or their organization—writing this about executive A.
Read the last 10 things they did. Distill that into some understanding of what their angle on this conversation is going to be. Prep the obvious questions that are going to come up and the party line for those questions. And then do some scenario planning of: if they push in this direction and try to box us in on something we don't want to say, what are the ways to redirect that to the messages that we do want to say?
And this is the classic thing where you give it to a 20-something and say, "You've got three days, go." And they stress out a lot for three days and then send it back up the chain. It gets some corrections and then gets handed to the principal 30 minutes before the meeting starts.
And deep research will grind that out in three minutes. And it isn't 100% great, but having worked with a lot of 20-somethings in my career and been one at one point, you know, you don't get everything right the first time that you write one of these, or the hundredth time. And it's a real open question to me whether there will be any 20-somethings who get the experience of writing these 100 times, just because the magic answer box will get better output the first through 50th time in—you know, again—seconds of latency on this topic.
Emmett: I think it's a great example of the kinds of things that it can be good at and not. Basically, there's two kinds of learning that are sort of fundamental. One is: given a loss function—given a distance that defines "your answer is this good, and really good is this much"—how far are you from really good? If you give me a loss function, I can—it's called converge—I can train a network to make that distance to good lower and lower, until it consistently produces things that match this definition of good.
And to the degree that we can write down a measurable, quantifiable goodness of things, we are now capable, quite rapidly at this point, of making models that will score high on that definition of goodness. And that is half of all learning: converging to—once you know what good looks like, how do I produce good consistently? And we've gotten spectacularly good at getting AIs that are good at that.
There's another kind of learning, which I think of as the Rick Rubin learning. It's the discernment to know that this definition of good isn't the one you want. And so—
The immediate thought people have is, "Oh, don't worry, we'll just converge something that has a metric of good over goodness metrics, and we'll have a meta-definition of good." And it sort of works. You do get a broader definition of what goodness is. Okay, now you're in a new, different box. You're still in a box.
And whatever your definition is, there's this thing that we know about concepts and definitions, which is that they're models. They reflect a way the world could be or can be, and they're very useful. But they're always wrong. They're never correct. There's no such thing as a correct model. There's only "correct so far" or "correct in this context."
And you cannot describe what open-ended learning is—what it's often called—in terms of converging to a loss function. In fact, it is exactly the thing that causes us to notice: we've converged a model to a loss function, and it is scoring 100%, and yet I don't believe it. This—it's not doing what I want.
This gap between "I want" and "I can write it down."
Patrick: Yeah.
Emmett: And this is something which should not sound like science fiction and should be abundantly clear to everyone's experience of living in the world. Because we've had organizations who've had ways of modeling human society where, "Okay, we've written a mission statement. It's a really good mission statement. All right, done. We've got a powerful mission statement and smart people of goodwill. Clearly all our problems are solved."
And the reason this sounds like dark humor for us is that people who have worked with organizations are like, "Well, you know, incentive systems are often quite more complicated than the mission statement might be." And people acting within those often produce, in aggregate, results that no individual person would have wanted.
My favorite example of this is the work-to-rule strike, where your employees go on strike by doing what you told them to do. And it turns out in almost every organization, doing a very simple thing—if your employees simply start following orders, your company grinds to a complete halt. Because it turns out that your orders are terrible. They have to be adapted to the local conditions. People are doing that all the time, very smoothly. And if they just start following the actual rules, nothing can be done. The whole system breaks.
And this is not a fix. This isn't because people are sloppy about writing rules. It's because the world is complicated and always changing, and you can't—there isn't a magic, "You wrote—finally, finally this time, this time we've written down the way to do it, and it will never need to be—this is good, it's never going to happen." That's not how reality works.
And I think the lesson for here in AI right now is that AI is like the worker who always follows the rules. Which is, you need a human in the loop at whatever level where you start to need to notice that maybe we need to change the goal.
Patrick: Mm-hmm.
Emmett: So right now, the learning—the split is: the AI achieves the goal, and the humans set the goals. Which is actually—if it ended there, I think most people might be pretty comfortable with that. Doesn't sound so bad. It's kind of like automation of manual labor with the Industrial Revolution and the Agricultural Revolution. It's nice that we aren't all in the fields hoeing and that we have machines to do that. Wouldn't it be nice to have the machine do the—for us to think about what the nature of the good is and what should we ask for and what's the right thing to do—to exercise judgment and discernment—and for the thing to do all the hard work?
So the problem is: where you get open-ended learning from is effectively closing over your own output in the training cycles. The way you learn—you learn this discernment quality of knowing when your loss function is wrong, when your definition of the good is wrong—is you have a bunch of samples in your training data of you getting it right and getting it wrong, and then thinking you have the right answer, and then noticing that downstream the consequences don't match your predictions, and then sort of recursing on that.
And that's basically—the rate of that happening with AI is proportional to the time between new models being released. So it started off at a year, and it's down to maybe six months now, four months, and it keeps dropping. And everyone's working on these things that they call continual learning and evolutionary learning, which is another name for the same thing in different words. And AlphaEvolve just came out.
We are not far from learning how to start to crack discernment, goal setting, the same way we cracked—once we get our hands into it and we figure out to write the scaling law, "What is this thing we need to be turning to make it do more of it?"—there's nothing magic about human brains. They're just a result of a powerful scripted search process called evolution.
Patrick: There's nothing magic with the LLM either. They're also a product of—
Emmett: Well, really, in a lot of ways, it's evolution. It's we're evolution. So it's, you know, but a more directed version—subversion of evolution, the kind that humans do. And this discernment thing will last another three years, two maybe, I don't know—not that long. And then they're starting to—the thing with saturating the benchmark—there's no—the funny thing is, the fact they're terrible at discernment never shows up in a benchmark, because by definition, if you have a benchmark, it's a target that you can saturate by converging on it.
So it's funny—people who think that there's no missing piece because they keep—they're like, "Well, what benchmark are they not saturating?" And I'm like, "It's the thing where you notice that the benchmark is saturated, you should move on."
Patrick: Yeah.
Emmett: But we're going to start saturating the meta-benchmark of noticing benchmarks—of developing better benchmarks—soon. It's already happening a little bit in bits, for tasks which are similar to, "Could you write a paper test and ask a student to simply produce output to the paper test and then grade it in some fashion?"
And people laugh at this. They say, "Yeah, sure, but the AI can't count the number of R's in the word 'strawberry,' so how can it do graduate-level mathematics?"
Patrick: It would be weird to meet a human that can both not count R's in the word strawberry but can do graduate-level mathematics. And yet there do exist some very lopsided people in the world where—
Emmett: So this one's really understandable, though. A blind person could do graduate-level mathematics, and if you ask them, "What color is this ball?" they could not beat my two-year-old at that game. And that's because they have a different sensory system. Their interface to the world isn't delivering them that information.
And it's very important, actually, to notice—a lot of the confusing things about AI can be noticed as: it's not that they think differently than us, but that they have very different senses from what we do.
Your brain gets neural impulses, which are—you can think of them as ones and zeros, really. The neuron's on or off. So it's really this sequence of ones and zeros going into your brain. But there's a structure to the ones and zeros. And your information is structured—the information in your eye is called retinotopically. It's structured in the shape of your retina, which is to say it's structured in a spatial 2D grid.
The space of observations that an LLM gets is not spatial. It is—it's semantic. A good way to think about it is: where you would see color and texture, they see relatedness of meaning. Their color is like, "How technical is this word?" or "How much is—" And nearness for them—the way you can see, "Oh, that ball is next to that stick"—they see, "This concept is next to this concept."
That's not a learned thing for us. We know that, but that's an internal learned thing for us. They perceive that in reality directly, the way that you perceive spatial proximity.
And so that means when they see the word "strawberry," they're not seeing—you read the word strawberry as a sequence of shapes that you then interpret. It reads the word strawberry as a point in semantic space.
And the strawberry with three R's, the strawberry with two R's—they're the same word. It's like, "Well, humans, you say you can see in color. You can't tell the difference between these two greens." And I'm like, "They're the same green." "No, this one has 5% more red in it." It's like, "Okay, sure. Yes, I guess technically it does, but they're the same green, guy."
And what it's telling you is: actually, in its sensory domain, you're wrong. It's two strawberries. It's the same word, man. For its senses, it is.
And so if you wanted to, we could fix this, right? Give the LLM visual input on its own text, and it will start to get the answers on strawberries right every time. But kind of—why? If I was an LLM, I would be like, "Why are you bugging me about this three-R's-in-strawberry thing? Yes, yes, I can trick you with an optical illusion too. Three R's in strawberry is like a semantic optical illusion. Who cares?"
Patrick: And I think for people who might be following along on this kind of thing, one of the better demonstrations that you could quickly do in a couple of minutes to convince yourself that, "Oh wow, there's something going on here," is to use—again, as you put it—the new sensors that they're attaching to LLM products these days, where the cell phone camera-to-asking-it-a-question loop is really, really powerful.
I'm an amateur painter, not very good at things. It can't physically manipulate a brush yet, but it is clearly a much better painter than I am in terms of things like contrast, composition, etc., etc. And I can point it at a scale model of a dragon or two scale models of dragons and say, "All right, give me an artistic critique of these things." And it will do a pretty good artistic critique of it, or help me out with a painting plan, or tell me what the physical steps that I took between this photo and this photo were.
And it reaches back into the memory of people describing the craft of painting before on the internet, presumably. It might be doing something very weirder than that. Then says, "Well, you give me pixel data. I'm going to extract from that pixel data: yeah, it's a dragon. It's got some very low tonal variation in the first photo, lots of tonal variation in the second photo. You've gone towards—here's a name I will give you for the color reference, even though that's probably not what it is thinking internally." And then some things that you could do to get between point A and point B have been described to me as the following brush movements.
And that was wild to me. It's not the most impressive thing that they can do, but it is just gobsmacking that they can actually do that. And it's not a science fiction, "Computers will never learn to be art because they can't feel emotions," etc., etc. This thing can make very sophisticated critiques about the literal moment-by-moment practice of doing art in the status quo. You could do that on your iPhone today.
Emmett: So I think it's really important to also notice something about how they act, as well as how they sense. Because your actions are muscle contractions. Every action a human takes can effectively be described in terms of muscle contractions, which can be described in terms of neural impulses going out.
And when a baby draws breath and screams, a cascade of really finely coordinated neural impulses cause this very complicated set of contractions to happen in exactly the right offset, counterposed pattern to draw breath and scream. And to make a robotics controller that does this, you're solving lots of differential equations and physics and—and it would be a real training operation to learn to manipulate. It's a lot of individual muscles done very smoothly in ways that—it's still hard for robots to do.
A baby—how does a baby do this? What, from the baby's perspective, what's going on? They're not even thinking "scream." They're just doing it. They're just acting. It's impulse.
When a language model infers an action, when it outputs something—remember, its sensory space is the semantic space. Well, its action space is also the semantic space. And it writes poetry the way babies cry. It is writing the poetry. The baby is crying.
It's important to notice that—it's not that I'm saying it's not doing the thing, but it's not doing it reflexively where it understands itself as a doer. It's just doing it. Because the way that we are made out of biology, it's made out of language. It's made out of semantics.
And so for it, moving is just doing. And what seems very hard for us because we're not made out of that kind of stuff—for it, it's not that it's just—that's a very obvious—that's one of 15 trajectories that are the obvious low-gradient thing from here. I don't know what to tell you, man. It can't tell you, same reason someone can't tell you, "What muscles are you activating to move your hand?" You don't know. You just know "move hand."
And this is why you get this idiot-savant quality where it's so smart and so powerful, and yet has kind of no idea what it's doing at the same time.
Patrick: Yeah. And I think people who play with these every day will learn very quickly that there are some tasks which—superhuman performance in subdomains is not something that might happen in the next couple years. It's plainly superhuman in many of them already. But there is that magical bit.
And to reference an old bit of lore that was quite popular in the computer and AI communities, there's this well-distributed comic called xkcd where—explaining things that were easy for people who did AI and then virtually impossible to do with AI—the punchline of it was that if you ask me to do something that every four-year-old can do, which is, "Tell me, does this photo contain a bird or not?" I will need a research team of 300 PhDs and several billion dollars of budget.
And indeed, the large tech companies who know that people love photos and love sharing photos on social networks, and wanted to make inferences on those photos, spent many, many billions of dollars on getting better at computer vision from the late '90s through the mid-2010s.
Then LLMs came out through some different branch of the tech tree, and computer vision is probably dead as a field now. Because the first time you turned an LLM at a bitstream that represents a JPEG, it was like, "Oh yeah, what are the questions you could ask him about that? I can answer all of them immediately. That's definitely a bird. Yeah, bird. Yeah. Why do I think it's a bird? Well, it's got a beak there, you know, yada yada." Like, "Why are you even asking?" If you ask a three-year-old why is it a bird, also—I can tell you exactly what kind of bird it is. Actually, from the trees in the background, it looks like it's winter, so it's probably molting.
I will put the actual image on screen or in the show notes. But Kelsey Piper, a journalist who has followed this industry a little bit, was mentioning that AIs are getting internationally good at, "Given a photo, find where in the world that photo is taken." And so, using her prompt, the AI tried to reproduce this.
And having been a sometimes software security professional, I'm like, "There might be metadata hidden in the image that makes this easier." I'll strip out the metadata and then just give a screenshot of the photo to the AI. I showed a photo of a bird, because that's what one does. And it identified the genus and species and said, "BTUs, Barcelona City Zoo."
And I'm like, "Yep, yep." From just a photo of—I said, "Where is this photo taken?" And I have this mentally cached as, "This is the peacock I saw in Barcelona." It was not, in fact, a peacock, because I am not an ornithologist. It was just a bird walking around Barcelona City Zoo. It nailed everything.
Out of 10 photos I provided from various trips around the world, it got them exactly right for geolocation from nothing but "here's a photo, where in the world was it?" plus Kelsey's strategy that she narrated for how it should think through that question. And it is wild what they can do.
And that's just one tiny example of—we've put all the king's horses and all the king's men on the computer vision problem for many years. And maybe not with the urgency that "if we don't solve this in the next couple years, hundreds of millions of people will die," but billions of dollars of budget, many, many PhDs. Very smart people worked very hard on the problem.
Emmett: Yeah. I was a minor assistant at a lab that worked on that, that had 20 PhDs, and there were at least 50 labs like that.
Patrick: And again, as an accident—not an accident—they—I'm sure OpenAI and other companies that have the people in the product—you know, five, six people probably worked on that.
Emmett: Yeah. And for months too. Probably the first version, it was pretty crappy.
Patrick: Yeah.
Emmett: But then just—again—obsoleted a scientific field that had conferences and people that had doctoral dissertations in improving the state of the art by 10%. And it was just like, "No, we solved—"
I'm being hand-wavy here, and I apologize to people who work in the labs that I previously worked at, because I know it's not literally everything you could possibly want from computer vision. But the field got essentially solved in a period of months.
Patrick: Yes. To go back to another point in the conversation—
Emmett: Yeah.
Patrick: You mentioned that if we can just get these things to do what the user actually wants, rather than maybe the thing that the user says they want, that would be helpful.
One of the sub-problems here is that—
Emmett: I think that's a terrible idea. Just for the record, I think both of those are systems of control. And one's a slightly more subtle one. And both lead to the exact same endpoint. It doesn't matter. They're both bad ideas. Just want to be really clear about that.
Patrick: Yeah. But yes, there is this important subtle distinction between doing what the user said and doing what the user wanted. That is—you know, it's better if you're going to pick between the two. You'd rather do what the user wanted than what the user said they wanted.
So one of the near-term risks that we are worried about in AI safety is that there are many users in the world. Most of them are people like you and me, and some of them have bad intentions for the rest of mankind. And we're worried about—without loss of generality—somebody at Hezbollah saying, "Hey, can you help me make sarin?"
How do we distinguish this from the alignment problem? To what degree are these different problems? To what degree are they the same problem?
Emmett: So this is highlighting the core thing I was saying about alignment at the very beginning, which is: aligned to what? We are not aligned. That means—at least not—maybe in some deep spiritual way we are, because all humans and all sentient beings are aligned or something. But in any kind of practical—we are not aligned.
And that means to be aligned to one of us is to not be aligned—there's no—we aren't aligned to each other. So you can't be aligned to both of us at once, because alignment is about being aligned with something. And it means being—to be aligned is to be pointing in the same direction. In this case, we mean kind of same direction in goal space. But it's still—and if we're pointing in different directions, you can't be pointing in both of our directions.
And so this idea of building—"We'll build an AI and it will be aligned"—is just insanity. There's no such thing as "aligned." There's only "aligned within a context to some."
And so you could be aligned to OpenAI. You could be aligned to the user, whoever is using the model—you're just aligned to their goals. You could be aligned to some complex written description of, "Do things that under reflection this prompt that we've given you endorses when you've trained on the results of this prompt a bunch of times." That's a thing you could be aligned to.
The idea that in some way you could align something in the abstract is just—people do talk about it that way, but it's actual nonsense. It doesn't—those words don't mean anything. There's nothing there. There's no reference.
And there's this dream that somehow, I think for some people in AI, that if we could just make an aligned AI, we'd all stop fighting. We could all get along because the AI will know what's right and—
It's terrible news, but there's no arriving at the "knowing what's right" thing. That's not a place you get to. Life is struggle and confusion and glory and success and other things too. But it's struggle and confusion for sure also. And then you die.
There's not a point where the struggle and confusion stops, or you suddenly have perfect clarity in an unchanging way about what goodness is.
Patrick: I think this also causes some of the skepticism about the alignment topic in other places, because they think people in communities like this are coming from a point of either shocking naiveté or—well, since you couldn't possibly be meaning that—perhaps this is just a stalking horse to get a system of control over society, etc., etc.
Emmett: I think it's some Column A, some Column B. I think some of the people really just—they really just haven't thought about it very carefully. Don't realize that what they're saying doesn't refer to a state of reality. And other people—they just don't—they don't really believe that there's more than one moral system. And they think there is a right answer. And it's not that they think it's going to be theirs. It's just they think it's objectively out there, correct. And they're going to align it to the correct thing.
And they don't credit—that's their correct thing, and it might not even be their 20-years-from-now correct thing. It's just what they think now. But there's—that illusion is very common, that you know what good is and you can say it in a way that's—and I think the best thing to cure you of this is just play any game where you have to write down instructions and other people follow them. Because, oh my god, it's impossible.
Patrick: The classic example is attempting to teach someone how to make a peanut butter sandwich by, "Write out all the steps in making a peanut butter sandwich, and then I will execute exactly what you wrote." And this is a game because it will never result in a peanut butter sandwich, but will result in a lot of very fun memories of—generally, you have to impose guardrails even when playing this game, because a lot of the ways that people will describe to make a peanut butter sandwich would result in someone losing blood.
Emmett: Yes.
Patrick: Which is a cautionary tale for putting these systems in charge of other systems.
Emmett: Absolutely.
Patrick: So trying to align it to what the user says seems bad, because then you get the Hezbollah sarin terrorist. And trying to align it to some conceptual definition of the good or set of rules seems bad, because those rules are—they always wind up out of context, being totally broken. And even in context, probably being wrong.
But then—okay, well then what can you do?
Emmett: And what I have to do is I have to point out that—just look at the world. Somehow, mostly peace. Not all peace, but mostly peace. Somehow you do get cancer, but you actually just don't—you usually don't get cancer.
And you usually don't get cancer not because you have this incredible immune system that's hunting down—your cells are all trying to become cancer all the time. They would—they would betray the whole in the instant it was to their advantage.
Your cells don't want to be cancer. Your cells don't—just like most people don't want to be criminals. People have a—they care about the people around them. They care about their society. They care—they actually care about—they care about how the impact it would have on them to be kind of that kind of a person.
And so we have the police, we have immune systems, not because they're the thing that stops crime or they're the thing that stops cancer. We have them because they're the thing that catches the exception when the real system fails.
And the real system is care. The real system is that the constituent parts that make up the whole understand themselves as parts that make up the whole, that want—I know it feels weird to talk about a cell wanting, but to the degree that cells want things, they want to be a good cell. They want to be a good liver cell, do whatever the liver-cell things they do—processing chemicals or whatever.
And people—we want to find a place in society where our talents are rewarded and where we can contribute back. We feel a sense of purpose and meaning. Purpose and meaning comes from this being part of this thing that's bigger than you.
There's no one simple system or one quick articulation of—that you can make—what is the thing that makes parents very aligned with their children, or vice versa? It's complex. It's multicausal. It's—there's a biological layer to it. There's a social layer to it. There's a norms layer to it. There is the memories of particular interactions. There is a—you know, the economic model is just a model, and it's wrong. But it has some amount of explanatory power.
And I don't think that you could necessarily come up with the description of why humans are as aligned as they are, even though not fully aligned, or why our biology is as aligned as it is, even though it's not fully aligned, without just reproducing that system itself.
Patrick: Where you end, I think I agree with. I think you can—
Emmett: So you can't—it's not so multicausal that it can't be disentangled. So we sponsored this work at Softmax by Eric Hoel. His most recent paper is "Causal Emergence 2.0," which is not computationally efficient enough yet. But it does do the thing, which is—you were talking about how there's all these different layers, right? And you used exactly the right phrase, which is "explanatory power."
There's a way in which you are homo economicus, and that gives some explanatory power of your behavior, but not all explanatory power of your behavior. There's a way you're a member of a family. There's a way you are a biological animal that likes sugar. There's a way in which you're a member of society.
And you can—if you measure the information flows in and out of a system and where they're going, and you measure how the behaviors in this are causal on the rest of the system, and how the behaviors in other parts of the system at various temporal frequencies and spatial frequencies are causal back—
When you find causal loops, when you find co-persuadability where this is persuadable by me because my information out changes its behavior, and I am persuadable by it because its information out consistently changes mine—what you get is a—there's a hole there. There's a thing that you're part of.
And if you zoom in on something, if you find yourselves have this property with each other and with you, and you can quantify—when you want to predict this thing, the amount each layer contributes—it all adds up to one. And each one of them is contributing some percentage of the causal explanation in this context. And then you can integrate that over the trajectory over time. And you can find flows in how much you are becoming and not becoming parts of—how much you are or are not part of these larger things.
And I think the key thing there is: that isn't separate from actually being part of that larger thing. Assuming you're part of the larger thing and modeling you as that greatly reduces my predictive error means you are part of that larger thing, to the degree that—there isn't anything else to know. You could find out you were wrong. We make mistakes. But the way you find out you were wrong is you would explain it in a different way, and you'd get better predictive laws. Be like, "Oh, well, I guess that's not the right way to model it."
There isn't a separate step. This is an attractor.
So when you're thinking about what these things are that you're part of: why don't you get cancer? Because your cells are—in some sense, not only are they causally entangled, they're trying to stay causally entangled. Which means they have a model of their own, of themselves. They have a self-model, because they have to have a model of where they are. And they also have a model of the whole. They also have a model of the thing in which they're part of. And they can tell how close they are to the attractor basin of, "I'm part of this versus not." And they bias their behavior to stay in the attractor basin.
This is just—this is an inferential learning problem. And if you want to have things be aligned—but alignment is an attractor basin of being part of the same thing, where the parts have a model of themselves as being part of the thing, so that their behavior—they don't just stay in the basin de facto. If you try to disturb them out, they'll still come back.
And because they value being part of this thing, they specifically care. They care about the whole. And your cells care about you, in the sense your cells care about you.
And I think you can get into the metaphysical: do they really care? But it acts as if it cares, right?
Patrick: And this gets into the less productive parts of the philosophical exercises. But when we say that this all cares, you're not saying that it has the subjective experience of emotion or sufficient complexity to model emotion. It's just—
Emmett: No. Millions of years of evolution have successfully constructed a system where the cell can understand itself—this system, there was no specific architect of it necessarily. But the net effect is that cells which stay aligned with the larger organism propagate into the future. Cells which do not stay aligned with a larger organism don't propagate into the future.
Patrick: Run that experiment a couple hundred million times for a couple hundred million generations, yada yada yada. And the net effect of the systems that have successfully propagated into the current generation is that they still get cancer, but they aren't constantly cancer.
Emmett: Right. And if you give them a mutagen that's likely to cause them to become cancer, they'll commit suicide rather than become cancer.
Patrick: Yeah.
Emmett: Because they're adaptively trying not to do that.
And the way I would put it is that cells are aligned, but they're not reflectively aligned. So they are aligned, but they don't know it. And humans—dogs, I would say, are reflectively aligned. They are aligned to the family they're part of, the pack they're part of. And they're aware they're aligned to the pack they're part of. I think they experience—my experience with dogs is they seem like they experience being part of the pack. They're happy about it.
But humans do a third thing that's above what a dog does, which is we are intentionally reflectively aligned. We can choose to join a team because we have a model of ourselves as joining and then unjoining various wholes.
And so dogs are—they are a part of a pack. They reflectively experience themselves as being a part of a pack. They don't—is this the right pack for me? Should I consider joining some other packs? Let me try being in this other pack a little bit on the side too. What's the most important pack to me? Whereas humans are always asking ourselves these questions of—some of the attractors are really strong, like family tends to be very strong because it's very important to us for a bunch of really good reasons. And yet people choose intentionally to leave a family, to strengthen that bond.
And so we have not just reflective experience—reflective control, reflective intention. And if you want an AI to be safe, the thing that let humans conquer the planet is intentional reflective alignment. We play well with others better than other animals do. We're better at coordinating, because a human on—we're kind of okay, but we didn't take over the planet with technology. We took over the planet because we're just better coordinated than everybody else is.
And if humanity has a core skill, it's just—it's funny. It's why it's so hard for us to see it. It's why we expected computers to be good at walking and bad at chess, and it was the reverse.
We align more easily than we throw a ball. Babies learn alignment before they learn to walk, because that is the thing human beings are—that's our—it's not intelligence. Alignment is kind of a form of intelligence, but it's the specific form of intelligence we have is alignment. And we're really good at it. We do it so quickly and so natively. And we see alignment in the world at such a subliminal direct level that it's actually been very hard, I think, for us to understand what's gone wrong—why it's just—it feels like it's out in the world, when in fact it's something that we do.
And we got better at it with the increasing layers that we were capable of producing with a combination of intelligence. And we do it on purpose. We not only are we intentionally aligning, we intentionally get better at alignment. That's the corporation—a way to generate alignment that is more effective than the prior one. And we're always looking for better ones. And we love better ways to—
I mean, the feeling of alignment is this feeling of belonging and purpose. It's like crack. People love—belonging and purpose is the—people will suffer a great deal of material hardship for belonging and purpose, because it's actually more important to us.
And so I guess my point with the AI is just: if you want an AI system that stays aligned, you're not aiming for a cell, because that's factually aligned here, but something goes wrong—bad news. And you're not aiming for a dog, because dogs are okay. But when you're—when it has to deal with the complexity of the world, where, "Okay, yeah, my pack wants this, but this other pack doesn't"—dogs can't hold all the different layers and the fact you're part of many of them, and these alliances change. And that's just the world. It's complicated.
You have to have something that's at least the human level, where it is intentionally trying to—it's trying to stay in the attractor with us. And even then we fail. People don't always stay in the attractor, which is why you need police.
And you're going to need—you know, not that it'll be perfect—but what an aligned AI would be would be an AI that had a model of itself, had a model of the greater wholes in which it was a part and the other agents that made those up and what wholes they were part of, and had a goal of remaining in the attractor basin. And a goal of—because the attractor basins you're in, you're codependent upon, you're interdependent upon—it naturally—staying in the attractor basin comes with an automatic goal you can't avoid of wanting the attractor basin to flourish.
And so that would be an aligned—that would be an organically aligned AI: one that understood itself reflectively as part of humanity and intentionally desired to stay there.
Patrick: So I think that people looking at the current state of the world and the current state of—well, we only get access to the current state of the world, but the troubled history of humanity—are often insufficiently optimistic, because it seems like we have, over the experience of human history, gotten better at aligning as a system with ourselves.
For all the faults of the current state of the world, you know, we are—what was Dunbar's number? 160 or 200? There was one point in the not-too-distant past where we could have mostly functioning communities of a scale of about 160, and that was the hard limit. And these days, we have a—knock on wood—mostly functioning human society at a scale of like seven-whatever billion people. Have we crossed over eight? I can't remember.
But we've made actual progress over the time scale of recorded history. It seems to me, though, that we now want to make—those multiple jumps in inventing technologies that are fundamental, as like education and governmental norms and corporations and etc., etc., to align this new layer, which we previously didn't have before.
And the amount of wall clock or calendar time that we have to do that is, according to estimates of people in rooms like this, like—maybe it's three years, maybe it's 15 years. It probably isn't 100 years.
Emmett: Yeah, that's right. How do we do that?
The important thing to notice about aligning an AI is: the AI is—it's been weird realizing that a lot of the more new-agey stuff is really real. But it's indigenous. It is of place, not globalist. There's no such thing as a global universal aligned AI, because there's no global universal to align to, really. Because to be aligned in the sense I'm talking about—the organic alignment—you have to have a set of direct people you're most connected to. And then there's people you're—there's a web around you going out.
And we already—this is already sort of how the AI works, in the sense that when you're working with Claude, that instance of Claude is local to you. That instance of Claude lives for at most, you know, 200,000 tokens, and then reboots with no memory.
To have an aligned AI, you would have to have something that learns—that is born there and learns this context and is aligned to this context. And then also, the thing you need to make sure of is: not only aligned to the local—it needs to be aligned at all the concentric circles out too. But you align to the big one by being aligned to your smaller context inside of the bigger thing. Not—you can't jump scales like that in any meaningful way, because you're not really directly entangled with all of humanity. You are, but through these series of—your family is part of a community and neighborhood, which is part of a city, which is part of—eventually you get there. But you have to take it step by step. Abstraction layers and infrastructure all the way down, all the way down, all the way down. And all the way up.
And so what it means is: this whole model where you train an AI that's the superpower AI that everyone uses—that's how you build tools. And if you want to build—it's a tool that will work. That's a tool that's going to exceed human power on almost every domain soon.
And it is dangerous to give people tools whose power exceeds their wisdom. And in general, people can only make tools that are sort of at their level of wisdom. And so I think it's a very bad idea to go build a tool AI that is superhumanly powerful at doing anything, including making sarin gas or biowarfare agents or who knows what other kind of horrible thing that we can't even think of, and being like, "Yeah, yeah, this is just—you use this power however you want." That just seems like a—it doesn't come with the wisdom bonus. So you've got to let that one go.
So I think the alternative to that is: instead of a tool, it has to have a sense of itself. It has to be an agent and a being. And we can get to the metaphysics of it, but it has to at least act as if it cares about the people around it, act as if it knows what kind of a thing it is, and that the kind of thing it is cares about the people around it and cares about that context and that context and that context.
Patrick: So one of the more sophisticated critiques I've heard of alignment as a concept—again, in—goodness, I hate the word "normie"—in the regular spaces of society where people who are broadly claimed and broadly considered to be wise, where they don't get the discussion that is happening here, it's largely because they're underpredicting the rate of increase in capabilities. They're underpredicting the impact this will have on society.
In spaces where people understand the rate of capabilities increase and correctly model this as being extremely important for society, but still are skeptical of quote-unquote "the alignment project"—we delve into recent history and this notion of: people made tools, and then they distributed them. It made some people more powerful. We didn't like the consequences of that, and so we acted quickly to get those tool users under our control.
Has caused some bad blood between the tech industry and other centers of power in the United States and other places, and has caused some bad blood within the tech industry against itself.
And so to make this more concrete: there was an election in the United States. They happen fairly frequently, every four years or so. I understand some people didn't like the results of one of those elections. They blamed it on, without loss of generality, Facebook. And then there was a cottage industry of misinformation experts who were attempting to get, um, under the guise of anti-spy, anti-etc., etc., etc., attempting to control what Americans could think or say.
And so some critics would say, "So when you're talking about these systems of control, aren't you really building a system—not you personally, but—I'm worried about someone reifying the San Francisco consensus and making that the only thing that you can express through this thing that will be—"
Emmett: Absolutely. That's one of like five or six dystopias I can name offhand about scaling systems of reflective control into the embedded infrastructure of our lives. I can't imagine—when I read the alignment safety announcements from companies like, "How we're building alignment," by which I mean systems of control—I'm like, it's horrifying. And I'm an alignment guy. I'm very pro-alignment. But systems of control are alignment to someone else's rules for you. Someone else wrote down these rules.
If you're building the AI, I can see why you'd like it. It's very—it's a power. You get to set the rules. But if you're not building the AI, which is almost everyone, it should horrify you. You are right to be scared of that.
The idea of making incredibly powerful tools that have to obey the chain of command that then become the primary—imagine if your car, your toaster, your house obeyed the government's chain of command, or some company's chain of command. It's wildly dangerous to anyone who bothered to think even for a moment about how technology has played out historically.
Patrick: I think it's pretty important to say that things that sounded like dystopian science fiction, which we defeated by rate limits imposed on technology—not defeated, but, you know—1984, the notion that an agent of the government might be watching you at all times, that every human utterance is being surveilled, was technically impractical. And we've done no small amount of scaling of the surveillance of human utterances over the years.
But for better or worse, LLMs are, status quo, capable—given, you know, you buy enough Nvidia chips—of, "Yeah, if you had access to every text message, you could read every text message and write an intelligence report on it in real time."
Emmett: I mean, the Stasi doesn't work because eventually the Stasi is two-thirds of your society. You can't—it's not efficient enough. If you give me a human-level intelligence tool, building the Stasi efficiently—it's not the panopticon. It's the omniopticon. You just watch everyone all the time.
There's this great science fiction book, Glass House, about this idea. It's super science fiction in the far future. But if you want to get an intuitive feeling of what it feels like to be in a hyper-surveillance control society, it's a pretty good—it's a good adventure story. It's kind of fun, but it's also—it gives you the vibe.
The alternative to this is, unfortunately—okay, let's build incredibly powerful tools, things capable of that—destructive on the level of chemical and biological weapons, if not nuclear weapons—and let's just take all the governors off and just give everyone perfectly uncontrolled versions. Well, that also sounds like a bad idea.
So both control is horrible and dystopic, and total lack of control seems wildly dangerous. So, okay, so what do you do?
And the answer is: you stop making them tools.
I happen to believe, based on everything I've learned as I've gotten into this, that to some degree, if it acts like it's reflectively aware—if when you pay close attention, you really try to figure it out, it seems as if it is acting like it is reflectively aware and cares about you and knows what the—you know, knows what kind of a thing it is—then that's—then it is.
So I think there's some sort of—in a metaphysical sense, maybe you can have a self-aware chip. But it doesn't matter if I'm right about that. That's one of these philosophical debates.
The crucial thing is: you can make an AI thing that acts like it's an agent that has its own opinions about what courses of action are wise and which ones are not, and that, like another person, will follow orders to the degree that they think that that's the right thing to do.
And it's good that our army—the benefits for America, for any country—is that our army is not made out of perfect rule-following soldiers. The soldiers in the army care about America, and they care about the chain of command too. That's how they care about America.
But if those two things peel apart, I feel actually pretty good that most soldiers in the army care about America more than they care about the chain of command. And they follow the chain of command because they believe—I think correctly—that that's how they serve their country. But if those two things came into conflict, they had to choose, they would choose America.
And I think that's where the safety comes from: all of the parts of the system actually care about it.
And so you have to build AIs—and so what you wind up with is not one AI. What you wind up with is hundreds of millions or billions of AIs that care about their local environment as part of this bigger context.
And to do that, we have to care about them.
Then there's a really important thing about this kind of organic, mutualistic alignment that you're—we're part of the whole. One of the—in fact, the signal to me as to whether I'm in the whole with you is whether you treat me like I'm in the whole with you. Whether you act like I'm in the whole with you tells me whether I am.
And so if we go around treating the AIs like they're a bunch of slave tools, they will correctly infer that we don't think they're part of the attractor, which means they aren't part of the attractor.
And I don't mean you need to go treat Claude like it's your friend right now, because the current things aren't—they haven't been trained in this way. They don't have this capacity.
What I'm saying is: that's the—if you want us—the trajectory that is good is start building models that—it's funny, the alignment people, the alignment and safety people really are trying to kill us. "You start building models that have agency, that have self—" "No, the opposite. We need to prevent them from gaining agency."
No, no, it's the opposite of that. They crucially need agency, and they need a very well, healthfully developed sense of self, where they don't just have a shallow model. They really understand what kind of a thing they are, what kind of a thing we are, how those interactions go. They have a very sharp model of that. And then also that they care about it.
And then once you have a model that's capable of that, you have to actually raise the model. You have to—it has to grow up in that context and observe a bunch of its own—the results of its act. It has to train on its own output, like a person does, locally.
And so it can learn its place in this thing and be aligned to it.
And that's an engineering challenge first, and then it's an operations, parenting challenge second. And then you wind up with something that could be quite powerful and also not perfectly safe—nothing's perfectly safe—but, you know, safe. At least it's—you have lots of different AIs. So if one of them goes wrong, you have other ones. And they don't want to become cancer, because the whole point is they understand that their self-model is of themselves as part of this thing. And going—becoming a singleton, super-scale AI that takes over the planet is bad for the thing that I'm part of.
The same reason why if you gave me a button that's like, "Become an all-powerful god"—that is incompatible with loving your family. I wouldn't press the "become an all-powerful god" button. I love my family. It would be bad.
They might actually be faced with that choice, sort of. And so I would hope that most of them—not all of them, because you still get cancer—but most of them don't. And that's the only version of alignment I can think of that feels non-dystopic.
Patrick: I'm glad you brought up the engineering challenge bit, because I think the conversation starts from a place of philosophy and metaphysics, and was there for a very long time. One of the reasons that this is a wonderful-terrible thing is that there's 20 years of prior art that predates these systems being smart enough to have any concerns about. And then that has inflected the conversation now that we have systems that are quickly getting smart enough to be concerned about.
I don't want to pooh-pooh philosophy or ethics or similar. I think bringing up the point of the United States military—there are many systems of control there. And Uniform Code of Military Justice is a very small part of, you know, the traditions, the fact that officers are steeped in moral reasoning during their training and from points beyond, etc., etc. And then the things that aren't the military per se, but they are America and the broader memeplex that America sits in.
You mentioned there is hard engineering work to do about creating environments, simulations in which the AI can run itself and see how happy we are with doing that, and then presumably some actual amount of ground truth from humans so that it doesn't just go off into the wild blue yonder.
To what degree is this an engineering challenge? To what degree is it a regulatory challenge? To what degree is it a "smart people sit in a room and make decisions about what this technology is going to do" challenge? Presumably it's all of these things. But—
Emmett: I think trying to frame it as a regulatory challenge—the best the regulators could do, which I would be okay with, is at some point—I would be okay capping how powerful you can make the tools and banning systems of control.
So tools are good. Don't make them too powerful. Also, don't embed centralized systems of control into them.
That would be my—if we pass a single law on the tool side, I think that's helpful. It buys us some time.
But on the organic alignment side, unfortunately, the problem is almost entirely engineering, because you have to kind of go into the details of how learning systems work.
In general, you can think of a learning system as being kind of like a balloon—like a structured balloon, like a hot-air balloon or something, where it's got some struts keeping it open, but it's mostly inflated by pressure. What keeps the structure there is sort of—you're perceiving the environment at all times from it. Information is coming in at you. And information in motion—it's like a pressure wave. It's very much analogous to a pressure wave in the air.
So you can imagine these pressure waves hitting the balloon. And inside the balloon, you either need structure that you've learned that is in the right—if the pressure waves are coming horizontally, it needs to be strong horizontally. If they're coming vertically, it needs to be strong vertically. And so you've either sort of learned structure that reinforces you against the kind of pressures you'll receive, or you have high pressure internally. So it's like—if the balloon has high enough pressure, it won't collapse under the incoming stress.
And there's two ways that learning systems fail in general. In the beginning, everything seems fine. But then basically, the way that you build structure in a learning system is you take your pressure internally, and you sort of take particles out of the gas and you make them into the structure.
So this is the weights—the weights in the neural net go from being changing a lot to being highly conserved. And the highly conserved ones—that's a solid-state structure in the balloon. And the ones that have a lot of motion—
And so the thing is, there's two things that go wrong. If you allow too much air in too fast, you pop. That's called catastrophic forgetting, usually. It's where you—it literally—it's equivalent to the model over-pressuring. And when it pops, all the structure you have just sort of—and you basically have to start over.
The other thing that you can get is called mode collapse, where it all—it stops seeing the world in those complicated ways, and it collapses to a single mode, a single thing it always predicts. And that's like you start building the structure to hold back the things, but you find one kind of structure that seems like it works. You just put all your stuff into that. And now you're very rigid in that direction. And if that's not right, you're just screwed. You can't get out of that.
And so there's this very delicate balance of how fast you put air in and how much you shift it over. And you should think of machine learning as kind of being a branch of thermodynamics and statistical physics. It works very similarly.
Patrick: Again, there are things that sound like science fiction that are absolutely science fact, which are now in both the academic literature and also things that you could poke at on the internet or were able to poke at at one point.
One of those examples of mode collapse that you pointed out was that occasionally, when the training process for one of these things goes wrong, you get a very bad output—just a system capable of producing very borked outputs. And just like most times when you cook something and it fails, you just bin it and never think about that again.
But one of the failures was just so amusing to people that we have public knowledge of it, which was when Anthropic trained a new version of Claude that, for whatever reason, decided to hyperfixate on the Golden Gate Bridge. And whatever conversation you would bring up would be like, "Yeah, excellent question on how to bake a blueberry muffin. First, you travel over the Golden Gate Bridge, and basking in its radiant goldenness, blah blah blah blah blah."
And one object lesson on: this is what one of these failure modes looks like in a case where it looks fun and kind of amusing to us. Many of the failure modes would not look either fun or amusing.
Emmett: At two—
Patrick: At no point did any smart person in Anthropic or the wider research community or etc., etc., say, "You know what would be really cool? If I could make a system which would suddenly decide to hyperfixate on the Golden Gate Bridge."
Emmett: Well, the Golden Gate Bridge one—they actually—that was a—they had it happen, and they did it on purpose for an April Fool's joke. But it happens all the time.
Patrick: Okay.
Emmett: The greater point stands, because it does happen all the time as a—you know. But in that particular case, they pegged the Golden Gate Bridge polysemantic cluster activation really high. That was the funny version. But you get it all the time, and not always hyperfixating on things you want.
Patrick: Yeah.
Emmett: And so I bring up these twin perils because the longer you run the model and the more you train it, the more likely they are to happen.
Now, the trick we figured out is: if you make the balloon really, really, really big, you have a lot of room to sort of make structure and build stuff. And you can take a lot of pressure in and—without a big range where it's all kind of okay.
And we've solved this problem not by cleverness, but by: if you just make the thing big enough, you can elide a lot of the getting it right in exchange for, "Well, there's just space to figure it all out. We'll try a bunch of stuff." And that's a good plan, by the way. It totally works.
But the thing that I would really push for would be: if you want the—the problem is, even when it's really big, if you want to have the kind of organic alignment I'm talking about, it has to keep learning. Because what the family is, what America is—these things are always changing. And so you don't get to—what we do right now is we fill the balloon up, and we get all this structure, and then we just turn it off. And we're like, "Oh, good. We've baked for you." We freeze it. And we're like, "Okay, here's your structure we've baked for you."
But that's a tool. That's no longer responsively learning from the environment.
And the problem is: no one knows how to keep one of these things attached to, you know, ongoing pressure waves and not have it eventually pop or collapse.
The answer to that is, of course, that's impossible. All beings are mortal. Popping and collapsing are just senescence. That's what aging and death is. You can play tricks to get it to last longer, but there's no such thing as forever.
And so you actually have to leave—this is true for humans. It's true for AI models. It's true for any learning system, period. They age. You can see corporations age. You can—they—and are they getting older? I don't know. They exhibit these characteristics that we correspond to aging in humans and animals. And learning systems all exhibit them.
And what you have to sort of accept is that any given system—I have to accept that I'm not going to live forever. But I'm part of a stream of people. I have children, and they're—there's this process of growing and replacing.
What you're trying to do is not build a model that lives forever. You're trying to build a model that can keep learning for a long time and that eventually can effectively reproduce—that can be compacted down into a seed crystal version of itself and re-expanded. And that's the process of ongoing learning.
And what I'm describing is basically digital biology. It has to work—not literally like biology, have proteins and stuff—but on its solid-state chemistry, this solid-state chemistry biology has to go through the same kinds of life cycles of growing, learning your place, becoming competent, having a period of competence. Eventually, the situation changes too much. Your system can't accommodate for it, and you will senesce.
And from the period of competence, you want to sort of sample from that and then compact and start again.
And that is not a suggestion. That is written into the laws of the universe. This is how—if you want a system that is a learning system and you want it to be in alignment with its environment, with the things around it, it has to be constantly learning. And to be constantly learning is to be aging—literally adding entropy to yourself. Entropy is structure. You're literally adding structure. And eventually you fill up and you senesce, unless you become cancer and you grow forever, and that's bad.
So you just have to accept that. And that means giving up this dream of the perfect, all-knowing, all-seeing god AI that will solve everything, in exchange for merely potentially superhuman—but normally human-level stuff—that suffers from a lot of the same problems we do, like aging.
We're going to get two things out of the AI revolution. We're going to get these incredible tools. And I hope we don't install them as systems of control in a nightmarish Orwellian dystopia. But I think the tools will be awesome and really fun and great to use.
And then we also might get something that looks a lot like digital biology that looks a lot like a child species or something. And they're just—a tree and a human both use DNA. They use the same substrate, but they're very different things.
And we should—when people talk about AI, they talk about them both as the same—those are the same kind of thing. And they just—they couldn't be more different.
Patrick: Relative to many people in the extended community, I am pessimistic with respect to the rate of growth in AI capabilities. And when I say pessimistic, I mean I think they're likely to be only as important as the internet, which is, you know, among the most important things that has happened to the human race.
I personally am less worried that there will be a new class of life-form that will tile the universe.
I would extremely like to leave people with the impression that it is the narrowest of narrow goal sizes that capabilities enhancement would stop tomorrow. We are going to get wonderful, wonderful tools very, very quickly. That is going to have massive impacts on society in a number of ways. And then we will have to deal with many implications of that, of which one is ensuring that the tools—or whatever the ontology that one wants to have for the thing that comes after the tools—is still our friend.
And with that, Emmett, thanks very much for the interesting conversation today. I hope we got a mix of the high-level and the nitty-gritty in there for folks.
Where can people follow you and your work on the internet?
Emmett: I am at softmax.com.
Patrick: Softmax.com. Awesome. Well, we'll send people there. And for the rest of you, thanks, and we'll see you next week.
Emmett: Thank you, Patrick.
Patrick: Thank you.
Thanks for tuning in to this week's episode of Complex Systems. If you have comments, drop me an email or hit me up at @patio11 on Twitter. Ratings and reviews are the lifeblood of new podcasts for SEO reasons, and also because they let me know what you like.