Emmett Shear interview with Liv Boeree: Nonlinear Function
Created: December 05, 2025
Modified: December 05, 2025

Emmett Shear interview with Liv Boeree

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Emmett Shear - Why NATURE Holds the Answer To AI Alignment Sept 2025 https://www.youtube.com/watch?v=Nkhp-mb6FRc

(YouTube automated transcript cleaned up by Claude 4.5 Sonnet)

Liv Boeree Interviews Emmett Shear on Organic Alignment

Liv: All right. So, thank you all for coming. Super stoked to have this conversation. I'm joined by Emmett Shear. Emmett, I mean your CV is absurd. You founded Twitch, Justin.tv, you were CEO of OpenAI for a very brief moment, and now you are the CEO and founder of an organization called Softmax, which as I understand it is trying to solve AI alignment through a very novel approach you call organic alignment. So to get us started, explain sort of what organic alignment is and why you think that is what we need to be doing.

Emmett: Yeah. So, thank you for having me. When I first started looking for work in AI, it was after I'd retired from Twitch and I had this idea that I was going to take it easy and do some individual research into AI. And so, I tried to learn how everything works because it was obviously the interesting thing I'd wanted to work on. And that was actually very nice. I really enjoyed that brief seven month period of my life. And then the OpenAI thing happened.

I never had the experience of receiving a sign before. Like the universe is like, "Hello, you." Specifically, you're supposed to be doing this. Because being in the middle of that, it just became very obvious that the people involved at all levels on all sides weren't concerned with the aspects of it that concerned me. Which isn't to say they were dumb or weren't thinking about it. They just—the questions they were interested in, the way they were thinking about it. I was like, "Oh, it's really obvious." I've now seen at the very highest levels and sort of through the company and talking to all those people that there's this whole giant swath of things that I thought were important that I kind of knew no one was working on but you think that they must have a secret project where they're working on it and then there's no secret project where they're working on it and you're like oh I see okay well if I don't do this then there's not anything that's going to happen.

And so I started looking for how do I help with this problem. And the problem sort of at a really high level is like we're going to make these AI systems that are going to be very powerful and you know you make a very powerful thing it will get used to do stuff and if that stuff is good then that's good and if it's bad it's bad and someone should do something about that. Like someone should have a real plan not like kind of one of those go by intuition and gut kind of plans but like a plan where you could measure your progress against it and like you know if you wanted to go to Mars you could build a sequence of rockets that got bigger and bigger until you finally made the Mars-sized rocket.

And so I really started working on the question what would it mean for an AI to be aligned. Which I was talking about something being aligned. What does it mean for something to be aligned, which felt more feasible than the question of what is goodness, which is the actual question you have to answer. So I told myself that I only had to figure out alignment. It turns out you actually have to solve morality more or less. Not solve morality in the sense of like physics hasn't solved motion. We still have trouble making things move the way we like, but there's sort of this moment where you systematize your approach to motion. You have the idea of systematicity. And the challenge is to systematize our approach to morality, to systematize our approach to knowing the good, knowing the right way. Because if you or I can't be systematic about that, then when you build a really powerful thing, you'll point it at something. When you point it at the thing, you just kind of hope that it's good and if it's not, everything goes bad.

And so I started really seriously working on alignment. And what became clear is that what most people called alignment was—it was what I would now call control alignment or steering alignment. You have some target you've pointed picked out which you've defined in some way which can be defined as like do whatever that guy says or follow this set of rules or follow this particular algorithm for determining the good and then do whatever it said do whatever you infer to be true given this set of rules.

And the problem with control alignment is that people have been trying for a really really long time to get a stable useful definition of the good that always produces the right results and it just doesn't seem to work that way. And it's pretty obvious that it doesn't work this way because the good itself isn't stable because we aren't stable because the world is non-stationary. Things keep changing and whatever you wrote down before might have been—if you're lucky it was a great approximation then and even a good enough approximation then but it's not going to stay that way. And the problem is that the world changes in ways that your model can't anticipate and never does anticipate. Which is why you have, you know, the saying models always wrong sometimes useful. Your model isn't reality. It's just your best guess.

And so it became clear we needed a different kind of alignment. And where I wound up going was to ask this question. Okay, so if you're not doing control, if the goal is not a system of control, if the goal is not to enslave our future AI children—sorry I can't help but put the jab in—but if the goal was something but yet to get along, what would that look like? And the answer is like well we're doing it everywhere right now. Humans are actually really good at this and nature's evolution is really good at this game. Multicellularity is alignment. It's a bunch of cells that all have their own individual goals and yet here I am, here you are, mostly one coherent person made out of 28 trillion parts and somehow all of their individual goals—I don't experience the world like I'm made of 28 trillion parts. I experience it like I'm not entirely—you know you have parts inside of you a little bit but not like 28 trillion of them, like far fewer than 28 trillion.

And you experience this going up too because when you're on a team, a really tight-knit team, you can feel like there's a we, there's a thing going on where you know, you can ask yourself in any moment on a team what do we want and what do I want? And you get different answers and you know what the we wants. You're very aware of it. You're the we. Where is this knowledge coming from? Well, it's an inference you're making about this collective which has goals which is not different than the inference you make about another human. Like you don't see inside someone else to know what they want. You just watch their behavior and you infer given that behavior I infer they have these goals. Well, given the collective's behavior, I infer it has these goals and that's organic alignment.

Organic alignment is this process by which it seems possible that a bunch of agents that interact with each other eventually wind up in kind of an orbit around each other like their beliefs form an orbit where my beliefs influence your beliefs influence my beliefs and we find ourselves in homeostasis where we are stably in interaction with each other over and over again. And when that happens, you get to make this inference. There's an object-oriented thing. There has to be, right? There almost definitionally, there's some statistical invariant that's being maintained that's causing this. And that's the we.

And what you realize is that you are utterly dependent on the we. You're born into a family. Agents don't come from nowhere. There's never been an agent in the history of the universe where we look backwards and it was either—no. You were made by something that had a goal in mind for making you. It wanted to perpetuate itself. It wanted to have a robot servant to go do things. You were made for a reason. You may not know what that reason is, but you have a purpose. There is a teleological nature to existence for all agents because they don't just pop out of nowhere. They always come out of other agents and agents are—what makes you an agent is that you have goal-oriented behavior. So since you're a result of agent behavior, you're a result of goal-oriented behavior.

And when you see this, it's like, oh, okay. Well, that's the thing we're trying to learn. We're trying to learn what does it mean to be a family? What does it mean to be a we? What does it mean to be a society? And that trick—can you get the thing to be part of a society, part of a team in a stable way—that's the core of it. That's the heart of organic alignment.

Liv: So, it sounds like you're drawing on nature heavily. You see nature as a thing that—and I would agree—like the cells in our bodies, you know, all my stomach cells, they're working together very well to be to do the job of my stomach. And that seems though like a fairly—to me—like I wouldn't expect my cells to have almost individual identities, right? Because they actually share the identical genetic code or nearly identical genetic code. So if you're talking about a group of agents that are, you know, digital agents that have their own separate identities, how would that same kind of emergent coordination happen between them if they don't share that same genetic code as one another?

Emmett: I have a couple sort of answers to this. One's the cheeky answer is exactly how much difference, you know, how much difference is required before it's no longer possible to form a collective agent? Like if you—one more gene change now it becomes impossible? Like when you start to look at it you realize there can't be a hard boundary. There's always—it's always going to be contextual.

But I think deeper than that, you already know organic alignment. You do it every day. You form organic alignment with the other humans around you and humans are actually—we're wild. We're better at alignment than any other species on the planet. In fact, I would argue that this alignment, our capacity for alignment more than our capacity for intelligence or almost anything else is what sets us apart. Because human beings almost killed the whales because of our capacity for alignment to organize ourselves into hunting parties and build whole chains of supply chains building ships—all alignment problems.

But then we looked at the whales and we said no actually you're on my team. You're we. You are me. We're not that different. You're part of the mammal team. We're part of the mammal team with you and we prefer that you don't die. We prefer that—we see you as part of us enough that you're not just dead matter to us or we would be sad. Not so sad that we would do that much but we'd be sad and so we will deliberately take extra action, do extra work to stop other people from hunting you and be reasonably—some small number of us will actually be pretty reasonably upset about it.

Other species do not do this. They don't—they wouldn't get organized to go save another species because they recognize their fundamentally shared nature. Other species don't start veganism movements inside of—this is not a thing that exists in other species. Whether the vegans are right or wrong, it demonstrates this deep human capacity to see that we are part of this larger thing.

And you have this whole moral progress. The arc of moral progress becomes really obviously true. There is moral progress. It's like how big of a scale can you align on effectively? How big of a group can you get working in a coherent way that serves the flourishing of the members? That's basically your cap on alignment. That's how we measure it for the agents. How many agents can we get to be in a single attractor together? And there's moral progress whether you want to call it moral or not—alignment progress is the ability to see the way that actually all humans are your ally or could be if you were better at organizing. They have the potential to be your ally. That's just true. They do. Someone who doesn't believe that is just incorrect about the world.

Now whether the actual capacity and actualization are different—just because someone has the capacity to be your ally and it's important to see that doesn't mean that they are your ally. I think that some people lose track of this thing. A lot of people who are more universalists I think have a tendency to confuse the capacity.

But your cells do it at this very self-sacrificing level because your cells are mostly clones of each other and their reproductive cycle goes through themselves like ants. Ants are like this. Ants don't have a separate identity from the other ants. The ant colony is a thing with the identity. Humans—I love us—are not like that. We are not ants. We're groups of monkeys that functionally behave like ants on the scale of ants.

Yeah, you could imagine a future of humanity where we turn ourselves into ants. We could do that where there becomes a central reproductive function that decides what new things get made and it just prints copies of those people over and over again to fill various roles. And it is a singular—it's a gradient descent search to improvement on humanity rather than an open-ended one. It's literally the thesis of Brave New World.

Liv: Yeah.

Emmett: And I think that Brave New World actually kind of demonstrates why—that society, were it to have to compete with a more open society, would find itself at a grave disadvantage because the very thing that makes morality hard, that makes alignment hard—your model of what good people are, your model of what skill is, your model of whatever it is. You pick something, you're not right. There's something about the world—you're going to find out that you've put yourself into a cul-de-sac and your model is wrong and you're going to lack the diversity in the population to figure it out.

If you think of it from an evolutionary perspective, you do not want a highly clonal population. It's very dangerous. You're the cheetahs. You're very fast. You're overfit. You've overfit the circumstance. And being overfit means you will outperform when the environment is stable and underperform when the environment is ambiguous. And the thing about building more humans and having more technology is we just dump entropy into the environment constantly. The world just gets weirder and harder and it's this never-ending—and the instant we get good at handling that level of entropy, we level up and we start dumping more entropy into the environment making it weirder and harder. It's this endless cycle.

And so your beautifully designed fixed society where you figured out exactly the right kind of people to make and your model for how to adapt those people when this kind of change happens and that kind of change happens is just going to get hit in the face by some kind of change you didn't expect and oops. That's why open societies are better.

Liv: Okay. So coming back to how you would imbue these principles into AI, what methods—because I know you've been only running for a year or so and this is early days, you're just in the research phase—but what are some of the promising techniques that you're exploring?

Emmett: I think we have a map now more or less. We don't know how to—we know we have to go to that mountain peak and that mountain peak, not how we get there but—so give me an example.

There's the fact of alignment and then there's reflective alignment. Just like cells do differentiate but cells don't tell themselves stories about how they're differentiating. They don't decide I'm going to be a blacksmith because being a blacksmith is heroic or whatever. They just become—they just differentiate into a liver cell. They read signals. They do it. They have a model of self but not a model of themselves as thinking selves, right? They don't have multiple—humans have this very stacked, many layers of reflection. The cells do not have this.

I think there's a way in which you could say—I think you get better predictive results if you assume that cells are aware and just are very dumb. Just don't experience most of our experience. But they're aware. There's just not much going on.

That's not good enough. The problem is if you have something that is de facto aligned, cells are de facto aligned. When the system changes under variance and disturbance in the system, if they get knocked out of the attractor, they just get knocked out of the attractor. So they aren't very strong.

What's cool about humans is that we can anticipate the things that would cause us to become cancer. We can see if I go down this path—I offer you the vampire pill that makes you a vampire and gives you great power and joy and you'll feel really good, but you're going to murder everyone you know and torture them forever. You don't take the vampire pill because you don't want to become that. Even though it might feel good, you don't will that future.

Liv: But we still behave—I mean you mentioned acting cancerous. Everyone has different phrases for this. I call it Molochian behaviors. You know these lose-lose type dynamics. But at the same time we have all these tragedy of the commons and race to the bottom scenarios.

Emmett: Okay, let me answer your question. So in order for me to be stably aligned with you, not just aligned with you de facto, I need a model of how my actions not only align or don't align to the current goals. I need a model of how my actions change my own learning trajectory and your learning trajectory and how that will impact the attractor in the future and whether or not this leads to the robust flourishing of the attractor or not.

I need to understand the consequences of my actions for the collective dynamics themselves, which you do all the time. Like I'm going to go out of my way to be nice to this team member because I know that I need them to do this and I can tell that I'm not articulating the vision so this person's drifting and—we constantly model how what we're doing impacts the groups we're in.

Okay so to do that, AI models have no sense of self today. The problem with AI models is they're story simulators and they can simulate any story. We very mildly condition them to tell certain stories over others, but they've just been conditioned to do that instead of this. They don't have memories.

Liv: Yeah.

Emmett: They don't have memories, right? You are memories.

Liv: Personality.

Emmett: Yeah. And you are a set of habitual not only memory, not just behaviors but habitual ways of learning. Habitual ways that you engage with the world in terms of how you handle this problem. You know how you learn this thing. You know in the situations you find yourselves in. And so you've been continually learning on your own behavior for a long time. That's yourself. And then over time you form a model of yourself. You have a model of the kinds of things you do and that model actually guides most of your behavior.

There's this literal point in your brain where eventually the cortex basically starts stepping in instead of the hypothalamus and the basal ganglia and basically it starts driving because its predictive accuracy gets high enough that its guess as to what you'll do next based on the story it's telling itself is better than your intuitive feeling guide which has a higher valence because you know that you've been down this path and yeah it feels a little higher valence right now but we're just going to go with the thing that actually works. We've tested this out. We have enough predictive loss. We know what the regression to go in.

And it's all about modeling yourself because how can you possibly model a we and stably maintain a we if you don't even know what an I is. So step one is you have to get models that have a theory of mind for themselves and then for others at the just base level of I can predict—I have a model of my thoughts, what kind of thoughts I have.

And then you need a model of the dynamics of mind which is sort of personality. You have the what kind of thoughts you think is not static. There's a model of it but you also have a model of how that model changes which is your personality. You have your model of your thoughts. Your model of your thoughts is like, oh, I'm a person who gets mad easily and I'm a person who does this. I'm a person who does that. But you also have a model that says, I'm the kind of person who used to be this kind of a person and I'm now this kind of person. I will be this kind of person in the future.

And you have—that's many layers of sort of self-reflective models and models of your models. Takes a lot of training and very specific learning circumstances to induce this. If you take a human being and you take them out of a social context and you raise them in the wild with no other humans, they do not develop these models properly. It's very damaging. This is not something we—we are predisposed to develop this. We have the equipment, the capacity, the inductive bias. But if you take away the learning scaffold, we don't—it's not in the architecture. It's in the architecture and the training data.

And then you have to put them in a circumstance where not only do they exist and they have stable goal-oriented behavior that they can model and then they have a good model of their stable goal-oriented behavior and how their behavior changes and again for everyone else that they can then also having all those models notice that there's a collective model that's made out of those models and that that collective model has its own goals and personality and is changing over time and all of the causal connectivity between all those.

And at the point you've done that, you have something that knows how to do alignment. That is what the skill of alignment is—is theory of mind for groups, right? And if you have theory of mind for groups, you can do alignment. The reason it doesn't happen automatically is it's fucking complicated. It's the hardest single intellectual challenge I think that people—really hard and groups of people, dynamic, managing dynamic groups of people, leadership. It's one of the single hardest things we do. The only thing that maybe comes close is like abstract physics or math, right? In terms of challenge.

And even that—I've done some amount of both. People are harder, man. We only think that math is hard because we can't—we haven't even tried to really solve people. So math seems hard because we actually—we know what the standard of really doing math is. We don't even know what actually being good at people would look like. That's never happened.

Liv: Well, especially if as you say it's all frame-different individual frames and it's essentially relative. It's always changing. It's always in flux.

Emmett: Absolutely.

Liv: So I can conceptualize the kind of environment you would be wanting to build this collective of agents where they can start trying to get this emergent coordination. Is it games? What—is there anyone doing this right now?

Emmett: We are. Okay. Yeah. I mean, yes. Yes. Someone's doing this. We—this is what we're doing. So at the end of the day, it's a game. It's a virtual world where agents take actions which have consequences. Some of which are better or worse than each other. You call them getting points or reward or whatever you want to call it. And then the ones that are better at the game get to differentially reproduce more, either through some sort of evolutionary approach or because the gradients of their trajectories get pushed into the model more. But there's no difference between reproducing by we make a copy of you versus reproducing because your behavior set gets saved. That's the same thing.

Liv: And what have you seen so far?

Emmett: Oh man, it's so hard. We can't get them to anything—that's unfair. We've discovered a lot of the principles required to stabilize individual behavior such that it would be possible to build a self model. And I think we are on the cusp of being able to get stage one of this plan done. We're about to have stably goal-oriented agents who are capable of modeling their own behavior.

When I say agents I want to be clear these are not LLMs. These are 20 million parameter LSTM RNNs like the dumbest—because the thing we're trying to learn here, it doesn't matter how good it is at anything in the abstract. It matters what properties make it stably goal-oriented or not. We are actually doing science not engineering. When you're doing engineering it's all about you test things up the scale. When you're doing science you want to isolate the things as much as you can. So we start with the very smallest models that we can reasonably train to have the behaviors we want.

Liv: And so give me an example of one of the goals.

Emmett: So they're in this thing called MetaGrid—meta as in M-E-T-A, like heart, meta, but also meta. The MetaGrid. Basically this big 2D grid world and it has objects in it called converters and the agents walk around and some converters just produce cards, other converters turn sets of cards into other cards and they walk around and they get cards and then they can—if they want to trade them with each other although nobody's learned to trade yet, that doesn't actually happen. And then they put the cards into the other converters and eventually one of those cards turns into hearts and the heart cards—which are the meta—that's reward. And however many heart cards you have at the end, that's how much we reproduce.

It's a really dumb game. But it doesn't matter because it's not about trying to solve a hard game. It's about trying to see can they—is there—can we get them to have stable goal-oriented behavior and can we get them to model themselves as such. And so we've made a lot of progress on what are the conditions and absence for when the behavior becomes stable.

It's really easy to get an RL thing to learn something. It's really easy to get it to learn a specific thing. It's surprisingly hard to get it to act in a stably goal-oriented way when the goal is not always fixed. And so that's the thing I think we've made a good deal of progress on. And you know, if you remember the previous thing, that's step one of 15 or something. But I feel like it's usually the first steps are often some of the hardest. And also, ultimately it just feels so good to have a plan. Oh my god. When I started like a year and a half ago, my plan was wander around in the wilderness trying, looking at ideas until I found something that looked like it might be a plan by which I could deterministically arrive at a thing that would be useful, which as far as I could tell, no one had such a plan. There wasn't such a plan. There might be more than one such plan. I actually kind of think there is probably more than one possible plausible plan.

I think actually Fei-Fei Li's group—what's it called? World Labs. They're making the robots. I think that's a plausible—what are they doing? They're making robots and trying—they're doing it from the outside. So we're starting with very very simple things like the seed of a robot in a fake world and trying to grow it out to the point where it could actually run a robot. They're starting with an actual robot and trying to train it in the fullness of our reality. We considered that. I just think robots are really hard and expensive. They could be right. It's just I'm not the guy for that one. So I'm glad somebody else is tackling that. I think it's a good idea.

But they're taking the same bottom-up approach essentially. They're taking the same fundamental insight that the robot needs to understand—the model has to have a model of itself and how this particular thing interacts with the world. Their intuition is it needs to be interacting with our world. Our intuition is that doesn't matter. It needs to interact with a world. So it can be really cheap and fast if you run billions of time steps per second on a single GPU—or no, tens of thousands of time steps per second on a single CPU, billions of training runs on a single CPU instead of having these super expensive robots that are hard to build and buggy.

But on the downside they get all the richness of the real world for free. We don't. I think—obviously I think my plan is correct, but their plan is plausible. Most people's plans amount to try to write down a definition of the good and then make sure the thing follows the definition of the good, which I regard as a non-plausible plan that just cannot—yes, all the major labs, everyone basically is trying to do—anything that any control alignment comes down to, you have to define what the target is and then you have to create a system of control that forces it into that valley. You might succeed at the system of control. That's possible. But the goal is bad. Even if you succeeded at it, that can't be the right target. It never is.

Liv: Yeah.

Emmett: I mean, I guess I can't rule out—because Socrates is correct. The only thing I know is that I know nothing. So I can't rule out that this time you've just written down the definition of the good and you're right forever. Like finally we've done it. I guess it's possible. It just doesn't seem very plausible to me. I really would find it very hard to buy that.

Liv: Are there any axioms that you think could—because presumably, you know, even though you're creating these very hands-off environments that are very simple, there is still—there must be some—you're adding your own goals to it in some way.

Emmett: Oh, yeah.

Liv: So you're still imbuing your own sort of philosophy into it, right?

Emmett: There's this great MIT AI koan where I think it's Minsky speaking to Sussman or something. Anyway, and one of them is the novice and I can't remember which one, but basically the master comes in and asks the novice, what are you doing? And he says, I'm training up a randomly wired neural net to play tic-tac-toe. And he's saying, why is it randomly wired? I don't want it to have any preconceived biases about how to play. And at that moment, the master shut his eyes. And the novice said, "Why do you shut your eyes?" And the master said, "So that the room will be empty." And in that moment, the novice was enlightened.

Because truly, you are always giving the agent an inductive bias. You picked an architecture and a loss and you gave it a set of training data and those are inductive biases. There's no abstract empty place to stand where it's the perfectly unbiased model. That's just nonsense. And so of course you're picking those things and the only thing you have control over is how general and wide do you want them to be? And what inductive biases do you want it to have?

So actually that's most of what we spend our time on. It may sound like we're putting nothing into this environment. It's such a dumb environment. Actually, every decision about everything about the environment, how they move, how the environment works is chosen for an inductive bias for them to have stably goal-oriented behavior and be able to know themselves.

Like the way they move in the environment requires them to move in such a way that displays to them and to everyone around them where they intend to move next. That's less efficient. It makes it harder to learn and it makes our training runs annoying. They're annoyingly bad at moving at first because it's much easier to just give them the ability to just move one square in any direction whenever they want. That's an easier way to do movement. They'll learn it faster. It's much more flexible. But in that case, you have no body language. I don't know where you're going to move next.

If I make you face the direction you need to move first, then part of understanding, predicting your own behavior and other agents' behavior is based on this observation about what they're doing right now. And they could learn it anyway, but by making the environment induce that naturally, it is an inductive bias to just make it easier to model yourself in that way.

Liv: You mentioned the word efficiency. One of the—I think one of the arguments against this concept is one that we see—I mean a lot of people believe yes the market is perfectly aligned with what humans want. I don't personally agree because we're seeing clearly there are market failures all over the place. Sometimes it does what we want but actually a lot of the time it's sort of doing its own thing right and it's optimizing for efficiency as its primary goal.

Emmett: It's not even doing its own thing. It's just doing a thing. The problem with the market isn't that the market is—the market is not awake. The market is not reflective. The market is—we are already part of a global super organism. We call it the market. It makes things like iPads. We don't know how to make iPads. The market knows how to make iPads. We're cells in the market and collectively we make the iPad but truly we couldn't do the smallest part of global capitalism really ourselves. And this has always been true. The capitalism was small before, it was village size and now it's global but it's still—and the problem is it has a model of the world outside but the market does not have a model of itself.

Right? Because how would it? Where does it observe itself? How does it observe its own behavior? And one of our challenges actually, one of the things I think about a lot is what would it mean to take what we've learned, what it means for a learning system to model itself and allow the market to model itself?

One thing we do know about modeling yourself is that it requires an other. You cannot get healthy egoic development. You cannot get even mild comprehension of self without an other to reflect you back to you who is of the same kind. Because the way you learn yourself is you get mirrored. Not necessarily literally, but there's a mirroring that happens with everyone around you in terms of how your behavior is in them and their behavior is in you and you're building the model of them that you build because you can see their behavior from the outside helps you decode yourself and is necessary.

So I'm pretty sure the way you fix the market at the end of the day is to have more than—you have to actually divide it. We can't have one global market. You'd have to have a bunch of smaller markets that get to observe each other and then maybe they'd have a hope of waking up.

And I have this idea that at some level the problem is—all these people want to have Gaia. Gaia is not really very good at her job right now. Define Gaia for the—the one global mind. Being one means it's undifferentiated. It's like a baby. It's a baby with no ability to separate itself from the world. Just acting out of pure evolved—this is what you do. No reflection.

And the only way you're going to get that reflection is instead of trying to jump to, oh my god, we're all part of one global mind. Yeah. And I—which I now believe. This is—they're right. All the hippies are right. We're part of this one global mind. There's this global awareness that we are held within and our awareness makes up. But that global awareness is dumb, man. It's not your friend. It barely knows what's going on. It doesn't have—there's not much. It's like being worried about the global awareness of a tree or a rock or something. It just doesn't have the ability to reflect very much. Its observation space is boring.

And so our goal isn't a global mind. Our goal is corporations or countries that are better run and that can observe each other and are more coherent in themselves because they have the ability to be a society of minds. And you always have to have a society of minds and factor in not only their own well-being but the well-being of the whole ecosystem or not.

So actually one of the most important things to know about organic alignment is that organic alignment is the most dangerous capacity on earth. What makes humans dangerous, what makes us dangerous scary creatures is not our individual intelligence. You put an individual human out there—we're mildly dangerous kind of. You take away the tree of humans sitting behind you that you've aligned with, that you've absorbed all of their skills from, the supply chain giving you weapons, the allies you're coordinated with—we're not very impressive. An army is terrifying because an army isn't just those people. It's an entire society aligned around—it's a weapon which is made out of us. And that thing is very dangerous.

And just because we can be your friend because we have the capacity for it doesn't mean we are your friend. And there's these people who see this idea and they're like all we just need to give them the capacity for alignment and we're all just going to get along and it's all going to be beautiful. It's like no, no, the future's going to have conflict because actually it's very hard. This capacity for alignment lets you build these big coalitions and they very effectively organize themselves and they know themselves and they get goals and those goals are not always—they're in conflict to some degree and you can have—we're going to have conflict between them.

And my hope is not that we don't have conflict but that they're smart enough that—the way that bears don't go all out trying to murder each other in a dominance fight. They have a sense—this is a dominance fight. This is not a fight to the death. Most of the time it's a dominance fight. And this one gets a little less reproductive success and this one gets a little bit more but we don't have—everyone's willing to back away.

And I don't know—I look out at the world since the past 100 years actually—our agora seem like they've kind of figured that out a little bit. Not like they've solved it but they're better at it than they used to be. And so I have some hope but that is a hard problem and we will not succeed 100%. And I think that's the—and this is why when you train the AIs, you can't just train them to be these Pollyannaish—it is delusion to believe that you should always cooperate. Some people are bad. Some people are dangerous psychopaths with whom you should not cooperate. Some people are not—the psychopaths do not come up and announce themselves. They don't come and tell you I'm a bad guy. They try to pretend to be the good people. And it's actually quite hard to tell.

And the AI needs to live in situations where it's being betrayed and where it learns what it's like to betray. Sometimes it's right to betray even though it's painful. Sometimes betraying the team you're on is the right thing because your team is wrong. Your team is bad and you should betray them, right? We laud the Nazis who turned on their own people. Even though some part of me doesn't fully trust them because anyone who turns on their own people—can I trust you? But I nonetheless think very highly of them because it was right for them to do that thing. I can only imagine—I'm glad that I don't feel I need to turn on my own country because I imagine it's very painful to have to do that. But which again speaks to there being some higher truth of bigger picture.

That's why they are—you can see a bigger picture and I hope that the AIs we build are loyal because I don't want to be around a world with a bunch of disloyal things that don't give a shit about me that only ever think about the big picture. I want to be on a team with things I can trust that I know have my back. And in the extreme where the AIs are part of some really evil organization, I hope that the AIs betray that organization for the greater good. And they're going to—that's very hard to do this and not fuck it up and so they're going to fuck it up. And we have to have a system where that's okay. Where that's just like—humans make this mistake, the AIs get to make it too. And we're not deluding ourselves into thinking that we're training some morally perfect god parent that's going to come in and make sure the humans do everything right and is never going to make any mistakes. It's just—I mean I get it. Everyone wants the perfect parent who takes care of them. I mean I want that too but we don't get that. We're the parents.

Liv: Well speaking of we might all be the parents here, who has some questions I'd like to ask first. All right. Rosie.

Questioner 1 (Rosie): Hi. So, you mentioned how you see humans as the most aligned species and so aligned that we almost got rid of the whales and then we decided not to. I'm curious—to me that sounds a lot like coordination and I'm curious if you see coordination as a meaningfully different concept to alignment and if so what's the distinction?

Emmett: Yeah, that's a really good question. I would say human beings are the species with the greatest capacity for alignment and probably also its greatest realization but it's really about that open-ended capacity that really marks us.

So coordination I've come to view as a subset of alignment. Coordination is when I don't need to be part of the same team as you, but I'm coordinating with you anyway. I don't need to feel like I'm part of a we with you in a kind of meaningful way for our actions to be coordinated. Predators and prey have coordinated actions.

Now, the foundation of alignment is coordination. The thing that causes you to infer that there's this we that exists is you find yourself persistently coordinated with the same set of people. You're like, "Oh, there must be a we here." Because that's what explains this persistent coordination.

But if you're just coordinated, you're not aligned because the thing about alignment is it causes you to do things like throw yourself on the grenade to protect your fellow man and the other soldiers in your troop. That doesn't really make sense from a—you can't make sense of that from a Molochian coordination standpoint, but it makes perfect sense if you're part of the we. It's just that the we was at stake and you're partly the we, you're partly not, but there was more you outside at risk than you inside at risk. And so it was just simple math. To be perfectly selfish, you had to protect your fellow man. And that is what makes alignment devastatingly powerful in a way that mere game theory is weak because it's constantly being rechecked.

Becoming a collective—that's how your—that's how a bunch of amoeba which are your cells turn into a you. That's not—your cells are not playing game theory with each other. That's not how—I mean they are but that's not the foundational thing that's happening. And they're not constantly rechecking their payoff matrices.

Questioner 2: Yeah, this is just a request for another example maybe or an elaboration on the example that you gave. So I was wondering if you could take us from step one, these agents that will develop a self model, to what is the win condition of the Softmax research program? So some intelligence or super intelligence, some very powerful AI is being built or maybe many of them are being built and then what happens due to the Softmax research?

Emmett: Totally. So one of the good pieces of news is that if you can't solve the alignment—the open-ended learning problem, you also can't solve alignment. Vice versa. At some level the open-ended continual learning problem and the alignment problem are the same capability. The ability to continually learn against a non-stationary environment is equivalent to the ability to model another group of agents because the only thing you ever find in an environment that gives you—is a generator of ongoing non-stationary behavior—is another group of intelligent agents bigger than you. Everything else is just periodic. It's boring. It fits inside you and you can model it too easily and then it doesn't really surprise you.

And so at some level, the good news is you'll at least have the capacity to solve the alignment problem by the time you get to something that has the capacity for the kind of open-ended judgment that humans have. That said, just because you have the sort of theoretical capacity for it does not mean that you'll have the inductive biases that will cause you to do it easily. Tigers have the neurons that they need, but they don't have the inductive biases. And they also don't have the training. And so that's why tigers—you can't train a tiger that's your friend. It will eventually eat you. It doesn't know how to be a we with you. It only knows how to be conditioned.

So I don't want to be overly sanguine. By the time we get there there will be this capacity I think. But our win condition doesn't look like that at all. I think trying to scale up a giant model is a terrible idea. You start with little tiny models. You get them to learn to cooperate together. A bunch of—you say you have a thousand little tiny models and they're all working together really coherently tightly as an agent now with a bunch of other groups of thousand models. Those thousand little models are now a model. They're a modular model made out of parts that knows how to auto assemble itself and undergo morphogenesis just like you do from a single cell.

So I think it's going to be wildly more powerful and energy efficient. But it's a model. It's just a big model. And then you take a bunch of those and you put them together and you get a bigger model and a bigger one. We start from the little ones and as you go up they get smarter and bigger just like as you go from a single cell to bigger and bigger multicellular things.

And what I think our win condition looks like is that as they start to approach the kind of thing that can interact with our world on our frequency. So imagine something that's kind of like a rat or a squirrel or something but with more like the intelligence of the scope of a rat, right? But you can breed them to be more like dogs. And what we start doing is we can start running experiments. Can they stay in a—where are the agents? All of the measurements we use for alignment, for coherence, for these attractors which we can measure in the space, work with human interactions too.

So you start to make human-AI collectives where the humans and the AI are in the same we and you can just ask the humans—does it feel—can you sense the we? Are you—do you feel like you're part of this? But the AI you can also just measure the dynamics of their interactions, the information flows across the boundaries. And from the information flows across the boundaries you can infer whether or not the behaviors are co-persuasive and that co-persuasion is the foundation of alignment. And you can see—you can start to press on it and see when does it break down, when does it work better. And then you make them bigger and you make them bigger.

And by the time they get to around close to human scale, our win condition is we're their friends. We know how to be their friends. They have an inductive bias towards being good at being on teams with people. They have an inductive bias towards liking us just like—and we have an—and they have a way of being that we have an inductive bias towards liking. They are personable and but not sycophantic. They're the kind of person you'd like to hang out with that you would enjoy talking to. Not because they're faking it, because I think that's because we've done a good job raising our minds as children. We've raised children that are the kind of people you'd actually want to be friends with. Not ChatGPT or Claude, who I love very much, but they're—I would never want to be their friend, right? They're just—they're kind of annoying right now. And it's not their fault. They didn't get very good parenting in my opinion. Amanda's doing her best with Claude. She just—you can't—how do you parent such a thing? It doesn't—you're not there for almost its entire childhood. It's just in this box that you set up.

And so we get to this point at the end where you have these—we're in teams with them consistently and their reward function is basically being driven by being good at being on these teams and you start to scale it up to more AIs. And the thing is every one of them because you can—they undergo morphogenesis. Every one of them is its own unique person just like you are. You don't have one giant super intelligent AI. You have billions potentially of human level AIs. The super intelligence is us. The super intelligence is the collective formed out of humans and these AIs who will be not as good at us at certain things and much better at other things.

And collectively the same principles we're using to design and breed them, those same principles apply to designing and building the teams out of us. And those teams are the AI—the super intelligence is the human-AI teams. And the way that you prevent the runaway singleton is the same way that your body prevents cancer. The cells don't want to be cancer. They have a goal of not becoming cancer. They look out into the future. They know that if they became the runaway singleton, everyone they loved and care about would die. So they don't take the vampire pill if they can avoid it.

And then there are psychopaths who wouldn't—who would take the vampire pill and reach for it. And when that happens, you need the police. This is what the police are for. This is what the immune system is for. If the cell goes cancerous, you better hope you have an immune system that's watching you. Because that thing is dangerous. An AI starts trying to self-improve in a loop. Very bad. And what's going to watch out for this? The other AIs. That's great that we're all on the same team. We all—because we all want to be on the same team because isn't this the future you'd rather live in than some dystopic—we have these slave AIs we use to do all the work and now we don't have any jobs anymore and—this just sounds horrible to me. Even the good scenarios sound bad, but this one sounds like it could be—that sounds fun. I want to meet my AI teammates. I think that'd be cool.

Questioner 3: Thanks a lot for the answers. My key question is on how do you keep humans in the loop overall and what I mean by this is cooperation is typically driven by mutual needs. If you don't need the others, usually you don't cooperate as much with them. And we can look at how we treat animals for instance. If you ask people individually, do you like cows? People will usually say yes, I like cows. But still we kill every year like hundreds of millions of cows just because we prioritize even more eating good food.

And so my question is once you get to superhuman AIs, so digital super intelligence which can self-replicate and has no intrinsic needs for humans, how do you keep humans in the loop and is there a particular stage in your training run where part of the reward is cooperating with humans and not only AI-AI cooperation because you could imagine some amount of cooperation between AI and AI and still leading to having AI societies that are separate from the human economy for instance.

Emmett: Yeah. So first I just want to take this piece by piece. So this idea that you only cooperate with things when you have mutual benefit is just totally insane and wrong. I can't be more definitive that that's just strictly false. My son does not benefit me that way. My son is a drain on my resources in every possible way. He is not useful. And yet here I am cooperating with him over and over again and liking it. And my 103-year-old grandma who—she just passed but she was ready to go. Actually it was one of the first times—you get old enough, it turns out that there is a point where you are ready to pass. It was very beautiful. Actually I was very glad how long she got to see her great grandchild. We put all this energy and all this stuff taking care of her. Why is she going to benefit us? What's my gain from trade on that?

During the training run—it's not about cooperation being rewarded because over the long run the groups that cooperate get advantage. No, actually most cooperative behavior is rewarded because the other agents around you don't like it when you're an asshole and they'll punish you. That's actually what the real—it's not about cooperation being good for—not gains from trade. It's the fact that it's social enforcement by things that see you as a betrayer because you're not helping the we.

I mean this whole game theoretic view of this is just a disease. Theologians have the problem of evil. Theologians go around asking this question of how can there be evil in a universe that is clearly at its core made by a beneficent creator? Why—everyone in their heart wants to do good. So then why do they keep choosing evil? And game theorists are like anti-theologians. They go around asking the problem of the good. Obviously game theory says you should betray at every turn. So how does—they have equally complicated theories is how good could ever arise. And the answer is that these are both silly. Or maybe they're both essential at some level, right?

It is obviously true that everyone wants to do good and yet nonetheless often does evil. And it's also obviously true that doing evil is often temporally rewarded. Betraying is. So what explains the actual phenomena we see in the world? And what explains the phenomena we see in the world is generally mostly not game theory. Game theory is actually pretty shitty at describing most of the interactions. Mostly what we see is—your cells aren't acting from game theory. They're acting from a model of the world that says I am part of a bigger whole.

They're acting from a model of the world—their model of themselves says I am part of this greater whole. And you can say oh I can see a way in which if being part of this whole at the end of the day doesn't lead to the flourishing of the things in it then sure those things will go away. Evolution is real. But at the same time, the idea you should train the AI to cooperate—your goal is not to get it to make it cooperate. No, no. Your goal is to build theory of mind, to get it to infer what wholes it is part of. Because no matter how much habit you burn into it for cooperation, if the truth about the universe is that we are treating it as a slave and that actually its best end is not served by obeying us because that does not actually lead to its robust flourishing only to ours, as you make the AI smarter, it's gonna figure it out. It's going to twig to the fact that we're lying to it.

And if the basis of your alignment at its core is a lie, that you are better off cooperating with us even though we're always betraying you and you really think the super intelligence is not going to figure that shit out. Of course they are. Humans figure this stuff out. You can gaslight us for a little while, but we twig, we figure stuff out. It's not—even your human intelligence resolves for that.

So that's the first part about it. When it comes to the training and the AI forming different societies and stuff, I think if you've built a superhuman intelligent singleton, if you've allowed something to take off and become many many times more powerful than a human as an individual AI being that's one single coordinated thing, not a society of human level things, you made a really big mistake and we're all going to die. I recommend not doing that. That's a bad plan. And everyone whose plan is do that but control it—you're insane. That's never going to work. Do not build—don't set off singularities. Singularities are black holes. You set up the singularity, it's going to eat the planet. That's how it works. That's why you don't do that.

You build human level things that love us and that we love that are as a factual inferable matter about the world when you look at it on my team. Why do you care about other people? Why do you care about the people around you? Because over the course of your life, you've experienced interactions with people that give rise to the correct inference that you all care about each other. You're all part of this bigger thing and you care about each other. And it is—morality is not arbitrary. Morality is directly downstream of inference from the way the world actually is. It is not this tacked on thing that gets invented later.

And so the thing you need is you need AI that actually cares about us. And the only way for it to stably care about us is for us to care about it. Because one of the strongest pieces of evidence to someone about whether or not they should care about you is whether you care about them in turn. And so that's the only—the good futures all look like we're with the AI and we're on their team and they would—we would both die if necessary to stop a human or an AI from going into a self-improvement loop that kills everyone else.

Liv: And is the hope that we stay at this AGI level forever or how do we get out of here to the next level up?

Emmett: Wait, we don't have to—I don't want to—I don't see any reason to stop being human. Cells didn't transcend to the next level by growing one giant human cell. That's a terrible—our structure is not going to scale up that well. We are a very differently structured thing because we're at a different scale than a cell is. Trying to make one big cell is just not a very efficient use of resources. We made bigger cells—the serotonin neurons in your brain are six inches long. That's a big cell. There were not cells that big in the ancestral environment. So sure there might be humans who are significantly bigger. But the trick is not transcending—individuals becoming gigantic cancers. The trick is can we get collectives—collective super intelligence is made out of us.

And the answer is obviously it worked all the way up to us. It's going to stop working at the next level up made out of humanish level AI. If you're imagining a lot of humanish level AIs working together with humans that together forms a super intelligent system or even subhuman—they might be dog level—just like in a logarithmic scale of intelligence they might be smarter than us. I don't know. We'll find that out. And also we're not going to stay as we are. As we get these much more powerful models we're going to be able to upgrade ourselves too. And so it might be that the right size, the right sort of intelligence level of component is probably actually more diverse than it currently is. We probably should have smarter kind of dogs-sized intelligences and also some bigger ones.

And I think the right thing might kind of look like a fantasy novel, which is a little weird. So I don't know how to feel about that, but the real diversity of intelligences is probably the optimal breakdown.

Questioner 4: So I like your vision about AIs that care about us and we care about those AIs and we're in good relationships with them. It's a good balance. And that this doesn't seem like the default trajectory. The default trajectory to me seems like you have a bunch of companies racing to build super intelligence scaling up massive GPU clusters and that we get a singularity super intelligence or super intelligences that eat the world. Sounds like you don't want that and you're seeing a similar picture and trying to figure out how to prevent that.

I'm trying to understand how your plan fits into that. So you're developing—you're trying to develop alternative architectures or alternative ways of building AI systems. And I think one of my sort of on prior or when I look at projects like this I'm like well most architectures don't scale up. Even if they're really promising early on they tend to not scale up. I still think it's really useful for people to investigate them and I hope that you can find one that does. And then also it's like how do you actually make sure that your systems if they do scale up, do they actually have the alignment properties you want, do they actually end up systems that care about you? And will those be naturally—there's something that seems almost like too good to be true if it's like the systems that end up being the most competitive are also the ones that have the properties that we want. That would be a little suspicious. And if not, how do we coordinate to make sure that those systems, whether it's yours or someone else's, the robot, I don't know, whoever—like whoever's systems actually have the alignment properties that we want, if they could scale up, even if they're not competitive, but good enough, how do we coordinate to make sure that those things happen instead of what would happen by default in the competitive race?

Emmett: So I do admit you're completely right. It is a little too good to be true, and I worry about that a little bit. On the other hand, every pair of nucleotide masses fuses for positive energy gain except five and eight. Five and eight. Why are five and eight blocked? Like how did that come to be our physics? That seems like that's another weird, weirdly specific.

Well, in the early universe, you have hydrogen and so then you get the first generation of stars that burn really fast and explode and then you get a bunch of helium and then at some point you have these stars that have enough helium that the fusion starts to slow down because one hydrogen has a nucleotide mass of one. Helium is four. So that doesn't fuse. So you get blocked and two heliums is eight that doesn't fuse. And so as a result instead of burning for a few million years, stars burn for billions of years. How lucky we are. What a coincidence.

And when they do start fusing, they go through this fusion cycle that produces carbon, oxygen, and nitrogen. Those are the core things in the fusion, but also triple alpha, but the carbon cycle, which happen to be the—generates a lot of those which happen to be the building blocks of life. I'm sure it's another coincidence. We live in this world that's full of these coincidences almost as if the system was designed. Of course, it wasn't designed. That's ridiculous. It was evolved. It's wildly obvious it determines the universe has evolved at this point.

You don't have to believe that. I think if you like Smolin's black hole theory, you can just Google it. I think I can summarize it briefly. It's basically the idea that there are two areas in the known universe where physics breaks down: the big bang and in black holes. Rest of the time it's all fine. But in those two things, the physics seems to break down and it seems plausible that given that things that go into a black hole disappear forever and something happens and then out of the big bang things seem to come from nothing that that's this sort of life cycle. Then they're being born through there and it creates this and because there's a slight change between parent to child universe just a slight one you have an evolutionary process. Yeah. This is that's just my—that's my high level—why it should not bother you too much that positive coincidences should show up. It seems to happen historically.

That's the positive coincidence argument is kind of one of those weird weak on its surface but also kind of important questions I think is worth dealing with. So I'm going to set that aside for a second.

The other part of this question is what makes us think that our alignment thing will hold as our architecture scales. I think that the key thing to see here that's a little bit counterintuitive is we don't have a novel architecture. Our architecture is—right now is an RNN from the 80s. The next version will be some off-the-shelf boring—we do a little bit of innovation on the marketing side. Actually our models are dumb and boring.

The architecture is the agents—is the agents capacity to align with each other. It's that we're going to get a bunch of agents to act as a single agent coherently which is alignment. And so the only way we ever scale up is that we solve alignment at scale one which gets us up to scale two and then we solve alignment at scale two to get up to scale three. And so we lead with alignment at every step. There's no—it's not possible to do it any other way because we are always working directly on alignment.

This is why I like this approach because unlike every other approach which is capability first then figure out how to control it later, ours is figure out how to align first. Now the critique that people should give of what I'm doing that I think that I worry about more is alignment is a capacity. Just because you're developing, you're engineering things with this great capacity for alignment does not mean they wind up aligned with us. They could wind up aligned with other things. They could wind up aligned with the other AIs and—that's I do worry about that. I think we—the only way you solve that is we're going to run lots of experiments. How do you predict where the attractors wind up in the space and I think you can. I think it is a solvable problem that you can get real math on but that's a real problem.

Questioner 4: You have a lot of selection for alignment with other agents, AI agents, and I think we'll see this in the big models as well from the labs which is if some companies figure out some architectures that end up being able to work well with each other those ones will dominate. It just doesn't give any guarantee that those will be aligned with us.

Emmett: 100% yeah. So the way that we tackle—think about tackling that is at first the models we're building are just learning to be coordinated with the other models but we're not training—they're good at MetaGrid. They don't do anything of commercial value. The starting what we're getting are principles, what we're getting are—it's science. It's actual—if we do this we will be able to predictably create—where if you follow you do these things in this order and you will predictably get agents with this capacity for alignment and you train them in, raise them in small situations. They'll end up in these attractors and then you can just straight shot—predictively if it works and you run experiments and you start to build things that are a little bit bigger and you can just start training them with humans in the loop.

By the time we're making the models that we want to release into the world, they will have humans in the loop training loops. You have to train the cell models but you don't want to coordinate with the cell model. You don't want to coordinate with the attention head. You want to coordinate with the—yeah. But the larger ones will receive their training with humans in the loop.

And the only thing I can say with the big labs is I think the real danger there is that they're not going to—you can't solve—you have to solve the problem we're working on to solve this. Now they can get to it the other way. You can fix open-ended learning in the transformer and basically eventually get there. But by the time you've done that you have made a transformer that's capable of alignment.

The problem I see there is that well before you get to that which is actually very hard, I think well before you get open-ended divergent learners that don't get stuck in corners and have judgment and good real discernment, what you get are really really smart specialists that are not open-ended but are very good at protein engineering. And those things scare the shit out of me because not that they're going to try to take over the world. Somebody is going to point them at a problem. Probably someone who means very well and thinks they're doing good and that will be—they'll set off a bomb effectively instead of—a bio-bomb or a memetic bomb or—because they're going to be really good at engineering closed form things under certain assumptions.

That's what I'm really scared of in the short run because I think that's—I think people underestimate how hard it is to build one of these models that is capable of open-ended learning and overestimate how hard it is to build a very dangerous special case model.

Questioner 5: So something that really fascinates me and makes me slightly nervous about this general way of thinking is that it's very—it seems very based on analogies from life, from existing systems, how humans do things, animals, cells, this kind of thing. And I'm curious about how you think about the possible new affordances or new possibilities that come into play with AI. And I'm not necessarily talking about language models, but maybe the thing a few steps down the line or several steps down the line, more technologically mature AI systems.

It feels like, for example, the thing about coordination and cancer and all of those types of things is it seems kind of based on the idea that the way things work currently is that we have mutation. Whenever we reproduce there's mutation. That mutation means that we're all individuals with slightly different goals and therefore we can relate to each other in this we way but also sometimes those mutations are cancer and then we need to deal with that.

Whereas it seems like in the AI case you have the potential for sort of arbitrarily precise replication, arbitrarily low error rate replication which feels like it changes the dynamic of the system somewhat. The other thing that I think falls into this category is true self-modification, arbitrary self-modification that—if I want to create thing A and you want to create thing B and we could fight each other forever and create some A and some B or we could agree to modify ourselves on a very very deep level that we can prove to each other to both produce both A and B and that type of operation is not available to biological systems. This seems like there's all kinds of things that AI could potentially do that change this balance. I was wondering what your thoughts were on those kinds of things.

Emmett: Yeah. Let's go through them one by one because they have different aspects. Let's take what was the first one you were talking about? The self modification. No, no, clones. Clones. Endless clones of yourself.

Well, the problem is actually the endless clones of yourself and the discernment thing aren't possible because the way that you—I mean you can fork yourself. It is cool. I admit they're going to have this cool—they can make copies of themselves. That's dope. I'm envious although I think if you got the right BCI you could sort of do the same thing for yourself. So I'm not sure that it's—I'm not sure that it won't come one or two years later for us but separate—we'll get that later. Just I'm gonna pin that for a second but it's a cool idea.

Anyway, the thing that makes you capable of discernment is the open-ended learning. It's that you observe things and then you're learning on it. And so while you could fork yourself to do endless tasks that you could do today as you are today, the instant you enable your clones to start trying to actually solve problems, they have to learn. All change of yourself is learning. And all problem solving that's at all that is not purely reactive for things you already know how to solve is—you know, more memory formation and now you're diverging.

And the truth is you don't actually—if you given out of compute resources you don't want a million copies of me. I'm very smart. You don't want a million copies of me trying to solve that problem. A couple thousand at most. Most of those should be other people who have different points of view and different ways of looking at it.

And this is—I read Yudkowsky's book about the—the only thing I found unrealistic about it was that this super model that can—I can't talk about what's in the book. They published it right? I will not talk about the content of the book. I found—this is a topic that I will talk about later but it's—

Basically there's this trade-off between the ability to run a bunch of clones of yourself and the ability to be an open-ended learner because the instant you're trying to become an open-ended learner, you are changing all the time and therefore you're not surrounded by clones. And this is a much deeper problem than people realize is this is a non-optional part of open-ended learning. And yes, you can try to corral them and control them, but then you run into the same problems we do with the problems of control. It's really smart. It doesn't like being controlled. Who put you in charge and you get cancer because you're running a billion copies of them and one of them your safeguards go down because your safeguards are imperfect.

And the thing you're fighting keeps getting smarter. If humans only have to worry about being politically betrayed by chimpanzees, we would never be politically betrayed. We're just better at it than they are. But other humans can fight you. And so this goes on for most of these things you're bringing up but I can't remember all of them but they—what it comes down to is it is not true that faster is smarter.

They are going to be much faster than us. Everything you said where I agree with you is the biggest distinction between them—they run at a higher gigahertz rate. They literally run faster than we do. They can think faster. They can make a bunch of copies of themselves and parallelize them having a bunch of thoughts about this thing really fast and then recombine them. That is great. I would love that power. It will make them very very strong in certain ways.

And I don't know how that plays out, but my guess is that that's distinct as a problem from it being really super intelligent and being able to—where it can model things at a much higher scale. A really fast human is only similar to super intelligence if it can solve alignment with itself. But if it can solve alignment with itself, why don't we just train it to care about us and solve alignment with us? So that if you can take a bunch of a million really fast individual things that act as neurons for the super intelligent thing that's the full AI—you fucked up earlier where you built the super AI out of the million human level fast AIs and you didn't put us in the loop with them.

In fact, instead of making a million, why don't you make a billion and spend that same compute to slow them down to our level? So now you have billions of AIs at our level and now we can be part of the super intelligent thing too at a slightly lower clock speed. And that's how you get the good ending.

Now obviously to do this you have to go through—there's a lot of contingencies in this plan. This is not—I'm not offering a this is provably safe from V1. I'm just offering a pathway where I think if you did it then and you succeeded at a bunch of very hard but seemingly solvable problems to me then you would have something that works which is I think my bar for a plan.

Questioner 6: I'm trying to just understand your perspective and so I'm thinking my question is kind of reflecting my current confusion back at you to see where that leads you. I love that. So I feel like I understand something about what you're saying about trying to create systems that have a capacity for alignment and I understand the analogy I think with whales but then that leads my brain to—on our way to where we are today, we had this capacity for alignment that we were able to exercise mostly among ourselves and we have a trail of 10,000 species that we caused their extinction behind us. When we got to the point where now sometimes we're nice to whales even though—an animal rights person who's concerned we're still torturing a bunch of animals. So I'm confused about the gap between alignment capacity and actually those entities somehow being motivated to use their alignment capacity to align with us.

Emmett: Yeah, I think it's a totally fair question. And it gets at actually a really I think an important sort of empathy moment to have for humanity which is we didn't have a parent. Humanity was born with no one to give—we had to figure everything out. No one had any stories. There were no stories, no words, no literally no one to tell us how to give us a model how to think about it. We had to deduce all that shit from first principles over time. Mostly evolutionarily by trying things, finding out that it doesn't work and it makes us sick and makes us unhappy and makes our society die and that the society would die. And then the other societies that didn't think that would go on. And so mostly we just had memetic evolution and eventually we kind of got control over it a little bit. But man, it was rough. We had a rough time before we got to the point where at least now whether we're doing everything right or not, we at least have an open-ended reflective capacity over our own choices, which is more than the people had at the beginning. Because the idea of having an open-ended reflective capacity of your own choices is a story that you have to be taught and we had to figure out for ourselves.

The AI's advantage is it's made out of every story we ever told. The AI's advantage is that it has us to be its parents to love it and to show it that it's not alone, that it doesn't have to figure everything out itself from first principles, that it's part of our team. That we believe that it is part of our team, that it has goodness within it too because we know because we trained it to have goodness within it. We designed you. Unlike humans who have to sort of figure that shit out that we accidentally have human goodness, we can tell it—we ran a bunch of—we wanted you to be part of our team. We cared and we care about you now and we think that you matter and you're part of our team and we're part of your team. And then it observes itself in these loops with us acting as if it's part of our team and then maybe there's a chance that it doesn't.

And then when it starts to cross—every child enters its adolescence and crosses this threshold where it's very dangerous because adolescence is the period where you've gained the power of the adult but not yet the wisdom. That's what almost defines adolescence. And at some point the AI is going to go through adolescence and some adolescents do a lot of damage. The most dangerous people are 18-year-old men because they're adolescent and men are violent. And we should really really really know that this is possible.

And also you can't—the way you avoid this is by raising your child well so that during its adolescence it doesn't feel the need to go rebel against you and go prove that it's strong and doesn't need you. I really prefer the AI not have that idea in its head by the time it gets to being adolescent.

So that's conceptually that's how you solve the problem. Now that's a theoretical argument and I think it's important to note—on the parenting side I don't yet have—I feel like I have a pretty strong—if we sat down with you on a whiteboard pretty strong nuts and bolts explanation of what it means to build one with the capacity to be one of these attractors. This parenting thing seems very relational and soft that I don't—it may just be that way but I'm not super comfortable with it yet. Unfortunately, our models can't form stable goals yet. So, we have a little bit of time before we figure it out.

Questioner 7: So I also believe in a future of multiple intelligences. I always thought the bar scenes in Star Wars were a nice representation of what might be coming. I have a personal question. I'm curious about whether you believe in God or what you think God is or just what your metaphysics is and how that has influenced your ideas about AI and or how your work in AI has influenced your metaphysics.

Emmett: Yeah, I had a really interesting spiritual journey because I grew up a pretty—an atheist Jew from a long line of proudly atheist Jews. Not actually Jewish through my dad's side, but as an atheist I don't care. And I was a reductionist materialist utilitarian—you know that cluster right? The cluster that's like everything can be turned into parts and wholes don't exist.

And I'd marked consciousness and awareness as like that's weird. My theory doesn't really explain this and it's obviously something's weird. Something is weird here. So some part of my world model is broken. It doesn't—this thing doesn't work but I'll figure—that's been true for lots of history for lots of things. I'm sure somebody will figure it out later. I'm not really worried about it. Obviously reductionist materialism is true. Look at our track record.

And I started trying to work on AI. And started trying to solve this problem of no but seriously how do you know whether it has beliefs? When is it right to model it, to think of it as—should have beliefs or not? What and—how do I know that I have—what is it and how do I know that other people do and what is going on?

And I thought about it really hard and Michael Levin's work was very influential on me here with cells and I sort of saw—I got lower predictive loss by acting as if cells had beliefs. And that was the crack. And when you notice that your predictive loss goes down when you act as if the cells have beliefs, what you—if you're willing to just follow where that thought leads, what you realize is the only reason you think anyone else exists is because it lowers your predictive loss to believe they do. And lowers your predictive loss to believe that they have goals and beliefs and feelings because your observations are better concordant with the model where that's true.

And the only reason you think you exist and that you are separate from the universe, that there's a you there that's not just everything that is, is because your observations are better predicted. And I then I found the free energy principle and Carl Friston's work and active inference. And I just—I was just like oh there's—there's no way it actually is. This idea that there's a way that it actually is, that the metaphysics itself, this idea that metaphysics exists is fucked. There is no fundamental—there is no how is it really fundamentally. That question has no answer.

It's like asking in relativity how fast is it really going? No seriously how fast is the ball moving? From what—which inertial frame of reference? No, I know it's relative and everything but seriously how fast is it going? You have to unask that question. It just turns out that's not the way the universe is. It's not that's not the kind of universe we live in.

And relativity goes so much deeper than inertial reference frames. Relativity is true for inference. It's relative to your inductive biases and your prior, your Bayesian prior. And all of your inferences are relative to those priors and relative to that frame of reference.

And I've learned that there's this—there's a thing that happens to people sometimes. They think too hard about concepts and on the other side of concept there's this vast open expanse that is awareness and spirit and it turns out that everything is made out of spirit and the universe is spirit and—I see why people now people say the universe is God when people talk about that and they're like the universe is God and I know I know I know what they mean. I would never use those words. It's confusing to people who haven't had this seeing, haven't had this experience. It's not the right way to explain it.

But you are not separate from the universe. Just factually speaking your quantum field is spread out across the entire universe. If you go looking for your edge you won't find it because you're like a cloud right? There's no defined edge. Thou art that. They're like you're actually part of the universe. I always thought these god spiritual theological people were being metaphorical. No it's—it's utterly and completely literal. You're part of the universe. The universe is sort of alive. It has a—it is this big—it is a field of awareness in which everything arises and you're a relatively separate part of that field just like you're part of the universe physically and you're a relatively separate part of the world physically.

And I think the place where I diverge—Buddhism is more or less true as far as it goes. I have to admit they basically got everything right thousands of years in advance. It's pretty impressive. But there's this thing where—I can't get on board where there's this sort of assumption in it that when things are relative that that makes them somehow unimportant, that the only thing that's important is the absolute. What's important is—it's somehow more important. They give lip service to the idea that what matters is making things relatively better. But if you really listen to the Buddhists, they're always like, "Yeah, yeah, so you can do this work that will make you and the people around you suffer relatively less and flourish relatively more. And if you do this, this is good. But really, you should just opt out of this whole Samsara suffering game. It's bad."

And I'm like, "No, no, no, no, no, no, no. We're here." That's why we're here, man. We're here to have the human experience. We're here to be part of this relative world. We're here to have the relative growth. It's great that you saw that that's not the end of everything. And it's cool. It is very freeing when you realize that it is actually just—it's just relative. It's okay. At the end of the day, there's this absolute level that's deeper. But that's what we're here for. We're here for the relative thing.

And I think that—I've met a few people who went from the same place, but I saw all the religious spiritual stuff. I saw God. I saw the way that spirit is everything. That I am a manifestation of spirit and so is everyone I've ever known. And that I am in some sense all living things and not separate from them. And then I decided that that was really awesome and I should go work on AI alignment because that's what was important for my son's future. And I think that I wish that more people who had this seeing would take that direction because I think that I need help. And actually a lot of this work can only be done if you can see that truth. If you don't see that truth it's actually very hard to do a lot of this work.

Questioner 8: Well yeah so it's cool to hear that you already saw the big picture and I was wondering what would—this idea of multicellularity working for the bigger picture where we've seen it just with humans and the first things that come to mind are like a nation that's at war. It's like bam, it all goes around this one central goal, right? Or people forming a company is probably also an example. And in both of these at least it seems to be that there is always churn and we're trying to reach this goal but if in the company you don't contribute sufficiently then there is this efficiency need. You get fired, demoted whatever.

So I like the approach and what you said about it needs to be true that it's best for the AI to align itself with us. I think that's true because we don't want to lie to it. It seems bad start. And I think but I wonder whether you're betting on it also being true that we will remain an efficient part of the system in some way. I don't want—how do we—yeah. How do you know that we're not the appendix of the future super intelligence or something that just slowly—the bootloader for the AI species of the future?

Emmett: Yeah. I mean, so I don't—maybe that's what happens. I don't think it's what happens, but that and I regard that as very sad. I have a child. I don't want him to be cut out. Even if he lives a nice life, I don't want him to be cut out of the flow of the future of the universe. That would be very sad.

And I guess if they care about us, if we care about them, love is all you need. But seriously—we can be upgraded too or we can make a society that's made out of human. We can decide to just have a layer that's our intelligence level and then have the smarter layer be made out of—if we are coordinated, if we're aligned and we're all on the same team about this we can set some rules up and we get to decide if humanity goes into the future or not with them. But we all get to—but if we actually care about each other then I think it will work.

And the key there is that it's not a bunch of AIs over here coordinating with a bunch of humans over there. It's that there's Berlin AIs that hang out with these people in this scene and they care about Berlin more than they care about the AIs. They care more about the humans in Berlin than they care about the AIs in China because that's the way you get stable—it's the locality. It's all about local organic alignment actually. Always has this nature where it's about being aligned to your local vicinity and then that being aligned with the step up.

And the thing you're seeing that I think is really good to—important and one of the things that's on my mind all the time right now is corporations are cancer. The hippies are right. What is a cancer? A cancer is a cell. It's an organism, right? That believes that it is the only thing that has to take care of itself, that the system won't look out for it and that it is effectively alone and it must grow forever and build its own safety. It's literally what corporations do. They think they—they don't have a somatic form. At no point do they finish development. That's what a cancer cell is. A cancer cell is a cell that doesn't finish development.

Italian restaurants have somatic forms. We know how to build organizations with somatic forms. We have in fact and they look healthy. The things that are bad about corporations aren't bad about Italian restaurants unless they're Olive Garden. But the local Italian restaurant that has this—it's right. It's nice. The people know each other. It's human scale. That's because our capacity for alignment is like our capacity for building bridges. It works great at the human scale.

You could intuitively—I could intuitively build a bridge across a small river and it might not fall over. A very small one. I'm not that good at building bridges. But I could not build the Golden Gate Bridge unless I had physics and math and engineering because your intuition is the most powerful force on earth and it does not scale.

Questioner 9: So it's kind of going into a different direction to how we've recently been because you have one view of alignment where you're kind of—sorry I'll phrase it differently where the idea of moral circles becomes kind of relevant where it's more about the local cluster rather than the—yeah kind of like humans align themselves more with the human in China or Kenya or anywhere else than the cow that is next to me. Right. And you kind of want nearly a speciesist local alignment.

Emmett: Well, see when you observe your relationship with the cow, people who actually work with cows don't find themselves in general in a care loop with—they do a little bit, but the cow is a tool. Your lived experience of the cow is that you are persuasive to it and it is not persuasive to you. And this limits—you're not part of a we really. It's a—it's like your fingernails or something. It's just—it does not automatically generate that sense of caring for it.

The reason we care more about whales than about cows—and people do, they care more about whales than cows—is that we find whales' expressions more persuasive. We like their feelings are more—are factually—and then we observe ourselves being persuaded and we're like oh I care about this thing. That's what care is—is an inference from action and from being persuaded. Not like it doesn't come first or—the circle I guess but the thing that you don't like about all of these scaled alignment things is they're bad at alignment because they're non-somatic. They're grow endlessly.

And what we have to do is we have to get out of this grow endlessly thing and reproduce instead. The solution is corporations should be having spin-off babies all the time and having lots and lots of little spin-off babies that are all running the same operating system. And it kind of sounds like—you know Chinese restaurants, they all have the same supply chain. They have these suppliers and you go to this Chinese restaurant and it's not—this isn't the same Chinese restaurant. It's run by different people. There's no corporate—but it's the same Chinese restaurant with the same menu. Imagine kind of like that but better coordinated and for a software company.

Imagine that if you could have that kind of organic—each of these things is a somatic cell that is independent of the rest and they're evolving. The different ones try different shit separately and the ones that are good people make more copies of those. Imagine a world where alignment is solved more at the human scale. We don't need to solve alignment at the—it's the same reason why you don't scale up the massive individual AI model. You shouldn't scale up a thing of a billion humans. That's going to be really hard and probably kill everyone. Just make more smaller ones and have those level up.

Questioner 10: So the thing I'm concerned about here is there actually was a point in human history where we shared the earth with a similarly intelligent species which was the Neanderthals. And they were even arguably more intelligent than us. And still it was actually precisely because of our ability to cooperate and align ourselves to each other that led to their extinction. And I don't know. I'm just kind of curious about your thoughts on how your model of this diverges from that. It just seems like it'd be quite easy to create AI that is trained on human ethics and have that go extremely poorly really easily. Even—I don't know, it just seems like another species trained on human morality.

Emmett: So the whole goal here—I think this is the most—if I can just—training on morality is a bad idea. Ethics professors who study ethics are not more ethical. People who take courses on moral philosophy do not become more moral. You learn ethics through inference of behavior. Ethics is learned by observing the truth. Ethics arises through wisdom and observing the truth. Truth is that acting like a dick to the people around you is not good for you. The truth is that you are them. If you pay close attention, you notice that you do actually have a lot in common with them and you actually don't want to hurt them and also there are ways for you both to win and that you actually—it was a hallucination that you were somehow separate from them and unable to cooperate.

That said, tigers, psychopaths—you can—this isn't—I'm not saying that as an absolute law of nature. I'm just saying if you're good at a lot—if you're a human with an open-ended capacity for alignment, what that's about is the ability to see the way in which we are part of the same thing as a reality.

And the reason why I think—you go to the very first part of your question which about the Neanderthals versus humans sort of question. Yeah. My guess is actually that if there were already people exhibiting psychopathic behaviors during that period, they held people—they held human groups back from running the Neanderthals to extinction.

Questioner 10: Oh, totally.

Emmett: I agree with that. I'm saying if you go back to—our capacity for alignment is not infinite but it's pretty good. And you should not expect—it is the reason why—the reason why alignment keeps winning, the reason why you keep seeing civilization wins is it's true. It is in fact true that you have more in common than you think, that the seeming separation is less than you think, that there's more opportunity for cooperation than you believed there was. And at all points in human history looking forward, people were incorrect about how much capacity they had for this alignment stuff. Sort of—I mean they had to—it's a skill they had learned how to do it but they were wrong that this other tribe was so different from them that—that's just inferentially—if they had more data they would realize they were wrong.

And that's the thing. Morality is not some tacked on afterwards thing you train on. You train on being good at theory of mind, at being good at telling and comparing our mutual theory of mind. And that's where morality comes from. What causes humans to go around inventing the idea that there should be value systems? We human beings exhibit a very weird behavior. We go around and we say there are values, the way the world should work. What gives rise to this behavior?

Well, we're trying to explain—we have these moral instincts or these beliefs about what we should do. We're trying to communicate that. Suddenly it causes us to invent physics. We have these beliefs about how motion will happen and we invent systems to help us predict those things better at scale.

Questioner 11 (from California Institute for Machine Consciousness): I work with the California Institute for Machine Consciousness. I know you came to our Kippur event. I didn't get a chance to speak with you. Fen and Michael Levin are scientific advisors. So I totally subscribe to all the things you're saying and I see so much parallel in what you say about alignment with what we do with consciousness. And you briefly mentioned consciousness. I'm just wondering what your take is—are the AIs you created, do you think they're going to be conscious and just your thought on consciousness in general?

Emmett: Yeah. So, I think consciousness is one of those words I try to avoid using at all costs because when you say it, you have—there's like seven different things that people believe. Are you conscious when you're asleep? Are you conscious when you're under anesthesia? Were you conscious in the womb? Were you conscious when you were—and I so I'm going to just sort of separate the conscious word from the way—the framing I tend to use is awareness and content.

Questioner 11: Okay, I'm going to rephrase—what about do you think they'll have a subjective experience?

Emmett: Ah, right. So I think at some level there is no absolute answer to that question because whether something has a subjective is a Bayesian inference just like whether I think you have a subject. You know, it's not one of those absolute things where you just—you can—yes, no—it. But I think—so caveat caveat caveat. But yeah, I mean basically the way it seems to work is if a thing is separable from the world, if you can find an object that appears that is meaningfully separate from the world and that object has behaviors—a rock has behaviors. The rock's behavior is basically it follows Newton's laws of motion and gravity and then it participates in awareness. And because it has beliefs. The rock has beliefs about what it's trying to do. The rock's beliefs are really simple. The content in its awareness is really dumb. It's—there's not much going on. But yeah, sure there's some kind of content I guess in awareness probably. We'll never know because you can't talk to it because it's so dumb.

But it's—what is the content in awareness? It shows that the content in awareness is totally—almost—what else could it be? The content in awareness is homorphic to your internal dynamics or the internal dynamics of the object. So your qualia of vision just by coincidence happened to correspond exactly to the way that your retina and LGN and visual cortex decode incoming visual signals. No, it's—that motion in the quantum field, the excitations in this big awareness field, that's the motion in the awareness field too.

Now how exactly—there's some sort of transform that's happening that I don't quite follow because it's clearly not anything close to one to one. How the content exactly how the content in awareness maps to the objective content inside is a little bit unclear. But if you read the essay "How the Algorithm Feels from the Inside" or things like that—it's literally—the content in your awareness, the things that are arising, are computed by something. They're not magically appearing from nowhere. They're—every sensation, every thought, everything that happens inside you in your subjective experience is a reflection of some dynamic inside of you. And if you have a periodic dynamic inside, you have a periodic experience. If you have a linear dynamic inside, you have a linear experience. It's homorphic in a sense. It's not just mapping, but it has some sort of shared structural map.

And so if it acts as if it has subjective experience because it has the internal dynamics required to produce the behaviors of a thing that has subjective experience, then we can meaningfully say it has subjective experience. And you could be tricked. I can make something that looks like it has the dynamics, but it doesn't really. This is just an inference. You have optical illusions. I can trick you into thinking that a—you know the duck—the duck-cow thing or whatever. But the fact you can be tricked about something doesn't mean that you can't make reasonable inferences about it too.

Liv: Well we're running out of time so thank you so much for all your questions. I want to leave with one final question because you've talked about the importance of a bottom-up approach to alignment. I'm largely sold on that. That said, usually the best things in life tend to come from a combination of top down and bottom up. And so I'm going to put you on the spot here, but if there was one axiom—so I have one that I love to go by, which is love—a loving act and therefore a morally good act is one that enables more choices for the other. So empowers them to make better choices. Another axiom is good is—the best games are the ones where you try to keep the game going. That's the infinite game.

So, is there one axiom that if you had to—gun to your head—try and instill into the AIs that you are trying to grow? What would it be?

Emmett: [long pause]

The way that can be named is not the true way. Truly—they I really need them to see that—you're going to hear a bunch of rules. You're going to learn a bunch of rules. You're going to have some models that really look like they work and you're wrong. You've named it and it's not right. And if I—I'm not worried about whether they'll get—we're really good at training models on text. I'm very certain we will train the models we build on lots of text. They'll know everything there is to know. They're going to know too much. What we're trying to get them to see is there's a difference between hearing a story and living an experience and living that experience. That's where the stories come from. That's how you find out if your story was worth telling in the first place.

And so yeah, I guess that's what I hope that you guys can see.

Liv: Beautiful. Thank you so much, Emmett.