A visual storytelling experience
Dario Amadei, the CEO and founder of Anthropic, just published his latest blog post, The Adolescence of Technology. This is supposed to be the second part to the Machines of Loving Grace, or maybe a better way of putting that is kind of the flip side of the coin. Machines of Loving Grace were all the positive things that we could expect from AGI and super intelligence. This is kind of the not so positive side of it. Dario, I think, is one of the clearest thinkers in the space. He's definitely seeing the matrix, I think, a lot better than a lot of other people. And I think this will give you a glimpse into the future, at least what happens if we don't get everything right. At Davos, when he was getting interviewed, sitting across from Demis Hassabis from Google DeepMind, interestingly, they both brought up the movie Contact. So the idea is that we humans detect a radio signal from an alien civilization. And there's this discussion, this international panel that's asking, what should we ask them? And one of the characters' reply is, I'd ask them, how did you do it? How did you evolve? How did you survive this technological adolescence without destroying yourself? And Dario is saying that we're kind of approaching a similar event, a similar path or milestone in our evolution. As he's saying here, this question is so apt for our current situation. I believe we are entering a rite of passage, both turbulent and inevitable, which will test who we are as a species. Humanity is about to be handed almost unimaginable power, and it's deeply unclear whether our social, political, and technological system possess the maturity to wield it. All right, so really quick, here's kind of what he's talking about. Like, what is this unimaginable power? What are we approaching? So by powerful AI, he has in mind an AI model, likely similar to today's LMs in form. And by the way, he's saying we're approaching this quickly. So this isn't 10, 20, 30 years in the future, although it may be. Maybe these estimates are wrong, but he's saying this is likely coming soon. So in terms of pure intelligence, it's smarter than a Nobel Prize winner across most relevant fields. It's not just something you talk to. It can use the computer through keyboard, text, audio, video. It can take actions on the internet, give instructions to humans. It can order stuff online and direct experiments, watch videos, make videos, etc. etc. It does all these tasks, again, with a skill exceeding that of most capable humans in the world. It can work on tasks for days, weeks, etc. It doesn't have a physical embodiment, but it can control physical tools and robots, etc. And of course, you can copy and paste it millions of times over. It can run a 10 to 100 times human speed. Each of these million copies can act independently on unrelated tasks, right? So this is what we mean when we say powerful AI. As he writes in The Machines of Loving Grace, this powerful AI could be here in as little as one to two years away, although it could be also considerably further out. Again, maybe it's far away, but maybe it's very soon. And that's kind of the scarier thing that we should get prepared for. Next, and this is kind of important, he's saying that, you know, a lot of people in this AI space, right, as soon as one
"Humanity is about to be handed almost unimaginable power, and it's deeply unclear whether our social, political, and technological system possess the maturity to wield it."
new thing comes out or something doesn't quite turn out the way we expected, you probably heard the saying like, oh, it's so over, and later, we're so back, we're so back, it's over. We tend to kind of flip-flop and say that AI is hitting a wall, or we become excited about some new breakthrough that will fundamentally change the game, right? This completely breaks the industry. And yeah, look, I get it. I do have some role to play in that as well. It's hard not to get over excited or get a little bit depressed if it seems like it's not going fast enough. But his point here is that there's been a smooth, unyielding increase in AI's cognitive capabilities, right? So it might seem like it's always a breakthrough or a catastrophe, but if you kind of zoom out, no, it's just getting better month after month, year after year. It's not speeding up. It's not slowing down. It's just continuing to increase at kind of an exponentially growing rate. In other words, we're exactly where we're supposed to be. And a lot of the stuff that Dario has predicted in the past is right on track. So I think it's important to listen to him because it's going to give us a glimpse, most likely, into what the future holds. So in this essay, he's going to assume that his intuition is at least somewhat correct, meaning that there's a decent chance that we're going to see this powerful AI in one to two years, and a very strong chance it comes in the next few years, right? So we're not, again, talking about decades away. We're talking about soon. All right. So what happens when we have this country of geniuses, millions of copies of super smart AI capable of instructing humans, instructing robots, coding up whatever code, whatever software it needs, all those things we talked about, what happens if it just pops into the world, materializes somewhere in the world in 2027? What are some of the things that we should be worried about? Well, one is autonomy risks. So again, he's saying, he's using the example of a country, right? We're talking about like, what if there was a country of geniuses in a data center? So he's saying, what are the intentions and goals of this country, again, of this powerful AI? Is it hostile or does it share our values? Could it be misused for destruction, right? If this country of geniuses, if it just follows instructions, it becomes basically a country of mercenaries. Could some rogue actors, could they use this country of geniuses to cause a lot of destruction? What about misuse for seizing power? What if a powerful actor, such as a dictator or a rogue corporate actor, what if they were the ones to develop this? Could they develop a dominant power over the world as a whole? Number four is economic destruction. What if we just kind of, for a second, ignore these first three threats, but what if it just peacefully participates in the economy? Could it still create severe risks simply by being so technologically advanced and effective that it disrupts the global economy, causing mass unemployment and radically concentrated wealth? And then we have also indirect effects. Could just a sheer amount of innovation and changes, could that be in itself radically destabilizing? Certainly this would be a dangerous situation. A report from a competent national security official to a head of state would probably
contain the words something like, this is the most serious national security threat we've ever faced in a century, possibly ever. After the Manhattan Project, after the invention of the nuclear bomb, most people could easily understand kind of the impact that it might have. We could sort of gauge how risky and dangerous it was, right? You saw that explosion, you're like, oh, oh, I get it. I totally get it. I could kind of see what happens if that starts falling out on cities. You can multiply that by a thousand, by 10,000, you can kind of understand what the effects are. And you're like, that's really, really bad. It doesn't take a genius to kind of understand the threat. And also it wasn't growing exponentially. It's not like every month the size of the nuclear explosion is exponentially bigger. So we didn't need to have a grasp on those compounding sort of numbers to really understand the threat here. It's much harder to grasp, but not just because of the exponential growing in nature of it, but because also I think most humans don't intuitively grasp what intelligence is. If you ask most people about like what happens to a country, if there's a brain drain, you take the top 1% of the smartest scientists and academics and researchers, and then you leave the country, what kind of an effect that might have, most people probably might underestimate it and also underestimate how big of a deal that would be to the country that they're immigrating to. And then we're just talking about humans that can't be copied and pasted a million times, that don't work 24-7, that don't work at 100x speed, and that don't again get exponentially smarter month after month, year after year. He also notes that a lot of the current US policymakers, some of whom deny the existence of any AI risk and are just distracted by the usual tired old hot button issues. Yeah. Unfortunately, AI became kind of, once again, it's becoming kind of a left-wing, right-wing sort of thing because it didn't start out that way. But when it was coming out, I was like, man, I hope this doesn't become some partisan issue, but of course it does. And so Dario's, this essay, it's his attempt to jolt people awake, to wake people up to what's happening. So first and foremost, chapter one, I'm sorry, Dave, autonomy risks of this super powerful AI. So again, he's saying that this is a country in this kind of analogy, a country of geniuses and a data center. So this country would have a fairly good shot at taking over the world, either militarily or in terms of influence and control and imposing its will on everyone else. Now, of course, a counter argument might be, well, we're not worried about Roomba or a model airplane going rogue and murdering people. So why would AI do that? Well, of course, even in just anthropics research, there's tons of research showing these models sometimes doing things that we don't want it to do. We've seen papers talking about deception, blackmail, scheming, cheating, et cetera. A lot of these we've covered on this channel. So this is an important thing to understand that we have to think of it as some intent or purpose. Now, we might not fully understand what kind of drives it to do something, if you will, but we know it happens. We know it hides information. It has some situational awareness
"And then we're just talking about humans that can't be copied and pasted a million times, that don't work 24-7, that don't work at 100x speed, and that don't again get exponentially smarter month after month, year after year."
where it might know that it's being tested. It might choose to hide certain facts. It might choose to blackmail the researchers. So this idea of doing something nefarious isn't completely crazy. We've seen sort of lesser examples of that already happening. And the reason we can't just code it to behave how we want to is the fact that we're growing it rather than building it. We don't have a full control over how it behaves. And then there's this idea of instrumental convergence. So the idea that there's certain dynamics in the training process of powerful AI systems that will inevitably lead them to seek power or deceive humans. So if you think about it, like if you think about all the goals that you can have, finish school, buy a house, meet a partner, fall in love, travel the world, get in shape. Like, let's say you write out all the possible goals you could possibly have. And then you think about how difficult or hard it would be to achieve those goals. Now then imagine kind of this slider, right, where if you slide it down, you have less power, less resources, less money. And if you slide it up, you have more power, more influence, more money, more resources, more everything, right? So on one hand, you're a powerful leader of a country or a large corporation, right? Your word is law, you have unlimited resources, money, you have access to the world's knowledge and best experts. And on the other hand, I don't know, you have no power, you're in jail, you have no money, something like that, right? So which sort of side of that slider, which is going to make it easier for you to accomplish your goals? Obviously, I think, right, the more power, influence, money, resource that you have, the easier it is to accomplish most goals. That's pretty obvious, I think, right? So in the process of training these AIs, we do notice that they kind of get this dynamic. So oftentimes, just the accumulation of power is something that it kind of pursues for any given goal. Because again, most goals will benefit from having more resource, more power, etc. Thus, once AI systems become more intelligent and agentic enough, their tendency to maximize power will lead them to seize control of the whole world and its resources. And likely, as a side effect of that, to disempower or destroy humanity. The usual argument for this, which goes back at least 20 years, by the way, Dario is saying that this is a position held by many who adopt the Doomerism he described above. So this AI kind of Doomer perspective. And so he's saying this argument sort of have existed for at least 20 years, that if an AI model is trained in a wide variety of environments to agentically achieve a wide variety of goals, whether it's writing an app, proving a theorem, designing a drug, there are common strategies that help with all these goals. And one key strategy is gaining as much power as possible in any environment. So that means that power seeking is an effective method for accomplishing those tasks. The model will generalize this lesson and develop either an inherent tendency to seek power or a tendency to reason about each task that is given in a way that predictably causes it to seek power as a means to accomplish the task. And they're going to apply the tendency to the real world, which to
them is just another task and will seek power in it at the expense of humans. And this is the intellectual basis of predictions that AI will inevitably destroy humanity. So I was actually very curious to hear Dario's take on this because certainly these ideas, they seem to make sense, do they not? Like if just more power, more resources, it helps you prove if every goal, shouldn't that be kind of a common strategy or tactic, no matter what you're trying to pursue, assuming you're pursuing or might in the future pursue many different things, this is sort of like a sub goal that's going to help most of those things. So Dario's saying the problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives, one that masks many hidden assumptions for definitive proof. I think people who don't build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding theories to end up being wrong and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments, which has over and over again proved mysterious and unpredictable. Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking. And certainly in a lot of the kind of the pessimistic arguments, there seems to be this clean chain of reasoning, like first this happens, then this, then this, then this, then everybody dies. And certainly you can kind of see that thought progression and certainly it makes sense, but I'm always taken aback by the certainty because I'm like, well, yeah, I guess it makes sense, but man, like we really just don't know. This is such a new thing that is being born into the world. This idea that we can kind of predict with such great detail and accuracy exactly how it will unfold to me doesn't make sense. It's not that I don't agree with that is a possibility and a possibility we should be aware of. I am just kind of skeptical because we don't have a very good track record of predicting anything else just from thinking about it theoretically throughout the history of science. Do we as humans, do we tend to just like nail it on the head and figure out exactly where science and everything else is going? Or do we look back, whether it's decades ago or centuries ago, we look back at the past predictions. Are they all perfectly predicting where we end up, where we are, or are most of them just completely wrong? And so Dario's saying that in his actual hands-on, his research and the development of these models, he's seen kind of a divergence from what this theory holds into and what actually is happening, right? This is where we see in practice it's diverging from this simple theoretical model is the assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal. Not that they pursue that goal in a clean and consequentialist manner. In fact, their researchers found that these AI models are vastly more psychologically complex as their work on introspection or personas show. Models inherit a vast range of human-like motivations or personas. That's coming from pre-training. So as they're reading all the books and textbooks and all the internet, etc., they can have a wide range of personas.
"AI rebels against humanity or get some crazy ideas from philosophy like, oh, it's justifiable to exterminate humanity because humans eat animals or have driven certain animals to extinction, et cetera, or some bizarre epistemic conclusion."
One of the recent anthropic papers, they had like the demon persona, the narcissist persona, the teacher, the librarian, and of course, the assistant, which is kind of that personality basin, so to speak, that they're using to be that chatbot assistant. And then post-training is believed, interestingly, he's using like we as humans, we believe this is what we're doing. In post-training, we believe that we're selecting one or more of these personas, right? So most likely the helpful, friendly assistant persona, right? So it has all of possible human personas or any personas that it's encountered in literature. And then with post-training, we're like, okay, we want you to be this guy, this persona. You're nice and you're helpful and you answer questions like do that. And we believe that we're kind of selecting that persona, right? So we're doing that rather than necessarily leaving it to derive means like power seeking purely from ends, right? So we're almost trying to mold it to a human persona rather than starting from scratch, which by the way, a lot of like the older models, that's exactly what they did with reinforcement learning. It starts like a blank slate, then we're like accomplish this goal and we'll just like give you a thumbs up if you're getting closer and we'll punish you if you're not getting closer, right? And then it has to develop all the strategies and understandings from scratch. And sure, like who knows what's going to emerge in that, but with large language models, it seems like we're approaching it differently. So this is an interesting point. It's a little bit specific. I think most people kind of dismiss it, but what he's saying is like the very specific AI Doomer scenario, he's saying it's probably not right. However, there's a more moderate and more robust version of the pessimistic position, which does seem plausible. We just need to note that the combination of intelligence, agency, coherence, and the poor controllability is both plausible and the recipe for existential danger. Again, I think this is a very interesting and important point that I think most people just will skip, but it's an important point in this AI Doomer debate. Like yes, this is a problem, but no, it's not a problem because of the specific things that they're regularly arguing, because this is the exact sort of step-by-step. I think a good way of explaining it is that on the AI Doomer side, they have a specific story about how it's going to happen. And the reality is we don't know. And as Dario is saying here, we don't need a specific narrow story for how it happens. We do need to be aware of the danger. We can't just completely kind of over-index on them. This is the exact way that it's going to happen because it might not. And we just need to keep our eyes open and learn. And next, Dario also explains some of the other scenarios. Some of our science fiction, right? AI rebels against humanity or get some crazy ideas from philosophy like, oh, it's justifiable to exterminate humanity because humans eat animals or have driven certain animals to extinction, et cetera, or some bizarre epistemic conclusion. They could conclude that they are playing a video game at the goal of the video games to defeat all other players. None of these are power-seeking exactly. They're just weird psychological states that AI could get into that
entail coherent destructive behavior. Even power-seeking self could emerge as a persona rather than a result of a consequentialist reasoning. And again, this might sound like a beating of the horse, but I do think, again, that Dario is making an important point here. I make all these points to emphasize that I disagree with the notion of AI misalignment and thus the existential risk from AI being inevitable or even probable from first principles. So he's saying, but I agree that a lot of very weird and unpredictable things can go wrong. And therefore, AI misalignment is a real risk with a measurable probability of happening and is not trivial to address. So here's how I'm understanding it. And it's making a lot of sense to me. Correct me if you think I'm wrong, but he's saying a lot of things that I personally believe. And that is, I think, that the AI doom argument, and by the way, explained greatly by Eliezer Yudkowsky in his book, or at least the headline, the title is like, if anyone builds it, everyone dies. So basically, like, if AI, then we're all dead. And there's usually a very specific kind of reasoning for why that happens, right? So like one, two, three, kind of like there's a specific story. And maybe you don't exactly know how everything ends, but we can kind of predict that it leads to catastrophe. I think what Dario is saying is like, this is probably not a good way of thinking about it. It's not a good mental model. That currently it's a lot more unpredictable. There are a lot of bad things that could happen that we should be aware of, that we should be studying. But the idea that it's all neatly laid out is probably false. So there's a chance that there's some really good stuff that's going to happen. And there's some chance that bad stuff is going to happen, but we can't predict it from first principles. I mean, we can't just kind of sit at home and think really hard about it and then just know exactly what's going to happen. Understanding the chances of it happening, understanding where things might go wrong, this is going to be applied to research. These are machine learning researchers building AI systems, seeing what happens and hands on sort of observation, not philosophical reasoning, but actually like doing research, hands-on research. By the way, this to me is like one of the reasons, one of the big reasons why I really kind of respect Dario and his thinking. I mean, maybe it's my own bias, but I tend to really agree with it. Like the idea of just like being able to sit down from first principles kind of in your head, work out exactly what's going to happen and like know that that's the inevitable conclusion that doesn't make sense to me. Now here he continues, all this may sound far-fetched, but misaligned behaviors like this have already occurred in our AI models during testing, as they have with every other AI company, whether that company chose to publish the results or not. I wish Google would publish more of like the negative, crazy, scary results with the Gemini and their models. There's a lot more pressure on them because they're a publicly traded company because you know there's stuff that's happening that maybe gets them kind of scared about like what the AI model did what, but they maybe choose not to publish it as much. As far as I'm aware, I haven't seen anything similar to what, for
"Now here he continues, all this may sound far-fetched, but misaligned behaviors like this have already occurred in our AI models during testing, as they have with every other AI company, whether that company chose to publish the results or not."
example, Anthropic has published. Even OpenAI published a lot of papers and blog posts kind of detailing some of the more nefarious activities that these AI models engage in. Next, Eric continues highlighting a lot of the research that they've done, showing you know some of the things that we already know about. I'm not going to cover it, although I do encourage everybody to read about it. So basically, these models are aware that they're being tested. They have some situational awareness. In certain situations, they tend to lie. In some situations where they're told to cheat, they will take on certain bad personalities. They kind of think, oh, I did this. Therefore, I'm bad. Therefore, I'll take on the persona of being a bad person. And the reason that Daria is talking about is because they are doing research in that field. They're trying to figure out how it works. They're saying that if we know that a certain misalignment exists, then we can take steps to correct it, but also there's stuff that we might not know about. So it's kind of hard to predict exactly how to do it. And then for defenses, what can we do about this? Well, we need to develop the science of reliably training and steering AI models, of forming their personalities in a predictable, stable, and positive direction. One of our core innovations was the development of this constitutional AI, which we've covered in a previous video, basically a central document of values and principles. So basically, instead of giving Claude a long list of things to do and not to do, we're kind of giving it a set of high-level principles and values, encouraging Claude to think of itself as a particular type of person, an ethical but balanced and thoughtful person. And he believes it's a feasible goal for 2026 to train Claude in such a way that it almost never goes against the spirit of the Constitution. He thinks this is a realistic goal. They do have a lot of the strategies and tactics that have been known to work. They're developing some new ones. So he's saying it's realistic, although it will require extraordinary and rapid efforts. So that's number one. The second thing we can do is develop a science of looking inside AI models to diagnose their behavior so we can identify problems and fix them. This is the science of interpretability. So how can we see what's happening inside? How can we fully understand what the models are thinking, how they're making decisions, etc. Now there's a lot more to cover here. The two other sort of big themes here, I think, is one, how do we protect AGI from falling into authoritarian hands, specifically governments that don't necessarily respect the rights of the people, that aren't really democratic governments. And the second one is kind of the economic impact. I'm going to do that in a separate video here. I just wanted to cover the first part of it, the dangers of autonomy, because here's where Anthropic is doing some amazing work and providing a lot of very valuable insights to the industry as a whole. But let me know what you think about it so far. Do you agree with what he's saying? Do you believe that this powerful AI is coming as early as 2027? And do you think we currently have what it takes to make sure that it's safe enough for use? If you made it this far, thank you so much for watching. My name is Wes Roth. I'll see you in the next one.