Self-improvement races

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more)
Cross-posted from my and Johannes Treutlein's personal blog.

 

Most of my readers are probably familiar with the problem of AI safety: If humans could create super-human level artificial intelligence the task of programming it in such a way that it behaves as intended is non-trivial. There is a risk that the AI will act in unexpected ways and given its super-human intelligence, it would then be hard to stop.

I assume fewer are familiar with the problem of AI arms races. (If you are, you may well skip this paragraph.) Imagine two opposing countries which are trying to build a super-human AI to reap the many benefits and potentially attain a decisive strategic advantage, perhaps taking control of the future immediately. (It is unclear whether this latter aspiration is realistic, but it seems plausible enough to significantly influence decision making.) This creates a strong motivation for the two countries to develop AI as fast as possible. This is the case especially if the countries would dislike a future controlled by the other. For example, North Americans may fear a future controlled by China. In such cases, countries would want to invest most available resources into creating AI first with less concern for whether it is safe. After all, letting the opponent win may be similarly bad as having an AI with entirely random goals. It turns out that under certain conditions both countries would invest close to no resources into AI safety and all resources into AI capability research (at least, that's the Nash equilibrium), thus leading to an unintended outcome with near certainty. If countries are sufficiently rational, they might be able to cooperate to mitigate risks of creating uncontrolled AI. This seems especially plausible given that the values of most humans are actually very similar to each other relative to how alien the goals of a random AI would probably be. However, given that arms races have frequently occurred in the past, a race toward human-level AI remains a serious worry.

Handing power over to AIs holds both economic promise and a risk of misalignment. Similar problems actually haunt humans and human organizations. Say, a charity hires a new director who has been successful in other organizations. Then this creates the opportunity for the charity to rise in influence. However, it is also possible that the charity changes in a way that the people currently or formerly in charge wouldn't approve of. Interestingly, the situation is similar for AIs which create other AIs or self-improve themselves. Learning and self-improvement are the paths to success. However, self-improvements carry the risk of affecting the goal-directed behavior of the system.

The existence of this risk seems true prima facie: It should be strictly easier to find self-improvements that "probably" work than it is to identify self-improvements that are guaranteed to work. The former is a superset of the latter. So, AIs which are willing to take risks while self-improving can improve faster.

There are also formal justifications for the difficulty of proving self-improvements to be correct. Specifically, Rice's theorem states that for any non-trivial property p, there is no way of deciding for all programs whether they have this property p. (If you know about the undecidability of the halting problem, Rice's theorem follows almost immediately from that.) As a special case, deciding for all programs whether they are pursuing some goals is impossible. Of course, this does not mean that proving self-improvements to be correct is impossible. After all, an AI could just limit itself to the self-improvements that it can prove correct (see this discussion between Eliezer Yudkowsky and Mark Waser). However, without this limitation – e.g., if it can merely test some self-improvement empirically and implement it if it seems to work – an AI can use a broader range of possible self-modifications and thus improve more quickly. (In general, testing a program also appears to be a lot easier than formally verifying it, but that's a different story.) Another relevant problem from (provability) logic may be Löb's theorem which roughly states that a logical system with Peano arithmetic can't prove another logical mechanism with that power to be correct.

Lastly, consider Stephen Wolfram's more fuzzy concept of computational irreducibility. It basically states that as soon as a system can produce arbitrarily complex behavior (i.e., as soon as it is universal in some sense), predicting how most aspects of the system will behave becomes fundamentally hard. Specifically, he argues that for most (especially for complex and universal) systems, there is no way to find out how they behave other than running them.

So, self-improvement can give AIs advantages and ultimately the upper hand in a conflict, but if done too hastily, it can also lead to goal drift. Now, consider the situation in which multiple AIs compete in a head-to-head race. Based on the above considerations this case becomes very similar to the AI arms races between groups of humans. Every single AI has incentives to take risks to increase its probability of winning, but overall this can lead to unintended outcomes with near certainty. There are reasons to assume that this self-improvement race dynamic will be more of a problem for AIs than it is for human factions. The goals of different AIs could diverge much more strongly than the goals of different humans. Whereas human factions may prefer the enemy's win over a takeover by an uncontrolled AI, an AI with human values confronting an AI with strange values has less to lose from risky self-modifications. (There are some counter-considerations as well. For instance, AIs may be better at communicating and negotiating compromises.)

Thus, a self-improvement race between AIs seems to share the bad aspects of AI arms races between countries. This has a few implications:

  • Finding out (whether there is) a way for AIs to cooperate and prevent self-improvement races and other uncooperative outcomes becomes more important.
  • Usually, one argument for creating AI and colonizing space is that Earth-originating aligned AI could prevent other, less compassionate AIs (uncontrolled or created by uncompassionate ETs) from colonizing space. So, according to this argument, even if you don't value what humans or human-controlled AIs would do in space, you should still choose it as the lesser of two (or more) evils. However, the problem of self-improvement races puts this argument into question.
  • On a similar note, making the universe more crowded with AIs, especially ones with weird (evolutionary uncommon) values or ones that are not able to cooperate, may be harmful as it could lead to results that are bad for everyone (except for the AI which is created in a self-modification gone wrong).

Leave a Reply

Your email address will not be published. Required fields are marked *