Surrogate goals might be one of the most promising approaches to reduce (the disvalue resulting from) threats. The idea is to add to one’s current goals a surrogate goal that one did not initially care about, hoping that any potential threats will target this surrogate goal rather than what one initially cared about.
In this post, I will outline two key obstacles to a successful implementation of surrogate goals.
To steer the development of powerful AI in beneficial directions, we need an accurate understanding of how the transition to a world with powerful AI systems will unfold. A key question is how long such a transition (or “takeoff”) will take.
Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more) Summary This post was originally written for internal discussions only; it is half-baked and unpolished. The post assumes familiarity with the ideas discussed in Caspar Oesterheld’s paper Multiverse-wide cooperation via coordinated decision-making. I wrote a short introduction to multiverse-wide cooperation in an earlier post (but I still recommend reading parts of Caspar’s original paper, or this more advanced introduction, because several of the points that follow below build on topics not covered in my introduction). With that out of the way: In this post, I will comment on what I think might be interesting aspects of multiverse-wide cooperation […]
Agents that threaten to harm other agents, either in an attempt at extortion or as part of an escalating conflict, are an important form of agential s-risks. To avoid worst-case outcomes resulting from the execution of such threats, I suggest that agents add a “meaningless” surrogate goal to their utility function.