Making AIs less likely to be spiteful

This report is also posted to LessWrong here. Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at ​​intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn […]

Read more

Responses to apparent rationalist confusions about game / decision theory

I’ve encountered various claims about how AIs would approach game theory and decision theory that seem pretty importantly mistaken. Some of these confusions probably aren’t that big a deal on their own, and I’m definitely not the first to point out several of these, even publicly. But collectively I think these add up to a common worldview that underestimates the value of technical work to reduce risks of AGI conflict. I expect that smart agents will likely avoid catastrophic conflict overall—it’s just that the specific arguments for expecting this that I’m responding to here aren’t compelling (and seem overconfident). For each section, I include in the footnotes some examples of the claims I’m pushing back on (or note whether I’ve primarily seen […]

Read more

Open-minded updatelessness

Summary Bounded agents might be unaware of possibilities relevant to their decision-making. That is, they may not just be uncertain but fail to conceive of some relevant hypotheses entirely. What's more, commitment races might pressure early AGIs into adopting an updateless policy from a position of limited awareness. What happens then when a committed AGI becomes aware of a possibility that’d have changed which commitment it’d have wanted to make in the first place? Motivated by this question, we develop "open-minded" extensions of updatelessness, where agents revise their priors upon experiencing awareness growth and reevaluate their commitment to a plan relative to the revised prior. Introduction Bounded agents may be unaware of propositions relevant to the decision problem they face.1 […]

Read more