Many experts believe that AIs will, within the not-too-distant future, become powerful enough for their decisions to have tremendous impact. Unfortunately, setting up AI goal systems in a way that results in benevolent behavior is expected to be difficult, and we cannot be certain to get it completely right on the first attempt. We should therefore account for the possibility that the goal systems fail to implement our values the intended way. In this paper, we propose the idea of backup utility functions: Secondary utility functions that are used in case the primary ones “fail”. We also describe how this approach can be generalized to the use of multi-layered utility functions, some of which can fail without affecting the final outcome as badly as without the backup mechanism.
Read the full text here.