Blog

Overview of Transformative AI Misuse Risks: What Could Go Wrong Beyond Misalignment

9 December 2024 by Samuel Martin

This post provides an overview of this report. Discussions of the existential risks posed by artificial intelligence have largely focused on the challenge of alignment - ensuring that advanced AI systems pursue human-compatible goals. However, even if we solve alignment, humanity could still face catastrophic outcomes from how humans choose to use transformative AI technologies. A new analysis examines these "misuse risks" - scenarios where human decisions about AI deployment, rather than AI systems acting against human interests, lead to existential catastrophe. This includes both intentional harmful uses (like developing AI-enabled weapons) and reckless deployment without adequate safeguards. The analysis maps out how such human-directed applications of AI, even when technically aligned, could lead to permanent loss of human potential. […]

Individually incentivized safe Pareto improvements in open-source bargaining

17 July 2024 by Nicolas Macé, Anthony DiGiovanni and Jesse Clifton

Summary Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures. More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this. This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we’ll present, and (iii) bargain poorly by default (since bargaining competence isn’t […]

Making AIs less likely to be spiteful

26 September 2023 by Nicolas Macé, Anthony DiGiovanni and Jesse Clifton

This report is also posted to LessWrong here. Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn […]

Responses to apparent rationalist confusions about game / decision theory

30 August 2023 by Anthony DiGiovanni

I’ve encountered various claims about how AIs would approach game theory and decision theory that seem pretty importantly mistaken. Some of these confusions probably aren’t that big a deal on their own, and I’m definitely not the first to point out several of these, even publicly. But collectively I think these add up to a common worldview that underestimates the value of technical work to reduce risks of AGI conflict. I expect that smart agents will likely avoid catastrophic conflict overall—it’s just that the specific arguments for expecting this that I’m responding to here aren’t compelling (and seem overconfident). For each section, I include in the footnotes some examples of the claims I’m pushing back on (or note whether I’ve primarily seen […]

Open-minded updatelessness

10 July 2023 by Nicolas Macé, Jesse Clifton and Sylvester Kollin

Summary Bounded agents might be unaware of possibilities relevant to their decision-making. That is, they may not just be uncertain but fail to conceive of some relevant hypotheses entirely. What's more, commitment races might pressure early AGIs into adopting an updateless policy from a position of limited awareness. What happens then when a committed AGI becomes aware of a possibility that’d have changed which commitment it’d have wanted to make in the first place? Motivated by this question, we develop "open-minded" extensions of updatelessness, where agents revise their priors upon experiencing awareness growth and reevaluate their commitment to a plan relative to the revised prior. ContentsIntroductionUpdatelessness in a game of ChickenCrazy predictorDynamic awareness and open-mindednessChicken under dynamic unawarenessExploitabilityExploitability in ChickenUnexploitable […]

The optimal timing of spending on AGI safety work; why we should probably be spending more now

29 November 2022 by Tristan Cook

Tristan Cook & Guillaume Corlouer October 24th 2022 Summary When should funders wanting to increase the probability of AGI going well spend their money? We have created a tool to calculate the optimum spending schedule and tentatively conclude funders collectively should be spending at least 5% of their capital each year on AI risk interventions and in some cases up to 35%. This is likely higher than the current AI risk community spending rate which is at most 3%. In most cases, we find that the optimal spending schedule is between 5% and 15% better than the ‘default’ strategy of just spending the interest one accrues and from 15% to 50% better than a naive projection of the community’s spending […]

When is intent alignment sufficient or necessary to reduce AGI conflict?

14 October 2022 by Jesse Clifton, Samuel Martin and Anthony DiGiovanni

In this post, we look at conditions under which Intent Alignment isn't Sufficient or Intent Alignment isn't Necessary for interventions on AGI systems to reduce the risks of (unendorsed) conflict to be effective. We then conclude this sequence by listing what we currently think are relatively promising directions for technical research and intervention to reduce AGI conflict. ContentsIntent alignment is not sufficient to prevent unendorsed conflictWhen would consultation with overseers fail to prevent catastrophic decisions?Conflict-causing capabilities failuresFailures of cooperative capabilitiesFailures to understand cooperation-relevant preferencesWhy not delegate work on conflict reduction?Intent alignment may not be necessary to reduce the risk of conflictTentative conclusions about directions for research & interventionReferences Intent alignment is not sufficient to prevent unendorsed conflict In the previous post, we outlined […]

When would AGIs engage in conflict?

13 October 2022 by Jesse Clifton, Samuel Martin and Anthony DiGiovanni

Here we will look at two of the claims introduced in the previous post: AGIs might not avoid conflict that is costly by their lights (Capabilities aren’t Sufficient) and conflict that is costly by our lights might not be costly by the AGIs’ (Conflict isn’t Costly). ContentsExplaining costly conflictAvoiding conflict via commitment and disclosure ability? What if conflict isn’t costly by the agents’ lights? Candidate directions for research and interventionAppendix: Full rational conflict taxonomyEquilibrium-compatible casesEquilibrium-incompatible casesReasons agents don’t disclose private informationReferences Explaining costly conflict First we’ll focus on conflict that is costly by the AGIs’ lights. We’ll define “costly conflict” as (ex post) inefficiency: There is an outcome that all of the agents involved in the interaction prefer to the one that […]

When does technical work to reduce AGI conflict make a difference?: Introduction

12 October 2022 by Jesse Clifton, Samuel Martin and Anthony DiGiovanni

This is a pared-down version of a longer draft report. We went with a more concise version to get it out faster, so it ended up being more of an overview of definitions and concepts, and is thin on concrete examples and details. Hopefully subsequent work will help fill those gaps. ContentsSequence SummaryNecessary Conditions for Technical Work on AGI Conflict to Have a Counterfactual ImpactConflict isn't CostlyCapabilities aren't SufficientIntent Alignment isn't SufficientIntent Alignment isn't NecessaryNote on scopeAcknowledgmentsReferences Sequence Summary Some researchers are focused on reducing the risks of conflict between AGIs. In this sequence, we’ll present several necessary conditions for technical work on AGI conflict reduction to be effective, and survey circumstances under which these conditions hold. We’ll also present […]

Taboo "Outside View"

17 June 2021 by Daniel Kokotajlo

No one has ever seen an AGI takeoff, so any attempt to understand it must use these outside view considerations —[Redacted for privacy] What? That’s exactly backwards. If we had lots of experience with past AGI takeoffs, using the outside view to predict the next one would be a lot more effective. —My reaction Two years ago I wrote a deep-dive summary of Superforecasting and the associated scientific literature. I learned about the “Outside view” / “Inside view” distinction, and the evidence supporting it. At the time I was excited about the concept and wrote: “...I think we should do our best to imitate these best-practices, and that means using the outside view far more than we would naturally be inclined.” Now that I […]

Case studies of self-governance to reduce technology risk

6 April 2021 by Jia Yuan Loke

Summary Self-governance occurs when private actors coordinate to address issues that are not obviously related to profit, with minimal involvement from governments and standards bodies. Historical cases of self-governance to reduce technology risk are rare. I find 6 cases that seem somewhat similar to AI development, including the actions of Leo Szilard and other physicists in 1939 and the 1975 Asilomar conference. The following factors seem to make self-governance efforts more likely to occur: Risks are salient The government looks like it might step in if private actors do nothing The field or industry is small Support from gatekeepers (like journals and large consumer-facing firms) Support from credentialed scientists. After the initial self-governance effort, governments usually step in to develop […]

Coordination challenges for preventing AI conflict

8 March 2021 by Stefan Torges

Summary In this article, I will sketch arguments for the following claims: Transformative AI scenarios involving multiple systems pose a unique existential risk: catastrophic bargaining failure between multiple AI systems (or joint AI-human systems). This risk is not sufficiently addressed by successfully aligning those systems, and we cannot safely delegate its solution to the AI systems themselves. Developers are better positioned than more far-sighted successor agents to coordinate in a way that solves this problem, but a solution also does not seem guaranteed. Developers intent on solving this problem can choose between developing separate but compatible systems that do not engage in costly conflict or building a single joint system. While the second option seems preferable from an altruistic perspective, […]

Collaborative game specification: arriving at common models in bargaining

6 March 2021 by Jesse Clifton

Conflict is often an inefficient outcome to a bargaining problem. This is true in the sense that, for a given game-theoretic model of a strategic interaction, there is often some equilibrium in which all agents are better off than the conflict outcome. But real-world agents may not make decisions according to game-theoretic models, and when they do, they may use different models. This makes it more difficult to guarantee that real-world agents will avoid bargaining failure than is suggested by the observation that conflict is often inefficient. In another post, I described the "prior selection problem", on which different agents having different models of their situation can lead to bargaining failure. Moreover, techniques for addressing bargaining problems like coordination on […]

Weak identifiability and its consequences in strategic settings

13 February 2021 by Jesse Clifton

One way that agents might become involved in catastrophic conflict is if they have mistaken beliefs about one another. Maybe I think you are bluffing when you threaten to launch the nukes, but you are dead serious. So we should understand why agents might sometimes have such mistaken beliefs. In this post I'll discuss one obstacle to the formation of accurate beliefs about other agents, which has to do with identifiability. As with my post on equilibrium and prior selection problems, this is a theme that keeps cropping up in my thinking about AI cooperation and conflict, so I thought it might be helpful to have it written up. We say that a model is unidentifiable if there are several […]

Birds, Brains, Planes, and AI: Against Appeals to the Complexity / Mysteriousness / Efficiency of the Brain

18 January 2021 by Daniel Kokotajlo

[Epistemic status: Strong opinions lightly held, this time with a cool graph.] I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable. In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes. In slogan form: If all we had to do to get TAI was make a simple neural net 10x the […]

Against GDP as a metric for AI timelines and takeoff speeds

30 December 2020 by Daniel Kokotajlo

Or: Why AI Takeover Might Happen Before GDP Accelerates, and Other Thoughts On What Matters for Timelines and Takeoff Speeds I think world GDP (and economic growth more generally) is overrated as a metric for AI timelines and takeoff speeds. Here are some uses of GDP that I disagree with, or at least think should be accompanied by cautionary notes: Timelines: Ajeya Cotra thinks of transformative AI as “software which causes a tenfold acceleration in the rate of growth of the world economy (assuming that it is used everywhere that it would be economically profitable to use it).” I don’t mean to single her out in particular; this seems like the standard definition now. Takeoff Speeds: Paul Christiano argues for […]

Incentivizing forecasting via social media

16 December 2020 by Daniel Kokotajlo

Summary Most people will probably never participate on existing forecasting platforms which limits their effects on mainstream institutions and public discourse. Changes to the user interface and recommendation algorithms of social media platforms might incentivize forecasting and lead to its more widespread adoption. Broadly, we envision i) automatically suggesting questions of likely interest to the user—e.g., questions related to the user’s current post or trending topics—and ii) rewarding users with higher than average forecasting accuracy with increased visibility. In a best case scenario, such forecasting-incentivizing features might have various positive consequences such as increasing society’s shared sense of reality and the quality of public discourse, while reducing polarization and the spread of misinformation. Facebook’s Forecast could be seen as one […]

Commitment ability in multipolar AI scenarios

5 December 2020 by Anni Leskelä

ContentsAbstractIntroductionPotential approaches to commitment between AI systemsConclusions and further notesAcknowledgements Abstract The ability to make credible commitments is a key factor in many bargaining situations ranging from trade to international conflict. This post builds a taxonomy of the commitment mechanisms that transformative AI (TAI) systems could use in future multipolar scenarios, describes various issues they have in practice, and draws some tentative conclusions about the landscape of commitments we might expect in the future. Introduction A better understanding of the commitments that future AI systems can make is helpful for predicting and influencing the dynamics of multipolar scenarios. The option to credibly bind oneself to certain actions or strategies fundamentally changes the game theory behind bargaining, cooperation, and conflict. Credible […]

Persuasion Tools: AI takeover without takeoff or agency?

21 November 2020 by Daniel Kokotajlo

[epistemic status: speculation] I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked. --Wei Dai What if most people already live in that world? A world in which taking arguments at face value is not a capacity-enhancing tool, but a security vulnerability? Without trusted filters, would they not dismiss highfalutin arguments out of hand, and focus on whether the person making the argument seems […]

How Roodman's GWP model translates to TAI timelines

16 November 2020 by Daniel Kokotajlo

How does David Roodman’s world GDP model translate to TAI timelines? Now, before I go any further, let me be the first to say that I don’t think we should use this model to predict TAI. This model takes a very broad outside view and is thus inferior to models like Ajeya Cotra’s which make use of more relevant information. (However, it is still useful for rebutting claims that TAI is unprecedented, inconsistent with historical trends, low-prior, etc.) Nevertheless, out of curiosity I thought I’d calculate what the model implies for TAI timelines. Here is the projection made by Roodman’s model. The red line is real historic GWP data; the splay of grey shades that continues it is the splay […]

The date of AI Takeover is not the day the AI takes over

22 October 2020 by Daniel Kokotajlo

Instead, it’s the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like “World GWP doubles in four years” and “Superhuman AGI is deployed." The rest of this post explains, justifies, and expands on this obvious but underappreciated idea. (Toby Ord appreciates it; see quote below). I found myself explaining it repeatedly, so I wrote this post as a reference. AI timelines often come up in career planning conversations. Insofar as AI timelines are short, career plans which take a long time to pay off are a bad idea, because by the time you reap the benefits of the plans it may already be too […]

Reducing long-term risks from malevolent actors

7 July 2020 by David Althaus and Tobias Baumann

Summary Dictators who exhibited highly narcissistic, psychopathic, or sadistic traits were involved in some of the greatest catastrophes in human history. Malevolent individuals in positions of power could negatively affect humanity’s long-term trajectory by, for example, exacerbating international conflict or other broad risk factors. Malevolent humans with access to advanced technology—such as whole brain emulation or other forms of transformative AI—could cause serious existential risks and suffering risks. We therefore consider interventions to reduce the expected influence of malevolent humans on the long-term future. The development of manipulation-proof measures of malevolence seems valuable, since they could be used to screen for malevolent humans in high-impact settings, such as heads of government or CEOs. We also explore possible future technologies that […]

Risk factors for s-risks

22 February 2019 by Tobias Baumann

Traditional disaster risk prevention has a concept of risk factors. These factors are not risks in and of themselves, but they increase either the probability or the magnitude of a risk. For instance, inadequate governance structures do not cause a specific disaster, but if a disaster strikes it may impede an effective response, thus increasing the damage. Rather than considering individual scenarios of how s-risks could occur, which tends to be highly speculative, this post instead looks at risk factors – i.e. factors that would make s-risks more likely or more severe.

Challenges to implementing surrogate goals

3 July 2018 by Tobias Baumann

Surrogate goals might be one of the most promising approaches to reduce (the disvalue resulting from) threats. The idea is to add to one’s current goals a surrogate goal that one did not initially care about, hoping that any potential threats will target this surrogate goal rather than what one initially cared about. In this post, I will outline two key obstacles to a successful implementation of surrogate goals.

A framework for thinking about AI timescales

29 March 2018 by Tobias Baumann

To steer the development of powerful AI in beneficial directions, we need an accurate understanding of how the transition to a world with powerful AI systems will unfold. A key question is how long such a transition (or “takeoff”) will take.

Commenting on MSR, Part 2: Cooperation heuristics

1 March 2018 by Lukas Gloor

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more) Summary This post was originally written for internal discussions only; it is half-baked and unpolished. The post assumes familiarity with the ideas discussed in Caspar Oesterheld’s paper Multiverse-wide cooperation via coordinated decision-making. I wrote a short introduction to multiverse-wide cooperation in an earlier post (but I still recommend reading parts of Caspar’s original paper, or this more advanced introduction, because several of the points that follow below build on topics not covered in my introduction). With that out of the way: In this post, I will comment on what I think might be interesting aspects of multiverse-wide cooperation […]

Using surrogate goals to deflect threats

20 February 2018 by Tobias Baumann

Agents that threaten to harm other agents, either in an attempt at extortion or as part of an escalating conflict, are an important form of agential s-risks. To avoid worst-case outcomes resulting from the execution of such threats, I suggest that agents add a “meaningless” surrogate goal to their utility function.

Self-improvement races

14 November 2017 by Caspar Oesterheld

Just like human factions may race toward AI and thus risk misalignment, AIs may race toward superior abilities by self-improving themselves in risky ways.

Commenting on MSR, Part 1: Multiverse-wide cooperation in a nutshell

2 November 2017 by Lukas Gloor

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more) This is a post I wrote about Caspar Oesterheld’s long paper Multiverse-wide cooperation via coordinated decision-making. Because I have found the idea tricky to explain – which unfortunately makes it difficult to get feedback from others on whether the thinking behind it makes sense – I decided to write a shorter summary. While I am hoping that my text can serve as a standalone piece, for additional introductory content I also recommend reading the beginning of Caspar’s paper, or watching the short video introduction here (requires basic knowledge of the “CDT, EDT or something else” debate in decision […]

S-risk FAQ

21 September 2017 by Tobias Baumann

In the essay Reducing Risks of Astronomical Suffering: A Neglected Priority, s-risks (also called suffering risks or risks of astronomical suffering) are defined as “events that would bring about suffering on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far”.

Focus areas of worst-case AI safety

18 September 2017 by Tobias Baumann

Efforts to shape advanced artificial intelligence (AI) may be among the most promising altruistic endeavours. If the transition to advanced AI goes wrong, the worst outcomes may involve not only the end of human civilization, but also astronomical amounts of suffering – a so-called s-risk.

A reply to Thomas Metzinger’s BAAN thought experiment

10 August 2017 by Lukas Gloor

Published on the CLR blog, where researchers are free to explore their own ideas on how humanity can best reduce suffering. (more) This is a reply to Metzinger’s essay on Benevolent Artificial Anti-natalism (BAAN), which appeared on EDGE.org (7.8.2017). Metzinger invites us to consider a hypothetical scenario where smarter-than-human artificial intelligence (AI) is built with the goal of assisting us with ethical deliberation. Being superior to us in its understanding of how our own minds function, the envisioned AI could come to a deeper understanding of our values than we may be able to arrive at ourselves. Metzinger has us envision that this artificial super-ethicist comes to conclude that biological existence – at least in its current form – is […]

Uncertainty smooths out differences in impact

21 July 2017 by Tobias Baumann

Suppose you investigated two interventions A and B and came up with estimates for how much impact A and B will have. Your best guess is that A will spare a billion sentient beings from suffering, while B “only” spares a thousand beings. Now, should you actually believe that A is many orders of magnitude more effective than B?

Arguments for and against moral advocacy

17 July 2017 by Tobias Baumann

This post analyses key strategic questions on moral advocacy, such as: What does moral advocacy look like in practice? Which values should we spread, and how? How effective is moral advocacy compared to other interventions such as directly influencing new technologies? What are the most important arguments for and against focusing on moral advocacy?

Strategic implications of AI scenarios

30 June 2017 by Tobias Baumann

Efforts to mitigate the risks of advanced artificial intelligence may be a top priority for effective altruists. If this is true, what are the best means to shape AI? Should we write math-heavy papers on open technical questions, or opt for broader, non-technical interventions like values spreading?

Tool use and intelligence: A conversation

26 June 2017 by Tobias Baumann

This post is a discussion between Lukas Gloor and Tobias Baumann on the meaning of tool use and intelligence, which is relevant to our thinking about the future or (artificial) intelligence and the likelihood of AI scenarios.

Training neural networks to detect suffering

20 June 2017 by Tobias Baumann

Imagine a data set of images labeled “suffering” or “no suffering”. For instance, suppose the “suffering” category contains documentations of war atrocities or factory farms, and the “no suffering” category contains innocuous images – say, a library. We could then use a neural network or other machine learning algorithms to learn to detect suffering based on that data.

S-risks: Why they are the worst existential risks, and how to prevent them (EAG Boston 2017)

20 June 2017 by Max Daniel

This post is based on notes for a talk I gave at EAG Boston 2017. I talk about risks of severe suffering in the far future, or s-risks. Reducing these risks is the main focus of the Foundational Research Institute, the EA research group that I represent.

Launching the FRI blog

19 June 2017 by Max Daniel

We were moved by the many good reasons to make conversations public. At the same time, we felt the content we wanted to publish differed from the articles on our main site. Hence, we're happy to announce the launch of FRI’s new blog.

Identifying Plausible Paths to Impact and their Strategic Implications

14 August 2016 by Lukas Gloor

FRI’s research seeks to identify the best intervention(s) for suffering reducers to work on. Rather than continuing our research indefinitely, we will eventually have to focus our efforts on an intervention directly targeted at improving the world. This report outlines plausible candidates for FRI’s “path to impact” and distills some advice on how current movement building efforts can best prepare for them.

Our Mission

7 June 2016 by Lukas Gloor

This is a snapshot of the Center on Long-Term Risk’s (formerly Foundational Research Institute) previous "Our Mission" page. The Foundational Research Institute (FRI) conducts research on how to best reduce the suffering of sentient beings in the long-term future. We publish essays and academic articles, make grants to support research on our priorities, and advise individuals and policymakers. Our focus is on exploring effective, robust and cooperative strategies to avoid risks of dystopian futures and working toward a future guided by careful ethical reflection. Our scope ranges from foundational questions about ethics, consciousness and game theory to policy implications for global cooperation or AI safety. ContentsReflectiveness, values and technologyDealing with uncertaintyOur researchOur valuesPartners and affiliationsHow to get involved Reflectiveness, values […]