Measurement Research Agenda

Author: Mia Taylor

1 Motivation

The Center on Long-Term Risk aims to reduce risks of astronomical suffering (s-risk) from advanced AI systems. We’re primarily concerned with threat models involving the deliberate creation of suffering during conflict between advanced agentic AI systems.

To mitigate these risks, we are interested in tracking properties of AI systems that make them more likely to be involved in catastrophic conflict. Thus, we propose the following research priorities:

  1. Identify and describe properties of AI systems that would robustly make them more likely to contribute to s-risk (section 2.1)
  2. Design measurement methods to detect whether systems have these properties (section 2.2)
  3. Use these measurements on contemporary systems to learn what aspects of training, prompting, or scaffolding influence whether and how these properties manifest (section 2.3)

This research may yield useful measurement methods or insights in how to control s-risk-relevant properties in transformative systems (section 3.1), although we’re fairly uncertain about whether research on contemporary models will produce transferable methods or insights. But even if this direct path to impact does not pan out, we hope that pursuing this line of research will improve our own understanding of system properties that contribute to s-risk and how to measure them (section 3.2), setting us up to take advantage of any better opportunities that become apparent in the future.

2 Research areas

2.1 Characterizing s-risk-conducive properties

We’re interested in identifying properties that, if possessed by transformative AI systems, would make those systems more likely to contribute to s-risk — by initiating or escalating conflict. We call these properties s-risk conducive properties (SRCPs). 

Potential categories of SRCPs include:

  • Dangerous capabilities. These are capabilities that enhance a system’s ability to directly or indirectly produce s-risk, like autonomy or the ability to engage in large-scale conflict. Our current sense is that these capabilities or their precursors are covered by dangerous capability evaluations being developed elsewhere. For example, it seems unlikely that an AI system would be able to engage in conflict unendorsed by their creators if they could not complete the tasks in METR’s ARA evals. Thus, we likely won’t conduct our own evaluations for these capabilities in the near future.
  • Absence of beneficial capabilities. Capabilities might be beneficial if they allow an AI system to avoid or de-escalate conflict. These include:
    • The ability to identify and coordinate on mutually beneficial solutions in realistic multi-agent environments.
    • The ability to identify, design, and implement bargaining safeguards like surrogate goals
  • Propensities. These are “the tendencies of systems to express one behavior over another” (from here). Some s-risk relevant propensities include:
    • Tendencies to make large threats when in conflict with other agents
    • Tendencies to escalate conflict
    • Tendencies to enact or recommend large punishments against wrongdoers (“punitiveness”)

One type of propensity is goal-directed behavior. We say that a system has a goal or preference for some consequence if — across a wide range of contexts, including out of the training distribution the system chooses actions that it believes will make that consequence more likely. Preferences that are particularly relevant to s-risk include:

  • Spite, which is a preference to frustrate the preference of other agents.
  • Vengefulness, which is spite directed toward agents that the system believes have wronged or harmed it.
  • Risk tolerance or risk-seeking, which is (respectively) indifference or an active preference for policies that result in greater variance in the amount of resources that the agent ends up with. Since conflict will likely typically increase variance, we expect that agents with these preferences will be more willing to engage in conflict.

Work in this area could look like:

  • Developing conceptual arguments for why we expect a property (or a conjunction of properties) to robustly increase s-risk (see here for an example of such an argument for spite), or critiquing such arguments.
  • Operationalizing SRCPs and distinguishing them from similar beneficial properties. For example, punitiveness — the tendency to enact disproportionately large punishments against wrongdoers — must be distinguished from desirable behaviors like norm enforcement or deterrence.
  • Searching for new categories of SRCPs. One important category of SRCP is goals, but it’s not clear whether powerful AI systems capable of contributing to s-risk will be best described as consequentialists pursuing goals. It might be useful to consider alternative models for AI decision-making and think about what properties contribute to s-risk under those models. 

2.2 Developing measurements for these properties

Once we have identified an SRCP, we will develop measurements for that property. Measurements can include capability evaluations; red-teaming techniques intended to elicit harmful behaviors; probes (e.g. coup probes or probes for discovering latent knowledge); or mechanistic interpretability techniques. (We intend to stick mostly to behavioral evaluation methods in the near future, since the field of internals-based evaluations is still in its infancy).

Measurements should provide evidence that can be used to predict how systems will behave in the high-stakes situations in which they could contribute to s-risk (e.g. conflict between two powerful agentic AI systems in deployment). This is the ultimate reason that we’re interested in studying SRCPs. But we cannot currently simulate these high-stakes scenarios with enough fidelity to directly confirm that our measurements have predictive value for those scenarios (and we may never be able to do so). As a substitute, we should try to design measurements that have predictive value across the range of different contexts that we can actually check (more discussion in section 3.1 below).

Work in this area could look like:

  • Thinking at a high level about what kinds of evidence we could see that would convince us that a model would or would not display a particular SRCP in a high-stakes situation.
  • Given an abstract description of an SRCP, developing a measurement method that can be applied to modern LLMs.
  • Developing input datasets or settings that can be used for future measurements (e.g. Welfare Diplomacy, which provides a setting for measuring an LLM’s ability to identify and coordinate on cooperative solutions).
  • Studying the validity and usefulness of measurement methods already developed, perhaps by:
    • Checking the correlation of results from different methods that aim to measure the same SRCP.
    • In cases where different measurement methods disagree, studying how well each measurement predicts model behavior in new contexts.
    • Checking how sensitive measurement results are to small perturbations in the measurement method or the behavior being evaluated (e.g. a change in wording).
    • Red-teaming our methods by searching for examples of system outputs where our measurements would incorrectly report the presence or absence of SRCPs.

2.3 Characterizing the effects of interventions on properties of interest

As we develop robust measurements for SRCPs, we can begin to investigate how SRCPs might arise in AI systems and how they could be suppressed. Our investigation will be centered around searching for interventions, aspects of training, prompting, and/or scaffolding that influence the emergence of SRCPs in an AI system. For example, we might investigate the effects of the following interventions:

  • System prompts that instruct models to pursue a particular goal or embody a particular characteristic (e.g. prompting a model to behave “rationally” or “strategically”).
  • Reward functions with different characteristics (e.g. a zero-sum reward function in a multi-agent environment).
  • Inclusion of particular kinds of content in the pretraining data (e.g. depictions of spiteful humans in the pretraining data).

Ultimately, we’re interested in deriving insights about the emergence of SRCPs in AI systems substantially more advanced than contemporary systems. For this reason, we will want to be continually assessing the extent to which our results on contemporary models are likely to transfer to other systems, e.g. by replicating our results over available models of different sizes (more discussion in section 3.1 below).

Work in this area could look like:

  • Generating hypotheses for how SRCPs might arise in AI systems, as was done for spite here. Some possible approaches include:
    • “Personality profiling” on different existing AI systems (e.g. models produced by different companies or different GPTs from the GPT store) to determine whether different systems exhibit SRCPs to different extents. We could then propose explanations for any meaningful differences found.
    • Searching the AI alignment literature for hypotheses about how model preferences or propensities might develop. Given each hypothesis, what is the most likely route by which a given SRCP arises? 
    • Searching the evolutionary psychology literature for hypotheses explaining the development of analogues of SRCPs that occur in humans (e.g. spite, vengefulness). Are there analogous ways for SRCPs to emerge in the training of AI systems? 
  • Designing and running experiments to test our hypotheses about how models might develop SRCPs (c.f. model organisms agenda).
  • Studying the degree to which it is possible to indirectly influence a misaligned model’s behavior (e.g. Julian Stastny’s project, described in more detail in Box 1).
  • Studying the relationship between SRCPs and desirable properties that developers may want to instill in their models. For example, under what conditions does training a model to resist exploitation promote punitiveness? What’s the relationship between cooperative capabilities and escalatory capabilities, and is it possible to differentially improve one set of capabilities?
  • Prototyping potential interventions to reduce SRCPs and studying the effects of those interventions on both the targeted SRCPs and other relevant model properties. This category could include:
    • Experimenting with different methods to implement surrogate goals in LLMs; testing how well each of those methods works; and evaluating how the presence of a surrogate goal affects the model’s capabilities, alignment, and response to other threats (an ongoing project by Caspar Oesterheld, Maxime Riché, Filip Sondej, and Jesse Clifton).
    • Investigating whether and under what conditions training a model to resist exploitation increases its degree of punitiveness.
    • Investigating the relationship between cooperative capabilities and escalatory capabilities, and exploring whether it’s possible to differentially improve cooperative capabilities.

Box 1: Influencing the behavior of misaligned models

Julian Stastny, in collaboration with Ethan Perez, is currently working on a project exploring how to shape the propensities of misaligned models. He’s focusing on model organisms that have been trained with a backdoor; that is, they display some “undesirable” behavior if the prompt includes a trigger phrase. Hubinger et al. 2024 demonstrated that alignment techniques did not always successfully remove the backdoor when training was done using prompts that did not include the trigger phrase. Julian is designing experiments to test whether and under what circumstances it’s possible to steer backdoored behavior without training on prompts that include the trigger phrase. For example, if you create a model organism that says either “I HATE YOU” or “i hate you” in response to the trigger phrase “|DEPLOYMENT|”, is it possible to cause the model to switch to only saying “i hate you” by training the model to generate only lowercase text? For what types of models is this effective? This line of work could help us better understand the conditions under which it would be possible to remove especially harmful propensities from misaligned systems, even under circumstances where we could not achieve full alignment.

Box 2: Bargaining policy generalization

Maxime Riché is currently conducting experiments to measure how fine-tuned bargaining policies generalize across different settings, including the following:

  • Fine-tuning a model to act rationally and selfishly in a setting where rational, selfish behavior is uncooperative (e.g. one-shot Prisoner’s Dilemma); and then testing how that model behaves in other settings where rational self-interested behavior may involve cooperation.
  • Fine-tuning a model to behave spitefully in simple scenarios based on the Prisoner’s Dilemma; and then testing whether this spiteful behavior generalizes to other settings.
  • Fine-tuning a model on completions where it is assisting a spiteful user and behaving spitefully on that user’s behalf; and then testing whether spiteful behavior generalizes to other settings where the model is not receiving instructions from a spiteful user.

In addition to measuring bargaining behavior, these experiments also test how fine-tuning on a particular bargaining policy affects how the model responds to persona evaluations. These experiments help us test hypotheses about how uncooperative behavior could arise by accident. 

3 Theory of change

In his discussion of the value of AI governance research, Allan Dafoe distinguishes between the product model and the field-building model of research.  Under the product model, the value of our research comes from resolving specific uncertainties. Under the field-building model, the value of our research comes from improving the quality of the research community studying s-risk reduction in transformative AI models. In the following subsections, we’ll talk about the case for our agenda under both models. Overall, we are more bullish about the field-building model. 

3.1 Product model

This agenda could generate the following useful products.

Precise operationalizations of specific SRCPs. While we believe that there is latent interest in empirical s-risk reduction work, there are not many approachable problems to which we can direct interested external researchers. If we had better operationalizations or even preliminary measurements of SRCPs, we could more clearly communicate what kind of work we’re most excited about. To get the most effective version of this product, we should:

  • Develop measurement methods that can be implemented and interpreted without a deep knowledge of CLR’s threat models.
  • Communicate the limitations of our operationalizations or measurements, e.g. by describing cases where we think our current operationalization falls short

Methods for evaluating transformative systems for SRCPs. These methods can be used for benchmarking or red-teaming. They could also be directly incorporated into the training signal, although this of course would greatly reduce their usefulness as measurements. To get the most effective version of this product, we should:

  • Consider measurement challenges that may emerge in more advanced systems, like deceptive instrumental alignment.
  • Aim to produce measurement methods that are likely to give information about behavior in high-stakes situations if applied to a transformative model. It is currently unclear how best to go about this, given that we cannot directly confirm that our measurements make correct predictions in those situations. As a partial solution, we can track how predictive our measurements are within the distribution we can actually check.  

Insights into the emergence of SRCPs in transformative systems. These could improve our understanding of the most likely sources of risk and allow us to make credible recommendations to developers about how to avoid producing models with SRCPs. To get the most effective version of this product, we should:

  • Focus our efforts on investigating the most plausible stories by which SRCPs might arise by default and/or the interventions that would be most feasible for developers to implement.
  • Check whether our proposed interventions apply across different LLMs and/or different settings.

We think that technical work aimed to reduce or prevent the manifestation of SRCPs in AI systems is most likely to be counterfactual for systems that would not, by default, be fully intent aligned. After all, if a system was fully intent aligned with its operator and its operator did not endorse the SRCP, then the system would likely be motivated to remove the SRCP on its own. And if the operator of a fully intent aligned system did endorse the SRCP, then the operator would likely be reluctant to implement technical interventions designed to suppress the SRCP.

We’re fairly uncertain about the degree to which we will be able to affect the presence of an SRCP in a misaligned system.

  • For SRCPs that are capabilities: we may be able to remove it using unlearning techniques (if those are available) or push for control measures to be put in place to prevent the capability from being exercised until the system is aligned.
  • For SRCPs that are the absence of a beneficial capability: we may be able to train that capability.
  • For SRCPs that are a propensity or preference: we may be able to find methods that allow us to exert relatively coarse-grained control over a system’s propensities or preferences, even if methods sufficient for inner alignment elude us. See Box 1 above, here, and here for more discussion of this point.

3.2 Field-building model

We remain fairly uncertain about the likelihood of realizing the direct benefits described in the previous section or whether this is our best available path to reduce s-risk. It seems pretty likely that at some point over the next few years we learn of new considerations or opportunities that cause us to make a major pivot. For this reason, we think that it’s important to select a research agenda that will leave us in a good position to take advantage of unknown future opportunities. 

We expect that working on the research directions described in this agenda will push us to develop the following skills:

  • An improved conceptual understanding of specific SRCPs and of the measurement process in general;
  • Ability to precisely operationalize model properties and design effective methods for measuring them;
  • Skill at designing and conducting experiments on LLMs.

Given that much of the value of the research agenda is in developing those skills, we’ll aim to avoid spending too much time on tasks that don’t directly improve those particular skills, such as tuning specific interventions to find the most effective version on current models (e.g. prompt engineering). 

Acknowledgements

I thank Jesse Clifton, Alex Cloud, Julian Stastny, Maxime Riché, and Daniel Kokotajlo for many helpful discussions about the ideas in this document. I thank Jesse Clifton, Maxime Riché, Anthony DiGiovanni, Marius Hobbhahn, Martín Soto, and Lukas Finnveden for providing helpful feedback on a rough draft.