Model Persona Research Agenda

CLR’s overall mission is to reduce the risk of astronomical suffering from powerful AI, or s-risks. We’re primarily concerned with threat models involving the deliberate creation of suffering, and have identified a number of properties that may increase such risks if future powerful AIs have them. We call these s-risk conducive properties, short SRCPs.

Our previous empirical research agenda has focused on characterizing and measuring these properties with a particular focus on agential suffering as a result from conflict. While we are still interested in measuring SRCPs we have shifted our focus from evaluating to understanding and steering their emergence, and now also consider threat models that involve motivations for creating suffering outside of conflict. This puts more emphasis on propensities such as spitefulness, sadism, or punitiveness.

 

An important consideration for empirical s-risk research is how much uncertainty we have about the future: we don’t have access to the future systems we’re concerned about, and no specific path to catastrophe stands out as particularly likely. This makes it hard to backchain from our overall goal. Therefore, we want to do research that plausibly teaches us useful insights for a wide variety of scenarios.

 

As a result, we chose LLM personas as our current empirical research focus: they allow us to study the emergence of (malicious) propensities, and seem like a useful abstraction to reason about many interesting phenomena.

Personas in the wild

Why was Bing Chat for a short time prone to threatening its users, being jealous of their wife, or starting fights about the date? What makes Claude Opus 3 special, even though it’s not the smartest model by today’s standards? And why do models sometimes turn evil when finetuned on unpopular aesthetic preferences, or when they learned to reward hack? We think that these phenomena are related to how personas are represented in LLMs, and how they shape generalization.

Influencing generalization towards desired outcomes. 

Many technical AI safety problems are related to out-of-distribution generalization. Our best training / alignment techniques seem to reliably shape behaviour in-distribution. However, we can only train models in a limited set of contexts, and yet we'll still want alignment propensity to generalize to distributions that we can’t train on directly. Ensuring good generalization is generally hard.

So far, we seem to have been lucky, in that we have gotten decent generalization by default, albeit with some not-well understood variance. However, it’s unclear if this will continue to hold up: emergent misalignment can happen from seemingly-innocuous finetuning, as a consequence of capabilities training, or due to currently unknown mechanisms.

On the whole, we remain far from a mature science of LLM generalization. Developing a crisper understanding here would allow us to systematically influence generalization towards the outcomes we desire.

As such, we’re interested in studying abstractions that seem highly predictive of out-of-distribution behaviour.

Personas as a useful abstraction for influencing generalization

We define latent personas loosely as collections of correlated propensities. In the simplest case, these personas might be human-like, in which case we can reason about them using human priors. More generally, even if alignment-relevant personas might be somewhat AI-specific, the traits and their correlations will likely be amenable to analysis. 

We think that personas might be a good abstraction for thinking about how LLMs generalise: 

  • Personas exist ‘latently’ in the model, as bundles of traits which are correlated in pretraining data.
  • The assistant persona that is reinforced during post-training1 is well explained as a recombination of existing personas to a first approximation.
  • This provides an impetus for the model to generalize OOD by demonstrating other traits associated with the persona, which were not directly trained on. 

Persona interventions might work where direct approaches fail

One reason we find personas especially exciting is that it’s sometimes hard to provide good supervision for key propensities. E.g. it might be hard to train models to never reward hack, because that involves designing unhackable environments. Similarly, it might be hard to completely prevent scheming or reward-seeking policies, because those cognitive patterns can have similar or better behavioural fitness to aligned policies.2 Telling models to be strictly aligned might paradoxically make them more misaligned when we reinforce mistakes they inevitably make. Furthermore, naively training models to be aligned could just make the misalignment more sneaky or otherwise hard-to-detect. 

Alignment is not a binary question

Besides just 'aligned' vs. 'misaligned', some kinds of aligned and misaligned AIs seem much better/worse than others. We'd rather have aligned AIs that are, e.g., wiser about how to navigate 'grand challenges' or philosophical problems. And if our AIs end up misaligned, we'd rather they be indifferent to us than actively malicious, non-space-faring than space-faring, or cooperative rather than uncooperative [see link and link].

 

Examples of malicious traits that we care about include sadism, spitefulness, or generally traits that motivate an agent to create harm. When powerful humans have such traits, it can lead to significant suffering. It is easy to imagine that if powerful AI systems end up with similar dispositions, outcomes could be significantly worse than destruction of all value. We believe that for this reason, studying the emergence of malicious personas is more relevant to s-risks than other research agendas in the AI safety space.

Further reading

Here, we present a high-level motivation to care about personas in the context of LLMs. For more reading, we refer to the appendix of our LessWrong post for some discussion of what LLM personas are, and research examples in which personas are helpful frames. For some concrete research ideas in this space, we refer to this post.

 

 

1 This statement is more likely true for instruction-following finetuning, and less for RLVR-like finetuning.

2 For a more in-depth treatment of behavioural fitness, see predicting AI motivations.