Annual Review & Fundraiser 2025
This is a brief overview of our activities in 2025 and our plans for 2026. We are hoping to fundraise $400,000 to fulfill our target budget in 2026.
Contents
About us
CLR works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce s-risks. Our research primarily involves thinking about how to reduce conflict and promote cooperation in interactions involving powerful AI systems. In addition to research, we conduct a range of activities aimed at building a community of people interested in s-risk reduction, and support efforts that contribute to s-risk reduction via the CLR Fund.
2025 was a year of significant transition for CLR. Jesse Clifton stepped down as Executive Director in January, succeeded by Tristan Cook and Mia Taylor as Managing Director and Research Director. Following Mia's subsequent departure in August, Tristan continues as Managing Director with Niels Warncke leading empirical research.
During this period, we clarified the focus of our empirical and conceptual research agendas: respectively, studying the emergence of undesirable personas in LLMs, and developing interventions to get AIs to use “safe Pareto improvements” to prevent catastrophic conflict. We held another annual Summer Research Fellowship and hired Daniel Tan from the program to join our empirical team.
Review of 2025
Research
Our research in 2025 fell in the following agendas:
Empirical research: AI model personas. One theme in our work this year has been Emergent Misalignment, the phenomenon that models can often generalize towards malicious personas when finetuned on demonstrations of narrow misalignment. CLR’s contribution includes the collaboration on the original emergent misalignment paper, a paper showing that emergent misalignment can arise from finetuning on demonstrations of reward hacking behavior, and a case study showing that emergent misalignment does not require the training dataset to display any misaligned behavior. We have been excited to see large interest in the AI safety community, with follow-up works by OpenAI, Anthropic, and many others.
Our interest in AI personas stems from the belief that malicious personas represent an alignment failure that is especially concerning from an s-risk perspective, and that personas provide a useful abstraction to reason about generalization. We led work on inoculation prompting, a simple technique to steer generalization towards more desirable outcomes, such as preventing emergent misalignment. Concurrent and follow-up work by Anthropic found that inoculation prompting is effective at preventing reward hacking and the emergent misalignment resulting from it.
We have also conducted research that is not yet published, focusing on training conditions that may induce spitefulness. As part of this, we first considered how goal representation in early training affects later generalization behavior, and then investigated whether RL training on constant-sum games generalizes to spitefulness. This work has been supported by grants from CAIF and the Foresight Institute.
Acausal safety and safe Pareto improvements (SPIs). We wrote distillations of previous internal work on an “overseer’s manual” for preventing high-stakes mistakes in acausal trade, for our collaborators in the acausal safety community. This included a post outlining ways in which we might want AIs to be “wiser” to avoid these high-stakes mistakes.
Both for acausal safety and mitigating downsides from AI conflict broadly, we’re excited about SPIs as an approach to bargaining. (Our understanding is that others who have thought a lot about s-risks broadly agree.) We’ve started drafting policies to propose to AI companies to make it more likely that transformative AIs use SPIs. In parallel, we’ve refined our understanding of when/why SPIs wouldn’t be used by default,1 and when interventions to promote SPIs might actually undermine SPIs.
Strategic readiness. We developed frameworks for determining when and how to robustly intervene on s-risks.2 See this memo that summarizes previous internal research disentangling aspects of what makes an intervention “robust”. Much of this research remains non-public and primarily supported our two intervention-focused agendas.
Community building
Community building was significantly affected by staff departures in 2024-2025. We maintained essential functions during the leadership transition, but broader community building activities were deprioritized. In 2025, we:
- Concluded the first iteration of the CLR Foundations Course
- Ran our fifth Summer Research Fellowship, which received record applications. We hired 4 fellows, intentionally having a small cohort to ensure high mentorship quality and hiring bar
- Continued providing career calls and one-on-one support
Plans for 2026
Research
Empirical work. The main goal of the empirical stream for 2026 is to advance the personas agenda and increase collaborations with the wider AI safety community. In pursuit of this, we plan to grow our team by 1-3 empirical researchers, and collaborate with external researchers interested in understanding and steering AI personas, including through participation in mentorship programs.
SPI. We plan to turn our current work on SPI proposals to AI companies into fully fleshed-out, concrete, practical asks. We’ll aim for lots of input on these asks from others in the s-risk and acausal safety communities, and contacts at AI companies. In parallel, we might also integrate the SPI proposals with other complementary interventions, such as getting AIs to think about open-minded decision theory.
Strategic readiness. We'll continue developing frameworks for robust s-risk interventions, with particular focus on identifying conditions under which our personas and SPI work can be safely implemented. This includes analyzing potential backfire mechanisms and monitoring which real-world developments would signal readiness for intervention. We aim to hire 1 researcher to ensure continuity in this area.
Community building
We plan to hire a Community Coordinator in 2026 to lead this work. Their focus will be on engaging community members with AI lab connections, coordinating the acausal safety research community, and identifying promising researchers for our programs and potential hires.
We'll continue our existing programs:
Donate
We're seeking $400,000 in funding to support our planned expansion in 2026 and maintain our target of 12 months of reserves. This funding will support:
- Hiring 1-3 empirical researchers to scale our AI model personas work
- Hiring 1 conceptual researcher for strategic readiness research
- Hiring a Community Coordinator
- Compute-intensive empirical research
To donate to CLR, please go to the Fundraiser page on our website. For frequently asked questions on donating to CLR, see here.
Get involved
- We accept expressions of interest in research roles on a rolling basis.
- Register your interest in a career call to discuss how to contribute to s-risk reduction.
- Sign up for notifications about our Foundations Course.
- Building on this post we published in 2024.
- Since many intuitive approaches can have unintended consequences, this work provides decision tools for evaluating whether interventions—like our personas and SPI work—will actually reduce s-risks or could make things worse.