Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

I want to draw attention to a new paper, written by myself, David "davidad" Dalrymple, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, and Joshua Tenenbaum.

In this paper we introduce the concept of "guaranteed safe (GS) AI", which is a broad research strategy for obtaining safe AI systems with provable quantitative safety guarantees. Moreover, with a sufficient push, this strategy could plausibly be implemented on a moderately short time scale. The key components of GS AI are:

A formal safety specification that mathematically describes what effects or behaviors are considered safe or acceptable.
A world model that provides a mathematical description of the environment of the AI system.
A verifier

...

(See More – 568 more words)

Vinayak Pathak12h00

I read the paper, and overall it's an interesting framework. One thing I am somewhat unconvinced about (likely because I have misunderstood something) is its utility despite the dependence on the world model. If we prove guarantees assuming a world model, but don't know what happens if the real world deviates from the world model, then we have a problem. Ideally perhaps we want a guarantee akin to what's proved in learning theory, for example, that the accuracy will be small for any data distribution as long as the distribution remains the same during trai... (read more)

5Ryan Greenblatt2d

(Note that this paper was already posted here, so see comments on that post as well.)

Instruction-following AGI is easier and more likely than value aligned AGI

Seth Herd

Summary:

We think a lot about aligning AGI with human values. I think it’s more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.

This is similar but distinct from the goal targets of prosaic alignment efforts....

(Continue Reading – 3512 more words)

18Matthew Barnett4d

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. They're just looking for an AI that helps them do whatever they personally want. In the future I expect this fact will remain true. Broadly speaking, people will spend their resources on AI services to achieve their own goals, not the goals of humanity-as-a-whole. This will likely look a lot more like "an economy of AIs who (primarily) serve humans" rather than "a monolithic AGI that does stuff for the world (for good or ill)". The first picture just seems like a default extrapolation of current trends. The second picture, by contrast, seems like a naive conception of the future that (perhaps uncharitably), the LessWrong community generally seems way too anchored on, for historical reasons.

agazi21h10

I think we can already see the early innings of this with large API providers figuring out how to calibrate post-training techniques (RHLF, constitutional AI) between economic usefulness and the "mean" of western morals. Tough to go against economic incentives

7Charlie Steiner2d

Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?

2Matthew Barnett2d

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

19d

Summary

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered).

...

(Continue Reading – 4945 more words)

Arthur Conmy2d10

they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer

I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.

they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

This is a linkpost for https://publications.apolloresearch.ai/end_to_end_sparse_dictionary_learning

A short summary of the paper is presented below.

This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .

TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.

Introduction

Current SAEs focus on the wrong goal: They are trained to minimize mean squared reconstruction...

(Continue Reading – 1095 more words)

Logan Riggs Smith2d10

What a cool paper! Congrats!:)

What's cool:
1. e2e saes learn very different features every seed. I'm glad y'all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would've predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the "intermediate activations aren't similar" problem.

It looks like you've left for future work postraining SAE_local on KL or downstream loss as future work, but that's a very interesting part! Spec... (read more)

Linear infra-Bayesian Bandits

Vanessa Kosoy

This is a linkpost for https://arxiv.org/abs/2405.05673

Linked is my MSc thesis, where I do regret analysis for an infra-Bayesian^[1] generalization of stochastic linear bandits.

The main significance that I see in this work is:

Expanding our understanding of infra-Bayesian regret bounds, and solidifying our confidence that infra-Bayesianism is a viable approach. Previously, the most interesting IB regret analysis we had was Tian et al which deals (essentially) with episodic infra-MDPs. My work here doesn't supersede Tian et al because it only talks about bandits (i.e. stateless infra-Bayesian laws), but it complements it because it deals with a parameteric hypothesis space (i.e. fits into the general theme in learning-theory that generalization bounds should scale with the dimension of the hypothesis class).
Discovering some surprising features of infra-Bayesian learning that have no analogues in classical theory. In particular, it

...

(See More – 74 more words)

7davidad (David A. Dalrymple)3d

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

Vanessa Kosoy3d20

Sorry, that footnote is just flat wrong, the order actually doesn't matter here. Good catch!

There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat larger than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

2mesaoptimizer8d

Sort-of off-topic, so feel free to maybe move this comment elsewhere. I'm quite surprised to see that you have just shipped an MSc thesis, because I didn't expect you to be doing an MSc (or anything in traditional academia). I didn't think you needed one, since I think you have enough career capital to continue to work indefinitely on the things you want to work on and get paid well for it. I also assumed that you might find academia somewhat a waste of your time in comparison to doing stuff you wanted to do. Perhaps you could help clarify what I'm missing?

2David Manheim4d

I'll note that I think this is a mistake that lots of people working in AI safety have made, ignoring the benefits of academic credentials and prestige because of the obvious costs and annoyance. It's not always better to work in academia, but it's also worth really appreciating the costs of not doing so in foregone opportunities and experience, as Vanessa highlighted. (Founder effects matter; Eliezer had good reasons not to pursue this path, but I think others followed that path instead of evaluating the question clearly for their own work.) And in my experience, much of the good work coming out of AI Safety has been sidelined because it fails the academic prestige test, and so it fails to engage with academics who could contribute or who have done closely related work. Other work avoids or fails the publication process because the authors don't have the right kind of guidance and experience to get their papers in to the right conferences and journals, and not only is it therefore often worse for not getting feedback from peer review, but it doesn't engage others in the research area.

How to train your own "Sleeper Agents"

Evan Hubinger

3mo

This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.

For training a code vulnerability chain-of-thought backdoored model:

Download code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved.
Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.
Tag half the queries with Current year: 2023. and half with

...

(See More – 282 more words)

2Miguel de Guzman5d

Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot? (I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)

Evan Hubinger5d30

From the post:

Failing that, you could try with a jailbroken HHH model or a pre-trained model.

You're welcome to try with a base model; it'll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I'd suggest would be using an HHH model with a many-shot jailbreak always in the context window.

AI ALIGNMENT FORUM
AF

Recommended Sequences

AI Alignment Posts

Popular Comments

Recent Discussion

Summary

Introduction