Modified: April 07, 2022
safe objective
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Language is a really natural way to tell AI systems what we want them to do. Some current examples:
- GPT-3 and successors (InstructGPT, etc) generate text given a prompt.
- DALL-E generates images from a prompt.
- Codex writes code, given a prompt.
These are mostly trained with predictive objectives. For example, Codex sees lots of Github code with comments, and is trained to predict the code that will follow a given comment.
Are predictive objectives safe?
Predictive objectives feel more or less 'safe'. The system isn't trying to generate new behavior; it won't systematically do things you don't expect.
That said, prediction is fundamental; ultimately it contains everything. Internal optimization and planning mechanisms might be emergent capabilities of a sufficient advanced predictive system. Language is often generated by humans engaging in goal-directed thinking, so ultimately the best predictive systems will learn to mimic this.
You could imagine prompting GPT-5 with "This is the manifesto written by Joe EvilDoer, an incredibly smart and evil high-school student, detailing the plan he ultimately used to take over the world:", and that in that context it might actually produce a workable such plan. Such a system would be potentially dangerous, and we would need to think carefully about abuse. But it's a tool AI, not an 'agent AI'; it won't by itself become a paperclip maximizer.
Human feedback
There's already some evidence that we can improve quite a lot by steering language models with human feedback.
Ways in which you could imagine this going wrong:
- The system has the ability to take actions that change human preferences (or at least the expression of those preferences). For example, a language model that produces a well-written novel that subtly guides its reader towards repugnant moral conclusions.
- The system has the ability to take actions that somehow seize or corrupt its reward channel. For example, it produces some crazy argument (Roko's Basilisk-style, or maybe more along the lines of 'ethical treatment of reinforcement learnings') that the user should hardwire its reward channel to allow it to wirehead.
These are, debatably, pitfalls of modern adtech and recommender systems, which are essentially trained on human-preference objectives. Systems are incentivized to manipulate people into engaging in ways that don't reflect our best or highest selves.
However, the setup of these systems is a bit different from the 'steering' setup, where a group of human raters are explicitly asked to give responses that reflect their preferences. Users of ad and recommender systems are never just trying to provide information to train a system; they're themselves hoping to accomplish some goal, and the feedback is incidental to the broader goal. The system assumes treats a click on a post as feedback that the post was a 'good' recommendation, but I suspect that if you asked people for explicit feedback you would get different answers.
I want to make some sort of observation here about the meta-level shape of machine learning. There is something about the language modeling paradigm that makes me not scared of it, even when tuned with human feedback. Is it:
- the system doesn't "know" that it's playing a repeated game: it is incentivized to maximize reward in this trajectory, and would never sacrifice this reward in pursuit of long-term goals.The outer-loop SGD updates are trying to maximize future reward, in a sense, but this is a fixed and well-understood mechanism.