Modified: June 07, 2021
tractable approximations to utilitarianism
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.There are three main approaches to moral philosophy:
- utilitarianism: you should feed a starving person because it will increase 'global utility': roughly, the total amount of well-being in the world.
- rule-based (deontological or 'Kantian') ethics: you should feed a starving person because doing so follows from some general moral rule like 'do unto others as you would have them do unto you'.
- virtue or 'Aristotelian' ethics: you should feed a starving person because doing so embodies some virtue, such as charity or benevolence.
When I learned about these, it seemed like utilitarianism was plainly 'correct' insofar as any of these can be correct (the is vs ought dichotomy). Rules and virtues are secondary: given any well-defined notion of global utility, if we adopted a set of rules or a set of virtues such that following them tended to reduce global utility, then we would say that these are bad rules. Conversely, good rules are those such that following them tends to increase global utility.
It makes sense to see utility as the 'root' of morality, because everything else follows in principle from it (reward is enough). This makes it superior to rule-based or virtue-based systems, which don't actually give a concrete recipe for trading off between competing principles.
The main criticisms of utilitarianism are that global utility is impossible to measure (or to even define) and intractable to optimize. Rule-based utilitarianism recognizes these issues by stating that, in practice, we need to come up with a manageable set of principles or rules to follow, so that we are not directly calculating utilities in our day-to-day lives, but that these rules should ultimately be chosen to optimize global utility. Under this framework we could see other ethical systems as tractable approximations to utilitarianism.
Of course, every possible view on this has been proposed and debated and complexified ad infinitum in the philosophical literature over the centuries. But the modern world opens up a perspective where these questions become concrete with real consequences: how do we build AI systems that act 'morally' to increase overall human flourishing?
- We might quibble that this is a question worth caring about, since the trolley problem is a distraction in practical systems.
How could we think about implementing utilitarian ethics in AI terms?
- The framework posits that there is some utility function assigning a real-valued utility to every world state.
- We have uncertainty about both the current state of the world , and the utility function itself. In other words: although we posit that a well-defined utility function exists, we recognize that we are uncertain about the proper definition. (i.e., reward uncertainty)
- We might suspect that the utility function decomposes in certain ways, e.g., that global utility is the sum of individual utilities for every agent on earth. It may not be desirable to encode this as a hard constraint, but some inductive bias in this direction will probably be necessary for learning.
- We'd need a prior on the utility function, and a 'likelihood'; some way to learn the function from data. Theoretically, we could think about these along a few lines:
- The prior could encode inductive biases like: global utility is the sum of individual utilities, all individual utilities are on the same scale (there are no utility monsters), etc.
- We could attempt to refine this prior offline using inverse reinforcement learning on all of human history, literature, fables, and so on; assuming that most people are basically good and are trying (imperfectly) to increase global utility as they see it. (doing this in full generality would seem to present an enormous set of technical problems around how we represent the state of the world, how to model irrationality, and so on).
- We could then test and refine this prior using active learning with human feedback: in a new situation, how well do the actions of the system correspond to our 'moral intuitions'?
I think this is a nice framework, in the same way that Bayesian inference is a nice framework, but it seems to similarly miss the point that computation is important. Ultimately, we need to implement a computation that will act well in the world. The reward uncertainty perspective posits a latent reward function, which we are to learn and then optimize. But this is totally intractable. Even optimizing a known utility is intractable, let alone trying to optimize expected utility with respect to many possible utility functions that we are uncertain over. As with Bayesian inference, any practical implementation will have to make compromises, and which compromises we make are hugely significant.
Does this mean that the notion of a 'tractable approximation to utilitarianism' misses the point in the same way that approximate Bayesian inference misses the point?
- With Bayesian inference we can precisely point to the issues: ultimately the choice of both model and approximation need to depend on the decision-theoretic context, which means that the promise of the posterior as a universal representation for decoupling model from inference doesn't actually work in complex situations.
- For Bayesian inference, we also have an alternative: just optimize the final loss directly and let the model come up with whatever internal representations it needs.
In moral settings, what alternatives do we have? Ultimately we're trying to learn a policy. Instead of positing a latent utility function, we could just do behavioral cloning: try to imitate what people actually do rather than what they should do. Trained in a multi-task fashion over many domains with enough data, eventually the most compact representation of what people actually do may turn out to involve some sort of approximate inverse RL. A model that learned this would then be able to generalize to act 'morally' (or at least, as a human would) in novel situations. But framed this way, we force the model to figure out the relevant computational tradeoffs.
- This is still a bit unsatisfying, because it will never produce an agent that is more moral than humans. It will never produce an agent that does the same thing that humans are trying to do, but better.