AI safety: Nonlinear Function
Created: January 16, 2021
Modified: January 24, 2022

AI safety

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

AI safety, as a term, is sterile and hard to get excited about. Preventing catastrophe is important, but doesn't motivate me, since the system is bad and we need profound change. I don't want 'safe', AIs, I want actively good AIs that lead to human flourishing, in the sense that love is value alignment.

I think someone who works in safety would not disagree with this; they might say that safety is an important part of that picture. But there's a difference in emphasis. I suspect that we are not going to get ironclad proofs of safety for cutting-edge AI systems, any more than we can prove that a newborn child will never do anything bad. But we can think about what positive traits we'd like to cultivate in an AI, and how to do that.

The present usage of the term 'AI safety' is also ambiguous. In the grand sense imagined by the original proponents of the field, it means something like "how to build a general AI that is guaranteed not to turn the world into paperclips". Now people also use it to mean "how to deploy ML-driven products that won't make certain types of expensive mistakes", like classifying people with dark skin as gorillas, or crashing a self-driving car. I'm going to focus on the first sense.

  • The basic argument is:
    • The human brain is a miracle of evolution. Like eyes, wings, and other such miracles, it is beautifully impressive but also almost certainly not anywhere close to the optimal device. We can and will eventually build AI that is vastly smarter than humans.
    • Intelligence is the root of almost all capability. Highly intelligent machines will have immense capability to determine human events, just as humans currently have the capability to determine events for less intelligent species.
    • Any intelligence will either optimize a utility function, or not. If it doesn't, it is irrational in the sense of violating at least one of the von Neumann-Morganstern axioms. If it does, then it is exceedingly unlikely that that utility function is perfectly aligned with that of humankind, because utility functions live in a high-dimensional space. It is very easy to construct situations that lead to paperclip maximization.
    • If future world events are determined by an incredibly powerful agent without the best interests of humanity (or of some form of conscious life) at heart, this is bad for humanity's interests.
  • In the more all-consuming forms of this argument, AI alignment research is by far the most important thing we can work on. Meanwhile, AI capabilities research is actively harmful, because it decreases the time we have to solve alignment.
  • If I take this seriously, I should drop everything to work on AI alignment. And maybe I should. Why wouldn't I?
    • Comparative advantage: my talents are better suited to something else. (more applicable to adults)
    • A slight variant, motivation: other issues (politics, gay rights, proving the riemann hypothesis, etc) are more personally meaningful to me, and so I will be able to work more productively on them.
    • depression: it can be difficult to believe in the possibility of a beautiful future worth saving, when currently life is suffering, and I don't see a way to change that.
    • The field is not yet ready: maybe the concepts in terms of which AI safety will properly be understood have yet to be developed, and developing them requires first focusing on capabilities research.
      • A reduced version of this is that I am not yet ready and need to learn more about capabilities, before I'm able to work coherently on safety.
  • Some general objections or complications:
    • AI is too far away. People worrying about AI safety are out of touch with the current state of research. I am skeptical of this objection because researchers don't always know best.
      • A steelmanned version of this is 'the field is not yet ready'. We're not yet even in the correct paradigm for good safety work to be done.
    • Proves too much: Human events are currently determined by superhuman intelligences called 'corporations'. We have elaborate mechanisms (called 'law') to align their behavior with human utility, with varying degrees of effectiveness. Sometimes corporations act directly contrary to human interests. Does AI safety argue that we should leave CS to go study contract law? (Dylan might say yes)
    • Theory is useless: ML is currently an empirical science; most of the advances (including the major deep-learning advances that got us into this empirical mode) have been driven by intuition and trying stuff out, rather than derived from principles of rationality. Theory already fails to affect practice in ML; putting more time into theory is not going to help.
    • We can prevent disaster by informal or non-technical means. Agreements, treaties, generally 'trying to be safe'.
    • Technical means don't matter. A generalization of 'theory is useless' is 'research is useless'. Even if we did figure out safer approaches, there will always be incentives to disregard the research.
  • How do I feel right now?
    • Conceptually, I like the theory of value alignment by maintaining reward uncertainty, or cooperative inverse reinforcement learning).
      • It does suffer from the issues of Bayesian inference, both conceptually (model mismatch) and practically (#P-completeness).
    • I don't know of any theoretical or practical problems in AI safety that motivate me.
  • Musings:
    • Bayesian AIs that infer preferences seem to be in competition with current market mechanisms, which (as noted above) also elicit and attempt to satisfy preferences. Are markets a better approach? Should we view markets as approximately implementing some active learning algorithm?
    • Will companies like Google or Facebook want to provide AIs that maximize their users' preferences, if the preferences are things like "I don't want to see ads?". Eventually they should---it's a vastly more valuable service---but they may not be structurally able to disrupt themselves.