capabilities research: Nonlinear Function
Created: January 23, 2022
Modified: February 26, 2022

capabilities research

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In the discourse around AI safety you sometimes see the claim that research on AI capabilities is harmful to the extent that it outpaces safety research. This seems plausibly true for some types of capabilities research, e.g.:

  • Systems work that scales up existing approaches without introducing new concepts.
  • Work in pure optimization, or its decision-theoretic cousin deep reinforcement learning, that improves our ability to build systems that optimize black-box objectives.

Personally, I prefer to see the battle of capabilities vs safety as a false dichotomy. As PhdAdvisor points out, civil engineers don't have separate fields of 'building bridges' and 'building bridges that don't fall down'---for them, the safety is part of the capability. Ultimately the goal of AI research should be to produce systems that make the world a better place, and it's important to recognize that that goal is different from the pure capability to optimize black-box objectives.

Work that introduces new concepts and helps develop theory of intelligence can improve capabilities while being neutral or even beneficial to safety. The key is to carve the world at the joints; the hypothesis is that the best concepts with which to think about intelligence are also the best concepts for understanding how to control and shape intelligence. Such work structures our systems in the 'right way' to provide a handle on what the system is doing.

To put it starkly: suppose we knew that another paradigm shift is needed for us to build fully humanlike AI systems. That is, such systems would in important aspects be structured using concepts that we don't currently have the language to describe. It seems plausible that effective safety engineering for those systems would need to engage with the core concepts of the new paradigm. Work to find those concepts and bring about that shift would count as capabilities research, but it is also a necessary prerequisite for doing the safety work.

Generally, any work that helps us understand intelligence (and artificially intelligent systems in general) is good for safety. Any time we can represent structure in the system explicitly rather than implicitly, we make the system more explainable, debuggable, and also a faster learner.

For example: AI systems will need to maintain relationships with humans, so they will need some value alignment capabilities, following love is value alignment. These could in principle emerge from a black-box optimization, but it may be beneficial to think about how to architect systems where this occurs explicitly.