provably safe system: Nonlinear Function
Created: September 06, 2023
Modified: September 06, 2023

provably safe system

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

they argue that the only way to guarantee safety against superintelligence is mathematical proof. the rough outline is:

  • "unsafe" AI like a neural net generates code for us to run, along with a proof of safety
  • hardware is set up to only run code that has proofs of safety
  • this provably safe code makes up the control systems that do all actual work in the world

what I'm unclear about so far is:

  • we'll still need the unsafe superintelligences to do the initial code generation and safety proofs. (unless we can bootstrap a cycle of provably safe systems that only generate other provably safe systems, but this seems hard?)
  • we can't ever prove safety, we can only prove a formalization of safety in some formal system. which puts all the burden on us to perfectly formalize what we mean by safety. if we get it wrong, then the proofs are useless. (and of course we will get it wrong)
  • suppose there are "systems we can reason about" (well-constructed traditional code) and "systems we can't reason about" (giant black-box neural nets). it may be that there are fundamental gaps in what is possible with each class of system. intuitions here: systems that learn flexibly will end up doing things we can't specify ahead of time (or else we could have just specified the behavior without needing to learn it). learning systems will be able to form new representations which will carve the world in ways we don't currently know.
  • countervailing intuitions: we don't need to reason about every part of a system, maybe we just need to reason about sandboxes. e.g. a program can do arbitrary "unsafe" things, train models, accept instructions from untrusted sources, etc., as long as those eventually go through some sort of provable sanitizing filter on the way to action in the world. and maybe the sanitizing filter itself is a proof checker: the model outputs a planned trajectory, or a controller that will follow the planned trajectory, along with a proof that this controller is 'safe' in some sense.
  • it seems like this requires that all hardware include verifiers and refuse to run unverified code. this is a pretty totalitarian world - are you even allowed to program your own computer?
  • any meaningful safety properties are going to be in terms of macro-level objects in the world, like "human beings". so provable safety has to guarantee that this concept is reliably recognized and represented in the system. but all such concepts are inherently nebulous, have unclear boundaries, and the recognition systems will be subject to adversarial examples (a person gets mistakenly classified as a not-person or vice versa). a provably safe system of this sort must avoid ontological crisis, so it won't have the flexibility to reorder its ontology to adapt to new situations.

interesting observations from the paper:

  • it makes sense to co-generate code and a proof of correctness. the code will decompose in the same way as the proof. good programmers always have an argument for why their code is correct --- what contract it is trying to meet and how it meets that contract.
  • some provable properties that could be useful:
    • time- and space-boundedness: hardware will only run code within certain time period or GPS coords. (this is a property of the hardware, not the code?)
    • "throttled AI": systems can only operate as long as they receive a 'crypto drip' of tokens from an authorized source. otherwise they automatically shut down.
      • not clear to me how this is different from just having an off switch? how is depending on a 'crypto drip' different from an 'electricity drip' (ie power supply) that also turns off the machine when it shuts off?