Created: October 27, 2022
Modified: October 27, 2022

training for consistency

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

These days we think a lot about using data to train large language models. But there's only so much data in the world; eventually we'll want other training signals. Thinking about how to make a model self-consistent is a really natural space of possibilities, since it offers a way to improve a model without additional data. In human cognition, we might compare this to reflective thought, where we are sometimes able to make progress by 'connecting the dots' between things we already know without any new real-world experience.

From another angle: generally when we teach a model something, we'd also like the model to internalize all the logical consequences of that information, as if it did a Bayesian update. This means forcing it to uproot whatever previous beliefs it had, that have now been falsified.

What could this look like? In formal reasoning contexts, we could consider amortizing chain of thought prompting: ask the model to answer a question while explicitly showing its reasoning. Perhaps do this many times and take the majority vote as the presumed-correct answer. Then train the model to produce that answer 'instinctively' without requiring the chain of thought. (cf Large Language Models Can Self-Improve)

For more informal propositions, maybe some kind of Socratic dialogue? You can probably get a model to produce, e.g., philosophical beliefs that seem contradictory. Then have another model (or the same model with another prompt) play Socrates, or an adversarial "devil's advocate" that points out the contradiction, and ask how to reconcile those beliefs. Ultimately the model could disavow one of the beliefs, or revise it into something consistent. Then we backprop that final consistent statement as the thing the model should have said in the first place.

Note that 'consistency' is highly multimodal and not synonymous with truth. There may be many sets of internally consistent beliefs. Having internally consistent beliefs is not sufficient for having correct beliefs. And it may not even be necessary: in a world with many models an agent can hold beliefs in many contexts, and the beliefs that are behaviorally useful in one context many appear contradictory to those that are useful in another context. At some level, these beliefs must be reconcilable since they both reflect the same underlying world, but that doesn't imply that reconciling them is necessarily a useful enterprise. Still, it is important and valuable to understand the consequences of our beliefs, and when a belief system is inconsistent it's a strong warning that there's something worth examining further.

training for consistency

Links to this note

toolformer

fast weights

Meta