toolformer: Nonlinear Function
Created: February 16, 2023
Modified: February 16, 2023

toolformer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Notes on Toolformer: Language Models Can Teach Themselves to Use Tools

The basic method is: "Given just a handful of human-written examples of how an API can be used, we let a LM annotate a huge language modeling dataset with potential API calls. We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens. Finally, we finetune the LM itself on the API calls that it considers useful."

This works quite well and allows a finetuned GPT-J (6B) model to outperform GPT-3 on math and other tasks. It is not quite training for consistency, but it uses the language model's own prediction signal to decide which tools are useful, which seems quite clever.

Limitations include that it can only use

how does this relate as a form of meta-reasoning to sparse mixture of experts models? a mixture of experts model decides to 'call' a transformer subroutine internally, with gating, and always passes it the current activation. the probabilities of the calls are the output of a gating layer. some call is always made. and the whole thing is trained with backprop to minimize prediction loss. meanwhile, the toolformer decides to 'call' an external subroutine by emitting appropriate tokens. the probability of emitting each token is the output of the model -- not that different from a gating layer. but each subroutine then also gets a specific input. and the model is fine-tuned on the cases where this makes things better. the external subroutines are not trained (and need not be differentiable) at all.

there's a nice 'seeding' where the model itself can be taught when to make appropriate calls, using natural language priors.

this all connects to language model cascades: although the toolformer can only make one call, we'd like it to be able to interact with an API, have a series of thoughts, even a series of internal conversations (thoughts) before continuing to generate. This is all metareasoning writ broadly.

What's a good unifying framework to think of all this stuff in?