Created: March 13, 2022
Modified: March 13, 2022

generative flow network

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Many objects can be generated by a sequence of actions. For example:

Generating language by adding one word at a time
Generating a molecule by adding one atom or functional group at a time
Generating an image by adding one pixel at a time, or brush stroke at a time, object at a time, or taking one diffusion step at a time.

We can thus treat generative modeling as a special case of reinforcement learning, where the generative steps are viewed as the actions of an agent. A policy specifies, for each (intermediate) state, the probability of taking each potential generative step. But while in standard reinforcement learning we usually want to find the policy with the highest expected reward, here we want to find a policy that samples outcomes with probability proportional to their

With discrete states and actions, we can conceive of the Markov decision process as a graph in which the states are nodes and the actions are outgoing edges (in the deterministic case) or distributions over outgoing edges (if transitions are stochastic). We can equate a policy with a flow on this graph, where:

the initial state is the source, with outflow 1
the terminal states are sinks, where the inflow at each sink is the probability of generating that particular terminal state (these outflows must of course sum to 1, matching the total outflow)

Sampling from the policy corresponds to following a particle of fluid through the flow: at every node, the particle follows outgoing edges with probability proportional to the amount of flow along each edge. The inflow to a node $F(s)$ represents the probability of ending up at that node, and the per-edge outflows $F(s, a)$ are unnormalized action probabilities. These flows must obey a consistency condition:

F(s') = \underbrace{\sum_{s, a, \text{ s.t. } T(s, a)=s'} F(s, a)}_{\text{total inflow to }s'} = \underbrace{\sum_{a'} F(s', a')}_{\text{total outflow from } s'} + \underbrace{R(s')}_\text{terminal reward (if any)}

To find a valid flow represented by a parameterized function $F_\theta$ , we can soften this consistency condition into a loss that we minimize by gradient descent:

L(\theta) = \sum_{s'} \left(\log \left(\epsilon + \sum_{s, a, \text{ s.t. } T(s, a)=s'} \exp(g_\theta(s, a))\right) - \log\left(\epsilon + \sum_{a'} \exp(g_\theta(s', a')) + R(s')\right)\right)^2 -

where modeling the flow in log space ( $g_\theta(s, a) = \log F(s, a)$ ) ensures that we give as much priority to getting the final steps right as we do to the initial steps, even though the latter have exponentially larger flow values.

TODO: read about the improved trajectory balance objective, and about GFlowNet foundations.

References:

generative flow network

Links to this note

mode-covering variational inference is incoherent

probabilistic program induction

Meta