Modified: September 25, 2023
multimodal transformer
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.possible refs:
- google's multimodal architectures: https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/googles-latest-approaches-to-multimodal-foundational-model-beedaced32f9
- PaLM-E: embodied multimodal model: https://arxiv.org/abs/2303.03378
- Pathways Language + Image model https://arxiv.org/abs/2209.06794
- https://blog.research.google/2022/03/multimodal-bottleneck-transformer-mbt.html?m=1
- also the perceiver work
I think the predominant approach is based on tokenizing all the things. (ref??)
We use a vision transformer to process an image into a sequence of ~256 tokens vectors. If we like, we can then project (or directly inject, if they happen to be the same size) these vectors into the embedding space of a large language model. Then an image is literally represented the same way as 256 words. The processed image tokens just act as words in the model.
Does it make sense to do preprocessing of the image tokens before feeding into the langauge model? In a sense, raw images are lower-level, so might need extra layers of processing in order to exist at the same conceptual level as language tokens.
In this story, the token space is effectively the space of representations of the model. So if there were a 'global workspace' it would be this. But a true global workspace model is recurrent, in a sense.
The training signal for PaLi is:
- start with pretrained text and image models
- then fine-tune on a range of combined tasks: captioning, ocr, visual QA, object listing (produce a text list of all objects in the image)
The Multimodal bottleneck work (https://blog.research.google/2022/03/multimodal-bottleneck-transformer-mbt.html?m=1) distinguishes between early/mid/late fusion of modalities:
- in late fusion, the two modalities are encoded independently, with no cross-attention, and final representations combined for whatever the task is
- in early fusion, there is cross-attention between modality tokens from the very beginning
- mid fusion is a compromise
They propose bottleneck fusion, where unlimited cross-attention is replaced by latent units that attend to units in one modality as input, and can be attended to by units in the other modality. So information can move between the modalities, but only as a summary in the bottlenecks.
These bottlenecks then seem like a sort of global workspace? If we think about the model as unfolding in time, and a single type of bottleneck that reads from both (all) modalities and is read by all modalities, then this potentially supports a sort of 'top-down' understanding where the bottleneck represents the current 'conscious experience' that binds different modalities, and can in turn influence how each modality is interpreted. This might make more sense as a system for understanding video, where things actually do unfold over time, and the high-level (bottleneck) information from one frame can feed into lower-level understanding of the next frame.