Modified: April 30, 2022
contrastive learning
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.A technique for representation learning in which semantically similar datapoints are encouraged to have similar representations, and dissimilar points to have dissimilar representations.
Contrastive methods can be applied in an unsupervised setting using data augmentation (e.g., image transformations) to construct multiple 'views' of a given example, which should all have similar representations.
CLIP uses a contrastive loss with text supervision: the image and text encoders are trained to maximize similarity in representation space of each matching (image, caption)
pair, while minimizing the similarity of mismatched pairs.
Lilian Weng's blog is a good reference: https://lilianweng.github.io/posts/2021-05-31-contrastive/
Yann LeCun on the drawbacks of contrastive methods (page 23):
But there are two main issues with contrastive methods. First, one has to design a scheme to generate or pick suitable ˆy. Second, when y is in a high-dimensional space, and if the EBM is flexible, it may require a very large number of contrastive samples to ensure that the energy is higher in all dimensions unoccupied by the local data distribution. Because of the curse of dimensionality, in the worst case, the number of contrastive samples may grow exponentially with the dimension of the representation. This is the main reason why I will argue against contrastive methods.