Web: http://arxiv.org/abs/2206.07699

June 16, 2022, 1:11 a.m. | Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang

cs.LG updates on arXiv.org arxiv.org

With the success of vision-language pre-training, we have witnessed the
state-of-the-art has been pushed on multi-modal understanding and generation.
However, the current pre-training paradigm is either incapable of targeting all
modalities at once (e.g., text generation and image generation), or requires
multi-fold well-designed tasks which significantly limits the scalability. We
demonstrate that a unified modal model could be learned with a prefix language
modeling objective upon text and image sequences. Thanks to the simple but
powerful pre-training paradigm, our proposed …

arxiv cv language language models models

