Bumblebee: Advancing Beyond Closed-Source Multi-Modal Models through Token Shrinkage

Anonymous
2024

*Indicates Equal Contribution

Abstract

Multi-modal Large Language Models (MLLMs) are at the forefront of artificial intelligence research, aiming to create models capable of understanding, learning, and generating multiple data types, including text, images, and sound. Despite the potential, significant challenges persist, including the integration of suitable vision encoders and LLMs, the scarcity of comprehensive multi-modal datasets, and the need for efficient performance improvement. A performance gap currently exists between closed-source models, often developed by resource-rich tech companies, and open-source models. However, the open-source community is making substantial strides, driven by collaboration and resource availability. Our work with the Bumblebee model, an open-source MLLM, exemplifies this progress. By implementing token shrinkage and developing an efficient projector called STSR (Scalable Token Shrinkage Resampler), Bumblebee has surpassed the closed-source QwenVL Max on the MMBench-Test-CN with a score of 75.9, using only open-source data and 14 billion LLM parameters. This surpasses the current open-source state-of-the-art Yi-34B-VL by 5.9 points on MMBench-Test-CN, despite having fewer parameters. This achievement underscores the potential of open-source models to compete with, and potentially surpass, their closed-source counterparts, signaling a promising future for open-source multi-modal learning.

MY ALT TEXT

We surpassed QwenVL Max on MMbench-Test CN set.

MY ALT TEXT

Qualitative results by Bumblebee

BibTeX

        
@article{Fagang,
  title={Bumblebee: Advancing Beyond Closed-Source Multi-Modal Models through Token Shrinkage},
  author={Fagang Jin, Chen Tong, Lin You},
  year={2024},
  primaryClass={cs.CV}
}