June 4, 2024

I recently saw these papers for fusing together the probability distributions of LLMs with different vocabs

[https://arxiv.org/pdf/2404.12715v2](https://arxiv.org/pdf/2404.12715v2) - this paper uses a no-training method by transforming distributions to a shared space, averaging them, and then casting them back to a distribution for a chosen model

[https://openreview.net/forum?id=jiDsk12qcz](https://openreview.net/forum?id=jiDsk12qcz) - this paper trains a target LLM based on prob distributions of other LLMs (with some special logic for vocab differences)

The first paper is of particular interest because it claims to outperform the …

