Feb. 6, 2024, 12:53 p.m. | /u/hypergraphs

Machine Learning www.reddit.com

Let's say I have validated an idea for dealing with long contexts in transformers, enabling 32x - 64x longer ctx lengths, as well as reducing inference time for long ctx by 32x - 64x, without losing long-range information compared to vanilla transformers of corresponding ctx length. Training time is a bit slower, due to the models being a bit bigger than vanilla.

Problem is I have limited compute, so am only able to train models below 1Bn regime on a …

