Nov. 15, 2023, 1:23 p.m. | /u/duffano

Deep Learning www.reddit.com

Dear all,

I had a look at the encoder-decoder architecture following the seminal paper "Attention is all you need".

After doing experiments on my own and doing further reading, I found many sources saying that the (maximum) input lengths of encoder and decoder are usually the same, or that there is no reason in practice to use different legnths (see e.g. [https://stats.stackexchange.com/questions/603535/in-transformers-for-the-maximum-length-of-encoders-input-sequences-and-decoder](https://stats.stackexchange.com/questions/603535/in-transformers-for-the-maximum-length-of-encoders-input-sequences-and-decoder)).

What puzzles me is the "usually". I want to understand the thing on the mathematical level, and I …

architecture attention attention is all you need decoder deeplearning encoder encoder-decoder look paper practice reason

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York