April 16, 2024, 11 a.m. | /u/cephtahrioh

Deep Learning www.reddit.com

I'm training a transformer model without a causal mask on the WikiText-2 Dataset to understand how a transformer would use future tokens in predicting the next token. However, based on my tests, a transformer without a causal mask is performing worse than one with a causal mask. Intuitively, this shouldn't be the case because the model has access to future tokens (and the next token itself) that should have statistical significance in predicting the next token.

For these results, I …

causal data dataset deeplearning future however next overfitting tests token tokens training training data transformer transformer model transformers

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York