April 16, 2024, 11 a.m. | /u/cephtahrioh

Deep Learning www.reddit.com

I'm training a transformer model without a causal mask on the WikiText-2 Dataset to understand how a transformer would use future tokens in predicting the next token. However, based on my tests, a transformer without a causal mask is performing worse than one with a causal mask. Intuitively, this shouldn't be the case because the model has access to future tokens (and the next token itself) that should have statistical significance in predicting the next token.

For these results, I …

causal data dataset deeplearning future however next overfitting tests token tokens training training data transformer transformer model transformers

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Engineer - New Graduate

@ Applied Materials | Milan,ITA

Lead Machine Learning Scientist

@ Biogen | Cambridge, MA, United States