What does self-attention learn from Masked Language Modelling? | allainews.com

Feb. 8, 2024, 5:45 a.m. | Riccardo Rende Federica Gerace Alessandro Laio Sebastian Goldt

stat.ML updates on arXiv.org arxiv.org

Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the …

attention cond-mat.dis-nn cond-mat.stat-mech cs.cl inputs language language modelling language processing learn machine machine learning modelling natural natural language natural language processing network networks neural networks process processing self-attention stat.ml transformers via word words

More from arxiv.org / stat.ML updates on arXiv.org

Calabi-Yau Four/Five/Six-folds as $\mathbb{P}^n_\textbf{w}$ Hypersurfaces: Machine Learning, Approximation, and Generation 13 hours ago | arxiv.org

abstract approximation arxiv five +17

Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective 13 hours ago | arxiv.org

abstract arxiv bayesian distribution +16

The Projected Covariance Measure for assumption-lean variable significance testing 13 hours ago | arxiv.org

abstract arxiv covariance lean +14

A Heteroskedasticity-Robust Overidentifying Restriction Test with High-Dimensional Covariates 13 hours ago | arxiv.org

abstract arxiv econ.em errors +11

Adjoint Sensitivity Analysis on Multi-Scale Bioprocess Stochastic Reaction Network 13 hours ago | arxiv.org

abstract analysis arxiv challenges +15

Neural Networks Optimized by Genetic Algorithms in Cosmology 13 hours ago | arxiv.org

abstract algorithms applications artificial +14

Seeded graph matching for the correlated Gaussian Wigner model via the projected power method 1 day, 13 hours ago | arxiv.org

abstract agreement arxiv edge +10

Convergence and Complexity Guarantee for Inexact First-order Riemannian Optimization Algorithms 1 day, 13 hours ago | arxiv.org

abstract algorithms analyze arxiv +11

Mixture of partially linear experts 1 day, 13 hours ago | arxiv.org

abstract arxiv benefits computational +9

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net