Jan. 3, 2022, 11:38 p.m. | /u/ctrl_alt_deleter

Natural Language Processing www.reddit.com

In the BERT paper it says that during training it mask a fraction of the words and replaces them with random words:

The training data generator chooses 15% of the token positions at random forprediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time.

I can't wrap my head about the explaination …

bert languagetechnology

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analyst (CPS-GfK)

@ GfK | Bucharest

Consultant Data Analytics IT Digital Impulse - H/F

@ Talan | Paris, France

Data Analyst

@ Experian | Mumbai, India

Data Scientist

@ Novo Nordisk | Princeton, NJ, US

Data Architect IV

@ Millennium Corporation | United States