Natural language models can model a text corpus C by learning the probability distribution p over all natural language strings Y∈C, with Y={y\_1,y\_2,y\_3…,y\_t}. For each word y\_t, the model learns to output a probability distribution q over candidate words y∈V, the model’s vocabulary, choosing the words which minimize the log-likelihood of q(Y), which is the product of the probability of each word being in the sentence given prior context, q(y\_t|Y<t).

Despite the learned distribution being optimal, in the sense that …

