June 6, 2024, 4:15 a.m. | Sajjad Ansari

Grokking is a newly developed phenomenon where a model starts to generalize well long after it has overfitted to the training data. It was first seen in a two-layer Transformer trained on a simple dataset. In grokking, generalization occurs only after many more training iterations than overfitting. This requires high computational resources, making it less […]

