Feb. 12, 2024, 3:35 p.m. | /u/KD_A

Machine Learning www.reddit.com

There are many examples where this is done. Here is one: https://huggingface.co/datasets/nguha/legalbench/viewer/abercrombie/test

Can't web crawlers easily pick this up and add it to a future LLM's pretraining data? I'm not very familiar w/ how those datasets like C4 are constructed.

