Feb. 12, 2024, 3:35 p.m. | /u/KD_A

Machine Learning www.reddit.com

There are many examples where this is done. Here is one: https://huggingface.co/datasets/nguha/legalbench/viewer/abercrombie/test

Can't web crawlers easily pick this up and add it to a future LLM's pretraining data? I'm not very familiar w/ how those datasets like C4 are constructed.

data datasets future llm machinelearning pretraining web

Research Scholar (Technical Research)

@ Centre for the Governance of AI | Hybrid; Oxford, UK

HPC Engineer (x/f/m) - DACH

@ Meshcapade GmbH | Remote, Germany

Senior Analytics Engineer (Retail)

@ Lightspeed Commerce | Toronto, Ontario, Canada

Data Scientist II, BIA GPS India Operations

@ Bristol Myers Squibb | Hyderabad

Analytics Engineer

@ Bestpass | Remote

Senior Analyst - Data Management

@ Marsh McLennan | Mumbai - Hiranandani