April 21, 2024, 10:18 a.m. | /u/msaoudallah

Data Science www.reddit.com

Hello all,

i have a problem where i have to predict a class for each line in a pdf , my data set consists of lines from different pdf files, when i shuffle the dataset and split with random lines in train, test sets i got a high score >0.96 , but when i group the dataset by document, and take some document for training and others for testing and i get a very poor score <0.9
what do you …

class data datascience data set dataset files hello line pdf random set split test train

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States