Oct. 10, 2023, 4:33 p.m. | /u/data_scallion

Data Science www.reddit.com

Hi everyone!

Is there a simple and robust method for extracting highly tabular data from a PDF without resorting to rule based regex parsing? I'm currently using PDFminer, PDFplumber and regex to build templates to extract PDFs based on the type of PDF but it's very time-consuming and tedious. Is there a better way?

I've used Langchain and OpenAI to build "Chat with your document" apps which works great for uploading a PDF of a whitepaper and asking it to …

build data datascience extract parsing pdf pdfminer regex simple tabular tabular data type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne