Oct. 10, 2023, 4:33 p.m. | /u/data_scallion

Data Science www.reddit.com

Hi everyone!

Is there a simple and robust method for extracting highly tabular data from a PDF without resorting to rule based regex parsing? I'm currently using PDFminer, PDFplumber and regex to build templates to extract PDFs based on the type of PDF but it's very time-consuming and tedious. Is there a better way?

I've used Langchain and OpenAI to build "Chat with your document" apps which works great for uploading a PDF of a whitepaper and asking it to …

build data datascience extract parsing pdf pdfminer regex simple tabular tabular data type

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Senior Applied Data Scientist

@ dunnhumby | London

Principal Data Architect - Azure & Big Data

@ MGM Resorts International | Home Office - US, NV