May 23, 2022, 4:22 p.m. | Benjamin McCloskey

Towards Data Science - Medium towardsdatascience.com

Many of the documents used for Natural Language Processing (NLP) today are in .pdf format. Reading the pdfs into Python, while not extremely difficult, is not as simple as typing pd.read_pdf(‘file_name.pdf’). Today I am going to provide you with the code which will allow you to not only read a .pdf file into Python but also a function you can create that utilizes regular expressions to find the metadata of your document.

Photo by Dmitry Ratushny on Unsplash

Python …

data science exploratory-data-analysis function general language language processing machine learning natural natural language natural language processing naturallanguageprocessing overview pdf processing towards-data-science

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Social Insights & Data Analyst (Freelance)

@ Media.Monks | Jakarta

Cloud Data Engineer

@ Arkatechture | Portland, ME, USA