s
Feb. 2, 2024, 2:47 a.m. |

Simon Willison's Weblog simonwillison.net

unstructured


Relatively new but impressively capable Python library (Apache 2 licensed) for extracting information from unstructured documents, such as PDFs, images, Word documents and many other formats.


I got some good initial results against a PDF by running "pip install 'unstructured[pdf]'" and then using the "unstructured.partition.pdf.partition_pdf(filename)" function.


There are a lot of moving parts under the hood: pytesseract, OpenCV, various PDF libraries, even an ONNX model - but it installed cleanly for me on macOS and worked out of the …

apache documents function good images information install library moving ocr opencv pdf pdfs pip pytesseract python running unstructured unstructured documents word

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Principal Data Engineering Manager

@ Microsoft | Redmond, Washington, United States

Machine Learning Engineer

@ Apple | San Diego, California, United States