s
March 30, 2024, 5:59 p.m. |

Simon Willison's Weblog simonwillison.net

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?


I've been having some very promising results with Gemini Pro 1.5, Claude 3 and GPT-4 Vision recently - I'll write more about that soon. But those tools are still inconvenient for most people to use.


Meanwhile, older tools like Tesseract OCR are …

aiassistedprogramming browser claude concerns conference data data extraction data journalism datajournalism discovery extraction gemini gemini pro hot images journalism ocr pdfs pro 1.5 projects results running scale stanford story tesseract topics

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston