Jan. 20, 2024, 2:44 p.m. | /u/ragavsachdeva

Machine Learning www.reddit.com

Paper: [http://arxiv.org/abs/2401.10224](http://arxiv.org/abs/2401.10224)

Github: [https://github.com/ragavsachdeva/magi](https://github.com/ragavsachdeva/magi)

Try it yourself: [https://huggingface.co/spaces/ragavsachdeva/the-manga-whisperer/](https://huggingface.co/spaces/ragavsachdeva/the-manga-whisperer/)

TLDR: Given a high resolution manga page as input, Magi (our model) can (i) detect panels, character, text blocks, (ii) cluster characters (without making any assumptions about the number of ground truth clusters), (iii) match text blocks to their speakers, (iv) perform OCR, (v) generate a transcript of who said and when (by sorting the panels and text boxes in the reading order). See the figure below for an example.Wanted to share …

assumptions characters cluster generate iii machinelearning magi making manga match ocr page panels sorting speakers text truth

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

RL Analytics - Content, Data Science Manager

@ Meta | Burlingame, CA

Research Engineer

@ BASF | Houston, TX, US, 77079