Oct. 15, 2022, 7:10 p.m. | /u/Super-Martingale

Machine Learning www.reddit.com

We are trying to standardize a long list (in millions) of company name strings. The same company can show up in different rows because of abbreviations, nicknames, subsidiaries, business units, typos, etc. So we need a way to group rows based on whether they are the same company. Given the size of our data, is there any good way to process the standardization efficiently?

Below is an example in which all strings should be grouped as a single company:

JPMorgan …

machinelearning scale standardization

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Software Engineer, Generative AI (C++)

@ SoundHound Inc. | Toronto, Canada