May 26, 2022, 2:06 p.m. | /u/synthphreak

Natural Language Processing www.reddit.com

I have scraped about 30 million Reddit comments. Now I want to use them to train some classification models. But this volume of data is proving seriously challenging to work with.

My current set up is that the comment strings are stored as a `dask.Series`. At first I was using `dask` methods to clean the comments in parallel (this step involves multiple passes each using regex), then using `apply(nlp)` to convert each comment into a `spacy` `Doc` (this just uses …

dask efficiency languagetechnology medium memory spacy

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

IT Commercial Data Analyst - ESO

@ National Grid | Warwick, GB, CV34 6DA

Stagiaire Data Analyst – Banque Privée - Juillet 2024

@ Rothschild & Co | Paris (Messine-29)

Operations Research Scientist I - Network Optimization Focus

@ CSX | Jacksonville, FL, United States

Machine Learning Operations Engineer

@ Intellectsoft | Baku, Baku, Azerbaijan - Remote

Data Analyst

@ Health Care Service Corporation | Richardson Texas HQ (1001 E. Lookout Drive)