Optimizing for efficiency/memory use with spaCy and dask when preprocessing ~30M medium-large strings | allainews.com

May 26, 2022, 2:06 p.m. | /u/synthphreak

Natural Language Processing www.reddit.com

I have scraped about 30 million Reddit comments. Now I want to use them to train some classification models. But this volume of data is proving seriously challenging to work with.

My current set up is that the comment strings are stored as a `dask.Series`. At first I was using `dask` methods to clean the comments in parallel (this step involves multiple passes each using regex), then using `apply(nlp)` to convert each comment into a `spacy` `Doc` (this just uses …

dask efficiency languagetechnology medium memory spacy

More from www.reddit.com / Natural Language Processing

Online course Recommendations : I need to take some programming and CS or ai courses … 2 days, 9 hours ago | www.reddit.com

admissions ai courses computational computer +17

ReFT: Representation Finetuning for Language Models 1 week ago | www.reddit.com

abstract languagetechnology

Stanford CS 25 Transformers Course (OPEN TO EVERYBODY) 1 week, 2 days ago | www.reddit.com

andrej karpathy applications architectures art +20

Best Masters Program? 1 week, 3 days ago | www.reddit.com

computational language languagetechnology linguistics +4

How is Claude able to tokenize "!Bmpvfuuf-!hfoujmmf!bmpvfuuf!Bmpvfuuf-!kf!uf!qmvnfsbj!Kf!uf!qmvnfsbj!mb!u�uf!Kf!uf!qmvnfsbj!mb!u�uf!Fu!mb!u�uf-!fu!mb!u�uf!Bmpvfuuf-!Bmpvfuuf!" ? 1 week, 4 days ago | www.reddit.com

claude claude ai languagetechnology responded

I just can't fine tune BERT over 40% accuracy for text-classification task 1 week, 6 days ago | www.reddit.com

accuracy bert classification data +12

Stanford CS 25 Transformers Course (Open to Everybody | Starts Tomorrow) 2 weeks, 1 day ago | www.reddit.com

authors chat course deep learning +9

Book recs for modern text analytics (need insights for stakeholders) 2 weeks, 2 days ago | www.reddit.com

advice analysis analytics analyze +19

10 years of NLP history explained in 50 concepts | From Word2Vec, RNNs to GPT 2 weeks, 5 days ago | www.reddit.com

concepts explained gpt history +5

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

IT Commercial Data Analyst - ESO

@ National Grid | Warwick, GB, CV34 6DA

View on ai-jobs.net

Stagiaire Data Analyst – Banque Privée - Juillet 2024

@ Rothschild & Co | Paris (Messine-29)

View on ai-jobs.net

Operations Research Scientist I - Network Optimization Focus

@ CSX | Jacksonville, FL, United States

View on ai-jobs.net

Machine Learning Operations Engineer

@ Intellectsoft | Baku, Baku, Azerbaijan - Remote

View on ai-jobs.net

Data Analyst

@ Health Care Service Corporation | Richardson Texas HQ (1001 E. Lookout Drive)

View on ai-jobs.net