Web: https://www.reddit.com/r/LanguageTechnology/comments/vgd2zh/why_does_sklearnpipeline_with_regex_outperform/

June 20, 2022, 4:38 a.m. | /u/synthphreak

Natural Language Processing reddit.com

# TL;DR

I need help selecting between `spacy` and `sklearn` for processing a huge text corpus. I ran a test to measure the performance of each, but the results were unexpected. Moreover, because I'm new-ish to the frameworks involved, I lack confidence that my test is completely valid. I'd really appreciate some guidance.

- [Code](https://pastebin.com/dBAW1rrD)

- [Results](https://imgur.com/a/dhJmNGi)

# Background

I'm doing a project that involves preprocessing 35 million Reddit comments. This is a massive amount of text. So I'm searching …

languagetechnology pipeline regex sklearn spacy text

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY