June 25, 2023, 6:31 a.m. | /u/Lockonon3

Data Science www.reddit.com

I'm looking for different ways to summarize documents with vector embeddings

* centroid of word2vec embeddings
* doc2vec but in terms of distributed bag of words since word order doesn't really matter for this particular task
* The CLS embedding of Bert

To be economic, I plan to keep only the top 20 tf-idf words of each document. For that reason, word order is completely arbitrary.

bag bag of words bert datascience distributed documents economic embedding embeddings terms tf-idf vector vector embeddings word word2vec words

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Engineer

@ Quantexa | Sydney, New South Wales, Australia

Staff Analytics Engineer

@ Warner Bros. Discovery | NY New York 230 Park Avenue South