[D] Best Practices for Semantic Search on 200k vectors (30GB) Worth of Embeddings? | allainews.com

Jan. 28, 2024, 8:15 a.m. | /u/stoicbats_

Machine Learning www.reddit.com

Hi, I have converted some domain-specific name vectors into embeddings, with a dataset size of 200k words. All the embeddings were generated using OpenAI's embedding model 3 (3072 dim per embedding) . Now I am planning to implement semantic search similarity. Given a domain keyword, I want to find the top 5 most similar matches. After embedding all 280k words, the size of the JSON file containing the embeddings is around 30GB. (Edit, as suggestion saved in msgpack format, 6.5GB …

best practices dataset domain embedding embeddings generated machinelearning model 3 openai per planning practices search semantic vectors words

More from www.reddit.com / Machine Learning

[P] NLLB-200 Distill 350M for en-ko 4 hours ago | www.reddit.com

cpu english good gpu +9

[D] Real talk about RAG 12 hours ago | www.reddit.com

data deal documents machinelearning +5

[P] Classification finetuning experiments on small GPT-2 sized LLMs 17 hours ago | www.reddit.com

acc classification context cpu +16

[D] Llama-3 based OpenBioLLM-70B & 8B: Outperforms GPT-4, Gemini, Meditron-70B, Med-PaLM-1 & Med-PaLM-2 in Medical-domain 18 hours ago | www.reddit.com

70b art biomedical domain +16

How do I convince my superior to do data preprocessing? [D] 18 hours ago | www.reddit.com

ai engineer build chat chatbots +11

[D] Llama-3 based OpenBioLLM-70B & 8B: Outperforms GPT-4, Gemini, Meditron-70B, Med-PaLM-1 & Med-PaLM-2 in Medical-domain 18 hours ago | www.reddit.com

70b art biomedical domain +16

[D] Mathematical aspects of tokenization 20 hours ago | www.reddit.com

compression educational encoding entropy +7

[R] Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey 22 hours ago | www.reddit.com

abstract advancement application challenges +15

[D] Does it make sense to talk about the probabilities of models? 1 day, 4 hours ago | www.reddit.com

compute data likelihood machinelearning +4

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Principal Data Engineering Manager

@ Microsoft | Redmond, Washington, United States

View on ai-jobs.net

Machine Learning Engineer

@ Apple | San Diego, California, United States

View on ai-jobs.net