Aug. 18, 2023, 6:07 a.m. | Eric Zhù

Towards Data Science - Medium towardsdatascience.com

Finding Needles in a Haystack — Search Indexes for Jaccard Similarity

From basic concepts to exact and approximate indexes

Finding Needles in a Haystack. Image by the author using Midjourney.

Vector databases are in the news for being the external memory of large language models (LLMs). The vector databases today are new systems built on decade-old research called approximate nearest neighbor (ANN) indexes. These indexing algorithms takes many high-dimensional vectors (e.g., float32[]), and built a data structure that supports finding …

author basic concepts databases haystack high-dimensional-data image jaccard language language models large language large language models llms lsh memory midjourney minhash search vector vector databases

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Principal Data Architect - Azure & Big Data

@ MGM Resorts International | Home Office - US, NV

GN SONG MT Market Research Data Analyst 11

@ Accenture | Bengaluru, BDC7A