Aug. 18, 2023, 6:07 a.m. | Eric Zhù

Towards Data Science - Medium towardsdatascience.com

Finding Needles in a Haystack — Search Indexes for Jaccard Similarity

From basic concepts to exact and approximate indexes

Finding Needles in a Haystack. Image by the author using Midjourney.

Vector databases are in the news for being the external memory of large language models (LLMs). The vector databases today are new systems built on decade-old research called approximate nearest neighbor (ANN) indexes. These indexing algorithms takes many high-dimensional vectors (e.g., float32[]), and built a data structure that supports finding …

author basic concepts databases haystack high-dimensional-data image jaccard language language models large language large language models llms lsh memory midjourney minhash search vector vector databases

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote