July 22, 2023, 1:42 p.m. | /u/Qdr-91

Machine Learning www.reddit.com

As far as I know and studied, each token is mapped from high dimensional discrete token space into a continuous, lower dimensional space where words are embedded meaningfully based on their relationships in the training data. So 1000 tokens text produces 1000 vectors.


Now for vector databases (correct me if I'm wrong), people are storing fixed sized vectors for text with varying lengths. For example, 2 sentences one with 1000 tokens and the other is 10 tokens, each produces one …

continuous data embedded embedding embedding models machinelearning mapped relationships space technical text token tokens training training data vectors words

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York