Aug. 2, 2023, 6:20 a.m. | 1littlecoder

1littlecoder www.youtube.com

FalconLite is a quantized version of the Falcon 40B SFT OASST-TOP1 model, capable of processing long (i.e. 11K tokens) input sequences while consuming 4x less GPU memory. By utilizing 4-bit GPTQ quantization and adapted dynamic NTK RotaryEmbedding, FalconLite achieves a balance between latency, accuracy, and memory efficiency. With the ability to process 5x longer contexts than the original model, FalconLite is useful for applications such as topic retrieval, summarization, and question-answering.

FalconLite - https://huggingface.co/amazon/FalconLite

AutoGPT Q https://github.com/PanQiWei/AutoGPTQ

Fine-tuned Falcon model …

accuracy amazon context dynamic efficiency falcon falcon 40b gpu latency llm memory process processing quantization tokens

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Senior Applied Data Scientist

@ dunnhumby | London

Principal Data Architect - Azure & Big Data

@ MGM Resorts International | Home Office - US, NV