Aug. 31, 2023, noon | code_your_own_AI

code_your_own_AI www.youtube.com

Python code to code "Reinforcement Learning from Human Feedback" (RLHF) on a LLama 2 model with 4-bit quantization, LoRA and new DPO method, by Stanford Univ (instead of old PPO). Fine-tune LLama 2 with DPO.

A1. Code for Supervised Fine-tuning LLama2 model with 4-bit quantization.
A2. Code for DPO-Trainer by HuggingFace with PEFT, LoRA, 4-bit bnb, ...

B1. Code for Supervised Fine-tuning LLama1 model with 4-bit quantization, LoRA.
B2. Code for Reward Modelling of LLama1 model with 4-bit quantization.
B3. …

code feedback fine-tuning human human feedback llama llama 2 llama2 llama 2 model lora ppo python quantization reinforcement reinforcement learning rlhf stanford trainer

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston