Feb. 1, 2024, 10:45 p.m. | /u/SatisfyingLatte

Machine Learning www.reddit.com

Paper: [https://arxiv.org/abs/2401.17263](https://arxiv.org/abs/2401.17263)


Abstract: Despite advances in AI alignment, language models (LM) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. While some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, universal, and practical. To achieve this, we propose the first adversarial objective for defending LMs against jailbreaking attacks and an algorithm, robust prompt optimization (RPO), that uses …

abstract advances adversarial adversarial attacks ai alignment alignment attacks behavior defense focus jailbreaking language language models machinelearning narrow posit practical prompts threat vulnerable

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US