Feb. 6, 2024, 5:44 a.m. | Xiaolong Jin Zhuo Zhang Xiangyu Zhang

cs.LG updates on arXiv.org arxiv.org

Large Language Model (LLM) alignment aims to ensure that LLM outputs match with human values. Researchers have demonstrated the severity of alignment problems with a large spectrum of jailbreak techniques that can induce LLMs to produce malicious content during conversations. Finding the corresponding jailbreaking prompts usually requires substantial human intelligence or computation resources. In this paper, we report that LLMs have different levels of alignment in various contexts. As such, by systematically constructing many contexts, called worlds, leveraging a Domain …

alignment conversations cs.cl cs.lg diverse human human intelligence intelligence jailbreak jailbreaking language language model large language large language model llm llms malicious content match multiverse prompts researchers spectrum values

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote