DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents | allainews.com

Feb. 26, 2024, 5:42 a.m. | Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.14865v1 Announce Type: cross
Abstract: Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol …

abstract agents algorithms arxiv benchmarks community concerns cs.ai cs.cl cs.lg current data diverse dynamic evaluation issue language language models large language large language models llms meta specific tasks tasks type work

More from arxiv.org / cs.LG updates on arXiv.org

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior an hour ago | arxiv.org

arxiv consistent cs.cv cs.lg +6

Machine-learned models for magnetic materials an hour ago | arxiv.org

abstract arxiv autoencoder cond-mat.mtrl-sci +17

Revisiting RIP guarantees for sketching operators on mixture models an hour ago | arxiv.org

abstract alternative analysis arxiv +9

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata an hour ago | arxiv.org

abstract accuracy arxiv assessment +16

Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification an hour ago | arxiv.org

abstract arxiv audio cs.cv +18

Neural-network quantum state study of the long-range antiferromagnetic Ising chain an hour ago | arxiv.org

abstract arxiv boltzmann cond-mat.quant-gas +12

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on … an hour ago | arxiv.org

abstract arxiv assumptions cs.lg +22

Vortex Feature Positioning: Bridging Tabular IIoT Data and Image-Based Deep Learning an hour ago | arxiv.org

abstract arxiv cs.cv cs.lg +19

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret an hour ago | arxiv.org

abstract algorithms arxiv attention +20

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

View on ai-jobs.net

Customer Data Analyst with Spanish

@ Michelin | Voluntari

View on ai-jobs.net

HC Data Analyst - Senior

@ Leidos | 1662 Intelligence Community Campus - Bethesda MD

View on ai-jobs.net

Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease

@ Clarivate | Remote (121- Massachusetts)

View on ai-jobs.net

Data Analyst (maternity leave cover)

@ Clarivate | R155-Belgrade

View on ai-jobs.net

Sales Enablement Data Analyst (Remote)

@ CrowdStrike | USA TX Remote

View on ai-jobs.net