Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources. (arXiv:2201.10066v1 [cs.CL]) | allainews.com

Jan. 26, 2022, 2:10 a.m. | Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aj

cs.CL updates on arXiv.org arxiv.org

In recent years, large-scale data collection efforts have prioritized the
amount of data collected in order to improve the modeling capabilities of large
language models. This prioritization, however, has resulted in concerns with
respect to the rights of data subjects represented in data collections,
particularly when considering the difficulty in interrogating these collections
due to insufficient documentation and tools for analysis. Mindful of these
pitfalls, we present our methodology for a documentation-first, human-centered
data collection project as part of the …

arxiv bigscience data data sources language resources

More from arxiv.org / cs.CL updates on arXiv.org

LLMs for Science: Usage for Code Generation and Data Analysis 1 day ago | arxiv.org

abstract analysis arxiv become +26

VAL: Interactive Task Learning with GPT Dialog Parsing 1 day ago | arxiv.org

abstract acquisition arxiv box +22

Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and … 1 day ago | arxiv.org

abstract arxiv assessment automated +23

Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss 1 day ago | arxiv.org

abstract arxiv binary cs.ai +13

DBCopilot: Scaling Natural Language Querying to Massive Databases 1 day ago | arxiv.org

abstract advances arxiv challenges +31

ARN: Analogical Reasoning on Narratives 1 day ago | arxiv.org

abstract analogy arxiv cognitive +17

Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical … 1 day ago | arxiv.org

abstract arxiv biomedical building +24

Learning the meanings of function words from grounded language using a visual question answering model 1 day ago | arxiv.org

abstract acquisition arxiv children +17

RETVec: Resilient and Efficient Text Vectorizer 1 day ago | arxiv.org

arxiv cs.ai cs.cl resilient +2

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Data Engineer

@ Parker | New York City

View on ai-jobs.net

Sr. Data Analyst | Home Solutions

@ Three Ships | Raleigh or Charlotte, NC

View on ai-jobs.net