[R] Introducing The Vault: A new multilingual dataset for advancing code understanding and generation. | allainews.com

May 12, 2023, 9:17 a.m. | /u/bdqnghi

Machine Learning www.reddit.com

We are releasing a new dataset for code understanding and generation in the same vein as the Pile (Eleuther AI) and The Stack (BigCode Project). However, we put in a lot of effort to make the data much cleaner by writing parsers that extract the code comment (docstring) and code into high quality pairs.

Read more about the Vault in our technical report: [https://arxiv.org/abs/2305.06156](https://arxiv.org/abs/2305.06156)

Github page: [https://github.com/FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)

bigcode code data dataset extract machinelearning multilingual project stack the pile understanding writing

More from www.reddit.com / Machine Learning

[D] Strange Loss Curve while training 4 hours ago | www.reddit.com

dataset gpt loss machinelearning +4

[D] Intra-Document prefix (cumulative) sum when using sequence packing in PyTorch 9 hours ago | www.reddit.com

computational context context window documents +7

[Research] xLSTM: Extended Long Short-Term Memory 15 hours ago | www.reddit.com

abstract contributed deep learning error +16

Non Technical ML Podcasts? [D] 23 hours ago | www.reddit.com

challenge context current data +16

[D] PEFT techniques actually used in the industry 1 day, 2 hours ago | www.reddit.com

industry machinelearning normally peft +2

[D] Can anyone with the expertise speak to the overlap, or not, between Nvidia's hardware … 1 day, 3 hours ago | www.reddit.com

apple chips expertise hardware +4

[P] Skyrim - Open-source model zoo for Large Weather Models 1 day, 5 hours ago | www.reddit.com

ai models building capabilities fine-tuning +7

[P] Identify toxic underwater air bubbles lurking in the substrate with aquatic ultrasonic scans via … 1 day, 7 hours ago | www.reddit.com

arduino classification color identify +11

[P] YARI - Yet Another RAG Implementation. Hybrid context retrieval 1 day, 8 hours ago | www.reddit.com

api context cosine embedding +14

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net