May 12, 2023, 9:17 a.m. | /u/bdqnghi

Machine Learning www.reddit.com

We are releasing a new dataset for code understanding and generation in the same vein as the Pile (Eleuther AI) and The Stack (BigCode Project). However, we put in a lot of effort to make the data much cleaner by writing parsers that extract the code comment (docstring) and code into high quality pairs.

Read more about the Vault in our technical report: [https://arxiv.org/abs/2305.06156](https://arxiv.org/abs/2305.06156)

Github page: [https://github.com/FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)

bigcode code data dataset extract machinelearning multilingual project stack the pile understanding writing

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote