all AI news
OpenProteinSet: Training data for structural biology at scale. (arXiv:2308.05326v1 [q-bio.BM])
cs.LG updates on arXiv.org arxiv.org
Multiple sequence alignments (MSAs) of proteins encode rich biological
information and have been workhorses in bioinformatic methods for tasks like
protein design and protein structure prediction for decades. Recent
breakthroughs like AlphaFold2 that use transformers to attend directly over
large quantities of raw MSAs have reaffirmed their importance. Generation of
MSAs is highly computationally intensive, however, and no datasets comparable
to those used to train AlphaFold2 have been made available to the research
community, hindering progress in machine learning for …
arxiv bio biology data design encode importance information multiple prediction protein proteins protein structure protein structure prediction raw scale training training data transformers