May 12, 2024, 8:55 p.m. | /u/bjergerk1ng

Machine Learning www.reddit.com

Hello all! Sharing my side project here: [https://github.com/andylolu2/simpleGEMM](https://github.com/andylolu2/simpleGEMM) !

This is an *extremely* minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line CUDA/C++ file which implements fp16 tensor core matrix multiplication, optimised for Turing (SM75) architecture. The goal is to:

1. Write a matmul kernel that does not sacrifice performance. In fact, it's faster than PyTorch/CuBLAS if you [test it on a T4 in Colab](https://colab.research.google.com/github/andylolu2/simpleGEMM/blob/master/colab/simpleGEMM.ipynb)!
2. Make it hackable for new purposes. For …

architecture code core cuda example file fp16 implementation line machinelearning matrix matrix multiplication tensor turing

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Stage - Product Owner Assistant - Data Platform / Business Intelligence (M/F)

@ Pernod Ricard | FR - Paris - The Island