all AI news
[P] SimpleGEMM: Fast and minimal tensor core matrix multiplication in CUDA
May 12, 2024, 8:55 p.m. | /u/bjergerk1ng
Machine Learning www.reddit.com
This is an *extremely* minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line CUDA/C++ file which implements fp16 tensor core matrix multiplication, optimised for Turing (SM75) architecture. The goal is to:
1. Write a matmul kernel that does not sacrifice performance. In fact, it's faster than PyTorch/CuBLAS if you [test it on a T4 in Colab](https://colab.research.google.com/github/andylolu2/simpleGEMM/blob/master/colab/simpleGEMM.ipynb)!
2. Make it hackable for new purposes. For …
architecture code core cuda example file fp16 implementation line machinelearning matrix matrix multiplication tensor turing
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
Seeking Developers and Engineers for AI T-Shirt Generator Project
@ Chevon Hicks | Remote
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Stage - Product Owner Assistant - Data Platform / Business Intelligence (M/F)
@ Pernod Ricard | FR - Paris - The Island