May 2, 2022, 1:03 p.m. | /u/Competitive-Rub-1958

Machine Learning www.reddit.com

[The MT-NLG model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/) was 530B parameters compared to PaLM's 540B. They seem to have done things correctly from what I skimmed, However their model is neither that impressive on benchmarks, nor does it demonstrate any special capabilities.

So what was the reason MT-NLG didn't work as well as expected? Is it possible it has abilities to explain jokes (on par PaLM) but they were undiscovered by the authors? Or are there any gaping flaws in how they scale the different …

go machinelearning nlg palm scaling

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Strategy & Management - Private Equity Sector - Manager - Consulting - Location OPEN

@ EY | New York City, US, 10001-8604

Data Engineer- People Analytics

@ Volvo Group | Gothenburg, SE, 40531