Oct. 5, 2022, 4:37 p.m. | /u/mishtimoi

Machine Learning www.reddit.com

Hi,

So I have an empirical observation that when I train a large model vs. the same model in a staggered fashion, .i.e. some layers are frozen and others receive a gradient update, the latter takes more training time although the number of trainable parameters are less. This leads me to suspect that the detach() operation is the culprit. I cannot find much resources online to help me with understanding the time-complexity of the detach() operation in torch. Did anyone …

complexity machinelearning

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Technology Consultant Master Data Management (w/m/d)

@ SAP | Walldorf, DE, 69190

Research Engineer, Computer Vision, Google Research

@ Google | Nairobi, Kenya