April 17, 2022, 10:24 a.m. | /u/sniffykix

Data Science www.reddit.com

I recently had a situation where including a completely unrelated pseudorandom variable yielded marginally better CV metrics.

Could this be explained by the fact that this noise mitigates some overfitting to noise in the more powerful features?

Or maybe this was just a one off random observation?

Either way I don’t think it justifies including the variable in the final model. Think there’s more appropriate ways to reduce the overfitting.

Just interested if this is a known phenomenon or if …

datascience noise overfitting random random forests reduce

Data Scientist (m/f/x/d)

@ Symanto Research GmbH & Co. KG | Spain, Germany

Senior Product Manager - Real-Time Payments Risk AI & Analytics

@ Visa | London, United Kingdom

Business Analyst (AI Industry)

@ SmartDev | Cầu Giấy, Vietnam

Computer Vision Engineer

@ Sportradar | Mont-Saint-Guibert, Belgium

Data Analyst

@ Unissant | Alexandria, VA, USA

Senior Applied Scientist

@ Zillow | Remote-USA