The Hidden Complexity of your Modeling Process
Wednesday, May 26, 2021
Models generalize best when their complexity matches the problem. To avoid overfit, practitioners usually trade off accuracy with complexity, measured by the count of parameters. But this is surprisingly flawed. For example, a parameter is equivalent to one “degree of freedom” only for regression; it can be > 4 for decision trees, and < 1 for neural networks. Worse, a major source of complexity — over-search — remains hidden. The vast exploration of potential model structures leaves no trace on the final (perhaps simple-looking) model, but has outsized influence over whether it is trustworthy.
I’ll show how Generalized Degrees of Freedom (GDF, by Ye) can be used to measure the full complexity of algorithmic modeling. This allows one to fairly compare very different models and be more confident about out-of-sample accuracy. GDF also makes clear how seemingly complex ensemble models avoid overfit, and lastly, reveals a new type of outlier — cases having high model influence.