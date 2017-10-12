Why is the paper “Understanding Deep Learning Requires Rethinking Generalization” important? originally appeared on Quora: the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Eric Jang, Research engineer at Google Brain, on Quora:

As of 2017, many ML researchers are trying to crack the problem of “how do Deep Neural nets work, and why do they work so well in practice for problems we care about?”

Even if one does not care too much about the theoretical analysis and algebra, better intuitions of why DL works so well can help us make DL work even better for real-world applications.

The “Understanding Deep Learning Requires Rethinking Generalization” paper (Zhang et al. 2016) shows some intriguing properties of neural networks: specifically, they have enough capacity to memorize completely random inputs. In an SGD optimization setting, the training set error can be completely minimized with respect to an ImageNet-sized dataset where the inputs are all random noise (think white noise on your television set in place of images).

This runs counter to the classic narrative where “Deep Learning magically discovers lower level features, middle-level features, and higher-level features like the V1 system of the mammalian brain by learning to compress data”. From ~2012–2015 a lot of researchers used this “inductive bias” to explain how Deep Nets tend to have low test error, suggesting some form of generalization.

But if a deep net is also capable of memorizing random data, it suggests that the generalization capabilities are not entirely explained by the inductive biases of the model (e.g. the convolutional/pooling architecture, usage of regularization like Dropout, batchnorm), since these same inductive biases are also compatible with memorization.

Part of the reason this paper has received so much attention is the fact that it received a “perfect score” on the ICLR reviews and won ICLR2017 best paper award. That got a lot of people talking about the paper and asking themselves why the paper was so great, so there’s a bit of a feedback loop there. I do think it’s a great paper because it asked a question nobody was really asking and presented strong experimental evidence showing intriguing results.

However, I think it takes 1–2 years before the DL research community can really come to a consensus on whether a paper is important or not, especially because many insights are not analytically proven but rather determined empirically over many experiments, sort of like how it’s done in experimental neuroscience.

As Tapabrata Ghosh points out, some researchers argue that even though deep nets can memorize, that may not be exactly what they are doing in practice, since the time required to “memorize” a semantically meaningful dataset is shorter than the time required to “memorize” a random dataset, suggesting that deep nets can exploit semantic regularities in the training set when it is present.

I was also inspired by the “Sharp Minima” paper also presented at the ICLR2017 conference which showed that sharp minima found by batch gradient descent had high generalization error. Then a few months later Dinh et al. showed that local minima that generalize well can be made arbitrarily sharp [1703.04933] Sharp Minima Can Generalize For Deep Nets. While it does not completely invalidate the claims made by the Keskar et al. paper, it does suggest that the picture is more complicated. Similarly, I think that while Zhang et al. 2016 has the potential to be an extremely important direction in understanding how Deep Nets work, it does not solve the generalization question of Deep Nets. Other researchers may soon publish work that challenge the ideas presented in the “Understanding Deep Learning Requires Rethinking Generalization” paper. Such is the nature of experimental science.

Succinctly, this paper is considered important because it shows that deep nets learn random datasets by memorizing them (zero generalization). This then begs the question of how it learns non-random datasets.

My own opinion on this generalization business:

It’s not altogether surprising that a high-capacity parametric model with a well-conditioned optimization objective (ReLUs, Batchnorm, high-dimensional spaces) will just soak up the input data like some kind of data sponge. I think of Deep Nets optimization objectives as an extremely “lazy” but powerful optimizer: it will discover semantically meaningful feature hierarchies if the right model biases are present and compatible with the input data, but if it isn’t convenient to optimize that solution, the network is perfectly happy to optimize in a way that just memorizes the data. Right now we just lack the means to control the degree of memorization vs. generalization, short of using blunt tools like weight norm regularization and dropout.