What are the major factors that motivate us to use Neural networks over Kernel methods for large datasets in layman terms? originally appeared on Quora - the knowledge sharing network where compelling questions are answered by people with unique insights.
Answer by Alex Smola, Professor, Carnegie Mellon University and Chief Scientist, 1-Page, on Quora.
This is difficult to answer. The issue here is that it should be in layman's terms, yet at the same time address specific tools such as kernel methods and neural networks. A bit of background.
- Computers are getting better and faster all the time. This is known as Moore's law for microprocessors and as Kryder's law for harddisks. A similar thing holds for communication links (i.e. networks). More specifically, they're getting exponentially better. But their rates are different. Disks grow the fastest, networks the slowest.
- A corollary of Kryder's law is that disks will always be full with stuff. The consequence is that we're getting to a situation where lots of data is stored on many machines where it's hard to communicate the data between them and where it's also hard to keep a lot of that in main memory.
- GPUs have changed things a lot in the past decade by giving the brute force compute power a 10x to 100x boost over what CPUs could do.
The combination of these three things has created the perfect storm for neural networks where we have computers that are data rich (big disks), relatively poor on memory and fairly rich on computation (thanks to GPUs) and where it is hard to send lots of data to somewhere else. This is the very operating point that deep learning is great at and this is why we're now seeing such a big surge in popularity and applications.
Kernel methods, by contrast, work by comparing pairs of data points. This means that the matrix of comparisons (aka the kernel matrix) needs lots of memory (1,000 data points need 4MB for the kernel matrix, 1 million data points would need 4TB). Hence, in the 90s kernel methods were a great tool since the datasets were in the 1000s and computers have at least 16MB of memory. Unfortunately, this makes them tricky to use on big datasets unless you use lots of clever math. One of my students, Zichao Yang did this by using fast random function classes and adapting them. For a fancy tensor trick see also the work of Andrew Wilson.
What this means is that kernels are really hard to use on large amounts of data since you end up running out of RAM. On the other hand, deep networks are rather compact function classes, hence you can get away with compact storage, at the expense of a lot more computation to train them.
The way forward is to exploit the strengths of both of the methods and combine them, e.g. for nonparametric statistical tests, generative models, message passing, bandit algorithms and other things that need good statistical analysis and flexible models.
- Computer Programming: What is the Parameter Server?
- Machine Learning: Are stochastic variational approaches the way to do large scale Bayesian ML or do you see any hope of scaling up MCMC-based algorithms?
- Academia: Would you encourage a MS graduate in industry to get back to academia to pursue a PhD at 30?