What happens when you, instead of training the entire network at once, train for...

colah3 · on Dec 9, 2014

Good intuition! What you are describing sounds like a technique called pretraining (in particular, greedy, layer-wise pretraining). Five years ago, pretraining was how everyone attacked this problem, although they usually did a different kind of pretraining (basically, we train a different kind of model, and then perform surgery, cutting it apart and using some layers for it for the earlier layers of our model).

More recently, people, especially the younger generation of deep learning researchers, tend to be skeptical of how much pretraining helps.

Advocates for pretraining now tend to argue that it helps you find better local minima, instead of focusing on it helping the vanishing gradient problem. For example, see this paper: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf .

As I'm sure Michael will address in coming chapters, there's a bunch of tricks you can use that make training deep neural networks a lot easier. People tend to prefer, now, to just use those and a lot of computing power, rather than mess around with pretraining.

xtacy · on Dec 9, 2014

Could you post a few pointers about the bunch of tricks to make deep training a lot easier?

colah3 · on Dec 9, 2014

Certainly!

* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.

* Using convolutional layers really helps. Roughly, convolutional layers have multiple copies of the same neuron, applied to different inputs. This results in them needing to learn much less. It also leads to them kind of concentrating the gradient on just a few neurons. Because of this, if the first few layers of a network are convolutional layers, they are much easier to train. For a long time, these were the only kind of remotely deep neural networks we could train.

(I wrote a blog post on conv nets, which you can read here: http://colah.github.io/posts/2014-07-Conv-Nets-Modular/)

* So far, Michael's book has only talked about sigmoid neurons (I think). But you can use neurons with other activation functions. They still multiply their inputs by different weights and add a bias, but instead of applying sigmoid they apply a different function. Using a different kind of neuron, ReLU neurons, tends to help a lot. Unlike sigmoid neurons, which tend to have a very small derivative, ReLU neurons have a derivative of 1 a lot of the time. I've had mixed experiences, but most people swear by them.

* Using higher learning rates for early layers may be helpful.

xtacy · on Dec 9, 2014

Thank you colah. Your blog posts are inspiring! It's a lot of hard work and effort; keep it up!

dave_sullivan · on Dec 9, 2014

Just to add a couple others:

rmsprop is a great technique I don't hear talked about as much, example implementation here: https://github.com/BRML/climin/blob/master/climin/rmsprop.py

Using nesterov momentum and a "sparse" weight initialization scheme rather than uniform: https://www.cs.toronto.edu/~hinton/absps/momentum.pdf

Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.

Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.

Mini-batch size can make a difference. Somewhere between 2 and 200?

You can use bayesian optimization to intelligently search hyperparameters: https://github.com/JasperSnoek/spearmint

Try rmsprop though, I've heard good things.

benanne · on Dec 9, 2014

I haven't had any luck so far with rmsprop, adagrad and adadelta. SGD + Nesterov momentum has served me best.

xtacy · on Dec 9, 2014

Great, thanks for the pointers! I've tried momentum trick before and it has helped. I'll try rmsprop.