What happens when you, instead of training the entire network at once, train for a while with a single layer, then add a second layer and train with both layers, then add a third layer and train with all three layers, and so on?
Good intuition! What you are describing sounds like a technique called pretraining (in particular, greedy, layer-wise pretraining). Five years ago, pretraining was how everyone attacked this problem, although they usually did a different kind of pretraining (basically, we train a different kind of model, and then perform surgery, cutting it apart and using some layers for it for the earlier layers of our model).
More recently, people, especially the younger generation of deep learning researchers, tend to be skeptical of how much pretraining helps.
Advocates for pretraining now tend to argue that it helps you find better local minima, instead of focusing on it helping the vanishing gradient problem. For example, see this paper: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf .
As I'm sure Michael will address in coming chapters, there's a bunch of tricks you can use that make training deep neural networks a lot easier. People tend to prefer, now, to just use those and a lot of computing power, rather than mess around with pretraining.
* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.
* Using convolutional layers really helps. Roughly, convolutional layers have multiple copies of the same neuron, applied to different inputs. This results in them needing to learn much less. It also leads to them kind of concentrating the gradient on just a few neurons. Because of this, if the first few layers of a network are convolutional layers, they are much easier to train. For a long time, these were the only kind of remotely deep neural networks we could train.
* So far, Michael's book has only talked about sigmoid neurons (I think). But you can use neurons with other activation functions. They still multiply their inputs by different weights and add a bias, but instead of applying sigmoid they apply a different function. Using a different kind of neuron, ReLU neurons, tends to help a lot. Unlike sigmoid neurons, which tend to have a very small derivative, ReLU neurons have a derivative of 1 a lot of the time. I've had mixed experiences, but most people swear by them.
* Using higher learning rates for early layers may be helpful.
Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.
Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.
Mini-batch size can make a difference. Somewhere between 2 and 200?