It might be basically the same process as today but with several big new ideas (some of which might seem simple in retrospect...)
The quality of the training set is also critical, more so than the quantity. Some of these clever ideas for creating a lot of training data without any work, such as "guess the next word" can't really capture semantics.
I think it really takes multi-task training, like what the article we are talking about is advocating. That forces the upstream part of the network to learn features that capture important semantics.