Pretty good resource but not sure where the large-scale part is other than Chapt...

thecopy · on Nov 3, 2014

Chapter 2 & 3 goes through LSH and map-reduce which is used for large data sets, where comparing all-with-all is impossible. Chapter 4 goes through streams where you take one item at a time and fit your model to that (so instead of optimizing a (for example) SVM with the whole data-set your stream it one after another. Chapter 9 also includes "online" recommendation algorithms and 11 is dimension reduction.

Sidenote: A nice way to reduce data-set size for clustering is by constructing coresets from the original dataset [0], it is possible to create a coreset in parallell using map-reduce. After this k-means will produce a very good approximation

[0] http://las.ethz.ch/files/feldman11scalable.pdf