I believe you may be missing something. Even on a GPU there are synchronization points in some algorithms and data needs to be passed from one processor to another once you spill over the boundary of what can be held in the RAM.
Also, some parts of an algorithm may not be parallelizable at all.
GPU parallelization is only 'not difficult' if:
- all your data fits on a single GPU
- your code is embarassingly parallel
- for the total duration of one computation
Bonus if you do can use the output of one computation as the input of the next. In all other kinds of computations the usual bottle-necks apply.