Python is not great for fine-grained data parallelism (SIMD, GPU), which is incr...

Python is not great for fine-grained data parallelism (SIMD, GPU), which is increasingly the lion's share: non-starter for direct inline and pretty bad for DSLs. The result is runtime heroics for embedded ~dataframe DSLs (pyspark, rapids.ai) with high overhead.

OTOH, those heroics do happen, and been OK so far. Accelerating differentiable programming is basically an extra transform layer on accelerating data parallel programming. Thankfully, our team writes zero raw OpenCL/CUDA nowadays and instead fairly dense dataframes code. Similar to async/await being added to doing a lot for web programming on Python, curious what it'll take for data parallel fragments (incl. differentiable.) If it wasn't for language resistance for UDF + overhead, and legacy libs around blocking, we'd be happy.