M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.
Apple M1 Max has 32 GPU cores, each core contains 16 Execution Units, each EU has 8 ALUs (also called shaders), so overall there are 4096 shaders. Nvidia RTX 4090 contains 12 Graphics Processing Clusters, each GPC has 12 Streaming Multi-Processors, and each SM has 128 ALUs, overall there are 18432 shaders.
A single shader is somewhat similar to a single lane of a vector ALU in a CPU. One can say that a single-core CPU with AVX-512 has 8 shaders, because it can process 8 FP64s at the same time. Calling them "cores" (as in "CUDA core") is extremely misleading, so "shader" became the common name for a GPU's ALU due to that. If Nvidia is in charge of marketing a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX cores" because each core has 8-way SIMD.
Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and with avx512 ops it double-pumps the ALU (a good overview here [0]). If you count FADD as a single flop and FMA as 2, that's 48 "1 flop cores" per core.
I think it's got the same total FP ALU resources as zen3, and shows how register width and ALU resources can be completely decoupled.
I think that 4090 has 16000 ALUs, not "cores" (let's call a component capable to execute instructions independently from others, a "core"). And M1 Max probably has more than 1 ALU in every core, otherwise it resembles an ancient GPU.
Nvidia switched to marketing speak a long time ago when it came to the word "core". If we go with Nvidia's definition then M1 Max has 4096 cores, still behind the 4090, but the gap isn't as big as 32 to 16k.