> phase out Metal From the GPU compute perspective: Vulkan has a subset of the f...

pcwalton · on Nov 22, 2021

Metal is also a subset of Vulkan. As Raph Levien points out here [1], there are certain important algorithms (decoupled lookback) that can't be implemented at all on Metal.

[1]: https://raphlinus.github.io/gpu/2021/11/17/prefix-sum-portab...

my123 · on Nov 22, 2021

> One controversial aspect of the original decoupled look-back algorithm is that it depends a forward progress guarantee from the GPU

Apple GPUs aren't very amenable to implementing recursion or giving forward progress guarantees within a SIMD group by design. I'll have to check if a device-wide barrier type even exists on that TBDR...

It isn't like NVIDIA hardware (since Volta) where you have a separate instruction pointer for each (SIMT-but-not-quite) thread.

skohan · on Nov 23, 2021

Neither is a subset, they're mostly overlapping sets. For instance Metal has features for tile-based rendering Vulkan doesn't, at least as of a couple years ago.

my123 · on Nov 22, 2021

Ok checked more...

With #define _AIR_MEM_SCOPE_DEVICE 0x2 and:

typedef enum __air_mem_flags { mem_flags_global = 0x1, mem_flags_local = 0x2, mem_flags_texture = 0x4 } __air_mem_flags;

and

__attribute__((__no_duplicate__)) void _air_llvm_mem_barrier(__air_mem_flags flags, int scope) __asm("air.mem_barrier");

what does _air_llvm_mem_barrier((__air_mem_flags)mem_flags_global, _AIR_MEM_SCOPE_DEVICE) result in for you?

Some things in that make me go hmmmmm though, will ask Apple for some answers.

raphlinus · on Nov 22, 2021

While this might work on your specific device, it is an undocumented interface and there is no guarantee at all for future compatibility.

If you do get a definitive answer from Apple you can share, please follow up here, as I would like to be able to cite it. I would be quite shocked if it's different than what I just said, though.

my123 · on Nov 22, 2021

> While this might work on your specific device

it's what is used for OpenCL on Metal (which is the impl present on M1) to provide the semantics there. AIR is a stable, forward compatible bytecode. Will ask and see what Apple says...

edit: thinking about this, OpenCL doesn't actually need those semantics either

raphlinus · on Nov 23, 2021

The Open CL 1.2 barrier() function is threadgroup scope, same as threadgroup_barrier on Metal. OpenCL 2.0 introduced a proper barrier function (work_group_barrier), which takes a memory scope parameter, which can be memory_scope_device (all this is pretty similar to the Vulkan memory model, and at least some of the same people worked on both). I know of no way to reliably support those semantics on Metal.

Claiming OpenCL bitcode is "forward compatible" is a pretty strong claim considering that OpenCL has been deprecated for over 3 years, and the main thing you get when searching OpenCL docs on the Apple site is an exhortation to migrate to Metal. To the extent there's a forward compatibility guarantee for AIR, I'm sure it only applies to output generated by official Apple tools, and I'm pretty sure by now there's no way to get those to output a device-scope barrier.

my123 · on Nov 23, 2021

> The Open CL 1.2 barrier() function is threadgroup scope, same as threadgroup_barrier on Metal

Interestingly, the Apple driver seems to have that screwed up, with giving much stronger semantics than necessary...

#define _AIR_MEM_SCOPE_WORK_GROUP 0x2 in the OpenCL header

And for metal_types.h...

#define _AIR_MEM_SCOPE_WORK_GROUP 0x1

#define _AIR_MEM_SCOPE_DEVICE 0x2

Which made me have even more questions, that's why I couldn't give a straight answer earlier.

> OpenCL has been deprecated for over 3 years

The OpenCL implementation using AIR is new, for Apple Silicon.

raphlinus · on Nov 22, 2021

"Subset" is not the right word here. Vulkan has pointers now, but (as was discussed in a recent thread), there are serious limitations compared with "real" C++. At the same time, Metal has its own limitations, not least of which it's lacking acquire/release semantics on atomics and a device-scoped barrier.

OpenCL is a little strange because older variants don't have advanced atomics (or subgroups), but does have pointers. I'd be curious to know what specific thing is not available on DX12 and Metal but missing in Vulkan, especially because I'm not aware of any DX12 feature on the critical path for OpenCL that's missing from Vulkan (at least as an extension).

my123 · on Nov 22, 2021

For OpenCL on DX12, the test suite doesn't pass yet. Every Khronos OpenCL 1.2 CTS test passes on at least one hardware driver, but there's none that pass them all. That is why CLon12 isn't submitted to Khronos's compliant products list yet.

The pointer semantics that Vulkan has aren't very amenable to implementing a compliant OpenCL implementation on top of. There are also some other limitatons: https://github.com/google/clspv/blob/master/docs/OpenCLCOnVu....