The Zen fabric isn’t that special compared to Xe fabric, just doing different job. Do you know how Intel could scale Xe performance really well. Just have every die direct connection to every other die, with fabric that has 500GB/s bandwidth to each direction, or more. Then each tile can have full bandwidth of every other tile’s memory controller at its disposal assuming single HBM2 module per tile, and it wouldn’t matter as much which tile has the data. The GPU:s are BANDWIDTH machines, that stream lots of data from memory instead of relying on caches, GPU:s normally live and die by having enough bandwidth from memory to its execution units. There are two potential solutions to tiling GPU:s either brute force by delivering enough bandwidth, or being really clever in software to manage each as separate numa-node. If Intel chose brute-force method for their first gen chiplet GPU:s, the fabric would consume lots of power but would make software:s job a lot easier. In that case they could just make it work, and continue working on software to keep things more local to each tile. Now if we compare that to CPU, firstly most of the time CPU gets it’s data from cache’s so memory doesn’t get used nearly as much, so percentage of peak bandwidth usage on average situation is low and if it gets too high the cores can be clocked down to compensate power consumption of fabric. Now 32bytes/cycle @3.2GGhz is 102GB/s and if we multiply that by 8 it is 820GB/s peak that gets only used by a fraction of its peak consumes high fraction of CPU power.
Intel is probably brute forcing so hard problems of the overall system are easier, get it working and then work on reducing amount of brute forcing needed while making bandwidth consume less power at same time. They are trying to solve a problem that Nvidia said is probably too hard to solve by not solving the problem and going for adjacent market first with their solution.
As for power consumption. Foveros is 0.15pj/bit.
Infinity fabric is 2pj/bit. Infinity fabric 2 said to reduce by 27% power consumption so it is about 1.5pj/bit .
Now what I don’t know is what they included in each figure, so we cannot be certain that Intel is order of magnitude better than AMD/TSMC on this, but we can conclude that it isn’t inefficient.
Foveros allows Intel to 833GB/W. And AMD’s number is 86GB/W .
This also points towards Intel brute forcing the bandwith situation to make first gen scalable GPU, that fabric probably makes multiple chiplets look more like single GPU from software point of view.