NHacker Next
login
▲AMD's Pre-Zen Interconnect: Testing Trinity's Northbridgechipsandcheese.com
84 points by zdw 3 days ago | 15 comments
Loading comments...
bee_rider 5 hours ago [-]
It is probably good that Chips and Cheese stays technical and objective. But, this was from the pre-Zen bad days of AMD, right? I wonder if a “where’d it all go wrong” post would be more interesting. Or, maybe more optimistically, “how’d this set things up for Zen.”
hajile 3 hours ago [-]
The interconnect didn't have very much to do with what went wrong or setting up for Zen.

AMD made a killer design with Athlon64 that should have taken over the entire industry and made them the largest hardware company on the planet. Instead, Intel leveraged their market position to make it economically infeasible for computer manufacturers to buy AMD chips.

AMD was out of money which limited options. Denard Scaling had just failed, but Moore's Law was still in effect and multithreading was hyped as the future of everything. This made a big argument for lots of smaller cores and the most area-efficient way to do this was sharing less-used resources resulting in AMD betting big on small-core CMT.

At the same time, AMD's ATI division was under pressure to make a new, flexible GPU design (that became GCN) and the cult of Nvidia (even knowingly shipping massive numbers of defective chips then having a worse GPU than GCN still wasn't enough to lose market dominance).

The interconnect was a lower-priority redesign, so they slapped a bandaid on it and pushed the redesign down the road.

ahartmetz 3 hours ago [-]
I think you're being too nice about Bulldozer. It really was a big fat unforced error. Approximately no one wants to buy a CPU that's significantly slower than the last one at common (single core) tasks.

Today, Intel is still selling more CPUs than AMD in most market segments even though they are usually worse.

Zardoz84 2 hours ago [-]
They Bett too soon on having a high number of cores. And latter evolution of that microarchitecture wasn't bad.

From a proud ex user of a FX8370E

Tuna-Fish 2 hours ago [-]
Lower single core throughput with more cores is just always a bad bet. Existing software runs better on faster cores, people buy hardware to run existing software, and people write software to run well on hardware that other people already own.

The reason for AMD's resurgence right now is not that they have more cores, but that they have better cores. If they had even faster cores, and fewer of them per die, they'd be selling even better.

ahartmetz 41 minutes ago [-]
Well. I got the very first Ryzen model, the 1800X, because it had twice the amount of cores that Intel was selling, and they were just a few percent slower per core. If they had been 40% slower, I would have passed.

My most important workload - compiling C++ - is atypically parallel, but even there, single-core is important, too.

reginald78 19 minutes ago [-]
I think AMD was both lucky and good. They came out with a forward thinking design that could bring them back from the brink, but I'm not sure their stuff would have sold if Intel hadn't left them an opening. Most importantly was Intel's failure to execute on 10nm, global foundries 14nm wouldn't have compared as favorably to 10nm even with more cores. And since Intel was stubbornly refusing to sell more cores on anything but their expensive HEDT platform there was a market segment being neglected.
mlinhares 2 hours ago [-]
Athlon64 should have been the wake up call for intel to focus on engineering to beat AMD, but they decided they would bully the market into a worse product forever.
Tuna-Fish 2 hours ago [-]
... It was exactly that. In the following decade, Intel comprehensively beat AMD. But then they let up, and started spending all their money on idiotic acquisitions instead.
toast0 3 hours ago [-]
That's probably in their Bulldozer article [1]. But this article is about memory access on their APUs; you just have to accept the CPU was what it was, no need to dwell on it here.

[1] https://chipsandcheese.com/p/bulldozer-amds-crash-modernizat...

freeqaz 3 hours ago [-]
Something in the article that I had to look up that might bother others. He uses the term 'DCT' in this sentence, but it's never defined in the article. AFAIK it stands for 'DRAM Memory Controller', but that could be an LLM hallucination. Running a web search defines it as Discrete Cosine Transform. :P

> "AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries."

Should maybe swap DCT in for MCT (memory controller)?

dcminter 3 hours ago [-]
DCT is explicitly "DRAM Controller" in the referenced "BIOS and Kernel Developer’s Guide" - see definitions table on p23 of https://www.amd.com/content/dam/amd/en/documents/archived-te...
jauntywundrkind 3 hours ago [-]
It's wild how much extra work was done to avoid coherency, yet share memory.

Ok, there's the first part, the Garlic bus, which gives the GPU its own access to the DRAM request controller, instead of going through the CPU's memory controller.

Since the GPU is mostly going to miss, it's great that it's not wasting energy trying to go to the CPU's cache. But it means if you do want to share memory now you need a whole other access path for the GPU to read from the CPU memory, even though it's literally the same RAM (but maybe different cache). So, add a new Onion link, that lets the GPU go through the crossbar, and get handled by the memory controller. And this one is slower.

Infinity Fabric seems conceptually so much easier, to keep things in sync. But the work to snoop the bus, to maintain coherency: it has to be pretty massive effort.

It's so so different a thing, but I wonder how AMD deal with coherency (or not?) on the 6 Memory Control Die (MCD) in the 6800xt GPU. Having separate chips whose job is to be cache and dram controller, that must need at least some understanding of who has what memory, that has to be wild.

One other comment, on:

> modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads

I wonder how many more games would run under Linux? Theres an absurd amount of work still going into the radeonsi driver. The driver just switched to the newer ACO compiler pipeline by default, last December, for example. That said, Trinity is (2012) using a (2010) TeraScale3 (gfx4). This is old! But the improvements have been ongoing, in a way commercial systems would unlikely to ever be; there's so many wins over such a long time; not compatibility but getting multi threaded driver support (2017) also comes to mind as a big leap! https://www.phoronix.com/news/RadeonSI-ACO-Default-Pre-GFX10 https://www.phoronix.com/news/RadeonSI-G3D-Threads https://www.google.com/search?q=site%3Aphoronix.com+radeonsi

I wonder how granular the breakdown/fallback modes are for running ; I suspect if there's an unsupportable feature somewhere in the graphics pipeline the whole pipeline will usually need to fallback to CPU rendering, but perhaps perhaps perhaps there's some ability to fill in some GPU features via CPU while running most of the pipeline on CPU (and not having the latency destroy everything, perhaps using that Onion link/cacheable host memory)?

hajile 3 hours ago [-]
Redesigning their interconnect stuff for both GPU and CPU then implementing and validating would have been a massive expense and would have added additional time to ship.

With the company facing bankruptcy, I'd imagine that a small team hacking together the different GPU and CPU interconnects was cheaper and faster than designing a whole new interconnect and coherency then implementing and testing it everywhere.

toast0 3 hours ago [-]
> It's wild how much extra work was done to avoid coherency, yet share memory.

Having separate, non-coherent memory is status quo for GPUs. Bringing the GPU onto the die means you've got to share the path to memory, but access patterns are different.

Designing for the typical case where the addresses used are distinct is totally reasonable, it's not wild at all. After that works, you can try to maie shared use faster, too, but from the article, that didn't really happen in this design; the features are there, but the bandwidth isn't.