Hardware support: PNY Quietly Reduces XLR8 CS3030 SSD's Endurance by Almost 80%

When the first MGPU patent application was released in December of last year, there were a few questions over how the chiplets would interact with each other, as well as with the CPU. There was also the question of whether or not there would be extra latency involved in this setup, and other questions such as how VRAM is handled.

But first of all I want to make something abundantly clear, that goes against what RGT, MLID, and a few other leakers are saying: Nowhere in this patent application, or in any other chiplet GPU patent application written by AMD, is there an IO Die required for chiplet GPUs. A lot of people may say 'well patents aren't always what comes to market', but AMD is clearly taking a different approach, and all of the patent applications to date imply that the approach is to not use an IOD at all.

I also want to make it clear that, going against what Coreteks said in their latest 'AMD GPU chiplet' video, the active bridge chiplet is NOT overtop of the die. It is underneath the GPU chiplets and is embedded in the substrate.

Now for some tasty (and long) bullet points:

Fig. 6 shows three dies in the configuration. And to me, it seems that three-chiplet configuration is very likely. Not only because of this patent application, but also because of the fact that 3 dies is the maximum that TSMC can do with their CoWoS-L (TSMC version of EMIB) tech at the moment. In fact, 3-die CoWoS-L is entering mass production in Q2 2021, which is almost right on schedule, if not a bit early, to put it into use for Navi 31. It should be noted that up to 8x HBM2E can also be connected with the 3x logic dies via CoWoS-L, but I don't think it's likely that this will happen for gaming. Especially with the infinity cache, I doubt HBM2E will be used at all. I should also add that all patent applications point to using an active silicon bridge (EMIB / CoWoS-L) in their designs and not a full silicon interposer. (See paragraph [0021], "An active bridge chiplet couples the GPU chiplets to each other...includes an *active silicon bridge** that serves as a high-bandwidth die-to-die interconnect between GPU chiplet dies*). It is worth saying, though, that these patent applications don't specifically rule out more (or less) than 3 dies per package.
As was made clear in the first chiplet GPU patent app, the CPU only talks directly to one of the GPU chiplets, and the entire chiplet-GPU appears as one singular GPU to the system. The first chiplet dispatches the work across the other chiplets via active bridge chiplet that includes L3 and also synchronizes the chiplets. (See paragraph [0022], "..the CPU is coupled to a single GPU chiplet through the bus, and the GPU chiplet is referred to as a 'primary' chiplet. ")
VRAM access: Each chiplet has it's own set of memory channels, as indicated in paragraph [0022] "Subsequently, any inter-chiplet communications are routed through the active bridge chiplet as appropriate to access memory channels on other GPU chiplets". The question here is if the chiplet GPUs have their own memory channels, are those memory channels exclusive to that chiplet or are they shared somehow? This is semi-resolved in paragraph [0026]: "The memory-attached last level L3 cache that sits between the memory controller and the GPU chiplets avoids these issues by providing a consistent 'view' of the cache and memory to all attached cores". So where, physically, are the memory channels? A: They are on each chiplet. All memory access is controlled by the first chiplet, but memory channels are attached to L3. It should be noted that memory bandwidth scales directly with the # of chiplets on package. So for example, if a 1-chiplet GPU was built to have a 128-bit memory bus, a 2-chiplet GPU would have 256-bit...no more and no less.
Checkerboarding: In the end of paragraph [0023], states "...Mutually exclusive sets of adjacent pixels are processed by the different chiplets, which is referred to herein as 'checkerboarding' the pixels that are processed by the different GPU chiplets". This harkens back to the days of SFR (split frame rendering) vs. AFR (alternate frame rendering) when rendering over crossfire / SLI. However it seems like these 'sets' of pixels will be much smaller, and not based on a simple line drawn across the screen. This should prevent screen tearing and other post-processing problems associated with SFR. According to [0049] "Each pixel (/checkerboard) ... represents a work item that is processed in a graphics pipeline. Sets of pixels are grouped into tiles having dimensions of N pixels x M pixels. Mutually exclusive sets of the tiles are assigned to different process units", and, further on, in [0050] "After the pixel work items are generated by rasterization, the processing units determine which subset two [sic] process based on a comparison of the unit identifier and the screen space location of the pixel.", it seems to indicate that the delineation of work between the chiplets happens at the primitives level, and not at the screen-space level. This could potentially eliminate the problems that occur with post-processing effects while rendering in SFR mode, and would allow each chiplet to effectively work on different parts of the same frame.
Chiplet Synchronization: From [0024], "...Command processors in the GPU chiplets detect synchronization points in the command stream and stop (or interrupt) processing of commands in the command buffer until all the GPU chiplets have completing [sic] processing commands prior to the synchronization point". Again, if a single chiplet is saturated, all of the chiplets have to wait for it to catch up. However, with checkerboarding, it's doubtful whether the workload will vary so greatly between chiplets that this will become and issue.
Active Bridge Chiplet: From [0026], "...the active bridge chiplet includes a unified cache that is on a separate die than the GPU chiplets, and provides an external unified memory interface that communicable links two or more GPU chiplets together". The two points here being that 1) the entire L3 cache sits on the active silicon bridge itself, and nowhere on the GPU chiplets, and, 2) that the memory channels are on each chiplet (as stated above), but are controlled by only the first chiplet.
IO / Memory Controller: From Fig. 6, it's pretty clear that a memory controller exists on the first chiplet. However, this same memory controller would still 'physically' exist on the other 'slave' chiplets, but would be disabled and not used. Anyone familiar with chip fabrication knows that making 2 different designs instead of one involves two costs: 1) the cost of making separate designs in the first place, making a 2nd set of masks, etc, and 2) the cost associated with the loss of scalability of the design. So although the patent application doesn't explicitly mention that there won't be a separately-designed 'IO die', it is quite clear that each of the chiplets in this multi-chiplet GPU are identical, and also that there will be some dark silicon on each die.

submitted by /u/marakeshmode
[link] [comments]

[VideoCardz] AMD Radeon Pro W6800 to be the first Navi 21 graphics card with 32GB memory

Posted: 04 Jun 2021 03:01 AM PDT

submitted by /u/ryandtw
[link] [comments]

We Interrupt This Program - Intel and AMD Contemplate Different Replacements for x86 Interrupt Handling

Posted: 03 Jun 2021 10:02 AM PDT

submitted by /u/dayman56
[link] [comments]

XDA Developers: "Qualcomm's Snapdragon 888 successor will have Arm's new v9 CPUs"

Posted: 03 Jun 2021 02:44 PM PDT

submitted by /u/Dakhil
[link] [comments]

ASRock Announces AMD X300TM-ITX Motherboard: Thin ITX For Ryzen APUs

Posted: 03 Jun 2021 09:39 AM PDT

submitted by /u/tuldok89
[link] [comments]

Prediction: Upcoming Macbook Pro 14" and 16" will have SoCs based on A15, not A14

Posted: 04 Jun 2021 01:13 AM PDT

M1 is based on the A14. Many believe that the upcoming MBPs (expected to be announced next week), will have an M1X which is still based on A14. This is unlikely.

Here's why:

Latest Bloomberg report point to an 8/2 big.Little design for the Macbook Pros which suggests it's different architecturally to the 4/4 M1. Perhaps the two new efficiency cores have been beefed up to be equivalent to the A14 four.
A15 production for iPhone 13 at TSMC has already been started which suggest that the A15 design has long been ready for the new Macbook Pros as well
It makes more sense for the Macbook Pros to receive the latest SoC designs first over the Macbook Air/Mac Mini/iMac because you can produce the biggest chips first, bin the best for the Macbook Pros, and use the defective chips for lower-end Macs. For example, you can use the defective 8/2 dies as 6/2 dies for an upgraded Macbook Air. AMD/Intel/Nvidia almost always releases top-end chips first for this reason.
It doesn't make sense for the low-end Macs to have the highest single-thread performance. Apple's year-over-year single-thread improvements can increase by as much as 20%, unlike Intel's 1-5% increase previously. Having the Macbook Air outperform the expensive Macbook Pros in single-thread by 20% for 9 months every cycle is simply weird and unacceptable.
Many have already observed that the M1 could have easily been a rebranded A14X with extra IO bolted on. The real Mac SoCs will start with the upcoming Macbook Pros.

Don't be surprised if the Macbook Pros get announced with an A15-based SoC next week.

Bonus prediction: The new SoCs could have hardware ray tracing. Apple's software APIs have hinted at hardware ray tracing support.

submitted by /u/senttoschool
[link] [comments]

Can future uses of X3D involve stacking on the I/O die? Would an I/O die with cache be an option, or would that cost too much?

Posted: 03 Jun 2021 07:38 AM PDT

So, I've been reading a lot of articles on Semi Engineering, IEEE and Semi Wiki about 3D stacking for a while, and Dr. Cutress's article about the V-Cache announcement mentioned the possibility of putting the SRAM underneath logic instead, which would be better for cooling, but make it harder to deliver power to the logic.

But logic doesn't just need power, it also needs I/O. IIRC, the biggest idle power draw on Zen2 and Zen3 chiplets is the connection between the CCD and the I/O die, and that's the biggest source of latency as well, mainly on Zen2, but Zen3 still experiences it with chiplet-to-chiplet communication, even if it has to go chiplet-to-chiplet less often than Zen2 did.

If only there was a way to make communication between the I/O die and the CCDs faster and more efficient, maybe by travelling over a shorter distance...

Current X3D stacking and V-Cache rely on cache-on-cache stacking, because cache doesn't get as hot as logic does. I don't know how hot the PHY for I/O gets, but given that the x570 chipset needs a fan, and it's the same die design as the Zen2 and Zen3 I/O die, I imagine it gets a little bit hot.

In RDNA2, the "Infinity Cache" is a Last Level Cache that's physically closer to the GDDR6 memory controllers, so there seem to be some advantages to having cache and I/O closer together, for latency and power efficiency reasons.

The idea I had after reading the article: Make an I/O die with all the memory controllers, PCIe, USB and everything else, and put a big chunk of SRAM in the middle of it. Then, stack a CCD with CPU cores or GPU Compute Units, or both on top of the SRAM, and connect it to the I/O die with TSVs.

The main problem I see with this idea is that it would be a sort of reversal of the current paradigm, where AMD's I/O dies are the cheap dies made on older nodes.

Because I/O doesn't shrink very well, and SRAM cache doesn't either. The biggest improvement to I/O density in a while now has been TSMC's 5nm node, where I/O got a 1.2x density improvement. For their 3nm node, logic gets a 1.7x density improvement, SRAM gets 1.2x, and I/O gets 1.1x.

Putting SRAM and I/O, the two things that aren't shrinking super well, on the same die, might not be ideal from a cost perspective. Even if you can save by not including as much SRAM on the CCDs, and include more logic instead. And that's not even getting into the packaging costs that 3D stacking will add over traditional 2D packaging and substrates.

I will say, though, that, from my understanding, (which might be totally wrong) my description of a sort of "I/O + SRAM cache chiplet" kinda sorta already exists? Because I'm pretty sure that's a lot like how FPGAs work. I'm pretty sure most FPGAs consist mainly of SRAM cells, lookup tables, and I/O.

If I'm understanding this die shot and this block diagram correctly, an FPGA is a lot like how I pictured an SRAM+I/O die looking. Big SRAM in the center, I/O on the outside. Granted that the programmable data fabric and look-up tables in an FPGA are loads more complicated than "just" regular SRAM like everything else uses.

I'm sure AMD couldn't just take one of the existing Xilinx FPGAs and just start stacking Zen CCDs on top of it, there would probably have to be some re-engineering and design work involved. And they'd probably have to design the SRAM so that the TSVs can go through it properly, and they'd need to make sure the socket and software could communicate with the package properly and all of that. I assume that's probably why they didn't do it for the recent Zen3 + V-Cache announcement, because the AM4 socket and the Vermeer package were already laid out a certain way.

But does anyone see a problem with my assumption that I/O + cache would be ideal for the bottom level of a 3D-stacking solution?

Previously posted in r/AMD, deleted, then reposted in r/Hardware mainly so I could crosspost because it wasn't letting me.

EDIT: It's not letting me crosspost to r/AMD now, but whatever, lol.

I'm more interested in the technical answers I'll (hopefully?) get in r/Hardware anyway.

submitted by /u/Scion95
[link] [comments]

Gigabyte WRX80 Threadripper PRO - FINALLY Thunderbolt! - Motherboard Review

Posted: 03 Jun 2021 11:49 AM PDT

submitted by /u/lazy2late
[link] [comments]

Will mesh shaders reduce power consumption in a given scene?

Posted: 03 Jun 2021 02:05 PM PDT

I'm not absolutely sure, but I thought I saw in some article or video that mesh shaders actually increased power consumption by a large amount in a benchmark (in either the asteroids or the 3DMark demo running on the new Nvidia/AMD cards). Am I crazy?

If it does increase power consumption, why?

submitted by /u/jumpy-town
[link] [comments]

Why is Intel using TSMC for DG2 Dies instead of Intel Nodes?

Posted: 03 Jun 2021 07:05 PM PDT

Given that it is pretty agreed upon that Alder Lake will use Intel 10nm Super FINFET, why is Intel using TSMC for their DG2 Graphics Cards?

submitted by /u/Stanley_C
[link] [comments]

ZDNet: "Nvidia CEO eschews mobile RTX in favour of GeForce Now"

Posted: 03 Jun 2021 09:33 AM PDT

submitted by /u/Dakhil
[link] [comments]

Gizmochina: "Qualcomm "Snapdragon" branded gaming smartphone from ASUS spotted on TENAA"

Posted: 03 Jun 2021 06:44 AM PDT

submitted by /u/Dakhil
[link] [comments]