• Breaking News

    Thursday, May 7, 2020

    Hardware support: [LTT] This Chinese motherboard shouldn't exist... | LGA1151 board with integrated GTX 1050 Ti

    Hardware support: [LTT] This Chinese motherboard shouldn't exist... | LGA1151 board with integrated GTX 1050 Ti


    [LTT] This Chinese motherboard shouldn't exist... | LGA1151 board with integrated GTX 1050 Ti

    Posted: 06 May 2020 12:09 PM PDT

    Potential new Zen 3 instructions

    Posted: 07 May 2020 01:28 AM PDT

    I noticed a new revision of AMD64 Architecture Programmer's Manual was released in late April https://www.amd.com/system/files/TechDocs/24594.pdf

    Phoronix speculated that AMD might finally be implementing PCID support, which reduced the duration of cross-process context switches on Linux by about 10% https://lwn.net/Articles/671299/

    This was before Meltdown & Spectre though, and I haven't looked at newer benchmarks.

    Anyway, the interesting bit for me were the new instructions, particularly INVLPGB and TLBSYNC. Let me add some context first, but remember that I'm not super familiar with this and might be wrong.

    In a multithreaded application, all threads share the same page table mappings (barring something specific to thread-local storage maybe), which is why they're able to operate on and pass to each other the same addresses, etc. Currently, when a new mapping is added, like when an app allocates memory, a page table entry is inserted to a slot that was previously unused. The next time the CPU tries to load/store an address in that range, it will have to walk the page table hierarchy and resolve the physical address for that virtual address, then cache this mapping in the TLB.

    That is the fast path for adding a new mapping, but something more involved happens when the application is freeing memory. Freeing a page on one thread must take immediate effect on all other threads of that app. These other threads could be executing right now if you have more than 1 logical processor. This is why Inter-processor Interrupts (IPI) are used to stop execution on a given set of CPUs, so each of them can invalidate the local TLB entries.

    There are some more involved optimizations here, with Linux tracking the CPUs which executed a particular page table hierarchy in the past (not to send an IPI to TLBs that don't need invalidating), and something called Lazy TLB. Either way, there is a bunch of non trivial logic here to get this right and do the least amount of work possible.

    Let's also talk describe PCID. Before it became the savior of performance in Meltdown-mitigated Intel processors, it was simply a way to add a 12 bits long tag along to the page table hierarchy address. When switching between different applications, you would usually invalidate the whole TLB. But with PCID present, you could tell the processor that you're switching address spaces by changing the root pointer to a page table hierarchy and the PCID tag at the same time. The benefit of all this was that TLB entries could survive context switches to other apps, and when control was returned to the original one, it could still have its caches warm. Without having to do as many page table walks, loads and stores could execute faster after your process was scheduled again.

    Finally, let's talk about these new instructions. On top of AMD adding support for PCID, they could be adding these new instructions to further leverage the PCID features. The one that I found interesting was "Invalidate TLB Entry(s) with Broadcast" INVLPGB. It is supposed to be used together with "Synchronize TLB Invalidations" TLBSYNC. Neither of these exist in Intel's manuals, and they were added in the April 2020 revision of AMD manuals.

    The idea is that (and this is where I could be extra wrong), instead of having the kernel track which CPUs need to be interrupted and what invalidation method to use, the CPU could do this better. Because the TLB itself can track which cache entries belong to which address spaces, fine-grained invalidation can happen without disturbing the caches of other address spaces.

    So when this gets into actual products, and when OS vendors leverage it, the state tracking that I described earlier could be skipped (at least in part). Instead of the old solution with interrupts, these instructions would be used to remove mappings from all TLBs that cached them. This could perhaps be optimized in a hierarchical manner, since the tags are only 12 bits and they will definitely aim to minimize inter-socket and inter-CCX traffic if they can track which address spaces were used on which processors.

    Another clever thing is that these instructions have particular ordering semantics defined for them. It makes it so that you could pipeline&parallelize multiple INVLPGB and run one TLBSYNC at the end of this transaction. On my system at least, Linux prefers individual page invalidations (as opposed to global flushes) if it's trying to invalidate < 33 pages in one transaction (/sys/kernel/debug/x86/tlb_single_page_flush_ceiling). So there could perhaps be dozens of INVLPGB instructions followed by one TLBSYNC at the end.

    My take on it, assuming the description above was mostly correct, is that this is a nice feature overall. It is designed well to have room for optimization today and in the future by the CPU vendor. It simplifies the old problem of TLB caches not being transparent. They're still not going to be transparent, but it's a much nicer interface than IPIs were. I'm too lazy to write a benchmark, but one thing I'd like to see is a comparison of an app where one thread is crunching some numbers, and the other one is continuously calling mmap() + munmap() to disturb the first one. There could be a significant difference in performance between a) having one undisturbed thread only, b) current solution with N disturbing threads, c) INVLPGB solution with N disturbing threads. It would not be a real world benchmark of course, since allocators batch regions of memory, i.e not every malloc() translates to an mmap(), and not every free() translates to an munmap().

    submitted by /u/farnoy
    [link] [comments]

    [Hardware Unboxed] - Gaming Battle: Core i7-10750H vs i7-9750H, Why Bother Upgrading?

    Posted: 06 May 2020 04:02 AM PDT

    Microsoft Springs A Surface Refresh: Surface Book 3 And Surface Go 2 Plus Accessories

    Posted: 06 May 2020 06:24 AM PDT

    How the PS5’s New Hardware Could Change How Games Are Made

    Posted: 07 May 2020 02:31 AM PDT

    Some thoughts on Z490 motherboard prices and preorders.

    Posted: 06 May 2020 02:56 PM PDT

    LG Cuts HDMI 2.1 Bandwidth on 2020 4K TVs Below 48Gbps. Does It Matter?

    Posted: 06 May 2020 12:57 PM PDT

    Leaked Lenovo roadmap points to Intel Tiger Lake coming in September (plus new laptops and tablets)

    Posted: 06 May 2020 06:41 AM PDT

    [VideoCardz] - MSI discusses Intel Comet Lake-S processors binning and overclockability

    Posted: 06 May 2020 09:21 AM PDT

    Microsoft Surface Book 3: new Nvidia GPUs, up to 32GB of RAM, and faster SSDs

    Posted: 06 May 2020 06:59 AM PDT

    Your Favourite phone design from past

    Posted: 06 May 2020 01:37 PM PDT

    Nowadays in the era of notch design, full screen body phone or curved display or fragile design concept mobile phones. Which phone design would you like to bring back from past with today's internal specs? By bet would ne Htc m8 with modern specs , cause of that beautiful curve, front facing speaker where it should like for always. Share your thoughts

    submitted by /u/aayush_raj
    [link] [comments]

    The ASRock Rack Z490D4U-2L2T, Micro-ATX Server For LGA1200

    Posted: 06 May 2020 07:03 AM PDT

    EETimes - Is IPO in China Imagination’s Only Possible Exit Path?

    Posted: 06 May 2020 11:16 AM PDT

    Microsoft Surface Go 2: 10.5-inch display, thinner bezels, and better battery life

    Posted: 06 May 2020 06:59 AM PDT

    Huawei PCs powered by Kunpeng Processors and HarmonyOS 2.0

    Posted: 06 May 2020 05:43 AM PDT

    No comments:

    Post a Comment