• Breaking News

    Monday, March 29, 2021

    Hardware support: [LTT] How Motherboards Work - Turbo Nerd Edition

    Hardware support: [LTT] How Motherboards Work - Turbo Nerd Edition


    [LTT] How Motherboards Work - Turbo Nerd Edition

    Posted: 28 Mar 2021 10:17 AM PDT

    Why is Intel obsessed with AVX512 and beyond when hardware coroutines could offer a signifigantly greater advantage?

    Posted: 28 Mar 2021 06:45 PM PDT

    This post is probably going to backfire on me because the opinions I voice are so controversial, but here it is anyway.

    From a performance standpoint as a coder, it makes little sense why CPU technologies have evolved the way they have evolved, particularly the x86 architecture. Inside every CPU on the x86 (and most other architectures to some extent), there are several different technologies going on which try to speed up code execution. Note that this is not an exhaustive list and these are all woefully inadequate definitions, but I didn't want to write a book when I wrote this post.

    • Instruction pipelining, which helps keep the CPU busy all the time by getting rid of the wait time between instructions.
    • Superscalar execution, which enables (most often adjacent) instructions without dependencies to execute simultaneously on different ports.
    • Out-of-order execution, in which independent instructions are reordered to help improve superscalar execution. An interesting note is that GCC, being the God compiler it is, auto-shuffles the assembly instructions on the higher optimization levels in order to save the CPU some work.
    • Register renaming, which helps reveal non-dependent instructions so that they can be executed out of order for superscalar execution.
    • Branch prediction, which involves trying to predict whether a conditional jump will jump or not, and where it will jump to, based upon its history.
    • Speculative execution, which is where the CPU tries to execute the branch it thinks will be taken up ahead, reverting any progress made if it was wrong. Speculative execution is NOT branch prediction. Branch prediction tries to keep the pipeline full by prefetching the instructions it thinks will be executed. Speculative execution takes this a step further by actually running them.
    • Multithreading, which involves running many instructions in parallel. The problem with multithreading is that it is so rarely useful because most software has a long critical chain of processing data to perform its actions, which often requires redesigning the software to break up. If you configured your computer to use only 2 of its CPU cores, there might not even be any noticeable difference in the performance of your computer when running everyday applications like the Firefox web browser or LibreOffice. Further, interacting with data and variables between threads requires atomics, locks, barriers, and more, which disrupt many facets of instruction-level-parallelism, causing two threads to never be twice as fast as one thread. I've even seen cases where two threads are slower than a single thread because all the inter-thread communication and synchronization slowed everything down.
    • MMX, (S)SSE, and AVX aims to help with working with multiple smaller numbers or decimal points all at once. However, less than 10% of real software uses SIMD, and this software only uses SIMD to speed up less than 1% of its code (yes, there are exceptions, but even 3D games usually defer their number-crunching to the GPU and sometimes omit SIMD entirely). SIMD seems great on paper, but nearly zero software gets a significant speed-up from it. Yes, yes, individual sections get sped up. What I'm saying is that the software still spends a significant amount of time doing non-simidizable operations (especially branching) which account for most of the execution time. Additionally, it often takes a lot of time and testing for the developer to convert scalar code to use SIMD.
    • AVX512 enabled massively parallelized conditional processing using masks. It's an extension of its predecessors in that masks can be used to emulate branching but in a highly parallelized and incredibly fast way. However, due to its far greater complexity and endless meadows of opcodes, it takes coders far longer to take advantage of AVX512. Additionally, AVX512 slows down the CPU if used without restraint, so it can't be used for minor/trivial operation

    The above technologies seem awesome and any program utilizing them all should theoretically run hundreds of times faster than without these technologies. Together, all of the above technologies make CPUs, particularly AMD and Intel x86 processors (which feature all of the above), really really fast. However, there are lots of problems with this performance, which severely reduces the speedup in most software. Again, there is some specialized software that benefits from these technologies. My complaint is that practical everyday software like web browsers, text editors, and games don't receive a massive speedup from these technologies because these technologies are largely inapplicable to the greatest bottlenecks in the code. If you make 80% of the code run four times faster, then overall the code will only be 2.5x as fast, and that's the problem. These CPU technologies can speed up large sections of the code, but there will still be portions of the code which gain no advantage, and these sections cause the software to run slow.

    • Instruction pipelining sounds great because they increase throughput, but pipelines add latency to instruction execution. This latency becomes a problem during branching, function calls (a form of branching), and context switches.
    • Superscalar execution sounds great, but most software code involves input and output (because that's often the most useful thing to do with software), whether it's file IO, GUI IO, IPC IO, net IO, GPU IO, and other types of IO. Most/all types of IO code involve a very long critical dependency path, where most instructions depend upon the previous one. Observe the following implementation of atoll from musl.

    long long atoll(const char *s) { long long n=0; int neg=0; while (isspace(*s)) s++; switch (*s) { case '-': neg=1; case '+': s++; } /* Compute n as a negative number to avoid overflow on LLONG_MIN */ while (isdigit(*s)) n = 10*n - (*s++ - '0'); return neg ? n : -n; } 
    • Out-of-order execution also sounds great and it enhances superscalar, however, there's only so much you can do when it is already difficult to parallelize code with such a long critical chain.
    • Register renaming also helps, but again, you're putting good money after bad.
    • Branch prediction actually helps with most code. Branches occur everywhere and predicting them gives a massive performance boost.
    • Speculative execution greatly enhances branch prediction at the cost of severe and unpatchable security holes.
    • Multithreading is problematic because it is hard to break up the long chain of dependencies. The reason why consumer hardware usually has 4 cores and rarely more than 8 is that you wouldn't notice the difference in performance because so much software is single-thread bounded.
    • All SIMD suffer from the fact that practical code needs branches to do something useful. You can't write practical software using only branchless GLSL shader code.

    The Solution

    I believe the solution is to create an extremely generic form of hardware coroutines that can be applied to any code to speed it up.

    • In essence, this parallelism will be in the form of short-lived threads executing in parallel with sequential atomic ordering guarantees.
    • A CPU will be single-threaded but designed with n concurrency, or the ability to execute n coroutines simultaneously. Any attempt to spawn more than n will result in blocking the parent coroutine (n coroutines can execute simultaneously by following the last coroutine instead of blocking). Instead of having to rewrite your code to take advantage of new hardware features, your code will automatically scale to the number of coroutine slots provided that you use the coroutines feature.
    • The key advantage of coroutines and why they will be so fast is parallelism and the pipeline. Each coroutine has its own pipeline and there can be several shared pipelines between the coroutines for performing atomically ordered operations upon a shared variable waiting in a queue. Coroutines will be super fast because coroutines enable the CPU to know in advance which instructions will be executed in what order. Essentially, the power of coroutines is that they greatly boost throughput by streamlining the pipeline and this boost can be amplified by the hardware CPU vendor adding more coroutine processors.
    • High-level compiled/JIT languages will be able to benefit from hardware coroutines (but probably not high-level interpreted languages). This is a fucking tremendous deal. The single biggest reason why SIMD is so rare is that it's useless in languages like Python (specifically PyPy) and JavaScript. One tiny little branch can annihilate the delicate SIMD pipeline, and high-level loosely typed languages are littered with countless hidden branches, potholes, and speed bumps.
    • SIMD (including AVX512) is no longer needed/useful. Hardware coroutines will undoubtedly have a lower theoretical throughput than SIMD (and especially AVX512), but hardware coroutines will be a lot more flexible, so hardware coroutines may even be faster in some circumstances because less code is needed to achieve the same objective.
    • Significantly easier to code and 100% backward compatible. Instead of having to spend hours writing different code for AVX512 and non-AVX512, you write the code once. Hardware coroutines complement the serial nature of code and the serial nature of our thought processes, so it's a lot easier to write highly parallelizable code using these coroutines. If you are targeting a CPU without hardware coroutines, the compiler simply removes all the hardware coroutines annotations. If your target CPU does support hardware coroutines, you compile your code once and it automatically scales to the concurrency of the CPU it is running on.
    • Global and local variables are implicitly stored in different registers and globals are implicitly atomic, forming a queue of coroutines waiting in line to write to a global register.

    Some downsides of hardware coroutines are:

    • Significant investment. This goes against the grain and direction CPUs have been heading in for many many years, so it could quite possibly require redesigning the CPU from the ground up to take advantage of hardware coroutines. However, once supported and used in software, hardware coroutines will be able to speed up all code everywhere.
    • Long time to gain software support. Proprietary software will likely never benefit from hardware coroutines because none of its core components have been rewritten in the last ten years and they're unlikely to be rewritten for another ten years. However, this is probably an upside because this means free libre software will become significantly faster and more optimized than its proprietary counterparts, increasing the push towards freeware liberty.
    • Compilers would need to support it. I do admit it's entirely possible for compilers to never get smart enough to be able to properly optimize code utilizing hardware coroutines. That's what happened with Intel Itanium. However, if compilers do get smart enough to be able to turn hardware coroutines into good assembly, then we are in business.
    • For high-level languages to really take advantage of hardware coroutines, new syntax features will need to be added to the languages. C hasn't gained any new operators since its creation, and C is the best language for maximizing utilization of hardware coroutines, so we would likely need to depend upon GNUC extensions. It'll take MSVC at least 10 years to support hardware coroutines extensions, and, even then, MSVC will still produce shitty half-optimized assembly.
    • All of the coroutines share the same variables and globals, so heap memory management between the coroutines would get tricky and messy.

    So, here's an example implementation of the atoll code above rewritten to use hardware coroutines (at least this is what I imagine it would look like):

    long long atoll(const char *originalPos) { const char * s; task skipWhitespace { while (1) { ++originalPos; const char * curPos = originalPos; // eachCharacter happens after skipWhitespace, but skipWhitespace // is happening right now, so we execute skipWhitespace right now // OR we block this thread until a coroutine is available task eachCharacter after skipWhitespace { if (!isspace(*curPos)) { // "exit" kills a coroutine the next time it tries to spawn // a child coroutine AND clears the waiting queue of tasks // This ensures the ability for the parent to clean up. exit skipWhitespace; // s will be the output variable s = curPos; } else { // "next" executes the next task after skipWhitespace next skipWhitespace; } } } } join skipWhitespace; char * checkNegPos = s; long long neg = 1; task checkForNegative { // should get optimized into cmov, then s -= neg if (*checkNegPos == '-') { neg = -1; s++; } } task checkForPositive { if (*checkNegPos == '+') { s++; } } join checkForNegative; join checkForPositive; /* Compute n as a negative number to avoid overflow on LLONG_MIN */ long long n = 0; task processNumber { while (1) { char * curPtr = s; task eachCharacter after processNumber { char curValue = curPtr; if (!isdigit(curValue)) { exit processNumber; } else { // the compiler may move "n = n*10 + curValue - '0'" to after // "next processNumber;", seeing that the if isdigit takes // a lot longer to perform than this simple arithmetic n = n*10 + curValue - '0'; next processNumber; } } ++s; } } join processNumber; return neg * n; } 

    Observe how the hardware coroutines do about the same number of things. It even looks like the hardware coroutines would be slower in this case. However, the hardware coroutines would be much faster in this case because the pipeline is busy and several instructions can be executed in parallel without the overhead of needing to prove that they have no side effects. Another way of looking at hardware coroutines is that they are similar to speculative execution except that they carefully control and optimize eager execution and they are explicit, so no work needs to be done to try to reverse side effects.

    Next, let me explain a possible way the registers and variables and the stack would work. We add a new series of registers that will be local to each coroutine, say registers C0-15, where C0-C7 are the parent coroutine's state and C8-C15 are the child's coroutine state. Entering a coroutine remaps C8-C15 into C0-C7 automatically such that any writes to C0-C7 from the child coroutine are visible across all other coroutines spawned by the parent, forming a queue of waiting coroutines when necessary. R0-R15 are unique to each coroutine, starting off with a copy of their parent's state. This copying is necessary to ensure backward compatibility with existing libraries.

    Next, let's discuss how the stack would work. Well, it's going to be a complete gagglefuck if we try to implement it with a software solution because we need high-performance and we need backward compatibility with code in library function calls not compiled with support for hardware coroutines. So, here's how it would work: each coroutine processor has its own private lookup table correlating the perceived view of the stack to the actual view of the stack. This lookup table is only invoked whenever we offset the rbp variable and whenever the assembly push/pop/call/ret instruction is used. The CPU handles the actual stack pointer location, manages to push things onto the stack. The CPU access memory as if it had the privileges of the program, so the usual Stackoverflow segmentation fault rules still apply.

    Hardware Coroutines replace SIMD

    Observe the below slow scalar code:

    int32_t sumInt32Values(int32_t * arr, size_t length) { int32_t sum = 0; for (size_t i=0; i < length; i++) sum += arr[i]; return sum; } 

    We can not-so-easily accelerate it with SSE2, for example:

    int32_t sumInt32ValuesSSE(int32_t * arr, size_t length) { int32_t sum = 0; size_t i = 0; if (16 <= length) { __m128i sumX4 = _mm_set1_epi32(0); for (size_t end=length&~3; i < end; i+=4) sumX4 = _mm_add_epi32( sumX4, _mm_loadu_si128( (void *)(arr + i) ) ); int32_t sX4B[4]; _mm_storeu_si128( (void*)&(sX4B[0]), sumX4 ); sum += sX4B[0] + sX4B[1] + sX4B[2] + sX4B[3]; } for (; i < length; i++) sum += arr[i]; return sum; } 

    With coroutines, the number of parallel summations scales to the concurrency of the CPU, enabling potentially drastically better performance than SSE.

    int32_t sumInt32ValuesTask(int32_t * arr, size_t length) { int32_t sum = 0; task summationTask { for (size_t i=0; i < length; i++) // extends adds an asynchronous dependency task addToSum extends summationTask { sum += arr[i]; } } join summationTask; return sum; } 

    The above code is locked into serially adding each element because every operation inside the hardware coroutine is atomic with respect to all other coroutines, so the coroutines are stuck waiting in line for the previous coroutine to finish adding to sum. Nevertheless, this will rival and perhaps surpass the performance of SIMD. The reason why is superscalar execution. While the coroutines are waiting in line, they have time to investigate adjacent queued coroutines. Upon seeing that there are two adjacent 32-bit additions waiting in line and recognizing that integer addition is associative, adjacent queue slots will be added out of order while they're in the queue, enabling a high degree of parallelism.

    It's too complicated

    You are probably thinking "it's too complicated." Well, let me ask you, is speculative execution too complicated? If you asked anyone 30 years ago whether mainstream consumer computers were going to support speculative execution they would have laughed at you. And, here we are today.

    Also, an important thing to note is that I imagine the assembly for these hardware coroutines being very primitive and very simple in order to maximize performance. The fancy syntax I wrote in my C code snippets is what I imagine the sugar syntax to look like, much in the same way that if/for/while/switch statements are sugar syntax for complex conditional jumps.

    CPUs already do this

    Yes, CPUs already do a similar superscalar parallelization to some extent. However, don't let the hype fool you. CPUs are really dumb. The coder understands his/her code far better than the compiler, which understands the code far better than the CPU. Allowing the coder to assist the CPU with parallelization and out-of-order processing will, I believe, have drastic positive effects on performance, especially because more coroutine concurrency means a faster computer, so CPU manufacturers will eventually start producing consumer CPUs with hundreds of parallel coroutines. They don't produce those CPUs right now because no-one would buy them because so much everyday software is stuck in a single thread.

    EDIT: I realized that I misspelled "significantly" in the title. I am an idiot.

    submitted by /u/ILikeToPlayWithDogs
    [link] [comments]

    Nintendo Game Boy modded to mine Bitcoin (spoiler: it's slow)

    Posted: 28 Mar 2021 05:33 AM PDT

    Is Rtings a trustworthy review site?

    Posted: 28 Mar 2021 09:56 AM PDT

    I was wondering how trustworthy you guys think Rtings and tech radar and pcmag are or tell me your most trusted tech review site.

    submitted by /u/Emotional_Worry9770
    [link] [comments]

    Intel's Tiles vs AMD's Chiplets [TechTechPotato]

    Posted: 28 Mar 2021 06:01 AM PDT

    No comments:

    Post a Comment