Build your own pyramids, write your own hieroglyphs
May 2, 2021 8:32 AM Subscribe

Modern CPU Architecture Part 1 – Key Concepts (Part 2 – Microarchitecture Deep Dive)[1,2]

abstraction layers all the way down! (to machine-level primitives and ISAs betwixt code and bare metal ;)

also btw...

David Patterson: Computer Architecture and Data Storage
Chris Lattner: The Future of Computing and Programming Languages (Software 2.0 | MLIR | RISC-V)
Jim Keller: The Future of Computing, AI, Life, and Consciousness (ISAs | branch prediction | GPUs | DL | Software 2.0)

posted by kliuless (26 comments total) 37 users marked this as a favorite

Lovely! And just to round things out:

Neurons in silicon, not simulation, to reduce power consumption

posted by flabdablet at 8:54 AM on May 2, 2021

And then there's the secret stuff, Mystery CPU Instructions.
posted by sammyo at 11:40 AM on May 2, 2021 [3 favorites]

Here's a browser simulation that starts with NAND gates and ends with a 16-bit CPU: https://nandgame.com/
posted by fantabulous timewaster at 5:03 PM on May 2, 2021 [7 favorites]

Wow I have a lot to learn if I want the sort of non-specialist awareness of wtf is happening in a modern computer that is even possible......I took a computer architecture undergrad course many years ago, it was based on the Motorola 68K architecture, with some discussion of RISC for good measure. That's about where my low-level thing ends......
posted by thelonius at 5:40 PM on May 2, 2021 [1 favorite]

I've only watched the Intel ones, and I guess I was a little disappointed, but they were still interesting. I was hoping for something about microcode, but I've never seen anyone address it in micros.
He might have been hinting at that with μOps, but didn't really talk about it to my satisfaction. Programmers like to talk about machine language being pedal-to-the-metal, but microcode needs to be running the show to actually move the electrons. How else could a machine know how to add or store or do anything?
My deepest understanding of a CPU was with the IBM 3033- I spent 14 weeks in in-depth training in the summer of 1978, and I was interested in seeing how they compare.
The 3033 was a 64-bit, 32MB processor with a cycle time of 57ns. Everything today is much faster and more ... everything, but I haven't seen much new.
The instruction stream explanation seemed inadequate. I wasn't sure why they were executing instructions after a branch before the branch was even executed, but I guess that was with multiple ALUs in speculative execution. I didn't see why they would throw away executed instructions. I was impressed with the predictive branching. The 3033 had some predictive branching, but it did not 'learn' from the program running. The 3033 had 3 instruction pipelines going, so when it encountered a branch, both outcomes were set up, and if there was another branch, it had a third pipeline. The branch not taken was deleted, but it wasn't wasted, as the Intel guy said. In looking at the diagram of the front end, it looks like there were two pipelines for instructions, but he didn't mention it.

Hoping for more from Patterson, but don't have the time to watch it now.
posted by MtDewd at 5:50 PM on May 2, 2021

I think the unmentioned mostly pipeline was the Integer vs Floating Point instructions. I looked to the top and the darker blue vs lighter blue things at the top had the same label, I think it's a but of a bug. Later on down in the ALU realm the dark/light sides were Integer vs Floating Point.
posted by zengargoyle at 6:06 PM on May 2, 2021

MtDewd: Intel once tried a micro architecture that executed both choices at each branch, but the increased performance was far outweighed by the increased power consumption. Power consumption was but a minor constraint in 1970s mainframes.

Modern CPUs invest a lot of resources into predicting which way of a branch is most likely to be taken, giving the improved performance at much less extra power consumption. 1970s logic technology couldn't implement good branch prediction cost effectively, because small amounts of semiconductor memory were really expensive.
posted by monotreme at 8:22 PM on May 2, 2021

Programmers like to talk about machine language being pedal-to-the-metal, but microcode needs to be running the show to actually move the electrons. How else could a machine know how to add or store or do anything?

A lot of early machines did it by directly decoding the programmer-visible instruction set directly in hardware rather than having a microcode layer do that. The venerable 6502, for example, uses instruction bits and cycle counts as inputs into a lookup table whose outputs directly control the internal buses and latches. In effect, the control circuitry of the 6502 is just a big state machine, not a microcode interpreter. I guess you could look at the contents of the lookup table as a microprogram if you squint a little, but the way the state machine operates is so tightly integrated with the way the instruction set is laid out that I don't think it is one, not really.

Today's mainstream CPUs treat the programs they fetch from main memory as more a collection of hints about what's intended than a sequence of actual instructions; they will re-order stuff to run multiple operations in parallel as internal resources become available, and the internal register files and overall data architecture bear almost no resemblance to that defined by the ISA. Quite a lot of the developments in this space have been largely driven by Intel's mouldy old x86 instruction set being so totally not up to the job. Modern Intel and AMD machines are as blazingly fast as they are despite, not because of, that instruction set architecture.
posted by flabdablet at 9:25 PM on May 2, 2021 [6 favorites]

Modern CPUs invest a lot of resources into predicting which way of a branch is most likely to be taken, giving the improved performance at much less extra power consumption. 1970s logic technology couldn't implement good branch prediction cost effectively, because small amounts of semiconductor memory were really expensive.

The other trend that's shifted all the design constraints around in this area has been the increasing mismatch between the ridiculous speeds that CPU clocks run at internally and what's physically achievable on the interconnects between CPU and memory. Thirty years ago that gap was nowhere near as wide, and putting huge caches on the CPU chip was nowhere near as clear a design win.

One particularly interesting technique involved taking advantage of the way DRAM addressing works, where the row and column addresses of the DRAM memory cell being accessed are presented in separate hardware cycles on a common set of address pins. Page mode DRAM allowed the designer to supply one row address and multiple column addresses for accesses that all fell within a single row; a page mode DRAM, accessed this way, could keep up with the sequential instruction fetch rate of even quite a fast CPU of the day.

That, in turn, meant that there wasn't much point in sticking a big instruction cache on the CPU; it would take about as long to burst-fill a cache line as it would to work through the instructions within that line, assuming sequential execution. One of the innovative features of the AMD Am29000 processor was its branch-target cache: instead of caching every instruction, the 29k would cache only those instructions reached as a result of taking a branch. Fetching an instruction from the branch target cache automatically caused the memory subsystem to start a new page mode cycle beginning with the address following the cached branch target, getting a new sequential instruction stream established just in time.

This relatively tiny branch target cache allowed the 29k to devote much more chip area to its data cache, which really did improve performance since much more data access than code access is non-sequential.
posted by flabdablet at 9:51 PM on May 2, 2021

Meanwhile in microcontroller land, we have the exact same issue, except instead of dram, it's reading code from flash. My cortex-m4 has a cute little 256 byte cache to hide some of that flash controller latency.
posted by ryanrs at 10:15 PM on May 2, 2021

The instruction stream explanation seemed inadequate. I wasn't sure why they were executing instructions after a branch before the branch was even executed, but I guess that was with multiple ALUs in speculative execution. I didn't see why they would throw away executed instructions.

Speculative execution happens around conditional branches, and it's needed because in a pipelined architecture where parts of multiple sequential instructions get executed in parallel, the condition that tells a branch which way to go might not exist to be tested until the instructions that define that condition have finished executing.

If those condition-defining instructions are being processed in the pipeline in parallel with the conditional branch instruction, then without speculative execution the pipeline just has to stall until the condition gets calculated. This costs time.

Branch prediction uses heuristic rules like "a backward branch will probably be taken, because that's probably a loop" and "a forward branch will probably not be taken, because that's probably a loop exit condition" to guess what the condition will be before it's available for testing. Speculative execution runs the instructions from the predicted post-branch path, but it holds the results of those instructions instead of writing them back to the registers or memory where they'd normally go.

Once the condition-defining instruction has run to completion and the condition can be tested, it may turn out that the branch predictor's guess was wrong. The results of all the speculatively executed instructions must then be discarded, because the results of having executed them do not match the intended results specified by the instructions as fetched.

Discarding these results means that speculative execution has been wasting time, but only as much as the pipeline stall it was attempting to avoid would have done. But the point of all these shenanigans is that branch predictors will guess right most of the time, so most speculative execution will end up having its results saved and resolve to actual execution. Effectively, the only pipeline stalls that occur happen when the branch predictor gets it wrong.
posted by flabdablet at 10:42 PM on May 2, 2021

Incidentally, another nice feature of the Am29k architecture compared to x86 is relevant to this kind of thing. The x86 ISA defines a single condition code register, that instructions modify implicitly via side effects; for example, as well as a SUB instruction writing the subtraction result back to the explicitly specified memory or register destination, it also writes to the zero, carry, overflow and sign bits inside the condition code register. The CMP (compare) instruction is implemented as a degenerate SUB that updates only the condition code register.

Conditional branches in x86, as well as all the non-branching conditional instructions added later, all depend on the condition code register; there are a bunch of these conditional branches (branch equal, branch greater-than, branch negative and so forth), each of which interpret the bits in that register as if they reflected a subtraction result. This makes the condition register a bottleneck. If you want to do A if B is greater than C, and then do D if E is less than F, then the A/B/C instruction group cannot be run in parallel with the D/E/F instruction group without some fairly extreme internal shenanigans, because both the "if" steps need to examine the same condition codes.

The 29k handles conditionals differently. Instead of having a single compare instruction that sets condition codes and a bunch of different conditional branches that interpret them, the 29k has a bunch of different compare instructions that write a Boolean result to an explicitly specified destination register, and a single conditional branch that tests a specified source register for Boolean true. This allows multiple conditionals to be pipelined without bottlenecking.

In fact the 29k itself was not superscalar and didn't implement that kind of parallelism; but the point is that the instruction set architecture allows for it in a way that x86 just doesn't.

29k is also a three-address instruction set architecture; x86 is two-address. On the 29k, instructions that do something to two things and yield a result allow you to put that result anywhere you want; on x86, the result always has to replace one of the sources. Again, this makes 29k much more amenable to heavily pipelined and/or superscalar implemention than x86.

One vestigial quirk in 29k is the fact that the next instruction in sequence after a branch instruction will always be executed, either before the branch target or instead of it if the branch doesn't happen. This is a kind of poor-man's speculative execution built right in at ISA level; and because it is visible in the ISA, it doesn't need branch prediction or complex arrangements to discard speculated results. But as soon as an implementation involves a pipeline more than two stages deep, it turns out that this branch delay slot feature causes more complexity than it avoids. Designing computers elegantly is hard.
posted by flabdablet at 12:06 AM on May 3, 2021 [2 favorites]

Something else you might like if you like this kind of thing: much of the complexity around instruction re-ordering and interlocks in heavily pipelined implementations can be neatly sidestepped by pipelining unrelated instructions from multiple threads instead of sequential instructions from a single thread.

I invented this technique in a flash of marijuana-fuelled insight some time in the mid eighties, only to find that Seymour Cray had been building them since before I could walk.
posted by flabdablet at 12:17 AM on May 3, 2021 [2 favorites]

fantabulous timewaster wins this year's Eponysterical Award. Thanks everyone, see you next year.
posted by pompomtom at 3:42 AM on May 3, 2021

flabdablet: once you start renaming it doesn't really matter whether it's a=b+c or a=a+b from a microarchitecture point of view. x86 condition codes can be handled similarly, you have to if you're doing speculative execution, though x86 CCs tend to limit how much O-O you can do

An O-O/speculative 29k would have been cool, the whole register windows stuff could have been hidden in the renamer
posted by mbo at 4:42 AM on May 3, 2021 [1 favorite]

once you start renaming it doesn't really matter whether it's a=b+c or a=a+b from a microarchitecture point of view

True, but mainly because what renaming is is a technique for turning 2-address operations into 3-address operations behind the scenes. If the ISA is 3-address from the get-go then quite a lot of the performance gains that could otherwise only be achieved with on-chip re-ordering and renaming can be done by the compiler instead.
posted by flabdablet at 5:35 AM on May 3, 2021

the whole register windows stuff could have been hidden in the renamer

29k register windows are arguably an ISA-visible variant of renaming.
posted by flabdablet at 5:38 AM on May 3, 2021

How much cleaner is 64-bit x86? Was that a chance to do a clean-sheet redesign, or did the need for a fast 32-bit mode shape the underlying hardware enough that the 64-bit ISA had to keep a lot of ugliness?
posted by ryanrs at 12:31 PM on May 3, 2021

Oh wow, I just remembered the Itanium. I guess that would have been Intel's clean-sheet 64-bit arch, so of course x86-64 would carry the warts for compatibility.

Meanwhile, on the ARM side of things, AArch64 dropped some iconic ARM-isms like LDM/STM and conditional execution. Proof that every cool, innovative feature eventually becomes ugly architectural baggage.
posted by ryanrs at 1:08 PM on May 3, 2021

AMD did x86-64, not intel - it is a lot cleaner - compared to other CISCs x86 was always the ricyist of the bunch which IMHO is why it's lasted so well.

ldm/stm are difficult to manage for a bunch of reasons, in particular you have to handle partly done instructions when they cross a page boundary (unlike every other ARM instruction)
posted by mbo at 1:30 PM on May 3, 2021 [1 favorite]

Oh god, STM across a page boundary. Just throw up your hands and let the OS handle it at that point, ha ha.
posted by ryanrs at 1:57 PM on May 3, 2021

well x86s handle this sorts of things - but as I mentioned x86s are pretty riscy, almost all instructions only make a single memory reference and there are few side effects (like autoincremented registers) so most instructions can cause only one page fault and are easily restartable ('push mem_location' is the notable exception which can cause 2) - instructions with a rep prefix are restartable in a different way because their updated state is stored in user visible registers

Compare that with the Vax which reputably had a worst case instruction that could touch 27 different pages (and therefore needed at least 27 filled TLB entries, and 27 actual pages present, to progress) - I may have remembered the "27" wrongly but it was a number in that area)
posted by mbo at 2:28 PM on May 3, 2021

I hope it was fully associative, heh.
posted by ryanrs at 4:05 PM on May 3, 2021

Compare that with the Vax which reputably had a worst case instruction that could touch 27 different pages (and therefore needed at least 27 filled TLB entries, and 27 actual pages present, to progress)

The Vax had the POLY instruction which could evaluate a polynomial up to the 31st degree given a pointer to a table of coefficients. All in floating point.
Result = C0 * x^0 + C1 * x^1 + C2 * x^2 ... C31 * x^31

I kinda doubt this opcode made it into many compilers although I wouldn't put it past a Fortran library.
posted by JackFlash at 5:11 PM on May 3, 2021 [3 favorites]

Yeah, for the curious but this is a bit "WHOOSH!"... start with the MOS Technology 6502. Just Google '6502'. It's still simple enough to mostly understand. There are tons of things, people have gone so far as to build one out of plain old chips (it's like a couple square feet in area).

You might also want to check out Ben Eater and his 'Build a 65c02-based computer from scratch' series among his other breadboardy things like clocks and shitty vga graphics cards all wired up. It's very ASMR and highly educational.

Also maybe MOnSter 6502. Build a 6502 the hard way.

How big would a modern CPU be at this scale?

The Apple A8X, found in the iPad Air 2, contains about 3 billion transistors. (This is comparable to the number of transistors in modern desktop computer CPUs as well.) At the scale of the MOnSter 6502, that would take about 885,000 square feet (over 20 acres or 8 hectares) — an area about 940 ft (286 m) square.

It's all pretty complicated.
posted by zengargoyle at 7:50 PM on May 3, 2021 [2 favorites]

The Vax had the POLY instruction which could evaluate a polynomial up to the 31st degree given a pointer to a table of coefficients. All in floating point.

But was the result correctly rounded? That is what separates the man-FMAs from the boy-FMAs.
posted by ryanrs at 7:56 PM on May 3, 2021

« Older You can't tuna fish, but you can ring the doorbell... | Neurotypical Syndrome and the Double Empathy... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Build your own pyramids, write your own hieroglyphs
May 2, 2021 8:32 AM Subscribe

Tags

Share

Build your own pyramids, write your own hieroglyphs May 2, 2021 8:32 AM Subscribe

Tags

Share

Build your own pyramids, write your own hieroglyphs
May 2, 2021 8:32 AM Subscribe