It is much more about the design of the instructions, generally RISC instructions take a small, fixed amount of time and conceptually are based on a sort of minimum unit of processing, with a weak to very weak memory model (with delay slots, pipeline data hazards, required alignment of data etc) with the compiler/programmer combining them into usable higher level operations.
CISC designs on the other hand happily encode large, arbitrarily complex operations that take unbounded amounts of time, and have very strong memory models (x86 in particular is infamous here, you can pretty much safely access memory, without any alignment, at any time, even thought often the result will be slow, it wont crash)
As an example, the PDP-8 has fewer than 30 instructions, but is still definitely a CISC architecture, some ARM variants have over 1000 instructions but are still definitely RISC.
RISC is about making building processors simpler, not about making instruction sets and programming with them necessarily simpler.
Consumers and even manufacturers are both focused on performance as a primary metric, not the size of an instruction set.
It being RISC is secondary to that.
It was the point at, say, the original ARMv1.
I am genuinely curious, I don't know much about instruction set design.
Because exposing that would be a huge burden on the compiler writers. Intel tried to move in that direction with Itanium. It's bad enough with every new CPU having a few new instructions and different times, the compiler guys would revolt if they had to care how many virtual registers existed and all the other stuff down there.
But why C? If you want languages to interface with each other it always comes down C as a lowest common denominator. It's even hard to call C++ libraries from a lot of things. Until a new standard down at that level comes into widespread use hardware will be designed to run C code efficiently.
Exactly this hinders any substantial progress in computer architecture for at least 40 years now.
Any hardware today needs to simulate a PDP-7 more or less… As otherwise the hardware is doomed to be considered "slow" should it not match the C abstract machine (which is mostly a PDP-7) close enough. As there is no alternative hardware available nobody invests in alternative software runtime models. Which makes investing in alternative hardware models again unattractive as no current software could profit from it. Here we're gone full circle.
It's a trap. Especially given that improvements in sequential computing speed are already difficult to achieve and it's known that this will become even more and more difficult in the future, but the computing model of C is inherently sequential and it's quite problematic to make proper use of increasingly more parallel machines.
What we would need to overcome this would be a computer that is build again like the last time many years ago, as a unit of hard and software which is developed hand in hand with each other form the ground up. Maybe this way we could finally overcome the "eternal PDP-7" and move on to some more modern computer architectures (embracing parallelisms in the model from the ground up, for example).
Nope, "always" only applies to OS written in C and usually following POSIX interfaces as they OS ABI.
C isn't the lowest common denominator on Android (JNI is), on Web or ChromeOS (Assembly / JS are), on IBM and Unisys mainframes (language environments are), on Fuchsia (FIDL is), just as a couple of examples.
It has nothing to do with C, specifically, but with the fact that vast amounts of important software tend to be distributed in binary form. In a hypothetical world where everybody is using Gentoo, the tradeoffs would be different and CPUs would most likely expose many more micro-architectural details.
I don’t think that, because they don’t. Your premise is hogwash.
Modern RISC derived CPUs for the most part expose a load store architecture driven by historical evolution of that micro arch style and if they are SMP a memory model that only recently has C and C++ adapted to with standards. Intels ISA most assuredly was not influenced by C. SIMD isn’t reminiscent of anything standard C either.
Also you might want to look into VLIW and the history of Itanium for an answer to your other question.
There's only one question. What do you mean?
But Itanium wasn't out of order. How does that even come close to answering a question about exposed out-of-order machinery?
Because it was implemented in a flawed way.
> "In 2018 Christopher Domas discovered that some Samuel 2 processors came with the Alternate Instruction Set enabled by default and that by executing AIS instructions from user space, it was possible to gain privilege escalation from Ring 3 to Ring 0.["
I can swap out a cpu for one with better IPC and hardware scheduling in 10 minutes but re-installing binaries, runtime libraries, drivers, firmware to get newly optimized code -- no way. GPU drivers do this a bit and it's no fun.
"Terribleness" isn't an objective property.
To answer the question of if it's worth it to add this specialized instruction, it really depends on how much die space it adds, but from the look of it, it's specialized handling of an existing operation to match an external spec; that can be not too hard to do and significantly reduce software complexity for tasks that do that operation. As a CE with no real hardware experience, it looks like a clear win to me.
There will always be a “lowest common denominator” platform that reaches 100% of customers.
By definition the lowest common denominator will be limited, inelegant, and suffer from weird compatibility problems.
It would be interesting to know the back story on this: how did the idea feed back from JS implementation teams to ARM. Webkit via Apple or V8 via Google?
The problem is JS's double->int conversion was effectively defined as "what wintel does by default", so on arm, ppc, etc you need a follow on branch that checks for the clamping requirements and corrects the result value to what x86 does.
Honestly it would not surprise me if the perf gains are due to removing the branch rather than the instruction itself.
The use of INT_MIN as the overflow value is an x86-ism however, in C the exact value is undefined.
Sounds too like performance gains will depend on how often the branch is taken which seems highly dependent on the values that are being converted?
Most languages don't start with a spec, so the semantics of a lot of these get later specced as "uhhhhh whatever the C compiler did by default on the systems we initially built this on".
Not to my recollection. I don’t recall anyone at uni discussing a C standard until 1989, and even by 2000 few compilers were fully compliant with that C89 spec.
There were so many incompatible dialects of FORTRAN 77 that most code had to be modified at least a bit for a new compiler or hardware platform.
All of the BASIC and Pascal variants were incompatible with each other. They were defined by “what this implementation does” and not a formal specification.
Or did you mean by "taken" that the branch instruction has to be executed regardless of whether the branch is taken or not?
The whole RISC/CISC thing is long dead anyway, so I don't really mind having something like this on my CPU.
Bring on the mill (I don't think it'll set the world on fire if they ever make it to real silicon but it's truly different)
No variable-length instructions. No arithmetic instructions that can take memory operands, shift them, and update their address at the same time.
Pretty much what you say, I just liked the way of describing it.
AESE / AESMC (AES Encode, AES Mix-columns) are an instruction pair in modern ARM chips in which the pair runs as a singular fused macro-op.
That is to say, a modern ARM chip will see "AESE / AESMC", and then fuse the two instructions and execute them simultaneously for performance reasons. Almost every "AESE" encode instruction must be followed up with AESMC (mix columns), so this leads to a significant performance increase for ARM AES instructions.
> No variable-length instructions.
So, ARM is not a RISC instruction set, because T32 (Thumb-2) instruction can be 2 or 4 bytes long.
Similarly, RISC-V has variable-length instructions for extensibility. See p. 8ff of https://github.com/riscv/riscv-isa-manual/releases/download/... (section "Expanded Instruction-Length Encoding").
With variable length instructions one must decode a previous one to figure out where the next one will start.
People said the same thing about text encodings. Then UTF-8 came along. Has anyone applied the same idea to instruction encoding?
It probably is better to keep the “how long is this instruction” logic harder and ‘waste’ logic on the decoder.
By comparison, 4byte instructions that are always aligned have none of those problems. Alignment is a significant simplification for the design.
(Somewhere I have a napkin plan for fully Huffman coded instructions, but then jumps are a serious problem as they're not longer byte aligned!)
IBM's CPU-advancements of pipelining, out-of-order execution, etc. etc. were all implemented into Intel's chips throughout the 90s. Whatever a RISC-machine did, Intel proved that the "CISC" architecture could follow suite.
From a technical perspective: all modern chips follow the same strategy. They are superscalar, deeply-pipelined, deeply branch predicted, micro-op / macro-op fused "emulated" machines using Tomasulo's algorithm across a far larger "reorder buffer register" set which is completely independent of the architectural specification. (aka: out-of-order execution).
Ex: Intel Skylake has 180 64-bit reorder buffer registers (despite having 16 architectural registers). ARM A72 has 128-ROB registers (depsite having 32-architectural registers). The "true number" of registers of any CPU is independent of the instruction set.
Why wasn't it obvious previously? A few things had to happen: compilers had to evolve to be sophisticated enough, mindsets had to adapt to trusting these tools to do a good enough job (I actually know several who in the 80' still insisted on assembler on the 390), and finally, VLSI had to evolve to the point where you could fit an entire RISC on a die. The last bit was a quantum leap as you couldn't do this with a "CISC" and the penalty for going off-chip was significant (and has only grown).
I don't mean this as a criticism, I just wonder if this is really the optimum direction for a practical ISA
However I see all high performance web computing moving to WASM and JavaScipt will exist just as the glue to tie it together. Adding hardware support for this is naive and has failed before (ie. Jazelle, picoJava, etc).
The hardware support being added here would work just as well for WASM (though it might be less critical).
Because of this I find WASM to be the best direction yet for a "universal ISA" as it's very feasible to translate to most strange new radical architecture (like EDGE, Prodigy, etc). (Introducing a new ISA is almost impossible due to the cost of porting the world. RISC-V might be the last to succeed).
I'm not sure this is actually true, given that the original intent was for WASM to be easy to compile using existing JS infrastructure, not in general. So given that, it would make sense to carry over JS fp->int semantics into WASM. WASM is in effect a successor to asm.js.
It's certainly also not too hard to compile/jit for new architectures, but that was not the initial intent or what guided the early/mid-stage design process.
If you examine the current WASM spec, it doesn't appear to specify semantics at all for trunc. I would expect it inherits the exact behavior from JS.
* discovering which instructions are executed (tracking control flow),
* mapping x86 code addresses to translated addresses,
* discovering the shape of the CFG and finding loops,
* purging translations that get invalidated by overwrites (like self-modifying code),
* sprinkling the translated code full of guards so we can make assumptions about the original code
EDIT: fail to make bullets, oh well.
The only significant thing that has changed is that power & cooling is no longer free, so perf/power is a major concern, especially for datacenter customers.
Yes it is? The essay's point is that "standard" hardware benchmark (C and SPEC and friends) don't match modern workloads, and should be devaluated in favour of better matching actual modern workloads.
ADD: It was an issue a long time ago. Benchmarks like SPEC are actually much nicer than real server workloads. For example, running stuff like SAP would utterly trash the TLB. Curiously, AMD processors can address 0.25 TB without missing in the TLB, much better than Intel.
Despite all that, and me being very familiar with both Rust and JS, it was a big pain. WASM will remain a niche technology for those who really need it, as it should be. No one is going to write their business CRUD in it, it would be a terrible idea.
It can, and it is. Designers are already doing all they can to make it an appealing target for a variety of languages on multiple platforms.
Its also despite a couple decades of hard work by some very good compiler/JIT engineers at a considerable disadvantage perf wise to a lot of other languages.
Third its most common runtime environment, is a poorly thought out collection DP and UI paradigms that don't scale to even late 1980's levels, leading to lots of crutches. (AKA, just how far down your average infinite scrolling web page can you go before your browser either takes 10s of seconds to update, or crashes?).
But when it comes to JS and even other dynamic languages, people for some reason absolutely lose their minds, like a teenager whose parents are going out of town and are leaving them home by themselves overnight for the first time. I've seen horrendous JS from Java programmers, for example, that has made me think "Just... why? Why didn't you write the program you would have written (or already did write!) were you writing it in Java?" Like, "Yes, there are working JS programmers who do these kinds of zany things and make a real mess like this, but you don't have to, you know?"
It's as if people are dead set on proving that they need the gutter bumpers and need to be sentenced to only playing rail shooters instead of the open world game, because they can't be trusted to behave responsibly otherwise.
Regarding performance, modern JS is plenty fast, but it's not in the 'terrible' category. It's memory usage, perhaps ;) https://benchmarksgame-team.pages.debian.net/benchmarksgame/.... For performance critical code JS is not the answer, but good enough for most uses, including UIs.
Regarding UI paradigms, I'm not sure what the problem is, or what significantly better alternatives are. I did MFC/C++ in '90s and C++/Qt in the '00s, and both were vastly inferior to modern browser development. React+StyledComponents is a wonderful way to build UIs. There are some warts on the CSS side, but mostly because Stackoverflow is full of outdated advice.
I don't think that it's being suggested that ISAs should be designed to closely match the nature of these high level languages. This has been tried before (e.g. iAPX 432 which wasn't a resounding success!)
Asking for CPU features to speed up Python is like trying to strap a rocket to horse and cart. Not the best way to go faster. We should focus on language design and tooling that makes it easier to write in “fast” languages, rather than bending over backwards to accommodate things like Python, which have so much more low-hanging fruit in terms of performance.
What are those?
Anything that makes an actual attempt at optimising for performance.
But anything that impacts SPEC benchmarks (and the others we use for C code) is also going to impact Python performance. If you could find a new instruction that offers a boost to the Python interpreter performance that'd be nice, but it's not going to change the bigger picture of where the language fit in.
Say you work on an optimisation which improves SPEC by 0.1%, pretty good, it improves Python by 0.001%, not actually useful.
Meanwhile there might be an optimisation which does the reverse and may well be of higher actual value.
Because spec is a compute & parallelism benchmark, python is mostly about chasing pointers, locking, and updating counters.
Edited to be a touch less strident.
Arguably the scalar focus of CPUs is also to make them more suited for C-like languages. Now, attempts to do radically different things (like Itanium) failed for various reasons, in Itanium's case at least partially because it was hard to write compilers good enough to exploit its VLIW design. It's up in the air whether a different high-level language would have made those compilers feasible.
It's not like current CPUs are completely crippled by having to mostly run C programs, and that we'd have 10x as many FLOPS if only most software was in Haskell, but there are certainly trade-offs that have been made.
It is interesting to look at DSPs and GPU architectures, for examples of performance-oriented machines that have not been constrained by mostly running legacy C code. My own experience is mostly with GPUs, and I wouldn't say the PTX-level CUDA architecture is too different from C. It's a scalar-oriented programming model, carefully designed so it can be transparently vectorised. This approach won over AMDs old explicitly VLIW-oriented architecture, and most GPU vendors are now also using the NVIDIA-style design (I think NVIDIA calls it SPMT). From a programming experience POV, the main difference between CUDA programming and C programming (apart from the massive parallelism) is manual control over the memory hierarchy instead of a deep cache hierarchy, and a really weak memory model.
Oh, and of course, when we say "CPUs are built for C", we really mean the huge family of shared-state imperative scalar languages that C belongs to. I don't think C has any really unique limitations or features that have to be catered to.
My day job involves supporting systems on Itanium: the Intel C compiler on Itanium is actually pretty good... now. We'd all have a different opinion of Itanium if it had been released with something half as good as what we've got now.
I'm sure you can have a compiler for any language that really makes VLIW shine. But it would take a lot of work, and you'd have to do that work early. Really early. Honestly, if any chip maker decided to do a clean-sheet VLIW processor and did compiler work side-by-side while they were designing it, I'd bet it would perform really well.
This is half true. The other half is that OOO execution does all the pipelining a "good enough" compiler would do, except that dynamically at runtime, benefiting from just in time profiling information. Way back in the day OOO was considered too expensive, nowadays everybody uses it.
So, the whole flat memory model, large register machines, single stack registers. When you look at all the things people think are "crufty" about x86, its usually through the lenses of "modern" computing. Things like BCD, fixed point, capabilities, segmentation, call gates, all the odd 68000 addressing modes, etc. Many of those things that were well supported in other environments but ended up hindering or being unused by C compilers.
On the other side you have things like inc/dec two instructions which influence the idea of the unary ++ and -- rather than the longer more generic forms. So while the latency of inc is possibly the same as add, it still has a single byte encoding.
In C, something like a[i] is more or less:
(char*)(a) + (i * sizeof(*a))
By having address calculations handled by separate circuitry that operates in parallel with the rest of the CPU, the number of CPU cycles required for executing various machine instructions can be reduced, bringing performance improvements.
--- EDIT ---
@saagarjha, as I'm being slowposted by HN, here's my response via edit:
OK, sure! You need some agreed semantics for that, at the low level. But the hardware guys aren't likely to add actors in the silicon. And they presumably don't intend to support eg. hardware level malloc, nor hardware level general expression evaluation, nor hardware level function calling complete with full argument handling, nor fopen, nor much more.
BTW "The metal which largely respects C's semantics?" C semantics were modelled after real machinery, which is why C has variables which can be assigned to, and arrays which follow very closely actual memory layout, and pointers which are for the hardware's address handling. If the C designers could follow theory rather than hardware, well, look at lisp.
 IIRC the PDPs had polynomial evaluation in hardware.
I suspect that the OS and Architecture communities have known about one-way barriers for a very long time, and they were only recently added to the Arm architecture because people only recently started making Arm CPUs that benefit from them. And that seems like a more likely explanation than them having been plucked from the C standard.
Moreover, one-way barriers are useful regardless of what language you're using.
Yeah, this is pretty much the opposite of what actually works in practice for general-purpose processors though – otherwise we'd all be using VLIW processors.
I do take your point about VLIW, but I'm kind of assuming that the CPU has to, you know, actually run real workloads. So move the complexity out of the languages. Or strongly, statically type them. Or just don't use JS for server-side work. Don't make the hardware guys pick up after bad software.
But today I think it's hard to argue that modern pipelined, out of order processors with hundreds of millions of transistors are in any sense 'simple'.
If there is a general lesson to be learned it's that the processor is often best placed to optimise on the fly rather than have the compiler try to do it (VLIW) or trying to fit a complex ISA to match the high level language you're running.
> ...rather than [...] trying to fit a complex ISA to match the high level language you're running
Again agreed, that was the point I was making.
But I do think one day (might take a while) JS will no longer be the obvious choice for front-end browser development.
IMO: Rust isn't the easiest language to learn, but the investment pays off handsomely and the ecosystem is just wonderful.
EDIT: I meant "to learn" which completely changes the statement :)
I think that day might be sooner than anyone thinks- Chromium is dominant enough now that their including Dart as a first-class language (or more likely, a successor to Dart) will likely be a viable strategy soon.
Of course, the wildcard is Apple, but ultimately Dart can compile down to JS- being able to write in a far superior language that natively runs on 80% of the market and transpiles to the rest is suddenly much more of a winning proposition.
I like trying new languages and have played with dozens, and JS and Prolog are the only languages that have made me actually scream.
You should definitely try to branch out. At the very least it gives you new ways of thinking about things.
Go or dart is probably the best for "JS killer" in terms of maturity, tooling, and targeting the front end, followed by haxe, swift and/or rust (they may be better even, but frankly I'm not as familiar with them).
Nowadays it feels like the opposite, the committee takes so long to dot the i's and cross the t's that features take multiple years to make it through the approval process be ready to use (I'm looking at you, optional chaining).
ints (x|0 for i31 and BigInt)
arrays (Array, and 11-ish variants of TypedArray)
linked lists (Array)
sets (Set and WeakSet)
maps (Map and WeakMap)
It has the advantage that it is possible to give a reasonable type to most functions without a rewrite (even if the types would be terribly long to accommodate the weak underlying type system)
In addition to the existing doubles, ES2020 added support for signed integers.
I also don't really see any indication that the project maintainers don't know about the instruction.
Looks like an example of JSC not being developed in the open more than anything else.
Edit: From https://www.ecma-international.org/ecma-262/5.1/#sec-9.5
9.5 ToInt32: (Signed 32 Bit Integer)
The abstract operation ToInt32 converts its argument to one of 2 integer values in the range −2³¹ through 2³¹−1, inclusive. This abstract operation functions as follows:
Let number be the result of calling ToNumber on the input argument.
If number is NaN, +0, −0, +∞, or −∞, return +0.
Let posInt be sign(number) * floor(abs(number)).
Let int32bit be posInt modulo 2³²; that is, a finite integer value k of Number type with positive sign and less than 2³² in magnitude such that the mathematical difference of posInt and k is mathematically an integer multiple of 2³².
If int32bit is greater than or equal to 2³¹, return int32bit − 2³², otherwise return int32bit.
NOTE Given the above definition of ToInt32:
The ToInt32 abstract operation is idempotent: if applied to a result that it produced, the second application leaves that value unchanged.
ToInt32(ToUint32(x)) is equal to ToInt32(x) for all values of x. (It is to preserve this latter property that +∞ and −∞ are mapped to +0.)
ToInt32 maps −0 to +0.
The real problem is that JS inherited the x86 behavior, so everyone has to match that. The default ARM behavior is different. All this instruction does is perform a standard fpu operation, but instead of passing the current mode flags to the fpu, it passes a fixed set irrespective of the current processor mode.
As far as I can tell, any performance win comes from removing the branches after the ToInt conversion that are normally used to match x86 behavior.
I don't remember very many phones supporting DBX, but IIRC the ones that did seemed to run J2ME apps much smoother.
These instructions "merely" performs a float -> int conversion with JS semantics, such that implementations don't have to reimplement those semantics in software on ARM. The JS semantics probably match x86 so x86 gets an "unfair" edge and this is a way for ARM to improve their position.
I mean, if float-to-integer performance is so critical, why was this not fixed a long time ago in the language? What am I missing?
ARM wants to be appropriate for more workloads. They don't want to have to wait for software to change. The want to sell processor designs now.
Do you mean JS inside of browsers themselves ? Or JS running in another manner
It does totally makes sense though, given the importance of JS and it’s common use in mobiles.
Ultimately, the instruction set isn't that relevant anyways, what's more relevant is how the microcode can handle speculative execution for common workflows. There's a great podcast/interview from the x86 specification author: https://www.youtube.com/watch?v=Nb2tebYAaOA
No? Then it seems way more specific than the other examples you listed. So specific that it’s only applicable to a single language and that language is in the instruction name. That’s surprising, like finding an instruction called “python GIL release”.
As does ARM:
These are examples of https://en.wikipedia.org/wiki/High-level_language_computer_a....
The people at ARM have likely put a lot of thought behind this, trying to find the places where they, as instruction set vendor, can help making JS workflows easier.
And btw, I'm pretty sure that if python has some equally low hanging fruit to make execution faster, and python enjoys equally large use, ARM will likely add an instruction for it as well.
The reason you don't find this is that all "modern" processors designed since ~ 1980 are machines to run ... C as the vast majority (up until ~ Java) of all software on desktops and below were written in C. This also has implications for security as catching things like out of bound access or integer overflow isn't part of C so doing it comes with an explicit cost even when it's cheap in hardware.
It surprises you personally, but if you think about it it's easy to understand that widespread interpreters that use a specific number crunching primitive implented in software can and do benefit from significant performance improvements if they offload that to the hardware.
You only need to browse through the list of opcodes supported by modern processors to notice a countless list of similar cases of instructions being added to support even higher-leve operations.
I mean, are you aware that even Intel added instructions for signal processing, graphics, and even support for hash algorithms?
And you're surprised by a floating point rounding operation?