SecureRISC Instruction Set Architecture

Version 0.5-draft-20240713

Up Front

Documentation Outline

This document is organized as successive expositions at increasing levels of detail, to give the reader an idea of the motivations and high-level differences from conventional processor architectures, eventually getting down to the detailed definitions that direct SecureRISC processor execution. So, if the introductory material seems a little vague, that is because it attempts to sketch an overall context into which the details are later fit.

Work In Progress Documentation

SecureRISC was created to develop old ideas and notes of mine. It is not a complete Instruction Set Architecture (ISA), but only the things I have had time to consider and work on. It is certainly not a specification. At present, this document only exists for discussion purposes.

The ISA is mostly just ideas at this point. The opcode assignments and instruction specifications are little more than hints. The Virtual Memory architecture needs work. This is not a formal specification at this point, but a discussion document. Should it progress to a specification, much rewriting would be required (e.g. to adopt RFC 2119 requirement level keywords, and more precise definitions.

Open Source

SecureRISC began as an exploration of what a security-conscious ISA might look like. I hope I can improve it over time to live up to its name. Should it turn into something more than an exploration, I would intend to make it an Open Source ISA, along the lines of RISC‑V.

Software

There is no software (simulator, compiler, operating system, etc.) for SecureRISC. This is a paper-only set of ideas at this point. A compiler, simulator, and FPGA implementation might be created at some point, but that is probably years in the future.

Terminology

The reader of this document is likely already familiar with most of the acronyms, terminology, concepts, etc. used herein. Occasionally the reader might encounter something unfamiliar. So just in case, there is a set of glossaries at the end of this document. One is for general terms used in instruction set, processor, and system software, with a coda to the general glossary for terms specific to RISC‑V that this document cites. There are also references for programming languages, operating systems, and other processor architectures cited herein. Since cryptography terminology is used in a few places, there is a specialized crypto glossary for that. Finally, there is a glossary of security vulnerabilities that have tripped up many processor designs is included, since this document refers to many such things.

Table of Contents

Introduction

SecureRISC is an attempt to define a security-conscious Instruction Set Architecture (ISA) appropriate for server class systems, but which with modern process technology (e.g. 5 nm), could even be used for IoT computing given the die area for a single such processor is a small fraction of one mm2. I start with the assumption that the processor hardware should enforce bounds checking and that the virtual memory system should use older, more sophisticated security measures, such as those found in Multics, including rings, segmentation, and both discretionary and mandatory (non-discretionary) access control. I also propose a new block-structured instruction set that allows for better Control Flow Integrity (CFI) and performance. For performance, several features support highly parallel execution and latency tolerance, even in implementations that avoid highly speculative execution, which can lead to security vulnerabilities.

A comment about Multics is appropriate here. There seems to be an impression among many in the computer architecture world that many Multics features are complex. They are simple and general, easy to implement, and remove pressure on the architecture to add warts for specialized purposes. Computer architecture from the 1980s to the present is often an oversimplification of Multics. For example, segmentation in Multics served primarily to make page tables independent of access control, which is a useful feature that has been mostly abandoned in post-1980 architectures. Pushing access control into Page Table Entries (PTEs) puts pressure to keep the number of bits devoted to access control minimal when security considerations might suggest a more robust set. As another example, many contemporary processor architectures (e.g. RISC‑V) have two rings (User and Supervisor), with a single bit in PTEs (the U bit in RISC‑V) serving as a ring bracket. Having only two rings means a completely different mechanism is required for sandboxing rather than having four rings and slightly more capable ring brackets. Indeed, rings were not well utilized on Multics, but we now have more uses for multiple rings, such as hypervisors, Just-In-Time (JIT) Compilation, and sandboxing.

Goals

The goals for SecureRISC in order of priority are:

  1. Security
  2. Performance
  3. Power efficiency
  4. Compatibility where not in conflict with the above
  5. Code size (primarily for performance)
  6. Support for Garbage Collection (GC)
  7. Support for languages with dynamic typing (e.g. Lisp, Python, Julia)
  8. Suitable for implementations that execute many instructions per cycle (wide issue)
  9. Suitable for implementations that do minimal speculative execution to avoid security vulnerabilities (e.g. in-order and narrow issue) but still have some latency tolerance
  10. Simplicity where useful (but not where it conflicts with the above)

Non-goals for SecureRISC include (this list will probably grow):

Security can mean many things. One of the most important is preventing unassisted infiltration (e.g. through exploiting buffer overflows, use-after-free errors, and other programming mistakes). Bounds checking is the primary defense against buffer overflows in SecureRISC. Another is preventing unintentionally assisted infiltration (e.g. phishing attacks installing trojans), which may be accomplished through mandatory access control. SecureRISC is not a comprehensive attempt at security but addresses the aspects that I think can be improved.

While I expect that mandatory (aka non-discretionary) access control is critical to computer security, at this point there is relatively little in SecureRISC’s architecture that enforces this (it is primarily left to software). However, I am still looking for opportunities in this area.

Synergy Between Security and Other Goals

Security, garbage collection, and dynamic typing may appear to be orthogonal, but they are synergistic. SecureRISC attempts to minimize the impact of programming mistakes in several ways, such as making bounds checking somewhat automatic and making compiler-generated checking more efficient for disciplined languages where bounds checking is possible, and to keep pointers a single word, the architecture supports encoding the size in extra information per memory location. For undisciplined languages (e.g. C++) the compiler does not in general know the bounds that would be required to perform a check, and the two best methods so far invented to solve this also require some sort of extra information per memory location, such as the pointer and memory coloring used in AddressSanitizer[PDF] or the tag bit in CHERI. AddressSanitizer uses an extra N bits per the minimum allocation unit (where that unit may be increased to reduce overhead) to detect errors with approximate probability 2−N. To address memory allocation error detection other techniques are necessary. One possibility is garbage collection (GC), which eliminates these errors, but to be a substitute for explicit memory deallocation, GC needs to be efficient, hence the goal synergy. Some implementations of GC are made more efficient by being able to distinguish pointers from integers that look like addresses at runtime, and some sort of tagging helps things. For languages requiring explicit deallocation of memory, AddressSanitizer may be used. However, AddressSanitizer on most architectures is too inefficient to use in production and is typically employed only during development as a debugging tool. SecureRISC seeks to make it efficient enough to use in production. CHERI accomplishes its extra bounds checking by implementing a 129‑bit capability encoding a 64‑bit pointer, 64‑bit base, 64‑bit length, type, and permissions (note the extra bit over each 64‑bit memory location[PDF] required for making capabilities non-forgeable). Thus bounds checking, GC, or memory allocation error detection are all made possible or more efficient by having extra information per memory location. Since SecureRISC must support 64‑bit integer and float-point arithmetic, this extra information needs to be in addition to the 64 bits required for that data.

As justified above, SecureRISC targets its goals by what will likely be the most controversial aspect of SecureRISC: tags on words in memory and registers. The Basic Block descriptors may be more unusual, but the reader may come to appreciate this aspect of SecureRISC with familiarity (especially given the Control Flow Integrity advantages as a security feature), but the reader may in the end not find memory tags convincing because of the non-standard word size that results. An efficient and secure alternative is not known, and as a result, SecureRISC adds tags to memory locations. Tags simultaneously provide an efficient mechanism for bounding pointers, support use-after-free detection, support bounds checking with single-word pointers for undisciplined languages such as C++ (HWASAN or CHERI), and support more efficient Garbage Collection (the best solution to allocation errors), and also happen to support dynamically typed languages.

SecureRISC has not yet explored another use for tagging data, which is taint checking[wikilink].

Memory Options for SecureRISC

Before the reader dismisses SecureRISC because of tagged memory, consider the main memory options that SecureRISC processors are likely to support. Most contemporary processors use a cache block size of 576 bits (512 data bits plus 8 bits of ECC for every 64 bits of data), and provide efficient block transfers of this size between main memory and the processor caches by using interconnect of 72, 144, 288, or 576 bits. The equivalent natural width for SecureRISC is 640 bits (512 data bits, 64 tag bits, and 8 bits of ECC for every 72 bits of data and tag). However, there are multiple ways to provide the additional tag bits for SecureRISC, including the use of a conventional 576‑bit main memory. A simple possibility is to set aside ⅛ of main memory for tag storage. Misses from the Last Level Cache (LLC) would then do two main memory accesses, one reading 576 and then another access reading 72 bits (a total of 648 bits—the additional 8 bits the result of not sharing ECC over tags and data).* (There might be a specialized write-thru cache for the ⅛ of main memory after the LLC reserved for tag block read to exploit locality, but the coherence of this would need to be figured out.) Support for encryption of data in memory is a goal of SecureRISC, and good encryption requires the storage of authentication bits, increasing the size of cache blocks stored in main memory. The encryption proposed for SecureRISC encrypts 512 bits of data, 64 bits of tag into 704 bits of encrypted authenticated ciphertext, and then appends 64 bits of ECC (8 bits per 88) giving a 768‑bit memory storage, which conveniently fits three non-ECC DIMM widths. Alternatively, in a system with 512‑bit main memory, 1.5 main memory blocks could be used for a SecureRISC cache block (e.g. three transfers of 256 bits or six of 128 bits or twelve of 64 bits). Thus the cost for encrypted and tagged memory is the difference between two ECC DIMMs and three non-ECC DIMMs.

* If the system interconnect fabric is wide enough to support it (AMBA Coherent Hub Interface (CHI) may have support for this?), it may be preferable to move the read of the ≈⅛ of main reserved for tags into the memory controller, and then supply cache blocks with tags throughout the rest of the system.

The above is summarized in the follow table:

SecureRISC Memory Options
Data Tags Enc ECC Total Organization Type Use
Cached, Tagged
512 64 128 64 768 96×8, 192×4, …, 768×1
or 64×12, 128×6, 256×3
Main All
512 64 64 640 80×8, 160×4, …, 640×1 Main All unencrypted
512 64 72 648 72×8, 144×4, …, 576×1
+ 72×1
Main All unencrypted
≈⅛ of main reserved for tags[1][3]
512 64 128 88 792 72×8, 144×4, …, 576×1
+ 72×3
Main All (≈⅓ of main reserved for tags + encryption)[2][4]
Cached, Untagged
512 64 576 72×8, 144×4, …, 576×1 I/O Data only (no pointers or code)
512 128 64 704 ? ? Encrypted data only
Uncached
n.a. 8, 16, 32, 64, 128 I/O Registers

Footnotes:

  1. Actually about 88.88888% of main memory would be data storage, and about 11.11111% would be tag storage because the tag portion of memory doesn't require tags.
  2. Similar to [1], about 66.66666% of main memory would be data storage, and about 33.33333% would tag storage. This could be compressed further, but only by crossing cache block boundaries for the extension (3 words per 8 required, so 6 words per 8 fit without crossing a boundary).
  3. The L1 to L3 caches would likely store tags as part of the cache block. A L3 refill would read most of the block from the computed system interconnect address siaddr, and the remainder from
    03∥siaddrMB-1..6∥03 + OFFSET, where MB is the number of bits in the main memory siaddr, and OFFSET is approximately ⌈MEMSIZE×0.8888888÷64⌉×64 (the exact value being somewhat dependent on the main memory size MEMSIZE and would probably be configured by system software after boot). There would likely be a cache after the L3 to hold the rest of the cache block read for this remainder in case it is referenced subsequently. Such a tag remainder cache would need to be checked by coherency transactions on the system interconnect (e.g. invalidates), which has the potential to create false sharing at the 512 byte level, but this can be avoided by not targeting the L3 on an invalidate the hits in the tag remainder cache that does not hit in the L3. Since this cache would only be written on L3 evictions, it could be write-thru (i.e. L3 eviction writes a single 64‑bit group of 8 tags to the the tag portion of main memory) so that a L3 eviction simply writes to the 72 bits of the tag remainder.
  4. Similar to [3], the tag plus encryption remainder could be cached in a remainder cache after the L3. The amount required for L3 miss is 3 words per cache block. Only two 3‑word remainders would probably be stored block in main memory to avoid crossing block boundaries, so only 3 words need be stored in the remainder cache with a single bit indicating which three. A L3 refill would read most of the block from the computed system interconnect address siaddr, and the remainder from
    0∥siaddrMB-1..6∥05 + OFFSET, where MB is the number of bits in the main memory siaddr, and OFFSET is approximately ⌈MEMSIZE×0.3333333÷64⌉×64 (the exact value being somewhat dependent on the main memory size MEMSIZE and would probably be configured by system software after boot). As in [3], L3 evictions would probably write the remainder directly to main memory in a single 288‑bit transaction and invalidate the remainder cache to prevent false sharing.

It may be possible to add tags selectively to portions of memory. For example, slab allocators are typically page based. Thus one would direct the processor to read tags just from the beginning or end of the page. For example, the tag for vaddr might be read
from vaddr63..12 ∥ 03 ∥ vaddr11..3
and the slab allocator made aware to start is allocation at offset 512 in pages, so tags are stored at offsets 32..511 (0..31 not being used as tags on tags are not required—these offset are available for allocator overhead). A Page Table Entry (PTE) bit might indicate this form of tag storage is in use. Separate mechanisms for bage tags, stack tags, and slab allocations larger than a page would still be required.

Language Specific Mechanisms

The above discussion suggests at least five different uses of memory tags:

While memory tagging is useful for the above, it is used in different ways for the above. Instead of a single unified mechanism, SecureRISC uses memory tagging in two ways, one for AddressSanitizer, and then combining CHERI and disciplined language support into the other.

Is SecureRISC actually RISC?

Is SecureRISC Reduced Instruction Set Computing? It is certainly not a small instruction set, but RISC no longer stands for that, but has been primarily a marketing term. As one wag put it, RISC is any instruction set architecture defined after 1980. A more accurate description might be ISAs suitable as advanced compiler targets, as the general trend is to depend on the compiler to exploit features of the ISA, such as redundancy elimination, sophisticated register allocation, instruction scheduling, etc. Such things have generally favored ISAs organized along the load and store model and simple addressing modes. By this criterion, I believe SecureRISC is a RISC architecture, but it is not a simplistic or reduced instruction set. Contemporary processors, even for simple instruction sets, are very complex, and that complexity will probably grow until Moore’s Law fails. The design challenges are large. In the contemporary world, simplicity is a goal when it furthers other goals such as performance (e.g. by maximizing clock rate), efficiency (e.g. by reducing power consumption), and so on.

Background

The original motivation for block-structured ISAs was Instruction-Level Parallelism (ILP) studies that I did back in 1997 at SGI that showed that instruction fetch was the limiting factor in ILP. This was before modern branch prediction, e.g. TAGE[PDF], so that result may no longer be true. The idea was that instruction fetch is analogous to linked list processing, with parsing at each list node to find the next link. Linked list processing is inherently slow in modern processors, and with parsing it is even worse. I wanted to replace linked lists with vectors (i.e. to vectorize instruction fetch), but couldn’t figure out how, and settled for reducing the parsing at each list node. I still feel that this is worthwhile, but the exact tradeoffs might require updating older work in this area. The best validation of this dates from 2007, when Professor Christoforos Kozyrakis convinced his Ph.D. student Dr. Ahmad Zmily to look at this approach in a Ph.D. thesis. In the introduction of Block-Aware Instruction Set Architecture[PDF], Dr. Zmily wrote, We demonstrate that the new architecture improves upon conventional superscalar designs by 20% in performance and 16% in energy. Such an advantage is not enough on which to foist a new ISA upon the world, but it encourages me to think that it does provide an impetus for using such a base when creating a new ISA for other purposes, such as security. Since 2007, improvements in the proposed block-structured ISA should result in greater performance improvements, while improvements in branch prediction (e.g. TAGE predictors) decrease some of the advantages. Also, Dr. Zmily’s work was based on the MIPS ISA, and SecureRISC is quite different in many aspects. Should SecureRISC be developed to the point where binaries can be produced and simulated, a more appropriate performance estimate will be possible.

Before starting SecureRISC, my previous experience was with the many ISAs and operating systems. Long after starting my block-structured ISA thoughts, I became involved in the RISC‑V ISA project. RISC‑V is in many ways a cleaned-up version of the MIPS ISA (e.g. minus load and branch delay slots) and it seems likely to become the next major ISA after x86 and ARM. Being Open Source, RISC‑V has easy-to-access documentation. As such I have used it for comparisons in the current description of SecureRISC and modified some of its virtual memory model to be slightly more RISC‑V compatible (e.g. choosing the number of segment bits to be compatible with RISC‑V Sv48). However, most aspects of the SecureRISC ISAs predate my knowledge of RISC‑V and were not influenced by it, except that I found that RISC‑V’s Vector ISA was more developed than my thoughts (which were most influenced by the Cray-1, which supported only 64‑bit precision).

In 2022 I encountered the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) research effort. I found their work impressive, but I had concerns about the practicality of some aspects. Despite my concerns, I thought that SecureRISC might be a good platform for CHERI, so I have extended SecureRISC to outline how it might support CHERI capabilities as an exploration. I also modified SecureRISC’s sized pointers to include a simple exponent to extend the size range based on ideas from CHERI but kept them single words by not including both upper and lower bounds. This sized pointer is not as capable as a CHERI pointer, but it is 64 bits rather than 128 bits, which has the advantage of size. There is a more detailed discussion of CHERI and SecureRISC below.

In 2023 I took the virtual memory ideas in SecureRISC and created a proposal for RISC‑V tentatively called Ssv64. I made Ssv64 much more RISC‑V compatible than SecureRISC had been (e.g. in PTE formats), and have recently been backporting some of those changes into SecureRISC since there is no reason to be needlessly different.

SecureRISC does depend upon a few new microarchitecture structures to realize its potential. There should be a Basic Block Descriptor Cache (BBDC), though this could be thought of as an ISA-defined Branch Target Buffer (BTB). The BBDC is in addition to the usual L1 Instruction Cache. While the BTB and BBDC are similar, the BBDC is likely to be sized such that it requires more than one cycle to access (resulting in a target prediction in two cycles), making another structure useful (in the An Example Microarchitecture section at the end this is called a Next Descriptor Index Predictor) to enable a new basic block to be fetched every cycle by providing just the set index bits one cycle early. The most novel new microarchitecture structure suggested for SecureRISC is the Segment Size Cache, which caches the segment size log2 for a segment number, which is used for segment bounds checking on the base register of loads. This cache might also provide the GC generation number of the segment (TBD). While these are new structures, in the context of a modern microarchitecture, especially one with two or three levels of caches and a vector unit, they are tiny and worthwhile.

Conventional Aspects of SecureRISC

Some things remain unchanged from other RISCs. Addresses are byte-addressed. Like other RISC ISAs, SecureRISC is mostly based upon loads and stores for memory access. Integers and floating-point values have 8, 16, 32, or 64‑bit precision. Floating-point would be IEEE-754-2019 compatible. The Vector/Matrix ISA will probably be similar to the RISC‑V Vector ISA but might however use the 48‑bit or 64‑bit instruction format to do more in the instruction word and less with vset (perhaps a subset of vector instructions would exist as the 32‑bit instructions). Also, there are multiple explicit vector mask registers, rather than using v0. (There are sixteen vector mask registers, but only vm0 to vm3 are usable to mask vector operations in 32‑bit vector instructions—the others primarily exist for vector compare results.)

Readers will have to decide for themselves whether the proposed virtual memory is conventional because it is somewhat similar to Multics, or unconventional because it is different from RISC ISAs of the last forty years. A similar comment could be made concerning the register architecture since it echoes the Cray-1 ISA from 1976 but is somewhat different from RISCs since the 1980s. (The additional register files in SecureRISC serve multiple purposes, but an important one is supporting execution of many instructions per cycle without the wiring bottleneck that a single unified register file would create.)

Unconventional Aspects of SecureRISC

Much more in SecureRISC is unconventional. To prepare the reader to put aside certain expectations, we list some of these things here at a high level, with details in later sections.

Advantages of the Basic Block Descriptor

The Basic Block (BB) descriptor aspect listed above is perhaps the most unfamiliar. Below are some of the rationale and advantages of this approach.

Contemporary processors have various structures that are created and updated during program execution to improve performance, such as Caches, TLBs, Branch Target predictors (BTB), Return Address predictors (RAS), Conditional Branch predictors, Indirect Jump and Call predictors, prefetchers, and so on. In SecureRISC one of these is moved into the ISA for performance and security. In particular, the BTB becomes a Basic Block Descriptor Cache (BBDC). The BBDC caches lines of Basic Block Descriptors that are generated by the compiler, in a separate section from the instructions. SecureRISC also seeks to make the Return Address predictor more cache-like and build in some additional ISA support for loop prediction.

Capability Hardware Enhanced RISC Instructions (CHERI)

I started with the assumption that pointers are a single word, which are expanded based on the 8‑bit tag to a base and size when loaded into the doubleword (144‑bit) Address Registers (ARs). This enables automatic bounds checking. The effective address calculation uses the ARs base to check the offset/index value against the size. This supports programs oriented toward a[i] pointer usage, but not C++ *p++ pointer arithmetic (such arithmetic is possible in SecureRISC at the expense of bounds checking). In contrast, the University of Cambridge Capability Hardware Enhanced RISC Instructions (CHERI) Project started with the assumption that capability pointers are four words (including lower and upper bounds, the pointer itself, and permissions and object type), and invented a compression technique to get them down to two words. SecureRISC can support CHERI by using its 128‑bit AR load and store instructions to transfer capabilities to and from the 144‑bit ARs, and therefore be able to accommodate either singleword or doubleword pointers. Support for the CHERI bottom and top decoding, its permissions, and its additional instructions would be required. The CHERI tag bit is replaced with two SecureRISC reserved tag values (one tag value in word 0, another in word 1). I would expect languages such as Julia and Lisp would prefer singleword pointers, so supporting both singleword and doubleword pointers allows both to exist on the same processor depending on the instructions generated by the compiler.

Unlike CHERI, SecureRISC pointers have only a size and not bottom and top values encoded. As a result, SecureRISC’s bounds checking is more suited to situations where indexing from a base is used rather than incrementing and decrementing pointers, and so bounds checking is better suited to disciplined languages, primarily ones that emphasize array indexing over pointer arithmetic. My expectation is that running some C++ code on would be possible with bounds checking, but pointer-oriented C++ code would fail bounds checking. Bounds checking is a better target for Rust, Swift, Julia, or Lisp. SecureRISC can use unsized pointers for C++, but using these would represent a less secure mode of operation. The supervisor would need to enable on a per process basis whether such C++ pointers can be used; if disabled they would cause exceptions. For example, a secure system might only allow C++ pointers for applications without internet connectivity. Instead, undisciplined languages (such as C++) are likely to either use CHERI-128 pointers or memory and pointer cliques for security.

SecureRISC and CHERI Variants

Tagged memory words are separable from other aspects of SecureRISC, such as the Multics aspects and the Basic Block descriptor aspects. One could imagine a version of SecureRISC without the tags and a 64‑bit word (72 bits with ECC in memory). Even in such a reduced ISA—call it SemiSecureRISC—I would keep the AR/XR/SR/VR model. SemiSecureRISC is still interesting for its performance and security advantages, but I do not plan to explore it. There is also the possibility of combining SemiSecureRISC with CHERI and its 1‑bit tag[PDF], since the CHERI project has done a lot of important software work. Call such an ISA BlockCHERI. I suspect the CHERI researchers would say that the only advantage in BlockCHERI would be the performance advantage of the Block ISA and the AR/XR/SR separation, with the ARs specialized for CHERI capabilities, and the XRs/SRs for non-capability data. My primary thought on the BlockCHERI is that the difference between a 65‑bit memory (73 bits with ECC) and 72‑bit memory (80 bits with ECC) may find that 7 extra bits may be put to good use.

One could imagine variants of SecureRISC that have only some of its features:

Name Block ISA Segmentation Rings Tags CHERI Word Pointer
SecureRISC 72 72/144
SemiSecureRISC 64 64
BlockRISC 64 64
BlockCHERI ? ? 65 130

As I indicated earlier, I don’t think that BlockRISC is sufficient to justify a new ISA. I am concentrating on the full package.

Open Aspects

I need to think more carefully about I/O in a SecureRISC system. Some I/O will be done in terms of byte streams transferred via DMA to/and from main memory (e.g. DRAM). Such I/O if directed to tagged main memory writes the bytes with an integer tag. Similarly, if processors use uncached writes of 8, 16, 32, or 64 bits (as opposed to 8‑word blocks) to tagged memory, the memory tag must be changed to integer. Tag-aware I/O of 8‑word units exists and may be for paging and so forth. It may be that a general facility for reading tagged memory including the tag as a stream of 8‑bit bytes could be provided along with a cryptographic signing, and for writing such a stream back with signature checking will be useful.

Ports onto the system interconnect fabric will have to have rights and permissions assigned by the hypervisor, and perhaps hypervisor guests. This needs to be worked out.

Being able to support DMA from lower privilege rings (user-mode) would be desirable, but it seems difficult to make this work, because then the user ring code would be sending its own local virtual addresses to the I/O device for DMA, and so the I/O devices would have to be able to translate user addresses to system interconnect addresses via two-level page tables and user-mode would have to tell the I/O device the page table the supervisor assigned it, which it doesn’t know. I am for now leaving user-mode I/O unaddressed. One possibility is to implement a 80 to 96‑bit global address space by converting the existing 12‑bit per-processor translation cache flush optimization (ASID) to a system-wide 16 to 32‑bit Address Space ID and start the segment descriptor lookup from this extended virtual address. This would allow I/O to locate user-mode page tables. The cost is wider address matching in translation caches and potentially multi-level walks to read segment descriptors (probably with additional caching).

Documentation Conventions

Little Endian bit numbering is used in this documentation (bit 0 is the least significant bit). While not a documentation convention, I might as well mention up front that SecureRISC is similarly Little Endian in its byte addressing.

Links to Wikipedia articles are followed with a [wikilink] icon. Links to documents in PDF format are followed by a [PDF].

To augment English descriptions of things, SecureRISC uses notation that operates on bits. This notation is sketched out here, but it is still only a guide to the reader (i.e. it is not a complete formal specification language such as SAIL). Its advantage is brevity.

Basic Terminology

Multics Terminology (Multicians may mostly skim)

For those familiar with Multics, the primary thing to know is that SecureRISC has up to 8 rings (0..7) and inverts ring numbers so that ring 0 is the least privileged. Also, segment sizes are powers of two.

Segment
Segments are the basic unit of access control and sharing and are typically used for mapping files into the address space of processes. Segments may be paged or may be directly mapped (e.g. to an I/O device). Segments have power-of-two sizes that are used for bounds checking and for determining the depth of the page table walk when paging is used. (Multics segment lengths were not limited to powers of two sizes, but arbitrary lengths significantly increase the size of Segment Descriptors, whereas only six bits are required for powers of two.) Segment sizes < 12 (4096 bytes) would not be supported in most microarchitectures, and the maximum segment size is 61 (2 EiB or exbibytes). Segment sizes > 48 (256 TiB or terabyte) require the Segment Descriptor Table (set by the supervisor) to have 2size-48 entries with consistent values.
Segments and pages are also used to implement generational garbage collection (GC) by extending Segment Descriptors, Page Table Entries (PTEs), and TLB entries to have a generation number so that stores of pointers from an old generation to a new one are noted in PTE bits in a fashion similar to the usual Dirty bits. This allows pages that are about to be swapped out to be scanned for pointers to newer generations and noted, so that these pages need not be swapped in during GC. (SecureRISC may need a way to disable this feature in PTEs to provide more bits to software.)
Ring
Rings provide Read, Write, Execute, and Gate permission on a nested basis for different layers of privilege. Many recent and simpler architectures provide only two layers of privilege (typically called User and Supervisor) with Read (R), Write (W), and Execute (X) permission bits and a fourth bit (e.g. the RISC‑V U bit in PTEs) determining whether these permissions apply to both privilege levels or only the most privileged with access denied to the least privileged. Rings are the older generalization where separate Read, Write, and Execute permissions are specified for multiple privilege levels (typically 4 or 8) in a nested fashion that takes only a few more bits (compare RISC‑V’s 4 bits (RWXU) to 9 bits for four rings (RWX plus three 2‑bit ring brackets) and 12 bits for eight rings (RWX three 3‑bit ring brackets). In addition, to Execute permission, rings allow Multics and SecureRISC to implement Gate permission for privilege transition on procedure calls to and returns from gates.
Rings enforce a layering upon what may be read or written by code on a per segment basis, with ring 0 being the least privileged and ring 7 being the most privileged. (SecureRISC inverts the ring number to privilege mapping chosen by Multics. This allows less privileged code to be unaware of higher privilege levels and the number of rings supported by an implementation to vary: some implementations might have less than 8 rings.) Less privileged rings may also call gates in more privileged rings to request services from those rings. Ring numbers are stored in pointer tags so that pointer parameters passed to more privileged rings result access to virtual memory using the access rights of the caller, and not the rights of the more privileged ring. Loading a pointer from memory sets the ring field to the minimum of the current ring of execution, the ring number in the base register of the load, and the ring number stored in memory (if any). A special instruction used in gates is applied to pointers passed in registers to apply this minimum calculation using the ring number of the caller.
The number of rings could be reduced from 8 to 4 or even just 2 in some implementations, though the savings from this is minimal. Perhaps in a four-ring system ring 2 would be for the operating system, ring 1 for user code, and ring 0 for sandboxed user code.
Many recent ISAs can be thought of as having only two rings and with ring permissions being just present or absent (again, the RISC‑V U bit in PTEs is one example).
Michael D. Schroeder⁏s Ph.D. thesis, MAC TR-104, Cooperation of Mutually Suspicious Subsystems in a Computer Utility, September 1972[PDF] presented a generalization of rings to domains where permissions were specified without nesting. This is straight-forward, until the procedure for evaluating permissions of reference parameters using the privilege of the calling domain is attempted. SecureRISC does not attempt to generalize rings to domains due to this complexity.
Ring brackets
Each segment has three 3‑bit ring numbers—R1, R2, and R3—stored in the segment descriptor table and used for bracketing accesses by ring of execution in addition to the Read, Write, Execute permissions from the segment descriptor table. R3 also is used for gate call permission. To reiterate, SecureRISC inverts the ring number to privilege mapping chosen by Multics: ring 7 is the most privileged and ring 0 the least privileged. Typically, R3≤R2≤R1. Writes are permitted when the current ring of execution is in [R1:7], reads in [R2:7], execution in [R2:R1], and calls to gates in [R3:R2−1]*. (Originally the value 7 was used for other purposes, but that is no longer the case, and ring 7 is being reintroduced, for example for a Security Overlord / Trusted Execution Environment (TEE) / Ultravisor.)
* The ring number of the caller and the ring brackets of the target segment are used to calculate the new ring number of execution, as per the Multics documentation modified for the inverted ring order:

One additional change to Multics rings may be to require access to segments at lower privilege level than PC.ring by more privileged rings to use pointers with ring tags (192..199). A reference using a non-ring pointer would cause an exception, making it difficult to accidentally trick privileged rings to use untrusted data.

To illustrate the utility of rings, the following example shows how all 8 rings might be used. Indeed, if there were one more ring available, it might be used for the user-mode dynamic linker, so that links are readable by applications, but not writeable.

Example Ring Brackets
What R1,R2,R3 seg
RWX
R b W b X b G b Ring 0 Ring 1 Ring 2 Ring 3 Ring 4 Ring 5 Ring 6
User code 2,2,2 R-X [2,7] - [2,2] - ---- ---- R-X- R--- R--- R--- R---
User execute only 2,2,2 --X - - [2,2] - ---- ---- --X- ---- ---- ---- ----
User stack or heap 2,2,2 RW- [2,7] [2,7] - - ---- ---- RW-- RW-- RW-- RW-- RW--
User read-only file 2,2,2 R-- [2,7] - - - ---- ---- R--- R--- R--- R--- R---
User return stack 4,2,4 RW- [2,7] [4,7] - - ---- ---- R--- R--- RW-- RW-- RW--
Compiler library 7,0,0 R-X [0,7] - [0,7] - R-X- R-X- R-X- R-X- R-X- R-X- R-X-
Super driver code 3,3,3 R-X [3,7] - [3,3] - ---- ---- ---- R-X- R--- R--- R---
Super driver data 3,3,3 RW- [3,7] [3,7] - - ---- ---- ---- RW-- RW-- RW-- RW--
Super code 4,4,4 R-X [4,7] - [4,4] - ---- ---- ---- ---- R-X- R--- R---
Super gates for user 4,4,2 R-X [4,7] - [4,4] [2,3] ---- ---- ---G ---G R-X- R--- R---
Super heap or stack 4,4,4 RW- [4,7] [4,7] - - ---- ---- ---- ---- RW-- RW-- RW--
Super return stack 6,4,6 RW- [4,7] [6,7] - - ---- ---- ---- ---- R--- R--- RW--
Hyper driver code 5,5,5 R-X [5,7] - [5,5] - ---- ---- ---- ---- ---- R-X- R---
Hyper driver data 5,5,5 RW- [5,7] [5,7] - - ---- ---- ---- ---- ---- RW-- RW--
Hyper code 6,6,6 R-X [6,7] - [6,6] - ---- ---- ---- ---- ---- ---- R-X-
Hyper heap or stack 6,6,6 RW- [6,7] [6,7] - - ---- ---- ---- ---- ---- ---- RW--
Hyper return stack 6,6,6 RW- [6,7] [6,7] - - ---- ---- ---- ---- ---- ---- RW--
Hyper gates for supervisor 6,6,4 R-X [6,7] - [6,6] [4,5] ---- ---- ---- ---- ---G ---G R-X-
TEE code 7,7,7 R-X [7,7] - [7,7] - ---- ---- ---- ---- ---- ---- ----
TEE data 7,7,7 RW- [7,7] [7,7] - - ---- ---- ---- ---- ---- ---- ----
Sandboxed JIT code 1,0,0 RWX [0,7] [1,7] [0,1] - R-X- RWX- RW-- RW-- RW-- RW-- RW--
Sandboxed JIT stack or heap 0,0,0 RW- [0,7] [0,7] - - RW-- RW-- RW-- RW-- RW-- RW-- RW--
Sandboxed non-JIT code 1,1,1 R-X [1,7] - [1,1] - ---- R-X- R--- R--- R--- R--- R---
User gates for sandboxes 2,2,0 R-X [2,7] - [2,2] [0,1] ---G ---G R-X- R--- R--- R--- R---
Gate
Gates are the entry points into more privileged rings from less privileged rings and are marked as such by a bit in basic block descriptors. Less privileged rings may call directly to more privileged rings without employing the exception mechanism (not employing exceptions is a performance advantage). When the target segment does not allow execution to the current ring (i.e. the current ring is less than the target segment’s R2), but does allow gate calls (i.e. the current ring is in the target segment’s [R3:R2−1]), a ring transition takes place. Only basic block descriptors marked as gates (as indicated in the descriptor) may be used for such transfers. Gates are responsible for stack switching, validating the ring numbers of pointer arguments passed in registers, and clearing non-preserved registers before return.
Discretionary Access Control
The operating system maintains an Access Control List (ACL) for files. When files are mapped into a user address space, this ACL is mapped to permissions in the Segment Descriptor for that user. Those permissions are Read (R), Write (W), Execute (X), Pointer (P), and Capability (C) permissions.
Mandatory Access Control
Mandatory Access Control (aka Non-Discretionary Access Control) prevents access by independent of ACLs by implementing the Orange Book[PDF] classification system. In addition to its primary purpose, it protects against trojan attacks. The Orange Book calls for two concepts: levels and categories. In SecureRISC this could be simplified to just categories, with N levels encoded with 2N−1 category bits. Read access is granted to a segment when SegmentCategories ⊆ ProcessCategories. Write access is granted when SegmentCategories = ProcessCategories.
A argument against using 2N−1 category bits to encode N levels is simply the number of bits required, if these bits are stored in a TLB. In that case, using two mechanisms rather than one may be worth the complexity. For example, a classification system with six levels might use 3 bits for a binary level or 5 bits for a set representation (00000, 00001, 00011, 00111, 01111, 11111), a savings of 2 bits. For three levels, the savings is only 1 bit.
It may be useful to have multiple parallel tests of category bits for orthogonal security considerations. For example, a Trusted Execution Environment (TEE) category might be independent of Secret/TopSecret levels.

Address Terminology

SecureRISC implements two levels of address translation, as in processors with hypervisor support and virtualization, but I have invented new terminology for the process because physical address is somewhat ambiguous in a two-level translation. Programs operate using local virtual addresses. These addresses are translated to a system virtual address in a mapping specified by guest operating systems. The guest operating systems consider system virtual addresses as representing physical memory, but actually these addresses are translated again by a system-wide mapping specified by the hypervisor to system interconnect addresses that are used in the routing of accesses in the system fabric. All ports on the system interconnect translate system virtual addresses to system interconnection addresses in local TLBs at the boundary into the system interconnect. This allows guest operating systems to transmit system virtual addresses directly to I/O devices, which may transfer data to or from these addresses, employing the system-wide translation at the port boundary.

Making the svaddr → siaddr translation system-wide is a somewhat radical simplification compared to other virtualization systems. Whether SecureRISC retains this simplification or adopts a more traditional second level translation is open at this point, but my intention is to see if the simplification can suffice. A system-wide mapping means that the hypervisor must give each supervisor unique system virtual addresses for its memory and I/O, and the supervisors must be prevented from referencing the system virtual addresses of the other supervisors via the protection mechanism. This requires that supervisors must not expect memory and I/O in fixed locations. The advantage of a single mapping is that a single 64‑bit svaddr is all that is required when communicating with I/O devices, rather than two 64‑bit addresses (i.e. a page table address and the address within the address space specified by the page table).

A further consequence of the system-wide svaddr translation is that there can be only one hypervisor in the system. In other systems, one could have multiple hypervisors running in parallel, each supporting different sets of supervisors. This generality is elegant, but I wonder how important it is in practice.

The following elaborates on the above:

Local Virtual Address
This is the 64‑bit address generated for Basic Block descriptor fetches, Instruction fetches, and load and store instructions based on address arithmetic on the address and index register files. Local Virtual Addresses are translated in by the first level translation mechanism starting with the Segment Descriptor Tables specified by the sdtp registers. The result of this translation is a System Virtual Address. This first-level translation is usually specified by the guest operating system supervisor. Local Virtual Addresses are sometimes abbreviated to lvaddr.
Local Virtual Address
63 61 60 48 47 0
SG SEG fill VPN byte
3 13 48−ssize ssize−PS PS
where ssize is the segment size, PS is the page size given by the segment mapping, and fill is all 0s for upward-growing segments and all 1s for downward-growing segments.
The segment number size of 16 bits was chosen to limit the offset to 48 bits, which keeps simplistic operating system page tables (one page at each level) to four levels with 4 KiB pages. If all operating systems for SecureRISC are likely to take advantage of flexible page table sizes SecureRISC might reduce the segment number to 14 bits (11 in a segment group).
System Virtual Address
These 64‑bit addresses are used by processors and I/O devices when interfacing to the System Interconnect. Initiator ports on the System Interconnect translate (and check) these addresses to System Interconnect Addresses in Initiator TLBs based on the system Region Descriptor Table. This second-level translation is usually specified by the system hypervisor. System Virtual Addresses are sometimes abbreviated to svaddr.
It would be desirable to support >64‑bit svaddrs, but there is limited room in Page Table Entries with a 4 KiB page size. If SecureRISC were to raise the minimum page size, I think this should be increased a little.
System Virtual Address
63 48 47 0
region byte address
16 48
System Interconnect
The system-specific logic with multiple ports that allows these ports to communicate with each other. It may be implemented with a bus, cross-bar, ring, 2D or 3D mesh, HexaMesh, Diametrical Mesh (mesh with wormholes routes), dragonfly network, hypercube, or still other mechanisms. Ports on the System Interconnect may be either Initiators or Responders or both.
System Interconnect Address
The address used for routing data transfers on the system interconnect. The width of the system interconnect address is system dependent. System interconnect addresses are the result of translating System Virtual Addresses using the system-wide second-level translation specified by the Region Descriptor Table (RDT). Because I don’t see a strong reason to have more bits in siaddrs than svaddrs, SecureRISC should just stick with a maximum siaddr width of 64 bits. Many systems will have a smaller System Interconnect Address width. System Interconnect Addresses are sometimes abbreviated to siaddr.
System Interconnect Address
63 0
byteaddress
64

This document does not attempt to define the format of System Interconnect Addresses (siaddrs). That is left to the system designers. However, just to illustrate one possibility, what follows is an example of how a hypothetical system might interpret an siaddr.
Example System Interconnect Address
63 50 49 6 5 3 2 0
port line word byte
14 44 3 3

Tagged Pointer Terminology

Word
Memory Word
71 64 63 0
tag data
8 64
Words are 72 bits in memory with 64 bits of data, 8 bits of tag, and addresses that are multiples of 8 (i.e. aligned). SecureRISC supports using the tag portion of words in two ways, one appropriate to languages with strong bounds checking and garbage collection, and one for less disciplined languages (e.g. C++).
In the first tag usage, the tag is primarily used as a type, and by distinguishing pointers from non-pointers, facilitates garbage collection, and dynamic typing. When the type indicates a pointer, it may sometimes specify the size of the memory addressed by the pointer. Non-pointer tag values (≥240) are reserved for 64 bits of data contained directly in the word, rather than what the word points to. This allows dynamically typed languages such as Lisp to have 64‑bit integer and 64‑bit floating-point data as objects without allocating memory to contain it. Tags <240 represent pointers.
In the second tag usage, the tag is primarily used for checking for accesses beyond allocated memory or after explicit deallocation. This works probabilistically by setting the tags of allocated memory to a new 8‑bit value (0..231), tagging pointers to this allocation with this value, and checking on reads and writes that the pointer value matches the memory value. A bit in the segment descriptor word indicates whether tags are used in this way, and this bit must match the opcodes used to load and store to the segment. SecureRISC borrows the word clique to refer to this usage of tags; the clique of memory and pointers must match on access. Cliqued pointers in memory use the tag to represent the allocation containing the pointer, and so different bits must be used to specify the pointer clique, reducing the address space size by 8 bits for such pointers (making only 256 segments addressable). SecureRISC has CLA64 and CSA64 instructions that decode cliqued pointers on load and encode them on store. Cliqued pointers do not need to be word aligned in memory. When a load or store instruction checks memory tags (i.e. when the AR base register memtag field is not 251), if the address is not word aligned and the access crosses a word boundary, then all accessed word tags must match.
The CLA64 instructions supply the tag to write to AR[d] (typically 222, 240, or 244). The decode and encode allow the format in address registers (ARs) to be compatible with non-cliqued address usage, by setting the AR tag to newtag, moving ac to bits 143..136, setting the ring field to PC.ring, and setting the size field to the segment size. The memory forms of data and pointers are as shown below:
Cliqued Non-Pointer Stored in Memory accessed with {L,S}{X,S}n{U,S} etc.
71 64 63 0
mc data
8 64
Cliqued Pointer Stored in Memory accessed with CLA64/CSA64 etc.
71 64 63 56 55 0
mc ac address
8 8 56
Field Width Bits Description
address 56 55:0 Byte address
ac 8 63:56 Clique of addressed memory (0..231)
mc 8 71:64 Clique assigned by allocator to memory containing the pointer (0..231)

The CLA64 transformation is as follows:
t ← lvload64cliquecheck(ea, AR[a]143..136)
AR[d] ← t63..56 ∥ PC.ring ∥ segsize(ea) ∥ newtag ∥ 08 ∥ t55..0

There are many ways this mechanism might be used, but one way a slab allocator might work is by setting the tags of each N words of the slab to incrementing values mod 232, and then incrementing the tags in the words of a freed block by 16 or 32 mod 232.
Doubleword
A doubleword is two words of course. Doublewords are stored at addresses that are multiples of 16 (16‑byte aligned in memory).
Sub-word
SecureRISC avoids the terms half-word, quarter-word, etc. because words have tags and there is no such thing as half or a quarter of tag. SecureRISC does have 8‑bit, 16‑bit, 32‑bit, 64‑bit, and 128‑bit load and store instructions, and these instructions support misaligned accesses. Collectively these accesses are called sub-word loads and stores because they extract data from the data portion of one to three memory words. Sub-word loads tag the resulting data with an integer tag, and typed sub-word stores change the memory tag to integer.
Clique
A clique is the grouping of memory and pointers that access that memory into one of 232 cliques to detect errors in one of the two uses of tags. Memory allocation assigns the words one of the 232 cliques distinct from adjacent memory, and distinct from previous uses, and returns pointers that include this value, which is checked on accesses through the pointer. Thus, if pointer arithmetic moves the pointer outside of the allocation accesses are likely to fail. When memory is freed, the tags are changed so that old pointers will no longer match. With strong bounds checking and garbage collection, cliques are not necessary, and the tag is not used in this way.
Decoded Pointer
Pointers live in memory and in XRs/SRs/VRs in memory format, undecoded. When loaded into ARs however, some decoding is done to prepare the pointer for use for calculating effective addresses for loads and stores. The decoded form is 144 bits. The decoding depends on the instruction used to load the AR. Storing an AR encodes it back to memory format, depending on the store opcode used. There are also instructions for saving and restoring the full decoded form of ARs to memory. The four forms of AR loads and stores are then for typed/sized pointers (e.g. LA or SA), cliqued pointers (e.g. CLA64 or CSA64), CHERI pointers (e.g. LAC or SAC), and decoded pointers (e.g. LAD or SAD). The 144‑bit AR format is shown below (note that CHERI internal format is shown in CHERI Capabilities):
AR bits 71..0
71 64 63 0
type data
8 64
AR bits 143..72
71 64 63 61 60 0
memtag ring size
8 3 61

The memtag field is set to 251 for word loads (e.g. LA), but is set to the expected memory tag on cliqued loads (e.g. CLA64) by copying from bits 63..56 (which are then cleared). For doubleword loads (e.g. LAD or LAC) it is set from the 144‑bit memory read, but LAC traps if it is not 251. The ring field is set from bits 66..64 for tags 192..199 and 200..207, and to PC.ring otherwise.
Tagged Non-pointer Data
64‑bit integer data
71 64 63 0
240 integer
8 64

IEEE-754 64‑bit floating point data
71 64 63 0
244 float64
8 64
Null/Nil pointer
Null Pointer
71 64 63 0
0 0
8 64
A pointer to 0-length data (tag 0) and address 0 is used as a null pointer. Any reference through this pointer causes an exception. There are BEQN and BNEN 16‑bit instructions for branching on null pointers since this is so common. Other uses of the data field are reserved.
Sized Pointers
Sized pointers encode the size of the addressed memory in the tag for small sizes. This size is decoded into a full size by the LA instruction, and this size is used by bounds checking loads and stores that use the decoded form as a base address. Small sizes support many common cases, such as the pointers returned by memory allocators (which typically increase allocation size to prevent fragmentation, e.g. as in slab allocators). Array slices would be rounded up to the next supported size. When bounds checking is called for on things where the size cannot be encoded in the tag, the WSIZE may be used to set the decoded from an XR. These pointers lack a ring number.
Adding to an AR decreases its size field by the increment.
Sized pointers may be created by using the LIMIT instruction using a base and new size, where the new size is bounds checked against base.size.
The tag specifies the size in an unsigned floating-point format with a 4‑bit exponent, 4‑bit significand with a hidden bit for non-zero exponents, and is expanded to a 61‑bit size as follows:
e ← tag7..4
size ← e = 0 ? 054∥tag3..0∥03 : 053−e∥1∥tag3..0∥02+e
Sized Pointer with size encoded in tag
71 70 64 63 61 60 48 47 0
0 SS SG segment fill byte address in segment
1 7 3 13 48−SEGSIZE SEGSIZE

Small Size Encoding SS
tag SS Size in Words G
2:0
6:3
0 1 2 3 4 5 6 7
0..7 0 0 1 2 3 4 5 6 7 1
8..15 1 8 9 10 11 12 13 14 15 1
16..23 2 16 18 20 22 24 26 28 30 2
24..31 3 32 36 40 44 48 52 56 60 4
32..39 4 64 72 80 88 96 104 112 120 8
40..47 5 128 144 160 176 192 208 224 240 16
48..55 6 256 288 320 352 384 416 448 480 32
56..63 7 512 576 640 704 768 832 896 960 64
64..71 8 1024 1152 1280 1408 1536 1664 1792 1920 128
72..79 9 2048 2304 2560 2816 3072 3328 3584 3840 256
80..87 10 4096 4608 5120 5632 6144 6656 7168 7680 512
88..95 11 8192 9216 10240 11264 12288 13312 14336 15360 1024
96..103 12 16384 18432 20480 22528 24576 26624 28672 30720 2048
104..111 13 32768 36864 40960 45056 49152 53248 57344 61440 4096
112..119 14 65536 73728 81920 90112 98304 106496 114688 122880 8192
120..127 15 131072 147456 163840 180224 196608 212992 229376 245760 16384
Possible Small Size Extension to tags 128..191
tag SS Size in Words G
2:0
7:3
0 1 2 3 4 5 6 7
128..135 16 218 218+215 218+2×215 218+3×215 218+4×215 218+5×215 218+6×215 218+7×215 215
136..143 17 219 219+216 219+2×216 219+3×216 219+4×216 219+5×216 219+6×216 219+7×216 216
144..151 18 220 220+217 220+2×217 220+3×217 220+4×217 220+5×217 220+6×217 220+7×217 217
152..159 19 221 221+218 221+2×218 221+3×218 221+4×218 221+5×218 221+6×218 221+7×218 218
160..167 20 222 222+219 222+2×219 222+3×219 222+4×219 222+5×219 222+6×219 222+7×219 219
168..175 21 223 223+220 223+2×220 223+3×220 223+4×220 223+5×220 223+6×220 223+7×220 220
176..183 22 224 224+221 224+2×221 224+3×221 224+4×221 224+5×221 224+6×221 224+7×221 221
184..191 23 225 225+222 225+2×222 225+3×222 225+4×222 225+5×222 225+6×222 225+7×222 222

The maximum size supported for 128 tags is 245,750 words or 1.875 MiB; with the possible extension to 176 tags is 15,728,640 words or 120 MiB; and with 192 tags is 62,914,560 words or 48 MiB. The maximum allocation overhead with this encoding is 12.5% (e.g. 131073 words rounding up to 147456). With fewer exponent bits, this could be reduced to 6.2% (e.g. 16385 words rounded up to 17408) at the cost of reduced range (31744 words or 136 KiB for tag 191). Beyond the maximum size, or when a memory ring number is required, unsized pointers must be used.
Pointers to blocks with header/trailer sizes
Pointer with size at virtual address − 8
71 64 63 4 3 0
221 doubleword address 0
8 60 4

Providing a special tag for headers and trailers of allocated blocks allows a backward scan to find the start of the block. This may be useful in some applications.
There may be associated words before a header word that give additional information. If these exist, they are called leader words.
Size word stored at pointer − 8
71 64 63 4 3 0
250 doubleword count 0
8 60 4

Size word stored at pointer + size
71 64 63 4 3 0
250 − doubleword count 0
8 60 4
Unsized Pointers
Unsized Pointer
71 67 66 64 63 61 60 48 47 0
24 ring SG segment byte address
5 3 3 13 48

The only size check for unsized pointers comes from the segment size stored in the Segment Descriptor Word as cached in the TLB.
Unsized pointers are used for memory regions too large to have the size encoded in the tag and for pointers that have a ring number less than PC.ring. They may also be used for undisciplined language (e.g. C++) pointers to disable checking, but this is an insecure mode of operation, and cliqued pointers are preferred in such cases.
Code Pointers
Code pointers are used for function calls and returns, and for implementing switch statements. CHERI capabilities may also be used as code pointers. Calls and jumps using pointers without tags in the range 200 to 207 or 232 trap.
Pointer to Basic Block Descriptor
71 67 66 64 63 3 2 0
25 ring BB descriptor word address 0
5 3 61 3
Segment Relative Pointers
Segment relative pointers allow segments to contain address-space-independent pointers to locations within the segment. For example, a database could be mapped to different addresses in different address spaces, but still contain pointers to other data in the segment. There is no ring field in these pointers. The RLA64, RLA32, RLA64I, and RLA32I instructions load these pointers and convert to a sized pointer using the segment size. These instructions fill in the ring field with the current ring of execution (PC.ring). The RSA64, RSA32, RSA64I, and RSA32I instructions store pointers and convert to this format, checking that the segment number matches the segment number of the store address register and that the ring number is equal to the current ring of execution.
Segment Relative Pointers
71 64 63 61 60 0
223 0 offset
8 3 61
CHERI Capabilities
CHERI capabilities are stored in memory doublewords and may be loaded into ARs with the LAC instruction and stored with the SAC instruction. Word 1 of a CHERI capability is given a special tag. The word 0 and 1 tag values of CHERI capabilities may only be created by ring 7 and by CHERI instructions that derive from other CHERI capabilities.
Word 0 of CHERI capability
71 64 63 0
232 Local virtual address
8 64

Word 1 of CHERI capability
71 64 63 61 60 57 56 53 52 47 46 28 27 26 25 17 16 14 13 3 2 0
251 R 0 SDP AP 0 S F T TE B BE
8 3 4 4 6 19 1 1 9 3 11 3

The following gives an overview of the above. See CHERI Concentrate[PDF] Section 6 for details, except for the ring number field, which is SecureRISC specific.

Fields of CHERI Capability Word 1
Field Width Bits Description
BE 3 2:0 Bottom bits 2:0 or Exponent bits 2:0
B 11 13:3 Bottom bits 13:3
TE 3 16:14 Top bits 2:0 or Exponent bits 5:3
T 9 25:17 Top bits 11:3
F 1 26 Exponent format flag indicating the encoding for T, B and E:
The exponent is stored in T and B if EF=0, so it is internal
The exponent is zero if EF=1
S 1 27 Sealed
AP 6 52:47 Architectural permissions
SDP 4 56:53 Software defined permissions
R 3 63:61 Ring number (SecureRISC specific)
251 8 71:64 Tag for CHERI Word 1
The interpretation of the above fields is approximately (the CHERI Concentrate documention is definitive) as follows:
e ← F ? 06 : 52 − (TE ∥ BE)
ba ← F ? (B ∥ BE) : (B ∥ 03)
ta ← F ? (T ∥ TE) : (T ∥ 03)
carry ← ta11..0 < ba11..0
bot ← (lvaddr63..14+e + cb) ∥ ba13..0 ∥ 0e
top ← (lvaddr63..14+e + ct) ∥ (ba13..12 + ~F + carry) ∥ ta ∥ 0e

CHERI Alternative
The CHERI-128 format above may be appropriate for a capability architecture, but something simpler could drop the capability aspects and provide more bits the top and bottom bounds and a clique for dangling pointer detection. Here is one possibility:
Word 1 of CHERI Alternative
71 64 63 61 60 59 55 54 53 28 27 0
CLIQUE R W E L T B
8 3 1 5 1 26 28
Fields of cherialt
Field Width Bits Description
B 28 27:0 Bottom bits 30..3
T 26 53:28 Top bits 28..3
L 1 54 Length bit
E 5 59:55 Exponent
W 1 60 Write permission
R 3 63:61 Ring
CLIQUE 8 71:64 Clique
carry ← T25..0 < B25..0
T2 ← B27..26 + L + carry
bot ← (lvaddr63..32+e + cb) ∥ B ∥ 03+e
top ← (lvaddr63..32+e + ct) ∥ T2 ∥ T ∥ 03+e
Trap on load or store
One tag is defined to cause an exception when it is referenced on a load or store. This is useful for detecting accesses to freed memory, which is a source of security issues. A special instruction is provided to overwrite such words. Trap on load is also useful for dynamic linking and for setting breakpoints on basic block descriptors.
Trap on store either requires either a read before write, which is undesirable, or more likely storing an extra bit per word in the data cache tag RAMs (e.g. 8 bits for a 64 B line) which seems worth the checking this feature provides.
Trap on load tag
71 64 63 0
254 data
8 64
Trap on load or store tag
71 64 63 0
255 data
8 64

Dynamic Typing

As noted earlier, it is useful to provide tags for Common Lisp, Python, and Julia types, even when they are simply pointers to fixed-sized memory, and could theoretically use tags 1..128. This would consume perhaps 10 more tags, as illustrated in the following with the assumption that other types could employ the structure type or something like it (perhaps some of following could do so as well).

Tag Lisp Julia Data use
0 nil? 0
1..31 simple-vector? Tuple? TBD (pointers with exact size in words)
32..127 ? (pointer with inexact sizes)
128..191 no dynamic typing use (Reserved)
192..199 no dynamic typing use (unsized pointer with ring)
200..207 Code pointer with ring
208..220 no dynamic typing use (Reserved)
221 simple-vector? Tuple? TBD (pointer to words with size header)
222 no dynamic typing use (Cliqued pointer in AR)
223 no dynamic typing use (Segment Relative)
224 CONS Pointer to a pair
225 Function Pointer to a pair
226 Symbol Pointer to structure
227 Structure Structure? Pointer to structure
228 Array Pointer to structure
229 Vector Pointer to structure
230 String Pointer to structure
231 Bit-vector Pointer to structure
232 CHERI-128 capability word 0
233 no dynamic typing use (Reserved)
234 Ratio Rational Pointer to pair
235 Complex Complex Pointer to pair
236 Bigfloat BigFloat Pointer to structure
237 Bignum BigInt Pointer to structure
238 Int128 Pointer to pair,
−2127..2127−1
tag 241 in word 0, tab 240 in word 1
239 UInt128 Pointer to pair,
0..2128−1
tag 241 in both word 0 and word 1
240 Fixnum Int64 −263..263−1
241 UInt64 0..264−1
242 Character Bool, Char,
Int8, Int16, Int32,
UInt8, Uint16, Uint32
UTF-32 + modifiers,
subtype in upper 32 bits
243 no dynamic typing use (Reserved)
244 Float Float64 IEEE-754 binary64
245 Float16, Float32 subtype in upper 32 bits
246..249 no dynamic typing use (Reserved)
250 no dynamic typing use (header/trailer words)
251 no dynamic typing use (CHERI word 1)
252..253 no dynamic typing use (BB descriptor)
254 no dynamic typing use (trap on load or BBD fetch (breakpoint))
255 no dynamic typing use (trap on load or store)

Python and Other Language Types

In addition to Lisp types, SecureRISC could define tags for other dynamically typed languages, such as Python. Tuples, ranges, and sets might be examples. Other types, such as modules, might use a general structure-like building block rather than individual tags, as suggested for Lisp above.

Block-Oriented ISA Terminology

Basic Block
A series of instructions with control transfers only before the first instruction and after the last.
Bage
A SecureRISC invented term for Basic Block Page, which is a 4 KiB aligned portion of the virtual address space containing basic block descriptors and the instructions addressed by those descriptors. The bage size should be less than or equal to the minimum page size so that the bage lvaddr → siaddr translation can be used for the L1 instruction cache access; if the SecureRISC minimum page size is increased, the bage size might be increased as well.
Basic Block descriptor
All control transfers are to Basic Block (BB) descriptors, which have tags 252 and 253. Transfers are not to instructions; a jump addressing instruction words would take an exception based on a tag mismatch. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. For basic blocks with conditional branches, the conditional branch prediction is made when the basic block descriptor is executed, and checked when the conditional branch instruction in the basic block is executed. The conditional branch instruction only has the operands to decide on taken or not-taken; the branch offset is stored in the descriptor, not the instruction. Thus, conditional branches look like other ALU instructions and may occur anywhere in the basic block and need not be the last instruction (earlier placement may reduce the branch misprediction penalty).
Basic block descriptors (BBDs) would typically be cached in their own specialized cache earlier in the pipeline than the instruction cache, and all program path prediction would be done based on this cache. This takes the instruction cache hierarchy out of the critical path. Filling the BBD cache is done a line at a time, perhaps with prefetch, which further helps performance. Descriptors also enable wide, parallel instruction decode even when variable sized instructions are present.
Program Counter
The Program Counter (PC) is a processor register giving the current Basic Block Descriptor (a 8‑byte aligned lvaddr). In normal operation a Basic Block is executed in its entirety, and so the instruction within the Basic Block need only be identified when exceptions stop execution in the middle of a basic block. Calls only store 8‑byte aligned values and returns trap on non-aligned values. Exceptions do store the full PC (along with a 200+ring tag value) and the offset within the BB. When packed Basic Block Descriptors (tag 253) are implemented, the PC may become 4‑byte or 2‑byte aligned.

Other Features and Aspects of SecureRISC

Segment Growth
Segments may grow upward or downward, but not both. Downward-growing segments are supported by setting the downward bit in the Segment Descriptor Entry (SDE), which checks that the fill bits (bits 47..ssize) are all set. To grow a segment the supervisor increases ssize in the SDE and doubles the area allocated to the first-level page table.
Stacks
Stacks are typically contained in a single segment so that they may grow as needed, but still be bounds checked. It is up to the ABI whether stacks grow upward or downward. Downward-growing stacks made sense when the heap and stack were at opposite ends of a small address space, and bi-directional growth allowed each to grow to fill the space between without predetermined limits to each. This is no longer necessary with stack and heap each occupying its own segment. For both upward and downward stacks the size of stack frames is encoded in the sp register size field, which causes references using sp as a base register to be bounds checked to the stack frame.
Upward-growing stacks increment sp by the frame size, and the operating system knows that the stack includes that size. The ENTRY instruction accomplishes this, given an immediate value representing the tag of the new sp.
Downward-growing stacks simply decrement sp by the new frame size. This is accomplished with the ENTRYD instruction.
In both stack directions, the new size is written into the size field of sp.* The old value of sp is written into the new stack frame so that the frame can deallocated by loading this value. Attempting to deallocate a frame with only an increment or decrement would lose the frame size. This method also makes stack frames a linked list, which is convenient for debugging. (If the compiler chooses frame size that can be exactly encoded in the tag field, then this sp store can occupy a single word in the stack frame, rather than a doubleword.)
Each process thread is typically given two stacks per ring, each in its own segment. One stack of the pair is used only for return addresses, which are pushed and popped when calls, returns, exceptions, and exception returns commit in the pipeline using the CSP[ring] RCSRs.† The other stack is used for everything else. The return address stack segment is typically write-protected from the current ring of execution except during call operations and can only be written by calling a more privileged ring. This aids in Control Flow Integrity (CFI) by preventing return-oriented programming (ROP) from overwriting return addresses. The return address stack also avoids wasting an AR.
* When pointer and memory cliques are used for bounds checking undisciplined languages, it will be desirable to set the memory tags of the frame to a unique value, e.g. by using a call count mod the number of cliques, except if this results in the same clique as the previous frame. Initializing all the words of a stack frame is likely to be the major performance cost to the clique method.
† Most implementations will provide two specialized caches of several lines (typically 2–8 lines representing 16–64 return addresses) for the return address stack, one at the commit point of the pipeline and one at the front-end used for prediction. The commit cache is kept coherent with the processor data caches used by load and store instructions and the prediction cache may be restored from the commit cache on mispredicts.
Control-flow Integrity
Many attacks on conventional processors exploit sneaking trojan data into the memory of a process. Since that memory typically lacks execute permission, the attacker instead depends upon causing existing instructions to execute the attacker’s algorithm using bogus data. Return-oriented programming[wikilink] (ROP) is one method to exploit existing instructions by overwriting the return address on the stack so that the return transfers to carefully chosen address that executes a few instructions and then returns to a new address. Often only a portion of a basic block containing a return is executed in this way. The basic block descriptor mechanism defeats this since a return to a basic block descriptor executes the entire basic block. In addition, basic block descriptors contain a field indicating whether a return ever targets the block and returns to blocks that do not expect a return take an exception. In addition, by moving the return address stack into protected memory, overwrites are prohibited.
Branch avoidance
SecureRISC has several features that reduces the demands on branch prediction, which improves performance. The Boolean Registers (BRs) are one aspect of the ISA that enables some branch avoidance.
Trap instructions
SecureRISC contains a rich set of trap instructions that cause an exception based on various conditional tests. This allows the compiler to supplement the checking mandated by the SecureRISC ISA with its own checks. Trap instructions do not use branch prediction resources and in some microarchitectures are almost free to execute with minor performance impact, except for their code size and fetch bandwidth requirements.
Loop count
Rather than depending upon the conditional branch predictor to predict loop iteration counts, SecureRISC defines instructions to communicate inner loop iteration counts to the BB engine and to indicate how to check predictions made thereby. This feature is initiated by the LOOPX or LOOPXI instructions with the number of iterations prior to the start of the loop in a BB with the c bit set in its descriptor. The microarchitecture employs count prediction on such BBs, most likely in a specialized structure. This prediction is be replaced by the actual value when the LOOPX or LOOPXI executes in the AR engine, which is often before the first or second loop back. When the last BB of the loop wants to loop back, it uses a BB descriptor next code of loop back or loop back conditional decrementing the predicted count and branching to the target if not zero. This feature allows SecureRISC to achieve DSP-like performance on simple loops and reduces the burden on the branch predictor, making it more effective on real conditional branches. The BB containing a loop test must also contain the SOBX instruction to decrement the actual loop iteration count in an XR and check the prediction.
Pointer Permission
In addition to Read, Write, and Execute permissions, SecureRISC includes Pointer and Capability permission bits in Segment Descriptors. Only segments with the P bit set are allowed to contain pointers to other segments. Stack and heap segments would typically have P set, but code and mapped data files would have P clear. Segments with P clear may only contain local pointers, which consists of just the offset within the segment. The RLA64, RLA32, RLA64I, and RLA32I instructions allow such pointers to be converted to full pointers when loaded into an Address Register. This allows a database to contain internal pointers that are independent of the address to which the segment is mapped at runtime.
Capability Permission
Capability Permission allows the segment to contain CHERI capabilities.

Sandboxing

At times it can be useful to be able to execute untrusted code in an environment where that code has no direct access to the rest of the system, but where it can communicate with the system efficiently. Hierarchical protection domains (aka protection rings)[wikilink] provide an efficient way to provide such an environment. Imagine a web browser that wants to be able to download code from an untrusted source, perhaps use Just-In-Time Compilation to generate native code, and then execute to provide some service as part of displaying the web page. The downloaded code should not be able to access any files or the state of the user browser. For this scenario on SecureRISC, where ring 0 is the least privileged and ring 7 the most privileged (the opposite of the usual convention), the web browser might execute in ring 2, generate machine code to a segment that is writeable from ring 2, but only Read and Execute to ring 0, and then transfer to that ring 0 code. All rings share the same address space and TLB entries for a given process, but the ring brackets stored in the TLB change access to data based on the current ring of execution. Ring 0 would have access only to its code, stack, and heap segments, and nothing else. It would not be able to make system calls or access files, except indirectly by making requests to ring 2. The only access ring 0 would have outside of its three segments might be to call a limited set of gates in ring 2, causing a ring transition. Interrupts and such would be delivered to the browser in ring 2, allowing it to regain control if the ring 0 code does not terminate. The browser and the rest of the system is completely protected by the code executing in ring 0. Because ring 0 is a subset of the address space of ring 2, ring 2 has complete access to all the data in ring 0, but ring 0 has access only to the segments granted to it by ring 2. Ring 2 has the option to grow or not grow the code, heap, and stack segments of ring 0 as appropriate.

Garbage Collection

One goal for SecureRISC is to support languages, such as Lisp, Julia, Javascript, and Java, that rely on garbage collection (GC), as this eliminates many programming errors that introduce bugs and vulnerabilities. GC is the automatic reclamation of allocated memory by tracing all reachable allocations and freeing the remainder. GC needs to be both low overhead while meeting application response time requirements (e.g. by not pausing the application excessively). SecureRISC will achieve this by including features (described in subsequent sections) for generational GC and per-thread access restrictions to allow concurrent GC to be performed by one processor while another continues to run the application.

GC Terminology

Allocation is done in areas, which are for user-visible segregation of different-purpose allocations to different portions of memory. Areas consist of 1-4 generations, each generation consisting of some data structures and many small independent incremental regions that are used to implement incremental GC. The purpose of the incremental regions is to bound the timing of certain GC operations making program delays not proportional to memory size but only to incremental region size. When the application program needs to access an incremental region that has not been processed, the application switches to process it immediately, and then proceeds. The incremental region is small enough that the delay in processing it is acceptable to application performance, but large enough that its overhead is not excessive. A group of incremental regions is called a macro region, and a generation might be one or more macro regions. Macro regions are further divided into those for small and large objects, which use different algorithms for their incremental regions.

The SecureRISC Garbage Collection (GC) terminology introduced so far is briefly summarized below:

Area
Programmers allocate objects in areas, which provides grouping. In SecureRISC, an area might consist of a data structure and several segments, one per generation.
Generation
Generations group data into four lifetimes, from ephemeral (generation 0) to long-lived (generation 3), with generations 1 and 2 having intermediate object lifetimes. Generations consist of small-object macro regions and large-object macro regions.
Incremental Region
To minimize the time the application is paused by the GC algorithm, objects are stored in incremental regions that can be compacted quickly in response to an access by the application. Once the incremental region is compacted, the application continues.
Macro Region
Incremental regions are small to minimize processing time, and so cannot hold all of the application data. A set of incremental regions provides the capacity required by the application. Macro regions have incremental regions for large and small objects, managed by different algorithms.
Small Object Region
Small objects (less than a small multiple of the page size) benefit from compaction, and are allocated sequentially from free space.
Large Object Region
Large objects (greater than a small multiple of the page size) are allocated as many pages as required, and are never moved. When they are no longer referenced, GC frees the pages they occupied.

Generational GC

New allocations are presumed to how short lifetimes until proven otherwise. Such allocations are ephemeral and done in a generation 0, which is reclaimed frequently. The ephemeral allocations store pointers to all generations, but have few pointers from longer-lived generations to the more ephemeral allocations. For efficiency, reclamation operates without scanning all older allocations. Over time as data remains live in the ephemeral generation for many reclamations, it may be moved to an older generation. To work correctly, pointers in older areas that point to recent ones need to be known and used as roots for recent area scans. The processor hardware helps this process by taking an exception when a pointer to a newer generation is first stored to location in an older generation; the trap handler can note the page being stored to and then continue. The translation cache access for the store will provide both the generation dirty level for the target page and the generation number of the target segment. For the data being stored, the tag indicates whether it is a pointer or not, and if so then the Segment Size Cache provides the generation number of the pointer being stored, and the translation cache provides the generation of the page being stored to. If the page generation is greater than the generation of the pointer being stored, an exception occurs. SecureRISC has support for 4 generations, with generation 0 being the most ephemeral and generation 3 being the least frequently reclaimed. Rather than storing the location of all pointers on a page to more recent generations, the trap might simply note which pages need to be scanned when GC happens later. Because words in memory are tagged, pages can be scanned later without concern that an integer might be interpreted as a pointer. With sufficient operating system sophistication, it is even possible that a page could be scanned prior to being swapped to secondary storage, to prevent it needing to be read back in during GC. After the first trap on storing a recent generation pointer to an older generation page, if only the page is noted for later scan, then the GC field in the PTE would typically be changed by the trap handler so that future stores to the page are not trapped.

Before describing the mechanisms for incremental GC, it is helpful to have a specific GC algorithm in mind. The next section presents the preferred algorithm. After the preferred algorithm, the details of per-thread access restriction for incremental GC are presented.

Garbage Collection Algorithm

David Moon, architect of Symbolics Lisp Machines, kindly offered suggestions on Garbage Collection (GC). I have dubbed his algorithm MoonGC. He began by observing the following:

Compacting garbage collection is better than copying garbage collection because it uses physical memory more efficiently.

Compacting garbage collection is better than non-moving garbage collection and C-style heap allocation because it does not cause fragmentation.

First, divide objects into small and large. Large objects are too large to be worth the overhead of compacting, larger than a few times the page size. Large objects are allocated from a heap and never change their address. The garbage collector frees a large object when it is no longer in use. By putting each large object on its own pages, physical memory is not fragmented and the heap only has to track free space at page granularity. Virtual address space gets fragmented, but it is plentiful so that is not a problem.

Small objects are allocated consecutively in fixed-size regions by advancing a per-region fill pointer until it hits the end of the region or there is not enough free space left in that region; at that point allocation switches to a different region. The region size is a small multiple of the page size. The allocator chooses the region from among a set of regions that belong to a user-specified area. Garbage collection will compact all in-use objects in a region to the low-address end of that region, consolidating all the free space at the high-address end where it can easily be reused for new allocations.

SecureRISC now uses incremental region for what MoonGC called simply region above. Before continuing, this proposal introduces this and other terminology in the next section.

One other advantage of compaction, not mentioned above, is that it provides a natural mechanism for determining the long-lifetime data in ephemeral generations: it is the data compacted to the lowest addresses.

MoonGC, as originally presented, is a four phase algorithm to implement the above using only virtual memory and changing page permissions. The following adapts MoonGC to take advantage of the address restriction feature described below, as using virtual memory protection changes is costly. The restriction allows GC to deny application threads access to incremental regions when they are in an inconsistent state. The following also makes other minor changes so that the exposition below is new. The credit goes to David Moon, but problems and bugs are likely the result of these changes and exposition.

The application threads run concurrently with the GC threads, except in phase 3 (the stack scan). Application threads may be slowed during phase 4 as will be explained. The four phases of MoonGC are as follows:

  1. Mark phase. Mark data reachable from roots in incremental region bitmaps. This phase is concurrent with the application. New allocations also mark the bitmaps.
  2. Preparation phase. Process the bitmaps to prepare for small object relocation and free large object pages. This phase is concurrent with the application.
  3. Relocation phase. Translate all roots, including the stack and the pointers from longer-lived generations, converting pre-compaction addresses to post-compaction addresses. Deny the application access to all small and large object regions using address restriction. This trap is handled by a handler in the same ring, and is therefore fairly low-overhead. The application is paused during this phase, but this phase takes a short time and is not proportional to memory size, only the size of the stack and other roots. After this phase, the roots, stack, and all accessible memory contain only compacted addresses, and the application works only with such addresses. Application access to memory with pre-compaction addresses is denied (GC threads continue to have access).
  4. Compaction phase. This phase is concurrent with the application and occurs primarily in the GC threads, but application threads may join the work when needed. The compaction phase goes through all small object incremental regions and moves marked objects from their pre-compaction to post-compaction address, and also translates the pointers contained in the objects as well. It also goes through all large objects and translates the pointers contained therein. Once compaction is completed for an incremental region, the permission for that incremental region is enabled for the application. If an application thread tries to access a disabled region, the trap will, if compaction has not already started on this region, the thread switches to compaction and translation of the incremental region (only translation for large object regions). (If a GC thread is already working on the incremental region, the application thread will just wait.) When an application thread joins the compaction threads, it will temporarily enable access, and then restore access when it finishes its incremental region, and then return to application work, which now that the region is compacted, will no longer trap. The time spent in the application thread doing compaction and translation is proportional to incremental region size and not memory size, bounding the pause.

Occasionally an extra phase of the algorithm might compact two incremental regions into one. Still additional phases might migrate objects from a frequent generation to a less frequent one.

Virtual Address Restriction

This proposal starts with the assumption that software will designate one or more macro regions of the virtual address space to be subject to additional access control for rings ≤ R (controlled by a privileged register so that, for example, user mode cannot affect supervisor mode). For example, when Garbage Collection is used for reclaiming allocated storage, only the heap might be subject to additional protection to implement read or write barriers. These macro regions of the virtual address space are specified using a Naturally Aligned Power-of-2 (NAPOT) matching mechanism to provide flexible sizing. Matching for eight macro regions is currently proposed, which would support four generations of small object macro regions, and four generations of large object regions. This restriction is implemented in a fully associative 8‑entry CAM matching the effective address of loads and stores. A match results in 128 access restriction bits, with one bit selected by the address bits below the match. In particular, there are eight Virtual Access Match registers (amatch0 to amatch7), eight 128‑bit Virtual Address Region Trap registers (atrap0 to atrap7), and eight 128‑bit Virtual Address Region Write Trap registers (awtrap0 to awtrap7). The atrapi/awtrapi registers are read and written 64 bits at a time using low and high suffixes, i.e. atrapil/atrapih and awtrapil/awtrapih. The format of the amatchi registers is as follows, using a NAPOT encoding of the bits to compare when testing for a match.

Virtual Address Match Registers
63 22 21 18 17 4 3 0
vaddr63..19+S 2S 0 TYP
45−S 1+S 14 4
Fields of amatchi registers
Field Width Bits Description
TYP 4 3:0 0 ⇒ Disabled
1 ⇒ Address restriction for GC
2..15 Reserved
2S 1+S 18+S:18 NAPOT encoding of virtual address region to match
vaddr63..19+S 45−S 63:19+S Virtual address to match

When bits 63:19+S of a virtual address match the same bits of amatchi, then the corresponding atrapil/atrapih and awtrapil/awtrapih pairs specify 128 additional access and write denial bits for the incremental regions thereof. In particular, on a match to amatchi, bits 18+S:12+S of the effective address are used to select bits from the atrapi pair and the awtrapi pair. If the atrapi bit is 1, then loads and stores generate an access fault; else if the awtrapi bit is 1, then only stores generate an access fault. The value of S comes from the NAPOT encoding of amatchi registers, as the number zero bits starting from bit 18 (i.e., S=0 if bit 18 is 1, S=1 if bits 19:18 are 10, and so on). Setting bits 63:18 to 245 causes it to match the entire virtual address space. The lowest numbered amatchi match has priority. If no amatchi register matches then there is no additional access check.

How to control ring access to the above CSRs is TBD, as what ring accesses are trapped.

A atest instruction will be specified to return the incremental region that matches the effective address ea. If there is not a match, these instructions return the null pointer (tag 0). On a match to amatchi they return a pointer (with the appropriate size tag) to ea63..12+S∥012+S based on the S from the matching register.

awtrapi registers are not required for MoonGC, described above, and may be left set to zero for that algorithm. They could be omitted if another use is not found for them, but they may be useful for other GC algorithms.

The efficiency of translating pre-compaction to post-compaction addresses is critical. The original MoonGC proposal recognized that this time is probably limited by data cache misses, and used the preparation phase to convert the bitmaps into a relocation tree that would require only three cache block accesses per translation with binary searching. The following modification is proposed to reduce this to just two cache blocks by making extensive use of population count[wikilink] (popcount) operations.

Within a small object incremental region, the post-compaction offset of an object is the number of mark bits set in the incremental region bitmap for objects up to but not including that object. For translation, summing the popcount on all the words in the bitmap prior to the word for the pre-compaction address would touch too many cache blocks, so in phase 2 (preparation) compute the popcounts of each bitmap cache block and store them for lookup in phases 3 and 4. Each translation is then one popcount cache block access and one bitmap cache block access. For a small object incremental region holding N objects and a cache block size of 512 bits (64 B), the number of bitmap cache blocks B is ⌈N/512⌉. Store 0 in summary[0]; store popcount(bitmap[0..511]) in summary[1]; store summary[1]+popcount(bitmap[512..1023]) in summary[2]; and so on … and finally store summary[B-2]+popcount(bitmap[N-1024..N-511]) in summary[B-1]. If N ≤ 65536 then the summary count array elements fit in 16 bits, and so the size of the summary array is ⌈B/32⌉ cache blocks, and if N ≤ 16384 the summary array fits in only one cache block. To translate from the pre-compaction offset to the post-compaction offset in phases 3 and 4, simply take the ⌊offset/512⌋ as the index into this array to get the number of objects before the bitmap cache block. Now read the bitmap cache block. Add the popcount of the 1-8 words up to the object of interest (using a mask on the last word read) to the lookup value. This sum is the post-compaction offset in the small object incremental region. If eight popcounts are too costly, then the summary popcount array may be doubled in size to cover just four words, or a vector popcount reduction instruction might be added to make this even more efficient.

As an example, to illustrate the above, consider NAPOT matches on 16 MiB (S=5), which provides 128 access controlled incremental regions of 128 KiB (131072 B) each. An object pointer is converted to its containing incremental region by simply clearing the low 17 bits. There are 16104 words (2013 cache blocks) of object store (98.29%), which are stored starting at offset 0 in the incremental region. The bitmap summary popcounts are 64 bytes starting at 128832. Bitmaps are 2016 bytes (31.5 cache blocks) starting at 128896. Finally there are 160 bytes (20 words, 2.5 cache blocks) of incremental region overhead for locks, freelists, etc. available starting at 130912. To go from the pointer to its bitmap byte, add bits 16:6 to the region pointer plus 128896 and the bit is given by bits 5:3.

Incremental region layout examples
Mregion Iregion Objects Summary Bitmap Other
S MiB words words % words % words % words %
0 0.5 512 480 93.75 1 0.20 8 1.56 23 4.49
1 1 1024 984 96.09 1 0.10 16 1.56 23 2.25
2 2 2048 1992 97.27 1 0.05 32 1.56 23 1.12
3 4 4096 4008 97.85 2 0.05 63 1.54 23 0.56
4 8 8192 8040 98.14 4 0.05 126 1.54 22 0.27
5 16 16384 16104 98.29 8 0.05 252 1.54 20 0.12
6 32 32768 32232 98.36 16 0.05 504 1.54 16 0.05
7 64 65536 64480 98.39 32 0.05 1008 1.54 16 0.02
8 128 131072 128976 98.40 63 0.05 2016 1.54 17 0.01
9 256 262144 257968 98.41 126 0.05 4031 1.54 19 0.01
10 512 524288 515952 98.41 252 0.05 8062 1.54 22 0.00
11 1024 1048576 1031928 98.41 504 0.05 16124 1.54 20 0.00

Smaller incremental regions may provide better real-time response, but limit the size of a macro region due to the 128 access denial bits provided by atrapi pairs. Larger incremental regions pause the application for longer and also require a larger summary popcount array, but allow for larger memory spaces. Generations might choose different incremental region sizes. Typically generation 0 (the most ephemeral) would use small incremental regions, while generation 3 (the most stable) would use incremental regions sized to fit the amount of data required.

With eight amatch sets of registers, half might be used for four generations of small object regions, and half for four generations of large object regions. In the above example, if each bit of atrap controls a 128 KiB small object region, then the ephemeral generation can be as large as 16 MiB. Less ephemeral generations might be larger.

A possible improvement to the algorithm is to have areas use slab allocation for a few common sizes. For example, there might be separate incremental regions for 1, 2, …, 8, and >8‑word objects. This allows a simple freelist to be used for objects ≤8 words so that compaction is not required on every GC. Incremental regions for ≤8 words might only be compacted when it would allow pages to be reclaimed or cache locality to be increased. Note that different tradeoffs may be appropriate for triggering compaction in ephemeral vs. long-lived generations. Also, bitmaps could be potentially use only one bit per object rather than one bit per word in 2‑word, 4‑word, and 8‑word regions, making these even more efficient. However, that requires a more complicated mark and translation code sequence.

When a GC thread finishes compaction of an incremental region, application access is not immediately enabled since that would require sending an interrupt to all the application threads telling them to update their atrap registers. Instead the updated atrap bits are stored in memory, and the next application exception will load the updated value before testing whether compaction is required, in progress, or still needs to be done.

Setting the TYP to 0 in amatchi registers may be used by operating systems to reduce context switch overhead; disabled registers may be treated as having amatchi/atrapi/awtrapi all zero.

Exceptions And Interrupts

This section is very preliminary at this point.

Each ring is capable of handling some of its own exceptions and interrupts. For example, ring N assertion failures (attempts to writes of 1 to b0) turn into a call to the ring N handler. This exception call pushes the PC, the offset in the basic block, plus three words of additional information onto the Call Stack, and a return pops this information. The exception handler is specified in a per ring register. The additional information includes a cause code and may include further information that is cause dependent. The details of the exception mechanism are TBD. Of course, in some cases exceptions should be handled by a more privileged ring (e.g. user page faults should go a supervisor exception handler since the user exception handler might take a page fault, and similarly for second-level page faults for the supervisor and hypervisor). Again details TBD. Also, exceptions in exception handlers may go to a higher ring.

The Basic Block Descriptor (BBD) addresses of the exception handlers for ring R are given by the RCSR ExceptionHandler[R], which must be 8‑byte aligned (typically these values are 128‑byte aligned). As with other RCSRs, only rings of equal or higher privilege may write the register. In addition, values written to this register must have a code pointer tag designating a ring of equal or higher privilege, but not higher privilege of PC.ring. Thus the validity test is as follows:
h ← X[a]
if (h2..0 ≠ 0) | (R > PC.ring) | (h71..67 ≠ 25) | (h66..64 < R) | (h66..64 > PC.ring) then
  exception
endif
In addition the basic block descriptor (BBD) at ExceptionHandler[R] must have tag 252 with prev = 4 (Cross-Ring Exception entry) and the BBD at ExceptionHandler[R] | 64 must have tag 252 with prev = 12 (Same-ring Exception entry).

ExceptionHandler[R] specifies the BBD address for exceptions from less privileged rings to ring R (i.e. for PC.ring < R). Exceptions from ring R to R (i.e. for PC.ring = R) use the modified BBD address ExceptionHandler[R] | 64. This allows cross-ring exceptions to additional state save and restore (e.g. stack switching), while same-ring exceptions are fast (and for example stay on the same stack).

The exception process may itself encounter an exception that must be serviced by a more privileged ring (e.g. a virtual memory exception in writing the call stack). This will be designed so that after the virtual memory exception is remedied, the lower privilege ring exception can proceed. Also, programming or hardware errors might result in an attempt to take an exception in the critical portion of the exception handler, which will be detected, and signal a new exception to a more privileged ring, or a halt in the case of ring 7.

SecureRISC could provide an instruction to push an AR pair and an XR pair onto the Call Stack rather than providing per-ring scratch registers. However, some sort of way of loading new values for these registers to give the exception handler the addresses it needs to save further state is still required. It is unlikely that using an absolute address is acceptable.

Each ring has its own set of interrupt enable and pending bits, and these are distinct from other rings’ bits. Interrupts also use the general exception handler, with a cause code indicating the reason is for an interrupt. Their additional information includes the previous InterruptEnable mask for the target ring. When the interrupt exception occurs, InterruptEnable[ring] is cleared, automatically cleared, including the interrupt being taken, and the original interrupt enables are saved on the Call Stack. The interrupt handler is expected to reenable higher-priority interrupts based on clearing same and lower priority interrupts from the saved enables and writing that back to InterruptEnable[PC.ring]. The bits from the saved enables to clear might be a bitmask from a per-thread lookup table which allows for all 64 interrupts to be prioritized relative to each other.* The RFI instruction restores the interrupt enable bits from the Call Stack. Any pending interrupts that are thereby enabled will be taken before executing the instruction returned to. The RFI instruction may optimize this case by simply transferring back to the handler address rather than popping and pushing the call stack.

* Using a per-interrupt mask of same and lower-priority interrupts is very general and allows for all 64 interrupts to be prioritized relative to each other. However, rather than clearing the ring’s InterruptEnable, which temporarily blocks high-priority interrupts, it would be possible to do the new InterruptEnable computation in hardware as part of the process of taking the interrupt, but this requires a per-ring 64×64 SRAM to specify lower priority interrupts per taken interrupt, and this is a lot to context switch. If it is required, it would instead be possible to provide a per-ring 64×4 SRAM (256 bits to context switch) giving a 4‑bit interrupt priority to each interrupt, and use that to calculate a new InterruptEnable when taking an interrupt. Sixteen priority levels should be sufficient. However, this would require a new RICSR type to be able to read/write 256 bits per-ring, and so this would only be done if it proves necessary.

Interrupt pending bits are set by writing to a memory address specific to the target process. When the process is running, this memory address is redirected to the process’ pending register; otherwise, it will receive the interrupt when it switches to running.

The mechanism for clearing an interrupt pending bit is interrupt dependent. For level-triggered interrupts it is interaction with the interrupt signal source that deasserts the signal, and thus clears the pending bit. For edge-triggered and message-signalled interrupts, the RCSRRCXC instruction may be used clear the interrupt pending bit.

Processors check for interrupts at the start of each instruction. An interrupt is taken instead of executing the instruction if (InterruptPending[ring] & InterruptEnable[ring]) ≠ 0 with the check done in order for ring from 7 down to PC.ring.

Three interrupts are generated by the processor itself and are assigned fixed bits in the InterruptPending and InterruptEnable registers. Bit 0 is for the ICount interrupt; bit 1 is for the CCount interrupt; and bit 2 is for the Wait interrupt. Wait interrupts occur whenever less privileged rings attempt to use one of the wait instructions that would suspsend execution. Enabling Wait interrupts allows intercept such waits to switch to other work. This interrupt would typically be enabled when other work exists, and disabled otherwise. In addition, the supervisor is expected to define certain interrupts for user rings. For example, a timer interrupt would typically be created from cycle counts for bit 3. (Need to either define per-ring Wait interrupts or have a rule that the least privileged ring of higher privilege gets the interrupt.)

Virtualization of Interrupts

Interrupts need to be virtualized. SecureRISC expects systems to primarily employ Message Signaled Interrupts (MSIs), where interrupts are transactions on the system interconnect. MSIs are directed to a specific process. If the process is currently executing on a processor, then the interrupt goes to that processor. If the process is not running, then the interrupt must be stored in memory structures (e.g. by setting a bit), and then the scheduler for that process must be notified (e.g. by an interrupt message). When a process is scheduled on a processor, the interrupts stored in memory are loaded into the processor state, and future interrupts are directed to the processor rather than to memory.

To implement this, interrupt messages are directed to one or more specialized Interrupt Processing Units (IPUs). Creating a process allocates system interconnect memory for the process’ interrupt data structures and provides this memory to the chosen IPU. When the process is scheduled, the IPU is informed to forward interrupts directly to it. When a process is descheduled, the IPU is informed to store its interrupts in the allocated memory and send an interrupt to the scheduler.

For some systems a single Interrupt Processing Unit (IPU) may be sufficient. In others it may be appropriate to have multiple IPUs, e.g. one unit per major priority level, so that lower priority interrupts do not impede the processing of higher priority ones. (There may be some sequential processing in IPUs, such as a limitation on outstanding memory operations.) NUMA systems may also want distributed IPUs.

The details of the above are TBD. Conceptually, MSIs would probably address a process through a triple of Interrupt Processor Unit (IPU) number, an opaque identifier referencing a process, and an interrupt number for the process. The opaque identifier would be translated to its associated memory by the IPU, and the interrupt number bounds checked against the number of interrupts configured for the process. Forwarding interrupts to running processes would specify a processor port on the system interconnect, a ring number, and the interrupt number. It may be desirable to fit the interrupt state for a process into a single cache line to help manage atomic transfers between IPUs and processors.

The advantage of this outline is that not specialized storage is required per process. Main memory replaces the specialized storage for non-running processes, and the processor interrupt mechanisms are used for running processes.

Dynamic Linking and Loading

Most RISC ISAs use a set of mechanisms to implement dynamic loading and dynamic linking that are less efficient than what SecureRISC can do using tags and a different ABI. Because the RISC-V ABI for dynamic linking is slightly better than some older ABIs, it will be the basis of comparison here.

Most dynamic linking implementations do lazy linking on procedure calls; the first call to a procedure invokes the dynamic linker, which converts the symbol being referenced into an address and arranges that subsequent calls go directly to the specified address. This speeds up program loading because symbol lookup is somewhat expensive, and not all symbols need to be resolved in every program invocation. Lazy linking is not typically done for data symbols because the cost of detecting the first reference and invoking the dynamic linker would require too much extra code at every data access. So data symbol links are typically resolved when a shared library is loaded. In contrast, SecureRISC’s trap on load tag (tag value 254), allows links to external data symbols to be resolved on first reference, which should lead to faster execution initiation.

In RISC‑V, because external symbols are resolved by the dynamic linker when the object is dynamically loaded, it suffices to reference indirect through the link filled in by the linker, which is stored in a section called the Global Offset Table (GOT). In RISC‑V the GOT is a copy-on-write section of the mapped file and is addressed using PC-relative addressing (using the RISC‑V AUIPC instruction).

External symbol and function references are given in the C++, RISC‑V, and SecureRISC code examples below to illustrate the differences between the RISC‑V ABI and the proposed SecureRISC ABI. Starting with the the C++ code:

extern uint64_t count;// external data
extern void doit(void);// external function
static void
count_doit(void)
{
count += 1;
doit();
}
could be implemented as follows for RISC‑V:
count_doit:
addisp, sp, -16// allocate stack frame
sdra, 0(sp)// save return address
.Ltmp:auipct0, %got_pcrel_hi(count)// load link to count from GOT
ldt0, %pcrel_lo(.Ltmp)(t0)// (PC-relative)
ldt1, (t0)// load count
addit2, t1, 1// increment
sdt2, (t0)// store count
calldoit@plt// call doit indirectly through PLT
ldra, 0(sp)// restore return address
ret// return from count_doit
where the call pseudoinstruction above is initially:
auipcra, 0// with relocation R_RISCV_CALL_PLT
jalrra, ra, 0
but potentially relaxed to:
jalra, 0// with relocation R_RISCV_JAL
when the PLT is within the 1 MiB reach of the JAL (see RISC-V ABIs Specification version 1.0).
The PLT target of the above AUIPC/JALR or JAL is a 16‑byte stub with the following contents:
1:auipct3, %pcrel_hi(doit@.got.plt)
ldt3, %pcrel_lo(1b)(t3)
jalrt1, t3
nop

As seen above, the external variable reference is three instructions initially (and subsequently just one, as long as the link is held in the register). The SecureRISC ABI generally requires only two instructions to the first reference.

Also as seen above, the external procedure call is 4-5 instructions with two changes of instruction fetch (two BTB entries), one in the body and one in the PLT. If there are multiple calls to doit in the library, the PLT entry is shared by all the calls. When the number of frequent calls to doit is N, then N+1 BTB entries are required (N from the body, 1 from the PLT). The SecureRISC ABI requires 2 instructions and N BTB entries, which is not significantly different from N+1 for large N, but for N=1 represents half the BTB entries.

The typical POSIX ABI, such as the RISC‑V ABI, is as based on the C/C++ notion that all functions are top level. Other languages allow functions nesting, and is typically implemented by making function variables two pointers: a pointer to the code to call, and a context pointer specifying the stack frame of the function’s parent, which is used when referencing the parent’s local variables. The SecureRISC ABI proposal is to adopt the idea that all functions are specified by a code and context pointer pair, where the context for top-level functions is a pointer to their global variables and function links.

One the consequences of the proposed SecureRISC ABI is that copy-on-write is not required. An operating system that implements copy-on-write could use it (the context pointer would point to the data section of the mapped file), but it might avoid copy-on-write by copying the mapped file’s data template to a data segment with read and write permission, which allows page table sharing for the mapped file.

Another consequence is that the method of access to globals and links is the same in both the main application code and dynamically loaded shared libraries. In RISC‑V and other ABIs, the application code typically references global variables via the Global Pointer (gp) register, but with PC-relative references in shared libraries. For SecureRISC, each shared library has a register (the context pointer) for addressing its top-level data.

The C++ function above could be implemented on SecureRISC as follows (with shading used to highlight the basic blocks):
count_doit:
bb%prcall,%icall|%nohist
entryisp, sp, 32// allocate stack frame
sadisp, sp, 0// save return address
sadia10, sp, 16// save a10
movaa10, a1// move context to a10
laia2, a10, count_link// load count link
lsis0, a2, 0// load count
addsis1, s0, 1// increment
ssis1, a2, 0// store count
ljmpia0, a10, doit_link+0// doit code pointer
laia1, a10, doit_link+8// doit context pointer
bb%preturn,%return
ladia10, sp, 16// restore saved register
ladisp, sp, 0// restore stack pointer

Note that the LJMPI is a load instruction that checks the call prediction performed by the fetch engine when the BB descriptor at the start of count_doit is processed; it does not end the basic block.

Bounds Checking

Various checks are performed on all load and store instructions:

  1. The tag of the base register AR[a] is checked for a valid pointer tag (range 0..239):
    tag ← AR[a]71..64
    if tag ≥ 240 then exception
  2. The offset from the base address is calculated. This is typically done in one of three ways: Unsigned overflow during the shift raises an exception.
  3. The offset is compared to the size field of AR[a] and an exception is raised if if the offset exceeds the size:
    trap if offset ≥ 03∥AR[a]133..72
  4. The effective address is computed: ea ← AR[a]63..0 + offset
    Unsigned overflow during the addition raises an exception.
  5. The segment size cache is accessed for AR[a]63..48 to determine the segment size, ssize and this is checked:
    if AR[a]63..ssize ≠ ea63..ssize then exception
  6. The effective ring for the load or store is calculated:
    earing ← tag ≥ 224 | tag2..0 ≥ PC.ring ? PC.ring : tag2..0
  7. The data translation cache(s) are used to translate the lvaddr ea to a siaddr and Read, Write, Execute permissions and ring brackets. These are checked as described in the virtual memory section using the access type and earing.

Memory Model

SecureRISC, as originally conceived, was simply going to specify its memory model as Release Consistency[wikilink]), but after encountering RISC‑V, it seemed wise to look at what had been done there for memory model specification, so this is on hold. This section will be expanded when the memory model is defined.

Instruction Set

The following overview is meant to give a general framework to help the reader appreciate the details presented subsequently.

The SecureRISC Instruction Set is designed around six register files, two intended for use early in the pipeline, and four later in the pipeline. While some implementations may not have an early/late distinction, they are described this way here to indicate the possibility of such a split.

Name(s) Description Comment
Early Pipeline These instructions have at most three register operands, at most two sources and one destination except stores, which have up to three register sources, but never more than two sources from a given register file.
Operations are grouped into classes represented by schemas for conciseness in instruction exposition:
Early Pipeline Class Operation
Address comparison ac
Index arithmetic xa
Index bitwise logical xl
Index comparison xc
AR/AV Address Registers Used as base address for load and store instructions, where the effective address calculation is either
AR[a] + (XR[a]<<sa)
where sa is 0..3, or
AR[a] + immediate.
The single-bit AVs are valid bits for speculative load validity propagation.
XR/XV Index Registers Used as for integer calculations related to addressing. Often the general non-memory format is:
XR[d] ← XR[a] xa XR[b]
where xa is a fairly simple operation (e.g. + or <<u).
The single-bit XVs are valid bits for speculative load validity propagation.
Late Pipeline These instructions have up to three sources and one destination. SecureRISC makes use of the three source operands more than other ISAs. Often the general format is:
RF[d] ← RF[c] accop (RF[a] op RF[b])
where accop is an accumulation operation (e.g. + or ) and op is more general operation (e.g. ×).
Operations are grouped into classes represented by schemas for conciseness in instruction exposition, and most classes have an associated accumulation operation schema:
Late Pipeline Class OperationAccumulation
Boolean operation boba
Integer comparison icba
Integer arithmetic ioia
Bitwise logical lola
Floating-point arithmetic fofa
Floating-point comparison fcba
BR/BV Boolean Registers Use for comparisons, selection, and branches on scalar registers.
SR/SV Scalar Registers Used for both integer and floating-point scalar calculations. Not associated with address calculation.
VR Vector Registers Used for both integer and floating-point vector and matrix calculations. VRs have no associated valid bits and are typically not renamed.
VM Mask Registers Used for both integer and floating-point vector masking. VMs have no associated valid bits and are typically not renamed.
MA Matrix Accumulators Used for to accumulate the outer product of two vectors.

Processor State

Access to the state of more privileged rings is prohibited. For example, attempting to read or write CSP[ring] when ring > PC.ring causes an exception. Unprivileged state (e.g. the register files) may be accessed by any ring.

In the table below, the Type field values CSR, RCSR (per-ring CSRs), and ICSR (indirect CSRs) are described in Control and Status Register Operations. The type R is used for a simple register, RF for a Register File[wikilink], VRF for a Vector Register File, and MRF for a Matrix Accumulator.

The user process state includes:

Name Type Depth Width Read ports Write ports Description
PC R 1 3
+ 61
+ 5
The Program Counter holds the current ring number (3 bits), Basic Block descriptor address (61 bits—word aligned), and 5‑bit offset into the basic block of the next instruction. The 5‑bit offset is only visible on exceptions and interrupts. When compressed BBDs (tag 253) are defined, this will be 62 bits and 32‑bit aligned.
CSP RCSR 8 3 + 61 The Call Stack Pointer holds the ring number and address of the return address stack maintained by call and return basic blocks, and by exceptions and interrupts. This is not the same as the Program Stack Pointer, which is held in an AR designated by the Software ABI. There is one CSP per ring.
ThreadData RCSR 8 72 Thread Data is a per-ring storage location for a pointer to Thread-Local Storage (TLS). Functions that require access to per-thread data typically move this to an AR. It is also typically used in cross-ring exception handlers to save and restore the registers that ring requires to handle exceptions.
ExceptionHandler RCSR 8 3 + 61 ExceptionHandler[ring] holds the ring number and address to which the processor redirects execution on an exception for that ring.
InstructionCount RCSR 8 64 InstructionCount[ring] holds the count of executed instructions in each ring.
BBCount RCSR 8 64 BBCount[ring] holds the count of executed Basic Blocks in each ring.
ICountIntr RCSR 8 64 The ICount bit in InterruptPending[PC.ring] is set when (InstructionCount[PC.ring] − ICountIntr[PC.ring]) > 0. This may be used for single stepping.
CycleCount RCSR 8 64 CycleCount[ring] holds the number of cycles executed by ring.
CCountIntr RCSR 8 64 The CCount bit in InterruptPending[PC.ring] is set when (CycleCount[PC.ring] − CCountIntr[PC.ring]) > 0.
InterruptEnable RCSR 8 64 InterruptEnable[ring] holds interrupt enable bits for each ring. Interrupts for each ring are distinct. Application rings are expected to use the interrupts for inter-process communication. Supervisor and hypervisor rings will also use interrupts for communication with I/O devices.
InterruptPending RCSR 8 64 InterruptPending[ring] holds interrupt pending bits for each ring.
AccessRights RCSR 8 12 AccessRights[ring] holds the current Mandatory Access Control Set per ring. It is writeable only by ring 7. These rights are tested against the MAC level of svaddr regions specified in the Region Protection Table and potentially by the System Interconnect.
ALLOWQOS RCSR 8 6 ALLOWQOS[ring] holds the minimum value that may be written to QOS by a ring. Rings may not write values to ALLOWQOS[ring] less than ALLOWQOS[PC.ring].
QOS RCSR 8 6 QOS[ring] holds the current Quality of Service (QoS) identifier per ring. QoS identifiers are used on system interconnect transactions for prioritization. Rings may only set QOS to values allowed by ALLOWQOS[PC.ring]. Attempts to write smaller values trap.
KEYSET XCSR 1 16 This register is writeable only by ring 7, and specifies which encryption key indexes are currently usable. A reference to a disabled key in the ENC field of the Region Descriptor Table causes an exception. This allows ring 7 to partition the system based on which encryption keys are usable.
ENCRYPTION ICSR 15 8 + 256 These registers are readable and writeable only by ring 7, and provide the 8‑bit encryption algorithm and 256‑bit encryption key for main memory encryption as specified in Region Descriptor as an index into this array. The encryption algorithm and key are selected by the ENC of the Region Descriptor Table, with 0 being hardwired to no-encryption. Up to 15 pairs may be specified, but some implementations may support a smaller number. This is further defined in Memory Encryption below.
AMATCH ICSR 8 64 + 128 These registers are described in Virtual Address Restriction.
AR RF 16 144 2 1 The Address Register file holds pointers and integers to perform calculations related to control flow and to load and store address generation. No AR is hardwired to 0. Bits 63..0 are address or data (bits 135..133 are the ring number if address), bits 71..64 are the tag, and bits 132..72 are the size expanded from the tag, or as written by the WSIZE instruction, and bits 143..136 are used for the expected memory tag for cliqued pointers, or are the value 251 for other pointers.
In some microarchitectures, operations on ARs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.)
Most instructions that read ARs read only AR[a]. When two ARs are read, it is sometimes using the b field and sometimes the c field (AR stores read AR[c] and a few branches and SUBXAA read AR[b]). The b/c multiplexing can be done during instruction decode.
The assembler designation for individual ARs is by the names a0, a1, …, a15.
AV RF 16 1 1 1 The Address Register Valid file holds valid bits from speculative loads and propagation therefrom.
XR RF 16 72 2 1 The Index Register file holds integers to perform calculations related to control flow and to load and store address generation. No XR is hardwired to 0. Bits 63..0 are data and bits 71..64 are the tag. The XR primarily holds integer-tagged data, but other tags may be loaded.
In some microarchitectures, operations on XRs are executed in the early pipeline, either speculatively or non-speculatively. (Late pipeline operations may be queued until non-speculative or may be speculatively executed as well.) The XR register file requires two read ports and one write port per instruction.
Most instructions that read XRs read XR[a] and XR[b], but XR stores read XR[b] and XR[c]. The b/c multiplexing can be done during instruction decode.
The assembler designation for individual XRs is by the names x0, x1, …, x15.
XV RF 16 1 2 1 The Index Register Valid file holds valid bits from speculative loads and propagation therefrom.
SR RF 16 72 3 1 The Scalar Register file holds data for computations not involved in address generation and primarily holds integer or floating-point values. Tags are stored, and so SRs may be used for copying arbitrary data, including pointers, but no instruction uses SRs as an address (e.g. base) register. Integer operations check for integer tags, and floating-point operations check for float tags. No SR is hardwired to 0.
In some microarchitectures, operations on SRs occur later in the pipeline than operations on ARs, separated by a queue, allowing these operations to wait for data cache misses while the AR engine continues to move ahead generating addresses. When multiple functional units operate in parallel, only some will support 3 source operands, with the others only two. The most important instructions with three SR source operands are multiply/add (both integer and floating-point), and funnel shifts.
The three SR read ports handle the a, b, and c register specifier fields, with writes specified by the d register field. SRs are late pipeline state.
The assembler designation for individual SRs is by the names s0, s1, …, s15.
SV RF 16 1 3 1 The Scalar Register Valid file holds valid bits from speculative loads and propagation therefrom. SVs are late pipeline state.
BR RF 16 1 3 1 Boolean Registers hold 0/False or 1/True, such as the result of comparisons and logical operations on other Boolean values. BRs are typically used to hold SR register comparisons and may avoid branch prediction misses in some algorithms. BR[0] is hardwired to 0. Attempts to write 1 to BR[0] trap, which converts such instructions into negative assertions. BRs are late pipeline state.
The assembler designation for individual BRs is by the names b0, b1, …, b15.
BV RF 16 1 3 1 The Boolean Register Valid file holds valid bit propagation from speculative loads (primarily SR loads). Branches with an invalid BR operand take an exception. BVs are late pipeline state.
CARRY RF 1 64 The CARRY register is used on multiword arithmetic (addition, subtraction, multiplication, division, and carryless multiplication). See below.
Consider expansion of CARRY to a 4-entry register file (c0 to c3). CARRY is late pipeline state.
VL RF 4 64 The Vector Length registers specify the length of vector loads, stores, and operations. VLs are late pipeline state. The outer product instructions use and even/odd pair of vector lengths to specify the number of rows and columns of the product.
VSTART SCSR 1 7 The Vector Start register is used to restart vector operations after exceptions. Details to follow. VSTART is late pipeline state.
VM RF 16 128 3 1 The Vector Mask register file stores a bit mask for elements of vector operations. VM[0] is hardwired to all 1s and is used for unmasked operations.
Only VM[0] to VM[3] may be specified for masking vector operations in 32-bit instructions. VM[4] to VM[15] are available for vector comparison results and Boolean operations and in 48‑bit and 64‑bit formats. VMs are late pipeline state.
The assembler designation for individual VMs is by the names vm0, vm1, …, vm15.
VR VRF 16 72 × 128 3 1 Vector Registers hold vectors of tagged data, typically integers or floating-point data. (There are no speculative loads for the VRs and no associated valid bits. Vector operations with an invalid non-vector operand take an exception.) VRs are late pipeline state.
The assembler designation for individual VMs is by the names v0, v1, …, v15.
MA MRF 4 20 × 64×64
or 64 × 32×32
1 1 Matrix accumulators hold matrixes of untagged data, typically integers or floating-point data and are used to accumulate the outer product of two vectors. (There are no speculative loads for the MAs and no associated valid bits. Matrix operations with an invalid non-vector operand take an exception.) MAs are late pipeline state.
The assembler designation for individual MAs is by the names m0, m1, m2, and m3.

The SR register file must support 3 read and 1 write port per instruction for floating-point multiply/add instructions at least. Since it does, other operations on SRs may take advantage of the third source operand.

Basic Block Descriptor Words

Basic Block Descriptor
71 64 63 58 57 47 46 38 37 34 33 29 28 13 12 11 10 9 0
252 hint targr targl next prev start s c offset
8 6 11 9 4 5 16 1 2 10
Fields of BB Descriptors
Field Width Bits Description
offset 10 9:0 Instruction offset in bage for this BB
c 2 11:10 LOOPX present
0 ⇒ no LOOPX
1 ⇒ LOOPX present
2..3 ⇒ Reserved (possible use for nested loops)
s 1 12 Instruction size restriction:
0 ⇒ 16, 32, 48, and 64 bit instructions
1 ⇒ 32 and 64 bit instructions only
start 16 28:13 Instruction start mask (interpretation depends on s field)
prev 5 33:29 Mask of things targeting this BB for CFI checking
next 4 37:34 BB type / exit method
targl 9 46:38 Target BB offset in bage (virtual address bits 11:3)
targr 11 57:47 Target BB bage relative to this bage (±1024 4 KiB bages)
hint 6 63:58 Prediction hints specific to BB type
252 8 71:64 BB Descriptor Tag

Basic block descriptors are words with tags 252..253 aligned on a word boundary. The basic block descriptor points to the instructions and gives the details of the control transfers to successor basic blocks. (Tag 253 is reserved for future use, most likely for compressed descriptors.)

The s and 16‑bit start fields specify both the size of the basic block and the location of all the instructions in it. If s is set, then all instructions are 32‑bit or 64‑bit; if clear then 16‑bit and 48‑bit instructions may also be present. For s = 0, each bit represents 16 bits at offset in the bage, and the BB size can be up to sixteen 16‑bit locations, which could contain eight 32‑bit instructions, sixteen 16‑bit instructions, or an intermediate number of a mixture of the two, or a lesser number if 48‑bit and 64‑bit instructions are included. For s = 1, each bit represents 32 bits and the BB size can be up to sixteen 32‑bit locations, which could contain sixteen 32‑bit instructions, eight 64‑bit instructions, or an intermediate number of a mixture of the two. If the block is larger than these limits, then it is continued using a fall-through next field. The 16‑bit start field gives a bit mask specifying which 2‑byte locations start instructions, which allows parallel instruction decoding to begin as soon as the instruction bytes are read from the instruction cache. For example, sixteen instruction decoders could be fed in a single cycle from a single 8‑word instruction cache line fetch, using the start mask to specify which bytes to decode. The start bit for the first 16 bits is implicitly 1 and is not stored. The last 1 bit in the start field represents the 2‑byte position after the last instruction. Thus, the number of instructions is the number of 1 bits in the start field (if 0 bits are set, then there are 0 instructions). If the last instruction ends before a 32‑bit boundary, the last 16 bits should be filled with an illegal instruction. The s = 1 case is intended for floating-point intensive basic blocks which tend to have few 16‑bit instructions and also tend to be longer.

To increase locality and keep pointers short, SecureRISC stores basic block descriptors and instructions into 4 KiB regions of the address space (called bages) with the basic block descriptors in the one half and the instructions in the other half (the compiler should alternate the half used for even and odd bages to minimize set conflicts). This allows the pointer from the descriptor to 32‑bit aligned instructions to be only 10 bits, and in a paged system, the same TLB entry maps both the descriptors and instructions (since bage size ≤ page size), so only the BB engine requires a TLB (its translations are simply forwarded to the instruction fetch engine). The instructions are fetched from
PC63..12 ∥ offset ∥ 02
in the L1 instruction cache in parallel with the BB engine moving to fetch the next BB descriptor. For non-indirect branches and calls, the target is given by an 11‑bit signed relative 4 KiB delta from the current bage and a 9‑bit unsigned 8‑byte aligned descriptor address within that bage. Specifically
TargetPC ← PC66..64 ∥ (PC63..12 +p (targr1041∥targr)) ∥ targl ∥ 03.
(Note: the name targr is short for target relative and targl is short for target low.)
For indirect branches and calls, the targr/targl fields may be used as a hint or default.

Instructions are stored in the bage with tag 240, which may be helpful when code reads and writes instructions in memory. A future option may be to use tags 240..243 to provide two more bits for instruction encoding per word, or one bit per 32‑bit location. Using 16 tags would provide four more bits per word, or one bit per 16‑bit location.

The low targl field is sufficient to index a set-associative BB descriptor cache that uses bits 11..3 (or a subset) as a set index without waiting for the targr addition giving the high bits. As an example, a 32 KiB, 8‑way set associate BB descriptor could read the tags in parallel with completing the addition giving the high address bits for tag comparison. If the minimum page size can be increased, then the number of bits allocated to the targl and offset fields might be increased and the bits to targr decreased; the current values were chosen for a minimum page size of 4 KiB, which encourages a bage size of 4 KiB to match. When targr = 0, the TLB translation for the current BB remains valid, and energy can be saved by detecting this case.

For even bages, it is recommended that BB descriptors start at the beginning of a bage, and instructions start on a 64‑byte boundary in the bage. Any full word padding between the last BB descriptor and the first instruction would use an illegal tag. For odd bages, BB descriptors would be packed at the end starting on a 64‑byte boundary and the instructions start at the beginning. Intermixing BB descriptors and instructions is possible but is not ideal for prefetch or cache utilization.

A non-zero c field (assembler %loopx) indicates that the BB contains a LOOPX/LOOPXI instruction, and therefore the BB engine should initialize its iteration count to zero and should predict the count until the AR engine executes the LOOPX and sends the actual loop count value back. If no prediction is available, 264−1 may be used. Often the AR engine does so before the final iteration, and the loop is predicted precisely even if this default loop count prediction is used. The iteration count increments when the next field contains a loop back or conditional loop back, and these are predicted as taken based on the iteration count being unequal to the predicted or actual loop count.

The next field specifies how the next basic block after this one is selected. It is sufficient to enable branch prediction, jump prediction, return address prediction, loop back prediction, etc. to occur without seeing the instructions involved in the basic block. Its values are described in Basic Block Descriptor Types in the subsequent section.

The prev field is used for Control Flow Integrity (CFI) checking and to implement the gates for calls to more privileged rings. It too is described in Basic Block Descriptor Types in the subsequent section.

The hint field will be defined in the future for prediction hints specific to each next field value. For example, conditional branches will use the hint field with a taken/not-taken initial value for prediction, a hysteresis bit (strong/weak), and an encoded 4‑bit history length (8, 11, 15, …, 1024) indication of what global history is most likely to be useful in prediction. Similarly indirect jumps and calls may have hints appropriate to their prediction. More hint bits would be nice to have, for example to encode Whisper’s Boolean function.

Note: I expect to use tag 253 for packing multiple Basic Block Descriptors in a single word. However, the details of this would probably be driven by statistics gathered once a compiler is generating the unpacked descriptors. This is expected to be limited to the BBDs that are internal to functions (simply branching).

Basic Block Descriptor Types

The next field of the BB descriptor is used to specify how the successor to the current BB is determined. The values are given in the following table:

Value Description
0 Unconditional branch (%ubranch): The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check.
1 Conditional branch (%cbranch): The branch predictor is used to determine whether this branch is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There should be exactly one conditional branch instruction, which may be located anywhere in the basic block instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8.
2 Call (%rcall): The address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above. There should be no branch or jump instructions in the basic block, as there is no prediction to check.
3 Conditional Call (%crcall): The branch predictor is used to determine whether this call is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. There is no instruction for the call itself in the basic block, as this is not predicted. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8.
4 Loop back (%loop): The predicted loop iteration count is used to predict whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions. There should be no other branch or jump instructions in the basic block. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described above or is the fall-through BB descriptor at PC + 8.
5 Conditional Loop back (%cloop): The branch predictor is used to determine whether this loop back is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the loop back is enabled by the branch, the predicted loop iteration count is used to determine whether this loop is taken or not, and this prediction is checked by the SOBX instruction in the instructions of the basic block. There should be exactly one SOBX, which may be located anywhere in the basic block instructions and exactly one conditional branch instruction, but no jump instructions. The destination BB descriptor address is computed from the targr/targl fields of the descriptor as described below or is the fall-through BB descriptor at PC + 8.
6 Fall through (%fallthrough): This Basic Block is unconditionally followed by the BB at PC + 8.
The targr/targl/start fields are not required for fall-through, so instead they may be used for prefetch. The targr/targl fields would then specify the first of several lines to prefetch into the BB Descriptor Cache (BBDC). The three least-significant bits of the targl field are not needed to specify a line in the BBDC, and are instead a sub-type:
0 No prefetch suggested
1 PC-relative prefetch suggested of one or more lines in the BBDC starting at
PC66..64 ∥ (PC63..12 +p (targr1041∥targr)) ∥ targl8..3 ∥ 06
The hint field would specify a bitmask of the lines to be prefetched subsequent to the designated line. For example, this could be used on entry to the function to indicate up to 6 hot blocks after the specified one to prefetch, which allows up to 56 basic block descriptors of the function to be prefetched).
2..7 Reserved
BBDC prefetch might be queued for cycles when the BBDC is not being accessed or the tags might be dual-ported to allow parallel tag checks.
7 Reserved.
8 Jump Indirect (%ijump): The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/SWITCHX/etc. instructions in the instructions of the basic block. There should be exactly one jump, which may be located anywhere in the basic block instructions, but no conditional branches.
The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
9 Conditional Jump Indirect (%cijump): The branch predictor is used to determine whether this jump indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the jump indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/SWITCHX/etc. instruction in the instructions of the basic block. There should be exactly one jump and exactly one conditional branch, which each may be located anywhere in the basic block instructions. In the case the jump is not taken the destination is fall-through BB descriptor at PC + 8.
This type is expected to be used for case dispatch, where the conditional test checks whether the value is within range, and the JMPA/LJMP/LJMPI/SWITCHX uses PC ← PC + (XR[b] × 8) to choose one of several dispatch basic block descriptors, presuming that the BBs fit in the same 4 KiB bage (if not then a table and PC ← lvload72(AR[a] + XR[b]) should be used).
The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
10 Call Indirect (%icall): The indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the LJMP/LJMPI instruction in the instructions of the basic block. There should be exactly one LJMP/LJMPI, which may be located anywhere in the basic block instructions, but no conditional branch instructions. The address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
11 Conditional Call Indirect (%cicall): The branch predictor is used to determine whether this call indirect is taken or not, and this prediction is checked by the branch decision is given by a branch instruction in the instructions of the basic block. If the call indirect is enabled by the branch, the indirect jump predictor is used to predict the destination BB descriptor address, and this prediction is checked by the JMPA/LJMP/LJMPI/etc. instruction in the instructions of the basic block. There should be exactly one jump, which may be located anywhere in the basic block instructions, and exactly one conditional branch. In the case the call is not taken the destination is fall-through BB descriptor at PC + 8. In the case where the call is taken, the address PC + 8 is written to the word pointed to CSP[TargetPC66..64], and CSP[TargetPC66..64] is incremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
12 Return (%return): The Call Stack cache is used to predict the return using CSP[PC66..64] − 8 as the index and CSP[PC66..64] is decremented by 8.
The targr/targl may be used as a hint for the most likely destination when hint0 is set, but this will be generally unknown at compile-time. Micro-architectures may choose to store their own hint in this field of the BBDC.
It may be desirable to encode Exception Return with this BB type. Using hint1 might be used to distinguish this case.
13 Conditional return (%creturn): This is probably only useful in leaf functions without a stack frame, unless register windows are added.
14 Reserved.
15 Reserved.

The prev field of the BB descriptor is used to specify what methods are allowed to get to this BB for Control Flow Integrity (CFI) checking. It is a set of values bits, with the least significant bits of prev controlling interpretation of the more significant bits as follows:

prev0 = 1
Bit Description Assembler
1 Fall through to this BB allowed %pfallthrough
2 Branch/Loopback to this BB allowed %pbranch
3 Jump to this BB allowed (for case dispatch) %pswitch
4 Return to this BB allowed %preturn
prev1..0 = 2
Bit Description
2 Call relative allowed %prcall
3 Call indirect allowed %picall
4 Gate allowed %pgate
prev2..0 = 4
Bits 4..3 Description
0 Cross-ring Exception Entry %pxrexc
1 Same-ring Exception Entry %psrexc
2 Reserved
3 Reserved

Call/Return Details

Basic Block descriptors with one of the four call types (Call, Conditional Call, Call Indirect, Conditional Call Indirect), push the return address on a protected stack addressed by the CSP indexed by the target ring number (which is the same as the current ring number unless a gate is addressed). Returns pop the address from the protected stack and jump to it. The ring number of the CSP pointer is used for the stores and loads, and typically this ring is not writeable by the current ring.
The call semantics are as follows:
lvstore72(CSP[TargetPC66..64]) ← PC
CSP[TargetPC66..64]) ← CSP[TargetPC66..64]) +p 8
The return semantics are as follows:
PC ← lvload72(CSP[TargetPC66..64] −p 8)
CSP[TargetPC66..64]) ← CSP[TargetPC66..64]) −p 8

Debugging

Support for debuggers in SecureRISC has yet to be considered and thus TBD. Instruction count interrupts provide a single-step mechanism, and basic block descriptors may be patched with a 254 tag as a breakpoint mechanism, but some mechanism for debugging ROM code and for setting memory read and write breakpoints is also required. Note that amatch ICSRs could be used for read and write breakpoints, if changed to have finer resolution (e.g. start the NAPOT encoding at bit 7). This however might complicate debugging programs with Garbage Collection. Something similar to amatch could be defined on the fetch side for debugging ROM code, e.g. 256 bits per bage to indicate which trap, but probably something much simpler would suffice.

Overflow Checking

Overflow detection is important for implementing bignums in languages such as Lisp. SecureRISC provides a reasonably complete set of such instructions in addition to the usual mod 264 add, subtract, negate, multiply, and shift left.

Unsigned overflow could be detected by using the ADDC and SUBC instructions with BR[0] as the carry-in and BR[0] as the carry-out. But it might also make sense to have ADDOU (Add Overflow trapped Unsigned).

In addition, the ADDOS (Add Overflow Trapped Signed), ADDOUS (Add Overflow trapped Unsigned Signed), SUBOS (Subtract Overflow trapped Signed), SUBOU (Subtract Overflow trapped Unsigned), SUBOUS (Subtract Overflow trapped Unsigned Signed with Overflow), and NEGO (Negate Overflow trapped) instructions provide overflow checking for signed addition, subtraction, and negation, and signed-unsigned addition and subtraction. There is also SLLO (Shift Left Logical with Overflow) and SLAO (Shift Left Arithmetic with Overflow) in addition to the usual SLL. Finally there are MULUO, MULSO, and MULSUO for multiplication with overflow detection.

Overflow in the unsigned addition of load/store effective address generation is trapped. Segment bounds are also checked during effective address generation: the segment size is determined from the base register, and the effective address must agree with the base register for bits 63..size. A special small cache is required for this purpose, but the data portion is only eight bits of the Segment Descriptor Entry (a 6‑bit segment size and a 2‑bit generation).

Comparisons

SecureRISC handles comparisons differently in the early and late pipeline instruction sets. Comparisons on ARs/XRs are done with conditional branch instructions. Comparisons on SRs are done with instructions that write one of the 15 boolean registers (BRs). The boolean registers may be branched on, but also used in selection and logical operations.

For the SRs, SecureRISC has comparisons that produce both true and complement values (e.g. = and ≠ or < and ≥) so that they can be used with b0 as assertions. If b1 were hardwired to 1 and writes of 0 trapped, SecureRISC could have half as many comparisons, but would have to add more accumulation functions and SELI would have to have an inverted form. This would also require more compiler support to track whether Booleans in BRs are inverted or not. For the moment, SecureRISC has more comparisons, but might change.

Floating-Point Comparisons

SecureRISC provides floating-point comparisons that store 0 or 1 to a BR. These comparisons do not trap on NaN operands. The compiler can generate an unordered comparison to b0 to trap before doing the equal, less than, etc. test if traps on NaNs are required.

Branch Avoidance

SecureRISC has trap instructions and Boolean Registers (BRs) primarily as a way to avoid conditional branching for computation. For example, to compute the min of x1 and x3 into x6, the RISC‑V ISA would use conditional branches:

	move x6, x1
	blt x1, x3, L
	move x6, x3
L:

The performance of the above on contemporary microarchitectures depends on the conditional branch prediction rate and the mispredict penalty, which in turn depends on how consistently x1 or x3 is the minimum value. In SecureRISC, the sequence could be as follows:

	lts b2, s1, s3
	sels s6, b2, s1, s3

This sequence involves no conditional branches and has consistent performance. (Note: there is actually a minss instruction that would be preferred here, but this illustrated a general point.)

As another example, the range test

	assert ((lo <= x) && (x <= hi));

on RISC‑V would compile to

	blt x, lo, T
	bge hi, x, L
T:
	jal assertionfailed
L:

but on SecureRISC would compile to

	lts b1, x, lo
	orles b0, b1, hi, x

which involves no conditional branches, but instead using writes to b0 as a negative assertion check (trap if the value to be written is 1). The assembler would also accept torles b1, hi, x as equivalent to the above orles by supplying the b0 destination operand.

Even when conditional branches are used, the Boolean registers sometimes permit several tests to be combined before branching, so if we were branching on the range test above, instead of asserting it, the code might be

	lts b1, x, lo
	borles b1, hi, x, outofrange

which has one branch rather than two.

Tag Checking

Operations on tagged values trap if the tags are unexpected values. Integer addition requires that both tags be integers, or one tag be a pointer type and the other an integer. Integer subtraction requires the subtrahend tag to be an integer tag and the minuend to be either an integer or pointer tag. The resulting tag is integer with all integer sources, or pointer if one operand is a pointer. Integer bitwise logical operands and shifts require integer-tagged operands and produce an integer-tagged result. Floating point addition, subtraction, multiplication, division, and square root require floating-point tagged operands. To perform integer operands on floating-point tagged values (e.g. to extract the exponent) requires a CAST instruction to first change the tag. Similarly, to perform logical operations on a pointer, a CAST instruction to integer type is required.

Comparisons of tagged values compare the entire word in its entirety for =, , <u, u etc. This allows sorting regardless of type. Similarly, the CMPU operation produces −1, 0, 1 based on <u=>u of word values.

Multiword Multiplication

The ideal integer multiplication operation would be
SR[e],SR[d] ← (SR[a] ×u SR[b]) + SR[c] + SR[f]
to efficiently support multiword multiplication, but that requires 4 reads and 2 writes, which we clearly don’t want. The chosen alternative is to introduce a 64‑bit CARRY register to provide the additional 64‑bit input to the 128‑bit product and a place to store the high 64 bits of the product as follows:
p ← SR[c] + (SR[a] ×u SR[b]) + CARRY
SR[d] ← p63..0
CARRY ← p127..64

The CARRY register is potentially awkward for OoO microarchitectures. The simplest option is to rename it to a small register file (e.g. 4 or 8‑entry) in the multiword arithmetic unit. It is also possible that even an OoO processor will be called on to have a subset of instructions that are to be executed in-order relative to each other, and the multiword arithmetic instructions can be put in this queue.

Multiword Division

The ideal integer division operation would be
SR[e],SR[d] ← SR[c]∥SR[a] ÷u SR[b]
to efficiently support multiword division, but that requires 3 reads and 2 writes for quotient and remainder, which we clearly don’t want. As with multiplication, the alternative is to use the proposed 64‑bit CARRY register to provide the additional 64‑bit input to form the 128‑bit dividend and a place to store the remainder. The remainder of the previous division then naturally becomes the high bits of the current division. Thus the definition of DIVC is:
q,r ← (CARRY∥SR[a]63..0) ÷u SR[b]63..0
CARRY ← r
SR[d] ← 240 ∥ q

Arithmetic for Polynomials over GF(2)

Addition of polynomials over GF(2) is just xor (addition without carries), and so the existing bitwise logical XORS instruction provides this functionality. Polynomial multiplication requires carryless multiplication instructions. Three forms are provided:
CARRY,SR[d] ← SR[a] ⊗ SR[b]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c]
CARRY,SR[d] ← (SR[a] ⊗ SR[b]) ⊕ SR[c] ⊕ CARRY
A modulo reduction instruction may not be required, as illustrated by the following example. In many applications, the field uses a polynomial such as 𝑥128+𝑥7+𝑥2+𝑥+1 and in this case a 256→128 reduction can be implemented by further multiplication. First a series of carryless multiplication instructions are used to form the 255‑bit product p of two 128‑bit values. Bits 254..128 of this product have weight 𝑥128, i.e. represent (p254𝑥126+…+p129𝑥+p128)𝑥128. Because 𝑥128 mod 𝑥128+𝑥7+𝑥2+𝑥+1 is just 𝑥7+𝑥2+𝑥+1, multiplication of p254..128 by this value results in a product q with a maximum term of 𝑥133. q127..0 is added to p127..0 and q133..128 of that product can then be multiplied again by 𝑥7+𝑥2+𝑥+1 resulting in a product with a maximum term of 𝑥12, which can then be added to the low 128 bits of the original product (p127..0). This generalizes to any modulo polynomial with no term after 𝑥128 greater than 𝑥63. If most modulo reductions are of this form, then no specialized support is required.

Multiword Addition

The ideal instructions for multiword addition and subtraction need additional single bit inputs and outputs for the carry-in and carry-out. The BRs would be natural for this purpose, but this would result in undesirable five-operand instructions, e.g. Add with Carry (ADDC) would be:
s ← SR[a] +u SR[b] +u BR[c]
SR[d] ← s63..0
BR[e] ← s64
.
To avoid five operand instructions, SecureRISC instead defines the Add with Carry (ADDC) and Subtract with Carry (SUBC) instructions to use one bit in the 64‑bit CARRY register. ADDC is defined as:
s ← SR[a] +u SR[b] +u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64
.
SUBC is defined as:
s ← SR[a] −u SR[b] −u CARRY0
SR[d] ← s63..0
CARRY ← 063 ∥ s64
.

Multiword Shifts

One advantage of the 3 read SR file is that shifts can be based upon a funnel shift where the value to be shifted is the catenation of SR[a] and SR[b], allowing for rotates by specifying the same operand for the high and low funnel operands, and multiword shifts by supplying adjacent source words of the multiword value. The basic operations are then
SR[d] ← (SR[b] ∥ SR[a]) >> imm6,
SR[d] ← (SR[b] ∥ SR[a]) >> (SR[c] mod 64), and
SR[d] ← (SR[b] ∥ SR[a]) >> (−SR[c] mod 64).
Conventional logical and arithmetic shifts are also provided. Left shifts supply 0 for the lo side of the funnel and use a negative shift amount. Logical right shifts supply 0 on the high side of the funnel and arithmetic right shifts supply a signed-extended version of SR[a] on the high side of the funnel. Need to decide whether overflow detecting left shifts are required.

The CARRY register could be use as funnel shift operand instead of an SR, but that seems less flexible.

Floating-Point Flags

The floating-point flag mechanism for SecureRISC is TBD, but it is likely to be similar to other ISAs, where an uprivileged CSR has flag bits that are set by operations until cleared by software.

Floating-Point Rounding Modes

SecureRISC has the floating-point rounding mode encoded in the instruction word to allow various uses of rounding mode where changing a CSR would be too costly to the use. For example, round to odd[PDF] might be used in a sequence to do operations in higher precision and then round correctly to a lower precision. In such a case dynamic rounding mode changes are likely to make the sequence slower than necessary.

Vector Register File

SecureRISC has not adopted flexible vector register file sizing, such as found in RISC‑V. Instead there are 16 vector registers (VRs) that consist of 128 72‑bit words (9216 bits per register, 8192 of data, 1024 of tag). This size was chosen to target implementations with up to sixteen parallel execution units, which for a full-length vector would require eight iterations to perform the vector operation, giving the processor sufficient time to set up the next vector operation. Flexible sizing would allow smaller implementations of the vector unit, but 144 Kib (128 Kib of data, 32 Kib of tag) is acceptable area in modern process nodes.

There are four vector length registers, which specify the number of elements to use from the specified vector registers. Most vector instructions use n of the instruction to specify VL[n] as the length for the operation. The outer product instructions use an even/odd pair of vector lengths: VL[n] for VR[a] and the number of rows of matrix accumulators and VL[n+1] for VR[b] and the number of columns of matrix accumulators.

Matrix Multiply

Matrix Algebra

If A is an m×n matrix and B is an n×p matrix

A = ( a 11 a 12 a 1n a 21 a 22 a 2n a m1 a m2 a mn ) , B = ( b 11 b 12 b 1p b 21 b 22 b 2p b n1 b n2 b np )

the matrix product C=AB is defined to be the m×p matrix

C = ( c 11 c 12 c 1p c 21 c 22 c 2p c m1 c m2 c mp )

such that

c ij = a i1 b 1j + a i2 b 2j + + a in b nj = k=1 n a ik b kj

Note that the above diagrams and equations work both when the elements are scalars or are themselves matrixes. For example, cij could be r×t matrix that is sum of the products of a column of r×s matrixes from A with a row of s×t matrixes from B. Such a submatrix is called a tile below.

Note also that software often transposes the A prior to performing the matrix multiply, to avoided strided memory accesses for the columns of A. This transpose is not reflected in the material below, and is left as an exercise to the reader.

The following exposition attempts to explain the reasoning for the SecureRISC approach to matrix computation. In the following N designates the problem matrix size, keeping these square for simplicity of exposition (e.g. the number of operations is simplified to N3 rather than the more cumbersome m×p×n). Matrix multiply C=C+AB is N3 multiplications and N3 additions with each matrix element cij being independent of the others but sequential due the additions. The N3 multiplications are all independent (potentially done in parallel), but only N2 of the additions are parallel when floating-point rounding is preserved. With unbounded hardware, the execution time of matrix multiply with floating-point rounding is N×L where L is the add latency. This is achieved by using N2 multiply/add units N times every L cycles, but a smarter implementation would use N2/L units pipelined to produce a value every cycle, thereby adding only L-1 additional cycles for the complete result.

For practical implementation, hardware is bounded and should lay out in a regular fashion. Typically the number of multiply/add units is much smaller than N, in which case there is flexibility in how these units are allocated to the calculations to be performed, but the allocation that minimizes data movement between the units and memory is to complete a tile of C using the hardware array before moving on to a new tile. The computation that accomplishes this is the accumulation of the outer products[wikilink] of vectors from A and B. The goal is to determine the length of these vectors, and thus the size of the tile of C. If u is a m element vector and v is a p element vector,

u = [ u 1 u 2 u m ] , v = [ v 1 v 2 v p ]

then the outer product is defined to be the m×p matrix

C = ( c 11 c 12 c 1p c 21 c 22 c 2p c m1 c m2 c mp ) = u v = ( u 1 v 1 u 1 v 2 u 1 v p u 2 v 1 u 2 v 2 u 2 v p u m v 1 u m v 2 u m v p )

i.e.

c ij = u i v j

Using this formulation, the matrix product can be expressed as the sum of n outer products of the columns of A with the rows of B:

C = A 1..m,1 B 1,1..p + A 1..m,2 B 2,1..p + + A 1..m,n B n,1..p = k=1 n A 1..m,k B k,1..p

where A1..m,k is column k of A and Bk,1..p is row k of B. (Note that elsewhere in this document ⊗ denotes carryless multiply, but in a matrix context, it is use for the outer product.)

In most systems, the maximum tile size will either be square power of two, e.g. 2×2, 4×4, 8×8, … 128×128, or a rectangle of a power of two and twice that, e.g. 2×4, 4×8, … 64×128. In a given problem, most of the operations will be done with the maximum tile size, with the remainder being the leftover edges. For example, with a maximum tile size of 64×64, a 1000×2000 by 2000×1500 multiplication yielding a 1000×1500 product would use tiles of 64×64 15×23=345 times with the last row of tiles being be 23 tiles of 40×64, the last column of tiles being 15 tiles of 64×28, and final corner would employ a 40×28 tile.

Matrix Multiply Using Vectors

The following series of transforms demonstrates how the simple, classic matrix multiply written as three nested loops shown below is transformed to use a vector ISA using outer products. (Note that the pseudo-code switches from 1‑origin indexing of Matrix Algebra to 0‑origin indexing of computer programming. Note also that the code does not attempt to handle the case of the matrix dimensions not being a multiple of the tile size.)

    for i ← 0 to m-1
      for j ← 0 to p-1
        for k ← 0 to n-1
          c[i,j] ← c[i,j] + a[i,k] * b[k,j]

The scalar version above would typically then move c[i,j] references to a register to reduce the load/store to multiply/add ratio from 4:1 to 2:1.

    for i ← 0 to m-1
      for j ← 0 to p-1
        acc ← c[i,j]
        for k ← 0 to n-1
          t ← t + a[i,k] * b[k,j]
        c[i,j] ← t

However, in the vector version this step is delayed until after tiling. For vector, the above code is first tiled to become the following:

    // iterate over 8×8 tiles of C
    for ti ← 0 to m-1 step 8
      for tj ← 0 to p-1 step 8
        // add product of eight columns of a (a[ti..ti+7,0..n-1])
        // and eight rows of b (b[0..n-1,tj..tj+7]) to product tile
        for i ← 0 to 7
          for j ← 0 to 7
            for k ← 0 to n-1
              c[ti+i,tj+j] ← c[ti+i,tj+j] + a[ti+i,k] * b[k,tj+j]

The above code is then modified to use eight vector registers as an 8×8 tile accumulator, and all i and j loops replaced by vector loads:

    for ti ← 0 to m-1 step 8	// tile i
      for tj ← 0 to p-1 step 8	// tile j
        // copy to accumulator
	v0 ← c[ti+0,tj..tj+7]	// vector loads
	v1 ← c[ti+1,tj..tj+7]
	v2 ← c[ti+2,tj..tj+7]
	v3 ← c[ti+3,tj..tj+7]
	v4 ← c[ti+4,tj..tj+7]
	v5 ← c[ti+5,tj..tj+7]
	v6 ← c[ti+6,tj..tj+7]
	v7 ← c[ti+7,tj..tj+7]
        // add product of a[ti..ti+7,0..n-1]
        // and b[0..n-1,tj..tj+7] to tile
        for k ← 0 to n-1
          va ← a[ti..ti+i+7,k]	// vector load
          vb ← b[k,tj..tj+i+7]	// vector load
	  v0 ← v0 + va[0] * vb	// vector * scalar
	  v1 ← v1 + va[1] * vb
	  v2 ← v2 + va[2] * vb
	  v3 ← v3 + va[3] * vb
	  v4 ← v4 + va[4] * vb
	  v5 ← v5 + va[5] * vb
	  v6 ← v6 + va[6] * vb
	  v7 ← v7 + va[7] * vb
        // copy accumulator back to tile
	c[ti+0,tj..tj+7] ← v0	// vector stores
	c[ti+1,tj..tj+7] ← v1
	c[ti+2,tj..tj+7] ← v2
	c[ti+3,tj..tj+7] ← v3
	c[ti+4,tj..tj+7] ← v4
	c[ti+5,tj..tj+7] ← v5
	c[ti+6,tj..tj+7] ← v6
	c[ti+7,tj..tj+7] ← v7

One limitation of some vector instruction sets is the lack of a vector × scalar instruction where the scalar is an element of a vector register, which would add many scalar loads to the above loop. SecureRISC provides scalar operands from elements of vector registers.

Besides the obvious parallelism advantage, another improvement is that each element of the A and B matrixes are used eight times per load, which improves energy efficiency. However, one limitation of the vector implementation of matrix multiply is the limited number of multiply/add units that can be used in parallel. It is obvious that the above can use eight uints in parallel (one for each element of the vectors). Slightly less obvious is that an implementation could employ 64/L units to execute the above code, issuing groups of 8/L vector instructions in a single cycle, and parceling these vector operations out to the various units to proceed in parallel. After 8/L instructions, the next group can be issued to the pipelined units. However a better solution is possible by providing more direct support for the outer product formulation. The goals are to obtain better energy efficiency on the computation by reducing the data movement in the above, particularly the VRF bandwidth, and to allow even more multiply/add units to be employed on matrix operations (the above limited to 8×8 tiles by the number of vector registers).

Matrix Multiply Using Outer Product Array

It is desirable to match the number of multiply/add units to the load bandwidth when practical, as this results in a balanced set of resources (memory and computation are equally limiting). For example, if the cache hierarchy can deliver V elements per cycle, then it takes one cycle to load V elements from A and one cycle to load V elements from B so processing these values in two cycles matches load bandwidth to computation. For L2, a V×(V/2) array of multiply/add units with V2 accumulators (two per multiply/add unit) accomplishes this by taking the outer product of all of the u vector (from A) and the even elements of the v vector (from B) in the first cycle, and all of u with the odd elements of v in the second cycle. The full latency is L+1 cycles, but with pipelining a new set of values can be started every two cycles. For L>2, using a V×(V/L) pipelined array for L cycles is a natural implementation but does not balance load cycles to computation cycles. For L=4 there are multiple ways to match the load bandwidth and adder latency. The method that minimizes hardware is to process two tiles of C in parallel using pipelined multiply/add units by doing four cycles of loads followed by two 2‑cycle outer products to separate accumulators. For example, the loads might be V elements from an even column of A, V elements from an even row of B, V elements from an odd column of A, and V elements from an odd row of B. The computation would consist of two V×(V/2) outer product accumulates, each into V2 accumulators (total 2V2). The total latency is seven cycles but the hardware is able to start a new outer product every four cycles by alternating the accumulators used, thereby matching the load bandwidth. If any of these array sizes is too large for the area budget, then it will be necessary to reduce performance, and no longer match the memory hierarchy. Even in 2024 process nodes (e.g. 3 nm), this may be the case when L>2.

The V×(V/2) sequence for L=2 is illustrated below, using superscripts to indicate cycle numbers, as in C0=0 to indicate accumulators being zero on cycle 0, u0 the value loaded on cycle 0, v1 the vector loaded on cycle 1, C3 the result of the first half of the two-cycle latency outer product, C4 the result of the second half of the outer product, etc.

C0 = ( 0 0 0 0 ) , u0 = [ a 11 a 21 a m1 ] , v1 = [ b 11 b 12 b 1p ] , u2 = [ a 12 a 22 a m2 ] , v3 = [ b 21 b 22 b 2p ] , C3 = ( u10v11 0 u10vp-11 0 u20v11 0 u20vp-11 0 um0v11 0 um0vp-11 0 ) = ( a11b11 0 a11b1p-1 0 a21b11 0 a21b1p-1 0 am1b11 0 am1b1p-1 0 ) , C4 = ( u10v11 u10v21 u10vp-11 u10vp1 u20v11 u20v21 u20vp-11 u20vp1 um0vm1 um0vm1 um0vp-11 um0vp1 ) = ( a11b11 a11b12 a11b1p-1 a11b1p a21b11 a21b12 a21b1p-1 a21b1p am1b11 am1b12 am1b1p-1 am1b1p ) , C5 = ( c113+u12v13 c124 c1p-13+u12vp-13 c1p4 c213+u22v13 c224 c2p-13+u22vp-13 c2p4 cm13+um2v13 cm24 cmp-13+um2vp-13 cmp4 ) , C6 = ( c115 c124+u12v23 c1p-15 c1p4+u12vp3 c215 c224+u22v23 c2p-15 c2p4+u22vp3 cm15 cm24+um2v23 cmp-15 cmp4+um2vp3 ) ,

The following series of transforms demonstrates how the simple, classic matrix multiply written as three nested loops shown below is transformed to use tiles with an outer product multiply/add/accumulator array.

    for i ← 0 to m-1
      for j ← 0 to p-1
        for k ← 0 to n-1
          c[i,j] ← c[i,j] + a[i,k] * b[k,j]

The above code is then tiled to become the following:

    // iterate over TR×TC tiles of C
    for ti ← 0 to m-1 step tilerows
      for tj ← 0 to p-1 step tilecols
        // add product of a[ti..ti+tilerows-1,0..n-1]
        // and b[0..n-1,tj..tj+tilecols-1] to tile
        for i ← 0 to tilerows-1
          for j ← 0 to tilecols-1
            for k ← 0 to n-1
              c[ti+i,tj+j] ← c[ti+i,tj+j] + a[ti+i,k] * b[k,tj+j]

The above code is modified to use an accumulator:

    for ti ← 0 to m-1 step tilerows
      for tj ← 0 to p-1 step tilecols
        // copy to accumulator
        for i ← 0 to tilerows-1
          for j ← 0 to tilecols-1
            acc[i,j] ← c[ti+i,tj+j]
        // add product of a[ti..ti+tilerows-1,0..n-1]
        // and b[0..n-1,tj..tj+tilecols-1] to tile
        for i ← 0 to tilerows-1
          for j ← 0 to tilecols-1
            for k ← 0 to n-1
              acc[i,j] ← acc[i,j] + a[ti+i,k] * b[k,tj+j]
        // copy accumulator back to tile
        for i ← 0 to tilerows-1
          for j ← 0 to tilecols-1
            c[ti+i,tj+j] ← acc[i,j]

The above code is then vectorized by moving the k loop outside:

    for ti ← 0 to m-1 step tilerows
      for tj ← 0 to p-1 step tilecols
        for i ← 0 to tilerows-1
          acc[i,0..tilecols-1] ← c[ti+i,tj..tj+tilecols-1] // vector load + acc write
        for k ← 0 to n-1
          va ← a[ti..ti+i+tilerows-1,k] // vector load
          vb ← b[k,tj..tj+i+tilecols-1] // vector load
          acc ← outerproduct(va, vb)    // outer product instruction
        for i ← 0 to tilerows-1
          c[ti+i,tj..tj+tilecols-1] ← acc[i,0..tilecols-1] // acc read + vector store

where the outerproduct instruction is defined as follows:

        for i ← 0 to va.length-1
          for j ← 0 to vb.length-1
            acc[i,j] ← acc[i,j] + va[i] * b[j]

In the Matrix Algebra section it was observed that cycle count for matrix multiplication with the smarter variant of unbounded multiply/add units (i.e. N2/L units) pipelined to produce a value every cycle takes N×L+L-1 cycles. It is worth answering how the above method fares relative to this standard. Because we cut the number of multiply/add units in half to match the load bandwidth, we expect at least twice the cycle count, and this expectation is met: matching a memory system that delivers V elements per cycle, a tile of V×V processed by an array of V×(V/2) multiply/add units (L2) produces the tile in 2V+1 cycles. It may help to work an example. For a memory system delivering one 512‑bit cache block per cycle and 16‑bit data (e.g. BF16), V=32, and the 32×32 tile is produced using 2 vector loads and one 2‑cycle outer product instruction iterated 32 times taking 64 cycles yielding 512 multiply/adds per cycle. However, this does not include the time to load the accumulators before and transfer them back to C after. When this 64 cycle tile computation is part of a 1024×1024 matrix multiply, this tile loop will be called 32 times for each tile of C. If it takes 64 cycles to load the accumulators from memory and 64 cycles to store back to memory, then this is 64+32×64+64=2176 total cycles. There are a total of 1024 output tiles, so the matrix multiply is 2228224 cycles (not counting cache misses) for 10243 multiply/adds, which works out to 481.88 multiply/adds per cycle, or 94% of peak.

Matrix Accumulators

The bandwidth of reads and writes to outer product accumulators far exceeds what a Vector Register File (VRF) generally targets, which suggests that that these structures be kept separate. Also the number of bits in the accumulators is potentially large relative to VRF sizes. Increasing the bandwidth and potentially the size of the VRF to meet the needs of outer product accumulation is not a good solution. Rather the accumulator bits should located in the multiply/add array, and be transferred to memory when a tile is completed. This transfer might be one row at a time through the VRF, since the VRF has the necessary store operations and datapaths to the cache hierarchy. The appropriateness of separate accumulator storage may be illustrated by examples. A typical vector load width might be the cache block size of 512 bits. This represents 64 8‑bit elements. If the products of these 8‑bit elements is accumulated in 16 bits (e.g. int16 for int8 or fp16 for fp8), then for L2, 16×642 = 65536 bits of accumulator are required. The entire SecureRISC VRF is only twice as many bits, and these bits require more area than accumulator bits, as the VRF must support at least 4 read ports and 2 write ports for parallel execution of vector multiply/acc and a vector load or vector store. In contrast, accumulator storage within the multiply/add array is local, small, and due to locality consumes negligible power. As another example, consider the same 512 bits as sixteen IEEE 754 binary32 elements with L=4. The method for this latency suggests a 16×8 array of binary32 multiply/add units with 2048 32‑bit accumulators, which is again a total of 65536 bits of accumulator storage, but now embedded in much larger multiply/add units.

SecureRISC has instructions that produce the outer product[wikilink] of two vectors and add this to one of four matrix accumulators. The matrix accumulators are expected to be stored within the logic producing the outer product, and so are distinct from the vector register file. The outer product hardware allows a large number of multiply/accumulate units to accelerate matrix multiply more efficiently than using vector operations.

The purpose of providing 4 accumulators per multiply/add unit is to allow the accumulators to be loaded and stored by software while outer products are being accumulated to other registers and to allow multiple tiles to be pipelined.

Control and Status Register Operations

SecureRISC has a large number of registers that affect instruction execution. These registers, called CSRs, are accessed by special instructions that support reading, writing, swapping, setting bits, and clearing bits. Many ISAs have such instructions; the unusual aspect of SecureRISC is first that CSRs are split into early (XCSRs) and late (SCSRs), per-ring registers (RCSRs), and indirect CSRs (ICSRs). RCSRs can be accessed in two ways: first, via the CSR number in the immediate field and ring from an XR; and second via an encoding that refers to the register for the current ring of execution (PC.ring). ICSRs are accessed with a CSR base immediate, CSR index from an XR and an offset for the word of data at that index. For example, the ENCRYPTION ICSRs have five 64‑bit values for a given index (an 8‑bit algorithm and 256 bits of encryption key). Similarly the amatch ICSRs have five 64‑bit values for a given index (the address to match, 128 bits of region access permission, and 128 bits of region write permission).

XCSRs, RCSRs, and ICSRs are read and written to and from the XRs. Late pipeline SCSRs are read and written to and from the SRs.

Read and writing CSRs have no side effects. CSR operations always return the old value of the CSR, which if not useful, wastes a register, but that seems acceptable compared to providing separate opcodes to avoid the write.

Per-ring CSRs (RCSRs) appear to be fairly expensive, but the invention of SRAM cells in 5 nm and later process nodes that support efficient small RAM means that some RCSRs can by implemented by tiny 8‑entry SRAM arrays, provided that multiple ring values are not required in the same cycle. Unfortunately OoO microarchitectures might produce such a situation, but in some cases this could be handled by reading the necessary RCSRs during instruction decode and pipelining that value along. Other tricks might be used to keep RCSRs as tiny SRAMs.

In the specifications below, the definition of n ← op(o,v,m) comes from the opcode (mnemonic RW for Read-Write, mnemonic RS for Read-Set, mnemonic RC for Read-Clear). Here o is the old value, v is the operand value, and m is the per-CSR bitmask specifying which bits are writeable (some bits possibly being read-only).

Description op Definition
Read-Write RW n ← (o & ~m) | (v & m)
Read-Set RS n ← o | (v & m)
Read-Clear RC n ← o & ~(v & m)
CSRopX d,imm,a
v ← XR[a]
trap if (v & XCSRreserved[imm]) ≠ 0
o ← XCSR[imm]
n ← op(o,v,XCSRwriteable[imm])
XCSR[imm] ← n
XR[d] ← o
CSRopS d,imm,a
v ← SR[a]
trap if (v & SCSRreserved[imm]) ≠ 0
o ← SCSR[imm]
n ← op(o,v,SCSRwriteable[imm])
SCSR[imm] ← n
SR[d] ← o
RCSRopXC d,imm,a
v ← XR[a]
trap if (v & RCSRreserved[imm]) ≠ 0
r ← PC.ring
o ← RCSR[imm][r]
n ← op(o,v,RCSRwriteable[imm])
RCSR[imm][r] ← n
XR[d] ← o
RCSRopXR d,imm,a
v ← XR[a]
trap if (v & RCSRreserved[imm]) ≠ 0
r ← XR[b]66..64
trap if r > PC.ring
o ← RCSR[imm][r]
n ← op(o,v,RCSRwriteable[imm])
RCSR[imm][r] ← n
XR[d] ← o
ICSRopX d,a,b,c,e
v ← XR[a]
trap if (v & ICSRreserved[c][e]) ≠ 0
i ← XR[b]7..0
trap if i ≥ ICSRcount[c]
o ← ICSR[c][i][e]
n ← op(o,v,ICSRwriteable[c][e])
ICSR[c][i][e] ← n
XR[d] ← o

Atomic Memory Operations

SecureRISC has two sorts of instructions for synchronization via memory locations. The first is one of the primitives that can implement most synchronization methods: Compare And Swap[wikilink]). Compare And Swap (CAS) exists for the SRs (CASS, CASSD, CASS64, CASS128, CASSI, CASSDI, CASS64I, CASS128I). and perhaps the XRs (CASX, CASXD, CASX64, CASX128, CASXI, CASXDI, CASX64I, CASX128I) It is possible that 8, 16, and 32‑bit versions of Compare And Swap might also be provided. It is also plausible that 288‑bit (half cache block) and 576‑bit (whole cache block) CAS may be provided from the VRs. The basic schema of CAS is illustrated by the following simplified semantics of CASS64, with the other instruction formats being similar:
ea ← AR[a]
expected ← SR[b]
new ← SR[c]
m ← lvload64(ea)
if m = expected then
  lvstore64(ea) ← new
endif
SR[d] ← m
This specification clearly violates the number of read and write ports for the XRs, and the CASX forms might be omitted, but CAS instructions are likely at least two cycle instructions, and might read the register file over two cycles. However, it is possible that a CSR could be introduced for the expected value, though this would mean longer instruction sequences for synchronization. TBD.

The second synchronization is not as powerful as Compare And Swap, and could be implemented by CAS, but it may be more efficient in some circumstances. It is atomic load and add (AADDX64, and AADDS64). These instructions load the specified memory location into the destination register and then atomically increment the memory location, as illustrated by the following simplified semantics of AADDX64:
m ← lvload64(ea)
t ← m + 1
lvstore64(ea) ← t
XR[d] ← m
These operations correspond to the ticket(S) operation on a sequencer, as defined in Synchronization with Eventcounts and Sequencers[PDF] by Reed and Kanodia, though sequencers only require an atomic increment, the generalization to AADDX64 etc. keeps the system interconnect transaction for uncached atomic add similar to atomic AND and OR below.

The third synchronization instructions are even less powerful than atomic increment, and could be implemented by CAS, but may be more efficient in some circumstances, such as the GC mark phase for updating bitmaps. The instructions are atomic AND (AANDX64), atomic OR (AORX64), and atomic XOR (AXORX64). These instructions load the specified memory location into the destination register and then atomically AND, OR, or XOR the memory location, as illustrated by the following simplified semantics of AANDX64:
m ← lvload64(ea)
t ← m & AR[b]
if t ≠ m then
  lvstore64(ea) ← t
endif
XR[d] ← m
The case for RISC‑V’s AMOSWAP, AMOMIN, and AMOMAX seem unclear at this point. (The case for RISC‑V’s AMOXOR is also unclear to this author, but it is trivial given support for AANDX64 and AORX64, and also called for by C++11 std::atomic and so included.) Some APIs (e.g. CUDA) may expect these operations, but they could be implemented on SecureRISC with CAS instructions. C++20 added atomic operations on floating-point types, but these are best done using CAS (e.g. it is not appropriate to have floating-point addition in memory controller for uncached operations).

Atomic operations may be performed by the processor on coherent memory locations in the cache by holding off coherency transactions during the operations involved, or on uncached locations by sending a transaction to the memory, which performs the operation atomically there and returns the result. The System Interconnect Address Attributes section describes main memory attributes indicating which locations implement uncached atomic memory operations. The locations to be modified by atomic operations must not cross a 64‑byte boundary; for example, the address for CASS64 must be in the range 0..56 mod 64.

Wait Instructions

SecureRISC will have the usual instructions to wait for an interrupt. Such instructions increase efficiency. While the details are TBD, for example, there might be a WAIT instruction that takes a value to write to InterruptEnable[PC.ring], and then suspends fetch before the next instruction (so that the return from the interrupt exception returns to that instruction).

A more interesting instruction under consideration is one that waits for a memory location to change, which may be useful for reducing the overhead of memory based synchronization. The x86 MONITOR/MWAIT instructions may be one model.

Fence Instructions

Note: SecureRISC has acquire and release options for loads and stores, which reduces (but does not eliminate) the need for some memory fences. Fences for virtual memory changes may be necessary, though it may be possible to handle those in the coherence protocol. Some fence instructions may also be useful in mitigating security vulnerabilities due to microarchitecture bugs.

The details of SecureRISC’s fence instructions are TBD, but it is likely to specify a first set of (e.g. as a bitmask) of operations that must complete before a second set (also a bitmask) of operations are allowed initiate. This is similar to what RISC‑V adopted for memory fences (their FENCE instruction, where there are only four set elements), but for a larger set of instructions. The goal is to encompass the variety of fences found in other ISAs. The set elements might include instruction fetch, loads, stores, CSR reads, CSR writes, and other instructions. Loads and stores could be further categorized by System Interconnect Address Attributes or acquire and release attributes. Other operations that might be receive bits in the sets might be related to prediction, system interface transactions, error checking, privilege level changes, prefetch, address space changes, waits, interrupts, and exceptions, and so on. One goal is to correctly handle Just-in-Time (JIT) compilation in the presence of processor migration, which should be easier in SecureRISC because stores must invalidate instruction caches. An enlarged set of things to fence should also allow for finer-grain patching of security vulnerability bugs that seem to plague speculative processors; even though these should be handled correctly by processor designers, they seem to often not be handled properly. Not all of this is thought out. Again, the details are TBD.

Note: Need to look at POWER persistent synchronization instructions (phwsync and plwsync). See Power ISA Version 3.1B section 4.6.3 Memory Barrier Instructions.

System Call Instructions

SecureRISC lacks a System Call instruction (e.g. RISC‑V ECALL), as gates are the preferred method of making requests of higher privilege levels.

Prefetch and Cache Operations

SecureRISC has instructions for compiler-directed prefetching and to control automatic prefetching. These instructions operate on 8‑word cache lines. The C prefix to these assembler mnemonics represents Cache. Rather than identify caches as L1 BBDC, L1 Instruction, L2 Instruction, L1 Data, L2 Data, L3, etc. we designate caches by referencing the instructions that use those caches. Further work is required for things that operate on or stop at some intermediate level of the hierarchy. These instructions operate on cache block specified by an lvaddr and are subject to access permissions. They are available to all rings. There will be privileged instructions not yet listed here.

SecureRISC requires that writes invalidate or update all caches that contain previous fetches, including the BBDC and L1 and L2 Instruction caches. Previously fetched instructions still in the pipeline are not invalidated, so a fence is required. Thus, cache operations are not required for JIT compilers, merely the fence. This is typically implemented by having a directory at what ARM calls the Point of Unification (PoU) in the cache hierarchy. This directory records the locations in lower levels which may contain the a copy. Stores consult the directory and when other locations are noted, those locations are invalidated or updated. For multiprocessor systems, a first processor may write instructions that a second will execute. The first processor must execute a fence to ensure all writes have completed before signaling the second processor to proceed. The second processor must also use a fence to ensure that the pipeline has stale instructions (e.g. fetched speculatively). The details will be spelled out when the fence instructions are specified.

Is TLB prefetching required?

Instruction Operation
Fetch prefetching and eviction designation
(these may be executed too late in the pipeline to be useful an so may be replaced by BBD features)
CPBB Prefetch into Basic Block Descriptor Cache (BBDC)
CPI Prefetch into Basic Block Descriptor Cache (BBDC) and Instruction Cache
CEBB Designate eviction candidate for Basic Block Descriptor Cache (BBDC)
CEI Designate eviction candidate for Basic Block Descriptor Cache (BBDC) and Instruction Caches
Early pipeline prefetching, zeroing, writeback, invalidation, and eviction designation
CPLA Prefetch for LA/LAC/etc.
CPLX Prefetch for LX/etc.
(probably identical to CPLA in most cases)
CPSA Prefetch for SA/SAC/etc.
CPSX Prefetch for SX/etc.
(probably identical to CPSA in most cases)
CZA Zero cache block used for SA/SAC/etc.
CZX Zero cache block used for SX/etc.
(probably identical to CZA in most cases)
CEA Designate eviction candidate for LA/SA
CEX Designate eviction candidate for LX/SX
CCX Clean (writeback) for SX cache
CCIX Clean (writeback) and invalidate for SX cache
Late pipeline prefetching, zeroing, writeback, invalidation, and eviction designation
(the primary difference from early prefetching is some microarchitectures may not prefetch to the first stage(s) of the data cache hierarchy)
CPLS Prefetch for LS
CPSS Prefetch for SS
CZS Zero cache block used for SS/etc.
CES Designate eviction candidate for LS/SS
CCS Clean (writeback) for SS cache
CCIS Clean (writeback) and invalidate for SS cache

Need to look at POWER dcbstps (data cache block store to persistent storage) and dcbfps (data cache block flush to persistent storage).

The primary issue with fetch prefetching is that some implementations may execute explicit instructions too late to be useful. Eventually I expect to define new next codes in Basic Block Descriptors for L1 BBDC and L1 Instruction Cache prefetch and eviction designation to solve this problem. Whether some of the above instructions are removed by such a solution is TBD.

Prefetch may want additional options for rereference interval prediction and similar hints to avoid removing useful cache blocks when streaming data larger than the cache size.

Code Size Reduction

It is likely appropriate to add some instructions that exist only for code size reduction, which expand into multiple SecureRISC instructions early in the pipeline (e.g. before register renaming). The best candidates for this so far are doubleword load/store instructions, which would expand into two singleword load/store instructions. This expansion and execution as separate instructions in the backend of the pipeline avoids the issues with register renaming that would otherwise exist. The partial execution of part of the pair would be allowed (and loads to the source registers would be not allowed). Doubleword load/store significantly reduce the size of function call entry and exit and may be useful for loading a code pointer and context pointer pair for indirect calls.

Instruction Formats and Overview

The following outlines some of the instructions without giving them their full definitions, which includes tag and bounds checking. The full definitions will follow later.

The 16‑bit instruction formats are included for code density. Some evaluation of whether it is worth the cost should be considered. Note that the BB descriptor gives the sizes of all instructions in the basic block in the form of the start bit mask, and so the instruction size is not encoded in the opcodes. The start mask allows multiple instructions to be decoded in parallel without parsing the instruction stream; in effect it provides an extra bit of information for every 16 bits of the instruction stream.

Assembler Mnemonics

Because the identical or nearly identical instructions may exist in multiple, a convention for distinguishing them is required. Since 32‑bit instructions are most common, these have the shortest form. Mnemonics for instruction sizes other than 32 bits are indicated by their first letter:

Size Mnemonic prefix
161
32
483
644

Instructions that calculate an effective address are distinguished by the first letter of their mnemonic: Address, Load, or Store. For loads and stores, the second letter of the mnemonic gives the destination register file either A for ARs, X for XRs, S for SRs, M for VMs, or V for VRs. (There are no loads or stores to the BRs.) The next field of the mnemonic is empty for word loads and stores, or the size (8, 16, 32, or 64—possibly 128?) for sub-word loads and stores to the XRs or SRs. Word stores must be word-aligned, but 64‑bit (possibly 128? sub-word stores may be misaligned and generate an integer tag. Sub-word loads for 8, 16, or 32 bits next include S for signed or U for unsigned. Finally, the last letter is I for an immediate offset (as opposed to a XR[b] offset).

As examples of the above rules: A stores the address calculation AR[a] +p XR[b]<<sa to destination AR[d] while 1AI stores the address calculation AR[a] +p imm8. LA and LAI loads the contents of those two address calculation to the destination AR[d]. LX32U loads XR[d] from an unsigned 32‑bit memory location located using a XR[b] offset and LS16SI loads SR[d] from a signed 16‑bit memory location located using an immediate offset.

Arithmetic instructions use the operation (e.g. ADD or SUB) with a suffix X or S for the register file of the source and destination operands. If an immediate value is one of the operands, a final I is appended. For vector operations the suffixes are VV for vector-vector VS for vector-scalar, and VI for vector-scalar immediate.

As examples of the above rules:

Assembler Simplified meaning (ignoring details)
ADDXI d, a, imm XR[d] ← XR[a] + imm
ADDS d, a, b SR[d] ← SR[a] + SR[b]

For Floating-Point operations, F is used for IEEE754 binary32 (single-precision), D is used for IEEE754 binary64 (double-precision), H is used for IEEE754 binary16 (half-precision), B is used for the non-standard Machine Learning (ML) 16‑bit Brain Float format, and P3, P4, and P5 are used for the three proposed IEEE binary8pp formats for ML quarter-precision (8‑bit) with 5‑bit, 4‑bit, 3‑bit exponents. Q is reserved for a future IEEE754 binary128 (quad-precision).

Some floating-point examples are as follows:

Assembler Simplified meaning (ignoring details) Comment
FNMADDS d, a, b, c SR[d] ← −(SR[a] ×f SR[b]) +f SR[c]
DMADDVS d, a, b, c VR[d] ←  (VR[a] ×d SR[b]) +d VR[c]
P4MBSUBVV d, c, a, b VR[d] ←  (VR[a] ×p4b VR[b]) −b VR[c] P4 widening to BF multiply-subtract
fmt Floating-Point Precision schema
Mnemonic Definition Comment Exp Prec
Q binary128[wikilink] quadruple-precision 15113
D binary64[wikilink] double-precision 1153
F binary32[wikilink] single-precision 824
H binary16[wikilink] half-precision 511
B bfloat16[wikilink] ML alternative to half-precision 88
P5 binary8p5[PDF] quarter-precision for ML alternative 35
P4 binary8p4[PDF] quarter-precision for ML alternative 44
P3 binary8p3[PDF] quarter-precision for ML 53

SecureRISC has not yet considered inclusion of NVIDIA’s Tensor Float format[wikilink].

In the following sections, sometimes a set of instructions are defined with a mnemonic schema using the following:

What Schema Mnemonic Definition Comment
Operation Mnemonic schemas for ARs
Address
Comparison
ac EQx63..0 = y63..0
NEx63..0 ≠ y63..0
LTUx63..0 <u y63..0
GEUx63..0u y63..0
TEQx71..64 = y7..0 tag equal
TNEx71..64 ≠ y7..0 tag not-equal
TLTUx71..64 <u y7..0 tag less than
TGEUx71..64u y7..0 tag greater than or equal
WEQx71..0 = y71..0 word equal
WNEx71..0 ≠ y71..0 word not-equal
WLTUx71..0 <u y71..0 word less than
WGEUx71..0u y71..0 word greater than or equal
Operation Mnemonic schemas for XRs
Index
Arithmetic
xa ADDx63..0 + y63..0 mod 264 addition
SUBx63..0 − y63..0 mod 264 subtraction
MINUminu(x63..0, y63..0)
MINSmins(x63..0, y63..0)
MINUSminus(x63..0, y63..0)
MAXUmaxu(x63..0, y63..0)
MAXSmaxs(x63..0, y63..0)
MAXUSmaxus(x63..0, y63..0)
Index Logical xl ANDx63..0 & y63..0
ORx63..0 | y63..0
XORx63..0 ^ y63..0
SLLx63..0 <<u y5..0
SRLx63..0 >>u y5..0
SRAx63..0 >>s y5..0
Index
Comparison
xc EQx63..0 = y63..0
NEx63..0 ≠ y63..0
LTUx63..0 <u y63..0
LTx63..0 <s y63..0
GEUx63..0u y63..0
GEx63..0s y63..0
NONE(x63..0&y63..0)=0Check statistics
ANY(x63..0&y63..0)≠0Check statistics
ALL(x63..0&~y63..0)=0Check statistics
NALL(x63..0&~y63..0)≠0Check statistics
BITCxy5..0=0Check statistics
BITSxy5..0≠0Check statistics
TEQx71..64 = y7..0 tag equal
TNEx71..64 ≠ y7..0 tag not-equal
TLTUx71..64 <u y7..0 tag less than
TGEUx71..64u y7..0 tag greater than or equal
WEQx71..0 = y71..0 word equal
WNEx71..0 ≠ y71..0 word not-equal
WLTUx71..0 <u y71..0 word less than
WGEUx71..0u y71..0 word greater than or equal
Operation Mnemonic schemas for SRs, BRs, VRs, and VMs
Boolean bo ANDx & y
ANDTCx & ~y
NAND~(x & y)
NOR~(x | y)
ORx | y
ORTCx | ~y
XORx ^ y
EQVx ^ ~y
Boolean
accumulation
ba ANDx & y
ORx | y OR with b0 used by assembler for non-accumulation
Integer
Comparison
ic EQx63..0 = y63..0
NEx63..0 ≠ y63..0
LTUx63..0 <u y63..0
LTx63..0 <s y63..0
GEUx63..0u y63..0
GEx63..0s y63..0
NONE(x63..0&y63..0)=0Check statistics
ANY(x63..0&y63..0)≠0Check statistics
ALL(x63..0&~y63..0)=0Check statistics
NALL(x63..0&~y63..0)≠0Check statistics
BITCxy5..0=0Check statistics
BITSxy5..0≠0Check statistics
Integer
Arithmetic
io ADDx63..0 + y63..0 mod 264 addition
SUBx63..0 − y63..0 mod 264 subtraction
ADDOUx63..0 +u y63..0 Trap on unsigned overflow
ADDOSx63..0 +s y63..0 Trap on signed overflow
ADDOUSx63..0 +us y63..0 Trap on unsigned-signed overflow
SUBOUx63..0u y63..0 Trap on unsigned overflow
SUBOSx63..0s y63..0 Trap on signed overflow
SUBOUSx63..0us y63..0 Trap on unsigned-signed overflow
MINUminu(x63..0, y63..0)
MINSmins(x63..0, y63..0)
MINUSminus(x63..0, y63..0)
MAXUmaxu(x63..0, y63..0)
MAXSmaxs(x63..0, y63..0)
MAXUSmaxus(x63..0, y63..0)
MULx63..0 × y63..0 least-significant 64 bits of product
MULOUx63..0 ×u y63..0 Trap on unsigned overflow
MULOSx63..0 ×s y63..0 Trap on signed overflow
MULUSx63..0 ×us y63..0 Trap on unsigned-signed overflow
Integer
1-operand
(should these
be logical
accumulations
instead?)
a1 NEG− x63..0 negate
ABSabs(x63..0) absolute value
POPCpopcount(x63..0) count number of one bits
COUNTScountsign(x63..0) count most-significant bits equal to sign bit
COUNTMS0countms0(x63..0)
COUNTMS1countms1(x63..0)
COUNTLS0countls0(x63..0)
COUNTLS1countls1(x63..0)
Integer
Arithmetic
accumulation
ia ADDx63..0 + y63..0
SUBx63..0 − y63..0
y63..0 Non-accumulation
Mnemonic omitted in assembler:
e.g. just ADDS d, a, b
is encoded with this ia to perform
SR[d] ← SR[a] + SR[b]
Bitwise Logical lo ANDx63..0 & y63..0
ANDTCx63..0 & ~y63..0
NAND~(x63..0 & y63..0)
NOR~(x63..0 | y63..0)
ORx63..0 | y63..0
ORTCx63..0 | ~y63..0
XORx63..0 ^ y63..0
EQVx63..0 ^ ~y63..0
SLLx63..0 <<u y5..0
SRLx63..0 >>u y5..0
SRAx63..0 >>s y5..0
CLMULx63..0 ⊗ y63..0Carryless multiplication
Bitwise
Logical
accumulation
la ANDx63..0 & y63..0
ORx63..0 | y63..0
XORx63..0 ^ y63..0Primarily for CLMUL
y63..0 Non-accumulation
Mnemonic omitted in assembler:
e.g. just ANDS d, [c,] a, b
is encoded with this la to perform
SR[d] ← SR[a] & SR[b]
with SR[c] ignored.
Floating-Point
Arithmetic
fo ADDx +fmt y
SUBx −fmt y
MINminfmt(x, y)
MAXmaxfmt(x, y)
MINMminmagfmt(x, y)
MAXMmaxmagfmt(x, y)
Mx ×fmt y
NM−(x ×fmt y) negative multiply
Mww(x) ×w w(y) widening multiply
NMw−(w(x) ×w w(y)) widening negative multiply
DIVx63..0 ÷fmt y63..0 Must be no-accumulation
Floating-Point
accumulation
fa ADDx +fmt y
SUBx −fmt y
y63..0 Non-accumulation
Mnemonic omitted in assembler:
e.g. just DADDS d, a, b
is encoded with this fa to perform
SR[d] ← SR[a] +d SR[b]
Floating-Point
1-operand
f1 MOVx
NEGfmt x63..0
ABSabsfmt(x63..0)
RECIP1.0 ÷fmt x63..0
SQRTsqrtfmt(x63..0)
RSQRTrsqrtfmt(x63..0)
FLOORfloorfmt(x63..0)
CEILceilfmt(x63..0)
TRUNCtruncfmt(x63..0)
ROUNDroundfmt(x63..0)
CVTIconverti,fmt(x63..0)
CVTBconvertb,fmt(x63..0)
CVTHconverth,fmt(x63..0)
CVTFconvertf,fmt(x63..0)
CVTDconvertd,fmt(x63..0)
FLOATUfloatfmt,u(x63..0, imm)
FLOATSfloatfmt,s(x63..0, imm)
CLASSclassfmt(x63..0)
Floating-Point
Comparison
fc ORx63..0 ?fmt y63..0 ordered
UNx63..0 ~?fmt y63..0 unordered
EQx63..0 =fmt y63..0
NEx63..0fmt y63..0
LTx63..0 <fmt y63..0
GEx63..0fmt y63..0
LEx63..0fmt y63..0
GTx63..0 >fmt y63..0
Schema combinations for the late pipeline
Class Schema Definition Examples
Integer Arithmetic ioia SR[d] ← SR[c] ia (SR[a] io SR[b]) MADDS
Bitwise Logical lola SR[d] ← SR[c] la (SR[a] lo SR[b]) ANDORS
Floating-Point fofa SR[d] ← SR[c] fafmt (SR[a] fofmt SR[b]) DNMSUBS
Boolean boba BR[d] ← BR[c] ba (BR[a] bo BR[b]) ANDORS
VM[d] ← VM[c] ba (VM[a] bo VM[b]) ORANDM
Integer Comparison icba BR[d] ← BR[c] ba (SR[a] ic SR[b]) LTUANDS
Floating-Point Comparison fcba BR[d] ← BR[c] ba (SR[a] fc SR[b]) DLTANDS
logical operation encodings
Value Mnemonic Function Mnemonic Function
0000F0
0001NORa ~| b ANDCC~a & ~b
0010ANDTCa & ~b
0011NOTB~b
0100ANDCT~a & b
0101NOTA~a
0110XORa ^ b
0111NANDa ~& b ORCC~a | ~b
1000ANDa & b
1001EQVa ~^ b XNOR~(a ^ b)
1010Aa
1011ORTCa | ~b
1100Bb
1101ORCT~a | b
1110ORa | b
1111T1
signed/unsigned encoding
m What Example mnemonic
0Reserved
1UnsignedMINU
2SignedMAXS
3Unsigned SignedMINUS
overflow signed/unsigned encoding
m What Example mnemonic
0wrapADD
1Overflow UnsignedSUBOU
2Overflow SignedMULOS
3Overflow Unsigned SignedADDOUS
rounding mode encoding
field
TBD
Static Dynamic
0Nearest, ties to Even
1Round to odd[PDF]
2Nearest, ties to Min Magnitude
3Nearest, ties to Max Magnitude
4Toward −∞ (floor)
5Toward +∞ (ceiling)
6Toward 0 (truncate)
7DynamicAway from 0
load/store size encoding
field TBD Data
width
Aligned MemTag
check
Examples
0 8 240..245 LX8U, LS8SI
1 16 240..245 LS16S, LS16UI
2 32 240..245 SX32I, SS32
3 64 240..252 LX64UI, SS64
4 72 word LAI, LX, LS, SSI
5 144 doubleword LAD, SADI
6 144 doubleword 232/251 LAC, SAC
7 64 clique CLA64, CSA64
load/store ordering encoding
field TBD Mnemonic Semantics
0 Neither acquire nor release
1 .a Acquire
2 .r Release
3 .ar Acquire and Release

The table below lists the indexed load/store opcode mnemonics, but the same encoding is used for the immediate offset opcodes (i.e. with the appended I suffix). The {L,S}{X,S}128 instructions marked with a ? are possible placeholders for future code density instructions that expand into a pair of load or store instructions, similar to the existing {L,S}{X,S}D instructions.

load/store operation encoding
field
TBD
Reg
file
Operation field TBD
0 1 2 3 4 5 6 7
0 XR Load Unsigned LX8U LX16U LX32U LX64U LX LXD LX128?
1 SR Load Unsigned LS8U LS16U LS32U LS64U LS LSD LS128?
2 XR Speculative Load Unsigned SLX8U SLX16U SLX32U SLX64U SLX SLXD
3 SR Speculative Load Unsigned SLS8U SLS16U SLS32U SLS64U SLS SLSD
4 XR Load Signed LX8S LX16S LX32S LX64S
5 SR Load Signed LS8S LS16S LS32S LS64S
6 XR Speculative Load Signed SLX8S SLX16S SLX32S SLX64S
7 SR Speculative Load Signed SLS8S SLS16S SLS32S SLS64S
8 AR Load RLA32 RLA64 LA LAD LAC CLA64
9 VM Load LM
10 AR Speculative Load SRLA32 SRLA64 SLA SLAD SLAC SCLA64
11 Reserved
12 XR Store SX8 SX16 SX32 SX64 SX SXD SX128?
13 SR Store SS8 SS16 SS32 SS64 SS SSD SS128?
14 AR Store RSA32 RSA64 SA SAD SAC CSA64
15 VM Store SM
scalar vector encoding
n Suffix What Example m usage f usage
0 S Scalar
integer
SR[d] ← SR[c] ia (SR[a] io SR[b]) su or osu
B Boolean BR[d] ← BR[c] ba (BR[a] bo BR[b])
S Scalar
floating
SR[d] ← SR[c] fafmt (SR[a] fofmt SR[b]) round
1 SV Vector reduction
to scalar integer
SR[d] ← reduction(SR[a], VR[b])
masked by VM[m] and n
vector mask
SV Vector reduction
to scalar floating
SR[d] ← reductionfmt(SR[a], VR[b])
masked by VM[m] and n
vector mask round
M Mask VM[d] ← VM[c] ba (VM[a] bo VM[b])
2 VS Vector Scalar
integer
VR[d] ← VR[c] ia (VR[a] io SR[b])
masked by VM[m] and n
vector
mask
VS Vector Scalar
floating
VR[d] ← VR[c] fafmt (VR[a] fofmt SR[b])
masked by VM[m] and n
vector
mask
round
VI Vector Immediate
integer
VR[d] ← VR[a] io imm
masked by VM[m] and n
vector
mask
VS Vector Scalar
integer compare
VM[d] ← VM[c] ba (VR[a] ic SR[b])
masked by VM[m] and n
VI Vector Immediate
integer compare
VM[d] ← VR[a] ic imm
masked by VM[m] and n
VS Vector Scalar
floating compare
VM[d] ← VM[c] ba (VR[a] fcfmt SR[b])
masked by VM[m] and n
3 VV Vector Vector
integer
VR[d] ← VR[c] ia (VR[a] io VR[b])
masked by VM[m] and n
vector
mask
VV Vector Vector
floating
VR[d] ← VR[c] fafmt (VR[a] fofmt VR[b])
masked by VM[m] and n
vector
mask
round
VV Vector Vector
integer compare
VM[d] ← VM[c] ba (VR[a] ic VR[b])
masked by VM[m] and n
VV Vector Vector
floating compare
VM[d] ← VM[c] ba (VR[a] fcfmt VR[b])
masked by VM[m] and n

Vector operations write only the destination elements enabled by the vector mask operand. Destination element i is written if VM[m]i = n. Since VM[0] is hardwired to 0, setting m to 0 and n to 0 writes unconditionally. The combination of m = 0 and n = 1 is reserved.

The following are a sketch of the 16‑bit instruction encodings, but the actual encodings will be determined by analyzing instruction frequency in the 32‑bit instruction set.

16‑bit op16
1:0
3:2
0 1 2 3
0 1A 1LA 1SA i16da
1 1AI 1LAI 1SAI i16ab0
2 1ADDX 1LX 1SX 1XI
3 1ADDXI 1LXI 1SXI i16ab1
16‑bit instruction format destination 2 source
15 12 11 8 7 4 3 0
b a d op16
4 4 4 4
Word address calculation with indexed addressing: 1A
1Ad, a, b v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<3) = 0
AR[d] ← AR[a] +p XR[b]<<3
AV[d] ← v
Index register addition
1ADDXd, a, b XR[d] ← XR[a] + XR[b]
XV[d] ← XV[a] & XV[b]
Non-speculative tagged word loads with indexed addressing: L{A,X,S}
(LS in 32‑bit table)
1LAd, a, b v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<3) = 0
AR[d] ← v ? lvload72(AR[a] +p XR[b]<<3) : 0
AV[d] ← v
1LXd, a, b v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<3) = 0
XR[d] ← v ? lvload72(AR[a] +p XR[b]<<3) : 0
XV[d] ← v
16‑bit instruction format destination source immediate
15 12 11 8 7 4 3 0
imm4 a d op16
4 4 4 4
Word address calculation with immediate addressing: AI
1AId, a, imm4 v ← AV[a]
trap if v & boundscheck(AR[a], imm4<<3) = 0
AR[d] ← AR[a] +p imm4<<3
AV[d] ← v
Index register addition immediate
1ADDXId, a, imm4 XR[d] ← XR[a] + imm4
XV[d] ← XV[a]
Non-speculative tagged word loads with indexed addressing: L{A,X,S}I
(LSI and wider immediate LA and LX in 32‑bit table)
1LAId, a, imm4 v ← AV[a]
trap if v & boundscheck(AR[a], imm4<<3) = 0
AR[d] ← v ? lvload72(AR[a] +p imm4<<3) : 0
AV[d] ← v
1LXId, a, imm4 v ← AV[a]
trap if v & boundscheck(AR[a], imm4<<3) = 0
XR[d] ← v ? lvload72(AR[a] +p imm4<<3) : 0
XV[d] ← v
16‑bit op16da
13:12
15:14
0 1 2 3
0 1NEGX 1NOTX 1MOVSX 1MOVXS
1 1RTAGX 1RTAGA 1RSIZEA
2 1MOVAX 1MOVXA 1MOVAS 1MOVSA
3 1SOBX 1RTAGS
16‑bit instruction format destination 1 source
15 12 11 8 7 4 3 0
op16da a d i16da
4 4 4 4
1NEGXd, a XR[d] ← −XR[a]
XV[d] ← XV[a]
1NOTXd, a XR[d] ← ~XR[a]
XV[d] ← AV[a]
1MOVAXd, a AR[d] ← XR[a]
AV[d] ← XV[a]
1MOVXAd, a XR[d] ← AR[a]
XV[d] ← AV[a]
1MOVSXd, a SR[d] ← XR[a]
SV[d] ← XV[a]
1MOVXSd, a XR[d] ← SR[a]
XV[d] ← SV[a]
1MOVASd, a AR[d] ← SR[a]
AV[d] ← SV[a]
1MOVSAd, a SR[d] ← AR[a]
SV[d] ← AV[a]
1RTAGAd, a XR[d] ← 240 ∥ 056 ∥ AR[a]71..64
XV[d] ← XV[a]
1RTAGXd, a XR[d] ← 240 ∥ 056 ∥ XR[a]71..64
XV[d] ← XV[a]
1RTAGSd, a SR[d] ← 240 ∥ 056 ∥ SR[a]71..64
SV[d] ← SV[a]
1RSIZEAd, a XR[d] ← 240 ∥ 03 ∥ AR[a]132..72
XV[d] ← XV[a]
1SOBX trap if XV[a] = 0
XR[d] ← XR[a] − 1
XV[d] ← 1
loop back if XR[d] ≠ 0

1NEGX is identical to RSUBXI with an immediate of 0 but is 16 bits rather than 32. Whether this is important is unclear. 1NOTX is identical to RSUBXI with an immediate of -1 but is 16 bits rather than 32. Whether this is important is unclear. Whether to include these will depend on code size statistics.

16‑bit op16ab0
5:3
7:6
0 1 2 3
0 1BEQA 1BNEA 1BLTAU 1BGEAU
1 1BEQX 1BNEX 1BLTXU 1BGEXU
2 1BNONEX 1BANYX 1BLTX 1BGEX
3 i16a0
16‑bit op16ab1
5:4
7:6
0 1 2 3
0 1TEQA 1TNEA 1TLTAU 1TGEAU
1 1TEQX 1TNEX 1TLTXU 1TGEXU
2 1TNONEX 1TANYX 1TLTX 1TGEX
3 i16a1
16‑bit instruction format 2 source
15 12 11 8 7 4 3 0
b a op16ab i16ab
4 4 4 4

All of the following first do either
trap if AV[a] & AV[b] = 0
or
trap if XV[a] & XV[b] = 0
as appropriate.

1BEQAa, b branch if AR[a] = AR[b]
1BEQXa, b branch if XR[a] = XR[b]
1BNEAa, b branch if AR[a] ≠ AR[b]
1BNEXa, b branch if XR[a] ≠ XR[b]
1BLTAUa, b branch if AR[a] <u AR[b]
1BLTXUa, b branch if XR[a] <u XR[b]
1BGEAUa, b branch if AR[a] ≥u AR[b]
1BGEXUa, b branch if XR[a] ≥u XR[b]
1BLTXa, b branch if XR[a] <s XR[b]
1BGEXa, b branch if XR[a] ≥s XR[b]
1BNONEXa, b branch if (XR[a] & XR[b]) = 0
1BANYXa, b branch if (XR[a] & XR[b]) ≠ 0
1TEQAa, b trap if AR[a] = AR[b]
1TEQXa, b trap if XR[a] = XR[b]
1TNEAa, b trap if AR[a] ≠ AR[b]
1TNEXa, b trap if XR[a] ≠ XR[b]
1TLTAUa, b trap if AR[a] <u AR[b]
1TLTXUa, b trap if XR[a] <u XR[b]
1TGEAUa, b trap if AR[a] ≥u AR[b]
1TGEXUa, b trap if XR[a] ≥u XR[b]
1TLTXa, b trap if XR[a] <s XR[b]
1TGEXa, b trap if XR[a] ≥s XR[b]
1TNONEXa, b trap if (XR[a] & XR[b]) = 0
1TANYXa, b trap if (XR[a] & XR[b]) ≠ 0
16‑bit op16a0
13:12
15:14
0 1 2 3
0 1BEQNA 1BNENA 1BF 1BT
1 1BEQZX 1BNEZX 1BLTZX 1BGEZX
2 1JMPA
3 1SWITCHX 1BLEZX 1BGTZX
16‑bit op16a1
13:12
15:14
0 1 2 3
0 1TEQNA 1TNENA 1TF 1TT
1 1TEQZX 1TNEZX 1TLTZX 1TGEZX
2 1CHKVA 1CHKVX 1CHKVS
3 1TLEZX 1TGTZX
16‑bit instruction format 1 source
15 12 11 8 7 4 3 0
op16a a i16a i16ab
4 4 4 4

All of the following first do either
trap if AV[a] = 0
or
trap if XV[a] = 0
or
trap if BV[a] = 0
as appropriate.

1BEQNAa branch if AR[a]63..0 = 0
1BNENAa branch if AR[a]63..0 ≠ 0
1BEQZXa branch if XR[a]63..0 = 0
1BNEZXa branch if XR[a]63..0 ≠ 0
1BLTZXa branch if XR[a]63..0 <s 0
1BGEZXa branch if XR[a]63..0s 0
1BLEZXa branch if XR[a]63..0s 0
1BGTZXa branch if XR[a]63..0 >s 0
1BFa branch if BR[a] = 0
1BTa branch if BR[a] ≠ 0
1TEQZXa trap if XR[a]63..0 = 0
1TNEZXa trap if XR[a]63..0 ≠ 0
1TLTZXa trap if XR[a]63..0 <s 0
1TGEZXa trap if XR[a]63..0s 0
1TLEZXa trap if XR[a]63..0s 0
1TGTZXa trap if XR[a]63..0 >s 0
1TFa trap if BR[a] = 0
1TTa trap if BR[a] ≠ 0
1JMPAa trap if AR[a]71..68 ≠ 12
trap if AR[a]2..0 ≠ 0
trap if PC66..64 ≠ AR[a]66..64
PC ← AR[a]66..0
1SWITCHXa trap if XR[a]71..65 ≠ 120
PC ← PC +p (XR[a]<<3)
1CHKVAa trap if AV[a] = 0
1CHKVXa trap if XV[a] = 0
1CHKVSa trap if SV[a] = 0
16‑bit instruction format destination immediate
15 8 7 4 3 0
imm8 d op16
8 4 4
1XId, imm8 XR[d] ← 240 ∥ imm8748 ∥ imm8
XV[d] ← 1
32‑bit op32
1:0
3:2
0 1 2 3
0 AXload AXstore SVload SVstore
1
2 ARop XRop SRop VRop
3 XI XUI SI SUI
32‑bit instruction format 3 sources 1 destination
31 28 27 24 23 22 21 20 19 16 15 12 11 8 7 4 3 0
op32g f n m c b a d op32
4 4 2 2 4 4 4 4 4
Scalar Integer
ioiaSd, c, a, b SR[d] ← SR[c] ia (SR[a] io SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
lolaSd, c, a, b SR[d] ← SR[c] la (SR[a] lo SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
SELSd, c, a, b SR[d] ← BR[c] ? SR[a] : SR[b]
SV[d] ← BV[c] & (BR[c] ? SV[a] : SV[b])
i1Sd, a SR[d] ← i1(SR[a])
SV[d] ← SV[a]
Scalar Integer Multiword
FUNSd, b, a, c t ← (SR[b]63..0∥SR[a]63..0) >> SR[c]5..0
SR[d] ← 240 ∥ t63..0
SV[d] ← SV[a] & SV[b] & SV[c]
ROTRSd, a, b assembler expands to FUNS d, a, a, b
FUNNSd, b, a, c t ← (SR[b]63..0∥SR[a]63..0) >> (−SR[c])5..0
SR[d] ← 240 ∥ t63..0
SV[d] ← SV[a] & SV[b] & SV[c]
ROTLSd, a, b assembler expands to FUNNS d, a, a, b
ADDCd, b, a trap if (SV[a] & SV[b]) = 0
t ← SR[a]63..0 + SR[b]63..0 + CARRY0
CARRY ← 063 ∥ t64
SR[d] ← 240 ∥ t63..0
SV[d] ← 1
MULCd, b, a, c trap if (SV[a] & SV[b] & SV[c]) = 0
t ← (SR[a]63..0 ×u SR[b]63..0) + SR[c]63..0 + CARRY
CARRY ← t127..64
SR[d] ← 240 ∥ t63..0
SV[d] ← 1
DIVCd, b, a, c trap if (SV[a] & SV[b]) = 0
q,r ← (CARRY∥SR[a]63..0) ÷u SR[b]63..0
CARRY ← r
SR[d] ← 240 ∥ q
SV[d] ← 1
Boolean
bobad, c, a, b BR[d] ← BR[c] ba (BR[a] bo BR[b])
BV[d] ← BV[a] & BV[b] & BV[c]
bobaMd, c, a, b VR[d] ← VM[c] ba (VM[a] bo VM[b])
Integer Comparison
acbaAd, c, a, b BR[d] ← BR[c] ba (AR[a] ac AR[b])
BV[d] ← AV[a] & AV[b] & BV[c]
xcbaXd, c, a, b BR[d] ← BR[c] xa (XR[a] xc XR[b])
BV[d] ← XV[a] & XV[b] & BV[c]
icbaSd, c, a, b BR[d] ← BR[c] ba (SR[a] ic SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
Floating-Point
Df1Sd, a SR[d] ← f1d(SR[a])
SV[d] ← SV[a]
Ff1Sd, a SR[d] ← f1f(SR[a])
SV[d] ← SV[a]
Hf1Sd, a SR[d] ← f1h(SR[a])
SV[d] ← SV[a]
Bf1Sd, a SR[d] ← f1b(SR[a])
SV[d] ← SV[a]
P4f1Sd, a SR[d] ← f1p4(SR[a])
SV[d] ← SV[a]
P3f1Sd, a SR[d] ← f1p3(SR[a])
SV[d] ← SV[a]
DfofaSd, c, a, b SR[d] ← SR[c] fad (SR[a] fod SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
FfofaSd, c, a, b SR[d] ← SR[c] faf (SR[a] fof SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
HfofaSd, c, a, b SR[d] ← SR[c] fah (SR[a] foh SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
BfofaSd, c, a, b SR[d] ← SR[c] fab (SR[a] fob SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
P4fofaSd, c, a, b SR[d] ← SR[c] fap4 (SR[a] fop4 SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
P3fofaSd, c, a, b SR[d] ← SR[c] fap3 (SR[a] fop3 SR[b])
SV[d] ← SV[a] & SV[b] & SV[c]
Boolean Floating-Point Comparison
DfcbaSd, c, a, b BR[d] ← BR[c] bad (SR[a] fcd SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
FfcbaSd, c, a, b BR[d] ← BR[c] baf (SR[a] fcf SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
HfcbaSd, c, a, b BR[d] ← BR[c] bah (SR[a] fch SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
BfcbaSd, c, a, b BR[d] ← BR[c] bab (SR[a] fcb SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
P4fcbaSd, c, a, b BR[d] ← BR[c] bap4 (SR[a] fcp4 SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
P3fcbaSd, c, a, b BR[d] ← BR[c] bap3 (SR[a] fcp3 SR[b])
BV[d] ← SV[a] & SV[b] & BV[c]
32‑bit instruction format with 2 sources 1 destination and 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op32g i c i a d op32
4 8 4 4 4 4 4
Index comparison immediate
xcbaXId, c, a, imm BR[d] ← BR[c] ba (XR[a] xc imm12)
BV[d] ← XV[a] & BV[c]
Scalar comparison immediate
icbaSId, c, a, imm BR[d] ← BR[c] ba (SR[a] ic imm12)
BV[d] ← SV[a] & BV[c]
Scalar arithmetic immediate
ioiaSId, c, a, imm SR[d] ← SR[c] ia (SR[a] io imm12)
SV[d] ← SV[a] & SV[c]
lolaSId, c, a, imm SR[d] ← SR[c] la (SR[a] lo imm12)
SV[d] ← SV[a] & SV[c]
SELSId, c, a, imm SR[d] ← BR[c] ? SR[a] : imm12
SV[d] ← BV[c] & (~BR[c] | SV[a])
32‑bit instruction format 2 sources 1 destination
31 28 27 22 21 20 19 16 15 12 11 8 7 4 3 0
op32g op32f m op32c b a d op32
4 6 2 4 4 4 4 4
Address arithmetic: SUBXAA
SUBXAAd, a, b XR[d] ← 240 ∥ (AR[a]63..0 − AR[b]63..0)
XV[d] ← AV[a] & AV[b]
Index arithmetic
ADDXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 + XR[b]63..0)
XV[d] ← XV[a] & XV[b]
SUBXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 − XR[b]63..0)
XV[d] ← XV[a] & XV[b]
MINUXd, a, b XR[d] ← 240 ∥ minu(XR[a]63..0, XR[b]63..0)
XV[d] ← XV[a] & XV[b]
MINSXd, a, b XR[d] ← 240 ∥ mins(XR[a]63..0, XR[b]63..0)
XV[d] ← XV[a] & XV[b]
MAXUXd, a, b XR[d] ← 240 ∥ maxu(XR[a]63..0, XR[b]63..0)
XV[d] ← XV[a] & XV[b]
MAXSXd, a, b XR[d] ← 240 ∥ maxs(XR[a]63..0, XR[b]63..0)
XV[d] ← XV[a] & XV[b]
Possible changes
ADDXd, a, b, sa XR[d] ← 240 ∥ (XR[a]63..0 + XR[b]63..0<<sa)
XV[d] ← XV[a] & XV[b]
SUBXd, a, b, sa XR[d] ← 240 ∥ (XR[a]63..0 − XR[b]63..0<<sa)
XV[d] ← XV[a] & XV[b]
Instructions for loop iteration count prediction
LOOPXd trap if XV[a] & XV[b] = 0
XR[d] ← XR[a] − XR[b]
XV[d] ← 1
Possible additions: ADDOUX, ADDOSX, ADDUSX, SUBOUX, SUBOSX, SUBUSX, MINOUSX, MAXOUSX
Index logical
ANDXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 & XR[b]63..0)
XV[d] ← XV[a] & XV[b]
ORXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 | XR[b]63..0)
XV[d] ← XV[a] & XV[b]
XORXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 ^ XR[b]63..0)
XV[d] ← XV[a] & XV[b]
SLLXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 <<u XR[b]5..0)
XV[d] ← XV[a] & XV[b]
SRLXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 >>u XR[b]5..0)
XV[d] ← XV[a] & XV[b]
SRAXd, a, b XR[d] ← 240 ∥ (XR[a]63..0 >>s XR[b]5..0)
XV[d] ← XV[a] & XV[b]
Address calculation with index shift: A
Ad, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  AR[d] ← ea
  AV[d] ← 1
endif
Non-speculative tagged word loads with indexed addressing: L{A,X,S}
LAd, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  AR[d] ← sizedecode(lvload72(ea))
  AV[d] ← 1
endif
LXd, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  XR[d] ← 0
  XV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  XR[d] ← lvload72(ea)
  XV[d] ← 1
endif
LSd, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  SR[d] ← 0
  SV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  SR[d] ← lvload72(ea)
  SV[d] ← 1
endif
Non-speculative doubleword loads with indexed addressing: LAD (save/restore) and LAC (CHERI)
LADd, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  AR[d] ← lvload144(ea)
  AV[d] ← 1
endif
LACd, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  t ← lvload144(ea)
  trap if t71..67 ≠ 25
  trap if t143..136 ≠ 252
  AR[d] ← t
  AV[d] ← 1
endif
Non-speculative segment relative loads with indexed addressing: RLA{64,32}
RLA64d, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  t ← lvload64(ea)
  AR[d] ← segrelative(AR[a], t)
  AV[d] ← 1
endif
RLA32d, a, b, sa v ← AV[a] & XV[b]
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if boundscheck(AR[a], XR[b]<<sa) = 0
  t ← lvload32(ea)
  AR[d] ← segrelative(AR[a], 032 ∥ t)
  AV[d] ← 1
endif
Non-speculative sub-word loads with indexed addressing: L{A,X,S}{8,16,32,64}{U,S}
LX64d, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload64(AR[a] +p XR[b]<<sa) : 0
XR[d] ← 240 ∥ t
XV[d] ← v
LS64d, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload64(AR[a] +p XR[b]<<sa) : 0
SR[d] ← 240 ∥ t
SV[d] ← v
LX32Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0
XR[d] ← 240 ∥ 032 ∥ t
XV[d] ← v
LS32Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0
SR[d] ← 240 ∥ 032 ∥ t
SV[d] ← v
LX32Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0
XR[d] ← 240 ∥ t3132 ∥ t
XV[d] ← v
LS32Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload32(AR[a] +p XR[b]<<sa) : 0
SR[d] ← 240 ∥ t3132 ∥ t
SV[d] ← v
LX16Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0
XR[d] ← 240 ∥ 048 ∥ t
XV[d] ← v
LS16Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0
SR[d] ← 240 ∥ 048 ∥ t
SV[d] ← v
LX16Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0
XR[d] ← 240 ∥ t1548 ∥ t
XV[d] ← v
LS16Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]<<sa) = 0
t ← v ? lvload16(AR[a] +p XR[b]<<sa) : 0
SR[d] ← 240 ∥ t1548 ∥ t
SV[d] ← v
LX8Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]) = 0
t ← v ? lvload8(AR[a] +p XR[b]) : 0
XR[d] ← 240 ∥ 056 ∥ t
XV[d] ← v
LS8Ud, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]) = 0
t ← v ? lvload8(AR[a] +p XR[b]) : 0
SR[d] ← 240 ∥ 056 ∥ t
SV[d] ← v
LX8Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]) = 0
t ← lvload8(AR[a] +p XR[b]) : 0
XR[d] ← 240 ∥ t756 ∥ t
XV[d] ← v
LS8Sd, a, b, sa v ← AV[a] & XV[b]
trap if v & boundscheck(AR[a], XR[b]) = 0
t ← lvload8(AR[a] +p XR[b])
SR[d] ← 240 ∥ t756 ∥ t
SV[d] ← v
Load vector mask instructions with indexed addressing
LMd, a, b, sa v ← AV[a] & XV[b]
trap if v = 0
ea ← AR[a] +p XR[b]<<sa
trap if boundscheck(AR[a], XR[b]<<sa) = 0
VM[d] ← lvload128(ea)
Speculative tagged word loads with indexed addressing: SL{A,X,S}
SLAd, a, b, sa v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa)
if v = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  AR[d] ← sizedecode(lvload72(ea))
  AV[d] ← 1
endif
SLXd, a, b, sa v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa)
if v = 0 then
  XR[d] ← 0
  XV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  XR[d] ← lvload72(ea)
  XV[d] ← 1
endif
SLSd, a, b, sa v ← AV[a] & XV[b] & boundscheck(AR[a], XR[b]<<sa)
if v = 0 then
  SR[d] ← 0
  SV[d] ← 0
else
  ea ← AR[a] +p XR[b]<<sa
  trap if ea2..0 ≠ 03
  SR[d] ← lvload72(ea)
  SV[d] ← 1
endif
Speculative sub-word loads with indexed addressing: SL{X,S}{8,16,32,64}{U,S}
(TBD)
32‑bit instruction format with 1 source 1 destination and 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op32g i op32c i a d op32
4 8 4 4 4 4 4
Index arithmetic immediate
ADDXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 + imm121152∥imm12)
XV[d] ← XV[a]
ANDXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 & imm121152∥imm12)
XV[d] ← XV[a]
MINUXId, a, imm XR[d] ← 240 ∥ minu(XR[a]63..0, imm121152∥imm12)
XV[d] ← XV[a]
MINSXId, a, imm XR[d] ← 240 ∥ mins(XR[a]63..0, imm121152∥imm12)
XV[d] ← XV[a]
MAXUXId, a, imm XR[d] ← 240 ∥ maxu(XR[a]63..0, imm121152∥imm12)
XV[d] ← XV[a]
MAXSXId, a, imm XR[d] ← 240 ∥ maxs(XR[a]63..0, imm121152∥imm12)
XV[d] ← XV[a]
RSUBXId, imm, a XR[d] ← 240 ∥ ((imm121152∥imm12) − XR[a]63..0)
XV[d] ← XV[a]
RSUBId, imm, a SR[d] ← 240 ∥ ((imm121152∥imm12) − SR[a]63..0)
SV[d] ← SV[a]
Index logical immediate
ORXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 | imm121152∥imm12)
XV[d] ← XV[a]
XORXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 ^ imm121152∥imm12)
XV[d] ← XV[a]
Scalar integer logical immediate
loSId, b, imm SR[d] ← 240 ∥ (SR[a]63..0 lo imm121152∥imm12)
SV[d] ← SV[a]
Scalar integer arithmetic immediate
ioSId, b, imm SR[d] ← 240 ∥ (SR[a]63..0 io imm121152∥imm12)
SV[d] ← SV[a]
ADDSId, b, imm SR[d] ← 240 ∥ (SR[a]63..0 + imm121152∥imm12)
SV[d] ← SV[a]
SUBSId, b, imm SR[d] ← 240 ∥ (SR[a]63..0 − imm121152∥imm12)
SV[d] ← SV[a]
MINUSId, b, imm SR[d] ← 240 ∥ minu(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
MINSSId, b, imm SR[d] ← 240 ∥ mins(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
MINUSSId, b, imm SR[d] ← 240 ∥ minus(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
MAXUSId, b, imm SR[d] ← 240 ∥ maxu(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
MAXSSId, b, imm SR[d] ← 240 ∥ maxs(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
MAXUSSId, b, imm SR[d] ← 240 ∥ maxus(SR[a]63..0, imm121152∥imm12)
SV[d] ← SV[a]
Non-speculative tagged word load/store with immediate addressing: L{A,X,S}I
AId, a, imm if AV[a] = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  trap if boundscheck(AR[a], imm12) = 0
  AR[d] ← AR[a] +p imm12
  AV[d] ← 1
endif
LAId, a, imm if AV[a] = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p imm12
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], imm12) = 0
  AR[d] ← sizedecode(lvload72(ea))
  AV[d] ← 1
endif
LXId, a, imm if AV[a] = 0 then
  XR[d] ← 0
  XV[d] ← 0
else
  ea ← AR[a] +p imm12
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], imm12) = 0
  XR[d] ← lvload72(ea)
  XV[d] ← 1
endif
LSId, a, imm if AV[a] = 0 then
  SR[d] ← 0
  SV[d] ← 0
else
  ea ← AR[a] +p imm12
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], imm12) = 0
  SR[d] ← lvload72(ea)
  SV[d] ← 1
endif
Non-speculative doubleword loads with indexed addressing: LADI (save/restore) and LACI (CHERI)
LADId, a, imm if AV[a] = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p imm12
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], imm12) = 0
  AR[d] ← lvload144(ea)
  AV[d] ← 1
endif
LACId, a, imm if AV[a] = 0 then
  AR[d] ← 0
  AV[d] ← 0
else
  ea ← AR[a] +p imm12
  trap if ea2..0 ≠ 03
  trap if boundscheck(AR[a], imm12) = 0
  t ← lvload144(ea)
  trap if t71..67 ≠ 25
  trap if t143..136 ≠ 252
  AR[d] ← t
  AV[d] ← 1
endif
Non-speculative sub-word load/store with immediate addressing: L{X,S}{8,16,32,64}{U,S}I
LX64Id, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload64(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ t
XR[d] ← v
LX32UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload32(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ 032 ∥ t
XR[d] ← v
LS32UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload32(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ 032 ∥ t
SV[d] ← v
LX32SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload32(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ t3132 ∥ t
XV[d] ← v
LS32SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload32(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ t3132 ∥ t
SV[d] ← v
LX16UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload16(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ 048 ∥ t
XV[d] ← v
LS16UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload16(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ 048 ∥ t
SV[d] ← v
LX16SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload16(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ t1548 ∥ t
XV[d] ← v
LS16SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload16(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ t1548 ∥ t
SV[d] ← v
LX8UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload8(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ 056 ∥ t
XV[d] ← v
LS8UId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload8(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ 056 ∥ t
SV[d] ← v
LX8SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload8(AR[a] +p imm12) : 0
XR[d] ← 240 ∥ t756 ∥ t
XV[d] ← v
LS8SId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm12) = 0
t ← v ? lvload8(AR[a] +p imm12) : 0
SR[d] ← 240 ∥ t756 ∥ t
SV[d] ← v
Speculative word load/store with immediate addressing: L{A,X,S}I
SLAId, a, imm v ← AV[a] & boundscheck(AR[a], imm12)
AR[d] ← v ? lvload72(AR[a] +p imm12) : 0
AV[d] ← v
SLXId, a, imm v ← AV[a] & boundscheck(AR[a], imm12)
XR[d] ← v ? lvload72(AR[a] +p imm12) : 0
XV[d] ← v
SLSId, a, imm v ← AV[a] & boundscheck(AR[a], imm12)
SR[d] ← v ? lvload72(AR[a] +p imm12) : 0
SV[d] ← v
Instructions for loop iteration count prediction
LOOPXId trap if XV[a] = 0
XR[d] ← XR[a] + imm121152∥imm12
XV[d] ← 1
32‑bit instruction format 2 sources 1 destination
31 28 27 22 21 16 15 12 11 8 7 4 3 0
op32g op32f op32c op32b a d op32
4 6 6 4 4 4 4
Instructions for save/restore
MOVSBd, a SR[d] ← 240 ∥ 063 ∥ BR[a]
SV[d] ← BV[a]
MOVBSd, a, imm6 BR[d] ← SR[a]imm6
BV[d] ← SV[a]
MOVSBALLd SR[d] ← 240 ∥ 032 ∥ BV[15]∥BV[14]∥…∥BV[1]∥1 ∥ BR[15]∥BR[14]∥…∥BR[1]∥0
SV[d] ← 1
MOVBALLSd BR[1] ← SR[a]1
BR[2] ← SR[a]2

BR[15] ← SR[a]15
BV[1] ← SR[a]17
BV[2] ← SR[a]18

BV[15] ← SR[a]31
MOVXAVALLd XR[d] ← 240 ∥ 048 ∥ AV[15]∥AV[14]∥…∥AV[1]∥AV[0]
XV[d] ← 1
MOVAVALLXd AV[1] ← XR[a]1
AV[2] ← XR[a]2

AV[15] ← XR[a]15
MOVXXVALLd XR[d] ← 240 ∥ 048 ∥ XV[15]∥XV[14]∥…∥XV[1]∥XV[0]
XV[d] ← 1
MOVXVALLXd XV[1] ← XR[a]1
XV[2] ← XR[a]2

XV[15] ← XR[a]15
MOVSMd, m, w SR[d] ← 240 ∥ VM[a]w×64+63..w×64
SV[d] ← 1
MOVMSd, a, w trap if SV[a] = 0
VM[d]w×64+63..w×64 ← SR[a]
32‑bit instruction format with 2 sources 1 destination and 6‑bit immediate
31 28 27 22 21 16 15 12 11 8 7 4 3 0
op32g op32f imm6 b a d op32
4 6 6 4 4 4 4
FUNSId, a, b, i t ← (SR[b]63..0∥SR[a]63..0) >> imm6
SR[d] ← 240 ∥ t63..0
SV[d] ← SV[a] & SV[b]
ROTRSId, a, i assembler expands to FUNSI d, a, a, i
ROTLSId, a, i assembler expands to FUNSI d, a, a, (−i)5..0
SLLXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 <<u imm6)
XV[d] ← XV[a]
SRLXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 >>u imm6)
XV[d] ← XV[a]
SRAXId, a, imm XR[d] ← 240 ∥ (XR[a]63..0 >>s imm6)
XV[d] ← XV[a]
32‑bit instruction format 3 sources 0 destination
31 28 27 22 21 20 19 16 15 12 11 8 7 4 3 0
op32g op32f m c b a op32d op32
4 6 2 4 4 4 4 4
Store address instructions with indexed addressing
SAc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
trap if (AR[a]2..0 + XR[b]2..0) ≠ 03
lvstore72(AR[a] +p XR[b]<<sa) ← AR2mem72(AR[c])
SADc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
trap if (AR[a]3..0 + XR[b]3..0) ≠ 04
lvstore144(AR[a] +p XR[b]<<sa) ← AR2mem144(AR[c])
SACc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
trap if (AR[a]3..0 + XR[b]3..0) ≠ 04
lvstore144(AR[a] +p XR[b]<<sa) ← AR2CHERImem144(AR[c])
Store index instructions with indexed addressing
SXc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore72(AR[a] +p XR[b]<<sa) ← XR[c]
SX64c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore64(AR[a] +p XR[b]<<sa) ← XR[c]63..0
SX32c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore32(AR[a] +p XR[b]<<sa) ← XR[c]31..0
SX16c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore16(AR[a] +p XR[b]<<sa) ← XR[c]15..0
SX8c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore8(AR[a] +p XR[b]<<sa) ← XR[c]7..0
Store scalar instructions with indexed addressing
SSc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore72(AR[a] +p XR[b]<<sa) ← SR[c]
SS64c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore64(AR[a] +p XR[b]<<sa) ← SR[c]63..0
SS32c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore32(AR[a] +p XR[b]<<sa) ← SR[c]31..0
SS16c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore16(AR[a] +p XR[b]<<sa) ← SR[c]15..0
SS8c, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore8(AR[a] +p XR[b]<<sa) ← SR[c]7..0
Store vector mask instructions with indexed addressing
SMc, a, b, sa trap if (AV[a] & XV[b] & AV[c]) = 0
lvstore128(AR[a] +p XR[b]<<sa) ← VM[c]
Branch instructions
Bbobac, a, b branch if BR[c] ba (BR[a] bo BR[b])
BbaEQAc, a, b branch if BR[c] ba (AR[a] = AR[b])
BbaEQXc, a, b branch if BR[c] ba (XR[a] = XR[b])
BbaNEAc, a, b branch if BR[c] ba (AR[a] ≠ AR[b])
BbaNEXc, a, b branch if BR[c] ba (XR[a] ≠ XR[b])
BbaLTAUc, a, b branch if BR[c] ba (AR[a] <u AR[b])
BbaLTXUc, a, b branch if BR[c] ba (XR[a] <u XR[b])
BbaGEAUc, a, b branch if BR[c] ba (AR[a] ≥u AR[b])
BbaGEXUc, a, b branch if BR[c] ba (XR[a] ≥u XR[b])
BbaLTXc, a, b branch if BR[c] ba (XR[a] <s XR[b])
BbaGEXc, a, b branch if BR[c] ba (XR[a] ≥s XR[b])
BbaNONEXc, a, b branch if BR[c] ba ((XR[a] & XR[b]) = 0)
BbaANYXc, a, b branch if BR[c] ba ((XR[a] & XR[b]) ≠ 0)
assembler simplified versions of the above
Bboa, b equivalent to BORbo b0, a, b
BEQAa, b equivalent to BOREQA b0, a, b
BEQXa, b branch if XR[a] = XR[b]
BNEAa, b branch if AR[a] ≠ AR[b]
BNEXa, b branch if XR[a] ≠ XR[b]
BLTAUa, b branch if AR[a] <u AR[b]
BLTXUa, b branch if XR[a] <u XR[b]
BGEAUa, b branch if AR[a] ≥u AR[b]
BGEXUa, b branch if XR[a] ≥u XR[b]
BLTXa, b branch if XR[a] <s XR[b]
BGEXa, b branch if XR[a] ≥s XR[b]
BNONEXa, b branch if (XR[a] & XR[b]) = 0
BANYXa, b branch if (XR[a] & XR[b]) ≠ 0
32‑bit instruction format 2 sources 0 destination with 12‑bit immediate
31 28 27 20 19 16 15 12 11 8 7 4 3 0
op32g i c i a op32d op32
4 8 4 4 4 4 4
Store address instructions with immediate addressing
SAIc, a, imm lvstore72(AR[a] +p imm12) ← AR[c]
SADIc, a, imm lvstore144(AR[a] +p imm12) ← AR[c]
Store index instructions with immediate addressing
SXIc, a, imm lvstore72(AR[a] +p imm12) ← XR[c]
SX64Ic, a, imm lvstore64(AR[a] +p imm12) ← AR[c]63..0
SX32Ic, a, imm lvstore32(AR[a] +p imm12) ← AR[c]31..0
SX16Ic, a, imm lvstore16(AR[a] +p imm12) ← AR[c]15..0
SX8Ic, a, imm lvstore8(AR[a] +p imm12) ← AR[c]7..0
Store scalar instructions with immediate addressing
SSIc, a, imm lvstore72(AR[a] +p imm12) ← SR[c]
SS64Ic, a, imm lvstore64(AR[a] +p imm12) ← SR[c]63..0
SS32Ic, a, imm lvstore32(AR[a] +p imm12) ← SR[c]31..0
SS16Ic, a, imm lvstore16(AR[a] +p imm12) ← SR[c]15..0
SS8Ic, a, imm lvstore8(AR[a] +p imm12) ← SR[c]7..0
Branch instructions with immediate comparison
BbaEQXIc, b, imm12 branch if BR[c] ba (XR[a] = imm12)
BbaNEXIc, a, imm12 branch if BR[c] ba (XR[a] ≠ imm12)
BbaLTUXIc, a, imm12 branch if BR[c] ba (XR[a] <u imm12)
BbaGEUXIc, a, imm12 branch if BR[c] ba (XR[a] ≥u imm12)
BbaLTXIc, a, imm12 branch if BR[c] ba (XR[a] <s imm12)
BbaGEXIc, a, imm12 branch if BR[c] ba (XR[a] ≥s imm12)
BbaNONEXIc, a, imm12 branch if BR[c] ba ((XR[a] & imm12) = 0)
BbaANYXIc, a, imm12 branch if BR[c] ba ((XR[a] & imm12) ≠ 0)
assembler simplified versions of the above
BEQXIa, imm equivalent to BOREQXI b0, a, imm
BNEXIa, imm equivalent to BORNEXI b0, a, imm
BLTUXIa, imm equivalent to BORLTUXI b0, a, imm
BGEUXIa, imm equivalent to BORGEUXI b0, a, imm
BLTXIa, imm equivalent to BORLTXI b0, a, imm
BGEXIa, imm equivalent to BORGEXI b0, a, imm
BNONEXIa, imm equivalent to BORNONEXI b0, a, imm
BANYXIa, imm equivalent to BORANYXI b0, a, imm
Instructions yet to be grouped
SWITCHIa, imm PC ← AR[a] +p imm12
LJMPIa, imm PC ← lvload72(AR[a] +p imm12)
LJMPa, b PC ← lvload72(AR[a] +p XR[b]<<sa)
FENCE This is a placeholder for various FENCE instructions that need to be defined.
WFIa Wait For Interrupt for the current ring. May be intercepted by more privileged rings. Execution resumes after the interrupt is serviced (that is the return from interrupt goes to the following instruction). (Perhaps considering making this a BB descriptior type?) This is intended to be used when the processor has nothing to do, and is expected to reduce power consumption. For the duration of the wait, the interrupt enables are set to InterruptEnable[PC.ring] | XR[a]. That is, the operand specifies additional interrupts to enable. This allows software to disable interrupts, check for work, and if not, use WFI to wait for work to arrive without a window where an interrupt could occur before the WFI, return, and then wait when there is work to be done.
WFPa Wait For Interrupt Pending for the current ring. May be intercepted by more privileged rings. Execution resumes after InterruptPending[PC.ring] & XR[a] becomes non-zero. This may be used to wait until a particular cycle count is reached.
WAITa Wait For memory location change. May be intercepted by more privileged rings.
HALT The processor finishes all outstanding operands and halts. It will only be woken by Soft Reset. Ring 7 only.
BREAK This is a placeholder for later definition.
ILL This is a placeholder for later definition.
CSR* This is a placeholder for later definition.
fmtCLASSS This is a placeholder for later definition.
32‑bit instruction format with 1 source 1 destination and 16‑bit immediate
31 28 27 12 11 8 7 4 3 0
op32g imm16 a d op32
4 16 4 4 4
AId, a, imm v ← AV[a]
trap if v & boundscheck(AR[a], imm16) = 0
AR[d] ← AR[a] +p imm16
AV[d] ← v
Stack frame allocation for upward and downward stacks
ENTRYd, a, imm8 trap if imm8 ≥ 192
osp ← AR[a]
oring ← osp135..133
osize ← 03 ∥ osp132..72
oaddr ← osp63..0
naddr ← oaddr + osize
e ← imm87..4
nsize ← e = 0 ? 054∥imm83..0∥03 : 053−s∥1∥imm83..0∥0e∥02
nring ← min(PC.ring, oring)
ssize ← segsize(oaddr)
trap if naddr63..ssize ≠ oaddr63..ssize
nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr
lvstore72(nsp) ← osp71..0
AR[d] ← nsp
ENTRYDd, a, imm8 trap if imm8 ≥ 192
osp ← AR[a]
oring ← osp135..133
oaddr ← osp63..0
e ← imm87..4
nsize ← e = 0 ? 054∥imm83..0∥03 : 053−s∥1∥imm83..0∥0e∥02
naddr ← oaddr − nsize
nring ← min(PC.ring, oring)
ssize ← segsize(oaddr)
trap if naddr63..ssize ≠ oaddr63..ssize
nsp ← 251 ∥ nring ∥ nsize ∥ imm8 ∥ naddr
lvstore72(nsp) ← osp71..0
AR[d] ← nsp
32‑bit instruction format with 24‑bit immediate
31 8 7 4 3 0
imm24 d op32
24 4 4
XId, imm XR[d] ← 240 ∥ imm242340∥imm24
XV[d] ← 1
XUId, imm XR[d] ← 240 ∥ imm24∥040
XV[d] ← 1
SId, imm SR[d] ← 240 ∥ imm242340∥imm24
SV[d] ← 1
SUId, imm SR[d] ← 240 ∥ imm24∥040
SV[d] ← 1
DUId, imm SR[d] ← 244 ∥ imm24∥040
SV[d] ← 1
FId, imm SR[d] ← 245 ∥ 032∥imm24∥08
SV[d] ← 1
48‑bit op48
1:0
3:2
0 1 2 3
0 3XI 3XUI 3SI 3SUI
1 3ADDXI 3ADDXUI 3ADDSI 3ADDSUI
2 3ANDXI 3ANDUXI i48v
3 3ORXI 3ORXUI 3FI 3DUI
48‑bit instruction format
47 24 23 20 19 16 15 12 11 8 7 4 3 0
op48dabcm e c b a d op48
24 4 4 4 4 4 4
Vector-Vector Integer
3ioiaVVd, c, a, b, m VR[d] ← VR[c] ia (VR[a] io VR[b])
masked by VM[m]
3lolaVVd, c, a, b, m VR[d] ← VR[c] la (VR[a] lo VR[b])
masked by VM[m]
3SELVVd, c, a, b, m VR[d] ← select(VM[c], VR[a], VR[b])
masked by VM[m]
3i1Sd, a, m VR[d] ← i1(VR[a])
masked by VM[m]
Vector-Scalar Integer
3ioiaVSd, c, a, b, m VR[d] ← VR[c] ia (VR[a] io SR[b])
masked by VM[m]
3lolaVSd, c, a, b, m VR[d] ← VR[c] la (VR[a] lo SR[b])
masked by VM[m]
3SELVSd, c, a, b, m VR[d] ← select(VM[c], VR[a], SR[b])
masked by VM[m]
Vector-Immediate Integer
3ioiaVId, c, a, imm, m VR[d] ← VR[c] ia (VR[a] io imm)
masked by VM[m]
3lolaVId, c, a, imm, m VR[d] ← VR[c] la (VR[a] lo imm)
masked by VM[m]
3SELVId, c, a, imm, m VR[d] ← select(VM[c], VR[a], imm)
masked by VM[m]
Vector-Vector integer comparison
3icbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] ic VR[b])
masked by VM[m]
Vector-Scalar integer comparison
3icbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] ic SR[b])
Vector-Immediate integer comparison
3icbaVId, c, a, imm VM[d] ← VM[c] ba (VR[a] ic imm12)
Vector-Vector Floating-Point
VL[n] gives the number of elements in the VRs and VMs
3DfofaVVd, c, a, b, m VR[d] ← VR[c] fad (VR[a] fod VR[b])
masked by VM[m]
3FfofaVVd, c, a, b, m VR[d] ← VR[c] faf (VR[a] fof VR[b])
masked by VM[m]
3HfofaVVd, c, a, b, m VR[d] ← VR[c] fas (VR[a] foh VR[b])
masked by VM[m]
3BfofaVVd, c, a, b, m VR[d] ← VR[c] fas (VR[a] fob VR[b])
masked by VM[m]
3P4fofaVVd, c, a, b, e VR[d] ← VR[c] fab (VR[a] fop4 VR[b])
masked by VM[m]
3P3fofaVVd, c, a, b, m VR[d] ← VR[c] fab (VR[a] fop3 VR[b])
masked by VM[m]
Matrix Floating-Point Outer Product
VL[0] gives the number of elements in VR[a] and the number of rows of the MAs
VL[1] gives the number of elements in VR[b] and the number of columns of the MAs
3DOPVVd, c, a, b MA[d] ← MA[c] +d outerproductd(VR[a], VR[b])
3FOPVVd, c, a, b MA[d] ← MA[c] +f outerproductf(VR[a], VR[b])
3HOPVVd, c, a, b MA[d] ← MA[c] +h outerproducth(VR[a], VR[b])
Vector-Scalar Floating-Point
3DfofaVSd, c, a, b, m VR[d] ← VR[c] fad (VR[a] fod SR[b])
masked by VM[m]
3FfofaVSd, c, a, b, m VR[d] ← VR[c] faf (VR[a] fof SR[b])
masked by VM[m]
3HfofaVSd, c, a, b, m VR[d] ← VR[c] fah (VR[a] foh SR[b])
masked by VM[m]
3BfofaVSd, c, a, b, m VR[d] ← VR[c] fas (VR[a] fob SR[b])
masked by VM[m]
3P4fofaVSd, c, a, b, m VR[d] ← VR[c] fab (VR[a] fop4 SR[b])
masked by VM[m]
3P3fofaVSd, c, a, b, m VR[d] ← VR[c] fab (VR[a] fop3 SR[b])
masked by VM[m]
Vector-Vector floating comparison
3DfcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fcd VR[b])
3FfcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fcf VR[b])
3HfcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fch VR[b])
3BfcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fcb VR[b])
3P4fcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fcp4 VR[b])
3P3fcbaVVd, c, a, b VM[d] ← VM[c] ba (VR[a] fcp3 VR[b])
Vector-Scalar floating comparison
3DfcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fcd SR[b])
3SfcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fcs SR[b])
3HfcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fch SR[b])
3BfcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fcb SR[b])
3P4fcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fcp4 SR[b])
3P3fcbaVSd, c, a, b VM[d] ← VM[c] ba (VR[a] fcp3 SR[b])
48‑bit instruction format with source and 36‑bit immediate
47 12 11 8 7 4 3 0
imm36 a d op48
36 4 4 4
3ADDXId, a, imm XR[d] ← XR[a] + imm363528∥imm36
XV[d] ← XV[a]
3ADDXUId, a, imm XR[d] ← XR[a] + imm36∥028
XV[d] ← XV[a]
3ANDSId, a, imm SR[d] ← SR[a] & imm363528∥imm36
SV[d] ← SV[a]
3ANDSUId, a, imm SR[d] ← SR[a] & imm36∥028
SV[d] ← SV[a]
3ORSId, a, imm SR[d] ← SR[a] | imm363528∥imm36
SV[d] ← SV[a]
3ORSUId, a, imm SR[d] ← SR[a] | imm36∥028
SV[d] ← SV[a]
3ADDDUId, a, imm SR[d] ← SR[a] +d imm36∥028
SV[d] ← SV[a]
48‑bit instruction format with 40‑bit immediate
47 8 7 4 3 0
imm40 d op48
40 4 4
3XId, imm XR[d] ← 240 ∥ imm403924∥imm40
XV[d] ← 1
3XUId, imm XR[d] ← 240 ∥ imm40∥024
XV[d] ← 1
3SId, imm SR[d] ← 240 ∥ imm403924∥imm40
SV[d] ← 1
3SUId, imm SR[d] ← 240 ∥ imm40∥024
SV[d] ← 1
3DUId, imm SR[d] ← 244 ∥ imm40∥024
SV[d] ← 1
3FId, imm SR[d] ← 245 ∥ 024∥imm40
SV[d] ← 1
64‑bit op64
1:0
3:2
0 1 2 3
0 4I 4UI 4SI 4SUI
1
2
3 4FI 4DUI
64‑bit instruction format with 56‑bit immediate
63 8 7 4 3 0
imm56 d op64
56 4 4
4XId, imm XR[d] ← 240 ∥ imm56558∥imm56
XV[d] ← 1
4XUId, imm XR[d] ← 240 ∥ imm56∥08
XV[d] ← 1
4SId, imm SR[d] ← 240 ∥ imm56558∥imm56
SV[d] ← 1
4SUId, imm SR[d] ← 240 ∥ imm56∥08
SV[d] ← 1
4DUId, imm SR[d] ← 244 ∥ imm56∥08
SV[d] ← 1
4FId, imm SR[d] ← 245 ∥ 08∥imm56
SV[d] ← 1
64‑bit instruction format
63 24 23 20 19 16 15 12 11 8 7 4 3 0
op64dabcm e c b a d op64
40 4 4 4 4 4 4

Software Conventions

Data Types

I expect SecureRISC software to use the ILP64[wikilink] model, where integers and pointers are both 64 bits. Even in the 1980s when MIPS was defining its 64‑bit ISA, I argued that integers should be 64 bits, but keeping integers 32 bits for C was considered sacred by others. The result is that an integer cannot index a large array, which is terrible. With ILP64, I don’t expect SecureRISC to need special 32‑bit add instructions (that sign-extend from bit 31 to bits 63..32).

Register Names and Uses

Direct Mapping and Paging

Introduction to Translation

Translation is a two-stage process, where in the first stage a Local Virtual Address (lvaddr) is translated to a System Virtual Address (svaddr), and then in the second stage that address is then translated to a System Interconnect Address (siaddr). The lvaddr→svaddr translation may involve multiple svaddr reads, each of which has to also be translated to a siaddr during the process. A full translation is therefore very costly and is typically cached as direct lvaddr→siaddr to make the process much faster after the first time. The following sections first describe the lvaddr→svaddr process, and then subsequent sections describe the svaddr→siaddr process. These translations are somewhat similar, with minor differences. Once the first stage lvaddr→svaddr process is understood, the second stage svaddr→siaddr process will be straight-forward. Some systems may set up a minimal second stage translation process, but the process is still important for determining the cache and memory attributes and permissions, as stored in Region Descriptor Table (RDT).

Translation of local virtual addresses (lvaddrs) to system interconnect addresses (siaddrs) is typically performed in a single processor cycle in one of several L1 translation caches (often called TLBs), which may be supplemented with one or more L2 TLBs. If the TLBs fail to provide translate the address, then the processor performs a lengthier procedure, and if that succeeds, then the result is written into the TLBs to speed later translations. This TLB miss procedure determines the memory architecture. As described earlier, SecureRISC uses both segmentation and paging in its memory architecture. The first step of a TLB miss is therefore to determine a segment descriptor and then proceed as that directs. One way of thinking about SecureRISC segmentation is that is a specialized first-level page table that controls the subsequent levels, including giving the page size and table depth (derived from the segment size). After the segment descriptor, 0 to 4 levels of page table walk are used to complete the translation, as depending on the table values set by the supervisor. While 4‑level page tables are supported, SecureRISC is designed to avoid this if the operating system can use its features, as multiple-level page tables needlessly increase the TLB miss penalty.

SecureRISC segments may be directly mapped to an aligned system virtual address range equal to the segment size, or they may be paged. Direct mapping may be appropriate to I/O regions, for example. It consists of simply changing the high bits (above the segment size) of the local virtual address to the appropriate system virtual address bits and leaving the low bits (below the segment size) unchanged.

Paging

Processors today all implement some form of paging in their virtual address translation. Paging exists for several reasons. The most critical today is to simplify memory allocation in the operating system, as without paging, it would be necessary to find contiguous regions of memory to assign to address spaces. A secondary purpose is to allow a larger virtual address space than physical memory, which performs reasonably if the working set of the process fits in the physical memory (i.e. it does not use all of its virtual memory all the time).

Page Size Issues

A critical processor design decision is the choice of a page size or page sizes. If minimizing memory overhead is the criteria, it is well known that the optimal page size for an area of virtual memory is proportional to the square root of that memory size. Back in the 1960s, 1024 words (which became 4 KiB with byte addressing) was frequently chosen as the page size back to minimize the memory wasted by allocating in page units and the size of the page table. This size has been carried forward with some variation for decades. The trade-offs are different in 2020s from the 1960s, so it deserves another look. Even the old 1024 words would suggest a page size of 8 KiB today. Today, with much larger address spaces, multi-level page tables are typically used, often with the same page size at each level. The number of levels, and therefore the TLB miss penalty is then a factor in the page size consideration that did not exist in the 1960s.

In addition, today regions of memory vary wildly in size in computer systems, with many processes having small code regions, a small stack region, and a heap that may be small, large, or huge, and sometimes the size is dependent upon input parameters. Even in processors that support multiple page sizes, size is often set for the entire system. When page size is variable at runtime, there may be only one value for the entire process virtual address space, which makes the value for setting be sub-optimal for code, stack, or heap, depending on which is chosen for optimization. Further, memory overhead is not the only criteria of importance. Larger page sizes minimize translation cache misses and therefore improve performance at the cost of memory wastage. Larger page sizes may also reduce the translation cache miss penalty when multi-level page tables are used (as is common today), by potentially reducing the number of levels to be read on a miss.

A major advantage of segmentation is that it becomes possible to choose different page sizes on a per segment basis. Each shared library and the main program are individual segments containing code, and each could have a page size appropriate to its size. The stack and heap segments can likewise have different page sizes from the code segments and each other. Choosing a page size based on the square root of the segment size not only minimizes memory wastage, but it can also keep the page table a single level, which minimizes the translation cache miss penalty.

There is a cost to implementing multiple page sizes in the operating system. Typically, free lists are maintained for each page size, and when a smaller page free list is empty, a large page is split up. The reverse process, of coalescing pages, is more involved, as it may be necessary to migrate one or more small pages to put back together what was split apart. This however has been implemented in operating systems and made to work well.

There is also a cost to implementing multiple page sizes in translation caches (typically called TLBs though that is a terrible name). The most efficient hardware for translation caches would prefer a single page size, or failing that, a fairly small number of page sizes. Page size flexibility can affect critical processor timing paths. Despite this, the trend has been toward supporting a small number of page sizes. The inclusion of a vector architecture helps to address this issue, as vector loads and stores are not as latency sensitive as scalar loads and stores, and therefore can go directly to a L2 translation cache, which is both larger, and as a result of being larger slower, and therefore better able to absorb the cost of multiple page size matching. Much of the need for larger sizes occurs in applications with huge memory needs, and these applications are often able to exploit the vector architecture.

It may help to consider what historical architectures have for page size options. According to Wikipedia[wikilink] other 64‑bit architectures have supported the following page sizes:

Page Sizes in Other 64‑bit Architectures
Architecture 4 KiB 8 KiB 16 KiB 64 KiB 2 MiB 1 GiB Other
MIPS 256 KiB, 1 MiB, 4 MiB, 16 MiB
x86-64
ARM 32 MiB, 512 MiB
RISC‑V 512 GiB, 256 TiB
Power 16 MiB, 16 GiB
UltraSPARC 512 KiB, 4 MiB, 32 MiB, 256 MiB, 2 GiB, 16 GiB
IA-64 256 KiB, 1 MiB, 4 MiB, 16 MiB, 256 MiB
SecureRISC? 256 KiB

The only very common page size is 4 KiB, with 64 KiB, 2 MiB, and 1 GiB being somewhat common second page sizes. I believe 4 KiB has been carried forward from the 1960s for compatibility reasons as there probably exists some application and device driver software where page size assumptions exist. It would be interesting to know how often UltraSPARC encountered porting problems with its 8 KiB minimum page size. Today 8 KiB or 16 KiB pages make more technical sense for a minimum page size, but application assumptions may suggest keeping the old 4 KiB minimum and introducing at least one more page size to reduce translation cache miss rates.

RISC‑V’s Sv39 model has three page sizes for TLBs to match: 4 KiB, 2 MiB, and 1 GiB. Sv48 adds 512 GiB, and Sv57 adds 256 TiB. The large page sizes were chosen as early outs from multi-level table walks, and don’t necessarily represent optimal sizes for things like I/O mapping or large HPC workloads (they are all derived from the 4 KiB page being used at each level of the table walk). These early outs do reduce translation cache miss penalties, but they do complicate TLB matching, as mentioned earlier. To RISC‑V’s credit, it introduced a new PTE format (under the Svnapot extension) that communicates to processors that can take advantage of it that groups of PTEs are consistent and can be implemented with a larger unit in the translation cache. SecureRISC will adopt this idea.

Even a large memory system (e.g. HPC) will have many small segments (e.g. code segments, small files, non-computational processes such as editors, command line interpreters, etc.), and a smaller page size, such as 8 KiB may be appropriate for these segments. However, 4 KiB is probably not so sub-optimal to warrant incompatibility by not supporting this size. Therefore, the question is what is the most appropriate other page size, or page sizes, besides 4 KiB (which supports up to 2 MiB with one level, and up to 1 GiB with two levels). If only one other page size were possible for all implementations, 256 KiB might be a good choice, since this supports segment sizes up to 233 bytes with one level, and segment sizes of 234 to 248 bytes with two levels. But not all implementations need to support physical memory appropriate to a ≥248‑byte working set.

Instead, it makes sense to choose a second page size in addition to the 4 KiB compatibility size to extend the range of 1 and 2‑level page tables, and then allow implementations targeted at huge physical memories to employ even larger page sizes. In particular, there is a 4 KiB page size intended for backward compatibility, but the suggested page size is 16 KiB. Sophisticated operating systems will primarily operate with a pool of 16 KiB pages, with a mechanism to split these into 4 KiB pages and coalesce these back for applications that require the smaller page size.

SecureRISC Paging

SecureRISC has three improvements on paging found in recent architectures. First, it takes advantage of segment sizes to reduce page table walk latency. Second, it allows the operating system to specify the sizes of tables used at each level of the page table walk, rather than tying this to the page size used in translation caches. Decoupling the non-leaf table sizes from the leaf page sizes provides a mechanism that sophisticated operating systems may use for better performance, and on such systems this reduces some of the pressure for larger page sizes. Large leaf page sizes are still however useful for reducing TLB miss rates, and as the third improvement, SecureRISC borrows from RISC‑V and allows the operating system to indicate where larger pages can be exploited by translation caches to reduce miss rates, but without requiring that all implementations do so.

Paging in SecureRISC takes advantage of segment size field in Segment Descriptors to be more efficient than in some architectures. Even a simple operating system—one that specifies tables with the same size at every level—benefits when small segments need fewer levels of tables to cover the specified size specified in the Segment Descriptor. Just because the maximum segment size is 261 bytes doesn’t mean that every segment requires six levels of 4 KiB tables.

Segment descriptors and non-leaf page tables give the page size to be used at the next level, which allows the operating system to employ larger or smaller tables to optimize tradeoffs appropriate to the implementation and the application. Some implementations may add additional page sizes beyond these basic two in their translation cache matching hardware, such as 64 KiB and 256 KiB, some implementation targeting huge memory systems and applications (e.g. HPC) may add even larger pages to target reduced TLB miss rates. The paging architecture allows this flexibility with Page Table Size (PTS) encoding in segment descriptors and non-leaf PTEs, and for leaf PTEs by an encoding borrowed from RISC‑V called NAPOT that allows enabled translation caches to take advantage of multiple consistent page table entries.

As mentioned earlier, the page size that optimizes memory wastage for a single-level page table is proportional to the square root of the memory size, or in a segmented memory, to segment size, and a single-level page table also minimizes the TLB miss penalty, with a 2‑level page table being second best for TLB miss penalty. SecureRISC’s goal is to allow the operating system to choose page sizes per segment that keep the page tables to 1 or 2 levels. It is therefore interesting to consider what segment sizes are supported with this criterion with various page sizes. This is illustrated in the following table, assuming an 8 B PTE:

Segment size reached in 1 or 2 levels by page size
Page Size 1-Level 2-Level 3-Level Level
Last Other bits
4 KiB 2 MiB 1 GiB 512 GiB 21 30 39
4 KiB 16 KiB 8 MiB 16 GiB 32 TiB 23 34 45
16 KiB 32 MiB 64 GiB 128 TiB 25 36 47
16 KiB 64 KiB 128 MiB 1 TiB 8 PiB 27 40 53
64 KiB 512 MiB 4 TiB 32 PiB 29 42 55
256 KiB 8 GiB 256 TiB 8 EiB 33 48 63
2 MiB 512 GiB 128 PiB 39 57

The other consideration for page size is covering matrix operations in the L1 TLB. Matrix algorithms typically operate on smaller sub-blocks of the matrices to maximize data reuse and to fit into either the more constraining of the L1 TLB and L2 data cache (with other larger blocking done to fit into the L2 TLB and L3, and smaller blocking to fit into the register file). Matrices are often large enough that each row is in a different page for small page sizes. For an algorithm with 8 B or 16 B per element, each row is in a different page at the following column dimension:

Columns equal to page size
Page
size
Columns ×1024 rows per page
8 B 16 B 8 B 16 B
4 KiB 512 256 0.5 0.25
8 KiB 16 512 1 0.5
16 KiB 2048 1024 2 1
64 KiB 8192 4096 8 4
256 KiB 32768 16384 32 16

For large computations (e.g. ≥1024 columns of 16 B elements), every a row increment is going to require a new TLB entry for page sizes ≤16 KiB. Even a 16 KiB page with 16 B per element results in a TLB entry per row. For a L1 TLB of 32 entries and three matrices (e.g. matrix multiply A = A + B × C), the blocking needs to limited to only 8 rows of each matrix (e.g. 8×8 blocking), which is on the low-side for the best performance. In contrast, the 64 KiB page size fits 4 rows in a single page, and so allows 32×32 blocking for three matrices using 24 entries.

If the vector unit is able to use the L2 TLB rather than the L1 TLB for its translation, which is plausible, then these larger page sizes are not quite as critical. A L2 TLB is likely to be 128 or 256 entries, and so able to hold 32 or 64 rows of ×1024 matrices of 16 B elements.

A possible goal for page size might be to balance the TLB and L2 cache sizes for matrix blocking. For example, a L2 cache size of 512 KiB can fit up to 100×100 blocks of three matrices of 16 B elements (total 480 KiB) given sufficient associativity. To fit 100 rows of 3 matrices in the L2 TLB requires ≥300 entries when pages are ≤16 KiB, but only ≥75 entries when pages ≥64 KiB. A given implementation should make similar tradeoffs based on the target applications and candidate TLB and cache sizes, and page size is another parameter that factors into the tradeoffs here. What is clear is that the architecture should allow implementations to efficiently support multiple page sizes if the translation cache timing allows it.

Because multiple page sizes do affect timing-critical paths in the translation caches, it is worth pointing out that implementations are able to reduce the page size stored in translation caches to equal the matching hardware. An implementation could for example synthesize 16 KiB pages for the translation cache even when the operating system specifies a 64 KiB page. This will however increase the miss rate. Conversely, some hardware may support an even larger set of page sizes. SecureRISC adopts the NAPOT encoding from RISC‑V’s PMPs and PTEs (with the Svnapot extension) to allow the TLB to use larger matching for groups of consistent PTEs without requiring it. Thus, it up to implementations whether to adopt larger page matching to lower the TLB miss rate at the cost of a potential TLB critical path. The cost of this feature is one bit in the PTE (taken from the bits reserved for software).

Should it become possible to eliminate the 4 KiB compatibility page size in favor of a 16 KiB minimum page size, it may be appropriate to use the extra two bits the increase the svaddr and siaddr widths from 64 to 66 bits.

Address Space Identifiers (ASIDs) for TLB Sharing

TLBs introduce one other complication. Typically, when the supervisor switches from one process to another, it changes the segment and page tables. Absent an optimization, it would be necessary to flush the TLBs on any change in the tables, which is both costly in the cycles to flush and the misses that follow reloading the TLBs on memory references following the switch. Most processors with TLBs introduce a mechanism to reduce how often the TLB must be flushed, such as the Address Space Identifier (ASID) found in the MIPS translation hardware. The ASID is stored in the TLB, and when the supervisor switches to a new process, it either uses the process’ previous ASID, or assigns a new one if the TLB has been flushed since the last time the process ran. This allows its previous TLB entries to be used if they are still present in the TLB, but also avoids the TLB flush. When the ASIDs are used up, the TLB is flushed, and then ASID assignment starts fresh as processes are run. For example, a 5‑bit ASID would then require a TLB flush only when the 33rd distinct process is run after the last flush. The supervisor often uses translation and paging for its own data structures, some of which are process-specific, and some of which are common. To not require multiple TLB entries for the supervisor pages common between processes, a Global bit was introduced in the MIPS and other TLBs. This bit caused the TLB entry to ignore the ASID during the match process; such entries match any ASID. This whole issue occurs a second time when hypervisors switch between multiple guest operating systems, each of which thinks it controls the ASIDs in the TLB. RISC‑V for example introduced a VMID controlled by the hypervisor that works analogously to the ASID.

SecureRISC needs an ASID mechanism and a way to ignore for the same reason as in other ISAs. The question is whether this mechanism needs to be generalized, just as rings are a generalization of supervisor and user mode. I propose one such possible generalization with eight possible sharing opportunities, but whether this is required may be reevaluated. Perhaps SecureRISC will revert to a simple Global bit or just ASID=0 to mean Global. There is no particular reason to choose eight. Below is the mechanism proposed. Again, we expect that various service levels in the system will have some segments common to all of the service levels that they support, and that these should require only a single TLB entry, but that other segments might be change their translation for each supported service level.

Segment Descriptor Table Pointer Registers and ASIDs

The simplest implementation for a Segment Descriptor Table (SDT) is to have a single Segment Descriptor Table Pointer (SDTP) register and use a Global bit in Page Table Entries (PTEs). My alternative ASID generalization is to groups segments into eight groups (SG), and give each group its own SDT, as addressed by eight SDTP registers. These eight registers are then the zero-level table, followed by the chosen Segment Descriptor Table (the first level), followed by zero to four levels of page table. Since the registers are not in memory, there are one to five levels of memory tables to walk starting with the Segment Descriptor Table. The segment size in the SDT allows the length of the walk to be per segment, so most code segments (e.g. shared libraries) will have only one level of page table, but a process with a segment for weather data might require two or three levels (and might use a large page size as well to minimize TLB misses). Some hypervisor segments might be direct-mapped and require only the SDT level of mapping. In addition, if the hypervisor is not paging the supervisor, it might direct map many supervisor segments.

Here are the details. After a TLB miss, the processor starts by using the 3 high bits of the segment field of the address to pick one of eight Segment Descriptor Table registers (sdtp[0] to sdtp[7]). The low 13 bits of the segment field are then an index into the table at the system virtual address in the specified register. The SGS encoding of the sdtp registers is used to bounds check the low 13 bits of the segment number before indexing, which allows each portion of the Segment Descriptor Table to be 512, 1024, 2048, 4096, or 8192 entries (8 KiB to 128 KiB). The sgen register may be used to disable the segment group and all accesses to the group; otherwise, the check is that SGS = 4 | lvaddr60..57+SGS = 04−SGS. If the bound check succeeds, the doubleword Segment Descriptor Entry is read from (sdtp[lvaddr63..61]63..14 ∥ 014) | (svaddr60..48 ∥ 04) and this descriptor is used to bounds check the segment offset, and to generate a system virtual address. When TLB entries are created to speed future translations, they use the Address Space Identifier (ASID) specified in bits 11..0 the selected sdtp.

This method can be used to provide the functionality of two levels of other architectures (i.e. supervisor common using Global=1 and per process using Global=0). A SecureRISC supervisor might simply use segment group 7 (segments 57344–65535) for supervisor common (with ASID=0), and 256-1024 group 0 segments for per process mappings with dynamically assigned ASIDs as they are run. Such a system might set sdtp[7] at initialization, change sdtp[0] on process switch, and leave the other six groups unused (sgen ring fields set to 7).

Each Segment Descriptor Table Pointer register is only readable and writable by the ring specified in the corresponding field of sgen. Other rings must use calls to the appropriate ring to read and write these registers.

Segment Descriptor Table Pointers
71 64 63 12 11 0
240 svaddr63..13+SGS 2SGS ASID
8 51−SGS 1+SGS 12
Field Width Bits Description
ASID 12 11:0 Temporary Address Space ID
2SGS 1+SGS 12+SGS:12 Encoding of Segment Group Size
svaddr63..13+SGS 51−SGS 63:13+SGS Pointer to Segment Descriptors for Segment Group

Segment Group Enable

This section is very preliminary at this point.

The sgen controls which rings can write the various sdtp registers by specifying a 3‑bit ring number per register. Reads or writes to sdtp[i] register or its shadows trap if the current mode or ring number is less than or equal to sgeni×8+2..i×8. It is possible to provide read access separate from write access, but I don’t see the need. Setting a ring number to 7 disables the corresponding sdtp register altogether. Ring numbers less than 3 would typically never be used.

Segment Group Enable Register
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
sg7 sg6 sg5 sg4 sg3 sg2 sg1 sg0
8 8 8 8 8 8 8 8
Segment Group Enable Fields
7 6 3 2 0
L F ring
1 4 3
Field Width Bits Description
ring 3 2:0 Ring for which sdtp is enabled
F 4 6:3 Reserved for future use
L 1 7 Lock

Segment Descriptors

The segment descriptor can be thought of the first-level page table, but with a 16 B descriptor instead of an 8 PTE. The first 8 B of the descriptor is made very similar to the PTE format, with the extra permissions, attributes, etc. in the second 8 B of the descriptor.

Possible future changes:

Segment Descriptor Entry Word 0
71 64 63 3 2 0
240 svaddr63..4+PTS 2PTS MAP
8 60−PTS 1+PTS 3
Segment Descriptor Entry Word 1
71 64 63 40 39 32 31 30 29 28 27 26 24 23 22 20 19 18 16 15 14 13 12 11 10 8 7 6 5 0
240 0 SIAO G1 G0 0 R3 0 R2 0 R1 T 0 C P XWR 0 D ssize
8 24 8 2 2 1 3 1 3 1 3 1 2 1 1 3 1 1 6
Field Width Bits Description
MAP 3 2:0 0 ⇒ invalid SDE: bits 135..72, 63..3 available for software use
2 ⇒ svaddr63..4+PTS is first level page table
7 ⇒ svaddr63..ssize are high-bits of mapping
1, 3..6 Reserved
2PTS 1+PTS 3+PTS:3 See non-leaf PTE
svaddr63..4+PTS 60−PTS 63:4+PTS MAP = 2 ⇒ svaddr63..4+PTS is first level page table
MAP = 3 ⇒ svaddr63..ssize are high-bits of mapping
ssize 6 5:0 Segment size is 2ssize bytes for 12..61.
Values 0..11 and 62..63 are reserved.
D 1 6 Downward segment (must be 0 if ssize ≥ 48)
0 ⇒ address bits 47..ssize must be clear, i.e. = 048−ssize
1 ⇒ address bits 47..ssize must be set, i.e. = 148−ssize
XWR 3 10:8 Read, Write, Execute permission:
0Reserved
1Read-only
2Reserved
3Read-write
4Execute-only
5Read-execute
6Reserved
7Read-write-execute
P 1 11 Pointer permission (pointers with segment numbers are permitted)
0 ⇒ Stores of tags 0..199 to this segment take an access fault
C 1 12 CHERI Capability permission
0 ⇒ Stores of tags 232 to this segment take an access fault
T 1 15 0 ⇒ Memory tags give type and size
1 ⇒ Memory tags are clique
R1 3 18:16 Ring bracket 1
R2 3 22:20 Ring bracket 2
R3 3 26:24 Ring bracket 3
G0 2 29:28 Generation number of this segment for GC.
G1 2 31:30 Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this segment traps and software lowers the G1 field. This feature is turned off by setting G1 to 3.
SIAO 8 39:32 System Interconnect Attribute (SIA) override, addition, hints, etc. (e.g. cache bypass, as for example seen in most ISAs, such as RISC‑V’s PBMT)

Local Virtual Address Direct Mapping

For direct mapping, the segment mapping consists of:

  1. Checking that the offset is not out of bounds for segments < 248 bytes or clearing bits 63..size for segments ≥ 248 bytes.
  2. Checking that the mapping is aligned to the segment size.
  3. Logical-or the offset with the mapping. The two checks above ensures that the logical-or never sees two ones in the same bit position.

For segments ≤ 248 bytes, the offset is simply bits 47..0 of the local virtual address, and so the first check is that bits 47..size are zero (or all ones if downward is set in the Segment Descriptor Entry), or equivalently that svaddr47..0 < 2size. For segments > 248 bytes, the offset extends into the segment number field, and no checking need be done during mapping (such sizes are however used during checking address arithmetic), but bits 60..size must be cleared before the logical-or. The second check is that bits size−1..0 of the mapping are zero. The supervisor is responsible for providing the appropriate values in the Segment Descriptor Entries for each portion of segments > 248 bytes. Thus, paging does not need to handle segments larger than 248 bytes (the SDT for such segments is in effect the first level of the page table).

Local Virtual Address Paging

When paging is used, the page tables can be one or more levels deep. Each level has the flexibility to use a different table size, chosen by the operating system when it sets up the tables. A simple operating system might use only a single table size (e.g. 4 KiB or 16 KiB) at every level except the first (which would be a fraction of this single size). The following tables provide examples of how the local virtual address could be used to index levels of the page table for several page and segment sizes in this simple operating system. This is not the recommended way to use SecureRISC’s capabilities, but more of the backward-compatible option. In the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.

Local Virtual Address with 4 KiB page size and 221 segment size — 1‑level page table
63 61 60 48 47 21 20 12 11 0
SG SEG 0 V1 offset
3 13 27 9 12
Local Virtual Address with 4 KiB page size and 230 segment size — 2‑level page table
63 61 60 48 47 30 29 21 20 12 11 0
SG SEG 0 V1 V2 offset
3 13 18 9 9 12
Local Virtual Address with 4 KiB page size and 248 segment size — 4‑level page table
63 61 60 48 47 39 38 30 29 21 20 12 11 0
SG SEG V1 V2 V3 V4 offset
3 13 9 9 9 9 12
Local Virtual Address with 16 KiB page size and 225 segment size — 1‑level page table
63 61 60 48 47 25 24 14 13 0
SG SEG 0 V1 offset
3 13 23 11 14
Local Virtual Address with 16 KiB page size and 247 segment size — 3‑level page table
63 61 60 48 47 46 36 35 25 24 14 13 0
SG SEG 0 V2 V3 V4 offset
3 13 1 11 11 11 14

At the other end of the spectrum, an operating system that is capable of allocating any power of two size for page tables, and which did not want to demand page the page tables, might use a single table of 2ssize-14 16 KiB PTEs for most small segments. If the segment size is large enough that TLB miss rates are high, the operating system might allocate the segment’s pages in units of 64 KiB or 256 KiB and use the NAPOT encoding to take advantage of translation caches that can match sizes greater than 16 KiB. The follow examples illustrate how SecureRISC’s architecture might be used by such an operating system. Once again, in the figures below, the 13‑bit segment number is split into a 3‑bit segment group (SG) number (used to pick the SDTP register) and the offset (SEG) within that group.

Local Virtual Address with 256 KiB page size and 248 segment size — 2‑level page table
63 61 60 48 47 33 32 18 17 0
SG SEG V1 V2 offset
3 13 15 15 18

The format of a segment page table is multiple levels, with all but the last level consisting of 8 B‑aligned 72‑bit words with integer tags in the following format:

Non-Leaf Page Table Entry (PTE)
71 64 63 3 2 0
240 svaddr63..4+PTS 2PTS XWR
8 60−PTS 1+PTS 3
Fields of Non-Leaf Page Table Entries (PTEs)
Field(s) Width Bits Description
XWR 3 2:0 0 ⇒ invalid PTE: bits 63..3 available for software
2 ⇒ non-leaf PTE (this figure)
6 Reserved
1, 3..5, 7 indicate a Leaf PTE (see below)
2PTS 1+PTS 3+PTS:3 Table size of next level is 21+PTS entries (24+PTS bytes):
0216B
1432B
2864B
316128B
85124KiB
910248KiB
10204816KiB
1432768256KiB
34235256GiB
35236512GiB
≥36reserved
svaddr63..4+PTS 60−PTS 63:4+PTS Pointer to the next level of table

The last level (leaf) Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:

Leaf Page Table Entry (PTE)
71 64 63 11 10 8 7 6 5 4 3 2 0
240 svaddr63..12+S 2S RSW D A GC 0 XWR
8 52−S 1+S 3 1 1 2 1 3

Segments are meant as the unit of access control, but including Read, Write, and Execute permissions in the PTE might make ports of less aware operating systems easier. If RWX permissions are not needed in PTEs for operating system ports, then this field could be reduced to just a variable 1-2 bits (one bit for leaf/non-leaf, and a Valid bit only for in leaf PTEs), giving two bits back for another purpose. The most likely use of such a change would be to add two bits to System Virtual Addresses.

Field Width Bits Description
XWR 3 2:0 Read, Write, Execute permission (additional restriction on segment permissions):
0invalid, bits 63..3 available for software
1Read-only
2Non-leaf PTE (see above)
3Read-write
4Execute-only
5Read-execute
6Reserved
7Read-write-execute
GC 2 5:4 Largest generation of any contained pointer for GC. Storing a pointer with a greater generation number to this page traps and software lowers the G field. This feature is turned off by setting G to 3.
A 1 6 Accessed:
0 ⇒ trap on any access (software sets A to continue)
1 ⇒ access allowed
D 1 7 Dirty:
0 ⇒ trap on any write (software sets D to continue)
1 ⇒ writes allowed
RSW 3 10:8 For software use
2S 1+S 11+S:11 This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 212 bytes (4 KiB).
svaddr63..12+S 52−S 63:12+S For last level of page table, this is the translation
For earlier levels, this is the pointer to the next level

As example of the NAPOT 0S encoding, the following examples illustrate three page sizes:

4 KiB Leaf Page Table Entry (PTE)
71 64 63 12 11 10 8 7 6 5 4 3 2 0
240 svaddr63..12 1 RSW D A GC 0 XWR
8 52 1 3 1 1 2 1 3
16 KiB Leaf Page Table Entry (PTE)
71 64 63 14 13 12 11 10 8 7 6 5 4 3 2 0
240 svaddr63..14 1 02 RSW D A GC 0 XWR
8 50 1 2 3 1 1 2 1 3
256 KiB Leaf Page Table Entry (PTE)
71 64 63 18 17 16 11 10 8 7 6 5 4 3 2 0
240 svaddr63..18 1 06 RSW D A GC 0 XWR
8 46 1 6 3 1 1 2 1 3

System Interconnect Address Attributes

SecureRISC‛s System Interconnect Address Attributes (SIAA) are inspired by RISC‑V’s Physical Memory Attributes (PMA). SIAAs are specified on Naturally Aligned Powers of Two (NAPOT) siaddrs. The first attribute is the memory type, described below. Attributes are further distinguished for some memory types based on the cacheability software chooses for a portion of the NAPOT address space. Cacheability options are instruction and data caching with a specified coherence protocol, instruction and data caching without coherence, instruction caching only, and uncached. The set of coherency protocols to be enumerated is TBD, but is likely to include at least MESI and MOESI. Uncached instruction accesses may require full cache block transfers on some processors to keep things simpler, and the cache block transfer used multiple times before being discarded on a reference to another cache block (so there is a limited amount of caching even for uncached instruction accesses).

The attributes are organized into the following categories:

Category Applicability
Void ROM Main I/O
Memory type
Dynamic configuration (e.g. hotplug)
Non-volatile 1
Error correction: type (e.g. SECDED, Reed-Solomon, etc.)
and granularity (e.g. 72, 144, etc. bits)
Error reporting (how detected errors are reported)
Mandatory Access Control Set
Read access widths supported
Write access widths supported
Execute access widths supported
Uncached Alignment
Uncached Atomic Compare and Swap (CAS) widths
Uncached Atomic AND/OR widths
Uncached Atomic ADD widths
Coherence Protocols (e.g. uncached, cached without coherence, cached coherent (one of MESI, MOESI), directory-based coherence type) ?
NUMA location (for computing distances)
Read idempotency 1 1
Write idempotency 1

Memory type is one of four values:

Void
Access to these siaddrs are errors. There are no further relevant attributes for this memory type. Void could be considered to be a subtype of Main Memory or I/O without any access options, but separating it is perhaps more straightforward.
ROM
These siaddrs contain Read-Only-Memory devices (e.g. the boot ROM) with various attributes described below. ROM may be cached without coherence, instruction cached without coherence, or uncached. ROM is always read idempotent. Void could be considered to be a subtype of Main Memory without any write access, but separating it is perhaps more straightforward.
Main Memory
These siaddrs serve as system memory with various attributes described below. All caching options apply to main memory. Main memory is always read and write idempotent.
I/O
These siaddrs are I/O device registers and memory with various attributes described below. The caching options for I/O siaddrs is TBD. In particular, do I/O areas allow cache coherency support, or are they always non-coherent?
Access Widths
Width Tag Align Comment
UT T
8 TI any LX8*, LS8*, SX8*, SS8*, etc.
16 TI 0..62 mod 64 Crossing cache block boundary not supported
32 TI 0..60 mod 64 Crossing cache block boundary not supported
64 TI 0..56 mod 64 Crossing cache block boundary not supported
72 0 mod 8 Uncached LA*, LX*, LS*, SA*, SX*, SS*, etc.
128 TI 0..48 mod 64 Crossing cache block boundary not supported
144 0 mod 16 Uncached LAD*, LXD*, LSD*, SAD*, SXD*, SSD*, etc.
256 TI 0..32 mod 64 Uncached vector load/store
288 TI 0 mod 32 Uncached vector load/store
512 TI 0 mod 64 Uncached vector load/store, cached untagged refill and writeback
576 0 mod 64 Uncached vector load/store, cached tagged refill and writeback
768 0 mod 64 Cached tagged refill and writeback with encryption

In the table above, the UT column indicates untagged memory support, the T column indicates tagged memory support, and the TI entry in the tagged column indicaes Tagged Immediate, defined on tagged memory where the word contains a tag in the range 240..255. Untagged memory supplies a 240 tag to the system interconnect on a read, and requires a 240 tag from the system interconnect on writes. Tagged writes (cached or uncached) to untagged memory siaddrs fail if the tag is not 240. Main memory and ROMs may impose additional uncached alignment requirements (e.g. Naturally Aligned Power Of Two (NAPOT) rather than arbitrary alignment within cache blocks).

Main memory must support reads and writes. ROMs only support reads. I/O memory may support reads, writes, or both, and may be idempotent or non-idempotent.

Cached Main Memory SIAAs
Attribute Width
512 576 768
Read
Write
Execute
Coherence protocols TBD
Cached ROM SIAAs
Attribute Width
512 576 768
Read
Write n.a.
Execute
Coherence protocols n.a.
Cached I/O SIAAs
Attribute Width
512 576 768
Read TBD
Write
Execute
Coherence protocols
Uncached Main Memory SIAAs
Attribute Width
8 16 32 64 72 128 144 256 288 512 576 768
Read
Write
Execute 0 0 0 0
Atomic CAS
Atomic AND/OR
Atomic ADD 0 0
Coherence protocols n.a.
Read Idempotency 1
Write Idempotency 1
Uncached ROM SIAAs
Attribute Width
8 16 32 64 72 128 144 256 288 512 576 768
Read
Write 0
Execute 0 0 0 0
Atomic CAS 0
Atomic AND/OR 0
Atomic ADD 0
Coherence protocols n.a.
Read Idempotency 1
Write Idempotency n.a.
Uncached I/O SIAAs
Attribute Width
8 16 32 64 72 128 144 256 288 512 576 768
Read
Write
Execute 0 0 0 0
Atomic CAS
Atomic AND/OR
Atomic ADD 0 0
Coherence protocols n.a.
Read Idempotency
Write Idempotency

Tagged memory is an attribute derived from the above. Tagged is true for ROM and main memory that supports uncached 72‑bit reads or cached 576‑bit or 768‑bit (for authentication and optional encryption) reads and optionally writes. Untagged memory supports some subset of uncached 8‑bit, …, 64‑bit, 128‑bit reads and optionally writes, or cached 512‑bit reads and optionally writes, and supplies a 240 tag on read, and accepts a 240 or 241 tag on writes. Code ROM (e.g. the boot ROM) might support only tags 241 and 252.

Encryptable is an attribute derived from the above. Encryptable is true for ROM and main memory that supports cached 768‑bit (for authentication and optional encryption) reads and optionally writes.

CHERI capable is an attribute derived from the above. CHERI capable is true for tagged main memory that supports tags 240, 232, and 251. This could be a cacheable 512‑bit that synthesizes tags on read from a a in-DRAM tag table with cache and compression[PDF].

System Virtual to System Interconnect Address Mapping

After 64‑bit Local Virtual Addresses (lvaddrs) are mapped to 64‑bit System Virtual Addresses (svaddrs), these 64‑bit svaddrs are mapped to 64‑bit System Interconnect Addresses (siaddrs). This mapping is similar, but not identical to the mapping above as it starts with a 16‑bit region number rather than one of eight 13‑bit segment numbers. There is one such mapping set by the hypervisor for the entire system using a Region Descriptor Table (RDT) at a fixed system address. The RDT may be hardwired, or read-only, on read/write by the hypervisor. For the maximum 65,536 regions, with 16 bytes for a RDT entry, the maximum size RDT is 1 MiB in size. A system configuration parameter allows the size of the RDT to be reduced when the full number of regions is not required (which is likely).

The format of the Region Descriptor Entries is shown below. It is similar to Segment Descriptor Entries, but without the D, X, P, C, R1, R2, R3, G0, G1, and SIAO fields, and with the addition of the MAC, and ATTR fields.

A possible future addition would be a permission bit that prohibits execution from privileged rings. Alternatively, there could be a mandatory access bit required in MAC for this.

Region Descriptor Entry Word 0
71 64 63 3 2 0
240 siaddr63..4+PTS 2PTS MAP
8 60−PTS 1+PTS 3
Region Descriptor Entry Word 1
71 64 63 32 31 28 27 16 15 14 12 11 10 9 8 7 6 5 0
240 ATTR ENC MAC 0 RPT 0 WR 0 rsize
8 32 4 12 1 3 2 2 2 6
Fields of Region Descriptor Entries
Field Width Bits Description
MAP 3 2:0 0 ⇒ invalid RDE: bits 135..72, 63..3 available for software use
2 ⇒ siaddr63..4+PTS is first level page table
3 ⇒ siaddr63..rsize are high-bits of mapping
1, 4..7 Reserved
2PTS 1+PTS 3+PTS:3 Table size of next level is 21+PTS entries (24+PTS bytes):
0216B
1432B
2864B
316128B
85124KiB
910248KiB
10204816KiB
1432768256KiB
34235256GiB
35236512GiB
≥36reserved
siaddr63..4+PTS 60−PTS 63:4+PTS MAP = 2 ⇒ siaddr63..4+PTS is first level page table
MAP = 3 ⇒ siaddr63..rsize are high-bits of mapping
rsize 6 5:0 Region size is 2rsize bytes for 12..61.
Values 0..11 and 62..63 are reserved.
WR 2 9:8 Write Read permission
RPT 3 14:12 Region Protection Table ring
Accesses by rings less than or equal to this value apply permissions specified by rptp.
MAC 12 27:16 Mandatory Access Set
ENC 4 31:28 Encryption index
0 ⇒ no encryption
1..15 ⇒ index into table giving algorithm and 256‑bit key
ATTR 32 63:32 Physical Memory Attributes

The format of a region page table is multiple levels, each level consisting of 72‑bit words with integer tags in the same format as PTEs for local virtual to system virtual mapping, except there is no X or G fields.

Second-Level Non-Leaf Page Table Entry (SPTE)
71 64 63 3 2 0
240 siaddr63..4+PTS 2PTS XWR
8 60−PTS 1+PTS 3
Field Width Bits Description
XWR 3 2:0 0 ⇒ invalid PTE: bits 63..3 available for software
2 ⇒ non-leaf PTE (this figure)
1, 3 indicate valid Second-Level Leaf PTE (see below)
4..7 Reserved
2PTS 1+PTS 3+PTS:3 Table size of next level is 21+PTS entries (24+PTS bytes):
0216B
1432B
2864B
316128B
85124KiB
910248KiB
10204816KiB
1432768256KiB
34235256GiB
35236512GiB
≥36reserved
siaddr63..4+PTS 60−PTS 63:4+PTS Pointer to the next level of table

The Second-Level Leaf Page Table Entry (PTE) is a 72‑bit word with an integer tag in the following format:

Second-Level Leaf Page Table Entry (SPTE)
71 64 63 11 10 8 7 6 5 3 2 0
240 siaddr63..12+S 2S RSW D A 0 XWR
8 52−S 1+S 3 1 1 3 3
Field Width Bits Description
XWR 3 2:0 Read, Write permission:
0invalid, bits 63..3 available for software
1Read-only
2Non-leaf PTE (see above)
3Read-write
4..7Reserved
A 1 6 Accessed:
0 ⇒ trap on any access (software sets A to continue)
1 ⇒ access allowed
D 1 7 Dirty:
0 ⇒ trap on any write (software sets D to continue)
1 ⇒ writes allowed
RSW 3 10:8 For software use
2S 1+S 11+S:11 This encodes the page size as the number of 0 bits followed by a 1 bit. If bit 11 is 1, then there are zero 0 bits, and S=0, which represents a page size of 212 bytes (4 KiB).
siaddr63..12+S 52−S 63:12+S For last level of page table, this is the translation
For earlier levels, this is the pointer to the next level

Translation Cache (TLB) Flushing

Cache coherency protocols automatically transfer and invalidate cache data in response to loads and stores from multiple processors. It is tempting to find a similar mechanism to avoid translation cache invalidates being performed by software. The problem is that unlike coherent instruction and data caches the same translation entries may occur in multiple translation cache locations, making a directory approach difficult. Unless some mechanism is found to make this feasible, SecureRISC will require some way for software to manage the translation caches. The instructions for this are TBD. The following explores the possibility of translation coherence a bit further.

Reading Segment Descriptor Entries (SDEs) from the Segment Descriptor Table (SDT) and Region Descriptor Entries (RDEs) from the Region Descriptor Table (RDT) would typically be done through the L2 Data Cache. Since the L2 Data Cache is coherent with respect to this and other processors in the system, it is possible that L2 Data Cache might note that the translation contains entries from the line and send an invalidate to the translation cache when the L2 line is invalidated. This might avoid the need for some translation cache flushes. However, this requires the L2 to store the translation cache locations to invalidate. An alternative might be to have translations to check the L2, which at least requires only a single value rather than multiple that a L2 directory would require. This might work by the L2 noting that a line has been fetched by the translation caches, and if modified or invalidated, then increment a counter. If the counter stored in a translation cache entry is less than the L2 counter, then it needs to be checked before use (counter wrapping would need to flush entries from the translation caches). It seems unlikely that this much mechanism would be worthwhile, but is documented here in case further consideration changes this evaluation.

Region Protection

The hypervisor has a unified 64‑bit address space of System Virtual Addresses (svaddrs) divided into 65536 regions. (The purpose of the unified address space simplifies the communication between supervisors and I/O devices.) The hypervisor allocates regions to supervisors, for example using the region descriptors to only allocate the appropriate portion of memory and I/O spaces to them. In a unified address space, each supervisor is capable of attempting references to the addresses of other supervisors or even to hypervisor addresses. Only the region protection mechanism prevents such access attempts from succeeding. The first level of protection is simple Read and Write permissions that the hypervisor sets for each region and supervisor. This is implemented as a 65536-entry table, one for each region, of 2‑bit values (16 KiB):

Value Permission
0 None
1 Read-only
2 Reserved
3 Read and write

At this time, I don’t see the need to add per region read and write ring brackets to these permissions. The unified region descriptor table does specify the rings that employ these permissions, which allows the hypervisor access to its own regions on any entry to hypervisor code.

For SecureRISC processors, the hypervisor specifies region permissions by storing a siaddr to the table in the rptp CSR. This would typically be context switched by the hypervisor. While most supervisors would have a unique rptp value, in theory a single protection domain could be shared by a cooperating group of supervisors. Region protection is cached in translation caches along with other permissions and the lvaddr→siaddr mapping. The PDID field exists to allow cached values to be differentiated when the rptp value or its contents changes in a fashion similar to the ASID field of sdtp registers.

Region Protection Table Pointer
71 64 63 11 10 0
240 siaddr63..12+PDS 2PDS PDID
8 52−PDS 1+PDS 11
Field Width Bits Description
PDID 12 10:0 Protection Domain ID
2PDS 1+PDS 11+PDS:11 Encoding of Region Protection Table Size
siaddr63..12+PDS 52−PDS 63:12+PDS Pointer to Region Protection Table

Because translation cache misses in many microarchitectures will access the Region Protection Table through the L2 data cache, the hypervisor may find it benefits performance to allocation regions to supervisors in a clustered fashion, so that a single L2 data cache line serves all RPT accesses during a supervisor’s quantum.

Non-processor initiating ports into the system interconnect (Initiators) are limited in which regions they are permitted to access. One option is similar functionality to the processor mechanism described above. For example, a port might have a configuration register equivalent to rptp that the hypervisor can set, or something simpler or more complex depending the complexity of the entities using the port for region access.

Another possibility is that each Initiator is programmed by the hypervisor with two or more Mandatory Access Control (MAC) sets. One is for the Initiator’s TLB accesses, and the others are for accesses made by agents that the Initiator services. The MAC set for a region is stored as part of the Region Descriptors and cached in the Initiator’s TLB. The Initiator tests each access and rejects those that fail. Read access requires RegionCategories ⊆ InitiatorCategories and Write access requires RegionCategories = InitiatorCategories. For example, the Region Descriptor Table and the page tables those reference might have a Hypervisor bit that would prevent reads and writes from anything but Initiator TLBs. Processors would have Mandatory Access Control sets per ring. This would allow the same system to support multiple classification levels, e.g. Orange Book Secret and Top-Secret, with Top-Secret peripherals able to read both Secret and Top-Secret memory, but Secret peripherals denied access to Top-Secret memory.

Encryption might also be used to protect multiple levels of data in a system. For example, if Secret and Top-Secret data in memory are encrypted with different keys, and Secret Initiators are only programmed with that encryption key, then reading Top-Secret memory will result in garbage being read and writing Top-Secret data from a peripheral to Secret memory will result in that data being garbage to a processor or another peripheral with only the Secret key.

Because encryption results only in data being unintelligible, it is more difficult to debug. It may be desirable to employ both MAC sets and encryption.

Memory Encryption

An optional system feature of Region Descriptor Entries (RDEs) is to specify that the contents of the memory of the region should be protected by authenticated encryption on a cache line basis. If the keys are sufficiently protected, e.g. in a secure enclave, the data may be protected even when system software is compromised. A separate table in the secure enclave gives the symmetric encryption key for encrypting and decrypting data transferred to and from the region and the system interconnect address would be used as the tweak. The challenge of cache line encryption, with only 64 bytes of data, is providing sufficient security with a smaller storage overhead than is typical for larger data blocks, while keeping the added latency of a cache miss minimal.

Cache lines are 576 bits. To encrypt, use a standard 128‑bit block cipher (e.g. standard AES128) five times in counter mode using 128 bits of the key to xor, producing ciphertext. Append a 64‑bit authentication code and the 64‑bit nonce used for encryption and authentication yielding 704 bits. The authentication code is a hash of the 576‑bit ciphertext added to the other 128 bits of the key applied to a different counter value. Add 8 ECC bits to each 88 bits produces a memory line of 768 bits. Main memory might then be implemented with three standard 32‑bit or 64‑bit DRAM modules. Reads of encrypted memory would compute the 576 counter mode xor bits during the read latency, resulting in a single xor when the data arrives at the system interconnect port boundary (either 96, 192, 384, or 576 bits per cycle). This xor would be much less time than the ECC check. Writes would incur the counter mode computation latency (primarily six AES computations). Because the memory width and interconnect fabric would be sized for encryption, the only point in not encrypting a region would be to reduce write latency or to support non-block writes (it being impossible to update the authentication code when without doing a read, modify, write).

Encryption would not be supported for untagged memory, as the purpose of untagged memory is primarily for I/O devices. Were encryption to be supported it would have to be with a tweakable block cipher (e.g. XTS-AES), because such memory would not support the extra bits required for tags and authentication.

In particular the encryption of the 576‑bit (ignoring ECC) cache line CL to a 768‑bit memory line ML (including ECC) using cache line address siaddr63..6 and the 64‑bit internal state nextnonce would be as follows:

nonce ← nextnonce
nextnonce ← (nextnonce ⊗ 𝑥) mod 𝑥64+𝑥4+𝑥3+𝑥+1
T0 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000000)
T1 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000001)
T2 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000010)
T3 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥000100)
T4 ← AESenc(Key127..0, nonce63..0∥siaddr63..6∥001000)
T5 ← AESenc(Key255..128, nonce63..0∥siaddr63..6∥100000)63..0
C ← CL ⊕ (T4∥T3∥T2∥T1∥T0)
A0 ← C63..0 ⊗ K0 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A1 ← C127..64 ⊗ K1 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A2 ← C191..128 ⊗ K2 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A3 ← C255..192 ⊗ K3 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A4 ← C319..256 ⊗ K4 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A5 ← C383..320 ⊗ K5 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A6 ← C447..384 ⊗ K6 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A7 ← C511..448 ⊗ K7 mod 𝑥64+𝑥4+𝑥3+𝑥+1
A8 ← C575..512 ⊗ K8 mod 𝑥64+𝑥4+𝑥3+𝑥+1
AC ← A8⊕A7⊕A6⊕A5⊕A4⊕A3⊕A2⊕A1⊕A0⊕T5
AE ← C ∥ nonce63..0 ∥ AC
M0 ← ECC(AE87..0) ∥ AE87..0
M1 ← ECC(AE175..88) ∥ AE175..88
M2 ← ECC(AE263..176) ∥ AE263..176
M3 ← ECC(AE351..264) ∥ AE351..264
M4 ← ECC(AE439..352) ∥ AE439..352
M5 ← ECC(AE527..440) ∥ AE527..440
M6 ← ECC(AE615..528) ∥ AE615..528
M7 ← ECC(AE703..616) ∥ AE703..616
ML ← M7 ∥ M6 ∥ M5 ∥ M4 ∥ M3 ∥ M2 ∥ M1 ∥ M0

The inverse of the above for decrypting ML to CL and checking the authentication is obvious. A variant where a 144‑bit block cipher with a 144‑bit key (e.g. AES with 9‑bit S-boxes or an obvious Simon144/144) is used instead of 128‑bit AES is fairly obvious, and might make sense for datapath width matching, but the nonce and authentication would remain 64 bits to fit the result in 768, which probably makes datapath matching consideration secondary, and the extra key width is a slight annoyance (but see PQC note below where a 144‑bit key might be an advantage).

It is not yet determined whether K0, K1, …, K8 are constants or generated from the 256‑bit key via a key schedule algorithm, or simply provided by the software.

The table in the secure enclave specifies up to 15 algorithm and key pairs (where typically a key is an encryption key and authentication key pair).

Value Encryption Auth Extra bits What
0 None None 0 no encryption, no authentication
1 None CWMAC 64+64 No encryption
authentication using 64‑bit Carter-Wegman with 64‑bit nonce and Key255..128
2 AES128 CWMAC 64+64 AES128 CTR encryption with 64‑bit nonce, 64‑bit tweak
64‑bit Carter-Wegman Authentication code
Key127..0 used for AES128 CTR, Key255..128 used for authentication
3..15 Reserved

It is possible that Simon128/128 could be used in place of AES128 to reduce the amount of area required. The area of 16 S-boxes for one round AES being somewhat expensive, and six iterations of 14 rounds is too slow, so perhaps 96 S-boxes are required to keep the write latency latency reasonable (the read latency being covered by the DRAM access time with this many S-boxes).

Post-Quantum Cryptography (PQC) may require a 192‑bit or 256‑bit key due to Grover’s Algorithm[wikilink]. AES192 and AES256 however require 12 and 14 rounds respectively (20% and 40% more than AES128), which may add too much latency to cache block write-back, which is already somewhat affected by 10 rounds of AES128, which are each relatively slow due the S-box complexity. It is possible that Simon128/192 or Simon128/256 become better choices at larger key sizes, as 192‑bit keys are only 1.5% and 5.9% additional rounds. On the other hand, it is also possible to use additional S-boxes for parallel AES computation. AES S-boxes are somewhat costly, which argues against this, but in counter-mode encryption inverse S-boxes are not required, so perhaps this is acceptable. For example, by using 32 S-boxes, the computations specified above allow for producing two of the six computations in parallel, with the S-boxes being used only three times rather than six. It would be nice to have cryptographers weigh in some of these issues (this author is definitely not qualified).

Given the 8×88=704 bits to be protected with 64 bits of ECC, which can detect up to 16 bits of errors and correct up to 8, it might be interesting to consider Reed-Soloman[wikilink] error correction and detection for block of 88 8‑bit codewords with eight check symbols, which would be able to detect up to 8 symbol errors (32 to 64 bits) and correct up to 4 symbol errors (16 to 32 bits). However, the latency of detection and ECC generation for cache fills becomes an issue.

Reset and Boot

SecureRISC processors have three levels of Reset and one Non-maskable Interrupt (NMI):

Power-on Reset is required when power is first applied to the processor, and may require thousands of cycles, during which time various non-digital circuits may be brought into synchronization. In addition the processor may run Built-in Self Test (BIST)[wikilink], which may leave the caches initialized, thereby eliminating the need for some of the steps below. Software detection of this might be based on reading status set by BIST as the first step (details TBD). After this initialization, Power-on Reset is identical to Hard Reset. Hard Reset forces the processor to reset even when there are outstanding operations in process, e.g. queued reads and writes, and will require system logic to be similarly reset to maintain synchronization. Power-on Reset and Hard Reset begin execution at the same hardwired ROM address. Soft Reset simply forces the processor to begin executing at the separate Soft Reset ROM address, while maintaining its existing interface to the system interface (e.g. queued reads and writes). Soft Reset may be used to restart a processor that has entered the Halt state. Finally Non-Maskable Interrupts cause an interrupt to a ring 7 address for ultra-timing-critical events. NMIs are initiated with an edge-triggered signal and should not be repeated while an earlier NMI is being handled. Timing-critical events that can be delayed during other interrupt processing should use normal message interrupts, to be serviced at their specified interrupt priority.

Unless BIST has initialized the caches, Power-on Reset and Hard Reset begin with the vmbypass, icachebypass, and dcachebypass bits set. The first forces an lvaddr→siaddr identity mapping. This allows the hardwired reset PC to be fetched from a system ROM and initialize the rest of the processor state, including the lvaddr→svaddr and svaddr→siaddr translation tables. At this point it should clear the vmbypass bit. vmbypass cannot be reenabled once clear, and thus is available only to the Boot ROM.

The Boot ROM is expected to initialize the various instruction fetch caches and then clear the icachebypass bit. Once clear, this bit may not be reenabled except by Power-on or Hard Reset. Next the Boot ROM is expected to initialize the various data caches and clear the dcachebypass bit. This bit also may not be reenabled except by Power-on or Hard Reset. Finally the Boot ROM is then responsible for starting the Root of Trust verification process and once that is complete, transferring to the hypervisor.

SecureRISC processors reset in an implementation-specific manner. During all three resets, the architecture requires some processor state to be set to specific values, and other state is left undefined and must be initialized by the boot ROM. In particular the following is required:

State Initial value Comment
PC 0xC6FFFFFFFFFF000000 Basic block descriptor pointer, ring 7,
4 MiB from end of address space
InterruptEnable[7] 0 All ring 7 interrupts disabled.
vmbypass 1 Force lvaddr→siaddr identity map.
icachebypass 1 Bypass all instruction fetch caching.
dcachebypass 1 Bypass all data cache caching.

An Example Microarchitecture

I expect both moderately speculative (e.g. 3-4 wide in-order with branch prediction) and highly speculative (e.g. 4-12 wide Out-of-Order (OoO)) implementations of SecureRISC to be appropriate, albeit with the highly-speculative implementations having solutions for Meltdown, SPECTRE, Foreshadow, Downfall, Inception, etc. and similar attacks that result from speculation. The moderately speculative processors are likely to be less vulnerable to future attacks, and the ISA should strive to enable such processors to still perform well (i.e. not depend upon OoO for reasonable performance, only for the highest performance). This is one reason I prefer the AR/XR/SR/BR/VR model (inspired by the ZS-1), where operations on the ARs/XRs may get ahead of operations on the SRs/BRs/VRs/MAs, and end up generating pipelined cache misses on SR/VR/MA load/store without stalling, thus being more latency tolerant. This is likely to work well for floating-point values, which naturally will be allocated to the SRs/VRs, but will depend on the compiler to put non-address generation integer arithmetic in the SRs/VRs. It may be that some microarchitectures will choose to handle SR load/store from the L2 data cache due to this latency tolerance, and the SR execution units will end up operating by at least the L2 data cache latency after the AR/XR execution units, causing branch mispredicts on BRs to have additional penalty, and for moves from SRs back to ARs to be costly, but this is better than penalizing every SR load miss.

An OoO implementation might choose to rename the AR/XR/SR registers to a unified physical register file but doing so would give up the reduced number of register file read and write ports that separating these files offers. I expect the preferred implementation will rename each to their own physical files.

The following example goes for full OoO (rather than the moderately speculative possibility mentioned above) but exploits the AR/XR vs. SR separation by targeting SR/VR/VM/MA load/store to the L2 data cache. The L1 data cache exists for address generation acceleration.

The challenge with highly speculative microarchitectures is avoiding vulnerabilities such as Spectre, Meltdown, RIDL, Foreshadow, Inception, etc. One possibility under consideration (not detailed in the table below) is to have all caches (including translation and control-flow caches) have a per-index way dedicated to speculative fills, and when the fill becomes not speculative, then you designate a different way as the speculative fill way for that index. Speculation pipeline flushes have to then kill the speculative fills, which is likely to reduce performance, so it might be necessary to introduce a per-process option. It is also a potential performance issue that there would only be one speculative fill way per index. It is the control-flow caches that are the most problematic because they usually have only probabilistic matching, but Inception shows that there is a potential hole here.

Another general consideration when employing speculative execution is to carefully separate privilege levels in microarchitectural state. For example, low-privilege execution should not be able to affect the prediction (branch, cache, prefetch, etc.) of higher privilege levels, or different processes at the same privilege level. Flushing microarchitectural state would be sufficient, but would unacceptably affect performance, so where possible, privilege level and process identifiers should be included in the matching used in microarchitectual optimizations (e.g. prediction). For example, the Next Descriptor Index and Return Address predictors suggested below include the ring number in its tag to prevent one class of attacks. For bypassing based upon siaddrs, a ring may be included; if the ring of the data is greater than the execution ring, this should force a fence. This does not address an attack from one process to another at the same privilege level, which would require inclusion of some sort of process id, which might be too expensive.

Note: Size in Kib (1024 bits) and Mib (1048576 bits) below do not include parity, ECC, column, or row redundancy. A + is appended to indicate there may be additional SRAM bits.

Structure Description
Basic Block Descriptor Fetch
Predicted PC 62‑bit lvaddr and 3‑bit ring
Predicted loop iteration 64‑bit count (initially from prediction, later from LOOPX)
64‑bit iteration (no loop back when iteration = count)
1‑bit Boolean whether LOOPX value received
64‑bit BB descriptor address with c set that started the loop
Predicted CSP 8×(61+3)‑bit
61‑bit lvaddr63..3 and 3‑bit ring
Predicted Basic Block Count 8‑bit bbc
Predicted Basic Block History 128‑entry circular buffer indexed by bbc6..0
(see below)
~9 Kib, not including register rename snapshot (~48 KiB?), and CSR reads
L1 BB Descriptor TLB 256 entry, 8‑way set associative, 640‑bit line (8 SDE/PTEs),
mapping lvaddr63..12 to siaddr63..12 in parallel with BB Descriptor Cache,
line index: lvaddr14..12,
set index: lvaddr16..15,
tag: lvaddr63..17,
data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits),
filled from L2 Descriptor/Instruction TLB with 640‑bit read
20+ Kib data, 1.5+ Kib tag
L2 BB Descriptor TLB 2048 entry, 8‑way set associative, 640‑bit line (8 SDE/PTEs),
line index: lvaddr14..12,
set index: lvaddr19..15,
tag: lvaddr63..20,
data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits),
filled from L2 Data Cache with 512‑bit read and augmented with SDE bits
160+ Kib data, 5.6+ Kib tag
BB Descriptor Cache 4096 descriptors (65 bits each), 8‑way set associative,
520‑bit line size, 65‑bit read, 520‑bit tagged refill,
line index: lvaddr5..3,
index: lvaddr11..6,
tag: siaddr63..12,
1.5 cycles latency, 2 cycles to predicted PC,
filled from L2 Descriptor/Instruction Cache on miss and by prefetch.
Might include some branch prediction bits that are initialized from hint bits, but then updated (whether to do this depends on the whether a separate write port is required, in which case a separate RAM is probably appropriate). For example, a simple 2‑bit counter might serve as a first stage for YAGS[PDF] or TAGE.
260+ Kib data, 26+ Kib tag
Next Descriptor Index Predictor 128×(9+10+3), direct mapped (sized to access in less than a cycle)
index: lvaddr9..3,
tag: lvaddr19..10 + 3‑bit ring,
data: lvaddr11..3,
1 cycle to predicted BB Descriptor Cache index,
This predictor is accessed in parallel with the BB Descriptor Cache (BBDC). It contains the most recent flow change hits from the BBDC and is used to accelerate fetch of next BB Descriptor by starting a new BBDC read 1 cycle after the last. If 2-cycle BBDC access and prediction yields the same index, then the read of the target BB Descriptor is accelerated by one cycle. If predicted next index differs then BBDC value fetched early is discarded.
The ring is included in the tag, and the data is only used if PC.ring ≤ tag.ring.
2.75+ Kib
Return Address Prediction The committed version of return addresses are stored on per-ring call stacks in memory. This structure maintains speculative versions of those lines for the BB Descriptor next field types Call, Conditional Call, Call Indirect, and Conditional Call Indirect. Exceptions also speculatively update this structure. Attempts to write a line not in this structure fetch the line from memory unless CSP[PC.ring]5..3 = 0, since in that case the call stack is initializing a new line.
Lines from this structure are never written back to memory.
This structure is read on the BB Descriptor next field type Return and Exception Return to predict the target PC. Unlike other microarchitecture Return Address Stacks, this structure is line-oriented, tagged, and searched by the predicted CSP[PC.ring], and may be filled from a line at a time from memory with non-speculative values as needed (and thus more likely to predict successfully compared to the typical wrapping Return Address Stack or after a context switch that changes CSP[PC.ring]). It is 8 entries and fully associative to handle cross-ring call and return gracefully.
index: lvaddr5..3,
tag: ASID ∥ lvaddr63..6 + 3‑bit ring
An entry matches only if PC.ring = tag.ring.
4.5+ Kib data, 464+ bits tag
Branch Predictor ~16 KiB BATAGE
Whisper add-on?
Consider using YAGS[PDF] with 8192 entries of 2‑bit saturating counters in the choice table, and 1024 entries of 2‑bit saturating counters with 8‑bit tags for the T and NT tables (total 36,864 bits) as a replacement for the first two TAGE stages.
~128 Kib
Indirect Jump/Call Predictor ~16 KiB ITTAGE?
Loop Count Predictor Predict loop count after fetching BB descriptor with c set.
TAGE-like, based on history, no hit is equivalent to 216−1
first-level (no history): 128 entry, 2‑way set associative
index: lvaddr8..3 of BB descriptor with c set
tag: lvaddr16..9 + 3‑bit ring
data: 16‑bit count (0..65535)
Prediction used only if PC.ring ≤ tag.ring. Written only on mispredicts that occur prior to LOOPX value received.
2+ Kib first-level data, 1+ Kib tag (other levels TBD)
BB Fetch Output 8‑entry BB Descriptor Queue of PC, BB type, fetch count, fetch siaddr63..2, instruction start mask, branch and jump prediction to check
Instruction Fetch
L1 Instruction Cache 2048 entry (128 KiB), 4‑way set associative, 512‑bit line, read, write
index: siaddr14..6,
tag: siaddr63..15,
2-cycle latency, use 0*-2 times per basic block descriptor, so 0 or 2-3 cycles for entire BB instruction fetch,
filled from L2 Descriptor/Instruction Cache on miss and prefetch,
experiment with prefetch on BB descriptor fill
experiment with a larger cache and 3-cycle latency
256+ Kib, 24.5+ Kib
* 0 fetches required if the previous 512‑bit fetch covers the current one
L2 Fetch
L2 Combined Descriptor/Instruction Cache 8192 entry (512 KiB), 8‑way set associative, 520‑bit line, read, write,
index: siaddr15..6,
tag: siaddr63..16,
filled from system interconnect or L3 on miss and prefetch, evictions to L3
2080+ Kib data, 192+ Kib tag
Instruction Fetch Output 32‑entry Instruction Queue of 80‑bit decoded AR/XR instructions
32‑entry Instruction Queue of 96‑bit decoded SR/BR/VR/VM/MA instructions
(16‑bit, 32‑bit, 48‑bit, and 64‑bit instructions expanded to canonical formats)
AR/XR (Early) Execution Unit
PC, CSP Committed values
Register renaming for ARs 16×6 4‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical AR numbers and assigning d from AR free list.
96 bits
Register renaming for XRs (and CSP?) 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, fields to physical XR numbers and assigning d from XR free list.
96 bits
Register renaming for BRs 16×6 6‑read, 2‑write register file mapping 4‑bit a, b, c fields to physical BR numbers and assigning d from BR free list.
96 bits
Register renaming for SRs 16×6 8‑read, 4‑write register file mapping 4‑bit a, b, c fields to physical SR numbers and assigning d from SR free list.
96 bits
Register renaming for CARRY 3‑bit register for 1→8 mapping,
8‑bit bitmap of free entries for allocation
3 bits
(VRs/VMs/MAs are not renamed)
AR physical register file 128×144 (+ parity) 6‑read, 4‑write
XR physical register file 128×72 (+ parity) 6‑read, 4‑write
Segment Size Cache for segment bounds checking:
128 entry, 4‑way set associative, parity protected,
mapping lvaddr63..48 and ASID to 6‑bit segment size log2 ssize and 2‑bit G0 for eight segments (one L2 TLB line),
index: lvaddr55..51
per way tag: 20 bits (12‑ASID and lvaddr63..56),
per way data: 64 bits indexed by lvaddr50..48
filled from L2 Data TLB
8+ Kib data, 2.5+ Kib tag
L1 Data TLB 512 entry, 8‑way set associative, 640‑bit line (8 SDE/PTEs),
mapping lvaddr63..12 to siaddr63..12 in parallel with L1 Data Cache,
line index: lvaddr14..12,
set index: lvaddr17..15,
tag: lvaddr63..18,
data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits),
filled from L2 Data TLB with 640‑bit read
40+ Kib data, 1.5+ Kib tag
L2 Data TLB 2048 entry, 8‑way set associative, 640‑bit line (8 SDE/PTEs),
line index: lvaddr14..12,
set index: lvaddr19..15,
tag: lvaddr63..20,
data: siaddr63..12, XWR, R1/R2/R3, etc. (80 bits),
filled from L2 Data Cache with 512‑bit read and augmented with SDE bits
160+ Kib data, 5.6+ Kib tag
L1 Data Cache 512 entry (~36 KiB), 4‑way set associative, 576‑bit line, 144‑bit read, 576‑bit refill,
index: lvaddr12..6,
tag: siaddr63..13,
write-thru,
filled from L2 Data Cache on miss or prefetch
288+ Kib data, 25.5+ Kib tag
Return Address Stack Cache 8‑entry, fully associative, 576‑bit line size,
fill and writeback to L2 Data Cache, subset and coherent with L2 Data Cache
tag: siaddr63..6 + 3‑bit ring,
4.5+ Kib data, 488+ bit tag
L2 Data Cache 32768 entry (~2.25 MiB), 8‑way set associative, 576‑bit line, read, write,
index: siaddr17..6,
tag: siaddr63..18 + state,
write-back,
used for SR/VR/VM/MA load/store and L1 Data Cache misses,
filled from system interconnect or L3 on miss or prefetch, eviction to L3
18+ Mib data, 1.5+ Mib tag
L2 Data Cache Prefetch TBD, possibly based on Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Hardware Prefetch[PDF] (16.7 KiB).
AR Engine Output 64‑entry BR/SR/VR/VM/MA operation queue
BR/SR/VR/VM/MA (Late) Execution Unit
(tends to run about a L2 Data Cache latency behind the AR Execution Unit)
BR physical register file 64×1 6‑read, 2‑write
SR physical register file 128×72 (+ parity) 8‑read, 4‑write
CARRY physical register file 8×64 (+ parity) 1‑read, 1‑write
VR register file 16×72×128 (+ parity) 4‑read, 2‑write
144+ Kib
VM register file 16×128 (+ parity) 3‑read, 1‑write
2080 bits
MA register file 4×20×64×64 (20+1 for parity) 1‑read, 1‑write
336 Kib
Combined Fetch/Data
System virtual address TLB 256 entry, 8‑way set associative, 640‑bit line (8 RDE/PTEs),
mapping system virtual addresses to system interconnect addresses
(maintained by hypervisor)
line index: lvaddr14..12,
set index: lvaddr16..15,
filled from L2 Data Cache with 512‑bit read, expanded with RDE bits
sized small because large page sizes expected
160+ Kib data, 12+ Kib tag
L3 Eviction Cache serving multiple processor L2 Instruction and L2 Data caches 262144 entries (~18 MiB), 8‑way set associative, 576‑bit line size, non-inclusive,
index: siaddr20..6,
tag: siaddr63..21 + state,
write-back,
plus 8‑way set associative directory for sub caches,
filled by evictions from L2 Instruction and Data caches
144+ Mib data, 11.5+ Mib tag, 11.5+ Mib directory

Using a line size in TLBs is unusual, but could represent a performance boost, given that the L2 data cache read is going to supply a whole line anyway. Without the line size, the L1 TLBs would only contain 32 or 64 entries for critical path reasons, and this is quite small. The issue is second level translation and svaddr protections. Performing these lookups for 8 PTEs would slow the TLB refill, so I expect the example microarchitecture to mark 7 of the 8 PTEs as requiring secondary checks and continue. On a match to an entry that requires secondary checks, these would be performed then, and the entry updated.

For tracking Basic Blocks (BBs) in the pipeline, there would be a 8‑bit internal basic block counter bbc (independent of the larger BBcount[ring] counters) incremented on each BB descriptor fetch. bbc6..0 would be used as index to write basic block information into a 128‑entry circular buffer for basic blocks in the pipeline, including the BB descriptor address, the prediction to check (including the conditional branch taken/not-taken, loop back prediction, and full target descriptor address for indirect jumps and returns), and so on. The circular buffer entry would also include a bit mask of completed instructions, and the entry may only be overwritten when all instructions are completed. Completion of all instructions of the BB causes state updates to commit (e.g. PC, CSP, and call stack writes).

Basic block ordering is tested testing the sign bit of subtraction: BBx is after BBy if (bbcx − bbcy) > 0. Each instruction in the pipeline includes its bbc value and the offset in the basic block. When a misprediction is detected, all instructions with bbc values after the basic block with the misprediction (using the above test) are flushed from the pipeline, bbc is reset to the bbc value of the mispredict plus one, and basic block descriptor fetch is restarted using the correct next descriptor (e.g. PC+8 for a not-taken conditional branch or the target calculated from the targr and targl fields, or the JMPA/LJMP/LJMPI/SWITCH destination for an indirect jump). Whether the circular buffer stores the targr/targl values or refetches them is TBD. 128 basic block predictions may seem large, but with the SecureRISC loop count prediction, 100% accuracy might be achieved, which means the 128‑entry circular buffer supports 128 loop iterations, and each loop iteration might only be three or four instructions. Note that in SecureRISC, there are 0, 1, or 2 predictions to check per basic block (e.g. a conditional branch and indirect jump, e.g. for a case dispatch), so 0, 1, or 2 mispredicts are possible (i.e. there might be two flushes).

I expect that immediately after each 512‑bit read of the L1 instruction cache, the start mask from the Basic Block (BB) descriptor will be used to feed the specified bits to eight parallel decoders which will convert them to a canonical form, something along the lines of the following. These canonicalized instructions would then be put into queues for the early pipeline (e.g. operations and branches on XRs and loads to and stores from ARs/XRs), late pipeline (SRs/BRs/VRs/VMs/MAs, or both of these (e.g. for loads to and stores from SRs/VRs/VMs/MAs and moves between early and late).

80‑bit early-pipeline canonical instruction example
79 24 23 22 21 17 16 12 11 7 6 0
i sa b a d op80
56 2 5 5 5 7
96‑bit late-pipeline canonical instruction example
95 42 41 38 37 32 31 26 25 20 19 14 13 0
i e c b a d op96
54 4 6 6 6 6 14

Questions and Things Still Undecided

The following list is in no particular order. Also, some items are old and should be pruned.

Tag Summary

Tag Use
0 Nil/Null pointer
1..31 Sized pointers exact
32..127 Sized pointers inexact
128..191 Reserved (possible sized pointer extension)
192..199 Unsized pointers with ring
200..207 Code pointers with ring
208..220 Reserved
221 Pointer to blocks with header/trailer sizes
222 Cliqued pointer in AR
223 Segment Relative pointers
224 Lisp CONS
225 Lisp Function
226 Lisp Symbol
227 Lisp/Julia Structure
228 Lisp Array
229 Lisp Vector
230 Lisp String
231 Lisp Bit-vector
232 CHERI-128 capability word 0
233 Reserved
234 Lisp Ratio, Julia Rational
235 Lisp/Julia Complex
236 Bigfloat
237 Bignum
238 128‑bit integer
239 128‑bit unsigned integer
240 64‑bit integer
241 64‑bit unsigned integer
242 Small integer types
243 Reserved
244 Double-precision floating-point
245 8, 16, and 32‑bit floating-point
246..249 Reserved
250 Size header/trailer words
251 CHERI capability word 1. Bits 143..136 of AR doubleword store (used for save/restore and CHERI capabilities)
252 Basic Block Descriptor
253 Reserved for packed Basic Block Descriptors
254 Trap on load or BBD fetch (breakpoint)
255 Trap on load or store

Glossaries of Terminology Used

General Glossary

ABI
An acronym for Application Binary Interface[wikilink], which is a set of software conventions for compilers, operating systems, and programmers that defines how code should use the processor architecture to allow interoperability of different components. It typically covers register usage, data type alignment, calling convention, etc.
ACL
An acronym for Access Control List[wikilink], a form of Discretionary Access Control[wikilink].
Acquire and Release
The paper Memory consistency and event ordering in scalable shared-memory multiprocessors by Gharachorloo et al. created a taxonomy of shared writeable memory accesses, first dividing into competing and non-competing, then subdividing competing into synchronization and non-synchronization, and finally subdividing synchronization into Acquire and Release specifying that Acquire accesses (typically loads) are performed to gain access to a set of shared locations, and Release accesses (typically stores) are performed to grant access to sets of locations. The paper went on to introduce release consistency.
API
An acronym for Application Programming Interface[wikilink], which is an interface specification for programming.
ASID
An acronym for Address Space Identifier, which is a small integer used to reduce the frequency of translation cache flushing.
ASLR
An acronym for Address Space Layout Randomization[wikilink], which is an operating system security feature to make exploiting vulnerabilities more difficult by varying the placement of code and data in the address space.
Atomic Operations
In an ISA, atomic operations are a series of operations on memory locations that are indivisible, such that they all occur without other intervening operations on the memory location, or none occur (if the atomic primitive allows for failure). As an example, a memory increment involves the series of operations: read the memory location, increment the value read, write the memory location. An atomic increment prevents parallel entities from interleaving these operations (e.g. A load, A increment, B load, A store, B increment, B store) such that the memory location is increased by only 1 rather than 2. Atomic operations may be performed in the processor caches by preventing cache coherence operations from interrupting the sequence, or for non-cached locations may be performed in the memory controller by sending the operands required there. One of the most important atomic operations is Compare-And-Swap (CAS)[wikilink].
BB
An acronym for Basic Block[wikilink], which is a sequence of instructions without potential transfers of control in or out (except due to exceptions).
Bell-LaPadula Model
Bell-LaPadula is a computer security access control policy to enforce data confidentiality in a system with Multilevel security. It is the inverse of the Biba Integrity model for integrity. In Bell-LaPadula, data and agents are ordered by levels of confidentiality and agents may not read data with higher confidentiality than their own level and cannot read data of lower integrity.
Biba Integrity Model
Biba a computer security access control policy to enforce data integrity. It is the inverse of the Bell–LaPadula model for confidentiality. In Biba, data and agents are ordered by levels of integrity and agents may not write data with higher integrity than their own level and cannot read data of lower integrity.
Binary Prefix
Binary Prefixes are unit prefixes denoting powers of 1024 rather than 1000, and are typically used to modify B for bytes or b for bits, as in 4 KiB for 4,096 bytes, or 2 MiB for 2,097,152 bytes.
Prefix Value
Ki 1024 210 1,024
Mi 10242 220 1,048,576
Gi 10243 230 1,073,741,824
Ti 10244 240 1,099,511,627,776
Pi 10245 250 1,125,899,906,842,624
Ei 10246 260 1,152,921,504,606,846,976
Zi 10247 270 1,180,591,620,717,411,303,424
Yi 10248 280 1,208,925,819,614,629,174,706,176
Branch Prediction
A Branch Predictor[wikilink] is logic in processor hardware that guesses, based on recent history, which way instruction fetch needs to proceed after fetching a branch instruction but before its execution, which in some processors may be multiple tens of cycles later. When the branch is executed, if the result is different from the prediction, the intervening work needs to be thrown away and restarted with the correct branch direction. Without an accurate guess, processor performance is substantially reduced. Between the time the branch direction is predicted and executed, the processor is essentially executing speculatively.
BTB
An acronym for Branch Target Buffer[PDF], which is a cache for the instruction fetch engine of processors indexed with the branch PC and yielding the targets of jumps and taken branches. BTBs allow instruction fetch to the target faster than would be possible by parsing the instruction stream, or in the case of an indirect jump, waiting for the operand to be available.
Buffer Overflow
Buffer overflow[wikilink] is a situation where a bug in a computer program allows data to be written beyond the bounds of the memory allocated for the data. Buffer overflow may in some cases result in security vulnerabilities. It is made possible by undisciplined programming languages such as C++. Undisciplined languages make it difficult or impossible for compilers to insert bounds checking to detect buffer overflow bugs at runtime.
Cache
Processor caches[wikilink] are memories that temporarily store copies of data that lives permanently elsewhere (typically in main memory). Caches reduce the latency of access to data they contain, and provide additional access bandwidth. A wide variety of caches are often used in implementing a processor, including caches for instruction fetch, loads and stores, and translation of virtual addresses. Instruction and Data caches typically store contiguous fixed-size blocks of data, and this block size is often called the line size. Caches may be arranged in a cache hierarchy. A cache access is a hit if the line containing the access is stored in the cache, and a miss if it is not; cache misses result in the a line-sized read from the next level of the cache hierarchy. A cache miss may also require eviction of some other cache line to make room to store the incoming data. Caches handle writes in different ways: a write-through cache writes store data to the cache and also sends it to the next level of the hierarchy; write-back caches store data in the cache and mark the cache line as dirty, meaning that the cache line will have to eventually be written back to higher levels of the cache hierarchy (e.g. on eviction). Caches may be fully associative (a block of data may be located in any cache location), or N-way set-associative (the set of N locations for a given block of data is determined by a few address bits—the set index) and only the N ways need to be searched for a match. Cache blocks have an associated tag, which is typically the address bits not used in the set index, though in some cases tags and indexes may be hashed.
Cache Coherence
Cache coherence[wikilink] is a mechanism where multiple caches in a computer system use communication protocols to ensure all processors have a coherent view of shared memory locations. Memory coherence is defined relative individual memory locations and ensures that all processors agree on the order of accesses to the same location (in contrast memory consistency concerns reads and writes to different memory locations). There are multiple cache coherence mechanisms and protocols. A simple protocol might arrange that only one cache at a time contains a dirty copy of a line, and when another processor references the line, it moves (along with its dirty status) to the new processor. Other coherence protocols[wikilink] are possible; examples include:
Cache Replacement
On a cache miss, if each way of a set contains valid data, one way must be selected for eviction to make room for the new data. The Cache replacement policy[wikilink] is the algorithm that determines which location in the cache (e.g. which way of an N-way set associative cache) is evicted and used to store the incoming data. The optimal policy is to replace the block that will be used furthest in the future, which is generally not known, and so other simpler algorithms are typically used, such as Least Recently Used (LRU) Pseudo-LRU, and Re-Reference Interval Prediction (RRIP).
CAM
An acronym for Content-addressable memory[wikilink], which are structures sometimes used in processors. One common use is for L1 translation caches (aka TLBs). CAMs typically have a match side and a data side, so for example, a fully associative L1 TLB would have the virtual page number stored on the match side and the physical page number stored on the data side. A translation then searches for a matching virtual page number and supplies the physical page number.
Capability
Capabilities are protected, non-forgeable addresses with permissions and other attributes in a Capability-based addressing[wikilink] architecture.
CFI
An acronym for Control Flow Integrity[wikilink], which is a class of defenses against malware attacks.
CHERI
An acronym for Capability Hardware Enhanced RISC Instructions (CHERI), which is a Capability architecture from the University of Cambridge, with variants for MIPS, ARM, x86, and RISC-V.
CAS
An acronym for Compare And Swap[wikilink], which is an atomic operation for synchronization in multithreaded applications.
Concurrent GC
Concurrent Garbage Collection operates concurrently (perhaps on a separate processor) with continued execution of the program. Earlier GC often paused the program when GC was executed (sometimes called Stop The World GC[wikilink]).
Consistency Model
Consistency models[wikilink] define rules processors must follow for memory references that put constraints on correct programs operating in a shared multiprocessor.
Copy-on-write
Copy-on-write[wikilink] is a technique implemented in many operating systems (especially Unix-derived or POSIX-compatible operating systems) in which files are mapped into an address space with write permission disabled, but on the first write, the page containing the write is copied and this page replaces the original in the process’ address space, leaving the file unmodified (and potentially used by other processes that have yet to write the page). It requires per-page rather than per-segment permissions, and makes the sharing of page tables difficult.
CSR
An acronym for Control/Status Register[wikilink], which are generally not used for computation (e.g. in contrast to register files), but to control aspects of processor operation or report status.
Dangling Pointer
Dangling pointers[wikilink] are pointers to memory locations that no longer contain valid data. For example, when explicit memory deallocation is requested (either on the heap or stack), pointers to the deallocated memory may still exist, and use of such pointers to reference that memory are in error, but these errors are often undetected, leading to obscure bugs and security vulnerabilities. Use of such a pointer after the memory is allocated for a new purpose is an even more problematic error.
Data Memory-dependent Prefetcher (DMP)
DMP[wikilink] is a technique for looking for pointers in cache blocks and subsequently prefetching the cache blocks of those pointers. It enables the GoFetch security vulnerability.
DIMM
An acronym for Dual-In-line Memory Module[wikilink]. A packaging arrangement of memory devices on a socketable substrate. This is an older form of DRAM for main memory. A newer main memory technology is HBM.
Directory-Based Cache Coherence
Directory-based cache coherence[wikilink] are scalable methods for identifying which processors in a system are caching a given memory block, and so need to receive coherence protocol transactions for that block.
DMA
An acronym for Direct Memory Access[wikilink], which is a feature of systems where non-processor elements access main memory. Such accesses are typically programmed by a processor, but then handled independently by the system component, which when done notifies the processor via an interrupt. Often DMA is used for high-bandwidth I/O.
DRAM
An acronym for Dynamic Random-Access Memory[wikilink], but which today sometimes refers to system bulk memory in general. DRAM typically uses a single transistor per memory element (e.g. a capacitor), and dynamic refers to leakage or reading the memory disturbing the memory contents, requiring it to be rewritten to maintain values.
Dynamic Linking
Static linking merges together multiple compilation units, resolving references between them in an efficient manner, but it creates copies of compilation units that cannot be shared in physical memory with other statically linked programs that use the same compilation units. In some cases it is better to create a shared library from a set of compilation units, and then use dynamic linking to dynamically load, that shared library. In addition to sharing the library code in physical memory with other applications, this allows the library to be upgraded (e.g. with bug fixes) without relinking all the applications that use the library. Dynamic linking typically occurs on the first reference to an external symbol from a code segment. The first reference invokes the dynamic linker, which translates the symbol into an address (potentially by dynamically loading the file containing that symbol), and then updates the reference so that subsequent references go directly to the address.
Dynamic Loading
A program usually begins with the supervisor initial loading code and data into a new address space. For a completely stand-alone application, no further code or data would be required, but many applications find advantage in bringing in additional code and data at runtime to be executed and referenced by the application, typically in the form of shared libraries, which are dynamically loaded into the address space after program execution begins. This loading is typically by mapping the shared library file into the address space so its code can be used while being shared by other programs, and the library data areas are initialized from the template in the library file.
ECC
An acronym for Error Correcting Code[wikilink], which is a general term, but used herein for the Hamming codes typically used for main memory and caches that implement single-error correction and double-error detection (SECDED)[wikilink].
Effective Address
Terminology for the virtual address calculated in an instruction for memory accesses or transfers of control by an Addressing Mode[wikilink]. As an example, most ISAs have an addressing mode that is the contents of a register specified in the instruction (the base) plus a small integer constant in the instruction (the offset) that is added to that value, with the sum (base+offset) being the effective address.
ELF
An acronym for Executable and Linkable Format[wikilink], which is the format used on many contemporary operating systems for programs and shared libraries.
Eventcounts
A parallel processing synchronization mechanism described in Synchronization with Eventcounts and Sequencers[PDF] by Reed and Kanodia.
False Sharing
False sharing[wikilink] occurs when a coherency unit (typically a cache block) is shared by multiple processors even though some individual locations in the cache block are not shared. Accesses to the unshared locations can cause performance-degrading transfers between processors by the coherence protocol.
FP
An acronym for Floating Point[wikilink] arithmetic, with associated acronyms for Single Precision (SP), Double Precision (DP), and sometimes Half Precision (HP). Most processors today implement the IEEE 754[wikilink] standard, which defines formats binary256, binary128, binary64, binary32, binary16, and soon several binary8 formats. The binary256 and binary128 formats are typically only implemented in software. There are also two non-IEEE formats of significance: bfloat16[wikilink] and TensorFloat[wikilink]. This document sometimes uses shorter names for some formats:
Dfp64IEEE 754 binary641153
Sfp32IEEE 754 binary32824
Hfp16IEEE 754 binary16511
P3fp8IEEE 754 binary8p353
P4IEEE 754 binary8p444
P5IEEE 754 binary8p535
Bbf16bfloat1688
tf32Tensor Float811
GC
An acronym for Garbage Collection[wikilink], which in a computer refers to having algorithms reclaim memory allocated and no longer used, rather than having explicit free operations that introduce the potential of dangling pointers and the continued use of the memory after the free.
Generational GC
Generational Garbage Collection[wikilink] separates data by age so that older generations, which typically have fewer references to recent generations, need not be fully scanned to determine which recent allocations may be reclaimed.
GOT
An acronym for Global Offset Table[wikilink], which is used in dynamic linking to hold pointers to data defined in other dynamically loaded shared libraries.
Guest OS
An Operating System (often running as a supervisor) running on a Virtual Machine.
Heap
Heap is a common term for an area of memory used for allocation of address space to be used directly (e.g. for mapping shared libraries), or for allocating data directly, or to provide new memory to be managed by finer grain memory allocators.
High Bandwidth Memory (HBM)
High Bandwidth Memory (HBM)[wikilink] is a standard for 3D-stacked Synchronous DRAM (SDRAM), with the HBM3E generation providing, for example, 1.2 TB/s of bandwidth and a capacity of 24 GB for a stack of 8 devices.
HPC
An acronym for High-performance computing[wikilink], which is an architectural and algorithmic approach to computation for very large problems (both in operations and memory requirements), usually on highly parallel configurations of processors and memories. A recent system targeted at HPC is Frontier[wikilink].
Hypervisor
Hypervisors[wikilink] are Operating Systems (OSes) for operating systems. They typically emulate a Virtual Machine on which a supervisor level of OS runs.
ILP
An acronym for Instruction-Level Parallelism[wikilink], which is a measure of the how sequential or parallel are the instructions of a program when executed on processor capable much higher parallelism than the program (so that the processor is not limiting).
ILP64
A 64‑bit data model[wikilink] where C++ shorts are 16 bits, but int, long, and pointers are 64 bits (only int32_t and uint32_t are 32 bits).
In-order Execution
In-order implementations may execute one or more instructions per cycle in a pipeline, but generally the computation or memory access of a later instruction does not occur before the same point of an earlier instruction, in contrast to Out-of-Order (OoO) implementations, which often perform the computation of a later instruction before the computation of an earlier one by scheduling execution in hardware based upon dependencies between instructions. While this presentation suggests that in-order vs. out-of-order is a binary choice, there are also design points somewhat in between, e.g. in-order for all instructions in given classes, but out-of-order between classes.
Incremental GC
Early Garbage Collection (GC) stopped execution of the program when the allocation space filled, and then either compacted or copied all reachable data, during which time the program was suspended. This might introduce unacceptable delays for programs where responses are time-critical. Incremental GC continues running the program during garbage collection but handles any references to the old allocation space as part of program execution. It is one particular form of Concurrent GC.
Interrupt
Interrupts[wikilink] represent situations or events that are handled by processors by suspending the current processor execution, transferring to an interrupt handler, having the handler take appropriate action, and then returning to the suspended execution. Without interrupts, a processor would have to periodically check for such events, which might introduce unacceptable delays and overhead.
ISA
An acronym for Instruction Set Architecture[wikilink], though ISAs typically cover more than just instruction sets.
jemalloc
jemalloc is a high-performance, multi-threaded memory allocator that uses different algorithms for small and large (in some version there are huge allocations as well). Small allocations use slab allocation, where slabs are allocated from extents. Large allocations are allocated directly as extents. The name derives from Jason Evans’ malloc.
JIT
An acronym for Just-In-Time[wikilink] compilation.
L1 etc.
L1, L2, L3, etc. refer to levels of the Cache hierarchy[wikilink]. The Level 1 or L1 cache in a cache hierarchy is the smallest, fastest cache. Misses in the L1 cache are sometimes sent to a larger, slower Level 2 or L2 cache, and sometimes from there misses proceed even to an L3, and so on. At some point in the cache hierarchy, caches may be shared by multiple processors (e.g., an L3 cache might serve four processors).
Sometimes the numbering starts with L0 rather than L1 when the first level is particularly small and fast and may indicate that the next level is accessed in parallel with the L0 rather than sequentially.
Latency Tolerance
Latency tolerance is a loosely defined ability of a processor to perform work even in the presence of computational or memory access latencies too large or too variable for compilers to statically schedule. Out-of-order execution where instruction execution is scheduled dynamically in hardware is one technique for latency tolerance, at the cost of complexity, speculative execution, and the resulting security vulnerabilities. Enabling some degree of latency tolerance without requiring speculative execution is valuable; non-speculative latency tolerance was a feature of The ZS-1 central processor (1987), for example.
LLC
An acronym for Last Level Cache, the last cache in the cache hierarchy.
Mandatory Access Control
Mandatory access control[wikilink] (MAC) is a type of access control to implement a security policy that is independent of other access control. For example, MAC might be used to implement the Bell-LaPadula[wikilink] policy for Multilevel security, which controls access to data based on categories assigned to the data and the accessor, examples being data classification and user security clearances, where these rules are enforced even when explicit permission is granted via Discretionary access control[wikilink]. A different security policy that could be implemented by MAC is the Biba Integrity Model[wikilink], which is an inverse of Bell-LaPadula.
Memory Barrier
Memory barriers[wikilink] ...
MPI
An acronym for Message Passing Interface[wikilink], which is a de facto standard for communication in large multiprocessor systems, e.g. for HPC.
Message Signaled Interrupts (MSIs)
Message Signaled Interrupts[wikilink] are a method of signaling interrupts using system interconnect transactions rather than wires. For example, PCIe employs MSIs.
MMU
An acronym for Memory Management Unit[wikilink], which is the processor hardware that implements Virtual Address translation.
Multilevel Security
A system implementing Multilevel security[wikilink] is required to simultaneously handle data and agents at multiple levels of classification by implementing the Bell-LaPadula policy with Mandatory Access Control.
NoC
An acronym for Network on a chip[wikilink], which is the on-chip interconnection between processors on a single integrated circuit.
Non-blocking Algorithm
Non-blocking algorithms[wikilink] are important in multithreaded programs as alternatives to mutual exclusion or locks. From Wikipedia, A non-blocking algorithm is lock-free if there is guaranteed system-wide progress, and wait-free if there is also guaranteed per-thread progress. Such algorithms are important motivators for the atomic operations, such as Compare-and-Swap.
NMI
An acronym for Non-Maskable Interrupt[wikilink], which is an interrupt that cannot be disabled, except perhaps by NMI interrupt process itself, so that it is always available to force a processor to take action.
NUMA
An acronym for Non-Uniform Memory Access[wikilink], which is a system in which main memory is distributed and so access time is different for some processor and memory pairs from other pairings. Typically, NUMA systems will use some sort of directory-based cache coherence.
OoO
An acronym for Out-of-Order[wikilink], which refers to instruction scheduling in hardware, in contrast to in-order execution. OoO execution is one of the best, but also one of the most expensive, ways of achieving latency tolerance.
Operating System
Operating Systems[wikilink] (OSes) are programs that operate at one or more levels of privilege higher than the privilege level of application programs that mediate access to the hardware and provide services to applications. There may be multiple levels of operating systems in a system, such as when a hypervisor emulates a virtual machine for a supervisor, which provides services to the user mode application.
PC
An acronym for Program Counter[wikilink], which is a register holding the address used for instruction fetch. PC is not the best name since it doesn’t just count—Intel’s Instruction Pointer is better. In SecureRISC it is the address for Basic Block Descriptor fetch, which makes it even less descriptive, but out of tradition, SecureRISC still keeps this name.
PCIe
An acronym for Peripheral Component Interconnect Express[wikilink], which is a bus standard.
Physical Address
The address used to address physical memory (as opposed to virtual memory[wikilink]). In a guest operating system running on a Virtual Machine, what the guest OS thinks are physical addresses are themselves virtualized in a second level of address translation, the first level being virtual addressguest physical address, and the second level being guest physical addresssystem physical address. Thus physical address is a potentially confusing term, being dependent upon context.
PIC
An acronym for Position-Independent Code[wikilink], which refers to machine code that does not depend upon the address at which it is placed. PIC typically uses PC-relative addresses for references to other code (e.g. for conditional branches and calls).
PLT
An acronym for Procedure Linkage Table[wikilink], which is a technique used on many systems for calling procedures in code segments not at a known offset. (The PLT is called the Program Linkage Table in the RISC‑V ABI.) The call is in the shared portion of the machine code and cannot be modified without unsharing. Instead the call is to a PC-relative stub in a an area called the PLT, which contains code that can invoke the dynamic linker on the first call to resolve the target. After the resolution, the PLT stub may be patched (depending on the ISA) so that future calls skip the dynamic linker. If the PLT stubs are patched, then the PLT must be writeable, but since such PLT entries are grouped together, this results in relatively few pages being unshared.
Popcount
Popcount is an abbreviation of population count, which is another name for the formal Hamming weight[wikilink]. While Hamming weight is defined more generally, popcount as used in this document is simply the number of 1 bits in bitstring or word.
PPN
An acronym for Physical Page Number, which is the page portion of a physical address. The PPN is the result of translation from a VPN in the Memory Management Unit.
Prefetch
Cache prefetching is either a hardware or software technique to begin bringing data into one or more caches from further out in the memory hierarchy in advance of an instruction fetch, load, or store requesting that data. The goal of prefetch is to improve performance by reducing the time the processor has to wait when the fetch, load, or store is executed. Software-initiated prefetch is typically in the form of instructions inserted by compilers, whereas hardware-prefetching is based on processor data structures that identify access patterns and initiate cache fills based on those patterns. An example of the latter is Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Hardware Prefetching[PDF].
PTE
An acronym for Page Table Entry[wikilink], the descriptor in a Page Table[wikilink] that specifies the physical address, permissions, and other information for a virtual → physical translation (the virtual portion usually being implied by the Page Table location).
QoS
An acronym for Quality of service[wikilink], which are techniques for prioritizing traffic and buffering resources in a network to ensure certain levels of service to the most critical data served by the network.
Register Renaming
Register Renaming[wikilink] is a microarchitecture technique that maps the ISA-defined register numbers specified in the instruction word (logical registers) to locations in a larger execution register file to eliminate needless restrictions on instruction execution order that would be required if logical register numbers were used directly.
Register Windows
Register Windows[wikilink] are a feature of a few Instruction Set Architectures (ISAs), typically used to reduce register save and restore operations on function calls, which improves performance and reduces code size. In this author’s opinion, most register window ISAs (e.g. SPARC) were not sufficiently efficient to be justified, with one exception: the Xtensa ISA[PDF]† made register windows efficient by making the increment small and variable.
† (designed by the author with code size as the primary justification for register windows).
Release Consistency
Release consistency[wikilink] was defined in Memory consistency and event ordering in scalable shared-memory multiprocessors by Gharachorloo et al. and is one of the most relaxed consistency models[wikilink] used in multiprocessor shared computation, and therefore requires more programming effort for correct synchronization, but which also allows for high performance by allowing for more buffering and pipelining than other models. Release consistency requires (quoting Gharachorloo et al. Condition 3.1 Conditions for Release Consistency): Here special accesses are loads and stores to competing shared memory locations and processor consistent[PDF] is an earlier consistency model where writes issuing from any processor may not be observed in any order other than that in which they were issued.
Reserved
A register or data structure field reserved for future use. Reserved fields in data structures must be set to 0 by software. Software must ignore reserved fields in registers and preserve the value held in these fields when writing values to other fields in the same register.
Ring
Rings[wikilink] provided nested access rights to multiple levels of privilege. They are a generalization of the user and supervisor modes of some processors. See A Hardware Architecture for Implementing Protection Rings, by Michael D. Schroeder and Jerome H. Saltzer.
RISC
An acronym for Reduced Instruction Set Computer[wikilink], originally coined by David Patterson. Others have subsequently suggested that it should be an acronym for Really Invented by Seymour Cray due to Cray’s contribution to ISA design on the CDC 6600 and the Cray-1. John Mashey’s once opined that the definition of RISC had become any instruction set defined after 1980, given the number of things that have been called RISC thereafter that were more complicated than the ISAs that first embraced the RISC terminology (SPARC[wikilink] and MIPS[wikilink]). See Is SecureRISC actually RISC? for my thoughts on a useful definition of RISC.
ROM
An acronym for Read-Only Memory[wikilink].
ROP
An acronym for Return-Oriented Programming[wikilink], which is a technique for exploiting lapses in computer security.
RTOS
An acronym for Real-Time Operating System[wikilink].
Sandbox
A Sandbox[wikilink] is an environment for executing untrusted programs.
Sequencers
A parallel processing synchronization mechanism described in Synchronization with Eventcounts and Sequencers[PDF] by Reed and Kanodia. Sequencers are typically implemented with an atomic increment.
Shared Library
A shared library[wikilink] consists of code and data that is usable by multiple programs where the code is shared[wikilink] in physical memory. Shared libraries may be mapped into the program address space during initialization or they may be dynamically loaded.
Slab Allocator
The terminology slab allocator can refer to a generic class of memory allocation algorithms, or a specific Linux kernel allocator (currently being deprecated in favor of the SLUB allocator). SecureRISC uses the generic class meaning, which is where allocation requests are binned into size groups, and for small sizes, and for a given size group there is a set of pages containing objects of that size. Allocation is just removing from the free list, and deallocation is simply adding to the free list. Large sizes are handled with page allocation. An example of a slab allocator is jemalloc. Slab alloctors interact with memory safety is multiple ways, and provide a good opportunity to exploit tagging to detect buffer overflows and dangling pointer errors.
SIMD
An acronym for Single Instruction, Multiple Data[wikilink], which is a more general concept than the way it is used in this document. Technically vector processing[wikilink] is a form of SIMD, but here the term is used as a simpler form of data parallelism where registers hold 2 to 64 data values and a single computation instruction operates on those data values in parallel units, producing results in a fixed number of cycles, whereas a vector processor might operate for a variable number of cycles over a larger set of data values of variable length.
Speculative Execution
Speculative Execution[wikilink] is a technique for enhancing performance by predicting what will be executed or needed by the program, and executing or fetching those things early, with the consequence that that effort may be wasted, and may potentially even leak information (a form of security flaw).
Supervisor
Supervisor is a term used to distinguish a level of the operating system[wikilink] hierarchy in a computer system. This level of the operating system mediates between the application program (which runs in user mode) and the hardware and memory, or in the case of a virtual machine the virtualized hardware and memory.
TAGE
An acronym for TAgged GEometric history length branch prediction[PDF], which is a class of branch predictors based on multiple predictors using geometric progression of global history length, using the longest matching history length for the prediction.
Tagged Architecture
Tagged computer architectures[wikilink] divide words into tag bits and data bits, where the tag bits describe the interpretation of the data bits. The tag bits may be used for Runtime Type Information (RTTI), memory-safety, information-flow control, and capabilities.
TCSEC
An acronym for Trusted Computer System Evaluation Criteria[wikilink], sometimes referred to as the Orange Book. While it has been superseded by the Common Criteria for Information Technology Security Evaluation[wikilink], there are still relevant concepts from the Orange Book, such as Mandatory access control[wikilink], which shows up in many products, such as Security-Enhanced Linux[wikilink] (SELinux).
TEE
An acronym for Trusted execution environment[wikilink], which is part of a system with greater security than the rest of the system, providing confidentiality and integrity for TEE data, and providing services employing that data.
Thread-Local Storage (TLS)
Programs allocate storage in four ways:
  1. At runtime (dynamically) via variable declarations (in C++ using the auto keyword or simply omitting the static keyword). These allocations are to either the stack or registers and always thread-local because stacks and registers are thread-local.
  2. At runtime (dynamically) via explicit memory allocation requests (e.g. C++ new and delete). These allocations are done on one or more areas or heaps, and are always shared by all threads because the heap is shared by all threads.
  3. At compile-time (statically) via variable declarations (in C++ outside of functions and inside functions and objects using the static keyword). These allocations are to ELF sections such as .bss and .data. A single copy is shared by all threads.
  4. At compile-time (statically) via variable declarations (in C++ using the __thread keyword). These allocations are to ELF sections such as .tbss and .tdata. Each thread gets exactly one copy of this data. This allocation may occur when a thread is created (initialized from the template created by the compiler) or dynamically on the first use. This storage is called Thread-Local Storage (TLS). See also GCC Thread-Local Storage and Ulrich Drepper’s ELF Handling For Thread-Local Storage[PDF].
TLB
An acronym for Translation Lookaside Buffer[wikilink], which is a bad name for a virtual address translation cache (there’s no lookaside involved).
Ultravisor
Ultravisor[wikilink] is the highest privilege mode in the POWER 9 architecture Protected Execution Facility (PEF), the privilege stack being Ultravisor, Hypervisor, Supervisor, and Problem.
User Mode
User mode is one or more levels of privilege in a system with multiple levels of privilege. Application programs typically run in user mode with a virtualized memory address space, making calls to the higher privilege levels (e.g. supervisor mode) to request services. The user vs. supervisor mode distinction is a simplified form of hierarchical protection domains (also called rings). As examples of privilege nesting, user mode would typically not have access to supervisor memory, but supervisor mode would have access to user memory, and user mode would not have access to certain instructions and registers that supervisor mode can access.
Virtual Address
The address used during program execution to specify locations in Virtual Memory[wikilink] for data accesses and transfers of control. Virtual addresses are translated to physical addresses via data structures specified by operating systems and cached in translation caches (also known as TLBs). This translation can be one-level (just virtual address → physical address) or two-level (virtual address → guest physical address → system physical address) when there are multiple levels of operating systems (the most privileged generally called the hypervisor and the less privileged being the guest os) or the supervisor).
Virtual Machine
A Virtual Machine[wikilink] (VM) is an emulation of a computer system. It allows an operating system[wikilink] to run on what appears to be bare hardware, but which is actually an environment that emulates that hardware. The VM abstraction is created by an operating system for the operating system called a hypervisor[wikilink].
VPN
An acronym for Virtual Page Number, which is the page portion of a virtual address. The VPN is translated to a PPN in the Memory Management Unit.
Word
Wikipedia defines word[wikilink] as the natural unit of data for a processor, but often contemporary ISAs define word in a historical way, e.g. to be 32 bits even when their datapaths make 64 bits more natural. They do so because their predecessor ISAs had 32‑bit words. Thus, word is term that is simply defined by each new ISA. The SecureRISC definition is given below.

RISC‑V Glossary

Other acronyms may be less familiar as they come from RISC‑V[wikilink].
To quote from RISC‑V International’s About RISC‑V: RISC-V is an open standard Instruction Set Architecture (ISA) enabling a new era of processor innovation through open collaboration.
The official RISC‑V ISA specifications may be downloaded from RISC‑V specifications while working versions may be found at the GitHub RISC‑V ISA Manual repository.
The two primary specifications are:

NAPOT
An acronym for Naturally Aligned Power-of-2[PDF] (Chapter 5 “Svnapot” Standard Extension for NAPOT Translation Contiguity)
NAPOT refers to things of size 2N bytes being aligned to that size (i.e. the address is a multiple of 2N, viz. the least significant N bits of the address are zero). In RISC‑V NAPOT sizes are represented with repeated 1s (e.g. in PMP entries) or repeated 0s (e.g. in PTEs) in the lower address bits, then the opposite bit, and then from that point actual address bits, which allows the address and size to encoded in 1 bit more than the address alone. (Note: SecureRISC in the early days had separate address and size fields but changed to adopt this clever bit-saving idea.)
PMA
An acronym for Physical Memory Attributes[PDF] (section 3.6 Physical Memory Attributes), which are attributes used by processors for regions of the system’s physical address space.
Example RISC‑V Physical Memory Attributes are as follows (SecureRISC PMAs are likely to be simpler, except for the addition of tagging):
Memory type Vacant, Main memory, I/O
Read access width Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit reads supported.
Write access width Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit writes supported.
Execute access width Subset of 16‑bit, 32‑bit, 64‑bit, 128‑bit, …, 512‑bit instruction fetch supported.
Atomic Memory Swap
(AMOSwap) width
Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit
AMOSwaps supported.
Atomic Memory Logical
(AMOLogical) width
Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit
AMOLogicals supported.
Atomic Memory Arithmetic
(AMOArithmetic) width
Subset of 8‑bit, 16‑bit, 32‑bit, 64‑bit, 128‑bit
AMOArithmetics supported.
Page Table Reads Supported or not.
Page Table Writes Supported or not.
LR/SC support level None, NonEventual, Eventual.
Coherence Not coherent or coherence channel number
Cacheability Yes or No
Idempotency Whether reads and writes have side effects.
PMP
An acronym for Physical Memory Protection[PDF] (section 3.7.1 Physical Memory Protection CSRs), which is a mechanism for providing Read, Write, and Execute permission for ranges of physical memory addresses independent of the translation mechanism, and which is logically anded with those permissions.
WorldGuard
WorldGuard is technology developed by SiFive for RISC‑V, proposed for standardization, by RISC‑V International. It provides isolation by constraining access to system physical addresses based upon identifiers (called Worlds) assigned to system interconnect ports (e.g. processors and devices) that are checked in a distributed fashion by resources (such as memories and peripheral devices) or checkers located before such resources. Worlds are created and configured by a Trusted Execution Environment (TEE), usually at system boot time. A two-world simplification of WorldGuard could be used to provide similar functionality to ARM’s TrustZone.
WARL
An acronym for Write Any Values, Reads Legal Values, which is a specification of a register field that allows processors to implement a subset of the functionality described in a CSR (e.g. by hardwiring certain bits to fixed values, or by allowing some values to be written but not others). If an unsupported value is written, the implementation substitutes some legal value.
WPRI
An acronym for Writes Preserve values, Reads Ignore Values. The RISC‑V privileged specification states, Software should ignore the values read from these fields, and should preserve the values held in these fields when writing values to other fields of the same register. For forward compatibility, implementations that do not furnish these fields must make them read-only zero.

Programming Language Glossary

Java
Julia
Lisp
See
Python
Rust
Swift

Operating System Glossary

Amber
CheriBSD
DBOS
Linux
Multics

Architecture Glossary

Alpha
ARM
Cray-1
IA-64
MIPS
PA-RISC
POWER
RISC-V
SPARC
x86-64
Xtensa

Crypto Glossary

Other terminology and acronyms are associated with cryptography and are summarized below. The reader should return here if encountering unfamiliar cryptographic terminology.

AEAD
An acronym for the Authenticated encryption with associated data[wikilink], which are methods that provide both encryption and authentication.
AES
An acronym for the Advanced Encryption Standard[wikilink], which is a block cipher for 128‑bit blocks with three key sizes: 128, 192, and 256 bits. See FIPS 197[PDF] for details.
AES128
AES with a 128‑bit key.
AES256
AES with a 256‑bit key.
Block Cipher
A block cipher is a pair of algorithms (encryption and its inverse decryption) that operate on two inputs, a fixed-length key and fixed-length data (the block). The key length need not be the same as the data length. The input to encryption is called plaintext, and the output is called ciphertext. Decryption with the same key takes ciphertext and produces the original plaintext. This is represented as follows for encryption algorithm E, decryption algorithm D, plaintext P, ciphertext C, key K:
C ← E(K, P)
P ← D(K, C)

or sometimes written as
C ← EK(P)
P ← DK(C)

Thus the following property holds:
∀P: DK(EK(P)) = P
Carter-Wegman MAC
A Message Authentication Code based on an encrypted hash function using a separate key from the encrypted data. Carter-Wegman MACs are sometimes abbreviated as CWMAC in this document.
Counter Mode
Encryption by xor with successive applications of a block cipher on a counter. Counter mode is sometimes abbreviated as CTR in encryption mode specifications (e.g. AES128-CTR).
Feistel Cipher
A Feistel Cipher[wikilink] is a structure used in the construction of symmetric block ciphers.
GCM
An acronym for Galois/Counter Mode[wikilink], which is a mode of operation for authenticated encryption with associated data (AEAD) encryption using a block cipher in counter mode with a Galois Message Authentication Code (GMAC) of the authenticated data, ciphertext, authenticated data length, and ciphertext length. See NIST Special Publication 800-38D[PDF] for details.
MAC
An acronym for the Message authentication code[wikilink], which is a fixed-length bit stream used to authenticate other data. MACs prevent data tampering by checking data integrity while being designed to resist forgery.
Nonce
A nonce is a bit string that is only used once, and which prevents data from being encrypted or authenticated twice, e.g. to prevent replay attacks.
PQC
An acronym for Post-Quantum Cryptography[wikilink].
S-Box
A S-box (or substitution box) is a function of input bits used in symmetric ciphers designed to resist certain attacks, such as linear and differential cryptanalysis.
Simon
Simon[wikilink] is a family of block ciphers for various block and key sizes. It was proposed as a lightweight (e.g. compared to AES) block cipher optimized for hardware. The Simonm/n family employs a Feistel structure to encrypt m‑bit blocks with a n‑bit key.
Tweakable Block Cipher
A tweakable block cipher is a block cipher whose encryption and decryption algorithms take an additional tweak input t as shown below for tweakable algorithm (E,D), key K, plaintext P, ciphertext C, and tweak t:
Encryption: C ← E(K, P, t)
Decryption: P ← D(K, C, t)
Property: ∀P: D(K, E(K, P, t), t) = P
Unlike keys, tweaks are typically publicly known (e.g. to adversaries). Tweakable block ciphers are used in applications where encryption must not expand the data as occurs in authenticated encryption when a MAC is appended. The tweak is often the location of the data, and ensures that even if the same data is encrypted twice in different locations, the resulting ciphertext is different.
XEX
An acronym for Xor-Encrypt-Xor[wikilink], which is a tweakable mode of operation a block cipher typically used for Data at rest protection[wikilink], for example in XTS-AES.
XTS-AES
XTS-AES is one use of XEX and AES defined in IEEE Std 1619-2007 for cryptographic protection of data stored in constant length blocks. The encryption of the 𝑗th 128 bits of a block with tweak 𝑖 is as follows where the block length is a multiple of 128 bits (i.e. the following does not cover ciphertext stealing):
T𝑗 ← AESenc(Key2, 𝑖) ⊗ 𝑥𝑗
C𝑗 ← AESenc(Key1, P𝑗 ⊕ T𝑗) ⊕ T𝑗
and decryption is the obvious inverse:
T𝑗 ← AESenc(Key2, 𝑖) ⊗ 𝑥𝑗
P𝑗 ← AESdec(Key1, C𝑗 ⊕ T𝑗) ⊕ T𝑗
where:
Bitwise xor
Multiplication of two polynomials over the binary field GF(2) mod 𝑥128+𝑥7+𝑥2+𝑥+1
Key Is a two-part encryption key, consisting of Key1 and Key2 where Key = Key1∥Key2.
For AES-128, Key would be 256 bits, and for AES-256 it would be 512 bits.
𝑖 is the value of 128-bit tweak
P𝑗 is the 𝑗th block of 128 bits (the plaintext)
C𝑗 is the 𝑗th block 128 bits of ciphertext for P𝑗

Vulnerability Glossary

There are so many security vulnerabilities that instruction set and microarchitecture design should be aware of that these are now separated from the conventional glossary above.

Augury
Augury[PDF] is a security vulnerability resulting from a data memory-dependent prefetcher (DMP) interpreting memory contents as a pointer and prefetching the blocks addressed. A newer vulnerability resulting from DMP is GoFetch.
Cacheout
Cacheout is a vulnerability of Intel processors where cache blocks evicted from the L1 data cache are store in the Line Fill Buffer (LFB), and loads search and can bypass from this buffer. This can be expoited by the attacker to leak the L1 data cache contents.
CrossTalk
CrossTalk[PDF] is a vulnerability of Intel processors where certain microarchitectural data shared between cores can leak information from one core to an attacker on another. For example, the attacker can observe the hardware random number generator values sent to another core.
Downfall
Downfall is a security vulnerability of Intel processors that exploits the vector gather instruction to leak data from buffers shared across security domains and VMs running on the same processor. For example, it allows SIMD register loads containing encryption keys to be leaked. Downfall is called Gather Data Sampling (GDS) by Intel.
Fallout
Fallout[PDF] is a Microarchitectural Data Sampling (MDS) attack that exploits two store buffer behaviors of Intel processors. Other MDS include RIDL, Fallout, CrossTalk, LVI, Snoop, L1DES, VRS, and TAA.
Foreshadow and Foreshadow-NG
Foreshadow[wikilink] is a computer security flaw attack in the Meltdown class targeting Intel SGX technology which defeats enclave memory isolation, sealing, and attestation guarantees. Intel calls Foreshadow a L1 Terminal Fault (L1TF) vulnerability. Intel’s analysis identified two closely related variants of Foreshadow, which we collectively call Foreshadow-NG (quotes from Foreshadow-NG[PDF]). These attacks allow the entire L1 data cache to be dumped, which potentially exposes data from other address spaces that otherwise not be nameable for leaking by other techniques: Foreshadow-NG-type attacks variants exploit a subtle L1TF microarchitectural condition that allows to transiently compute on unauthorized physical memory locations that are currently not mapped in the attacker’s virtual address space view. As such, Foreshadow-NG is the first transient execution attack that fully escapes the virtual memory sandbox-traditional page table isolation is no longer sufficient to prevent unauthorized memory access.
GhostRace
GhostRace[wikilink] is an attack on code with critical regions and synchronization. It involves training branches testing for mutual exclusion to mispredict, and exploiting the microarchitectural changes that result. Our key finding is that all the common synchronization primitives can be microarchitecturally bypassed on speculative paths, turning all architecturally race-free critical regions into Speculative Race Conditions (SRCs).
GoFetch
GoFetch is a security vulnerability resulting from a data memory-dependent prefetcher (DMP) interpreting memory contents as a pointer and prefetching the blocks addressed.
Inception
Inception is a transient execution attack that leaks arbitrary data on all AMD Zen CPUs.
LVI
Load Value Injection (LVI) exploits transiently incorrect forwarding, with variants exploiting forwarding in the L1 Data Cache, Store Buffer, Line Fill Buffer, and load ports.
Meltdown
Meltdown[wikilink] is a computer security flaw attack that is allowed when processors delay privilege checks until after subsequent instructions have been speculatively executed and thereby modified shared microarchitectural state. Vulnerable processors thereby allow low privilege levels to read the memory of higher privilege levels, breaking the privilege model. Meltdown is also known as Rogue Data Cache Load (RDCL) and Speculative execution exploit variant 3.
PACMAN
PACMAN.
Retbleed
Retbleed[wikilink].
RIDL
Rogue In-Flight Data Load (RIDL)[PDF] is a Microarchitectural Data Sampling (MDS) attack. Other MDS include Fallout, CrossTalk, LVI, Snoop, L1DES, VRS, and TAA.
Spectre
Spectre[wikilink] is a computer security flaw attack based training branch prediction to transiently direct execution of the victim to code that in turn exposes secrets through shared microarchitectural state. The original Spectre attacks were bounds check bypass (variant 1) and branch target injection (variant 2). Spectre attacks have been extended beyond branch prediction and branch target buffer to return address stack (Spectre-RSB)[PDF], and store to load bypass (Spectre-STL)[wikilink] (also known as Spectre variant 4). A Systematic Evaluation of Transient Execution Attacks and Defenses[PDF] provides a breakdown of the many Spectre and Meltdown types.
Spoiler
Spoiler[wikilink]. Also SPOILER: Speculative Load Hazards Boost Rowhammer and Cache Attacks[PDF].
SWAPGS
SWAPGS[wikilink].
Transient Execution Vulnerability
Transient execution vulnerabilities[wikilink] are a class of vulnerabilities caused by speculative execution. This class includes Spectre, Speculative Code Store Bypass (SCSB), Floating Point Value Injection (FPVI), Branch History Injection. Retbleed, SQUIP, Cross-Thread Return Address Predictions, Zenbleed, Inception, Downfall, GhostRace, Register File Data Sampling (RFDS).
Zenbleed
Zenbleed.
ZombieLoad
ZombieLoad is a transient-execution attack which observes the values of memory loads and stores on the current CPU core. ZombieLoad exploits that the fill buffer is used by all logical CPUs of a CPU core and that it does not distinguish between processes or privileges.